Hi, Feng Guang On 09/05/2012 09:11 AM, Fengguang Wu wrote: > Hi, > > Here is an old problem that happens also in 3.4. It's very unreliable: > it may only happen once per 3000 boots.. > > [ 10.968565] reboot: machine restart > [ 10.983510] ------------[ cut here ]------------ > [ 10.984218] WARNING: at > /c/kernel-tests/src/stable/arch/x86/kernel/smp.c:123 > native_smp_send_reschedule+0x46/0x50() > [ 10.985880] Pid: 88, comm: kpktgend_0 Not tainted 3.6.0-rc3-00005-gb374aa1 > #10 > [ 10.987185] Call Trace: > [ 10.987506] [<7902f42a>] warn_slowpath_common+0x5a/0x80 > [ 10.987506] [<7901ee16>] ? native_smp_send_reschedule+0x46/0x50 > [ 10.987506] [<7901ee16>] ? native_smp_send_reschedule+0x46/0x50 > [ 10.987506] [<7902f4fd>] warn_slowpath_null+0x1d/0x20 > [ 10.987506] [<7901ee16>] native_smp_send_reschedule+0x46/0x50
So this cpu try to fire a nohz balance kick ipi to an offline cpu? May be we are choosing a wrong cpu to kick but that's not the point, what I can't understand is why this cpu could do this kick. We have nohz_kick_needed() to check whether current cpu should do kick , and the first condition we need to match is that current cpu should be idle, but the trace show current pid is 88 not 0. We should add Peter to cc list, may be he will be interested on what happened. Regards, Michael Wang > [ 10.987506] [<7905fdad>] trigger_load_balance+0x1bd/0x250 > [ 10.987506] [<79056d14>] scheduler_tick+0xd4/0x100 > [ 10.987506] [<7903bde5>] update_process_times+0x55/0x70 > [ 10.987506] [<79071187>] tick_sched_timer+0x57/0xb0 > [ 10.987506] [<793accee>] ? do_raw_spin_unlock+0x4e/0x90 > [ 10.987506] [<7904e0b7>] __run_hrtimer.isra.29+0x57/0x100 > [ 10.987506] [<79071130>] ? tick_nohz_handler+0xe0/0xe0 > [ 10.987506] [<7904ed45>] hrtimer_interrupt+0xe5/0x280 > [ 10.987506] [<7905a5a7>] ? sched_clock_cpu+0xc7/0x150 > [ 10.987506] [<7901f9a4>] smp_apic_timer_interrupt+0x54/0x90 > [ 10.987506] [<79882631>] apic_timer_interrupt+0x31/0x40 > [ 10.987506] [<7905007b>] ? call_srcu+0x2b/0x70 > [ 10.987506] [<793a00e0>] ? __bitmap_intersects+0x10/0x80 > [ 10.987506] [<7988194f>] ? _raw_spin_unlock_irq+0x1f/0x40 > [ 10.987506] [<7905307f>] finish_task_switch+0x7f/0xd0 > [ 10.987506] [<79053038>] ? finish_task_switch+0x38/0xd0 > [ 10.987506] [<7988044a>] __schedule+0x38a/0x770 > [ 10.987506] [<7905a5a7>] ? sched_clock_cpu+0xc7/0x150 > [ 10.987506] [<7987ea40>] ? schedule_timeout+0x100/0x1b0 > [ 10.987506] [<793accee>] ? do_raw_spin_unlock+0x4e/0x90 > [ 10.987506] [<7988084e>] schedule+0x1e/0x50 > [ 10.987506] [<7987ea45>] schedule_timeout+0x105/0x1b0 > [ 10.987506] [<7903adb0>] ? __internal_add_timer+0xb0/0xb0 > [ 10.987506] [<796842f2>] pktgen_thread_worker+0x1342/0x1390 > [ 10.987506] [<7988044a>] ? __schedule+0x38a/0x770 > [ 10.987506] [<793accee>] ? do_raw_spin_unlock+0x4e/0x90 > [ 10.987506] [<793accee>] ? do_raw_spin_unlock+0x4e/0x90 > [ 10.987506] [<7904aa40>] ? abort_exclusive_wait+0x80/0x80 > [ 10.987506] [<7904aa40>] ? abort_exclusive_wait+0x80/0x80 > [ 10.987506] [<79682fb0>] ? pktgen_if_write+0x2210/0x2210 > [ 10.987506] [<79049ff8>] kthread+0x78/0x80 > [ 10.987506] [<79880000>] ? __up.isra.0+0xd/0x2d > [ 10.987506] [<79049f80>] ? insert_kthread_work+0x70/0x70 > [ 10.987506] [<798830c6>] kernel_thread_helper+0x6/0xd > > Here are all the oops messages I collected in the past days: > > [ 4.815145] Restarting system. > [ 4.815644] reboot: machine restart > [ 4.824591] ------------[ cut here ]------------ > [ 4.825423] WARNING: at /c/wfg/linux/arch/x86/kernel/smp.c:123 > native_smp_send_reschedule+0x46/0x50() > [ 4.826881] Pid: 18, comm: kworker/0:1 Not tainted > 3.6.0-rc3-bisect-00007-g6320675 #25 > [ 4.828116] Call Trace: > [ 4.828533] [<7902f42a>] warn_slowpath_common+0x5a/0x80 > [ 4.828585] [<7901ee16>] ? native_smp_send_reschedule+0x46/0x50 > [ 4.828585] [<7901ee16>] ? native_smp_send_reschedule+0x46/0x50 > [ 4.828585] [<7902f4fd>] warn_slowpath_null+0x1d/0x20 > [ 4.828585] [<7901ee16>] native_smp_send_reschedule+0x46/0x50 > [ 4.828585] [<7905fdad>] trigger_load_balance+0x1bd/0x250 > [ 4.828585] [<79056d14>] scheduler_tick+0xd4/0x100 > [ 4.828585] [<7903bde5>] update_process_times+0x55/0x70 > [ 4.828585] [<79071187>] tick_sched_timer+0x57/0xb0 > [ 4.828585] [<793accee>] ? do_raw_spin_unlock+0x4e/0x90 > [ 4.828585] [<7904e0b7>] __run_hrtimer.isra.29+0x57/0x100 > [ 4.828585] [<79071130>] ? tick_nohz_handler+0xe0/0xe0 > [ 4.828585] [<7904ed45>] hrtimer_interrupt+0xe5/0x280 > [ 4.828585] [<7905a5a7>] ? sched_clock_cpu+0xc7/0x150 > [ 4.828585] [<7901f9a4>] smp_apic_timer_interrupt+0x54/0x90 > [ 4.828585] [<79882401>] apic_timer_interrupt+0x31/0x40 > [ 4.828585] [<7905007b>] ? call_srcu+0x2b/0x70 > [ 4.828585] [<793a00e0>] ? __bitmap_intersects+0x10/0x80 > [ 4.828585] [<7988171f>] ? _raw_spin_unlock_irq+0x1f/0x40 > [ 4.828585] [<7905307f>] finish_task_switch+0x7f/0xd0 > [ 4.828585] [<79053038>] ? finish_task_switch+0x38/0xd0 > [ 4.828585] [<7988021a>] __schedule+0x38a/0x770 > [ 4.828585] [<79045529>] ? worker_thread+0x1a9/0x380 > [ 4.828585] [<793accee>] ? do_raw_spin_unlock+0x4e/0x90 > [ 4.828585] [<7988061e>] schedule+0x1e/0x50 > [ 4.828585] [<7904552e>] worker_thread+0x1ae/0x380 > [ 4.828585] [<79056ed9>] ? complete+0x49/0x60 > [ 4.828585] [<79045380>] ? manage_workers.isra.23+0x250/0x250 > [ 4.828585] [<79049ff8>] kthread+0x78/0x80 > [ 4.828585] [<79880000>] ? __schedule+0x170/0x770 > [ 4.828585] [<79049f80>] ? insert_kthread_work+0x70/0x70 > [ 4.828585] [<79882e86>] kernel_thread_helper+0x6/0xd > [ 4.828585] ---[ end trace ae11d0bb45464df2 ]--- > -- > [ 9.195595] Restarting system. > [ 9.196142] reboot: machine restart > [ 9.228474] ------------[ cut here ]------------ > [ 9.229329] WARNING: at /c/wfg/linux/arch/x86/kernel/smp.c:123 > native_smp_send_reschedule+0x46/0x50() > [ 9.230890] Pid: 88, comm: kpktgend_0 Not tainted 3.6.0-rc3-bisect #28 > [ 9.232034] Call Trace: > [ 9.232468] [<7902f42a>] warn_slowpath_common+0x5a/0x80 > [ 9.232468] [<7901ee16>] ? native_smp_send_reschedule+0x46/0x50 > [ 9.232468] [<7901ee16>] ? native_smp_send_reschedule+0x46/0x50 > [ 9.232468] [<7902f4fd>] warn_slowpath_null+0x1d/0x20 > [ 9.232468] [<7901ee16>] native_smp_send_reschedule+0x46/0x50 > [ 9.232468] [<7905fdad>] trigger_load_balance+0x1bd/0x250 > [ 9.232468] [<79056d14>] scheduler_tick+0xd4/0x100 > [ 9.232468] [<7903bde5>] update_process_times+0x55/0x70 > [ 9.232468] [<79071187>] tick_sched_timer+0x57/0xb0 > [ 9.232468] [<793accee>] ? do_raw_spin_unlock+0x4e/0x90 > [ 9.232468] [<7904e0b7>] __run_hrtimer.isra.29+0x57/0x100 > [ 9.232468] [<79071130>] ? tick_nohz_handler+0xe0/0xe0 > [ 9.232468] [<7904ed45>] hrtimer_interrupt+0xe5/0x280 > [ 9.232468] [<7905a5a7>] ? sched_clock_cpu+0xc7/0x150 > [ 9.232468] [<7901f9a4>] smp_apic_timer_interrupt+0x54/0x90 > [ 9.232468] [<79882401>] apic_timer_interrupt+0x31/0x40 > [ 9.232468] [<7905007b>] ? call_srcu+0x2b/0x70 > [ 9.232468] [<793a00e0>] ? __bitmap_intersects+0x10/0x80 > [ 9.232468] [<7988171f>] ? _raw_spin_unlock_irq+0x1f/0x40 > [ 9.232468] [<7905307f>] finish_task_switch+0x7f/0xd0 > [ 9.232468] [<79053038>] ? finish_task_switch+0x38/0xd0 > [ 9.232468] [<7988021a>] __schedule+0x38a/0x770 > [ 9.232468] [<7905a5a7>] ? sched_clock_cpu+0xc7/0x150 > [ 9.232468] [<7987e810>] ? schedule_timeout+0x100/0x1b0 > [ 9.232468] [<793accee>] ? do_raw_spin_unlock+0x4e/0x90 > [ 9.232468] [<7988061e>] schedule+0x1e/0x50 > [ 9.232468] [<7987e815>] schedule_timeout+0x105/0x1b0 > [ 9.232468] [<7903adb0>] ? __internal_add_timer+0xb0/0xb0 > [ 9.232468] [<796840c2>] pktgen_thread_worker+0x1342/0x1390 > [ 9.232468] [<7988021a>] ? __schedule+0x38a/0x770 > [ 9.232468] [<793accee>] ? do_raw_spin_unlock+0x4e/0x90 > [ 9.232468] [<7904aa40>] ? abort_exclusive_wait+0x80/0x80 > [ 9.232468] [<7904aa40>] ? abort_exclusive_wait+0x80/0x80 > [ 9.232468] [<79682d80>] ? pktgen_if_write+0x2210/0x2210 > [ 9.232468] [<79049ff8>] kthread+0x78/0x80 > [ 9.232468] [<79880000>] ? __schedule+0x170/0x770 > [ 9.232468] [<79049f80>] ? insert_kthread_work+0x70/0x70 > [ 9.232468] [<79882e86>] kernel_thread_helper+0x6/0xd > [ 9.232468] ---[ end trace 4b6a1d80f6ec28ef ]--- > -- > [ 4.665800] Restarting system. > [ 4.666537] machine restart > [ 4.682180] ------------[ cut here ]------------ > [ 4.683044] WARNING: at /c/wfg/linux/arch/x86/kernel/smp.c:120 > native_smp_send_reschedule+0x46/0x50() > [ 4.684515] Pid: 115, comm: reboot Not tainted 3.4.0-bisect #30 > [ 4.685459] Call Trace: > [ 4.685919] [<7984acc5>] ? printk+0x18/0x1a > [ 4.686173] [<7902ab6a>] warn_slowpath_common+0x5a/0x80 > [ 4.686173] [<7901ac46>] ? native_smp_send_reschedule+0x46/0x50 > [ 4.686173] [<7901ac46>] ? native_smp_send_reschedule+0x46/0x50 > [ 4.686173] [<7902ac3d>] warn_slowpath_null+0x1d/0x20 > [ 4.686173] [<7901ac46>] native_smp_send_reschedule+0x46/0x50 > [ 4.686173] [<79058885>] trigger_load_balance+0x295/0x340 > [ 4.686173] [<7904f43f>] scheduler_tick+0xcf/0x100 > [ 4.686173] [<79035985>] update_process_times+0x55/0x70 > [ 4.686173] [<7906979b>] tick_sched_timer+0x5b/0xc0 > [ 4.686173] [<7939794e>] ? do_raw_spin_unlock+0x4e/0x90 > [ 4.686173] [<79047167>] __run_hrtimer.isra.29+0x57/0x100 > [ 4.686173] [<79069740>] ? tick_nohz_handler+0xe0/0xe0 > [ 4.686173] [<79047dd5>] hrtimer_interrupt+0xd5/0x240 > [ 4.686173] [<7901b844>] smp_apic_timer_interrupt+0x54/0x90 > [ 4.686173] [<79858e31>] apic_timer_interrupt+0x31/0x40 > [ 4.686173] [<793877be>] ? delay_tsc+0x3e/0xa0 > [ 4.686173] [<7938785b>] __const_udelay+0x1b/0x20 > [ 4.686173] [<7901adff>] native_nmi_stop_other_cpus+0xaf/0xe0 > [ 4.686173] [<7901a237>] native_machine_shutdown+0x57/0x80 > [ 4.686173] [<79021c79>] kvm_shutdown+0x19/0x20 > [ 4.686173] [<7901a1c2>] native_machine_restart+0x22/0x40 > [ 4.686173] [<7901a3f9>] machine_restart+0x9/0x10 > [ 4.686173] [<7903a643>] kernel_restart+0x33/0x50 > [ 4.686173] [<7903a86d>] sys_reboot+0x13d/0x1e0 > [ 4.686173] [<79052cd7>] ? sched_clock_cpu+0xc7/0x150 > [ 4.686173] [<7902409f>] ? do_page_fault+0x19f/0x3e0 > [ 4.686173] [<79021b60>] ? kvm_read_and_reset_pf_reason+0x40/0x40 > [ 4.686173] [<79048466>] ? up_read+0x16/0x30 > [ 4.686173] [<7902409f>] ? do_page_fault+0x19f/0x3e0 > [ 4.686173] [<790c805b>] ? iterate_supers+0xab/0xd0 > [ 4.686173] [<7939794e>] ? do_raw_spin_unlock+0x4e/0x90 > [ 4.686173] [<79857eed>] ? _raw_spin_unlock+0x1d/0x30 > [ 4.686173] [<790c805b>] ? iterate_supers+0xab/0xd0 > [ 4.686173] [<790e9bd0>] ? __sync_filesystem+0xa0/0xa0 > [ 4.686173] [<79859327>] sysenter_do_call+0x12/0x37 > [ 4.686173] ---[ end trace 9d854ed5f523a273 ]--- > -- > [ 5.815913] Restarting system. > [ 5.816485] machine restart > [ 5.824887] ------------[ cut here ]------------ > [ 5.825679] WARNING: at /c/wfg/linux/arch/x86/kernel/smp.c:123 > native_smp_send_reschedule+0x46/0x50() > [ 5.827263] Pid: 16, comm: kworker/0:1 Not tainted 3.5.0-bisect #29 > [ 5.828336] Call Trace: > [ 5.828793] [<7902bb2a>] warn_slowpath_common+0x5a/0x80 > [ 5.828882] [<7901ba16>] ? native_smp_send_reschedule+0x46/0x50 > [ 5.828882] [<7901ba16>] ? native_smp_send_reschedule+0x46/0x50 > [ 5.828882] [<7902bbfd>] warn_slowpath_null+0x1d/0x20 > [ 5.828882] [<7901ba16>] native_smp_send_reschedule+0x46/0x50 > [ 5.828882] [<7905bc0d>] trigger_load_balance+0x1bd/0x250 > [ 5.828882] [<790529e4>] scheduler_tick+0xd4/0x100 > [ 5.828882] [<79038245>] update_process_times+0x55/0x70 > [ 5.828882] [<7906cc4b>] tick_sched_timer+0x5b/0xc0 > [ 5.828882] [<7939fcce>] ? do_raw_spin_unlock+0x4e/0x90 > [ 5.828882] [<79049f37>] __run_hrtimer.isra.29+0x57/0x100 > [ 5.828882] [<7906cbf0>] ? tick_nohz_handler+0xe0/0xe0 > [ 5.828882] [<7904abc5>] hrtimer_interrupt+0xe5/0x280 > [ 5.828882] [<7901c5a4>] smp_apic_timer_interrupt+0x54/0x90 > [ 5.828882] [<79870351>] apic_timer_interrupt+0x31/0x40 > [ 5.828882] [<7904007b>] ? start_worker+0x2b/0x30 > [ 5.828882] [<793900e0>] ? insn_get_prefixes+0x150/0x240 > [ 5.828882] [<7986f46f>] ? _raw_spin_unlock_irq+0x1f/0x40 > [ 5.828882] [<7904edcf>] finish_task_switch+0x7f/0xd0 > [ 5.828882] [<7904ed88>] ? finish_task_switch+0x38/0xd0 > [ 5.828882] [<7986df6a>] __schedule+0x38a/0x770 > [ 5.828882] [<790415fc>] ? worker_thread+0x16c/0x270 > [ 5.828882] [<7939fcce>] ? do_raw_spin_unlock+0x4e/0x90 > [ 5.828882] [<7986e36e>] schedule+0x1e/0x50 > [ 5.828882] [<79041601>] worker_thread+0x171/0x270 > [ 5.828882] [<79052ba9>] ? complete+0x49/0x60 > [ 5.828882] [<79041490>] ? manage_workers.isra.26+0x1c0/0x1c0 > [ 5.828882] [<79045f18>] kthread+0x78/0x80 > [ 5.828882] [<79860000>] ? ip_vs_control_net_cleanup+0xc0/0xd0 > [ 5.828882] [<79045ea0>] ? kthread_flush_work_fn+0x10/0x10 > [ 5.828882] [<79870dc6>] kernel_thread_helper+0x6/0xd > [ 5.828882] ---[ end trace df77608ce627a14d ]--- > > Thanks, > Fengguang > -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [email protected] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/

