linbao111 created HADOOP-10960: ---------------------------------- Summary: hadoop cause system crash with “soft lock” and “hard lock” Key: HADOOP-10960 URL: https://issues.apache.org/jira/browse/HADOOP-10960 Project: Hadoop Common Issue Type: Bug Affects Versions: 2.2.0 Environment: redhat rhel 6.3,6,4,6.5 jdk1.7.0_45 hadoop2.2 Reporter: linbao111 Priority: Critical
I am running hadoop2.2 on redhat6.3-6.5,and all of my machines crashed after a while. /var/log/messages shows repeatedly: Aug 11 06:30:42 jn4_73_128 kernel: BUG: soft lockup - CPU#1 stuck for 67s! [jsvc:11508] Aug 11 06:30:42 jn4_73_128 kernel: Modules linked in: bridge stp llc iptable_filter ip_tables mptctl mptbase xfs exportfs power_meter microcode dcdbas serio_raw iTCO_w dt iTCO_vendor_support i7core_edac edac_core sg bnx2 ext4 mbcache jbd2 sd_mod crc_t10dif wmi mpt2sas scsi_transport_sas raid_class dm_mirror dm_region_hash dm_log dm_m od [last unloaded: scsi_wait_scan] Aug 11 06:30:42 jn4_73_128 kernel: CPU 1 Aug 11 06:30:42 jn4_73_128 kernel: Modules linked in: bridge stp llc iptable_filter ip_tables mptctl mptbase xfs exportfs power_meter microcode dcdbas serio_raw iTCO_w dt iTCO_vendor_support i7core_edac edac_core sg bnx2 ext4 mbcache jbd2 sd_mod crc_t10dif wmi mpt2sas scsi_transport_sas raid_class dm_mirror dm_region_hash dm_log dm_m od [last unloaded: scsi_wait_scan] Aug 11 06:30:42 jn4_73_128 kernel: Aug 11 06:30:42 jn4_73_128 kernel: Pid: 11508, comm: jsvc Tainted: G W --------------- 2.6.32-279.el6.x86_64 #1 Dell Inc. PowerEdge R510/084YMW Aug 11 06:30:42 jn4_73_128 kernel: RIP: 0010:[<ffffffff8104d088>] [<ffffffff8104d088>] wait_for_rqlock+0x28/0x40 Aug 11 06:30:42 jn4_73_128 kernel: RSP: 0018:ffff8807786c3ee8 EFLAGS: 00000202 Aug 11 06:30:42 jn4_73_128 kernel: RAX: 00000000f6e9f6e1 RBX: ffff8807786c3ee8 RCX: ffff880028216680 Aug 11 06:30:42 jn4_73_128 kernel: RDX: 00000000fffff6e9 RSI: ffff88061cd29370 RDI: 0000000000000286 Aug 11 06:30:42 jn4_73_128 kernel: RBP: ffffffff8100bc0e R08: 0000000000000001 R09: 0000000000000001 Aug 11 06:30:42 jn4_73_128 kernel: R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000286 Aug 11 06:30:42 jn4_73_128 kernel: R13: ffff8807786c3eb8 R14: ffffffff810e0f6e R15: ffff8807786c3e48 Aug 11 06:30:42 jn4_73_128 kernel: FS: 0000000000000000(0000) GS:ffff880028200000(0000) knlGS:0000000000000000 Aug 11 06:30:42 jn4_73_128 kernel: CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 Aug 11 06:30:42 jn4_73_128 kernel: CR2: 0000000000e5bd70 CR3: 0000000001a85000 CR4: 00000000000006e0 Aug 11 06:30:42 jn4_73_128 kernel: DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 Aug 11 06:30:42 jn4_73_128 kernel: DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400 Aug 11 06:30:42 jn4_73_128 kernel: Process jsvc (pid: 11508, threadinfo ffff8807786c2000, task ffff880c1def3500) Aug 11 06:30:42 jn4_73_128 kernel: Stack: Aug 11 06:30:42 jn4_73_128 kernel: ffff8807786c3f68 ffffffff8107091b 0000000000000000 ffff8807786c3f28 Aug 11 06:30:42 jn4_73_128 kernel: <d> ffff880701735260 ffff880c1def39c8 ffff880c1def39c8 0000000000000000 Aug 11 06:30:42 jn4_73_128 kernel: <d> ffff8807786c3f28 ffff8807786c3f28 ffff8807786c3f78 00007f092d0ad700 Aug 11 06:30:42 jn4_73_128 kernel: Call Trace: Aug 11 06:30:42 jn4_73_128 kernel: [<ffffffff8107091b>] ? do_exit+0x5ab/0x870 Aug 11 06:30:42 jn4_73_128 kernel: [<ffffffff81070ce7>] ? sys_exit+0x17/0x20 Aug 11 06:30:42 jn4_73_128 kernel: [<ffffffff8100b0f2>] ? system_call_fastpath+0x16/0x1b Aug 11 06:30:42 jn4_73_128 kernel: Code: ff ff 90 55 48 89 e5 0f 1f 44 00 00 48 c7 c0 80 66 01 00 65 48 8b 0c 25 b0 e0 00 00 0f ae f0 48 01 c1 eb 09 0f 1f 80 00 00 00 00 <f3> 90 8b 01 89 c2 c1 fa 10 66 39 c2 75 f2 c9 c3 0f 1f 84 00 00 Aug 11 06:30:42 jn4_73_128 kernel: Call Trace: Aug 11 06:30:42 jn4_73_128 kernel: [<ffffffff8107091b>] ? do_exit+0x5ab/0x870 Aug 11 06:30:42 jn4_73_128 kernel: [<ffffffff81070ce7>] ? sys_exit+0x17/0x20 Aug 11 06:30:42 jn4_73_128 kernel: [<ffffffff8100b0f2>] ? system_call_fastpath+0x16/0x1b </em> and finally crashed crash /usr/lib/debug/lib/modules/2.6.32-431.5.1.el6.x86_64/vmlinux /opt/crash/127.0.0.1-2014-08-10-09\:47\:38/vmcore crash 6.1.0-5.el6 Copyright (C) 2002-2012 Red Hat, Inc. Copyright (C) 2004, 2005, 2006, 2010 IBM Corporation Copyright (C) 1999-2006 Hewlett-Packard Co Copyright (C) 2005, 2006, 2011, 2012 Fujitsu Limited Copyright (C) 2006, 2007 VA Linux Systems Japan K.K. Copyright (C) 2005, 2011 NEC Corporation Copyright (C) 1999, 2002, 2007 Silicon Graphics, Inc. Copyright (C) 1999, 2000, 2001, 2002 Mission Critical Linux, Inc. This program is free software, covered by the GNU General Public License, and you are welcome to change it and/or distribute copies of it under certain conditions. Enter "help copying" to see the conditions. This program has absolutely no warranty. Enter "help warranty" for details. GNU gdb (GDB) 7.3.1 Copyright (C) 2011 Free Software Foundation, Inc. License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html> This is free software: you are free to change and redistribute it. There is NO WARRANTY, to the extent permitted by law. Type "show copying" and "show warranty" for details. This GDB was configured as "x86_64-unknown-linux-gnu"... please wait... (determining panic task) WARNING: active task ffff881071850040 on cpu 12 not found in PID hash KERNEL: /usr/lib/debug/lib/modules/2.6.32-431.5.1.el6.x86_64/vmlinux DUMPFILE: /opt/crash/127.0.0.1-2014-08-10-09:47:38/vmcore [PARTIAL DUMP] CPUS: 24 DATE: Sun Aug 10 09:47:32 2014 UPTIME: 7 days, 16:00:19 LOAD AVERAGE: 11.01, 3.11, 1.08 TASKS: 724 NODENAME: master1.otocyon.com RELEASE: 2.6.32-431.5.1.el6.x86_64 VERSION: #1 SMP Fri Jan 10 14:46:43 EST 2014 MACHINE: x86_64 (1895 Mhz) MEMORY: 64 GB PANIC: "Kernel panic - not syncing: Watchdog detected hard LOCKUP on cpu 0" PID: 23976 COMMAND: "sh" TASK: ffff881071850aa0 [THREAD_INFO: ffff880a05c80000] CPU: 0 STATE: TASK_INTERRUPTIBLE (PANIC) crash> bt PID: 23976 TASK: ffff881071850aa0 CPU: 0 COMMAND: "sh" #0 [ffff880028207b50] machine_kexec at ffffffff81038f3b #1 [ffff880028207bb0] crash_kexec at ffffffff810c5d82 #2 [ffff880028207c80] panic at ffffffff8152751a #3 [ffff880028207d00] watchdog_overflow_callback at ffffffff810e696d #4 [ffff880028207d20] __perf_event_overflow at ffffffff8111c847 #5 [ffff880028207da0] perf_event_overflow at ffffffff8111ce14 #6 [ffff880028207db0] intel_pmu_handle_irq at ffffffff81022d87 #7 [ffff880028207e90] perf_event_nmi_handler at ffffffff8152bd69 #8 [ffff880028207ea0] notifier_call_chain at ffffffff8152d825 #9 [ffff880028207ee0] atomic_notifier_call_chain at ffffffff8152d88a #10 [ffff880028207ef0] notify_die at ffffffff810a153e #11 [ffff880028207f20] do_nmi at ffffffff8152b4eb It happened on machines from different vendors,and I have tried to update to the latest kernel from redhat. Can anyone with the same experience help? -- This message was sent by Atlassian JIRA (v6.2#6252)