[ https://issues.apache.org/jira/browse/HADOOP-10960?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Arpit Agarwal resolved HADOOP-10960. ------------------------------------ Resolution: Invalid Hadoop core has no kernel mode components so it cannot cause a kernel panic. You likely have a buggy device driver or hit a kernel bug. Resolving as Invalid. > hadoop cause system crash with “soft lock” and “hard lock” > ---------------------------------------------------------- > > Key: HADOOP-10960 > URL: https://issues.apache.org/jira/browse/HADOOP-10960 > Project: Hadoop Common > Issue Type: Bug > Affects Versions: 2.2.0 > Environment: redhat rhel 6.3,6,4,6.5 > jdk1.7.0_45 > hadoop2.2 > Reporter: linbao111 > Priority: Critical > Original Estimate: 168h > Remaining Estimate: 168h > > I am running hadoop2.2 on redhat6.3-6.5,and all of my machines crashed after > a while. /var/log/messages shows repeatedly: > Aug 11 06:30:42 jn4_73_128 kernel: BUG: soft lockup - CPU#1 stuck for 67s! > [jsvc:11508] > Aug 11 06:30:42 jn4_73_128 kernel: Modules linked in: bridge stp llc > iptable_filter ip_tables mptctl mptbase xfs exportfs power_meter microcode > dcdbas serio_raw iTCO_w > dt iTCO_vendor_support i7core_edac edac_core sg bnx2 ext4 mbcache jbd2 sd_mod > crc_t10dif wmi mpt2sas scsi_transport_sas raid_class dm_mirror dm_region_hash > dm_log dm_m > od [last unloaded: scsi_wait_scan] > Aug 11 06:30:42 jn4_73_128 kernel: CPU 1 > Aug 11 06:30:42 jn4_73_128 kernel: Modules linked in: bridge stp llc > iptable_filter ip_tables mptctl mptbase xfs exportfs power_meter microcode > dcdbas serio_raw iTCO_w > dt iTCO_vendor_support i7core_edac edac_core sg bnx2 ext4 mbcache jbd2 sd_mod > crc_t10dif wmi mpt2sas scsi_transport_sas raid_class dm_mirror dm_region_hash > dm_log dm_m > od [last unloaded: scsi_wait_scan] > Aug 11 06:30:42 jn4_73_128 kernel: > Aug 11 06:30:42 jn4_73_128 kernel: Pid: 11508, comm: jsvc Tainted: G W > --------------- 2.6.32-279.el6.x86_64 #1 Dell Inc. PowerEdge R510/084YMW > Aug 11 06:30:42 jn4_73_128 kernel: RIP: 0010:[<ffffffff8104d088>] > [<ffffffff8104d088>] wait_for_rqlock+0x28/0x40 > Aug 11 06:30:42 jn4_73_128 kernel: RSP: 0018:ffff8807786c3ee8 EFLAGS: > 00000202 > Aug 11 06:30:42 jn4_73_128 kernel: RAX: 00000000f6e9f6e1 RBX: > ffff8807786c3ee8 RCX: ffff880028216680 > Aug 11 06:30:42 jn4_73_128 kernel: RDX: 00000000fffff6e9 RSI: > ffff88061cd29370 RDI: 0000000000000286 > Aug 11 06:30:42 jn4_73_128 kernel: RBP: ffffffff8100bc0e R08: > 0000000000000001 R09: 0000000000000001 > Aug 11 06:30:42 jn4_73_128 kernel: R10: 0000000000000000 R11: > 0000000000000000 R12: 0000000000000286 > Aug 11 06:30:42 jn4_73_128 kernel: R13: ffff8807786c3eb8 R14: > ffffffff810e0f6e R15: ffff8807786c3e48 > Aug 11 06:30:42 jn4_73_128 kernel: FS: 0000000000000000(0000) > GS:ffff880028200000(0000) knlGS:0000000000000000 > Aug 11 06:30:42 jn4_73_128 kernel: CS: 0010 DS: 0000 ES: 0000 CR0: > 0000000080050033 > Aug 11 06:30:42 jn4_73_128 kernel: CR2: 0000000000e5bd70 CR3: > 0000000001a85000 CR4: 00000000000006e0 > Aug 11 06:30:42 jn4_73_128 kernel: DR0: 0000000000000000 DR1: > 0000000000000000 DR2: 0000000000000000 > Aug 11 06:30:42 jn4_73_128 kernel: DR3: 0000000000000000 DR6: > 00000000ffff0ff0 DR7: 0000000000000400 > Aug 11 06:30:42 jn4_73_128 kernel: Process jsvc (pid: 11508, threadinfo > ffff8807786c2000, task ffff880c1def3500) > Aug 11 06:30:42 jn4_73_128 kernel: Stack: > Aug 11 06:30:42 jn4_73_128 kernel: ffff8807786c3f68 ffffffff8107091b > 0000000000000000 ffff8807786c3f28 > Aug 11 06:30:42 jn4_73_128 kernel: <d> ffff880701735260 ffff880c1def39c8 > ffff880c1def39c8 0000000000000000 > Aug 11 06:30:42 jn4_73_128 kernel: <d> ffff8807786c3f28 ffff8807786c3f28 > ffff8807786c3f78 00007f092d0ad700 > Aug 11 06:30:42 jn4_73_128 kernel: Call Trace: > Aug 11 06:30:42 jn4_73_128 kernel: [<ffffffff8107091b>] ? do_exit+0x5ab/0x870 > Aug 11 06:30:42 jn4_73_128 kernel: [<ffffffff81070ce7>] ? sys_exit+0x17/0x20 > Aug 11 06:30:42 jn4_73_128 kernel: [<ffffffff8100b0f2>] ? > system_call_fastpath+0x16/0x1b > Aug 11 06:30:42 jn4_73_128 kernel: Code: ff ff 90 55 48 89 e5 0f 1f 44 00 00 > 48 c7 c0 80 66 01 00 65 48 8b 0c 25 b0 e0 00 00 0f ae f0 48 01 c1 eb 09 0f 1f > 80 00 00 00 00 <f3> 90 8b 01 89 c2 c1 fa 10 66 39 c2 75 f2 c9 c3 0f 1f 84 00 > 00 > Aug 11 06:30:42 jn4_73_128 kernel: Call Trace: > Aug 11 06:30:42 jn4_73_128 kernel: [<ffffffff8107091b>] ? do_exit+0x5ab/0x870 > Aug 11 06:30:42 jn4_73_128 kernel: [<ffffffff81070ce7>] ? sys_exit+0x17/0x20 > Aug 11 06:30:42 jn4_73_128 kernel: [<ffffffff8100b0f2>] ? > system_call_fastpath+0x16/0x1b > </em> > and finally crashed > crash /usr/lib/debug/lib/modules/2.6.32-431.5.1.el6.x86_64/vmlinux > /opt/crash/127.0.0.1-2014-08-10-09\:47\:38/vmcore > crash 6.1.0-5.el6 > Copyright (C) 2002-2012 Red Hat, Inc. > Copyright (C) 2004, 2005, 2006, 2010 IBM Corporation > Copyright (C) 1999-2006 Hewlett-Packard Co > Copyright (C) 2005, 2006, 2011, 2012 Fujitsu Limited > Copyright (C) 2006, 2007 VA Linux Systems Japan K.K. > Copyright (C) 2005, 2011 NEC Corporation > Copyright (C) 1999, 2002, 2007 Silicon Graphics, Inc. > Copyright (C) 1999, 2000, 2001, 2002 Mission Critical Linux, Inc. > This program is free software, covered by the GNU General Public License, > and you are welcome to change it and/or distribute copies of it under > certain conditions. Enter "help copying" to see the conditions. > This program has absolutely no warranty. Enter "help warranty" for details. > GNU gdb (GDB) 7.3.1 > Copyright (C) 2011 Free Software Foundation, Inc. > License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html> > This is free software: you are free to change and redistribute it. > There is NO WARRANTY, to the extent permitted by law. Type "show copying" > and "show warranty" for details. > This GDB was configured as "x86_64-unknown-linux-gnu"... > please wait... (determining panic task) > WARNING: active task ffff881071850040 on cpu 12 not found in PID hash > KERNEL: /usr/lib/debug/lib/modules/2.6.32-431.5.1.el6.x86_64/vmlinux > DUMPFILE: /opt/crash/127.0.0.1-2014-08-10-09:47:38/vmcore [PARTIAL DUMP] > CPUS: 24 > DATE: Sun Aug 10 09:47:32 2014 > UPTIME: 7 days, 16:00:19 > LOAD AVERAGE: 11.01, 3.11, 1.08 > TASKS: 724 > NODENAME: master1.otocyon.com > RELEASE: 2.6.32-431.5.1.el6.x86_64 > VERSION: #1 SMP Fri Jan 10 14:46:43 EST 2014 > MACHINE: x86_64 (1895 Mhz) > MEMORY: 64 GB > PANIC: "Kernel panic - not syncing: Watchdog detected hard LOCKUP on > cpu 0" > PID: 23976 > COMMAND: "sh" > TASK: ffff881071850aa0 [THREAD_INFO: ffff880a05c80000] > CPU: 0 > STATE: TASK_INTERRUPTIBLE (PANIC) > crash> bt > PID: 23976 TASK: ffff881071850aa0 CPU: 0 COMMAND: "sh" > #0 [ffff880028207b50] machine_kexec at ffffffff81038f3b > #1 [ffff880028207bb0] crash_kexec at ffffffff810c5d82 > #2 [ffff880028207c80] panic at ffffffff8152751a > #3 [ffff880028207d00] watchdog_overflow_callback at ffffffff810e696d > #4 [ffff880028207d20] __perf_event_overflow at ffffffff8111c847 > #5 [ffff880028207da0] perf_event_overflow at ffffffff8111ce14 > #6 [ffff880028207db0] intel_pmu_handle_irq at ffffffff81022d87 > #7 [ffff880028207e90] perf_event_nmi_handler at ffffffff8152bd69 > #8 [ffff880028207ea0] notifier_call_chain at ffffffff8152d825 > #9 [ffff880028207ee0] atomic_notifier_call_chain at ffffffff8152d88a > #10 [ffff880028207ef0] notify_die at ffffffff810a153e > #11 [ffff880028207f20] do_nmi at ffffffff8152b4eb > It happened on machines from different vendors,and I have tried to update to > the latest kernel from redhat. Can anyone with the same experience help? -- This message was sent by Atlassian JIRA (v6.2#6252)