linbao111 created HADOOP-10960:
----------------------------------

             Summary: hadoop cause system crash with “soft lock” and “hard lock”
                 Key: HADOOP-10960
                 URL: https://issues.apache.org/jira/browse/HADOOP-10960
             Project: Hadoop Common
          Issue Type: Bug
    Affects Versions: 2.2.0
         Environment: redhat rhel 6.3,6,4,6.5
jdk1.7.0_45
hadoop2.2
            Reporter: linbao111
            Priority: Critical


I am running hadoop2.2 on redhat6.3-6.5,and all of my machines crashed after a 
while. /var/log/messages shows repeatedly:

Aug 11 06:30:42 jn4_73_128 kernel: BUG: soft lockup - CPU#1 stuck for 67s! 
[jsvc:11508]
Aug 11 06:30:42 jn4_73_128 kernel: Modules linked in: bridge stp llc 
iptable_filter ip_tables mptctl mptbase xfs exportfs power_meter microcode 
dcdbas serio_raw iTCO_w
dt iTCO_vendor_support i7core_edac edac_core sg bnx2 ext4 mbcache jbd2 sd_mod 
crc_t10dif wmi mpt2sas scsi_transport_sas raid_class dm_mirror dm_region_hash 
dm_log dm_m
od [last unloaded: scsi_wait_scan]
Aug 11 06:30:42 jn4_73_128 kernel: CPU 1 
Aug 11 06:30:42 jn4_73_128 kernel: Modules linked in: bridge stp llc 
iptable_filter ip_tables mptctl mptbase xfs exportfs power_meter microcode 
dcdbas serio_raw iTCO_w
dt iTCO_vendor_support i7core_edac edac_core sg bnx2 ext4 mbcache jbd2 sd_mod 
crc_t10dif wmi mpt2sas scsi_transport_sas raid_class dm_mirror dm_region_hash 
dm_log dm_m
od [last unloaded: scsi_wait_scan]
Aug 11 06:30:42 jn4_73_128 kernel: 
Aug 11 06:30:42 jn4_73_128 kernel: Pid: 11508, comm: jsvc Tainted: G        W  
---------------    2.6.32-279.el6.x86_64 #1 Dell Inc. PowerEdge R510/084YMW
Aug 11 06:30:42 jn4_73_128 kernel: RIP: 0010:[<ffffffff8104d088>]  
[<ffffffff8104d088>] wait_for_rqlock+0x28/0x40
Aug 11 06:30:42 jn4_73_128 kernel: RSP: 0018:ffff8807786c3ee8  EFLAGS: 00000202
Aug 11 06:30:42 jn4_73_128 kernel: RAX: 00000000f6e9f6e1 RBX: ffff8807786c3ee8 
RCX: ffff880028216680
Aug 11 06:30:42 jn4_73_128 kernel: RDX: 00000000fffff6e9 RSI: ffff88061cd29370 
RDI: 0000000000000286
Aug 11 06:30:42 jn4_73_128 kernel: RBP: ffffffff8100bc0e R08: 0000000000000001 
R09: 0000000000000001
Aug 11 06:30:42 jn4_73_128 kernel: R10: 0000000000000000 R11: 0000000000000000 
R12: 0000000000000286
Aug 11 06:30:42 jn4_73_128 kernel: R13: ffff8807786c3eb8 R14: ffffffff810e0f6e 
R15: ffff8807786c3e48
Aug 11 06:30:42 jn4_73_128 kernel: FS:  0000000000000000(0000) 
GS:ffff880028200000(0000) knlGS:0000000000000000
Aug 11 06:30:42 jn4_73_128 kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 
0000000080050033
Aug 11 06:30:42 jn4_73_128 kernel: CR2: 0000000000e5bd70 CR3: 0000000001a85000 
CR4: 00000000000006e0
Aug 11 06:30:42 jn4_73_128 kernel: DR0: 0000000000000000 DR1: 0000000000000000 
DR2: 0000000000000000
Aug 11 06:30:42 jn4_73_128 kernel: DR3: 0000000000000000 DR6: 00000000ffff0ff0 
DR7: 0000000000000400
Aug 11 06:30:42 jn4_73_128 kernel: Process jsvc (pid: 11508, threadinfo 
ffff8807786c2000, task ffff880c1def3500)
Aug 11 06:30:42 jn4_73_128 kernel: Stack:
Aug 11 06:30:42 jn4_73_128 kernel: ffff8807786c3f68 ffffffff8107091b 
0000000000000000 ffff8807786c3f28
Aug 11 06:30:42 jn4_73_128 kernel: <d> ffff880701735260 ffff880c1def39c8 
ffff880c1def39c8 0000000000000000
Aug 11 06:30:42 jn4_73_128 kernel: <d> ffff8807786c3f28 ffff8807786c3f28 
ffff8807786c3f78 00007f092d0ad700
Aug 11 06:30:42 jn4_73_128 kernel: Call Trace:
Aug 11 06:30:42 jn4_73_128 kernel: [<ffffffff8107091b>] ? do_exit+0x5ab/0x870
Aug 11 06:30:42 jn4_73_128 kernel: [<ffffffff81070ce7>] ? sys_exit+0x17/0x20
Aug 11 06:30:42 jn4_73_128 kernel: [<ffffffff8100b0f2>] ? 
system_call_fastpath+0x16/0x1b
Aug 11 06:30:42 jn4_73_128 kernel: Code: ff ff 90 55 48 89 e5 0f 1f 44 00 00 48 
c7 c0 80 66 01 00 65 48 8b 0c 25 b0 e0 00 00 0f ae f0 48 01 c1 eb 09 0f 1f 80 
00 00 00 00 <f3> 90 8b 01 89 c2 c1 fa 10 66 39 c2 75 f2 c9 c3 0f 1f 84 00 00 
Aug 11 06:30:42 jn4_73_128 kernel: Call Trace:
Aug 11 06:30:42 jn4_73_128 kernel: [<ffffffff8107091b>] ? do_exit+0x5ab/0x870
Aug 11 06:30:42 jn4_73_128 kernel: [<ffffffff81070ce7>] ? sys_exit+0x17/0x20
Aug 11 06:30:42 jn4_73_128 kernel: [<ffffffff8100b0f2>] ? 
system_call_fastpath+0x16/0x1b
</em>
and finally crashed

crash /usr/lib/debug/lib/modules/2.6.32-431.5.1.el6.x86_64/vmlinux  
/opt/crash/127.0.0.1-2014-08-10-09\:47\:38/vmcore

crash 6.1.0-5.el6
Copyright (C) 2002-2012  Red Hat, Inc.
Copyright (C) 2004, 2005, 2006, 2010  IBM Corporation
Copyright (C) 1999-2006  Hewlett-Packard Co
Copyright (C) 2005, 2006, 2011, 2012  Fujitsu Limited
Copyright (C) 2006, 2007  VA Linux Systems Japan K.K.
Copyright (C) 2005, 2011  NEC Corporation
Copyright (C) 1999, 2002, 2007  Silicon Graphics, Inc.
Copyright (C) 1999, 2000, 2001, 2002  Mission Critical Linux, Inc.
This program is free software, covered by the GNU General Public License,
and you are welcome to change it and/or distribute copies of it under
certain conditions.  Enter "help copying" to see the conditions.
This program has absolutely no warranty.  Enter "help warranty" for details.

GNU gdb (GDB) 7.3.1
Copyright (C) 2011 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.  Type "show copying"
and "show warranty" for details.
This GDB was configured as "x86_64-unknown-linux-gnu"...

please wait... (determining panic task)         
WARNING: active task ffff881071850040 on cpu 12 not found in PID hash

      KERNEL: /usr/lib/debug/lib/modules/2.6.32-431.5.1.el6.x86_64/vmlinux
    DUMPFILE: /opt/crash/127.0.0.1-2014-08-10-09:47:38/vmcore  [PARTIAL DUMP]
        CPUS: 24
        DATE: Sun Aug 10 09:47:32 2014
      UPTIME: 7 days, 16:00:19
LOAD AVERAGE: 11.01, 3.11, 1.08
       TASKS: 724
    NODENAME: master1.otocyon.com
     RELEASE: 2.6.32-431.5.1.el6.x86_64
     VERSION: #1 SMP Fri Jan 10 14:46:43 EST 2014
     MACHINE: x86_64  (1895 Mhz)
      MEMORY: 64 GB
       PANIC: "Kernel panic - not syncing: Watchdog detected hard LOCKUP on cpu 
0"
         PID: 23976
     COMMAND: "sh"
        TASK: ffff881071850aa0  [THREAD_INFO: ffff880a05c80000]
         CPU: 0
       STATE: TASK_INTERRUPTIBLE (PANIC)

crash> bt
PID: 23976  TASK: ffff881071850aa0  CPU: 0   COMMAND: "sh"
 #0 [ffff880028207b50] machine_kexec at ffffffff81038f3b
 #1 [ffff880028207bb0] crash_kexec at ffffffff810c5d82
 #2 [ffff880028207c80] panic at ffffffff8152751a
 #3 [ffff880028207d00] watchdog_overflow_callback at ffffffff810e696d
 #4 [ffff880028207d20] __perf_event_overflow at ffffffff8111c847
 #5 [ffff880028207da0] perf_event_overflow at ffffffff8111ce14
 #6 [ffff880028207db0] intel_pmu_handle_irq at ffffffff81022d87
 #7 [ffff880028207e90] perf_event_nmi_handler at ffffffff8152bd69
 #8 [ffff880028207ea0] notifier_call_chain at ffffffff8152d825
 #9 [ffff880028207ee0] atomic_notifier_call_chain at ffffffff8152d88a
#10 [ffff880028207ef0] notify_die at ffffffff810a153e
#11 [ffff880028207f20] do_nmi at ffffffff8152b4eb


It happened on machines from different vendors,and I have tried to update to 
the latest kernel from redhat. Can anyone with the same experience help?



--
This message was sent by Atlassian JIRA
(v6.2#6252)

Reply via email to