We have a Linux cluster of 1000 nodes. I wasn't involved in setting it up. They use RedHat 6.2 kernel 2.2.19. Dual AMD 1.2GHz, 2GB memory, 2GB swap, GB ethernet.
Several nodes hang and/or get kernel errors every day. The first causes that come to mind are bad RAM and running out of virtual memory. I've pasted some logs below. The slaves mostly run FORTRAN code compiled with Lahey F95 v6.0 and g77 (0.5.24-19981002). What else could cause these errors? Are there special kernel config issues for AMD chips? I've run Linux for 9 years, always used Intel CPUs, used Debian since before the first official release ("buzz"), but never heard of so many problems. ch_binary_handler+67/168] [do_execve+417/516] [sys_execve+75/124] [system_call+52/56] Aug 21 06:35:07 hou000752cs kernel: Code: f6 46 24 01 74 52 8b 4c 24 68 39 4e 14 75 49 8b 4c 24 64 31 Aug 21 06:35:07 hou000752cs inetd[458]: pid 11124: exit signal 11 Aug 21 06:35:07 hou000752cs kernel: Unable to handle kernel paging request at virtual address 00ff0024 Aug 21 06:35:07 hou000752cs kernel: current->tss.cr3 = 1463e000, %cr3 = 1463e000 Aug 21 06:35:07 hou000752cs kernel: *pde = 00000000 Aug 21 06:35:07 hou000752cs kernel: Oops: 0000 Aug 21 06:35:07 hou000752cs kernel: CPU: 0 Aug 21 06:35:07 hou000752cs kernel: EIP: 0010:[locks_remove_posix+44/152] Aug 21 06:35:07 hou000752cs kernel: EFLAGS: 00010206 Aug 21 06:35:07 hou000752cs kernel: eax: 94629b04 ebx: be6b35a0 ecx: 94629a94 edx: 947f6920 Aug 21 06:35:07 hou000752cs kernel: esi: 00ff0000 edi: 942157c0 ebp: 94629b04 esp: 93a9bc28 Aug 21 06:35:07 hou000752cs kernel: ds: 0018 es: 0018 ss: 0018 Aug 21 06:35:07 hou000752cs kernel: Process in.ftpd (pid: 11125, process nr: 30, stackpage=93a9b000) Aug 21 06:35:07 hou000752cs kernel: Stack: 942157c0 bcc13f60 94629b04 94629a94 8012699a 94785f00 93a9a000 94785f00 Aug 21 06:35:07 hou000752cs kernel: fffffff7 00000202 93f45aa0 00013000 93f45a40 2aabf000 93f45adc 80135619 Aug 21 06:35:07 hou000752cs kernel: 80135626 93f45a40 08085fc0 0806b800 00000000 bcc13f60 80126991 be6b35a0 Aug 21 06:35:07 hou000752cs kernel: Call Trace: [filp_close+82/92] [load_elf_interp+677/708] [load_elf_interp+690/708] [filp ------------------------------------------------------------------------ Aug 21 04:02:00 hou000721cs anacron[5515]: Updated timestamp for job `cron.daily' to 2001-08-21 Aug 21 04:02:01 hou000721cs kernel: Unable to handle kernel paging request at virtual address 11008010 Aug 21 04:02:01 hou000721cs kernel: current->tss.cr3 = 145aa000, %cr3 = 145aa000 Aug 21 04:02:01 hou000721cs kernel: *pde = 00000000 Aug 21 04:02:01 hou000721cs kernel: Oops: 0000 Aug 21 04:02:01 hou000721cs kernel: CPU: 0 Aug 21 04:02:01 hou000721cs kernel: EIP: 0010:[d_lookup+100/224] Aug 21 04:02:01 hou000721cs kernel: EFLAGS: 00010217 Aug 21 04:02:01 hou000721cs kernel: eax: beee9a88 ebx: 11007ff8 ecx: 00000022 edx: bee00000 Aug 21 04:02:01 hou000721cs kernel: esi: 322f6ef6 edi: ac72f00a ebp: 11008010 esp: 8542bf3c Aug 21 04:02:01 hou000721cs kernel: ds: 0018 es: 0018 ss: 0018 Aug 21 04:02:01 hou000721cs kernel: Process slocate (pid: 5612, process nr: 18, stackpage=8542b000) Aug 21 04:02:01 hou000721cs kernel: Stack: ac72f00a 00000000 beee9a88 ac72f000 322f6ef6 0000000a 8012df0c aa7363e0 Aug 21 04:02:01 hou000721cs kernel: 8542bf84 8542bf84 8012e187 aa7363e0 8542bf84 00000000 ac72f000 ac72f000 Aug 21 04:02:01 hou000721cs kernel: 8542a000 7ffffc38 ac72f000 0000000a 322f6ef6 8012e284 ac72f000 aa7363e0 Aug 21 04:02:01 hou000721cs kernel: Call Trace: [cached_lookup+16/84] [lookup_dentry+275/488] [__namei+40/88] [sys_newlstat+42/140] [system_call+52/56] Aug 21 04:02:01 hou000721cs kernel: Code: 8b 6d 00 8b 74 24 18 39 73 48 75 5c 8b 74 24 24 39 73 0c 75 ------------------------------------------------------------------------ Aug 19 12:10:00 hou000669cs kernel: Unable to handle kernel paging request at virtual address d2040200 Aug 19 12:10:00 hou000669cs kernel: current->tss.cr3 = 11c09000, %cr3 = 11c09000 Aug 19 12:10:00 hou000669cs kernel: *pde = 00000000 Aug 19 12:10:00 hou000669cs kernel: Oops: 0000 Aug 19 12:10:00 hou000669cs kernel: CPU: 0 Aug 19 12:10:00 hou000669cs kernel: EIP: 0010:[flush_old_exec+196/552] Aug 19 12:10:00 hou000669cs kernel: EFLAGS: 00010246 Aug 19 12:10:00 hou000669cs kernel: eax: 00000000 ebx: 9b040000 ecx: 9b041e5c edx: 11c09000 Aug 19 12:10:00 hou000669cs kernel: esi: 00000000 edi: 801e59c3 ebp: 9a5c4000 esp: 9b041ca0 Aug 19 12:10:00 hou000669cs kernel: ds: 0018 es: 0018 ss: 0018 Aug 19 12:10:00 hou000669cs kernel: Process crond (pid: 15182, process nr: 24, stackpage=9b041000) Aug 19 12:10:00 hou000669cs kernel: Stack: 801e59c3 befddf80 00000000 9b040000 80135d52 9b041e5c 8021e718 fffffff 8 Aug 19 12:10:00 hou000669cs kernel: 9b040000 00000000 00000000 00000000 00030003 00000001 00001990 0000003 4 Aug 19 12:10:00 hou000669cs kernel: 464c457f 00010101 00000000 00000080 9b041d6c befcf400 9b041da4 805427b 0 Aug 19 12:10:00 hou000669cs kernel: Call Trace: [cprt+1315/42661] [load_elf_binary+1546/3480] [update_atime+94/10 0] [do_generic_file_read+1524/1536] [cprt+1312/42661] [search_binary_handler+67/168] [do_execve+417/516] Aug 19 12:10:00 hou000669cs kernel: [sys_execve+75/124] [system_call+52/56] Aug 19 12:10:00 hou000669cs kernel: Code: 66 39 83 00 02 00 00 75 29 8b 7c 24 14 66 8b 87 06 02 00 00 -------------------------------------------------------------------------- Aug 21 04:02:00 hou000587cs kernel: Unable to handle kernel NULL pointer derefer ence at virtual address 00000040 Aug 21 04:02:00 hou000587cs kernel: current->tss.cr3 = 20c50000, %cr3 = 20c50000 Aug 21 04:02:00 hou000587cs kernel: *pde = 00000000 Aug 21 04:02:00 hou000587cs kernel: Oops: 0000 Aug 21 04:02:00 hou000587cs kernel: CPU: 0 Aug 21 04:02:00 hou000587cs kernel: EIP: 0010:[dput+295/328] Aug 21 04:02:00 hou000587cs kernel: EFLAGS: 00010286 Aug 21 04:02:00 hou000587cs kernel: eax: 00000000 ebx: 8aa1d680 ecx: a14faf8 0 edx: a14fad7c Aug 21 04:02:00 hou000587cs kernel: esi: ffffffff edi: 00001004 ebp: 0000000 1 esp: 9cf7be64 Aug 21 04:02:00 hou000587cs kernel: ds: 0018 es: 0018 ss: 0018 Aug 21 04:02:00 hou000587cs kernel: Process slocate (pid: 32162, process nr: 30, stackpage=9cf7b000) Aug 21 04:02:00 hou000587cs kernel: Stack: 8aa1d680 80132c0c 8aa1d680 9cf7beb0 9 cf7beb0 8021e644 00001004 00001004 Aug 21 04:02:00 hou000587cs kernel: 80133d68 fffff7f6 00000806 00000000 8024a198 80 21e644 8024a198 a53672a0 Aug 21 04:02:00 hou000587cs kernel: a53672a0 00000000 98bea3fc 9cf7beb0 9cf7beb0 80 133df6 00001004 00000000 Aug 21 04:02:00 hou000587cs kernel: Call Trace: [prune_dcache+288/340] [try_to_free_inodes +316/396] [grow_inodes+30/440] [get_new_inode+197/312] [iget4+134/144] [iget+19/24] [ext2_ lookup+84/124] Aug 21 04:02:00 hou000587cs kernel: [real_lookup+80/160] [lookup_dentry+296/488] [_ _namei+40/88] [sys_newlstat+42/140] [system_call+52/56] Aug 21 04:02:00 hou000587cs kernel: Code: 8b 40 40 50 56 68 e0 55 1e 80 e8 7a 20 fe ff c7 05 00 00 00 ...RickM...