Have you considered hardware failure? Especially since the errors are "random", and not reproducable? This seems most likely.
On Sun, 27 Feb 2005 01:40:01 +0200, Alex Efros <[EMAIL PROTECTED]> wrote: > Hi! > > I've problem with my server - it hangs every 3-14 days with different > kernel oops error messages. I've already post message about this issue > at 30 Sep 2004 in this maillist. After that time I did a lot of different > things: > - try different kernels (currently use 2.6.10 vanilla-sources) > - configure netconsole to catch kernel oops messages on second server > - post a number of bugreports in kernel bugzilla: > -- invalid operand. EIP is at schedule_timeout+0x35/0xb7 > http://bugme.osdl.org/show_bug.cgi?id=4085 > -- udp queue Recv-Q overloaded because socket was not closed > http://bugme.osdl.org/show_bug.cgi?id=4086 > -- oops in e1000 > http://bugme.osdl.org/show_bug.cgi?id=4088 > -- Recursive die() failure > http://bugme.osdl.org/show_bug.cgi?id=4096 > -- REISERFS: panic: journal-601, buffer write failed > http://bugme.osdl.org/show_bug.cgi?id=4101 > -- invalid operand: 0000 at include/linux/netdevice.h:879 > http://bugme.osdl.org/show_bug.cgi?id=4111 > > In short, kernel just hangs from time to time with different errors, and > this continuing in last 1.5 year on 2 different hostings and 3 different > servers (so this isn't probably hardware issue) and different kernel versions. > > Only "unusual" things running on my servers is perl scripts which doing > web spidering 24x7 using a lot of non-blocking sockets (datamining of > about 100 special sites). Right now I've only two ideas why my server hangs: > 1) probably some race condition bug in kernel related to non-blocking IO > 2) hacker attack :-/ (don't really believe in this) > > Searching google for same errors don't help - sometimes I see something > like single non-answered post about similar issue (usually happens with > squid) in some forum and nothing more. > > Can anybody help me with this @#$??? :-( Any ideas what to do or to check? > > Right now I've noticed non-fatal kernel oops error in logs (looks like > sometime critical error happens which hang server while sometime > non-critical error happens... usually 1-2 days after non-critical error > critical error will happens too). Here is log: > > 2005-02-26_05:20:53.46432 kern.alert: Unable to handle kernel paging request > at virtual address bf155a80 > 2005-02-26_05:20:53.68198 kern.alert: printing eip: > 2005-02-26_05:20:53.68201 kern.warn: bf155a80 > 2005-02-26_05:20:53.68202 kern.alert: *pde = 00000000 > 2005-02-26_05:20:53.68203 kern.alert: Oops: 0000 [#1] > 2005-02-26_05:20:53.68204 kern.warn: CPU: 0 > 2005-02-26_05:20:53.68205 kern.warn: EIP: 0060:[<bf155a80>] Not tainted > VLI > 2005-02-26_05:20:53.68206 kern.warn: EFLAGS: 00010246 (2.6.10) > 2005-02-26_05:20:53.68210 kern.warn: EIP is at 0xbf155a80 > 2005-02-26_05:20:53.68212 kern.warn: eax: 00000000 ebx: 00000000 ecx: > 00000000 edx: f7087f40 > 2005-02-26_05:20:53.68213 kern.warn: esi: 00000000 edi: 00000000 ebp: > bffffd38 esp: f7087eec > 2005-02-26_05:20:53.68214 kern.warn: ds: 007b es: 007b ss: 0068 > 2005-02-26_05:20:53.68215 kern.warn: Process apache2 (pid: 888, > threadinfo=f7086000 task=f7045580) > 2005-02-26_05:20:53.68216 kern.warn: Stack: c0155d01 f7087f40 f7087f90 > 00000001 f7045580 ed3ba0cc f7045624 c1b145ac > 2005-02-26_05:20:53.68217 kern.warn: c18d007b c18d007b 00000296 > 00000000 f7086000 c0114b61 ffffffff 00000007 > 2005-02-26_05:20:53.68218 kern.warn: f721b0a0 c1bc0a20 000003e8 > 00000000 00000000 00000246 00000000 f7045580 > 2005-02-26_05:20:53.68219 kern.warn: Call Trace: > 2005-02-26_05:20:53.68219 kern.warn: [<c0155d01>] do_select+0x41/0x2b0 > 2005-02-26_05:20:53.68220 kern.warn: [<c0114b61>] do_wait+0x1c1/0x470 > 2005-02-26_05:20:53.68221 kern.warn: [<c01562ab>] sys_select+0x2fb/0x530 > 2005-02-26_05:20:53.68223 kern.warn: [<c0114f25>] sys_waitpid+0x25/0x29 > 2005-02-26_05:20:53.68224 kern.warn: [<c01022e3>] syscall_call+0x7/0xb > 2005-02-26_05:20:53.68225 kern.warn: Code: Bad EIP value. > 2005-02-26_05:20:53.68226 kern.warn: <7>IN=eth0 OUT= > MAC=00:30:48:42:63:fc:00:d0:02:49:64:00:08:00 SRC=212.31.242.103 > DST=XXX.XXX.XXX.XXX LEN=40 TOS=0x00 PREC=0x00 TTL=48 ID=6892 DF PROTO=TCP > SPT=3039 DPT=443 WINDOW=0 RES=0x00 RST URGP=0 > > P.S. A lot information about my hardware/software you can see in first > bugreport in kernel bugzilla (http://bugme.osdl.org/show_bug.cgi?id=4085). > > -- > WBR, Alex. >
