Re: vpf-10680, minor corruptions
Hello! On Mon, Jun 23, 2003 at 03:38:20PM +0200, Christian Kujau wrote: as stated before, the corruptions occur only on this very alpha machine, Well, I still cannot build the kernel myself and still working on it. (having make: *** [vmlinux] Error 139 and zero length vmlinux) BTW, I realised that I have not looked into your kernel config for that box, can you send it to me please? bread: Cannot read the block (523914): (Input/output error). Hm, but still it means kernel returned some error for read request. hah! i was not aware that the disk might have an hw problem, not a single error ever showed up in my logs. this was weird. so i re-partitioned the disk with a 10MB sde (to circumvent the bread error) on the beginning and a 2 GB sde2. now reiserfsck/cp/diff are all working fine under 2.4.21, but 2.5.72 is still erroneous. Sigh. btw: i am still using reiserfsprogs 3.6.8 now (since debian/testing has 3.6.6) and i have compiled these utils under a 2.5.72 kernel. is it safe to use them under 2.4 ? I see that you have used 2.5.70 and earlier kernels on alpha too. Do you have any idea of when stuff broke for you? Bye, Oleg
Re: 2.4.21 reiserfs oops
On Tue, 24 Jun 2003, Oleg Drokin moaned: Hello! On Mon, Jun 23, 2003 at 11:16:27PM +0100, Nix wrote: Jun 22 13:52:42 loki kernel: Unable to handle kernel NULL pointer dereference at virtual address 0001 This is very strange address to oops on. I'll say! Looks almost like it JMPed to a null pointer or something. No, if it'd jumped to a NULL pointer, we'd see 0 in EIP. JMPed to ((long)NULL)+1 or something then :) the fact remains that it's not somewhere that even a memory error would make us likely to jump to. Jun 22 13:52:43 loki kernel: EIP:0010:[c0092df4]Not tainted And the EIP is prior to kernel start which is also very strange. On the other hand the address c0192df4 is somewhere inside reiserfs code, so it looks like a single bit error, I'd say. I think it unlikely to be RAM problems given that the problem happened shortly after upgrading to 2.4.21; this was about half a day after I rebooted it because it threw a pile of never-seen-again, un-syslogged SCSI abort errors at me (sym53c875); and *that* was a few minutes after I rebooted into 2.4.21 for the first time. Hm, so first there were some scsi problems and then reiserfs oops? Different boots. I upgraded, the first boot crashed within five minutes with weird SCSI errors, so I rebooted again and this happened six hours later. I'm willing to write off the SCSI errors to the shock effect of having just been powered down for the first time in a year (the shutdown scripts didn't quite work and the reset button is disconnected). Actually since the RAM is good, I see no good reason for this to happen. (actually I see no good reason for valid code before _text, either). I wonder if 2.4.21 constantly crashes like that for you, then? No obvious sign of it: 9:34pm up 1 day 22:30, 14 users, load average: 0.09, 0.12, 0.16 (it is of course waiting until I am hundreds of miles away. *Then* it'll crash.) -- `It is an unfortunate coincidence that the date locarchive.h was written (in hex) matches Ritchie's birthday (in octal).' -- Roland McGrath on the libc-alpha list
Re: 2.4.21 reiserfs oops
On Tue, 2003-06-24 at 16:34, Nix wrote: On Tue, 24 Jun 2003, Oleg Drokin moaned: Hello! On Mon, Jun 23, 2003 at 11:16:27PM +0100, Nix wrote: Jun 22 13:52:42 loki kernel: Unable to handle kernel NULL pointer dereference at virtual address 0001 This is very strange address to oops on. I'll say! Looks almost like it JMPed to a null pointer or something. No, if it'd jumped to a NULL pointer, we'd see 0 in EIP. JMPed to ((long)NULL)+1 or something then :) the fact remains that it's not somewhere that even a memory error would make us likely to jump to. Jun 22 13:52:43 loki kernel: EIP:0010:[c0092df4]Not tainted The EIP isn't zero or 1, you've got a bad null pinter dereference at address 1. You get this when you do something like *(char *)1 = some_val. The ram is most likely bad, you're 1 bit away from zero, but you might try a reiserfsck on any drives affected by the scsi errors. -chris
Re: vpf-10680, minor corruptions
Christian Kujau schrieb: of course, the best thing i can do is the el-cheapo-hacking approach: compiling 2.5.60...up to 2.5.72 and see *when* it breaks. hm, compiling a 2.5 kernel takes 180min on this machine. but anyway, i'll start with 2.5.60 now, see what it gives. no, i started with 2.5.66 but the kernel did not compile. 2.5.65 did compile (don't ask how long) and has already booted. but trying to mount the newly created reiserfs gives: module reiserfs: Relocation overflow vs section 9 in the log. the reiserfs module was not loaded. modprobe reiserfs gives: lila:~# modprobe reiserfs FATAL: Error inserting reiserfs (/lib/modules/2.5.65/kernel/fs/reiserfs/reiserfs.ko): Invalid module format lila:~# uname -a Linux lila 2.5.65 #4 Wed Jun 25 00:48:46 CEST 2003 alpha GNU/Linux i compiled the module with CONFIG_REISERFS_CHECK=y. shall i go on with 2.5.64 or better 2.5.67 ? good night, Christian.
Re: vpf-10680, minor corruptions
Hello! On Wed, Jun 25, 2003 at 02:42:24AM +0200, Christian Kujau wrote: (/lib/modules/2.5.65/kernel/fs/reiserfs/reiserfs.ko): Invalid module format lila:~# uname -a Linux lila 2.5.65 #4 Wed Jun 25 00:48:46 CEST 2003 alpha GNU/Linux i compiled the module with CONFIG_REISERFS_CHECK=y. shall i go on with 2.5.64 or better 2.5.67 ? Try to compile with CONFIG_REISERFS_CHECK=y the kernel that known-bad for you. (e.g. 2.5.72/2.5.73) Bye, Oleg