Re: vpf-10680, minor corruptions

2003-06-24 Thread Oleg Drokin
Hello!

On Mon, Jun 23, 2003 at 03:38:20PM +0200, Christian Kujau wrote:

 as stated before, the corruptions occur only on this very alpha machine, 

Well, I still cannot build the kernel myself and still working on it.
(having make: *** [vmlinux] Error 139 and zero length vmlinux)

BTW, I realised that I have not looked into your kernel config for that box,
can you send it to me please?

 bread: Cannot read the block (523914): (Input/output error).

Hm, but still it means kernel returned some error for read request.

 hah! i was not aware that the disk might have an hw problem, not a 
 single error ever showed up in my logs. this was weird. so i 
 re-partitioned the disk with a 10MB sde (to circumvent the bread error) 
 on the beginning and a 2 GB sde2. now reiserfsck/cp/diff are all working 
 fine under 2.4.21, but 2.5.72 is still erroneous.

Sigh.

 
 btw: i am still using reiserfsprogs 3.6.8 now (since debian/testing has 
 3.6.6) and i have compiled these utils under a 2.5.72 kernel. is it safe 
 to use them under 2.4 ?

I see that you have used 2.5.70 and earlier kernels on alpha too.
Do you have any idea of when stuff broke for you?

Bye,
Oleg


Re: 2.4.21 reiserfs oops

2003-06-24 Thread Nix
On Tue, 24 Jun 2003, Oleg Drokin moaned:
 Hello!
 
 On Mon, Jun 23, 2003 at 11:16:27PM +0100, Nix wrote:
 
  Jun 22 13:52:42 loki kernel: Unable to handle kernel NULL pointer dereference at 
  virtual address 0001 
  This is very strange address to oops on.
 I'll say! Looks almost like it JMPed to a null pointer or something.
 
 No, if it'd jumped to a NULL pointer, we'd see 0 in EIP.

JMPed to ((long)NULL)+1 or something then :) the fact remains that it's
not somewhere that even a memory error would make us likely to jump to.

  Jun 22 13:52:43 loki kernel: EIP:0010:[c0092df4]Not tainted 
  And the EIP is prior to kernel start which is also very strange.
  On the other hand the address c0192df4 is somewhere inside reiserfs code,
  so it looks like a single bit error, I'd say.
 I think it unlikely to be RAM problems given that the problem happened
 shortly after upgrading to 2.4.21; this was about half a day after I
 rebooted it because it threw a pile of never-seen-again, un-syslogged
 SCSI abort errors at me (sym53c875); and *that* was a few minutes after
 I rebooted into 2.4.21 for the first time.
 
 Hm, so first there were some scsi problems and then reiserfs oops?

Different boots. I upgraded, the first boot crashed within five minutes
with weird SCSI errors, so I rebooted again and this happened six hours
later.

I'm willing to write off the SCSI errors to the shock effect of having
just been powered down for the first time in a year (the shutdown
scripts didn't quite work and the reset button is disconnected).

 Actually since the RAM is good, I see no good reason for this to happen.
 (actually I see no good reason for valid code before _text, either).
 
 I wonder if 2.4.21 constantly crashes like that for you, then?

No obvious sign of it:

  9:34pm  up 1 day 22:30,  14 users,  load average: 0.09, 0.12, 0.16

(it is of course waiting until I am hundreds of miles away. *Then* it'll
crash.)

-- 
`It is an unfortunate coincidence that the date locarchive.h was
 written (in hex) matches Ritchie's birthday (in octal).'
   -- Roland McGrath on the libc-alpha list


Re: 2.4.21 reiserfs oops

2003-06-24 Thread Chris Mason
On Tue, 2003-06-24 at 16:34, Nix wrote:
 On Tue, 24 Jun 2003, Oleg Drokin moaned:
  Hello!
  
  On Mon, Jun 23, 2003 at 11:16:27PM +0100, Nix wrote:
  
   Jun 22 13:52:42 loki kernel: Unable to handle kernel NULL pointer dereference 
   at virtual address 0001 
   This is very strange address to oops on.
  I'll say! Looks almost like it JMPed to a null pointer or something.
  
  No, if it'd jumped to a NULL pointer, we'd see 0 in EIP.
 
 JMPed to ((long)NULL)+1 or something then :) the fact remains that it's
 not somewhere that even a memory error would make us likely to jump to.
 
   Jun 22 13:52:43 loki kernel: EIP:0010:[c0092df4]Not tainted 

The EIP isn't zero or 1, you've got a bad null pinter dereference at
address 1.  You get this when you do something like *(char *)1 =
some_val.

The ram is most likely bad, you're 1 bit away from zero, but you might
try a reiserfsck on any drives affected by the scsi errors.

-chris




Re: vpf-10680, minor corruptions

2003-06-24 Thread Christian Kujau
Christian Kujau schrieb:
of course, the best thing i can do is the el-cheapo-hacking approach: 
compiling 2.5.60...up to 2.5.72 and see *when* it breaks. hm, compiling 
a 2.5 kernel takes 180min on this machine. but anyway, i'll start with 
2.5.60 now, see what it gives.
no, i started with 2.5.66 but the kernel did not compile. 2.5.65 did 
compile (don't ask how long) and has already booted. but trying to 
mount the newly created reiserfs gives:

module reiserfs: Relocation overflow vs section 9

in the log. the reiserfs module was not loaded. modprobe reiserfs gives:

lila:~# modprobe reiserfs
FATAL: Error inserting reiserfs 
(/lib/modules/2.5.65/kernel/fs/reiserfs/reiserfs.ko): Invalid module format
lila:~# uname -a
Linux lila 2.5.65 #4 Wed Jun 25 00:48:46 CEST 2003 alpha GNU/Linux

i compiled the module with CONFIG_REISERFS_CHECK=y.

shall i go on with 2.5.64 or better 2.5.67 ?

good night,
Christian.


Re: vpf-10680, minor corruptions

2003-06-24 Thread Oleg Drokin
Hello!

On Wed, Jun 25, 2003 at 02:42:24AM +0200, Christian Kujau wrote:
 (/lib/modules/2.5.65/kernel/fs/reiserfs/reiserfs.ko): Invalid module format
 lila:~# uname -a
 Linux lila 2.5.65 #4 Wed Jun 25 00:48:46 CEST 2003 alpha GNU/Linux
 i compiled the module with CONFIG_REISERFS_CHECK=y.
 shall i go on with 2.5.64 or better 2.5.67 ?

Try to compile with CONFIG_REISERFS_CHECK=y the kernel that known-bad for you.
(e.g. 2.5.72/2.5.73)

Bye,
Oleg