Re: panic: bad dir: mangled entry, fsck: missing dot/dotdot

2013-11-11 Thread Edgar Fuß
Sorry for replying to the list, but Mouse refuses to accept mails from .de 
domains.

 If you think that particular problem might be it I have a
 test program you might want, designed to catch just such things.
Yes, I could try that on a spare drive.


Re: panic: bad dir: mangled entry, fsck: missing dot/dotdot

2013-11-11 Thread Mouse
 Sorry for replying to the list, but Mouse refuses to accept mails
 from .de domains.

Yeah, I should train myself to stop responding to list mail from .de
addresses, or at least use something like my netbsd.org address, until
and unless .de...no *smack* bad Mouse!  No ranting onlist!

 If you think that particular problem might be it I have a test
 program you might want, designed to catch just such things.
 Yes, I could try that on a spare drive.

ftp.rodents-montreal.org:/mouse/hacks/disk-check.c

Unless you also have, or pick up, my md5 and idea libraries, or write
glue code, you'll need to rip out the MD5 and IDEA code, probably also
effectively removing the -D, -E, and -e options.  (Unless I suppose by
some fluke you have MD5 and IDEA code with a compatible interface,
which strikes me as unlikely, since my versions were not designed with
interface compatability with any other version as a goal.)

If you want to pick them up, see /mouse/local/src/libmd5/ and
.../libidea/ on the same FTP server.

/~\ The ASCII Mouse
\ / Ribbon Campaign
 X  Against HTMLmo...@rodents-montreal.org
/ \ Email!   7D C8 61 52 5D E7 2D 39  4E F1 31 3E E8 B3 27 4B


Re: panic: bad dir: mangled entry, fsck: missing dot/dotdot

2013-10-20 Thread Edgar Fuß
[I tried to send this to mouse in private email, but he refuses to accept it]

 I'd have a close look at the containing directory.
[Asterisks denote information manually hidden for privacy.
 The filenames, owner and group names all make sense.]

I=158304666 MODE=40770 SIZE=512
   MTIME=Oct 18 16:54:06 2013 [575961711 nsec]
   CTIME=Oct 18 16:54:06 2013 [575961711 nsec]
   ATIME=Oct 19 03:30:49 2013 [445762748 nsec]
OWNER= GRP= LINKCNT=2 FLAGS=0x0 BLKCNT=0x4 GEN=0x2b3d037c
fsdb (inum: 158304666) ls
slot 0 ino 158304666 reclen 12: directory, `.'
slot 1 ino 158304653 reclen 12: directory, `..'
slot 2 ino 158304678 reclen 40: directory, `.svn'
slot 3 ino 158304690 reclen 20: regular, `.tex'
slot 4 ino 158304760 reclen 20: regular, `.tex'
slot 5 ino 158304761 reclen 24: regular, `.tex'
slot 6 ino 158304762 reclen 28: regular, `.tex'
slot 7 ino 158304763 reclen 24: regular, `.tex'
slot 8 ino 158304764 reclen 24: regular, `.tex'
slot 9 ino 158304765 reclen 24: regular, `.tex'
slot 10 ino 158304766 reclen 24: regular, `.tex'
slot 11 ino 158304767 reclen 32: regular, `.tex'
slot 12 ino 158304768 reclen 28: regular, `.tex'
slot 13 ino 158304792 reclen 200: regular, `.ps'
fsdb (inum: 158304666) lookup .ps
component `Blatt02.ps': current inode 158304792: unallocated inode
fsdb (inum: 158304792) print
current inode 158304792: unallocated inode
fsdb (inum: 158304792) back
current inode: directory
I=158304666 MODE=40770 SIZE=512
   MTIME=Oct 18 16:54:06 2013 [575961711 nsec]
   CTIME=Oct 18 16:54:06 2013 [575961711 nsec]
   ATIME=Oct 19 03:30:49 2013 [445762748 nsec]
OWNER= GRP= LINKCNT=2 FLAGS=0x0 BLKCNT=0x4 GEN=0x2b3d037c
fsdb (inum: 158304666) lookup .svn
component `.svn': current inode 158304678: unallocated inode
fsdb (inum: 158304678) print
current inode 158304678: unallocated inode

 In particular, I'd make sure the d_type values match the types of the 
 pointed-to inodes.
They look correct. I didn't inspect the other inodes apart from slots 2 
and 13, though.

 I'd also have an intensive look at the entries which produce errors,
 and at the inodes named by them.
Well, I just get unallocated inode.

 If you just want to repair it, rather than figuring out what's going on, 
I would like to find out what's going on, but I need to repair it by monday 
morning. Moreover, I'm rather frightned by these errors, especially the two
panics on the same dir.

 and you can afford to lose what's in the file
Given that it's obviously an SVN working copy, I can live with losing the 
whole directory. I guess simply running fsck will clear slots 2 and 13.

 If you have the space
I don't.

How do I find out which disc block a directory or inode resides in?
I guess it will be hard enough to translate that to a physical block 
given it's a RAID, but it would be a starting point.


Re: panic: bad dir: mangled entry, fsck: missing dot/dotdot

2013-10-20 Thread Edgar Fuß
 On a FFSv2/WAPBL file system successfully fsck -f'd less than two hours 
 before, I got a bad dir: mangled entry panic.
 I fsck'd again, finding missing dot/dotdot entries [...]
 Just within minutes after going live again, I experienced the same panic, 
 and fsck found missing dot/dotdot in exactly the same directory.
Would a mis-behaving NFS client be able to unlink dot or dotdot?


Re: panic: bad dir: mangled entry, fsck: missing dot/dotdot

2013-10-20 Thread Edgar Fuß
EF Can there be some weird file system inconsistency fsck doesn't spot?
dM Yes.  Well, almost certainly.
So, in case we experience more panics under load on monday, would it make 
sense to dump/newfs/restore the file system? I.e., can there be any 
inconsistencies that survive a dump/restore?


Re: panic: bad dir: mangled entry, fsck: missing dot/dotdot

2013-10-20 Thread Michael van Elst
e...@math.uni-bonn.de (Edgar =?iso-8859-1?B?RnXf?=) writes:

EF Can there be some weird file system inconsistency fsck doesn't spot?
dM Yes.  Well, almost certainly.
So, in case we experience more panics under load on monday, would it make 
sense to dump/newfs/restore the file system? I.e., can there be any 
inconsistencies that survive a dump/restore?

Restore uses normal filesystem operations. If you have a corrupted
filesystem afterwards, you either have found a bug, or more likely,
there is a hardware issue.



Re: panic: bad dir: mangled entry, fsck: missing dot/dotdot

2013-10-20 Thread Mouse
 Can there be some weird file system inconsistency fsck doesn't spot?
 Yes.  Well, almost certainly.
 So, in case we experience more panics under load on monday, would it
 make sense to dump/newfs/restore the file system?  I.e., can there be
 any inconsistencies that survive a dump/restore?

There certainly shouldn't be.  (Well, assuming you mean
dump/newfs/restore, not dump/rm -rf */restore.)

But, of course, depending on what made things weird to begin with, it
could potentially happen again, which amounts to much the same thing in
practice.

/~\ The ASCII Mouse
\ / Ribbon Campaign
 X  Against HTMLmo...@rodents-montreal.org
/ \ Email!   7D C8 61 52 5D E7 2D 39  4E F1 31 3E E8 B3 27 4B


Re: panic: bad dir: mangled entry, fsck: missing dot/dotdot

2013-10-19 Thread Mouse
 Can there be some weird file system inconsistency fsck doesn't spot?

Yes.  Well, almost certainly.  When I was writing the program that
became resize_ffs when it was imported into NetBSD, I had a bug which
led to the kernel panicking when using the resized filesystem.  jtk
found it - but the relevance now is that fsck completely missed the
subtle filesystem corruption it involved.  I don't know whether fsck
has been fixed to catch that, but, if so, I'm sure there's another
inconsistency nobody's taught fsck to catch, and it could be yours.
(Mind you, fsck does know how to catch many . and .. errors.  But maybe
someone missed one.  And, lest you chase a red herring, the issue I
dealt with does not match your symptoms; see resize_ffs.c for more.)

I'd suggest poking around the oddities with fsdb or some such userland
tool.  I'd also have suggested using clri (and then fsck) rather than
rmdir to deal with the other directory, but that's water under the
bridge now.  (rmdir writes to places other than the directory being
removed)

Has the place you're getting the odd EBADF errors been created after
you deleted the former mystery?  I'm wondering if perhaps it got the
same inode or the same disk block or some such.

Another possibility occurs to me.  You write of a 23-hour RAIDframe
reconstruction, so you probably are dealing with large filesystems.
When I was working with a 7T filesystem, I found peculiar errors which
I eventually tracked down to two effective sectors actually being
represented by the same piece of disk, meaning each one held whatever
was last written to either of them, something filesystems do not deal
with well.  In my case this was due to a 32-bit bug, but I'm thinking
perhaps you're dealing with something more subtle - perhaps one of your
drives works fine both before and above the LBA48 boundary but folds
the sector exactly at the boundary onto some other or some such;
reading quirk lists makes me think such a thing is depressingly
plausible.  (If you think that particular problem might be it I have a
test program you might want, designed to catch just such things.)

/~\ The ASCII Mouse
\ / Ribbon Campaign
 X  Against HTMLmo...@rodents-montreal.org
/ \ Email!   7D C8 61 52 5D E7 2D 39  4E F1 31 3E E8 B3 27 4B


Re: panic: bad dir: mangled entry, fsck: missing dot/dotdot

2013-10-19 Thread Edgar Fuß
 I'd suggest poking around the oddities with fsdb or some such userland
 tool.
I'll try that.

 I'd also have suggested using clri (and then fsck) rather than
 rmdir to deal with the other directory, but that's water under the
 bridge now.  (rmdir writes to places other than the directory being
 removed)
I'm getting two Bad file descriptor errors, one on a directory and another 
on a regular file, both in the same directory. What do you suggest to do?

 Has the place you're getting the odd EBADF errors been created after
 you deleted the former mystery?  I'm wondering if perhaps it got the
 same inode or the same disk block or some such.
I'll have a look at that. I do have photographs of the fsck dealing with the 
first dot-lacking directory.

 Another possibility occurs to me.  You write of a 23-hour RAIDframe
 reconstruction, so you probably are dealing with large filesystems.
It's a 6TB (5.8) FFSv2 file system on a 12TB RAIDframe level 5 made from five 
3TB dk components each consisting of a single GPT partition on a 3TB SAS disc.

 meaning each one held whatever was last written to either of them, 
 something filesystems do not deal with well.
With all due respect to FFS's stability, I would expect more havoc if that 
were the case.

 perhaps one of your drives works fine both before and above the LBA48 
 boundary but folds the sector exactly at the boundary onto some other 
 or some such;
Fortunately, the components are SAS discs.

 reading quirk lists makes me think such a thing is depressingly
 plausible.
Even on SAS?


Re: panic: bad dir: mangled entry, fsck: missing dot/dotdot

2013-10-19 Thread Mouse
 I'm getting two Bad file descriptor errors, one on a directory and
 another on a regular file, both in the same directory.  What do you
 suggest to do?

Hm.

What do you get those errors from?  find(1)?  I think the first thing
I'd try to do is provoke them deliberately by hand - eg, try using find
on a directory one or two levels up rather than the whole filesystem,
if possible - and try to capture the error with ktrace.  I'm wondering
what syscall is producing the error.

You say it's an FFSv2 filesystem; I don't know much about how FFSv2
differs from FFSv1, and it's v1 I know well.  So the rest of this will
be written for v1, in the hope it's similar enough to v2 for the
remarks to be useful.

I'd have a close look at the containing directory.  In particular, I'd
make sure the d_type values match the types of the pointed-to inodes.
(It wouldn't surprise me if two entries with different d_type values
but the same inumber could produce surprising results, for example.)
I'd also have an intensive look at the entries which produce errors,
and at the inodes named by them.  If you just want to repair it, rather
than figuring out what's going on, and you can afford to lose what's in
the file, I'd suggest clri on all three inodes (containing directory,
file, and contained directory), then fsck and fishing things out of
lost+found to clean up the damage.  (If any two of those inodes are the
same, something is definitely corrupt.)

I can't really give full instructions, since this would be an
exploratory sort of investigation, with most of it guided by what
earlier work found.

 I'll have a look at that.  I do have photographs of the fsck dealing
 with the first dot-lacking directory.

That would be interesting, though I'm not sure how informative it would
be; the thing of real interest is something fsck missed and thus
probably won't be mentioned in fsck's output.

 meaning each one held whatever was last written to either of them,
 something filesystems do not deal with well.
 With all due respect to FFS's stability, I would expect more havoc if
 that were the case.

Depends on what the blocks get used for.  If, for example, each of them
happened to be a block of inodes, nothing will happen until inodes that
happen to end up on the same piece of disk get used - and the defaults,
in my experience, provide _way_ more inodes than needed, so that could
be a rare event, and will strike only a handful of inodes in any case.
If they happen to be data blocks, nothing will happen except that file
contents will get corrupted.  It's when they're indirect blocks or
superblocks, or one's inodes and the other isn't, that I'd expect
serious havoc.

 Fortunately, the components are SAS discs.

 reading quirk lists makes me think such a thing is depressingly
 plausible.
 Even on SAS?

Well, I wouldn't totally rule it out - at this point there's very
little I'd totally rule out - but I'd definitely investigate other
possibilities first.

If you have the space - which you may well not, given how large the
filesystem is - I'd suggest capturing a snapshot of the filesystem for
investigation, then look at the live copy with an eye to optimizing for
time-to-repair, with understanding deferred to later investigations on
the copy.

/~\ The ASCII Mouse
\ / Ribbon Campaign
 X  Against HTMLmo...@rodents-montreal.org
/ \ Email!   7D C8 61 52 5D E7 2D 39  4E F1 31 3E E8 B3 27 4B


Re: panic: bad dir: mangled entry, fsck: missing dot/dotdot

2013-10-19 Thread Edgar Fuß
 What do you get those errors from?  find(1)?
Yes.

What I've found out so far (from using fsdb) is that those two directory 
entries (one directory and one regular file) point to unallocated inodes.


Re: panic: bad dir: mangled entry, fsck: missing dot/dotdot

2013-10-19 Thread Mouse
 What do you get those errors from?  find(1)?

 Yes.

 What I've found out so far (from using fsdb) is that those two
 directory entries (one directory and one regular file) point to
 unallocated inodes.

Okay, that's weird.  fsck should have caught that - indeed, that's what
you'll see if you clri inodes.

The plausibility of the different blocks using the same disk sectors
scenario just went up in my mind - if an inode block overlaps with a
data block, this is one of the possible results.  The inode-free bitmap
may also be involved; an inode that's marked free but contains sane
data, or conversely, might cause such things.  I don't know whether
fsdb is using the inode-free bitmap or the contents of the inode itself
when telling you the inode is unallocated; it might be worth
investigating.

There are other things that can do this, too; it might be a good idea
to check, if you haven't already, that you don't have two partitions
overlapping.

/~\ The ASCII Mouse
\ / Ribbon Campaign
 X  Against HTMLmo...@rodents-montreal.org
/ \ Email!   7D C8 61 52 5D E7 2D 39  4E F1 31 3E E8 B3 27 4B


Re: panic: bad dir: mangled entry, fsck: missing dot/dotdot

2013-10-19 Thread Edgar Fuß
 Has the place you're getting the odd EBADF errors been created after
 you deleted the former mystery?
Probably yes (fsdb didn't give me the birth time, but ctime and mtime are 
identical and later than the last fsck/deletion of the missing-dot/dotdot 
dirctory.

 I'm wondering if perhaps it got the same inode or the same disk block 
 or some such.
No. The directory that twice caused a panic because of missing dot/dotdot 
was inode 463357837, the one with the two entries pointing to unallocated 
inodes is inode 158304666, the two unallocated inodes being 158304792 and 
158304678.


Re: panic: bad dir: mangled entry, fsck: missing dot/dotdot

2013-10-19 Thread Mouse
 I'm wondering if perhaps it got the same inode or the same disk
 block or some such.
 No.  The directory that twice caused a panic because of missing
 dot/dotdot was inode 463357837, the one with the two entries pointing
 to unallocated inodes is inode 158304666, the two unallocated inodes
 being 158304792 and 158304678.

I don't know whether fsdb can give it to you easily, but I'd be curious
what disk sectors those four inodes fall into.  In particular, if the
first one is close to the last two, that reinforces the blocks
overlapping theory.

In hex, those inumbers are (in order of apeparance above) 1b9e478d,
96f899a, 96f8a18, and 96f89a6.  The two unallocated inodes could fall
into the same disk block, if your blocks are big enough.  FFSv2 inodes
are 256 bytes; FFSv1 blocks max out at 64K, but in a quick glance I
haven't found the maximum FFSv2 size.  If your filesystem uses 256K or
larger blocks, those two inodes fall into the same block; otherwise,
not.  If they do, it might be interesting to look at the rest of that
block, in case its contents look recognizable.

/~\ The ASCII Mouse
\ / Ribbon Campaign
 X  Against HTMLmo...@rodents-montreal.org
/ \ Email!   7D C8 61 52 5D E7 2D 39  4E F1 31 3E E8 B3 27 4B


Re: panic: bad dir: mangled entry, fsck: missing dot/dotdot

2013-10-19 Thread Edgar Fuß
 If your filesystem uses 256K or larger blocks,
 those two inodes fall into the same block;
No, it uses 16k blocks.