Re: panic: bad dir: mangled entry, fsck: missing dot/dotdot
Sorry for replying to the list, but Mouse refuses to accept mails from .de domains. If you think that particular problem might be it I have a test program you might want, designed to catch just such things. Yes, I could try that on a spare drive.
Re: panic: bad dir: mangled entry, fsck: missing dot/dotdot
Sorry for replying to the list, but Mouse refuses to accept mails from .de domains. Yeah, I should train myself to stop responding to list mail from .de addresses, or at least use something like my netbsd.org address, until and unless .de...no *smack* bad Mouse! No ranting onlist! If you think that particular problem might be it I have a test program you might want, designed to catch just such things. Yes, I could try that on a spare drive. ftp.rodents-montreal.org:/mouse/hacks/disk-check.c Unless you also have, or pick up, my md5 and idea libraries, or write glue code, you'll need to rip out the MD5 and IDEA code, probably also effectively removing the -D, -E, and -e options. (Unless I suppose by some fluke you have MD5 and IDEA code with a compatible interface, which strikes me as unlikely, since my versions were not designed with interface compatability with any other version as a goal.) If you want to pick them up, see /mouse/local/src/libmd5/ and .../libidea/ on the same FTP server. /~\ The ASCII Mouse \ / Ribbon Campaign X Against HTMLmo...@rodents-montreal.org / \ Email! 7D C8 61 52 5D E7 2D 39 4E F1 31 3E E8 B3 27 4B
Re: panic: bad dir: mangled entry, fsck: missing dot/dotdot
[I tried to send this to mouse in private email, but he refuses to accept it] I'd have a close look at the containing directory. [Asterisks denote information manually hidden for privacy. The filenames, owner and group names all make sense.] I=158304666 MODE=40770 SIZE=512 MTIME=Oct 18 16:54:06 2013 [575961711 nsec] CTIME=Oct 18 16:54:06 2013 [575961711 nsec] ATIME=Oct 19 03:30:49 2013 [445762748 nsec] OWNER= GRP= LINKCNT=2 FLAGS=0x0 BLKCNT=0x4 GEN=0x2b3d037c fsdb (inum: 158304666) ls slot 0 ino 158304666 reclen 12: directory, `.' slot 1 ino 158304653 reclen 12: directory, `..' slot 2 ino 158304678 reclen 40: directory, `.svn' slot 3 ino 158304690 reclen 20: regular, `.tex' slot 4 ino 158304760 reclen 20: regular, `.tex' slot 5 ino 158304761 reclen 24: regular, `.tex' slot 6 ino 158304762 reclen 28: regular, `.tex' slot 7 ino 158304763 reclen 24: regular, `.tex' slot 8 ino 158304764 reclen 24: regular, `.tex' slot 9 ino 158304765 reclen 24: regular, `.tex' slot 10 ino 158304766 reclen 24: regular, `.tex' slot 11 ino 158304767 reclen 32: regular, `.tex' slot 12 ino 158304768 reclen 28: regular, `.tex' slot 13 ino 158304792 reclen 200: regular, `.ps' fsdb (inum: 158304666) lookup .ps component `Blatt02.ps': current inode 158304792: unallocated inode fsdb (inum: 158304792) print current inode 158304792: unallocated inode fsdb (inum: 158304792) back current inode: directory I=158304666 MODE=40770 SIZE=512 MTIME=Oct 18 16:54:06 2013 [575961711 nsec] CTIME=Oct 18 16:54:06 2013 [575961711 nsec] ATIME=Oct 19 03:30:49 2013 [445762748 nsec] OWNER= GRP= LINKCNT=2 FLAGS=0x0 BLKCNT=0x4 GEN=0x2b3d037c fsdb (inum: 158304666) lookup .svn component `.svn': current inode 158304678: unallocated inode fsdb (inum: 158304678) print current inode 158304678: unallocated inode In particular, I'd make sure the d_type values match the types of the pointed-to inodes. They look correct. I didn't inspect the other inodes apart from slots 2 and 13, though. I'd also have an intensive look at the entries which produce errors, and at the inodes named by them. Well, I just get unallocated inode. If you just want to repair it, rather than figuring out what's going on, I would like to find out what's going on, but I need to repair it by monday morning. Moreover, I'm rather frightned by these errors, especially the two panics on the same dir. and you can afford to lose what's in the file Given that it's obviously an SVN working copy, I can live with losing the whole directory. I guess simply running fsck will clear slots 2 and 13. If you have the space I don't. How do I find out which disc block a directory or inode resides in? I guess it will be hard enough to translate that to a physical block given it's a RAID, but it would be a starting point.
Re: panic: bad dir: mangled entry, fsck: missing dot/dotdot
On a FFSv2/WAPBL file system successfully fsck -f'd less than two hours before, I got a bad dir: mangled entry panic. I fsck'd again, finding missing dot/dotdot entries [...] Just within minutes after going live again, I experienced the same panic, and fsck found missing dot/dotdot in exactly the same directory. Would a mis-behaving NFS client be able to unlink dot or dotdot?
Re: panic: bad dir: mangled entry, fsck: missing dot/dotdot
EF Can there be some weird file system inconsistency fsck doesn't spot? dM Yes. Well, almost certainly. So, in case we experience more panics under load on monday, would it make sense to dump/newfs/restore the file system? I.e., can there be any inconsistencies that survive a dump/restore?
Re: panic: bad dir: mangled entry, fsck: missing dot/dotdot
e...@math.uni-bonn.de (Edgar =?iso-8859-1?B?RnXf?=) writes: EF Can there be some weird file system inconsistency fsck doesn't spot? dM Yes. Well, almost certainly. So, in case we experience more panics under load on monday, would it make sense to dump/newfs/restore the file system? I.e., can there be any inconsistencies that survive a dump/restore? Restore uses normal filesystem operations. If you have a corrupted filesystem afterwards, you either have found a bug, or more likely, there is a hardware issue.
Re: panic: bad dir: mangled entry, fsck: missing dot/dotdot
Can there be some weird file system inconsistency fsck doesn't spot? Yes. Well, almost certainly. So, in case we experience more panics under load on monday, would it make sense to dump/newfs/restore the file system? I.e., can there be any inconsistencies that survive a dump/restore? There certainly shouldn't be. (Well, assuming you mean dump/newfs/restore, not dump/rm -rf */restore.) But, of course, depending on what made things weird to begin with, it could potentially happen again, which amounts to much the same thing in practice. /~\ The ASCII Mouse \ / Ribbon Campaign X Against HTMLmo...@rodents-montreal.org / \ Email! 7D C8 61 52 5D E7 2D 39 4E F1 31 3E E8 B3 27 4B
Re: panic: bad dir: mangled entry, fsck: missing dot/dotdot
Can there be some weird file system inconsistency fsck doesn't spot? Yes. Well, almost certainly. When I was writing the program that became resize_ffs when it was imported into NetBSD, I had a bug which led to the kernel panicking when using the resized filesystem. jtk found it - but the relevance now is that fsck completely missed the subtle filesystem corruption it involved. I don't know whether fsck has been fixed to catch that, but, if so, I'm sure there's another inconsistency nobody's taught fsck to catch, and it could be yours. (Mind you, fsck does know how to catch many . and .. errors. But maybe someone missed one. And, lest you chase a red herring, the issue I dealt with does not match your symptoms; see resize_ffs.c for more.) I'd suggest poking around the oddities with fsdb or some such userland tool. I'd also have suggested using clri (and then fsck) rather than rmdir to deal with the other directory, but that's water under the bridge now. (rmdir writes to places other than the directory being removed) Has the place you're getting the odd EBADF errors been created after you deleted the former mystery? I'm wondering if perhaps it got the same inode or the same disk block or some such. Another possibility occurs to me. You write of a 23-hour RAIDframe reconstruction, so you probably are dealing with large filesystems. When I was working with a 7T filesystem, I found peculiar errors which I eventually tracked down to two effective sectors actually being represented by the same piece of disk, meaning each one held whatever was last written to either of them, something filesystems do not deal with well. In my case this was due to a 32-bit bug, but I'm thinking perhaps you're dealing with something more subtle - perhaps one of your drives works fine both before and above the LBA48 boundary but folds the sector exactly at the boundary onto some other or some such; reading quirk lists makes me think such a thing is depressingly plausible. (If you think that particular problem might be it I have a test program you might want, designed to catch just such things.) /~\ The ASCII Mouse \ / Ribbon Campaign X Against HTMLmo...@rodents-montreal.org / \ Email! 7D C8 61 52 5D E7 2D 39 4E F1 31 3E E8 B3 27 4B
Re: panic: bad dir: mangled entry, fsck: missing dot/dotdot
I'd suggest poking around the oddities with fsdb or some such userland tool. I'll try that. I'd also have suggested using clri (and then fsck) rather than rmdir to deal with the other directory, but that's water under the bridge now. (rmdir writes to places other than the directory being removed) I'm getting two Bad file descriptor errors, one on a directory and another on a regular file, both in the same directory. What do you suggest to do? Has the place you're getting the odd EBADF errors been created after you deleted the former mystery? I'm wondering if perhaps it got the same inode or the same disk block or some such. I'll have a look at that. I do have photographs of the fsck dealing with the first dot-lacking directory. Another possibility occurs to me. You write of a 23-hour RAIDframe reconstruction, so you probably are dealing with large filesystems. It's a 6TB (5.8) FFSv2 file system on a 12TB RAIDframe level 5 made from five 3TB dk components each consisting of a single GPT partition on a 3TB SAS disc. meaning each one held whatever was last written to either of them, something filesystems do not deal with well. With all due respect to FFS's stability, I would expect more havoc if that were the case. perhaps one of your drives works fine both before and above the LBA48 boundary but folds the sector exactly at the boundary onto some other or some such; Fortunately, the components are SAS discs. reading quirk lists makes me think such a thing is depressingly plausible. Even on SAS?
Re: panic: bad dir: mangled entry, fsck: missing dot/dotdot
I'm getting two Bad file descriptor errors, one on a directory and another on a regular file, both in the same directory. What do you suggest to do? Hm. What do you get those errors from? find(1)? I think the first thing I'd try to do is provoke them deliberately by hand - eg, try using find on a directory one or two levels up rather than the whole filesystem, if possible - and try to capture the error with ktrace. I'm wondering what syscall is producing the error. You say it's an FFSv2 filesystem; I don't know much about how FFSv2 differs from FFSv1, and it's v1 I know well. So the rest of this will be written for v1, in the hope it's similar enough to v2 for the remarks to be useful. I'd have a close look at the containing directory. In particular, I'd make sure the d_type values match the types of the pointed-to inodes. (It wouldn't surprise me if two entries with different d_type values but the same inumber could produce surprising results, for example.) I'd also have an intensive look at the entries which produce errors, and at the inodes named by them. If you just want to repair it, rather than figuring out what's going on, and you can afford to lose what's in the file, I'd suggest clri on all three inodes (containing directory, file, and contained directory), then fsck and fishing things out of lost+found to clean up the damage. (If any two of those inodes are the same, something is definitely corrupt.) I can't really give full instructions, since this would be an exploratory sort of investigation, with most of it guided by what earlier work found. I'll have a look at that. I do have photographs of the fsck dealing with the first dot-lacking directory. That would be interesting, though I'm not sure how informative it would be; the thing of real interest is something fsck missed and thus probably won't be mentioned in fsck's output. meaning each one held whatever was last written to either of them, something filesystems do not deal with well. With all due respect to FFS's stability, I would expect more havoc if that were the case. Depends on what the blocks get used for. If, for example, each of them happened to be a block of inodes, nothing will happen until inodes that happen to end up on the same piece of disk get used - and the defaults, in my experience, provide _way_ more inodes than needed, so that could be a rare event, and will strike only a handful of inodes in any case. If they happen to be data blocks, nothing will happen except that file contents will get corrupted. It's when they're indirect blocks or superblocks, or one's inodes and the other isn't, that I'd expect serious havoc. Fortunately, the components are SAS discs. reading quirk lists makes me think such a thing is depressingly plausible. Even on SAS? Well, I wouldn't totally rule it out - at this point there's very little I'd totally rule out - but I'd definitely investigate other possibilities first. If you have the space - which you may well not, given how large the filesystem is - I'd suggest capturing a snapshot of the filesystem for investigation, then look at the live copy with an eye to optimizing for time-to-repair, with understanding deferred to later investigations on the copy. /~\ The ASCII Mouse \ / Ribbon Campaign X Against HTMLmo...@rodents-montreal.org / \ Email! 7D C8 61 52 5D E7 2D 39 4E F1 31 3E E8 B3 27 4B
Re: panic: bad dir: mangled entry, fsck: missing dot/dotdot
What do you get those errors from? find(1)? Yes. What I've found out so far (from using fsdb) is that those two directory entries (one directory and one regular file) point to unallocated inodes.
Re: panic: bad dir: mangled entry, fsck: missing dot/dotdot
What do you get those errors from? find(1)? Yes. What I've found out so far (from using fsdb) is that those two directory entries (one directory and one regular file) point to unallocated inodes. Okay, that's weird. fsck should have caught that - indeed, that's what you'll see if you clri inodes. The plausibility of the different blocks using the same disk sectors scenario just went up in my mind - if an inode block overlaps with a data block, this is one of the possible results. The inode-free bitmap may also be involved; an inode that's marked free but contains sane data, or conversely, might cause such things. I don't know whether fsdb is using the inode-free bitmap or the contents of the inode itself when telling you the inode is unallocated; it might be worth investigating. There are other things that can do this, too; it might be a good idea to check, if you haven't already, that you don't have two partitions overlapping. /~\ The ASCII Mouse \ / Ribbon Campaign X Against HTMLmo...@rodents-montreal.org / \ Email! 7D C8 61 52 5D E7 2D 39 4E F1 31 3E E8 B3 27 4B
Re: panic: bad dir: mangled entry, fsck: missing dot/dotdot
Has the place you're getting the odd EBADF errors been created after you deleted the former mystery? Probably yes (fsdb didn't give me the birth time, but ctime and mtime are identical and later than the last fsck/deletion of the missing-dot/dotdot dirctory. I'm wondering if perhaps it got the same inode or the same disk block or some such. No. The directory that twice caused a panic because of missing dot/dotdot was inode 463357837, the one with the two entries pointing to unallocated inodes is inode 158304666, the two unallocated inodes being 158304792 and 158304678.
Re: panic: bad dir: mangled entry, fsck: missing dot/dotdot
I'm wondering if perhaps it got the same inode or the same disk block or some such. No. The directory that twice caused a panic because of missing dot/dotdot was inode 463357837, the one with the two entries pointing to unallocated inodes is inode 158304666, the two unallocated inodes being 158304792 and 158304678. I don't know whether fsdb can give it to you easily, but I'd be curious what disk sectors those four inodes fall into. In particular, if the first one is close to the last two, that reinforces the blocks overlapping theory. In hex, those inumbers are (in order of apeparance above) 1b9e478d, 96f899a, 96f8a18, and 96f89a6. The two unallocated inodes could fall into the same disk block, if your blocks are big enough. FFSv2 inodes are 256 bytes; FFSv1 blocks max out at 64K, but in a quick glance I haven't found the maximum FFSv2 size. If your filesystem uses 256K or larger blocks, those two inodes fall into the same block; otherwise, not. If they do, it might be interesting to look at the rest of that block, in case its contents look recognizable. /~\ The ASCII Mouse \ / Ribbon Campaign X Against HTMLmo...@rodents-montreal.org / \ Email! 7D C8 61 52 5D E7 2D 39 4E F1 31 3E E8 B3 27 4B
Re: panic: bad dir: mangled entry, fsck: missing dot/dotdot
If your filesystem uses 256K or larger blocks, those two inodes fall into the same block; No, it uses 16k blocks.