Re: readdir/telldir/seekdir problem (i think)
On 4/23/15 11:20 AM, Julian Elischer wrote: I'm debugging a problem being seen with samba 3.6. basically telldir/seekdir/readdir don't seem to work as advertised.. I have a patch for review at https://reviews.freebsd.org/D2410 which fixes this special case needed for samba, and improves the behaviour generally, however the real fix is to utilise cookies throughout the VFS and library so that this workaround is not needed. The patch, when installed, allows Samba's torture tests to succeed on FreeBSD 8 through 11 here's a little test program #include #include #include #include #include #include #include #include #include #include #define CHUNKSIZE 5 #define TOTALFILES 40 static void SeekDir(DIR *dirp, long loc) { printf("Seeking back to location %ld\n", loc); seekdir(dirp, loc); } static long TellDir(DIR *dirp) { long loc; loc = telldir(dirp); printf("telldir assigned location %ld\n", loc); return (loc); } int main(int argc, char *argv[]) { DIR*dirp; inti; intj; longoffset = 0, prev_offset = 0; char *files[100]; charfilename[100]; intfd; struct dirent *dp = NULL; if (chdir("./test2") != 0) { err(EX_OSERR, "chdir"); } /*/ /* Put a set of sample files in the target directory */ /*/ for (i=1; i < TOTALFILES ; i++) { sprintf(filename, "file-%d", i); fd = open(filename, O_CREAT, 0666); if (fd == -1) { err(EX_OSERR, "open"); } close(fd); } dirp = opendir("."); offset = TellDir(dirp); for (i = 0; i < 20; i++) files[i] = malloc(20); /***/ /* enumerate and delete small sets of files, one group */ /* at a time. */ /***/ do { /*/ /* Read in up to CHUNKSIZE file names*/ /* i will be the number of files we hold */ /*/ for (i = 0; i < CHUNKSIZE; i++) { if ((dp = readdir(dirp)) != NULL) { strcpy(files[i], dp->d_name); printf("readdir (%ld) returned file %s\n", offset, files[i]); prev_offset = offset; offset = TellDir(dirp); } else { printf("readdir returned null\n"); break; } } // /* Simuate the last entry not fitting into our (samba's) buffer */ /* If we read someting in on the last slot, push it back*/ /* Pretend it didn't fit. This is approximately what SAMBA does.*/ // if (dp != NULL) { /* Step back */ SeekDir(dirp, prev_offset); offset = TellDir(dirp); i--; printf("file %s returned\n", files[i]); } /*/ /* i is the number of names we have left.*/ /* Delete them. */ /*/ for (j = 0; j < i; j++) { if (*files[j] == '.') { printf ("skipping %s\n", files[j]); } else { printf("Unlinking file %s\n", files[j]); if (unlink(files[j]) != 0) { err(EX_OSERR, "unlink"); } } } } while (dp != NULL); closedir(dirp); //chdir(".."); } The program is simulating what Samba does when fails. (doing a recursive delete of a directory) What it does is reads a chunk of names using readdir() until it's (small) buffer s full, then it uses seekdir() to seek back before the last entry it read, (which fails to fit), theortically leaving it for the next read. It then deletes the entries it found and repeats the cycle. Eventually it should have found all the files in the directory and deleted them. Except that it doesn't. What actually happens is that some files are not enumerated, even though the seekdir() should have made the readdir() find them. for added fun. the FIRST seekdir appears to work. but all subsequent ones don't. It behaves this way in -current , all the way back to at least 8.0. if there's a bug in my program please let me know, but samba has the same problem.. e.g. on freeNAS. to use the program make a directory called "./test2" and then run it in the current directory.. It fills it with files and then tried to (fails) delete them in small batches. here's some (annotated) output: ./testit t
Re: readdir/telldir/seekdir problem (i think)
John Baldwin wrote: > On Thursday, April 30, 2015 02:56:25 PM Julian Elischer wrote: > > We really need to do something because the current system is really > > broken. > > And the fact that dirent has *32 bit* inode number in it was a > > shock.. I'd presumed > > that had gone the way of the dinosaurs and dodo. > > I think 11 needs to have a new dirent structure given out by a new > > syscall. > > (old one still present for compat reasons). Whether we need a > > readdir64() and friends > > I have not yet decided. Maybe it's time to bump libc's number again > > :-) > > This is the entire point of the ino64 branch (and project): to > rototill > struct stat and related structures so we have one ABI jump instead of > lots of separate ABI jumps. It bumps ino_t to 64 bits, dev_t to 32 > (IIRC), > adds d_off to dirent, etc. I believe the branch is able to do it all > with > symbol versioning rather than bumping libc. However, this is why > several of > us keep harping on this as the real long-term solution. :) > Yep. Btw, some of the stuff I did for the 64bit d_fileno patch I have isn't in projects/ino64 because they felt it could go in after the main patch. I haven't looked at how the new getdirentries() syscall + opendir() and friends actually need to change to use d_off/d_cookie yet, but will do so. (Btw, some like d_cookie since they feel it is more representative of what it is while others prefer d_off because that is what others used. I, personally, couldn't care less what it ends up being called.) I will admit to having some concern w.r.t. progress on this. Last I heard Gleb Kurtsou didn't have time to work on it and he's the main author (from 2011, if I recall correctly). I agree we really need to figure out how to get this in FreeBSD11. rick ps: With a naming change of dirent->dirent64, my patch could go into head before projects/ino64, since all that happens is the the high order 32bits of d_fileno == 0 until projects/ino64 goes in. If others think it appropriate, I can do that. > -- > John Baldwin > ___ > freebsd-current@freebsd.org mailing list > http://lists.freebsd.org/mailman/listinfo/freebsd-current > To unsubscribe, send any mail to > "freebsd-current-unsubscr...@freebsd.org" > ___ freebsd-current@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-current To unsubscribe, send any mail to "freebsd-current-unsubscr...@freebsd.org"
Re: readdir/telldir/seekdir problem (i think)
On Thursday, April 30, 2015 02:56:25 PM Julian Elischer wrote: > We really need to do something because the current system is really > broken. > And the fact that dirent has *32 bit* inode number in it was a > shock.. I'd presumed > that had gone the way of the dinosaurs and dodo. > I think 11 needs to have a new dirent structure given out by a new > syscall. > (old one still present for compat reasons). Whether we need a > readdir64() and friends > I have not yet decided. Maybe it's time to bump libc's number again :-) This is the entire point of the ino64 branch (and project): to rototill struct stat and related structures so we have one ABI jump instead of lots of separate ABI jumps. It bumps ino_t to 64 bits, dev_t to 32 (IIRC), adds d_off to dirent, etc. I believe the branch is able to do it all with symbol versioning rather than bumping libc. However, this is why several of us keep harping on this as the real long-term solution. :) -- John Baldwin ___ freebsd-current@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-current To unsubscribe, send any mail to "freebsd-current-unsubscr...@freebsd.org"
Re: readdir/telldir/seekdir problem (i think)
On 4/25/15 5:52 AM, Jilles Tjoelker wrote: On Fri, Apr 24, 2015 at 04:28:12PM -0400, John Baldwin wrote: Yes, this isn't at all safe. There's no guarantee whatsoever that the offset on the directory fd that isn't something returned by getdirentries has any meaning. In particular, the size of the directory entry in a random filesystem might be a different size than the structure returned by getdirentries (since it converts things into a FS-independent format). This might work for UFS by accident, but this is probably why ZFS doesn't work. Turns out they are all broken in the face of deletes. I managed to make a patch that fixes things in the simple case of "backing up by one (and only one time) record", which is what samba needs to do. Samba reads until the latest entry won't fit (buffer is full) and then effectively pushes the last one back by doing a seekdir back one record. It then sends the full buffer back to the client and continus reading at the next record, which is the one it pushed back. It turs out that because the buffer holding the last dirent presented is always still resident in memory, it is ALWAYS possible to slip the 'current entry' pointer back ONE entry, (but no more). This unbreaks samba, even in the case of deletes. (at least with UFS and ZFS). I'll make the patch available when I've tested it more and cleaned it up. However, this might be properly fixed by the thing that ino64 is doing where each directory entry returned by getdirentries gives you a seek offset that you _can_ directly seek to (as opposed to seeking to the start of the block and then walking forward N entries until you get an inter-block entry that is the same). The ino64 branch only reserves space for d_off and does not use it in any way. This is appropriate since actually using d_off is a major feature addition. d_cookie please.. :-) A proper d_off would still be useful even if UFS's readdir keeps masking off the offset so a directory read always starts at the beginning of a 512-byte directory block, since this allows more distinct offset values than safely using getdirentries()'s *basep. With d_off, one outer loop must read at least one directory block to avoid spinning indefinitely, while using getdirentries()'s *basep requires reading the whole getdirentries() buffer. not sure I follow what you mean by that. Some Linux filesystems go further and provide a unique d_off for each entry. yes ZFS provides that. we currently don't export it beyond teh VFS, but there is code that is supposed to supply a cookie for each entry. It could be put into the dirent structure should we want to do that. Currently it is put into a separate memory array in parallel to the dirents, except that we give it a NULL pointer, which means "we don't want cookies thanks" . We really need to do something because the current system is really broken. And the fact that dirent has *32 bit* inode number in it was a shock.. I'd presumed that had gone the way of the dinosaurs and dodo. I think 11 needs to have a new dirent structure given out by a new syscall. (old one still present for compat reasons). Whether we need a readdir64() and friends I have not yet decided. Maybe it's time to bump libc's number again :-) Another idea would be to store the last d_ino instead of dd_loc into the struct ddloc. On seekdir(), this would seek to loc_seek as before and skip entries until that d_ino is found, or to the start of the buffer if not found (and possibly return some entries again that should not be returned, but Samba copes with that). I didn't see it coping with that.. an earlier version of my patch suffered from duplicate entries in samba. it is also dangerous because multiple files linked to the same inode would get confused. it may be possible however to link the telldir cookies to a HASH of the filename and inode number. (assuming of course we don't have cookies directly from the fiesystem itself.) the current scheme of seeking to a buffer boundary and seeking forward in the buffer looking for a matching entry would become quite robust using this scheme, assuming that the lseek actually worked on that file system. ___ freebsd-current@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-current To unsubscribe, send any mail to "freebsd-current-unsubscr...@freebsd.org"
Re: readdir/telldir/seekdir problem (i think)
On 4/28/15 4:03 AM, Rick Macklem wrote: Julian Elischer wrote: On 4/25/15 4:28 AM, John Baldwin wrote: On Saturday, April 25, 2015 02:36:24 AM Julian Elischer wrote: On 4/25/15 1:30 AM, Julian Elischer wrote: On 4/24/15 10:59 PM, John Baldwin wrote: Index: head/lib/libc/gen/telldir.c === --- head/lib/libc/gen/telldir.c (revision 281929) +++ head/lib/libc/gen/telldir.c (working copy) @@ -101,8 +101,10 @@ return; if (lp->loc_loc == dirp->dd_loc && lp->loc_seek == dirp->dd_seek) return; - (void) lseek(dirp->dd_fd, (off_t)lp->loc_seek, SEEK_SET); - dirp->dd_seek = lp->loc_seek; + if (lp->loc_seek != dirp->dd_seek) { + (void) lseek(dirp->dd_fd, (off_t)lp->loc_seek, SEEK_SET); + dirp->dd_seek = lp->loc_seek; + } yes I did that yesterday but it still fails when you transition blocks.. (badly). I also tried bigger blocks.. also fails (eventually) I did find a way to make it work... you had to seek back to the first block you deleted on each set.. then work forward from there again.. unfortunately since I'm trying to make a microsoft program not fail (via samba) I have no control over how it does things and seekdir doesn't know what was deleted anyway... (so the fix is fine for the test program but not for real life) I think I can make the BSD one act like the linux one by changing the lseek being done to use the offset (loc) plus the buffer seek address of the target, instead of just going for the buffer base and stepping forward through the entries.. maybe tomorrow. The following conditional code makes ours behave the same as the linux one. it breaks several 'rules' but works where ours is clean but fails.. as Rick said.. "maybe that's what we should do too." this is at the end of seekdir() The new code does what linux does.. and shouldn't work.. but does // at least in the limited conditions I need it to. // We'll probably need to do this at work...: The original code is what we have now, but gets mightily confused sometimes. // This is clean(er) but fails in specific situations(when doing commands // from Microft windows, via samba). root@vps1:/tmp # diff -u dir.c.orig dir.c --- dir.c.orig2015-04-24 11:29:36.855317000 -0700 +++ dir.c2015-04-24 11:15:49.05850 -0700 @@ -1105,6 +1105,13 @@ dirp->dd_loc = lp->loc_loc; return; } +#ifdef GLIBC_SEEK +(void) lseek(dirp->dd_fd, (off_t)lp->loc_seek + lp->loc_loc, SEEK_SET); +dirp->dd_seek = lp->loc_seek + lp->loc_loc; +dirp->dd_loc = 0; +lp->loc_seek = dirp->dd_seek; +lp->loc_loc = 0; +#else (void) lseek(dirp->dd_fd, (off_t)lp->loc_seek, SEEK_SET); dirp->dd_seek = lp->loc_seek; dirp->dd_loc = 0; @@ -1114,6 +1121,7 @@ if (dp == NULL) break; } +#endif } Yes, this isn't at all safe. There's no guarantee whatsoever that the offset on the directory fd that isn't something returned by getdirentries has any meaning. In particular, the size of the directory entry in a random filesystem might be a different size than the structure returned by getdirentries (since it converts things into a FS-independent format). This might work for UFS by accident, but this is probably why ZFS doesn't work. However, this might be properly fixed by the thing that ino64 is doing where each directory entry returned by getdirentries gives you a seek offset that you _can_ directly seek to (as opposed to seeking to the start of the block and then walking forward N entries until you get an inter-block entry that is the same). I just made the stunning discovery that our seekdir/readdir/telldir code in libc works with FreeBSD 8.0. so maybe the problem is that the kernel changed it's behaviour, and no-one thought to fix libc.. (at least it works on one of our 8.0 base appliances.. I'll do more testing tomorrow.. it's past midnight.) actually that was a mistake.. it fails on 8.0 as much as it fails on 10.x and 11. the patch above also fixes it on 8.0. (UFS and ZFS) I suspect that pre-r252438 systems work better for UFS than r252438 or later. That patch changed ufs_readdir() so that it no longer returned the on-disk directory structure. (Among other things, it added code that skipped over d_ino == 0 entries.) yes but it was broken even before that. basically here's the difference between what Linux (and mu patched code) does and what we do.. in Linux they seek directly to the base address of the block PLUS the offset of the entry being seeked. The filesystem somehow correctly interprets this as teh start of the correct entry. I guess it must return the correct information in the dirent structure in the first place. in the unmodified BSD version, we seek to the start of the 'block' (actually the start of the buff
Re: readdir/telldir/seekdir problem (i think)
Julian Elischer wrote: > On 4/25/15 4:28 AM, John Baldwin wrote: > > On Saturday, April 25, 2015 02:36:24 AM Julian Elischer wrote: > >> On 4/25/15 1:30 AM, Julian Elischer wrote: > >>> On 4/24/15 10:59 PM, John Baldwin wrote: > Index: head/lib/libc/gen/telldir.c > === > --- head/lib/libc/gen/telldir.c (revision 281929) > +++ head/lib/libc/gen/telldir.c (working copy) > @@ -101,8 +101,10 @@ > return; > if (lp->loc_loc == dirp->dd_loc && lp->loc_seek == > dirp->dd_seek) > return; > - (void) lseek(dirp->dd_fd, (off_t)lp->loc_seek, > SEEK_SET); > - dirp->dd_seek = lp->loc_seek; > + if (lp->loc_seek != dirp->dd_seek) { > + (void) lseek(dirp->dd_fd, (off_t)lp->loc_seek, > SEEK_SET); > + dirp->dd_seek = lp->loc_seek; > + } > >>> yes I did that yesterday but it still fails when you transition > >>> blocks.. (badly). > >>> > >>> I also tried bigger blocks.. also fails (eventually) > >>> > >>> I did find a way to make it work... you had to seek back > >>> to the first block you deleted on each set.. > >>> then work forward from there again.. unfortunately since > >>> I'm trying to make a microsoft program not fail (via samba) > >>> I have no control over how it does things and seekdir doesn't > >>> know what was deleted anyway... (so the fix is fine for the > >>> test program but not for real life) > >>> > >>> I think I can make the BSD one act like the linux one by changing > >>> the lseek being done to use the offset (loc) plus the buffer seek > >>> address of the target, instead of just going for the buffer base > >>> and > >>> stepping forward through the entries.. > >>> > >>> maybe tomorrow. > >>> > >> The following conditional code makes ours behave the same as the > >> linux > >> one. > >> it breaks several 'rules' but works where ours is clean but > >> fails.. > >> as Rick said.. "maybe that's what we should do too." > >> > >> > >> this is at the end of seekdir() > >> > >> > >> The new code does what linux does.. and shouldn't work.. but does > >> // at least in the limited conditions I need it to. > >> // We'll probably need to do this at work...: > >> > >> > >> The original code is what we have now, but gets mightily confused > >> sometimes. > >> // This is clean(er) but fails in specific > >> situations(when > >> doing commands > >> // from Microft windows, via samba). > >> > >> > >> root@vps1:/tmp # diff -u dir.c.orig dir.c > >> --- dir.c.orig2015-04-24 11:29:36.855317000 -0700 > >> +++ dir.c2015-04-24 11:15:49.05850 -0700 > >> @@ -1105,6 +1105,13 @@ > >>dirp->dd_loc = lp->loc_loc; > >>return; > >>} > >> +#ifdef GLIBC_SEEK > >> +(void) lseek(dirp->dd_fd, (off_t)lp->loc_seek + lp->loc_loc, > >> SEEK_SET); > >> +dirp->dd_seek = lp->loc_seek + lp->loc_loc; > >> +dirp->dd_loc = 0; > >> +lp->loc_seek = dirp->dd_seek; > >> +lp->loc_loc = 0; > >> +#else > >>(void) lseek(dirp->dd_fd, (off_t)lp->loc_seek, SEEK_SET); > >>dirp->dd_seek = lp->loc_seek; > >>dirp->dd_loc = 0; > >> @@ -1114,6 +1121,7 @@ > >>if (dp == NULL) > >>break; > >>} > >> +#endif > >>} > > Yes, this isn't at all safe. There's no guarantee whatsoever that > > the offset on the directory fd that isn't something returned by > > getdirentries has any meaning. In particular, the size of the > > directory entry in a random filesystem might be a different size > > than the structure returned by getdirentries (since it converts > > things into a FS-independent format). > > > > This might work for UFS by accident, but this is probably why ZFS > > doesn't work. > > > > However, this might be properly fixed by the thing that ino64 is > > doing where each directory entry returned by getdirentries gives > > you a seek offset that you _can_ directly seek to (as opposed to > > seeking to the start of the block and then walking forward N > > entries until you get an inter-block entry that is the same). > I just made the stunning discovery that our seekdir/readdir/telldir > code in libc works with > FreeBSD 8.0. > so maybe the problem is that the kernel changed it's behaviour, and > no-one thought to fix libc.. > > (at least it works on one of our 8.0 base appliances.. I'll do more > testing tomorrow.. it's past midnight.) > I suspect that pre-r252438 systems work better for UFS than r252438 or later. That patch changed ufs_readdir() so that it no longer returned the on-disk directory structure. (Among other things, it added code that skipped over d_ino == 0 entries.) As such, r252438 and later systems have UFS where the "logical" offset of a directory entry returned by getdirentries() isn't the same as the "physical" offset for it
Re: readdir/telldir/seekdir problem (i think)
On 4/25/15 4:28 AM, John Baldwin wrote: On Saturday, April 25, 2015 02:36:24 AM Julian Elischer wrote: On 4/25/15 1:30 AM, Julian Elischer wrote: On 4/24/15 10:59 PM, John Baldwin wrote: Index: head/lib/libc/gen/telldir.c === --- head/lib/libc/gen/telldir.c (revision 281929) +++ head/lib/libc/gen/telldir.c (working copy) @@ -101,8 +101,10 @@ return; if (lp->loc_loc == dirp->dd_loc && lp->loc_seek == dirp->dd_seek) return; - (void) lseek(dirp->dd_fd, (off_t)lp->loc_seek, SEEK_SET); - dirp->dd_seek = lp->loc_seek; + if (lp->loc_seek != dirp->dd_seek) { + (void) lseek(dirp->dd_fd, (off_t)lp->loc_seek, SEEK_SET); + dirp->dd_seek = lp->loc_seek; + } yes I did that yesterday but it still fails when you transition blocks.. (badly). I also tried bigger blocks.. also fails (eventually) I did find a way to make it work... you had to seek back to the first block you deleted on each set.. then work forward from there again.. unfortunately since I'm trying to make a microsoft program not fail (via samba) I have no control over how it does things and seekdir doesn't know what was deleted anyway... (so the fix is fine for the test program but not for real life) I think I can make the BSD one act like the linux one by changing the lseek being done to use the offset (loc) plus the buffer seek address of the target, instead of just going for the buffer base and stepping forward through the entries.. maybe tomorrow. The following conditional code makes ours behave the same as the linux one. it breaks several 'rules' but works where ours is clean but fails.. as Rick said.. "maybe that's what we should do too." this is at the end of seekdir() The new code does what linux does.. and shouldn't work.. but does // at least in the limited conditions I need it to. // We'll probably need to do this at work...: The original code is what we have now, but gets mightily confused sometimes. // This is clean(er) but fails in specific situations(when doing commands // from Microft windows, via samba). root@vps1:/tmp # diff -u dir.c.orig dir.c --- dir.c.orig2015-04-24 11:29:36.855317000 -0700 +++ dir.c2015-04-24 11:15:49.05850 -0700 @@ -1105,6 +1105,13 @@ dirp->dd_loc = lp->loc_loc; return; } +#ifdef GLIBC_SEEK +(void) lseek(dirp->dd_fd, (off_t)lp->loc_seek + lp->loc_loc, SEEK_SET); +dirp->dd_seek = lp->loc_seek + lp->loc_loc; +dirp->dd_loc = 0; +lp->loc_seek = dirp->dd_seek; +lp->loc_loc = 0; +#else (void) lseek(dirp->dd_fd, (off_t)lp->loc_seek, SEEK_SET); dirp->dd_seek = lp->loc_seek; dirp->dd_loc = 0; @@ -1114,6 +1121,7 @@ if (dp == NULL) break; } +#endif } Yes, this isn't at all safe. There's no guarantee whatsoever that the offset on the directory fd that isn't something returned by getdirentries has any meaning. In particular, the size of the directory entry in a random filesystem might be a different size than the structure returned by getdirentries (since it converts things into a FS-independent format). This might work for UFS by accident, but this is probably why ZFS doesn't work. However, this might be properly fixed by the thing that ino64 is doing where each directory entry returned by getdirentries gives you a seek offset that you _can_ directly seek to (as opposed to seeking to the start of the block and then walking forward N entries until you get an inter-block entry that is the same). I just made the stunning discovery that our seekdir/readdir/telldir code in libc works with FreeBSD 8.0. so maybe the problem is that the kernel changed it's behaviour, and no-one thought to fix libc.. (at least it works on one of our 8.0 base appliances.. I'll do more testing tomorrow.. it's past midnight.) ___ freebsd-current@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-current To unsubscribe, send any mail to "freebsd-current-unsubscr...@freebsd.org"
Re: readdir/telldir/seekdir problem (i think)
Julian Elischer wrote: > On 4/25/15 9:39 AM, Rick Macklem wrote: > > Jilles Tjoelker wrote: > >> On Fri, Apr 24, 2015 at 04:28:12PM -0400, John Baldwin wrote: > >>> Yes, this isn't at all safe. There's no guarantee whatsoever > >>> that > >>> the offset on the directory fd that isn't something returned by > >>> getdirentries has any meaning. In particular, the size of the > >>> directory entry in a random filesystem might be a different size > >>> than the structure returned by getdirentries (since it converts > >>> things into a FS-independent format). > >>> This might work for UFS by accident, but this is probably why ZFS > >>> doesn't work. > >>> However, this might be properly fixed by the thing that ino64 is > >>> doing where each directory entry returned by getdirentries gives > >>> you a seek offset that you _can_ directly seek to (as opposed to > >>> seeking to the start of the block and then walking forward N > >>> entries until you get an inter-block entry that is the same). > >> The ino64 branch only reserves space for d_off and does not use it > >> in > >> any way. This is appropriate since actually using d_off is a major > >> feature addition. > >> > > Well, at some point ino64 will need to define a new > > getdirentries(2) > > syscall and I believe this new syscall can have > > different/additional > > arguments. > yes, posix only specifies 2 mandatory fields (d_ino and d_name) and > everything else is implementation dependent. > > I'd suggest that the new gtedirentries(2) syscall should return a > > flag to indicate that the underlying file system is filling in > > d_off. > > Then the libc functions can use d_off if it it available. > > (They will still need to "work" at least as well as they do now if > > the file system doesn't support d_off. The old getdirentries(2) > > syscall > > will be returning the old/current "struct dirent" which doesn't > > have > > the field anyhow.) > > > > Another bit of fun is that the argument for seekdir()/telldir() is > > a > > long and ends up 32bits for some arches. d_off is 64bits, since > > that > > is what some file systems require. > what does linux use? Btw, I found this: https://bugs.centos.org/view.php?id=5496 It appears that Linux has been having fun with this too, at least for NFS. I still think that reading the whole directory is the only way to fix NFS. (Unfortunately, they don't say how the Linux distros fixed it.;-) Have fun with it, rick > -- >In glibc up to version 2.1.1, the return type of telldir() was > off_t. > POSIX.1-2001 specifies long, and this is the type used since > glibc > 2.1.2. > > also from the linux man page: this is interesting.. > > > In early filesystems, the value returned by telldir() was a > simple > file offset within a directory. Modern filesystems use tree > or hash > structures, rather than flat tables, to represent > directories. On > such filesystems, the value returned by telldir() (and used > internally by readdir(3)) is a "cookie" that is used by the > implementation to derive a position within a directory. > Application > programs should treat this strictly as an opaque value, > making no > assumptions about its contents. > -- > but glibc uses the contents in a nonopaque (and possibly wrong) way > itself in seekdir. . > (not following their own advice.) > > > > Maybe the library code can only use d_off if it is a 64bit arch and > > the file system is filling it in. (Or maybe the library can keep > > track > > of 32<->64bit mappings for the offsets. I haven't looked at the > > libc > > functions for a while, so I can't remember what they keep track > > of.) > > one supposes a 32 bit system would not have such large file systems > on > it.. > (maybe?) > > > > rick > > > >> A proper d_off would still be useful even if UFS's readdir keeps > >> masking > >> off the offset so a directory read always starts at the beginning > >> of > >> a > >> 512-byte directory block, since this allows more distinct offset > >> values > >> than safely using getdirentries()'s *basep. With d_off, one outer > >> loop > >> must read at least one directory block to avoid spinning > >> indefinitely, > >> while using getdirentries()'s *basep requires reading the whole > >> getdirentries() buffer. > >> > >> Some Linux filesystems go further and provide a unique d_off for > >> each > >> entry. > >> > >> Another idea would be to store the last d_ino instead of dd_loc > >> into > >> the > >> struct ddloc. On seekdir(), this would seek to loc_seek as before > >> and > >> skip entries until that d_ino is found, or to the start of the > >> buffer > >> if > >> not found (and possibly return some entries again that should not > >> be > >> returned, but Samba copes with that). > >> > >> -- > >> Jilles Tjoelker > >> ___ > >> freebsd-current@freebsd
Re: readdir/telldir/seekdir problem (i think)
Julian Elischer wrote: > On 4/25/15 9:39 AM, Rick Macklem wrote: > > Jilles Tjoelker wrote: > >> On Fri, Apr 24, 2015 at 04:28:12PM -0400, John Baldwin wrote: > >>> Yes, this isn't at all safe. There's no guarantee whatsoever > >>> that > >>> the offset on the directory fd that isn't something returned by > >>> getdirentries has any meaning. In particular, the size of the > >>> directory entry in a random filesystem might be a different size > >>> than the structure returned by getdirentries (since it converts > >>> things into a FS-independent format). > >>> This might work for UFS by accident, but this is probably why ZFS > >>> doesn't work. > >>> However, this might be properly fixed by the thing that ino64 is > >>> doing where each directory entry returned by getdirentries gives > >>> you a seek offset that you _can_ directly seek to (as opposed to > >>> seeking to the start of the block and then walking forward N > >>> entries until you get an inter-block entry that is the same). > >> The ino64 branch only reserves space for d_off and does not use it > >> in > >> any way. This is appropriate since actually using d_off is a major > >> feature addition. > >> > > Well, at some point ino64 will need to define a new > > getdirentries(2) > > syscall and I believe this new syscall can have > > different/additional > > arguments. > yes, posix only specifies 2 mandatory fields (d_ino and d_name) and > everything else is implementation dependent. > > I'd suggest that the new gtedirentries(2) syscall should return a > > flag to indicate that the underlying file system is filling in > > d_off. > > Then the libc functions can use d_off if it it available. > > (They will still need to "work" at least as well as they do now if > > the file system doesn't support d_off. The old getdirentries(2) > > syscall > > will be returning the old/current "struct dirent" which doesn't > > have > > the field anyhow.) > > > > Another bit of fun is that the argument for seekdir()/telldir() is > > a > > long and ends up 32bits for some arches. d_off is 64bits, since > > that > > is what some file systems require. Now that I've taken a look at the libc code, I can see that loc_XX structures handle the telldir/seekdir stuff (mapping the seek position plus offset within a block to an index). I think d_off could be saved there as well (if the file system provides it) and the new getdirentries() could take a "physical offset" argument where this d_off could be passed down to the file system. I think this would allow the problem to be handled properly "in general". It still doesn't quite solve the NFS case. - For NFS, a server is supposed to have a "directory_cookie_verifier", which allows the server to decide if a "directory offset cookie" is still valid. --> This has never been implemented for the FreeBSD server and it isn't easy. In part, real NFS servers don't do what the RFC describes, at least not in an obvious way. They "understand" if offset cookies still work after removal of an entry and also must store this "directory_offset_verifier" in the file's attributes. FreeBSD just never checks this verifier. rick > what does linux use? > -- >In glibc up to version 2.1.1, the return type of telldir() was > off_t. > POSIX.1-2001 specifies long, and this is the type used since > glibc > 2.1.2. > > also from the linux man page: this is interesting.. > > > In early filesystems, the value returned by telldir() was a > simple > file offset within a directory. Modern filesystems use tree > or hash > structures, rather than flat tables, to represent > directories. On > such filesystems, the value returned by telldir() (and used > internally by readdir(3)) is a "cookie" that is used by the > implementation to derive a position within a directory. > Application > programs should treat this strictly as an opaque value, > making no > assumptions about its contents. > -- > but glibc uses the contents in a nonopaque (and possibly wrong) way > itself in seekdir. . > (not following their own advice.) > > > > Maybe the library code can only use d_off if it is a 64bit arch and > > the file system is filling it in. (Or maybe the library can keep > > track > > of 32<->64bit mappings for the offsets. I haven't looked at the > > libc > > functions for a while, so I can't remember what they keep track > > of.) > > one supposes a 32 bit system would not have such large file systems > on > it.. > (maybe?) > > > > rick > > > >> A proper d_off would still be useful even if UFS's readdir keeps > >> masking > >> off the offset so a directory read always starts at the beginning > >> of > >> a > >> 512-byte directory block, since this allows more distinct offset > >> values > >> than safely using getdirentries()'s *basep. With d_off, one outer > >> loop > >> must read at least one
Re: readdir/telldir/seekdir problem (i think)
Julian Elischer wrote: > On 4/25/15 5:52 AM, Jilles Tjoelker wrote: > > On Fri, Apr 24, 2015 at 04:28:12PM -0400, John Baldwin wrote: > >> Yes, this isn't at all safe. There's no guarantee whatsoever that > >> the offset on the directory fd that isn't something returned by > >> getdirentries has any meaning. In particular, the size of the > >> directory entry in a random filesystem might be a different size > >> than the structure returned by getdirentries (since it converts > >> things into a FS-independent format). > >> This might work for UFS by accident, but this is probably why ZFS > >> doesn't work. I took a quick look at UFS, ZFS and NFS: - UFS appears to use the offset as a block + offset within the block, so it is understandable that it works for the simple "seek to byte offset" case. - One oddity here is that the loop that copies out directory entries (about line#2188 in ufs_vnops.c) skips over d_ino == 0. Then uiomove() copies the data out and offset is updated. --> Seems to me that uio_offset "skips" entries with d_ino == 0, so it isn't the actual byte offset in the on-disk directory. *** Somehow, this seems it would break seekdir/telldir at some point? (This handling of uio_offset seems just plain broken to me?) - ZFS appears to number directory entries 1, 2, 3, 4, ... But, except for 1, 2, 3 (which are somewhat special, indicating ".", ".." and one other I've already forgotten, maybe ".zfs"), the rest are "serialized", which seems to mean that the byte offset is kept in a list along with some hash function used to find them. --> I didn't really follow the code, but it appears to "offset >> 28", so I don't think the low order 28bits are used to get the directory block. As such, I'm not too surprised that adding in the byte position within the block when doing lseek() doesn't break it. (I didn't really follow the stuff in zap_micro.c, so I could be all wrong here.) - I don't know if removal of an entry changes these indices 3, 4, ...? - NFS gets the block and mostly ignores the byte position within the block. (ie. It does "uio_offset / NFS_DIRBLKSIZ" and then uses that to get a buffer cache block. This will, in turn, look up this value to find a cached cookie.) --> The only effect of "uio_offset % NFS_DIRBLKSIZ" != 0 is that it will return data from this logical position within a block for the readdir. --> If a Readdir RPC is done for the cached cookie after a Remove, it is up to the NFS server's implementation w.r.t. what you'll get back. Not sure if the above is actually useful, at least until all file systems are evaluated, rick. > >> However, this might be properly fixed by the thing that ino64 is > >> doing where each directory entry returned by getdirentries gives > >> you a seek offset that you _can_ directly seek to (as opposed to > >> seeking to the start of the block and then walking forward N > >> entries until you get an inter-block entry that is the same). > > The ino64 branch only reserves space for d_off and does not use it > > in > > any way. This is appropriate since actually using d_off is a major > > feature addition. > > > > A proper d_off would still be useful even if UFS's readdir keeps > > masking > > off the offset so a directory read always starts at the beginning > > of a > > 512-byte directory block, since this allows more distinct offset > > values > > than safely using getdirentries()'s *basep. With d_off, one outer > > loop > > must read at least one directory block to avoid spinning > > indefinitely, > > while using getdirentries()'s *basep requires reading the whole > > getdirentries() buffer. > > > > Some Linux filesystems go further and provide a unique d_off for > > each > > entry. > > > > Another idea would be to store the last d_ino instead of dd_loc > > into the > > struct ddloc. On seekdir(), this would seek to loc_seek as before > > and > > skip entries until that d_ino is found, or to the start of the > > buffer if > > not found (and possibly return some entries again that should not > > be > > returned, but Samba copes with that). > > yes.. though maybe a hash of hte inode number and the name may be a > better thing to search on.. > > I need to check on your statement about samba to see if its true for > 3.6.. > > > > > > ___ > freebsd-current@freebsd.org mailing list > http://lists.freebsd.org/mailman/listinfo/freebsd-current > To unsubscribe, send any mail to > "freebsd-current-unsubscr...@freebsd.org" > ___ freebsd-current@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-current To unsubscribe, send any mail to "freebsd-current-unsubscr...@freebsd.org"
Re: readdir/telldir/seekdir problem (i think)
Julian Elischer wrote: > On 4/25/15 4:28 AM, John Baldwin wrote: > > On Saturday, April 25, 2015 02:36:24 AM Julian Elischer wrote: > >> On 4/25/15 1:30 AM, Julian Elischer wrote: > >>> On 4/24/15 10:59 PM, John Baldwin wrote: > On Friday, April 24, 2015 01:02:39 PM Julian Elischer wrote: > > On 4/23/15 9:54 PM, John Baldwin wrote: > >> On Thursday, April 23, 2015 05:02:08 PM Julian Elischer wrote: > >>> On 4/23/15 11:20 AM, Julian Elischer wrote: > I'm debugging a problem being seen with samba 3.6. > > basically telldir/seekdir/readdir don't seem to work as > advertised.. > >>> ok so it looks like readdir() (and friends) is totally broken > >>> in > >>> the face > >>> of deletes unless you read the entire directory at once or > >>> reset > >>> to the > >>> the first file before the deletes, or earlier. > >> I'm not sure that Samba isn't assuming non-portable behavior. > >> For example: > >> > >> >From > >> http://pubs.opengroup.org/onlinepubs/009695399/functions/readdir_r.html > >> > >> > >> If a file is removed from or added to the directory after the > >> most recent call > >> to opendir() or rewinddir(), whether a subsequent call to > >> readdir() returns an > >> entry for that file is unspecified. > >> > >> While this doesn't speak directly to your case, it does note > >> that > >> you will > >> get inconsistencies if you scan a directory concurrent with > >> add > >> and remove. > >> > >> UFS might kind of work actually since deletes do not compact > >> the > >> backing > >> directory, but I suspect NFS and ZFS would not work. In > >> addition, our > >> current NFS support for seekdir is pretty flaky and can't be > >> fixed without > >> changes to return the seek offset for each directory entry (I > >> believe that > >> the projects/ino64 patches include this since they are > >> breaking > >> the ABI of > >> the relevant structures already). The ABI breakage makes this > >> a > >> very > >> non-trivial task. However, even if you have that per-item > >> cookie, it is > >> likely meaningless in the face of filesystems that use any > >> sort > >> of more > >> advanced structure than an array (such as trees, etc.) to > >> store > >> directory > >> entries. POSIX specifically mentions this in the rationale > >> for > >> seekdir: > >> > >> One of the perceived problems of implementation is that > >> returning > >> to a given point in a directory is quite difficult to describe > >> formally, in spite of its intuitive appeal, when systems that > >> use > >> B-trees, hashing functions, or other similar mechanisms to > >> order > >> their directories are considered. The definition of seekdir() > >> and > >> telldir() does not specify whether, when using these > >> interfaces, > >> a given directory entry will be seen at all, or more than > >> once. > >> > >> In fact, given that quote, I would argue that what Samba is > >> doing is > >> non-portable. This would seem to indicate that a conforming > >> seekdir could > >> just change readdir to immediately return EOF until you call > >> rewinddir. > > how does readdir know that the directory has been changed? > > fstat? > Undefined. FreeBSD's libc doesn't cache the entire directory > (unless you > are using a union mount), instead it just caches the most recent > call to > getdirentries(). It then serves up entries from that block > until > you hit > the end. When you hit the end it reads the next block, etc. > When you > call telldir(), the kernel saves the seek offset from the start > of the > current block and the offset within that block and returns a > cookie to > you. seekdir() looks up the cookie to find the (seek offset, > block > offset) > tuple. If the location matches the directory's current location > it > doesn't > do anything, otherwise it seeks to the seek offset and reads a > new > block via > getdirentries(). There is no check for seeing if a directory is > changed. Instead, you can only be stale by one "block". When > you > read > a new block it is relative to the directory's state at that > time. > > Rick's suggestion of reusing the block for a seek within a block > would be > fairly easy to implement, I think it would just be: > > Index: head/lib/libc/gen/telldir.c > === > --- head/lib/libc/gen/telldir.c (revision 281929) > +++ head/lib/libc/gen/telldir.c (working copy) > @@ -101,8 +101,10 @@ > return; >
Re: readdir/telldir/seekdir problem (i think)
Jilles Toelker wrote: > On Fri, Apr 24, 2015 at 04:28:12PM -0400, John Baldwin wrote: > > Yes, this isn't at all safe. There's no guarantee whatsoever that > > the offset on the directory fd that isn't something returned by > > getdirentries has any meaning. In particular, the size of the > > directory entry in a random filesystem might be a different size > > than the structure returned by getdirentries (since it converts > > things into a FS-independent format). > > > This might work for UFS by accident, but this is probably why ZFS > > doesn't work. > > > However, this might be properly fixed by the thing that ino64 is > > doing where each directory entry returned by getdirentries gives > > you a seek offset that you _can_ directly seek to (as opposed to > > seeking to the start of the block and then walking forward N > > entries until you get an inter-block entry that is the same). > > The ino64 branch only reserves space for d_off and does not use it in > any way. This is appropriate since actually using d_off is a major > feature addition. > Just fyi, I did a patch: http://people.freebsd.org/~rmacklem/64bitfileno.patch which does fill this field in for the various file systems and copies the new "struct dirent" to "struct dirent32", so it works with an unchanged userland. Since the ino64 folks felt this should come after the ino64 change, I didn't pursue committing it. To commit it to head, it would have to be modified slightly, so that the new structure is called "struct dirent64", so it doesn't screw up userland builds using "struct dirent", but I think that is the only change required for it to go into head/current if people felt that was appropriate. (Oh, and it called d_off "d_cookie", although I have no strong preference w.r.t. what it is called.) If you look at it, you'll see most of the changes are for NFS, but the rest were pretty trivial. rick > A proper d_off would still be useful even if UFS's readdir keeps > masking > off the offset so a directory read always starts at the beginning of > a > 512-byte directory block, since this allows more distinct offset > values > than safely using getdirentries()'s *basep. With d_off, one outer > loop > must read at least one directory block to avoid spinning > indefinitely, > while using getdirentries()'s *basep requires reading the whole > getdirentries() buffer. > > Some Linux filesystems go further and provide a unique d_off for each > entry. > > Another idea would be to store the last d_ino instead of dd_loc into > the > struct ddloc. On seekdir(), this would seek to loc_seek as before and > skip entries until that d_ino is found, or to the start of the buffer > if > not found (and possibly return some entries again that should not be > returned, but Samba copes with that). > > -- > Jilles Tjoelker > ___ > freebsd-current@freebsd.org mailing list > http://lists.freebsd.org/mailman/listinfo/freebsd-current > To unsubscribe, send any mail to > "freebsd-current-unsubscr...@freebsd.org" > ___ freebsd-current@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-current To unsubscribe, send any mail to "freebsd-current-unsubscr...@freebsd.org"
Re: readdir/telldir/seekdir problem (i think)
Julian Elischer wrote: > On 4/25/15 9:39 AM, Rick Macklem wrote: > > Jilles Tjoelker wrote: > >> On Fri, Apr 24, 2015 at 04:28:12PM -0400, John Baldwin wrote: > >>> Yes, this isn't at all safe. There's no guarantee whatsoever > >>> that > >>> the offset on the directory fd that isn't something returned by > >>> getdirentries has any meaning. In particular, the size of the > >>> directory entry in a random filesystem might be a different size > >>> than the structure returned by getdirentries (since it converts > >>> things into a FS-independent format). > >>> This might work for UFS by accident, but this is probably why ZFS > >>> doesn't work. > >>> However, this might be properly fixed by the thing that ino64 is > >>> doing where each directory entry returned by getdirentries gives > >>> you a seek offset that you _can_ directly seek to (as opposed to > >>> seeking to the start of the block and then walking forward N > >>> entries until you get an inter-block entry that is the same). > >> The ino64 branch only reserves space for d_off and does not use it > >> in > >> any way. This is appropriate since actually using d_off is a major > >> feature addition. > >> > > Well, at some point ino64 will need to define a new > > getdirentries(2) > > syscall and I believe this new syscall can have > > different/additional > > arguments. > yes, posix only specifies 2 mandatory fields (d_ino and d_name) and > everything else is implementation dependent. > > I'd suggest that the new gtedirentries(2) syscall should return a > > flag to indicate that the underlying file system is filling in > > d_off. > > Then the libc functions can use d_off if it it available. > > (They will still need to "work" at least as well as they do now if > > the file system doesn't support d_off. The old getdirentries(2) > > syscall > > will be returning the old/current "struct dirent" which doesn't > > have > > the field anyhow.) > > > > Another bit of fun is that the argument for seekdir()/telldir() is > > a > > long and ends up 32bits for some arches. d_off is 64bits, since > > that > > is what some file systems require. > what does linux use? > -- >In glibc up to version 2.1.1, the return type of telldir() was > off_t. > POSIX.1-2001 specifies long, and this is the type used since > glibc > 2.1.2. > > also from the linux man page: this is interesting.. > > > In early filesystems, the value returned by telldir() was a > simple > file offset within a directory. Modern filesystems use tree > or hash > structures, rather than flat tables, to represent > directories. On > such filesystems, the value returned by telldir() (and used > internally by readdir(3)) is a "cookie" that is used by the > implementation to derive a position within a directory. > Application > programs should treat this strictly as an opaque value, > making no > assumptions about its contents. > -- > but glibc uses the contents in a nonopaque (and possibly wrong) way > itself in seekdir. . > (not following their own advice.) > I believe that most of the FreeBSD file systems except UFS and ZFS just copy the fields of their internal directory structure to fields in "struct dirent", filling blocks sequentially. (Actually, I only took a quick look, but ZFS might also be this way.) As such, the "offsets" for FreeBSD are byte offsets into these "logical directory" blocks. The problem is (as already discussed) that there is no way to predict how these will change for a given file system when entries are removed/added. (I think the only way to "know" what the modified "logical directory" looks like is to read it again from the beginning, so that all the directory entries go through the conversion to logical again.) UFS and the NFS client ensure that no "struct dirent" crosses a 512byte block boundary. I don't think the other file systems do this. I mention this, since the libc functions can't assume the UFS behaviour for this. (At one time, UFS just "consumed" removed entries into the preceding "struct direct" or set it invalid, if it was the first entry in a 512byte block. This implied that the byte offsets (logical == physical) didn't change for subsequent entries upon a removal. It sounds like UFS is no longer doing this, from one of your posts?) I am curious to see what glibc does, since I had assumed it just read the entire directory at opendir/first-readdir. rick ps: This is what I recall from fooling with "struct dirent" a few months ago and I'm getting old, so it may all be wrong;-) > > > Maybe the library code can only use d_off if it is a 64bit arch and > > the file system is filling it in. (Or maybe the library can keep > > track > > of 32<->64bit mappings for the offsets. I haven't looked at the > > libc > > functions for a while, so I can't remember what they keep track > > of.) > > one supposes a 32
Re: readdir/telldir/seekdir problem (i think)
Julian Elischer wrote: > On 4/25/15 9:39 AM, Rick Macklem wrote: > > Jilles Tjoelker wrote: > >> On Fri, Apr 24, 2015 at 04:28:12PM -0400, John Baldwin wrote: > >>> Yes, this isn't at all safe. There's no guarantee whatsoever > >>> that > >>> the offset on the directory fd that isn't something returned by > >>> getdirentries has any meaning. In particular, the size of the > >>> directory entry in a random filesystem might be a different size > >>> than the structure returned by getdirentries (since it converts > >>> things into a FS-independent format). > >>> This might work for UFS by accident, but this is probably why ZFS > >>> doesn't work. > >>> However, this might be properly fixed by the thing that ino64 is > >>> doing where each directory entry returned by getdirentries gives > >>> you a seek offset that you _can_ directly seek to (as opposed to > >>> seeking to the start of the block and then walking forward N > >>> entries until you get an inter-block entry that is the same). > >> The ino64 branch only reserves space for d_off and does not use it > >> in > >> any way. This is appropriate since actually using d_off is a major > >> feature addition. > >> > > Well, at some point ino64 will need to define a new > > getdirentries(2) > > syscall and I believe this new syscall can have > > different/additional > > arguments. > yes, posix only specifies 2 mandatory fields (d_ino and d_name) and > everything else is implementation dependent. > > I'd suggest that the new gtedirentries(2) syscall should return a > > flag to indicate that the underlying file system is filling in > > d_off. > > Then the libc functions can use d_off if it it available. > > (They will still need to "work" at least as well as they do now if > > the file system doesn't support d_off. The old getdirentries(2) > > syscall > > will be returning the old/current "struct dirent" which doesn't > > have > > the field anyhow.) > > > > Another bit of fun is that the argument for seekdir()/telldir() is > > a > > long and ends up 32bits for some arches. d_off is 64bits, since > > that > > is what some file systems require. > what does linux use? > -- >In glibc up to version 2.1.1, the return type of telldir() was > off_t. > POSIX.1-2001 specifies long, and this is the type used since > glibc > 2.1.2. > > also from the linux man page: this is interesting.. > > > In early filesystems, the value returned by telldir() was a > simple > file offset within a directory. Modern filesystems use tree > or hash > structures, rather than flat tables, to represent > directories. On > such filesystems, the value returned by telldir() (and used > internally by readdir(3)) is a "cookie" that is used by the > implementation to derive a position within a directory. > Application > programs should treat this strictly as an opaque value, > making no > assumptions about its contents. > -- > but glibc uses the contents in a nonopaque (and possibly wrong) way > itself in seekdir. . > (not following their own advice.) > > > > Maybe the library code can only use d_off if it is a 64bit arch and > > the file system is filling it in. (Or maybe the library can keep > > track > > of 32<->64bit mappings for the offsets. I haven't looked at the > > libc > > functions for a while, so I can't remember what they keep track > > of.) > > one supposes a 32 bit system would not have such large file systems > on > it.. > (maybe?) For NFS, the cookie is always an opaque 64bits. These cookies are cached in the kernel by the client, with one for each "logical UFS-like directory block generated for getdirentries(2)". As such, the NFS case does a 64bit->32bit mapping. (Because of endianness etc, there is no guarantee that most of these cookies are 0 for the high order 32bits.) I need to look at the library code (both glibc and ours) before I understand this better and can say more. Have fun with it, rick ps: The ino64 stuff will never be MFC'd, so it would be nice to "improve" what the libc functions do without use of d_off. > > > > rick > > > >> A proper d_off would still be useful even if UFS's readdir keeps > >> masking > >> off the offset so a directory read always starts at the beginning > >> of > >> a > >> 512-byte directory block, since this allows more distinct offset > >> values > >> than safely using getdirentries()'s *basep. With d_off, one outer > >> loop > >> must read at least one directory block to avoid spinning > >> indefinitely, > >> while using getdirentries()'s *basep requires reading the whole > >> getdirentries() buffer. > >> > >> Some Linux filesystems go further and provide a unique d_off for > >> each > >> entry. > >> > >> Another idea would be to store the last d_ino instead of dd_loc > >> into > >> the > >> struct ddloc. On seekdir(), this would seek to loc_seek as before > >> and > >> skip entries un
Re: readdir/telldir/seekdir problem (i think)
On 4/25/15 9:39 AM, Rick Macklem wrote: Jilles Tjoelker wrote: On Fri, Apr 24, 2015 at 04:28:12PM -0400, John Baldwin wrote: Yes, this isn't at all safe. There's no guarantee whatsoever that the offset on the directory fd that isn't something returned by getdirentries has any meaning. In particular, the size of the directory entry in a random filesystem might be a different size than the structure returned by getdirentries (since it converts things into a FS-independent format). This might work for UFS by accident, but this is probably why ZFS doesn't work. However, this might be properly fixed by the thing that ino64 is doing where each directory entry returned by getdirentries gives you a seek offset that you _can_ directly seek to (as opposed to seeking to the start of the block and then walking forward N entries until you get an inter-block entry that is the same). The ino64 branch only reserves space for d_off and does not use it in any way. This is appropriate since actually using d_off is a major feature addition. Well, at some point ino64 will need to define a new getdirentries(2) syscall and I believe this new syscall can have different/additional arguments. yes, posix only specifies 2 mandatory fields (d_ino and d_name) and everything else is implementation dependent. I'd suggest that the new gtedirentries(2) syscall should return a flag to indicate that the underlying file system is filling in d_off. Then the libc functions can use d_off if it it available. (They will still need to "work" at least as well as they do now if the file system doesn't support d_off. The old getdirentries(2) syscall will be returning the old/current "struct dirent" which doesn't have the field anyhow.) Another bit of fun is that the argument for seekdir()/telldir() is a long and ends up 32bits for some arches. d_off is 64bits, since that is what some file systems require. what does linux use? -- In glibc up to version 2.1.1, the return type of telldir() was off_t. POSIX.1-2001 specifies long, and this is the type used since glibc 2.1.2. also from the linux man page: this is interesting.. In early filesystems, the value returned by telldir() was a simple file offset within a directory. Modern filesystems use tree or hash structures, rather than flat tables, to represent directories. On such filesystems, the value returned by telldir() (and used internally by readdir(3)) is a "cookie" that is used by the implementation to derive a position within a directory. Application programs should treat this strictly as an opaque value, making no assumptions about its contents. -- but glibc uses the contents in a nonopaque (and possibly wrong) way itself in seekdir. . (not following their own advice.) Maybe the library code can only use d_off if it is a 64bit arch and the file system is filling it in. (Or maybe the library can keep track of 32<->64bit mappings for the offsets. I haven't looked at the libc functions for a while, so I can't remember what they keep track of.) one supposes a 32 bit system would not have such large file systems on it.. (maybe?) rick A proper d_off would still be useful even if UFS's readdir keeps masking off the offset so a directory read always starts at the beginning of a 512-byte directory block, since this allows more distinct offset values than safely using getdirentries()'s *basep. With d_off, one outer loop must read at least one directory block to avoid spinning indefinitely, while using getdirentries()'s *basep requires reading the whole getdirentries() buffer. Some Linux filesystems go further and provide a unique d_off for each entry. Another idea would be to store the last d_ino instead of dd_loc into the struct ddloc. On seekdir(), this would seek to loc_seek as before and skip entries until that d_ino is found, or to the start of the buffer if not found (and possibly return some entries again that should not be returned, but Samba copes with that). -- Jilles Tjoelker ___ freebsd-current@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-current To unsubscribe, send any mail to "freebsd-current-unsubscr...@freebsd.org" ___ freebsd-current@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-current To unsubscribe, send any mail to "freebsd-current-unsubscr...@freebsd.org"
Re: readdir/telldir/seekdir problem (i think)
On 4/25/15 5:52 AM, Jilles Tjoelker wrote: On Fri, Apr 24, 2015 at 04:28:12PM -0400, John Baldwin wrote: Yes, this isn't at all safe. There's no guarantee whatsoever that the offset on the directory fd that isn't something returned by getdirentries has any meaning. In particular, the size of the directory entry in a random filesystem might be a different size than the structure returned by getdirentries (since it converts things into a FS-independent format). This might work for UFS by accident, but this is probably why ZFS doesn't work. However, this might be properly fixed by the thing that ino64 is doing where each directory entry returned by getdirentries gives you a seek offset that you _can_ directly seek to (as opposed to seeking to the start of the block and then walking forward N entries until you get an inter-block entry that is the same). The ino64 branch only reserves space for d_off and does not use it in any way. This is appropriate since actually using d_off is a major feature addition. A proper d_off would still be useful even if UFS's readdir keeps masking off the offset so a directory read always starts at the beginning of a 512-byte directory block, since this allows more distinct offset values than safely using getdirentries()'s *basep. With d_off, one outer loop must read at least one directory block to avoid spinning indefinitely, while using getdirentries()'s *basep requires reading the whole getdirentries() buffer. Some Linux filesystems go further and provide a unique d_off for each entry. Another idea would be to store the last d_ino instead of dd_loc into the struct ddloc. On seekdir(), this would seek to loc_seek as before and skip entries until that d_ino is found, or to the start of the buffer if not found (and possibly return some entries again that should not be returned, but Samba copes with that). yes.. though maybe a hash of hte inode number and the name may be a better thing to search on.. I need to check on your statement about samba to see if its true for 3.6.. ___ freebsd-current@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-current To unsubscribe, send any mail to "freebsd-current-unsubscr...@freebsd.org"
Re: readdir/telldir/seekdir problem (i think)
On 4/25/15 4:28 AM, John Baldwin wrote: On Saturday, April 25, 2015 02:36:24 AM Julian Elischer wrote: On 4/25/15 1:30 AM, Julian Elischer wrote: On 4/24/15 10:59 PM, John Baldwin wrote: On Friday, April 24, 2015 01:02:39 PM Julian Elischer wrote: On 4/23/15 9:54 PM, John Baldwin wrote: On Thursday, April 23, 2015 05:02:08 PM Julian Elischer wrote: On 4/23/15 11:20 AM, Julian Elischer wrote: I'm debugging a problem being seen with samba 3.6. basically telldir/seekdir/readdir don't seem to work as advertised.. ok so it looks like readdir() (and friends) is totally broken in the face of deletes unless you read the entire directory at once or reset to the the first file before the deletes, or earlier. I'm not sure that Samba isn't assuming non-portable behavior. For example: >From http://pubs.opengroup.org/onlinepubs/009695399/functions/readdir_r.html If a file is removed from or added to the directory after the most recent call to opendir() or rewinddir(), whether a subsequent call to readdir() returns an entry for that file is unspecified. While this doesn't speak directly to your case, it does note that you will get inconsistencies if you scan a directory concurrent with add and remove. UFS might kind of work actually since deletes do not compact the backing directory, but I suspect NFS and ZFS would not work. In addition, our current NFS support for seekdir is pretty flaky and can't be fixed without changes to return the seek offset for each directory entry (I believe that the projects/ino64 patches include this since they are breaking the ABI of the relevant structures already). The ABI breakage makes this a very non-trivial task. However, even if you have that per-item cookie, it is likely meaningless in the face of filesystems that use any sort of more advanced structure than an array (such as trees, etc.) to store directory entries. POSIX specifically mentions this in the rationale for seekdir: One of the perceived problems of implementation is that returning to a given point in a directory is quite difficult to describe formally, in spite of its intuitive appeal, when systems that use B-trees, hashing functions, or other similar mechanisms to order their directories are considered. The definition of seekdir() and telldir() does not specify whether, when using these interfaces, a given directory entry will be seen at all, or more than once. In fact, given that quote, I would argue that what Samba is doing is non-portable. This would seem to indicate that a conforming seekdir could just change readdir to immediately return EOF until you call rewinddir. how does readdir know that the directory has been changed? fstat? Undefined. FreeBSD's libc doesn't cache the entire directory (unless you are using a union mount), instead it just caches the most recent call to getdirentries(). It then serves up entries from that block until you hit the end. When you hit the end it reads the next block, etc. When you call telldir(), the kernel saves the seek offset from the start of the current block and the offset within that block and returns a cookie to you. seekdir() looks up the cookie to find the (seek offset, block offset) tuple. If the location matches the directory's current location it doesn't do anything, otherwise it seeks to the seek offset and reads a new block via getdirentries(). There is no check for seeing if a directory is changed. Instead, you can only be stale by one "block". When you read a new block it is relative to the directory's state at that time. Rick's suggestion of reusing the block for a seek within a block would be fairly easy to implement, I think it would just be: Index: head/lib/libc/gen/telldir.c === --- head/lib/libc/gen/telldir.c (revision 281929) +++ head/lib/libc/gen/telldir.c (working copy) @@ -101,8 +101,10 @@ return; if (lp->loc_loc == dirp->dd_loc && lp->loc_seek == dirp->dd_seek) return; - (void) lseek(dirp->dd_fd, (off_t)lp->loc_seek, SEEK_SET); - dirp->dd_seek = lp->loc_seek; + if (lp->loc_seek != dirp->dd_seek) { + (void) lseek(dirp->dd_fd, (off_t)lp->loc_seek, SEEK_SET); + dirp->dd_seek = lp->loc_seek; + } yes I did that yesterday but it still fails when you transition blocks.. (badly). I also tried bigger blocks.. also fails (eventually) I did find a way to make it work... you had to seek back to the first block you deleted on each set.. then work forward from there again.. unfortunately since I'm trying to make a microsoft program not fail (via samba) I have no control over how it does things and seekdir doesn't know what was deleted anyway... (so the fix is fine for the test program but not for real life) I think I can make the BSD one act like the linux one by changing the lseek being done to use the offset (loc) plus the buffer seek address of the target, instead
Re: readdir/telldir/seekdir problem (i think)
Jilles Tjoelker wrote: > On Fri, Apr 24, 2015 at 04:28:12PM -0400, John Baldwin wrote: > > Yes, this isn't at all safe. There's no guarantee whatsoever that > > the offset on the directory fd that isn't something returned by > > getdirentries has any meaning. In particular, the size of the > > directory entry in a random filesystem might be a different size > > than the structure returned by getdirentries (since it converts > > things into a FS-independent format). > > > This might work for UFS by accident, but this is probably why ZFS > > doesn't work. > > > However, this might be properly fixed by the thing that ino64 is > > doing where each directory entry returned by getdirentries gives > > you a seek offset that you _can_ directly seek to (as opposed to > > seeking to the start of the block and then walking forward N > > entries until you get an inter-block entry that is the same). > > The ino64 branch only reserves space for d_off and does not use it in > any way. This is appropriate since actually using d_off is a major > feature addition. > Well, at some point ino64 will need to define a new getdirentries(2) syscall and I believe this new syscall can have different/additional arguments. I'd suggest that the new gtedirentries(2) syscall should return a flag to indicate that the underlying file system is filling in d_off. Then the libc functions can use d_off if it it available. (They will still need to "work" at least as well as they do now if the file system doesn't support d_off. The old getdirentries(2) syscall will be returning the old/current "struct dirent" which doesn't have the field anyhow.) Another bit of fun is that the argument for seekdir()/telldir() is a long and ends up 32bits for some arches. d_off is 64bits, since that is what some file systems require. Maybe the library code can only use d_off if it is a 64bit arch and the file system is filling it in. (Or maybe the library can keep track of 32<->64bit mappings for the offsets. I haven't looked at the libc functions for a while, so I can't remember what they keep track of.) rick > A proper d_off would still be useful even if UFS's readdir keeps > masking > off the offset so a directory read always starts at the beginning of > a > 512-byte directory block, since this allows more distinct offset > values > than safely using getdirentries()'s *basep. With d_off, one outer > loop > must read at least one directory block to avoid spinning > indefinitely, > while using getdirentries()'s *basep requires reading the whole > getdirentries() buffer. > > Some Linux filesystems go further and provide a unique d_off for each > entry. > > Another idea would be to store the last d_ino instead of dd_loc into > the > struct ddloc. On seekdir(), this would seek to loc_seek as before and > skip entries until that d_ino is found, or to the start of the buffer > if > not found (and possibly return some entries again that should not be > returned, but Samba copes with that). > > -- > Jilles Tjoelker > ___ > freebsd-current@freebsd.org mailing list > http://lists.freebsd.org/mailman/listinfo/freebsd-current > To unsubscribe, send any mail to > "freebsd-current-unsubscr...@freebsd.org" > ___ freebsd-current@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-current To unsubscribe, send any mail to "freebsd-current-unsubscr...@freebsd.org"
Re: readdir/telldir/seekdir problem (i think)
On Fri, Apr 24, 2015 at 04:28:12PM -0400, John Baldwin wrote: > Yes, this isn't at all safe. There's no guarantee whatsoever that > the offset on the directory fd that isn't something returned by > getdirentries has any meaning. In particular, the size of the > directory entry in a random filesystem might be a different size > than the structure returned by getdirentries (since it converts > things into a FS-independent format). > This might work for UFS by accident, but this is probably why ZFS > doesn't work. > However, this might be properly fixed by the thing that ino64 is > doing where each directory entry returned by getdirentries gives > you a seek offset that you _can_ directly seek to (as opposed to > seeking to the start of the block and then walking forward N > entries until you get an inter-block entry that is the same). The ino64 branch only reserves space for d_off and does not use it in any way. This is appropriate since actually using d_off is a major feature addition. A proper d_off would still be useful even if UFS's readdir keeps masking off the offset so a directory read always starts at the beginning of a 512-byte directory block, since this allows more distinct offset values than safely using getdirentries()'s *basep. With d_off, one outer loop must read at least one directory block to avoid spinning indefinitely, while using getdirentries()'s *basep requires reading the whole getdirentries() buffer. Some Linux filesystems go further and provide a unique d_off for each entry. Another idea would be to store the last d_ino instead of dd_loc into the struct ddloc. On seekdir(), this would seek to loc_seek as before and skip entries until that d_ino is found, or to the start of the buffer if not found (and possibly return some entries again that should not be returned, but Samba copes with that). -- Jilles Tjoelker ___ freebsd-current@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-current To unsubscribe, send any mail to "freebsd-current-unsubscr...@freebsd.org"
Re: readdir/telldir/seekdir problem (i think)
On Saturday, April 25, 2015 02:36:24 AM Julian Elischer wrote: > On 4/25/15 1:30 AM, Julian Elischer wrote: > > On 4/24/15 10:59 PM, John Baldwin wrote: > >> On Friday, April 24, 2015 01:02:39 PM Julian Elischer wrote: > >>> On 4/23/15 9:54 PM, John Baldwin wrote: > On Thursday, April 23, 2015 05:02:08 PM Julian Elischer wrote: > > On 4/23/15 11:20 AM, Julian Elischer wrote: > >> I'm debugging a problem being seen with samba 3.6. > >> > >> basically telldir/seekdir/readdir don't seem to work as > >> advertised.. > > ok so it looks like readdir() (and friends) is totally broken in > > the face > > of deletes unless you read the entire directory at once or reset > > to the > > the first file before the deletes, or earlier. > I'm not sure that Samba isn't assuming non-portable behavior. > For example: > > >From > http://pubs.opengroup.org/onlinepubs/009695399/functions/readdir_r.html > > > If a file is removed from or added to the directory after the > most recent call > to opendir() or rewinddir(), whether a subsequent call to > readdir() returns an > entry for that file is unspecified. > > While this doesn't speak directly to your case, it does note that > you will > get inconsistencies if you scan a directory concurrent with add > and remove. > > UFS might kind of work actually since deletes do not compact the > backing > directory, but I suspect NFS and ZFS would not work. In > addition, our > current NFS support for seekdir is pretty flaky and can't be > fixed without > changes to return the seek offset for each directory entry (I > believe that > the projects/ino64 patches include this since they are breaking > the ABI of > the relevant structures already). The ABI breakage makes this a > very > non-trivial task. However, even if you have that per-item > cookie, it is > likely meaningless in the face of filesystems that use any sort > of more > advanced structure than an array (such as trees, etc.) to store > directory > entries. POSIX specifically mentions this in the rationale for > seekdir: > > One of the perceived problems of implementation is that returning > to a given point in a directory is quite difficult to describe > formally, in spite of its intuitive appeal, when systems that use > B-trees, hashing functions, or other similar mechanisms to order > their directories are considered. The definition of seekdir() and > telldir() does not specify whether, when using these interfaces, > a given directory entry will be seen at all, or more than once. > > In fact, given that quote, I would argue that what Samba is doing is > non-portable. This would seem to indicate that a conforming > seekdir could > just change readdir to immediately return EOF until you call > rewinddir. > >>> how does readdir know that the directory has been changed? fstat? > >> Undefined. FreeBSD's libc doesn't cache the entire directory > >> (unless you > >> are using a union mount), instead it just caches the most recent > >> call to > >> getdirentries(). It then serves up entries from that block until > >> you hit > >> the end. When you hit the end it reads the next block, etc. When you > >> call telldir(), the kernel saves the seek offset from the start of the > >> current block and the offset within that block and returns a cookie to > >> you. seekdir() looks up the cookie to find the (seek offset, block > >> offset) > >> tuple. If the location matches the directory's current location it > >> doesn't > >> do anything, otherwise it seeks to the seek offset and reads a new > >> block via > >> getdirentries(). There is no check for seeing if a directory is > >> changed. Instead, you can only be stale by one "block". When you > >> read > >> a new block it is relative to the directory's state at that time. > >> > >> Rick's suggestion of reusing the block for a seek within a block > >> would be > >> fairly easy to implement, I think it would just be: > >> > >> Index: head/lib/libc/gen/telldir.c > >> === > >> --- head/lib/libc/gen/telldir.c (revision 281929) > >> +++ head/lib/libc/gen/telldir.c (working copy) > >> @@ -101,8 +101,10 @@ > >> return; > >> if (lp->loc_loc == dirp->dd_loc && lp->loc_seek == > >> dirp->dd_seek) > >> return; > >> - (void) lseek(dirp->dd_fd, (off_t)lp->loc_seek, SEEK_SET); > >> - dirp->dd_seek = lp->loc_seek; > >> + if (lp->loc_seek != dirp->dd_seek) { > >> + (void) lseek(dirp->dd_fd, (off_t)lp->loc_seek, > >> SEEK_SET); > >> + dirp->dd_seek = lp->loc_seek; > >> + } > > > > yes I did that yesterda
Re: readdir/telldir/seekdir problem (i think)
On 4/25/15 1:30 AM, Julian Elischer wrote: On 4/24/15 10:59 PM, John Baldwin wrote: On Friday, April 24, 2015 01:02:39 PM Julian Elischer wrote: On 4/23/15 9:54 PM, John Baldwin wrote: On Thursday, April 23, 2015 05:02:08 PM Julian Elischer wrote: On 4/23/15 11:20 AM, Julian Elischer wrote: I'm debugging a problem being seen with samba 3.6. basically telldir/seekdir/readdir don't seem to work as advertised.. ok so it looks like readdir() (and friends) is totally broken in the face of deletes unless you read the entire directory at once or reset to the the first file before the deletes, or earlier. I'm not sure that Samba isn't assuming non-portable behavior. For example: >From http://pubs.opengroup.org/onlinepubs/009695399/functions/readdir_r.html If a file is removed from or added to the directory after the most recent call to opendir() or rewinddir(), whether a subsequent call to readdir() returns an entry for that file is unspecified. While this doesn't speak directly to your case, it does note that you will get inconsistencies if you scan a directory concurrent with add and remove. UFS might kind of work actually since deletes do not compact the backing directory, but I suspect NFS and ZFS would not work. In addition, our current NFS support for seekdir is pretty flaky and can't be fixed without changes to return the seek offset for each directory entry (I believe that the projects/ino64 patches include this since they are breaking the ABI of the relevant structures already). The ABI breakage makes this a very non-trivial task. However, even if you have that per-item cookie, it is likely meaningless in the face of filesystems that use any sort of more advanced structure than an array (such as trees, etc.) to store directory entries. POSIX specifically mentions this in the rationale for seekdir: One of the perceived problems of implementation is that returning to a given point in a directory is quite difficult to describe formally, in spite of its intuitive appeal, when systems that use B-trees, hashing functions, or other similar mechanisms to order their directories are considered. The definition of seekdir() and telldir() does not specify whether, when using these interfaces, a given directory entry will be seen at all, or more than once. In fact, given that quote, I would argue that what Samba is doing is non-portable. This would seem to indicate that a conforming seekdir could just change readdir to immediately return EOF until you call rewinddir. how does readdir know that the directory has been changed? fstat? Undefined. FreeBSD's libc doesn't cache the entire directory (unless you are using a union mount), instead it just caches the most recent call to getdirentries(). It then serves up entries from that block until you hit the end. When you hit the end it reads the next block, etc. When you call telldir(), the kernel saves the seek offset from the start of the current block and the offset within that block and returns a cookie to you. seekdir() looks up the cookie to find the (seek offset, block offset) tuple. If the location matches the directory's current location it doesn't do anything, otherwise it seeks to the seek offset and reads a new block via getdirentries(). There is no check for seeing if a directory is changed. Instead, you can only be stale by one "block". When you read a new block it is relative to the directory's state at that time. Rick's suggestion of reusing the block for a seek within a block would be fairly easy to implement, I think it would just be: Index: head/lib/libc/gen/telldir.c === --- head/lib/libc/gen/telldir.c (revision 281929) +++ head/lib/libc/gen/telldir.c (working copy) @@ -101,8 +101,10 @@ return; if (lp->loc_loc == dirp->dd_loc && lp->loc_seek == dirp->dd_seek) return; - (void) lseek(dirp->dd_fd, (off_t)lp->loc_seek, SEEK_SET); - dirp->dd_seek = lp->loc_seek; + if (lp->loc_seek != dirp->dd_seek) { + (void) lseek(dirp->dd_fd, (off_t)lp->loc_seek, SEEK_SET); + dirp->dd_seek = lp->loc_seek; + } yes I did that yesterday but it still fails when you transition blocks.. (badly). I also tried bigger blocks.. also fails (eventually) I did find a way to make it work... you had to seek back to the first block you deleted on each set.. then work forward from there again.. unfortunately since I'm trying to make a microsoft program not fail (via samba) I have no control over how it does things and seekdir doesn't know what was deleted anyway... (so the fix is fine for the test program but not for real life) I think I can make the BSD one act like the linux one by changing the lseek being done to use the offset (loc) plus the buffer seek address of the target, instead of just going for the buffer base and stepping forwar
Re: readdir/telldir/seekdir problem (i think)
On 4/24/15 10:59 PM, John Baldwin wrote: On Friday, April 24, 2015 01:02:39 PM Julian Elischer wrote: On 4/23/15 9:54 PM, John Baldwin wrote: On Thursday, April 23, 2015 05:02:08 PM Julian Elischer wrote: On 4/23/15 11:20 AM, Julian Elischer wrote: I'm debugging a problem being seen with samba 3.6. basically telldir/seekdir/readdir don't seem to work as advertised.. ok so it looks like readdir() (and friends) is totally broken in the face of deletes unless you read the entire directory at once or reset to the the first file before the deletes, or earlier. I'm not sure that Samba isn't assuming non-portable behavior. For example: >From http://pubs.opengroup.org/onlinepubs/009695399/functions/readdir_r.html If a file is removed from or added to the directory after the most recent call to opendir() or rewinddir(), whether a subsequent call to readdir() returns an entry for that file is unspecified. While this doesn't speak directly to your case, it does note that you will get inconsistencies if you scan a directory concurrent with add and remove. UFS might kind of work actually since deletes do not compact the backing directory, but I suspect NFS and ZFS would not work. In addition, our current NFS support for seekdir is pretty flaky and can't be fixed without changes to return the seek offset for each directory entry (I believe that the projects/ino64 patches include this since they are breaking the ABI of the relevant structures already). The ABI breakage makes this a very non-trivial task. However, even if you have that per-item cookie, it is likely meaningless in the face of filesystems that use any sort of more advanced structure than an array (such as trees, etc.) to store directory entries. POSIX specifically mentions this in the rationale for seekdir: One of the perceived problems of implementation is that returning to a given point in a directory is quite difficult to describe formally, in spite of its intuitive appeal, when systems that use B-trees, hashing functions, or other similar mechanisms to order their directories are considered. The definition of seekdir() and telldir() does not specify whether, when using these interfaces, a given directory entry will be seen at all, or more than once. In fact, given that quote, I would argue that what Samba is doing is non-portable. This would seem to indicate that a conforming seekdir could just change readdir to immediately return EOF until you call rewinddir. how does readdir know that the directory has been changed? fstat? Undefined. FreeBSD's libc doesn't cache the entire directory (unless you are using a union mount), instead it just caches the most recent call to getdirentries(). It then serves up entries from that block until you hit the end. When you hit the end it reads the next block, etc. When you call telldir(), the kernel saves the seek offset from the start of the current block and the offset within that block and returns a cookie to you. seekdir() looks up the cookie to find the (seek offset, block offset) tuple. If the location matches the directory's current location it doesn't do anything, otherwise it seeks to the seek offset and reads a new block via getdirentries(). There is no check for seeing if a directory is changed. Instead, you can only be stale by one "block". When you read a new block it is relative to the directory's state at that time. Rick's suggestion of reusing the block for a seek within a block would be fairly easy to implement, I think it would just be: Index: head/lib/libc/gen/telldir.c === --- head/lib/libc/gen/telldir.c (revision 281929) +++ head/lib/libc/gen/telldir.c (working copy) @@ -101,8 +101,10 @@ return; if (lp->loc_loc == dirp->dd_loc && lp->loc_seek == dirp->dd_seek) return; - (void) lseek(dirp->dd_fd, (off_t)lp->loc_seek, SEEK_SET); - dirp->dd_seek = lp->loc_seek; + if (lp->loc_seek != dirp->dd_seek) { + (void) lseek(dirp->dd_fd, (off_t)lp->loc_seek, SEEK_SET); + dirp->dd_seek = lp->loc_seek; + } yes I did that yesterday but it still fails when you transition blocks.. (badly). I also tried bigger blocks.. also fails (eventually) I did find a way to make it work... you had to seek back to the first block you deleted on each set.. then work forward from there again.. unfortunately since I'm trying to make a microsoft program not fail (via samba) I have no control over how it does things and seekdir doesn't know what was deleted anyway... (so the fix is fine for the test program but not for real life) I think I can make the BSD one act like the linux one by changing the lseek being done to use the offset (loc) plus the buffer seek address of the target, instead of just going for the buffer base and stepping forward through the entries.. maybe tomorrow. dirp->dd_loc = 0; wh
Re: readdir/telldir/seekdir problem (i think)
On 4/24/15 10:43 PM, John Baldwin wrote: On Friday, April 24, 2015 06:42:01 PM Julian Elischer wrote: On 4/24/15 6:12 AM, Rick Macklem wrote: John Baldwin wrote: On Thursday, April 23, 2015 05:02:08 PM Julian Elischer wrote: On 4/23/15 11:20 AM, Julian Elischer wrote: I'm debugging a problem being seen with samba 3.6. basically telldir/seekdir/readdir don't seem to work as advertised.. ok so it looks like readdir() (and friends) is totally broken in the face of deletes unless you read the entire directory at once or reset to the the first file before the deletes, or earlier. I'm not sure that Samba isn't assuming non-portable behavior. For example: From http://pubs.opengroup.org/onlinepubs/009695399/functions/readdir_r.html If a file is removed from or added to the directory after the most recent call to opendir() or rewinddir(), whether a subsequent call to readdir() returns an entry for that file is unspecified. While this doesn't speak directly to your case, it does note that you will get inconsistencies if you scan a directory concurrent with add and remove. UFS might kind of work actually since deletes do not compact the backing directory, but I suspect NFS and ZFS would not work. In addition, our current NFS support for seekdir is pretty flaky and can't be fixed without changes to return the seek offset for each directory entry (I believe that the projects/ino64 patches include this since they are breaking the ABI of the relevant structures already). The ABI breakage makes this a very non-trivial task. However, even if you have that per-item cookie, it is likely meaningless in the face of filesystems that use any sort of more advanced structure than an array (such as trees, etc.) to store directory entries. POSIX specifically mentions this in the rationale for seekdir: http://pubs.opengroup.org/onlinepubs/009695399/functions/seekdir.html One of the perceived problems of implementation is that returning to a given point in a directory is quite difficult to describe formally, in spite of its intuitive appeal, when systems that use B-trees, hashing functions, or other similar mechanisms to order their directories are considered. The definition of seekdir() and telldir() does not specify whether, when using these interfaces, a given directory entry will be seen at all, or more than once. In fact, given that quote, I would argue that what Samba is doing is non-portable. This would seem to indicate that a conforming seekdir could just change readdir to immediately return EOF until you call rewinddir. Btw, Linux somehow makes readdir()/unlink() work for NFS. I haven't looked, but I strongly suspect that it reads the entire directory upon either opendir() or the first readdir(). Oh, and I hate to say it, but I suspect Linux defines the "standard" on this and not POSIX. (In other words, if it works on Linux, it isn't broken;-) rick here's an interesting datapoint. If the test program is run on kFreeBSD using glibc, it runs without flaw. OS-X (bsd derived libc) HFS+ fails FreeBSD libc (UFS) fails FreeBSD libc (ZFS) fails FreeBSD glibc succceeds Centos 6.5 glibc succeeds some NFS tests would be nice to do too I guess... glibc authors seem to have done something right.. it even copes with FreeBSD kernel.. It's probably just reading the entire directory and caching it until rewinddir is called. FreeBSD's libc does this if you have a unionfs mount. It would be a trivial change to always do this, is just means you will never notice any concurrent deletes, adds, or renames until you call rewinddir again. At that point you might as well have the client just do rewinddir each time. You are just moving the caching that Samba should be doing to be portable from samba into libc. I'm not sure that's really an improvement so much as shuffling deck chairs. Also, that is going to keep giving you directory entries for the files you've already removed (unless you patch libc to explicitly hack around that case by stating each entry and skipping ones that fail with ENOENT which would add a lot of overhead). SO I rewrote/ported glibc telldir/readdir to our FreeBSD.. Firstly, a slight addition.. BSD libc also fails on tmpfs ( I found that out by accident.. I thought I was on UFS and forgot I had a tmpfs there) ported glibc readdir/friends works an all three. and it is not caching the whole thing.. I'm doing the deletes in batches of 5. and every 5 deletes I see it doing: 95985 testit2 RET write 30/0x1e 95985 testit2 CALL write(0x1,0x801007000,0x18) 95985 testit2 GIO fd 1 wrote 24 bytes "file file-1756 returned " 95985 testit2 RET write 24/0x18 95985 testit2 CALL write(0x1,0x801007000,0x1d) 95985 testit2 GIO fd 1 wrote 29 bytes "Seeking back to location 144 " 95985 testit2 RET write 29/0x1d 95985 testit2 CALL lseek(0x3,0x90,SEEK_SET) 95985 testit2 RET lseek 144/0x90 95985 testit2 CALL write(0x1,0x801007000,0x1e) 95985 t
Re: readdir/telldir/seekdir problem (i think)
On 4/24/15 6:48 PM, Garrett Cooper wrote: On Apr 24, 2015, at 3:42, Julian Elischer wrote: here's an interesting datapoint. If the test program is run on kFreeBSD using glibc, it runs without flaw. OS-X (bsd derived libc) HFS+ fails FreeBSD libc (UFS) fails FreeBSD libc (ZFS) fails FreeBSD glibc succceeds Centos 6.5 glibc succeeds some NFS tests would be nice to do too I guess... glibc authors seem to have done something right.. it even copes with FreeBSD kernel.. Hi Julian, What version of FreeBSD are you running (uname -a) / coming up with the above results on? I’m asking because IIRC there was an issue that pho@ brought up with readdir/telldir/etc that was fixed on CURRENT in the past 1-2 years for us, that might be affecting you. Cheers, -NGie 11-current as of a few weeks ago ___ freebsd-current@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-current To unsubscribe, send any mail to "freebsd-current-unsubscr...@freebsd.org"
Re: readdir/telldir/seekdir problem (i think)
On Friday, April 24, 2015 06:42:01 PM Julian Elischer wrote: > On 4/24/15 6:12 AM, Rick Macklem wrote: > > John Baldwin wrote: > >> On Thursday, April 23, 2015 05:02:08 PM Julian Elischer wrote: > >>> On 4/23/15 11:20 AM, Julian Elischer wrote: > I'm debugging a problem being seen with samba 3.6. > > basically telldir/seekdir/readdir don't seem to work as > advertised.. > >>> ok so it looks like readdir() (and friends) is totally broken in > >>> the face > >>> of deletes unless you read the entire directory at once or reset to > >>> the > >>> the first file before the deletes, or earlier. > >> I'm not sure that Samba isn't assuming non-portable behavior. For > >> example: > >> > >> From > >> http://pubs.opengroup.org/onlinepubs/009695399/functions/readdir_r.html > >> > >> If a file is removed from or added to the directory after the most > >> recent call > >> to opendir() or rewinddir(), whether a subsequent call to readdir() > >> returns an > >> entry for that file is unspecified. > >> > >> While this doesn't speak directly to your case, it does note that you > >> will > >> get inconsistencies if you scan a directory concurrent with add and > >> remove. > >> > >> UFS might kind of work actually since deletes do not compact the > >> backing > >> directory, but I suspect NFS and ZFS would not work. In addition, > >> our > >> current NFS support for seekdir is pretty flaky and can't be fixed > >> without > >> changes to return the seek offset for each directory entry (I believe > >> that > >> the projects/ino64 patches include this since they are breaking the > >> ABI of > >> the relevant structures already). The ABI breakage makes this a very > >> non-trivial task. However, even if you have that per-item cookie, it > >> is > >> likely meaningless in the face of filesystems that use any sort of > >> more > >> advanced structure than an array (such as trees, etc.) to store > >> directory > >> entries. POSIX specifically mentions this in the rationale for > >> seekdir: > >> > >> http://pubs.opengroup.org/onlinepubs/009695399/functions/seekdir.html > >> > >> One of the perceived problems of implementation is that returning to > >> a given point in a directory is quite difficult to describe > >> formally, in spite of its intuitive appeal, when systems that use > >> B-trees, hashing functions, or other similar mechanisms to order > >> their directories are considered. The definition of seekdir() and > >> telldir() does not specify whether, when using these interfaces, a > >> given directory entry will be seen at all, or more than once. > >> > >> In fact, given that quote, I would argue that what Samba is doing is > >> non-portable. This would seem to indicate that a conforming seekdir > >> could > >> just change readdir to immediately return EOF until you call > >> rewinddir. > >> > > Btw, Linux somehow makes readdir()/unlink() work for NFS. I haven't looked, > > but I strongly suspect that it reads the entire directory upon either > > opendir() > > or the first readdir(). > > > > Oh, and I hate to say it, but I suspect Linux defines the "standard" on > > this and not POSIX. (In other words, if it works on Linux, it isn't > > broken;-) > > > > rick > > here's an interesting datapoint. If the test program is run on > kFreeBSD using glibc, it runs without flaw. > > OS-X (bsd derived libc) HFS+ fails > FreeBSD libc (UFS) fails > FreeBSD libc (ZFS) fails > FreeBSD glibc succceeds > Centos 6.5 glibc succeeds > > some NFS tests would be nice to do too I guess... > glibc authors seem to have done something right.. it even copes with > FreeBSD kernel.. It's probably just reading the entire directory and caching it until rewinddir is called. FreeBSD's libc does this if you have a unionfs mount. It would be a trivial change to always do this, is just means you will never notice any concurrent deletes, adds, or renames until you call rewinddir again. At that point you might as well have the client just do rewinddir each time. You are just moving the caching that Samba should be doing to be portable from samba into libc. I'm not sure that's really an improvement so much as shuffling deck chairs. Also, that is going to keep giving you directory entries for the files you've already removed (unless you patch libc to explicitly hack around that case by stating each entry and skipping ones that fail with ENOENT which would add a lot of overhead). -- John Baldwin ___ freebsd-current@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-current To unsubscribe, send any mail to "freebsd-current-unsubscr...@freebsd.org"
Re: readdir/telldir/seekdir problem (i think)
On Friday, April 24, 2015 01:02:39 PM Julian Elischer wrote: > On 4/23/15 9:54 PM, John Baldwin wrote: > > On Thursday, April 23, 2015 05:02:08 PM Julian Elischer wrote: > >> On 4/23/15 11:20 AM, Julian Elischer wrote: > >>> I'm debugging a problem being seen with samba 3.6. > >>> > >>> basically telldir/seekdir/readdir don't seem to work as advertised.. > >> ok so it looks like readdir() (and friends) is totally broken in the face > >> of deletes unless you read the entire directory at once or reset to the > >> the first file before the deletes, or earlier. > > I'm not sure that Samba isn't assuming non-portable behavior. For example: > > > > >From > > >http://pubs.opengroup.org/onlinepubs/009695399/functions/readdir_r.html > > > > If a file is removed from or added to the directory after the most recent > > call > > to opendir() or rewinddir(), whether a subsequent call to readdir() returns > > an > > entry for that file is unspecified. > > > > While this doesn't speak directly to your case, it does note that you will > > get inconsistencies if you scan a directory concurrent with add and remove. > > > > UFS might kind of work actually since deletes do not compact the backing > > directory, but I suspect NFS and ZFS would not work. In addition, our > > current NFS support for seekdir is pretty flaky and can't be fixed without > > changes to return the seek offset for each directory entry (I believe that > > the projects/ino64 patches include this since they are breaking the ABI of > > the relevant structures already). The ABI breakage makes this a very > > non-trivial task. However, even if you have that per-item cookie, it is > > likely meaningless in the face of filesystems that use any sort of more > > advanced structure than an array (such as trees, etc.) to store directory > > entries. POSIX specifically mentions this in the rationale for seekdir: > > > > > > > One of the perceived problems of implementation is that returning to a > > given point in a directory is quite difficult to describe formally, in > > spite of its intuitive appeal, when systems that use B-trees, hashing > > functions, or other similar mechanisms to order their directories are > > considered. The definition of seekdir() and telldir() does not specify > > whether, when using these interfaces, a given directory entry will be seen > > at all, or more than once. > > > > In fact, given that quote, I would argue that what Samba is doing is > > non-portable. This would seem to indicate that a conforming seekdir could > > just change readdir to immediately return EOF until you call rewinddir. > > how does readdir know that the directory has been changed? fstat? Undefined. FreeBSD's libc doesn't cache the entire directory (unless you are using a union mount), instead it just caches the most recent call to getdirentries(). It then serves up entries from that block until you hit the end. When you hit the end it reads the next block, etc. When you call telldir(), the kernel saves the seek offset from the start of the current block and the offset within that block and returns a cookie to you. seekdir() looks up the cookie to find the (seek offset, block offset) tuple. If the location matches the directory's current location it doesn't do anything, otherwise it seeks to the seek offset and reads a new block via getdirentries(). There is no check for seeing if a directory is changed. Instead, you can only be stale by one "block". When you read a new block it is relative to the directory's state at that time. Rick's suggestion of reusing the block for a seek within a block would be fairly easy to implement, I think it would just be: Index: head/lib/libc/gen/telldir.c === --- head/lib/libc/gen/telldir.c (revision 281929) +++ head/lib/libc/gen/telldir.c (working copy) @@ -101,8 +101,10 @@ return; if (lp->loc_loc == dirp->dd_loc && lp->loc_seek == dirp->dd_seek) return; - (void) lseek(dirp->dd_fd, (off_t)lp->loc_seek, SEEK_SET); - dirp->dd_seek = lp->loc_seek; + if (lp->loc_seek != dirp->dd_seek) { + (void) lseek(dirp->dd_fd, (off_t)lp->loc_seek, SEEK_SET); + dirp->dd_seek = lp->loc_seek; + } dirp->dd_loc = 0; while (dirp->dd_loc < lp->loc_loc) { dp = _readdir_unlocked(dirp, 0); (You still want to re-parse the block I think since the telldir offset might be from an earlier version of the block and not point to a valid directory entry since they are variable sized.) -- John Baldwin ___ freebsd-current@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-current To unsubscribe, send any mail to "freebsd-current-unsubscr...@freebsd.org"
Re: readdir/telldir/seekdir problem (i think)
On Apr 24, 2015, at 3:42, Julian Elischer wrote: > here's an interesting datapoint. If the test program is run on kFreeBSD > using glibc, it runs without flaw. > > OS-X (bsd derived libc) HFS+ fails > FreeBSD libc (UFS) fails > FreeBSD libc (ZFS) fails > FreeBSD glibc succceeds > Centos 6.5 glibc succeeds > > some NFS tests would be nice to do too I guess... > glibc authors seem to have done something right.. it even copes with FreeBSD > kernel.. Hi Julian, What version of FreeBSD are you running (uname -a) / coming up with the above results on? I’m asking because IIRC there was an issue that pho@ brought up with readdir/telldir/etc that was fixed on CURRENT in the past 1-2 years for us, that might be affecting you. Cheers, -NGie signature.asc Description: Message signed with OpenPGP using GPGMail
Re: readdir/telldir/seekdir problem (i think)
On 4/24/15 6:12 AM, Rick Macklem wrote: John Baldwin wrote: On Thursday, April 23, 2015 05:02:08 PM Julian Elischer wrote: On 4/23/15 11:20 AM, Julian Elischer wrote: I'm debugging a problem being seen with samba 3.6. basically telldir/seekdir/readdir don't seem to work as advertised.. ok so it looks like readdir() (and friends) is totally broken in the face of deletes unless you read the entire directory at once or reset to the the first file before the deletes, or earlier. I'm not sure that Samba isn't assuming non-portable behavior. For example: From http://pubs.opengroup.org/onlinepubs/009695399/functions/readdir_r.html If a file is removed from or added to the directory after the most recent call to opendir() or rewinddir(), whether a subsequent call to readdir() returns an entry for that file is unspecified. While this doesn't speak directly to your case, it does note that you will get inconsistencies if you scan a directory concurrent with add and remove. UFS might kind of work actually since deletes do not compact the backing directory, but I suspect NFS and ZFS would not work. In addition, our current NFS support for seekdir is pretty flaky and can't be fixed without changes to return the seek offset for each directory entry (I believe that the projects/ino64 patches include this since they are breaking the ABI of the relevant structures already). The ABI breakage makes this a very non-trivial task. However, even if you have that per-item cookie, it is likely meaningless in the face of filesystems that use any sort of more advanced structure than an array (such as trees, etc.) to store directory entries. POSIX specifically mentions this in the rationale for seekdir: http://pubs.opengroup.org/onlinepubs/009695399/functions/seekdir.html One of the perceived problems of implementation is that returning to a given point in a directory is quite difficult to describe formally, in spite of its intuitive appeal, when systems that use B-trees, hashing functions, or other similar mechanisms to order their directories are considered. The definition of seekdir() and telldir() does not specify whether, when using these interfaces, a given directory entry will be seen at all, or more than once. In fact, given that quote, I would argue that what Samba is doing is non-portable. This would seem to indicate that a conforming seekdir could just change readdir to immediately return EOF until you call rewinddir. Btw, Linux somehow makes readdir()/unlink() work for NFS. I haven't looked, but I strongly suspect that it reads the entire directory upon either opendir() or the first readdir(). Oh, and I hate to say it, but I suspect Linux defines the "standard" on this and not POSIX. (In other words, if it works on Linux, it isn't broken;-) rick here's an interesting datapoint. If the test program is run on kFreeBSD using glibc, it runs without flaw. OS-X (bsd derived libc) HFS+ fails FreeBSD libc (UFS) fails FreeBSD libc (ZFS) fails FreeBSD glibc succceeds Centos 6.5 glibc succeeds some NFS tests would be nice to do too I guess... glibc authors seem to have done something right.. it even copes with FreeBSD kernel.. -- John Baldwin ___ freebsd-current@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-current To unsubscribe, send any mail to "freebsd-current-unsubscr...@freebsd.org" ___ freebsd-current@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-current To unsubscribe, send any mail to "freebsd-current-unsubscr...@freebsd.org" ___ freebsd-current@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-current To unsubscribe, send any mail to "freebsd-current-unsubscr...@freebsd.org"
Re: readdir/telldir/seekdir problem (i think)
On 4/24/15 5:50 AM, Rick Macklem wrote: John Baldwin wrote: On Thursday, April 23, 2015 05:02:08 PM Julian Elischer wrote: On 4/23/15 11:20 AM, Julian Elischer wrote: I'm debugging a problem being seen with samba 3.6. basically telldir/seekdir/readdir don't seem to work as advertised.. ok so it looks like readdir() (and friends) is totally broken in the face of deletes unless you read the entire directory at once or reset to the the first file before the deletes, or earlier. I'm not sure that Samba isn't assuming non-portable behavior. For example: From http://pubs.opengroup.org/onlinepubs/009695399/functions/readdir_r.html If a file is removed from or added to the directory after the most recent call to opendir() or rewinddir(), whether a subsequent call to readdir() returns an entry for that file is unspecified. While this doesn't speak directly to your case, it does note that you will get inconsistencies if you scan a directory concurrent with add and remove. UFS might kind of work actually since deletes do not compact the backing directory, but I suspect NFS and ZFS would not work. In addition, our current NFS support for seekdir is pretty flaky and can't be fixed without changes to return the seek offset for each directory entry (I believe that the projects/ino64 patches include this since they are breaking the ABI of the relevant structures already). The ABI breakage makes this a very non-trivial task. However, even if you have that per-item cookie, it is likely meaningless in the face of filesystems that use any sort of more advanced structure than an array (such as trees, etc.) to store directory entries. POSIX specifically mentions this in the rationale for seekdir: http://pubs.opengroup.org/onlinepubs/009695399/functions/seekdir.html One of the perceived problems of implementation is that returning to a given point in a directory is quite difficult to describe formally, in spite of its intuitive appeal, when systems that use B-trees, hashing functions, or other similar mechanisms to order their directories are considered. The definition of seekdir() and telldir() does not specify whether, when using these interfaces, a given directory entry will be seen at all, or more than once. In fact, given that quote, I would argue that what Samba is doing is non-portable. This would seem to indicate that a conforming seekdir could just change readdir to immediately return EOF until you call rewinddir. Loosely related to this, I have been tempted to modify readdir() to read the entire directory on the first readdir() for NFS, to avoid the readdir()/unlink() problem. I did find a bug in our readdir/seekdir that makes it a lot worse... We reload the kernel's idea of the directory every time we do a seekdir() back, even if it's within the same block, which makes us a lot more susceptible to the problem.. making it not do that unless the new location is in another block made it work on directories with up to several thousand files (with 32k blocksize) before failing. With that bug it's possible do make every seekdir() produce failures even in a directory of just 3 files.. The downside is that the client continues to see the old contents of the block even though he has done a seekdir() within it. My concern was doing this for a very large directory. Maybe it could be done for directories up to some size? rick -- John Baldwin ___ freebsd-current@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-current To unsubscribe, send any mail to "freebsd-current-unsubscr...@freebsd.org" ___ freebsd-current@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-current To unsubscribe, send any mail to "freebsd-current-unsubscr...@freebsd.org" ___ freebsd-current@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-current To unsubscribe, send any mail to "freebsd-current-unsubscr...@freebsd.org"
Re: readdir/telldir/seekdir problem (i think)
On 4/23/15 9:54 PM, John Baldwin wrote: On Thursday, April 23, 2015 05:02:08 PM Julian Elischer wrote: On 4/23/15 11:20 AM, Julian Elischer wrote: I'm debugging a problem being seen with samba 3.6. basically telldir/seekdir/readdir don't seem to work as advertised.. ok so it looks like readdir() (and friends) is totally broken in the face of deletes unless you read the entire directory at once or reset to the the first file before the deletes, or earlier. I'm not sure that Samba isn't assuming non-portable behavior. For example: >From http://pubs.opengroup.org/onlinepubs/009695399/functions/readdir_r.html If a file is removed from or added to the directory after the most recent call to opendir() or rewinddir(), whether a subsequent call to readdir() returns an entry for that file is unspecified. While this doesn't speak directly to your case, it does note that you will get inconsistencies if you scan a directory concurrent with add and remove. UFS might kind of work actually since deletes do not compact the backing directory, but I suspect NFS and ZFS would not work. In addition, our current NFS support for seekdir is pretty flaky and can't be fixed without changes to return the seek offset for each directory entry (I believe that the projects/ino64 patches include this since they are breaking the ABI of the relevant structures already). The ABI breakage makes this a very non-trivial task. However, even if you have that per-item cookie, it is likely meaningless in the face of filesystems that use any sort of more advanced structure than an array (such as trees, etc.) to store directory entries. POSIX specifically mentions this in the rationale for seekdir: One of the perceived problems of implementation is that returning to a given point in a directory is quite difficult to describe formally, in spite of its intuitive appeal, when systems that use B-trees, hashing functions, or other similar mechanisms to order their directories are considered. The definition of seekdir() and telldir() does not specify whether, when using these interfaces, a given directory entry will be seen at all, or more than once. In fact, given that quote, I would argue that what Samba is doing is non-portable. This would seem to indicate that a conforming seekdir could just change readdir to immediately return EOF until you call rewinddir. how does readdir know that the directory has been changed? fstat? ___ freebsd-current@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-current To unsubscribe, send any mail to "freebsd-current-unsubscr...@freebsd.org"
Re: readdir/telldir/seekdir problem (i think)
On 4/24/15 6:12 AM, Rick Macklem wrote: John Baldwin wrote: On Thursday, April 23, 2015 05:02:08 PM Julian Elischer wrote: On 4/23/15 11:20 AM, Julian Elischer wrote: I'm debugging a problem being seen with samba 3.6. basically telldir/seekdir/readdir don't seem to work as advertised.. ok so it looks like readdir() (and friends) is totally broken in the face of deletes unless you read the entire directory at once or reset to the the first file before the deletes, or earlier. I'm not sure that Samba isn't assuming non-portable behavior. For example: From http://pubs.opengroup.org/onlinepubs/009695399/functions/readdir_r.html If a file is removed from or added to the directory after the most recent call to opendir() or rewinddir(), whether a subsequent call to readdir() returns an entry for that file is unspecified. While this doesn't speak directly to your case, it does note that you will get inconsistencies if you scan a directory concurrent with add and remove. UFS might kind of work actually since deletes do not compact the backing directory, but I suspect NFS and ZFS would not work. In addition, our current NFS support for seekdir is pretty flaky and can't be fixed without changes to return the seek offset for each directory entry (I believe that the projects/ino64 patches include this since they are breaking the ABI of the relevant structures already). The ABI breakage makes this a very non-trivial task. However, even if you have that per-item cookie, it is likely meaningless in the face of filesystems that use any sort of more advanced structure than an array (such as trees, etc.) to store directory entries. POSIX specifically mentions this in the rationale for seekdir: http://pubs.opengroup.org/onlinepubs/009695399/functions/seekdir.html One of the perceived problems of implementation is that returning to a given point in a directory is quite difficult to describe formally, in spite of its intuitive appeal, when systems that use B-trees, hashing functions, or other similar mechanisms to order their directories are considered. The definition of seekdir() and telldir() does not specify whether, when using these interfaces, a given directory entry will be seen at all, or more than once. In fact, given that quote, I would argue that what Samba is doing is non-portable. This would seem to indicate that a conforming seekdir could just change readdir to immediately return EOF until you call rewinddir. Btw, Linux somehow makes readdir()/unlink() work for NFS. I haven't looked, but I strongly suspect that it reads the entire directory upon either opendir() or the first readdir(). they make it work for everything apparently.. which is quite a trick. Oh, and I hate to say it, but I suspect Linux defines the "standard" on this and not POSIX. (In other words, if it works on Linux, it isn't broken;-) rick -- John Baldwin ___ freebsd-current@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-current To unsubscribe, send any mail to "freebsd-current-unsubscr...@freebsd.org" ___ freebsd-current@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-current To unsubscribe, send any mail to "freebsd-current-unsubscr...@freebsd.org" ___ freebsd-current@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-current To unsubscribe, send any mail to "freebsd-current-unsubscr...@freebsd.org"
Re: readdir/telldir/seekdir problem (i think)
John Baldwin wrote: > On Thursday, April 23, 2015 05:02:08 PM Julian Elischer wrote: > > On 4/23/15 11:20 AM, Julian Elischer wrote: > > > I'm debugging a problem being seen with samba 3.6. > > > > > > basically telldir/seekdir/readdir don't seem to work as > > > advertised.. > > > > ok so it looks like readdir() (and friends) is totally broken in > > the face > > of deletes unless you read the entire directory at once or reset to > > the > > the first file before the deletes, or earlier. > > I'm not sure that Samba isn't assuming non-portable behavior. For > example: > > From > http://pubs.opengroup.org/onlinepubs/009695399/functions/readdir_r.html > > If a file is removed from or added to the directory after the most > recent call > to opendir() or rewinddir(), whether a subsequent call to readdir() > returns an > entry for that file is unspecified. > > While this doesn't speak directly to your case, it does note that you > will > get inconsistencies if you scan a directory concurrent with add and > remove. > > UFS might kind of work actually since deletes do not compact the > backing > directory, but I suspect NFS and ZFS would not work. In addition, > our > current NFS support for seekdir is pretty flaky and can't be fixed > without > changes to return the seek offset for each directory entry (I believe > that > the projects/ino64 patches include this since they are breaking the > ABI of > the relevant structures already). The ABI breakage makes this a very > non-trivial task. However, even if you have that per-item cookie, it > is > likely meaningless in the face of filesystems that use any sort of > more > advanced structure than an array (such as trees, etc.) to store > directory > entries. POSIX specifically mentions this in the rationale for > seekdir: > > http://pubs.opengroup.org/onlinepubs/009695399/functions/seekdir.html > > One of the perceived problems of implementation is that returning to > a given point in a directory is quite difficult to describe > formally, in spite of its intuitive appeal, when systems that use > B-trees, hashing functions, or other similar mechanisms to order > their directories are considered. The definition of seekdir() and > telldir() does not specify whether, when using these interfaces, a > given directory entry will be seen at all, or more than once. > > In fact, given that quote, I would argue that what Samba is doing is > non-portable. This would seem to indicate that a conforming seekdir > could > just change readdir to immediately return EOF until you call > rewinddir. > Btw, Linux somehow makes readdir()/unlink() work for NFS. I haven't looked, but I strongly suspect that it reads the entire directory upon either opendir() or the first readdir(). Oh, and I hate to say it, but I suspect Linux defines the "standard" on this and not POSIX. (In other words, if it works on Linux, it isn't broken;-) rick > -- > John Baldwin > ___ > freebsd-current@freebsd.org mailing list > http://lists.freebsd.org/mailman/listinfo/freebsd-current > To unsubscribe, send any mail to > "freebsd-current-unsubscr...@freebsd.org" > ___ freebsd-current@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-current To unsubscribe, send any mail to "freebsd-current-unsubscr...@freebsd.org"
Re: readdir/telldir/seekdir problem (i think)
John Baldwin wrote: > On Thursday, April 23, 2015 05:02:08 PM Julian Elischer wrote: > > On 4/23/15 11:20 AM, Julian Elischer wrote: > > > I'm debugging a problem being seen with samba 3.6. > > > > > > basically telldir/seekdir/readdir don't seem to work as > > > advertised.. > > > > ok so it looks like readdir() (and friends) is totally broken in > > the face > > of deletes unless you read the entire directory at once or reset to > > the > > the first file before the deletes, or earlier. > > I'm not sure that Samba isn't assuming non-portable behavior. For > example: > > From > http://pubs.opengroup.org/onlinepubs/009695399/functions/readdir_r.html > > If a file is removed from or added to the directory after the most > recent call > to opendir() or rewinddir(), whether a subsequent call to readdir() > returns an > entry for that file is unspecified. > > While this doesn't speak directly to your case, it does note that you > will > get inconsistencies if you scan a directory concurrent with add and > remove. > > UFS might kind of work actually since deletes do not compact the > backing > directory, but I suspect NFS and ZFS would not work. In addition, > our > current NFS support for seekdir is pretty flaky and can't be fixed > without > changes to return the seek offset for each directory entry (I believe > that > the projects/ino64 patches include this since they are breaking the > ABI of > the relevant structures already). The ABI breakage makes this a very > non-trivial task. However, even if you have that per-item cookie, it > is > likely meaningless in the face of filesystems that use any sort of > more > advanced structure than an array (such as trees, etc.) to store > directory > entries. POSIX specifically mentions this in the rationale for > seekdir: > > http://pubs.opengroup.org/onlinepubs/009695399/functions/seekdir.html > > One of the perceived problems of implementation is that returning to > a given point in a directory is quite difficult to describe > formally, in spite of its intuitive appeal, when systems that use > B-trees, hashing functions, or other similar mechanisms to order > their directories are considered. The definition of seekdir() and > telldir() does not specify whether, when using these interfaces, a > given directory entry will be seen at all, or more than once. > > In fact, given that quote, I would argue that what Samba is doing is > non-portable. This would seem to indicate that a conforming seekdir > could > just change readdir to immediately return EOF until you call > rewinddir. > Loosely related to this, I have been tempted to modify readdir() to read the entire directory on the first readdir() for NFS, to avoid the readdir()/unlink() problem. My concern was doing this for a very large directory. Maybe it could be done for directories up to some size? rick > -- > John Baldwin > ___ > freebsd-current@freebsd.org mailing list > http://lists.freebsd.org/mailman/listinfo/freebsd-current > To unsubscribe, send any mail to > "freebsd-current-unsubscr...@freebsd.org" > ___ freebsd-current@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-current To unsubscribe, send any mail to "freebsd-current-unsubscr...@freebsd.org"
Re: readdir/telldir/seekdir problem (i think)
On 4/23/15 9:54 PM, John Baldwin wrote: On Thursday, April 23, 2015 05:02:08 PM Julian Elischer wrote: On 4/23/15 11:20 AM, Julian Elischer wrote: I'm debugging a problem being seen with samba 3.6. basically telldir/seekdir/readdir don't seem to work as advertised.. ok so it looks like readdir() (and friends) is totally broken in the face of deletes unless you read the entire directory at once or reset to the the first file before the deletes, or earlier. I'm not sure that Samba isn't assuming non-portable behavior. For example: >From http://pubs.opengroup.org/onlinepubs/009695399/functions/readdir_r.html If a file is removed from or added to the directory after the most recent call to opendir() or rewinddir(), whether a subsequent call to readdir() returns an entry for that file is unspecified. While this doesn't speak directly to your case, it does note that you will get inconsistencies if you scan a directory concurrent with add and remove. UFS might kind of work actually since deletes do not compact the backing directory, but I suspect NFS and ZFS would not work. looking in ufs_getdirent (or whatever it's called) even UFS is packing the list of files and removing deleted entries. it's also what I'm seeing in practice. if I do a seekdir back to the location of the first deleted file, I find a new file in that slot, that has been shuffled down. In addition, our current NFS support for seekdir is pretty flaky and can't be fixed without changes to return the seek offset for each directory entry (I believe that the projects/ino64 patches include this since they are breaking the ABI of the relevant structures already). The ABI breakage makes this a very non-trivial task. However, even if you have that per-item cookie, it is likely meaningless in the face of filesystems that use any sort of more advanced structure than an array (such as trees, etc.) to store directory entries. POSIX specifically mentions this in the rationale for seekdir: http://pubs.opengroup.org/onlinepubs/009695399/functions/seekdir.html One of the perceived problems of implementation is that returning to a given point in a directory is quite difficult to describe formally, in spite of its intuitive appeal, when systems that use B-trees, hashing functions, or other similar mechanisms to order their directories are considered. The definition of seekdir() and telldir() does not specify whether, when using these interfaces, a given directory entry will be seen at all, or more than once. In fact, given that quote, I would argue that what Samba is doing is non-portable. This would seem to indicate that a conforming seekdir could just change readdir to immediately return EOF until you call rewinddir. ___ freebsd-current@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-current To unsubscribe, send any mail to "freebsd-current-unsubscr...@freebsd.org"
Re: readdir/telldir/seekdir problem (i think)
On Thursday, April 23, 2015 05:02:08 PM Julian Elischer wrote: > On 4/23/15 11:20 AM, Julian Elischer wrote: > > I'm debugging a problem being seen with samba 3.6. > > > > basically telldir/seekdir/readdir don't seem to work as advertised.. > > ok so it looks like readdir() (and friends) is totally broken in the face > of deletes unless you read the entire directory at once or reset to the > the first file before the deletes, or earlier. I'm not sure that Samba isn't assuming non-portable behavior. For example: >From http://pubs.opengroup.org/onlinepubs/009695399/functions/readdir_r.html If a file is removed from or added to the directory after the most recent call to opendir() or rewinddir(), whether a subsequent call to readdir() returns an entry for that file is unspecified. While this doesn't speak directly to your case, it does note that you will get inconsistencies if you scan a directory concurrent with add and remove. UFS might kind of work actually since deletes do not compact the backing directory, but I suspect NFS and ZFS would not work. In addition, our current NFS support for seekdir is pretty flaky and can't be fixed without changes to return the seek offset for each directory entry (I believe that the projects/ino64 patches include this since they are breaking the ABI of the relevant structures already). The ABI breakage makes this a very non-trivial task. However, even if you have that per-item cookie, it is likely meaningless in the face of filesystems that use any sort of more advanced structure than an array (such as trees, etc.) to store directory entries. POSIX specifically mentions this in the rationale for seekdir: http://pubs.opengroup.org/onlinepubs/009695399/functions/seekdir.html One of the perceived problems of implementation is that returning to a given point in a directory is quite difficult to describe formally, in spite of its intuitive appeal, when systems that use B-trees, hashing functions, or other similar mechanisms to order their directories are considered. The definition of seekdir() and telldir() does not specify whether, when using these interfaces, a given directory entry will be seen at all, or more than once. In fact, given that quote, I would argue that what Samba is doing is non-portable. This would seem to indicate that a conforming seekdir could just change readdir to immediately return EOF until you call rewinddir. -- John Baldwin ___ freebsd-current@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-current To unsubscribe, send any mail to "freebsd-current-unsubscr...@freebsd.org"
Re: readdir/telldir/seekdir problem (i think)
On 4/23/15 11:20 AM, Julian Elischer wrote: I'm debugging a problem being seen with samba 3.6. basically telldir/seekdir/readdir don't seem to work as advertised.. ok so it looks like readdir() (and friends) is totally broken in the face of deletes unless you read the entire directory at once or reset to the the first file before the deletes, or earlier. So if you use readdir() to enumerate files in order to delete them, and you do it in sections in order to not have to load a potentially huge list of files all at once, you can not do it unless you rewind to 0 after each set of unlinks. (or you do all the unlinks in one hit). this appears to be because getdirentries (or maybe the filesystem behind) will 'compact' the returned data. At least this appears to be the case looking from the outside. Next step is to look at the getdirentries code.. here's a little test program #include #include #include #include #include #include #include #include #include #include #define CHUNKSIZE 5 #define TOTALFILES 40 static void SeekDir(DIR *dirp, long loc) { printf("Seeking back to location %ld\n", loc); seekdir(dirp, loc); } static long TellDir(DIR *dirp) { long loc; loc = telldir(dirp); printf("telldir assigned location %ld\n", loc); return (loc); } int main(int argc, char *argv[]) { DIR*dirp; inti; intj; longoffset = 0, prev_offset = 0; char *files[100]; charfilename[100]; intfd; struct dirent *dp = NULL; if (chdir("./test2") != 0) { err(EX_OSERR, "chdir"); } /*/ /* Put a set of sample files in the target directory */ /*/ for (i=1; i < TOTALFILES ; i++) { sprintf(filename, "file-%d", i); fd = open(filename, O_CREAT, 0666); if (fd == -1) { err(EX_OSERR, "open"); } close(fd); } dirp = opendir("."); offset = TellDir(dirp); for (i = 0; i < 20; i++) files[i] = malloc(20); /***/ /* enumerate and delete small sets of files, one group */ /* at a time. */ /***/ do { /*/ /* Read in up to CHUNKSIZE file names*/ /* i will be the number of files we hold */ /*/ for (i = 0; i < CHUNKSIZE; i++) { if ((dp = readdir(dirp)) != NULL) { strcpy(files[i], dp->d_name); printf("readdir (%ld) returned file %s\n", offset, files[i]); prev_offset = offset; offset = TellDir(dirp); } else { printf("readdir returned null\n"); break; } } // /* Simuate the last entry not fitting into our (samba's) buffer */ /* If we read someting in on the last slot, push it back*/ /* Pretend it didn't fit. This is approximately what SAMBA does.*/ // if (dp != NULL) { /* Step back */ SeekDir(dirp, prev_offset); offset = TellDir(dirp); i--; printf("file %s returned\n", files[i]); } /*/ /* i is the number of names we have left.*/ /* Delete them. */ /*/ for (j = 0; j < i; j++) { if (*files[j] == '.') { printf ("skipping %s\n", files[j]); } else { printf("Unlinking file %s\n", files[j]); if (unlink(files[j]) != 0) { err(EX_OSERR, "unlink"); } } } } while (dp != NULL); closedir(dirp); //chdir(".."); } The program is simulating what Samba does when fails. (doing a recursive delete of a directory) What it does is reads a chunk of names using readdir() until it's (small) buffer s full, then it uses seekdir() to seek back before the last entry it read, (which fails to fit), theortically leaving it for the next read. It then deletes the entries it found and repeats the cycle. Eventually it should have found all the files in the directory and deleted them. Except that it doesn't. What actually happens is that some files are not enumerated, even though the seekdir() should have made the readdir() find them. for added fun. the FIRST seekdir appears to work. but all subsequent ones don't. It behaves this way in -current , all the way back to at
Re: readdir/telldir/seekdir problem (i think)
On 4/23/15 11:20 AM, Julian Elischer wrote: I'm debugging a problem being seen with samba 3.6. basically telldir/seekdir/readdir don't seem to work as advertised.. On the intergooglewebs I see multiple previous reports of similar problems, some going back 30 years, but we supposedly have the fixes for those in -current. added data point.. if I comment out the 'unlink' lines all the files are enumerated, so it's something similar to the previously reported unlink-vs-readdir bugs. but different. If I insert a rewinddir() after each set of deletes it all works right but that means the system is continuously rereading all the prior directory blocks.. It goes wrong on at least UFS and ZFS. maybe others. I'm not totally sure it can be fixed. If teh underlying directory blocs are reshuffled or changed and things are moved, it may just be impossible to keep any information as to where you are in the directory without adding more state. here's a little test program #include #include #include #include #include #include #include #include #include #include #define CHUNKSIZE 5 #define TOTALFILES 40 static void SeekDir(DIR *dirp, long loc) { printf("Seeking back to location %ld\n", loc); seekdir(dirp, loc); } static long TellDir(DIR *dirp) { long loc; loc = telldir(dirp); printf("telldir assigned location %ld\n", loc); return (loc); } int main(int argc, char *argv[]) { DIR*dirp; inti; intj; longoffset = 0, prev_offset = 0; char *files[100]; charfilename[100]; intfd; struct dirent *dp = NULL; if (chdir("./test2") != 0) { err(EX_OSERR, "chdir"); } /*/ /* Put a set of sample files in the target directory */ /*/ for (i=1; i < TOTALFILES ; i++) { sprintf(filename, "file-%d", i); fd = open(filename, O_CREAT, 0666); if (fd == -1) { err(EX_OSERR, "open"); } close(fd); } dirp = opendir("."); offset = TellDir(dirp); for (i = 0; i < 20; i++) files[i] = malloc(20); /***/ /* enumerate and delete small sets of files, one group */ /* at a time. */ /***/ do { /*/ /* Read in up to CHUNKSIZE file names*/ /* i will be the number of files we hold */ /*/ for (i = 0; i < CHUNKSIZE; i++) { if ((dp = readdir(dirp)) != NULL) { strcpy(files[i], dp->d_name); printf("readdir (%ld) returned file %s\n", offset, files[i]); prev_offset = offset; offset = TellDir(dirp); } else { printf("readdir returned null\n"); break; } } // /* Simuate the last entry not fitting into our (samba's) buffer */ /* If we read someting in on the last slot, push it back*/ /* Pretend it didn't fit. This is approximately what SAMBA does.*/ // if (dp != NULL) { /* Step back */ SeekDir(dirp, prev_offset); offset = TellDir(dirp); i--; printf("file %s returned\n", files[i]); } /*/ /* i is the number of names we have left.*/ /* Delete them. */ /*/ for (j = 0; j < i; j++) { if (*files[j] == '.') { printf ("skipping %s\n", files[j]); } else { printf("Unlinking file %s\n", files[j]); if (unlink(files[j]) != 0) { err(EX_OSERR, "unlink"); } } } } while (dp != NULL); closedir(dirp); //chdir(".."); } The program is simulating what Samba does when fails. (doing a recursive delete of a directory) What it does is reads a chunk of names using readdir() until it's (small) buffer s full, then it uses seekdir() to seek back before the last entry it read, (which fails to fit), theortically leaving it for the next read. It then deletes the entries it found and repeats the cycle. Eventually it should have found all the files in the directory and deleted them. Except that it doesn't. What actually happens is that some files are not enumerated, even though the seekdir() should have made the readdir() find them. for added fun. the FIRST seekdir appears to wor
readdir/telldir/seekdir problem (i think)
I'm debugging a problem being seen with samba 3.6. basically telldir/seekdir/readdir don't seem to work as advertised.. here's a little test program #include #include #include #include #include #include #include #include #include #include #define CHUNKSIZE 5 #define TOTALFILES 40 static void SeekDir(DIR *dirp, long loc) { printf("Seeking back to location %ld\n", loc); seekdir(dirp, loc); } static long TellDir(DIR *dirp) { long loc; loc = telldir(dirp); printf("telldir assigned location %ld\n", loc); return (loc); } int main(int argc, char *argv[]) { DIR*dirp; inti; intj; longoffset = 0, prev_offset = 0; char *files[100]; charfilename[100]; intfd; struct dirent *dp = NULL; if (chdir("./test2") != 0) { err(EX_OSERR, "chdir"); } /*/ /* Put a set of sample files in the target directory */ /*/ for (i=1; i < TOTALFILES ; i++) { sprintf(filename, "file-%d", i); fd = open(filename, O_CREAT, 0666); if (fd == -1) { err(EX_OSERR, "open"); } close(fd); } dirp = opendir("."); offset = TellDir(dirp); for (i = 0; i < 20; i++) files[i] = malloc(20); /***/ /* enumerate and delete small sets of files, one group */ /* at a time. */ /***/ do { /*/ /* Read in up to CHUNKSIZE file names*/ /* i will be the number of files we hold */ /*/ for (i = 0; i < CHUNKSIZE; i++) { if ((dp = readdir(dirp)) != NULL) { strcpy(files[i], dp->d_name); printf("readdir (%ld) returned file %s\n", offset, files[i]); prev_offset = offset; offset = TellDir(dirp); } else { printf("readdir returned null\n"); break; } } // /* Simuate the last entry not fitting into our (samba's) buffer */ /* If we read someting in on the last slot, push it back*/ /* Pretend it didn't fit. This is approximately what SAMBA does.*/ // if (dp != NULL) { /* Step back */ SeekDir(dirp, prev_offset); offset = TellDir(dirp); i--; printf("file %s returned\n", files[i]); } /*/ /* i is the number of names we have left.*/ /* Delete them. */ /*/ for (j = 0; j < i; j++) { if (*files[j] == '.') { printf ("skipping %s\n", files[j]); } else { printf("Unlinking file %s\n", files[j]); if (unlink(files[j]) != 0) { err(EX_OSERR, "unlink"); } } } } while (dp != NULL); closedir(dirp); //chdir(".."); } The program is simulating what Samba does when fails. (doing a recursive delete of a directory) What it does is reads a chunk of names using readdir() until it's (small) buffer s full, then it uses seekdir() to seek back before the last entry it read, (which fails to fit), theortically leaving it for the next read. It then deletes the entries it found and repeats the cycle. Eventually it should have found all the files in the directory and deleted them. Except that it doesn't. What actually happens is that some files are not enumerated, even though the seekdir() should have made the readdir() find them. for added fun. the FIRST seekdir appears to work. but all subsequent ones don't. It behaves this way in -current , all the way back to at least 8.0. if there's a bug in my program please let me know, but samba has the same problem.. e.g. on freeNAS. to use the program make a directory called "./test2" and then run it in the current directory.. It fills it with files and then tried to (fails) delete them in small batches. here's some (annotated) output: ./testit telldir assigned location 0 readdir (0) returned file . telldir assigned location 1 readdir (1) returned file .. telldir assigned location 2 readdir (2) returned file file-1 telldir assigned location 3 readdir (3) returned file file-2 telldir assigned location 4 readdir (4) returned file file-3 telldir assigned location 5 > here we pretend the buffer was full and put the file > marker back