Re: [Pvfs2-developers] duplicate entries in directory listing

Sam Lang Mon, 09 Oct 2006 16:20:33 -0700


On Oct 9, 2006, at 3:23 PM, Phil Carns wrote:

Phil Carns wrote:
I started thinking about some more possible ideas, but I realizedafter looking closer at the code that I don't actually see whyduplicates would occur in the first place with the algorithm thatis being used :) I apologize if this has been discussed a fewtimes already, but could we walk through it one more time?
I know that the request protocol uses a token (integer based) tokeep track of position. However, the pcache converts this into aparticular key based on where the last iteration left off. Thiskey contains the handle as well as the alphanumeric name of theentry.
Trove then does a c_get on that key with the DB_SET flag, whichshould put the cursor at the proper position. If the entry hasbeen deleted (which is not happening in my case- I am onlycreating files), then it retries the c_get with the DB_SET_RANGEflag which should set the cursor at the next position. "next" inthis case is defined by the comparison function,PINT_trove_dbpf_keyval_compare().
The keyval_comare() function sorts the keys based on handlevalue, then key length, then stncmp of the key name.
This means that essentially we are indexing off of the name ofthe entry rather than a position in the database.
So how could inserting a new entry between readdir requests causea duplicate? The old entry that is stored in the pcache shouldstill be valid. If the newly inserted entry comes after it(according to the keyval_comare() sort order), then we should seeit as we continue iterating. If the new entry comes before it,then it should not show up (we don't back up in the directorylisting). It doesn't seem like there should be any combinationthat causes it to show up twice.
Is c_get() not traversing the db in the order defined by thekeyval_comare() function?
The only other danger that I see is that if the pcache_lookup()fails, the code falls back to stepping linearly through the db tothe token position which I could imagine might have orderingimplications. However, I am only talking to the server from asingle client, so I don't see why it would ever miss the pcachelookup.
I just want to confirm that there is actually an algorithmproblem here rather than just a bug in the code somewhere.
Oh, or is the problem in how the end of the directory isdetected? Does the client do something like issuing a readdiruntil it gets a response with zero entries? I haven't looked athow this works yet, but I imagine that could throw a wrench intothings if the directory gets additional entries between when theserver first indicates that it has reached the end and when theclient gives up on asking for more.
I just tried repeating the test a few times, replacing the "ls" inmy test script with either "pvfs2-ls" or "pvfs2-ls -al". I cannottrigger the problem when using pvfs2-ls.
If I switch back to "ls" or "/bin/ls" the problem shows up reliably.
Is there anything fundamentally different between how pvfs2-lsworks and how the vfs readdir path works, or is pvfs2-ls somehowgetting luckier with the timing?

I think I may see the problem. There's some bits in thedir.c:pvfs2_readdir kernel module code that check the directoryversion (the mtime) and if its different than the previous value, westart over from position 0, except the filldir callback (what the vfsreports back to the user for each entry) has already been called forthe previous 32 entries, so you will see those entries twice. Thenumber of duplicates you got confirms this theory (5024, 5344, 5120,and 6048 are all multiples of 32).

Rob mentioned that this has been around for a while. Maybe thechanges I made to the request scheduler cause it to occur morefrequently, but even then, we don't queue crdirents between readdircalls, so it seems like it would still have been possible. On thesame machine (same kernel module), I would imagine the directoryversion could get updated on the crdirent (file create), but then youwould miss the newly created files -- in fact the PVFS_sys_createdoesn't return the parent's mtime anyway so it can't get updatedright now anyway.

It seems like this directory_version check is a little unnecessary tobe honest. If I do an ls, I don't mind only seeing the results fromthe start of the command, even if I don't see the entries added whilethe command is running. Better to miss some than report some twice,but maybe I've got the semantics all wrong.

An interesting side effect of having the pcache (I know I just baggedon it earlier today) is that it allows us to sort of keep track ofreaddirs that are "in progress". For a crdirent, we might be able totack onto all the pcache entries with that handle the new entry,allowing it to be returned in the next readdir. This would avoid theneed for a client side version check entirely, and no entries wouldbe "missed". It wouldn't be perfect of course. The pcache is an in-memory structure so server restarts and LRU expulsion will probablycause missed entries again.

This is probably obvious to everyone by now, but using the componentname instead of an index for the position doesn't fix this particularbug/behavior.


-sam


-Phil


_______________________________________________
Pvfs2-developers mailing list
Pvfs2-developers@beowulf-underground.org
http://www.beowulf-underground.org/mailman/listinfo/pvfs2-developers

Re: [Pvfs2-developers] duplicate entries in directory listing

Reply via email to