On Oct 9, 2006, at 3:23 PM, Phil Carns wrote:
Phil Carns wrote:
I started thinking about some more possible ideas, but I realized
after looking closer at the code that I don't actually see why
duplicates would occur in the first place with the algorithm that
is being used :) I apologize if this has been discussed a few
times already, but could we walk through it one more time?
I know that the request protocol uses a token (integer based) to
keep track of position. However, the pcache converts this into a
particular key based on where the last iteration left off. This
key contains the handle as well as the alphanumeric name of the
entry.
Trove then does a c_get on that key with the DB_SET flag, which
should put the cursor at the proper position. If the entry has
been deleted (which is not happening in my case- I am only
creating files), then it retries the c_get with the DB_SET_RANGE
flag which should set the cursor at the next position. "next" in
this case is defined by the comparison function,
PINT_trove_dbpf_keyval_compare().
The keyval_comare() function sorts the keys based on handle
value, then key length, then stncmp of the key name.
This means that essentially we are indexing off of the name of
the entry rather than a position in the database.
So how could inserting a new entry between readdir requests cause
a duplicate? The old entry that is stored in the pcache should
still be valid. If the newly inserted entry comes after it
(according to the keyval_comare() sort order), then we should see
it as we continue iterating. If the new entry comes before it,
then it should not show up (we don't back up in the directory
listing). It doesn't seem like there should be any combination
that causes it to show up twice.
Is c_get() not traversing the db in the order defined by the
keyval_comare() function?
The only other danger that I see is that if the pcache_lookup()
fails, the code falls back to stepping linearly through the db to
the token position which I could imagine might have ordering
implications. However, I am only talking to the server from a
single client, so I don't see why it would ever miss the pcache
lookup.
I just want to confirm that there is actually an algorithm
problem here rather than just a bug in the code somewhere.
Oh, or is the problem in how the end of the directory is
detected? Does the client do something like issuing a readdir
until it gets a response with zero entries? I haven't looked at
how this works yet, but I imagine that could throw a wrench into
things if the directory gets additional entries between when the
server first indicates that it has reached the end and when the
client gives up on asking for more.
I just tried repeating the test a few times, replacing the "ls" in
my test script with either "pvfs2-ls" or "pvfs2-ls -al". I cannot
trigger the problem when using pvfs2-ls.
If I switch back to "ls" or "/bin/ls" the problem shows up reliably.
Is there anything fundamentally different between how pvfs2-ls
works and how the vfs readdir path works, or is pvfs2-ls somehow
getting luckier with the timing?
The kernel module sets the position itself before calling readdir
each time. It also tries to update the position if it can't handle
all 32 entries (out of memory) in the readdir response. This is why
we didn't initially think we could use the component name as the
position. Murali has looked at making sure the position is getting
set to the correct value, but maybe there are still some bugs there.
I'll have to look.
-sam
-Phil
_______________________________________________
Pvfs2-developers mailing list
Pvfs2-developers@beowulf-underground.org
http://www.beowulf-underground.org/mailman/listinfo/pvfs2-developers