Yeah that is odd. Setting the cursor for each call to iterate_handles may be the reason for it starting over. Do you know how many times it starts over? The number of times iterate_handles is called will be (# of files / 4096).

It only goes through the file twice if I am looking at the log correctly. Also, I just realized that on both passes (the one jumping backwards 40KB at a time and the one jumping backwards 4KB at a time) it is only reading 4KB per pread. I don't know what it is doing from a db point of view, but from an access point of view it looks like it goes backwards with a strided pattern and then goes backwards reading the entire thing. There are some other reads scattered here and there, but those two cycles represent the overwhelming majority of the total preads in the strace file. By spot checking I don't really see any significant divergence from the patterns.

It also just occurred to me that maybe I should repeat the strace and try to capture it with timestamps; I'm not really sure if both of these pread cycles are actually during the scan or not.

-Phil

Maybe it has to do with setting the iterator with the RECNUM flag, which we set so that we can keep track of positions over the iterate_handles call. Since we already use the handles to sort the entries, maybe the two are conflicting with each other. The berkeley db doc does mention that RECNUM will hinder performance, but only on writes:

--
Configuring a Btree for record numbers should not be done lightly. While often useful, it may significantly slow down the speed at which items can be stored into the database, and can severely impact application throughput. Generally it should be avoided in trees with a need for high write concurrency.
--

If we could return the handle as the position, we could get rid of the RECNUM flag and set the cursor with the last handle, but the position field is only uint32_t. Its really annoying that we only use the first 32 bits of the PVFS_handle right now too. Can we change that PVFS_ds_position type to be 64 bit?

_______________________________________________
Pvfs2-developers mailing list
Pvfs2-developers@beowulf-underground.org
http://www.beowulf-underground.org/mailman/listinfo/pvfs2-developers

Reply via email to