I have a little bit more information to add to the puzzle. I just ran
the newly patched server with strace -ff -o and then dug through the
output a little bit.
The attributes db is 472272896 bytes, according to ls -al.
The strace output has a lot of pread operations in it. Oddly enough,
they seem to go backwards through this db. If I grep/cut out the
offsets from all of the pread operations, at some point they go through
a pattern that looks like this:
472244224
472207360
472170496
472133632
472096768
472059904
472023040
471986176
471949312
...
442368
405504
368640
331776
294912
258048
221184
184320
147456
110592
73728
36864
Then, it seems to start over again, this time with a smaller difference
between the offsets:
472268800
472264704
472260608
472256512
472252416
472248320
472240128
472236032
472231936
...
57344
53248
49152
45056
40960
32768
28672
24576
20480
16384
12288
Altogether there are over 100,000 preads. I assume since they start at
nearly the size of the attributes db that this is what they are reading.
It seems odd that it would go through the entire file backwards twice.
I'm guessing that probably isn't very friendly to the whatever
caching/prefetching/etc. is going in in the kernel and storage devices.
I don't know what would cause this, unless it is somehow related to the
access method and/or comparison functions being used. If so, maybe it
could be overcome with a secondary index that is somehow laid out more
favorably for cursors? I'm grasping at berkeley db voodoo at this point :)
I also have one more data point. I mentioned in an earlier email that
simply "warming up" the db files in the buffer cache before starting the
servers was enough to reduce the startup time to a few seconds. For
some reason, that is not necessarily true on the environment that I am
now testing on. Now I see this:
- stock server, cached db files: 1 minute, 46 seconds average
- patched server, cached db files: 7 seconds average
The speedup from pre-caching isn't nearly as dramatic here unless I use
Sam's new and improved code.
-Phil
Phil Carns wrote:
Ok, I have tried several iterations both with and without these patches.
The test system is again using a SAN, this time with a
dataspace_attributes.db file of about 451 MB on a particular server. I'm
not sure how many files are on the file system; I just cranked out files
on it until the db file looked big enough to get good measurements on
the startup time. I was able to turn on the "trove,server" logging mask
along with the "usec" timestamp to see the scan time on both versions
without any logging occuring during the actual scan itself.
for example:
[D 10:00:46.541646] dbpf collection 752900094 - Setting collection
handle ranges to 4-536870914,4294967292-4831838202
[D 10:04:19.414723] dbpf collection 752900094 - Setting HIGH_WATERMARK
to -1
If I unmount between each server start, the original version takes an
average of 3 minutes, 17 seconds to complete the scan.
The patched version takes an average of 2 minutes, 22 seconds to
complete the same scan.
This is definitely a big improvement- almost 30% in my test case.
-Phil
Phil Carns wrote:
Thanks Sam! We will give these patches a try and report back.
-Phil
Sam Lang wrote:
Hi Phil,
Attached mult.patch implements iterating over the dspace db using
DB_MULTIPLE_KEY. This may allow for the db get call to do larger
reads from your SAN. I was seeing slightly better performance with
local disk after creating 20K files in a fresh storage space. Doing
strace doesn't show fewer mmaps or larger reads though, so I'm not
sure how berkeley db pulls in its pages. Anyway, if it helps
improve performance for you guys, I can clean it up a bit and commit
it. I don't think anything uses dspace_iterate_handles besides that
ledger handle management code.
You can fiddle the MAX_NUM_VERIFY_HANDLE_COUNT value to set how many
handles to get at a time. Right now its set to 4096. Keep in mind
that this requires a much larger buffer allocated in
dbpf_dspace_iterate_handles_op_svc, since we have to get keys and
values, so essentially we do a get with a buffer that's 4096*(sizeof
(handle) + sizeof(stored_attr)), which ends up being about 300K.
I also attached a patch (server-start.patch) that prints out the
start message as well as ready message after server initialization
has completed. If you set the Logstamp to usec, you'll be able to
see the time it takes to initialize the server. Also, this might
help in knowing when you can mount the clients, although, hopefully
at some point we'll be able to add the zero-conf stuff and then we
can return EAGAIN or something.
I'm not sure its time to replace the ledger code. It seems to work
ok, and to fix the slowness you're seeing would mean switching to
some kind of range tree that could be serialized to disk so that we
wouldn't have to iterate through the entire dspace db on startup.
That opens up the possibility of the dspace db and the
ledger-on-disk getting out of sync, which I'd rather avoid.
We could hand out new handles by choosing one randomly, and then
checking if its in the DB, getting rid of the need for a ledger
entirely, but I assume this idea was already scratched to avoid the
potential costs at creation time, especially as the filesystem grows.
-sam
On Feb 20, 2007, at 11:23 AM, Phil Carns wrote:
Robert Latham wrote:
On Tue, Feb 20, 2007 at 07:29:16AM -0500, Phil Carns wrote:
Oh, and one other detail; the memory usage of the servers looks
fine during startup, so this doesn't appear to be a memory leak.
There is quite a bit of CPU work, but I am guessing that is just
berkeley db keeping busy in the iteration function.
How long does it take to scan 1.4 million files on startup?
==rob
That's an interesting issue :)
A few observations:
- we were looking at this on SAN; the results may be different on
local disks
- the db files are on the order of 500 MB for this particular setup
- the time to scan varies depending on if the db files are hot in
the Linux buffer cache
If we start the daemon right after killing another one that just
did the same scan, then the process is CPU intensive, but fast
(about 5 seconds). If we unmount/mount the SAN between the two
runs so that the buffer cache is cleared, then it is very slow
(about 5 minutes).
An interesting trick is to use dd with a healthy buffer size to
read the .db files and throw the output into /dev/null before
starting the servers. This only takes a few seconds, and makes it
so that the scan consistently finishes in just a few seconds as
well. I think the reason is just that it forces the db data into
the Linux buffer cache using an efficient access pattern so that
berkeley db doesn't have to wait on disk latency for whatever small
accesses it is performing.
This seems to indicate that berkeley db's access pattern generated
by PVFS2 for this case isn't very friendly, at least to SANs that
aren't specifically tuned for it.
The 5 minute scan time is a problem, because it makes it hard to
tell when you will actually be able to mount the file system after
the daemons appear to have started. We would be happy to try out
any optimizations here :)
-Phil
_______________________________________________
Pvfs2-developers mailing list
Pvfs2-developers@beowulf-underground.org
http://www.beowulf-underground.org/mailman/listinfo/pvfs2-developers
_______________________________________________
Pvfs2-developers mailing list
Pvfs2-developers@beowulf-underground.org
http://www.beowulf-underground.org/mailman/listinfo/pvfs2-developers
_______________________________________________
Pvfs2-developers mailing list
Pvfs2-developers@beowulf-underground.org
http://www.beowulf-underground.org/mailman/listinfo/pvfs2-developers