Re: [Pvfs2-developers] server crash on startup with millions of files

Phil Carns Fri, 23 Feb 2007 09:46:24 -0800

I have a little bit more information to add to the puzzle. I just ranthe newly patched server with strace -ff -o and then dug through theoutput a little bit.


The attributes db is 472272896 bytes, according to ls -al.

The strace output has a lot of pread operations in it. Oddly enough,they seem to go backwards through this db. If I grep/cut out theoffsets from all of the pread operations, at some point they go througha pattern that looks like this:

Then, it seems to start over again, this time with a smaller differencebetween the offsets:

Altogether there are over 100,000 preads. I assume since they start atnearly the size of the attributes db that this is what they are reading.

It seems odd that it would go through the entire file backwards twice.I'm guessing that probably isn't very friendly to the whatevercaching/prefetching/etc. is going in in the kernel and storage devices.

I don't know what would cause this, unless it is somehow related to theaccess method and/or comparison functions being used. If so, maybe itcould be overcome with a secondary index that is somehow laid out morefavorably for cursors? I'm grasping at berkeley db voodoo at this point :)

I also have one more data point. I mentioned in an earlier email thatsimply "warming up" the db files in the buffer cache before starting theservers was enough to reduce the startup time to a few seconds. Forsome reason, that is not necessarily true on the environment that I amnow testing on. Now I see this:


- stock server, cached db files: 1 minute, 46 seconds average

- patched server, cached db files: 7 seconds average

The speedup from pre-caching isn't nearly as dramatic here unless I useSam's new and improved code.


-Phil


Phil Carns wrote:

Ok, I have tried several iterations both with and without these patches.The test system is again using a SAN, this time with adataspace_attributes.db file of about 451 MB on a particular server. I'mnot sure how many files are on the file system; I just cranked out fileson it until the db file looked big enough to get good measurements onthe startup time. I was able to turn on the "trove,server" logging maskalong with the "usec" timestamp to see the scan time on both versionswithout any logging occuring during the actual scan itself.
for example:
[D 10:00:46.541646] dbpf collection 752900094 - Setting collectionhandle ranges to 4-536870914,4294967292-4831838202[D 10:04:19.414723] dbpf collection 752900094 - Setting HIGH_WATERMARKto -1
If I unmount between each server start, the original version takes anaverage of 3 minutes, 17 seconds to complete the scan.
The patched version takes an average of 2 minutes, 22 seconds tocomplete the same scan.
This is definitely a big improvement- almost 30% in my test case.

-Phil

Phil Carns wrote:
Thanks Sam!  We will give these patches a try and report back.

-Phil

Sam Lang wrote:
Hi Phil,
Attached mult.patch implements iterating over the dspace db usingDB_MULTIPLE_KEY. This may allow for the db get call to do largerreads from your SAN. I was seeing slightly better performance withlocal disk after creating 20K files in a fresh storage space. Doingstrace doesn't show fewer mmaps or larger reads though, so I'm notsure how berkeley db pulls in its pages. Anyway, if it helpsimprove performance for you guys, I can clean it up a bit and commitit. I don't think anything uses dspace_iterate_handles besides thatledger handle management code.
You can fiddle the MAX_NUM_VERIFY_HANDLE_COUNT value to set how manyhandles to get at a time. Right now its set to 4096. Keep in mindthat this requires a much larger buffer allocated indbpf_dspace_iterate_handles_op_svc, since we have to get keys andvalues, so essentially we do a get with a buffer that's 4096*(sizeof(handle) + sizeof(stored_attr)), which ends up being about 300K.
I also attached a patch (server-start.patch) that prints out thestart message as well as ready message after server initializationhas completed. If you set the Logstamp to usec, you'll be able tosee the time it takes to initialize the server. Also, this mighthelp in knowing when you can mount the clients, although, hopefullyat some point we'll be able to add the zero-conf stuff and then wecan return EAGAIN or something.
I'm not sure its time to replace the ledger code. It seems to workok, and to fix the slowness you're seeing would mean switching tosome kind of range tree that could be serialized to disk so that wewouldn't have to iterate through the entire dspace db on startup.That opens up the possibility of the dspace db and theledger-on-disk getting out of sync, which I'd rather avoid.
We could hand out new handles by choosing one randomly, and thenchecking if its in the DB, getting rid of the need for a ledgerentirely, but I assume this idea was already scratched to avoid thepotential costs at creation time, especially as the filesystem grows.
-sam



On Feb 20, 2007, at 11:23 AM, Phil Carns wrote:
Robert Latham wrote:
On Tue, Feb 20, 2007 at 07:29:16AM -0500, Phil Carns wrote:
Oh, and one other detail; the memory usage of the servers looksfine during startup, so this doesn't appear to be a memory leak.There is quite a bit of CPU work, but I am guessing that is justberkeley db keeping busy in the iteration function.
How long does it take to scan 1.4 million files on startup?
==rob
That's an interesting issue :)

A few observations:
- we were looking at this on SAN; the results may be different onlocal disks
- the db files are on the order of 500 MB for this particular setup
- the time to scan varies depending on if the db files are hot inthe Linux buffer cache
If we start the daemon right after killing another one that justdid the same scan, then the process is CPU intensive, but fast(about 5 seconds). If we unmount/mount the SAN between the tworuns so that the buffer cache is cleared, then it is very slow(about 5 minutes).
An interesting trick is to use dd with a healthy buffer size toread the .db files and throw the output into /dev/null beforestarting the servers. This only takes a few seconds, and makes itso that the scan consistently finishes in just a few seconds aswell. I think the reason is just that it forces the db data intothe Linux buffer cache using an efficient access pattern so thatberkeley db doesn't have to wait on disk latency for whatever smallaccesses it is performing.
This seems to indicate that berkeley db's access pattern generatedby PVFS2 for this case isn't very friendly, at least to SANs thataren't specifically tuned for it.
The 5 minute scan time is a problem, because it makes it hard totell when you will actually be able to mount the file system afterthe daemons appear to have started. We would be happy to try outany optimizations here :)
-Phil

_______________________________________________
Pvfs2-developers mailing list
Pvfs2-developers@beowulf-underground.org
http://www.beowulf-underground.org/mailman/listinfo/pvfs2-developers
_______________________________________________
Pvfs2-developers mailing list
Pvfs2-developers@beowulf-underground.org
http://www.beowulf-underground.org/mailman/listinfo/pvfs2-developers


_______________________________________________
Pvfs2-developers mailing list
Pvfs2-developers@beowulf-underground.org
http://www.beowulf-underground.org/mailman/listinfo/pvfs2-developers

Re: [Pvfs2-developers] server crash on startup with millions of files

Reply via email to