Re: [Pvfs2-developers] server crash on startup with millions of files

Sam Lang Fri, 23 Feb 2007 10:11:33 -0800


On Feb 23, 2007, at 12:16 PM, Phil Carns wrote:

I have a little bit more information to add to the puzzle. I justran the newly patched server with strace -ff -o and then dugthrough the output a little bit.
The attributes db is 472272896 bytes, according to ls -al.
The strace output has a lot of pread operations in it. Oddlyenough, they seem to go backwards through this db. If I grep/cutout the offsets from all of the pread operations, at some pointthey go through a pattern that looks like this:
472244224
472207360
472170496
472133632
472096768
472059904
472023040
471986176
471949312
...
442368
405504
368640
331776
294912
258048
221184
184320
147456
110592
73728
36864
Then, it seems to start over again, this time with a smallerdifference between the offsets:
472268800
472264704
472260608
472256512
472252416
472248320
472240128
472236032
472231936
...
57344
53248
49152
45056
40960
32768
28672
24576
20480
16384
12288
Altogether there are over 100,000 preads. I assume since theystart at nearly the size of the attributes db that this is whatthey are reading.
It seems odd that it would go through the entire file backwardstwice. I'm guessing that probably isn't very friendly to thewhatever caching/prefetching/etc. is going in in the kernel andstorage devices.
I don't know what would cause this, unless it is somehow related tothe access method and/or comparison functions being used. If so,maybe it could be overcome with a secondary index that is somehowlaid out more favorably for cursors? I'm grasping at berkeley dbvoodoo at this point :)

Yeah that is odd. Setting the cursor for each call toiterate_handles may be the reason for it starting over. Do you knowhow many times it starts over? The number of times iterate_handlesis called will be (# of files / 4096).

Maybe it has to do with setting the iterator with the RECNUM flag,which we set so that we can keep track of positions over theiterate_handles call. Since we already use the handles to sort theentries, maybe the two are conflicting with each other. The berkeleydb doc does mention that RECNUM will hinder performance, but only onwrites:

--

Configuring a Btree for record numbers should not be done lightly.While often useful, it may significantly slow down the speed at whichitems can be stored into the database, and can severely impactapplication throughput. Generally it should be avoided in trees witha need for high write concurrency.

--

If we could return the handle as the position, we could get rid ofthe RECNUM flag and set the cursor with the last handle, but theposition field is only uint32_t. Its really annoying that we onlyuse the first 32 bits of the PVFS_handle right now too. Can wechange that PVFS_ds_position type to be 64 bit?


-sam

I also have one more data point. I mentioned in an earlier emailthat simply "warming up" the db files in the buffer cache beforestarting the servers was enough to reduce the startup time to a fewseconds. For some reason, that is not necessarily true on theenvironment that I am now testing on. Now I see this:
- stock server, cached db files: 1 minute, 46 seconds average

- patched server, cached db files: 7 seconds average
The speedup from pre-caching isn't nearly as dramatic here unless Iuse Sam's new and improved code.
-Phil


Phil Carns wrote:
Ok, I have tried several iterations both with and without thesepatches. The test system is again using a SAN, this time with adataspace_attributes.db file of about 451 MB on a particularserver. I'm not sure how many files are on the file system; I justcranked out files on it until the db file looked big enough to getgood measurements on the startup time. I was able to turn on the"trove,server" logging mask along with the "usec" timestamp to seethe scan time on both versions without any logging occuring duringthe actual scan itself.
for example:
[D 10:00:46.541646] dbpf collection 752900094 - Setting collectionhandle ranges to 4-536870914,4294967292-4831838202[D 10:04:19.414723] dbpf collection 752900094 - SettingHIGH_WATERMARK to -1If I unmount between each server start, the original version takesan average of 3 minutes, 17 seconds to complete the scan.The patched version takes an average of 2 minutes, 22 seconds tocomplete the same scan.
This is definitely a big improvement- almost 30% in my test case.
-Phil
Phil Carns wrote:
Thanks Sam!  We will give these patches a try and report back.

-Phil

Sam Lang wrote:
Hi Phil,
Attached mult.patch implements iterating over the dspace dbusing DB_MULTIPLE_KEY. This may allow for the db get call todo larger reads from your SAN. I was seeing slightly betterperformance with local disk after creating 20K files in a freshstorage space. Doing strace doesn't show fewer mmaps or largerreads though, so I'm not sure how berkeley db pulls in itspages. Anyway, if it helps improve performance for you guys, Ican clean it up a bit and commit it. I don't think anythinguses dspace_iterate_handles besides that ledger handlemanagement code.
You can fiddle the MAX_NUM_VERIFY_HANDLE_COUNT value to set howmany handles to get at a time. Right now its set to 4096.Keep in mind that this requires a much larger buffer allocatedin dbpf_dspace_iterate_handles_op_svc, since we have to getkeys and values, so essentially we do a get with a bufferthat's 4096*(sizeof (handle) + sizeof(stored_attr)), which endsup being about 300K.
I also attached a patch (server-start.patch) that prints outthe start message as well as ready message after serverinitialization has completed. If you set the Logstamp to usec,you'll be able to see the time it takes to initialize theserver. Also, this might help in knowing when you can mountthe clients, although, hopefully at some point we'll be able toadd the zero-conf stuff and then we can return EAGAIN orsomething.
I'm not sure its time to replace the ledger code. It seems towork ok, and to fix the slowness you're seeing would meanswitching to some kind of range tree that could be serializedto disk so that we wouldn't have to iterate through the entiredspace db on startup. That opens up the possibility of thedspace db and the ledger-on-disk getting out of sync, which I'drather avoid.
We could hand out new handles by choosing one randomly, andthen checking if its in the DB, getting rid of the need for aledger entirely, but I assume this idea was already scratchedto avoid the potential costs at creation time, especially asthe filesystem grows.
-sam



On Feb 20, 2007, at 11:23 AM, Phil Carns wrote:
Robert Latham wrote:
On Tue, Feb 20, 2007 at 07:29:16AM -0500, Phil Carns wrote:
Oh, and one other detail; the memory usage of the serverslooks fine during startup, so this doesn't appear to be amemory leak. There is quite a bit of CPU work, but I amguessing that is just berkeley db keeping busy in theiteration function.
How long does it take to scan 1.4 million files on startup?
==rob
That's an interesting issue :)

A few observations:
- we were looking at this on SAN; the results may be differenton local disks
- the db files are on the order of 500 MB for this particularsetup
- the time to scan varies depending on if the db files are hotin the Linux buffer cache
If we start the daemon right after killing another one thatjust did the same scan, then the process is CPU intensive, butfast (about 5 seconds). If we unmount/mount the SAN betweenthe two runs so that the buffer cache is cleared, then it isvery slow (about 5 minutes).
An interesting trick is to use dd with a healthy buffer sizeto read the .db files and throw the output into /dev/nullbefore starting the servers. This only takes a few seconds,and makes it so that the scan consistently finishes in just afew seconds as well. I think the reason is just that itforces the db data into the Linux buffer cache using anefficient access pattern so that berkeley db doesn't have towait on disk latency for whatever small accesses it isperforming.
This seems to indicate that berkeley db's access patterngenerated by PVFS2 for this case isn't very friendly, at leastto SANs that aren't specifically tuned for it.
The 5 minute scan time is a problem, because it makes it hardto tell when you will actually be able to mount the filesystem after the daemons appear to have started. We would behappy to try out any optimizations here :)
-Phil

_______________________________________________
Pvfs2-developers mailing list
Pvfs2-developers@beowulf-underground.org
http://www.beowulf-underground.org/mailman/listinfo/pvfs2-developers
_______________________________________________
Pvfs2-developers mailing list
Pvfs2-developers@beowulf-underground.org
http://www.beowulf-underground.org/mailman/listinfo/pvfs2-developers


_______________________________________________
Pvfs2-developers mailing list
Pvfs2-developers@beowulf-underground.org
http://www.beowulf-underground.org/mailman/listinfo/pvfs2-developers

Re: [Pvfs2-developers] server crash on startup with millions of files

Reply via email to