Hi, guys. Milo from CMU here.

I'm looking into small I/O performance on PVFS2. It's actually part of a larger project investigating possible improvements to the performance of cloud computing software, and we're using PVFS2 as a kind of upper bound for performance (e.g. writing a flat file on a parallel filesystem as opposed to updating data in an HBase table).

One barrier I've encountered is the small I/O nature of many of these Cloud Workloads. For example, the one we're looking at currently does 1 KB I/O requests even when performing sequential writes to generate a file.

On large I/O requests, I've managed to tweak PVFS2 to get close to the performance of the underlying filesystem (115 MB/s or so). But on small I/O requests performance is much lower. It seems I can only performance approximately 5,000 I/O operations/second even when sequentially writing testing on a single node server (4.7 MB/s with 1KB sequential writes. 19.0 MB/s with 4KB sequential writes). The filesystem system is mounted through the PVFS2 kernel mod. This seems similar to the Bonnie++ rates in ftp://info.mcs.anl.gov/pub/tech_reports/reports/P1010.pdf

None of this is unexpected to me and I'm happy with PVFS2's large I/O performance. But I'd like to get a better handle on where this bottleneck is coming from, codewise (and how I could fix it if I find coding time between research). Here's some experimentation I've done so far:

1) A small pair of C client/servert programs that open and close TCP connections in a tight loop, pinging each other with a small of data ('Hello World'). I see about 10,000 connections/second with this approach. So if each small I/O is opening and closing two TCP connections, this could be the bottleneck. I haven't yet dug into the pvfs2-client code and the library to see if it reuses TCP connections or makes new ones on each request (that's deeper into the flow code than I remember. =;) )

2) I can write to the underlying filesystem with 1 KB sequential writes almost as quickly as with 1 MB writes. So it's not the underlying ext3.

3) The IO ops/s bottleneck is there even with the null-aio TroveMethod, so I doubt it's Trove.

4) atime is getting updated with null-aio, so a MetaData barrier is possible.

Some configuration information about the filesystem:
* version 2.8.1
* The strip_size is 4194304. Not that this should matter a great deal with one server.
* FlowBufferSizeBytes 4194304
* TroveSyncMeta and TroveSyncData are set to no
* I've applied the patch from http://www.pvfs.org/fisheye/rdiff/PVFS?csid=MAIN:slang:20090421161045&u&N to be sure metadata syncing really is off, though I'm not sure how to check. =:)

Thanks.

~Milo

PS: Should I send this to the pvfs2-developers list instead? Apologies if I've used the wrong venue.
_______________________________________________
Pvfs2-users mailing list
[email protected]
http://www.beowulf-underground.org/mailman/listinfo/pvfs2-users

Reply via email to