On Sep 27, 2009, at 12:19 AM, Milo wrote:
Hi, guys. Milo from CMU here.
I'm looking into small I/O performance on PVFS2. It's actually part
of a larger project investigating possible improvements to the
performance of cloud computing software, and we're using PVFS2 as a
kind of upper bound for performance (e.g. writing a flat file on a
parallel filesystem as opposed to updating data in an HBase table).
One barrier I've encountered is the small I/O nature of many of
these Cloud Workloads. For example, the one we're looking at
currently does 1 KB I/O requests even when performing sequential
writes to generate a file.
On large I/O requests, I've managed to tweak PVFS2 to get close to
the performance of the underlying filesystem (115 MB/s or so). But
on small I/O requests performance is much lower. It seems I can only
performance approximately 5,000 I/O operations/second even when
sequentially writing testing on a single node server (4.7 MB/s with
1KB sequential writes. 19.0 MB/s with 4KB sequential writes). The
filesystem system is mounted through the PVFS2 kernel mod. This
seems similar to the Bonnie++ rates in ftp://info.mcs.anl.gov/pub/tech_reports/reports/P1010.pdf
None of this is unexpected to me and I'm happy with PVFS2's large I/
O performance. But I'd like to get a better handle on where this
bottleneck is coming from, codewise (and how I could fix it if I
find coding time between research). Here's some experimentation I've
done so far:
1) A small pair of C client/servert programs that open and close TCP
connections in a tight loop, pinging each other with a small of data
('Hello World'). I see about 10,000 connections/second with this
approach. So if each small I/O is opening and closing two TCP
connections, this could be the bottleneck. I haven't yet dug into
the pvfs2-client code and the library to see if it reuses TCP
connections or makes new ones on each request (that's deeper into
the flow code than I remember. =;) )
Don't waste your time; it keeps the connections open.
2) I can write to the underlying filesystem with 1 KB sequential
writes almost as quickly as with 1 MB writes. So it's not the
underlying ext3.
3) The IO ops/s bottleneck is there even with the null-aio
TroveMethod, so I doubt it's Trove.
4) atime is getting updated with null-aio, so a MetaData barrier is
possible.
Some configuration information about the filesystem:
* version 2.8.1
* The strip_size is 4194304. Not that this should matter a great
deal with one server.
* FlowBufferSizeBytes 4194304
* TroveSyncMeta and TroveSyncData are set to no
* I've applied the patch from http://www.pvfs.org/fisheye/rdiff/PVFS?csid=MAIN:slang:20090421161045&u&N
to be sure metadata syncing really is off, though I'm not sure how
to check. =:)
It would be interesting to know how much time is spent on the client
(in the kernel module, in the daemon) vs. how much on the server.
This would probably help us rule out quite a few things too.
Thanks.
~Milo
PS: Should I send this to the pvfs2-developers list instead?
Apologies if I've used the wrong venue.
_______________________________________________
Pvfs2-users mailing list
[email protected]
http://www.beowulf-underground.org/mailman/listinfo/pvfs2-users