These are great results Phil. Its nice to have you guys doing this
testing. Did you get a chance to run any of your tests with the
threaded version of pvfs2-client? I added -threaded option, which
runs pvfs2-client-core-threaded instead of pvfs2-client-core. For
the case where you're running multiple processes concurrently, I
wonder if you would see some improvement, although Dean didn't see
any when he tried it with one process doing concurrent reads/writes
from multiple threads. Just a thought.
I'd also be curious what affect the mallocs have on performance. I
added a fix to Walt's branch for the allocation of all the lookup
segment contexts on every request from the VFS, but that hasn't
propagated into HEAD yet.
-sam
On Nov 29, 2006, at 9:58 AM, Phil Carns wrote:
We recently ran some tests that we thought would be interesting to
share. We used the following setup:
- single client
- 16 servers
- gigabit ethernet
- read/write tests, with 40 GB files
- using reads and writes of 100 MB each in size
- varying number of processes running concurrently on the client
This test application can be configured to be run with multiple
processes and/or multiple client nodes. In this case we kept
everything on a single client to focus on bottlenecks on that side.
What we were looking at was the kernel buffer settings controlled
in pint-dev-shared.h. By default PVFS2 uses 5 buffers of 4 MB
each. After experimenting for a while, we made a few observations:
- increasing the buffer size helped performance
- using only 2 buffers (rather than 5) was sufficient to saturate
the client when we were running multiple processes; adding more
made only a marginal difference
We found good results using 2 32MB buffers. Here are some
comparisons between the standard settings and the 2 x 32MB
configuration:
results for RHEL4 (2.6 kernel):
------------------------------
5 x 4MB, 1 process: 83.6 MB/s
2 x 32MB, 1 process: 95.5 MB/s
5 x 4MB, 5 processes: 107.4 MB/s
2 x 32MB, 5 processes: 111.2 MB/s
results for RHEL3 (2.4 kernel):
-------------------------------
5 x 4MB, 1 process: 80.5 MB/s
2 x 32MB, 1 process: 90.7 MB/s
5 x 4MB, 5 processes: 91 MB/s
2 x 32MB, 5 processes: 103.5 MB/s
A few comments based on those numbers:
- on 3 out of 4 tests, we saw a 13-15% performance improvement by
going to 2 32 MB buffers
- the remaining test (5 process RHEL4) probably did not see as much
improvement because we maxed out the network. In the past, netpipe
has shown that we can get around 112 MB/s out of these nodes.
- the RHEL3 nodes are on a different switch, so it is hard to say
how much of the difference from RHEL3 to RHEL4 is due to network
topology and how much is due to the kernel version
It is also worth noting that even with this tuning, the single
process tests are about 14% slower than the 5 process tests. I am
guessing that this is due to a lack of pipelining, probably caused
by two things:
- the application only submitting one read/write at a time
- the kernel module itself serializing when it breaks reads/writes
into buffer sized chunks
The latter could be addressed by either pipelining the I/O through
the bufmap interface (so that a single read or write could keep
multiple buffers busy) or by going to a system like Murali came up
with for memory transfers a while back that isn't limited by buffer
size.
It would also be nice to have a way to set these buffer settings
without recompiling- either via module options or via pvfs2-client-
core command line options. For the time being we are going to hard
code our tree to run with the 32 MB buffers. The 64 MB of RAM that
this uses up (vs. 20 MB with the old settings) doesn't really
matter for our standard node footprint.
-Phil
_______________________________________________
Pvfs2-developers mailing list
Pvfs2-developers@beowulf-underground.org
http://www.beowulf-underground.org/mailman/listinfo/pvfs2-developers
_______________________________________________
Pvfs2-developers mailing list
Pvfs2-developers@beowulf-underground.org
http://www.beowulf-underground.org/mailman/listinfo/pvfs2-developers