Thanks for the conf files Matthieu. It sounds like there might be more than one issue here? Some comments below:

On 04/02/2013 10:47 AM, Matthieu Dorier wrote:
Hi Rob and Phil,

This thread moved to the ofs-support mailing list (probably because the first 
personne to answer was part of this team), but I didn't get much answer to my 
problem, so I'll try to summarize here what I have done.

First to answer Phil, here is the PVFS config file attached, and here is the 
script file used for IOR:

IOR START
   testFile = pvfs2:/mnt/pvfs2/testfileA
   filePerProc=0
   api=MPIIO
   repetitions=100
   verbose=2
   blockSize=4m
   transferSize=4m
   collective=1
   writeFile=1
   interTestDelay=60
   readFile=0
   RUN
IOR STOP

Besides the tests I was describing on my first mail, I also did the same 
experiments on another cluster also with TCP over IB, and then on Ethernet, 
with 336 clients and 672 clients, with 2, 4 and 8 storage servers. In every 
cases, this behavior appears.

I benchmarked the local disk attached to the storage servers and got 42MB/s, so 
the high throughput of over 2GB/s I get obviously benefits from some caching 
mechanisme and the periodic behavior observed at high output frequency could be 
explained by that. Yet this does not explain why, overall, the performance 
decreases over time.

One thing that you could try as an experiment would be to change this pvfs config file line:

    TroveSyncData no

to

    TroveSyncData yes

That will force PVFS to sync all writes to disk. Unfortunately that would put your max performance at 42*2 = 84 MiB/s, which may not be what you are hoping for, but it might help isolate if there is an issue with the buffer cache flushing on the server side. Depending on your local file system and kernel version, you might then be able to check/adjust some buffer cache parameters to get better behavior in the non-syncing case. In a real application checkpointing scenario you probably want the servers to start flushing data during the idle time between bursts, but in reality it might not flush much of anything until the next burst comes in and forces the issue, unless you do some tuning to help it out.

I attach a set of graphics summarizing the experiments (on the x axis it's the 
iteration number and on the y axis the aggregate throughput obtained for this 
iteration, 100 consecutive iterations are performed).
It seems that the performance follows the law D = a*T+b where D is the duration of the write, T is 
the wallclock time since the beginning of the experiment, and "a" and "b" are 
constants.

When I stop IOR and immediately restart it, I get the good performance back, it 
does not continue at the reduced performance the previous instance finished.

Just stopping IOR and restarting it is enough to improve performance? Or did you mean restarting the pvfs2 servers?

thanks,
-Phil


I also thought it could come from the fact that the same file is re-written at 
every iteration, and tried with the multiFile=1 option to have one new file at 
every iteration instead, but this didn't help.

Last thing I can mention: I'm using mpich 3.0.2, compiled with PVFS support.

Matthieu

----- Mail original -----
De: "Rob Latham" <[email protected]>
À: "Matthieu Dorier" <[email protected]>
Cc: "pvfs2-users" <[email protected]>
Envoyé: Mardi 2 Avril 2013 15:57:54
Objet: Re: [Pvfs2-users] Strange performance behavior with IOR

On Sat, Mar 23, 2013 at 03:31:22PM +0100, Matthieu Dorier wrote:
I've installed PVFS (orangeFS 2.8.7) on a small cluster (2 PVFS
nodes, 28 compute nodes of 24 cores each, everything connected
through infiniband but using an IP stack on top of it, so the
protocol for PVFS is TCP), and I witness some strange performance
behaviors with IOR (using ROMIO compiled against PVFS, no kernel
support):
IOR is started on 336 processes (14 nodes), writing 4MB/process in
a
single shared file using MPI-I/O (4MB transfer size also). It
completes 100 iterations.
OK, so you have one pvfs client per core.  All these are talking to
two servers.
First every time I start an instance of IOR, the first I/O
operation
is extremely slow. I'm guessing this is because ROMIO has to
initialize everything, get the list of PVFS servers, etc. Is there
a
way to speed this up?
ROMIO isn't doing a whole lot here, but there is one thing different
about ROMIO's 1st call vs the Nth call.  The 1st call (first time any
pvfs2 file is opened or deleted), ROMIO will call the function
PVFS_util_init_defaults().

If you have 336 clients banging away on just two servers, I bet that
could explain some slowness.  In the old days, the PVFS server had to
service these requests one at a time.

I don't think this restriction has been relaxed?  Since it is a
read-only operation, though, it sure seems like one could just have
servers shovel out pvfs2 configuration information as fast as
possible.
Then, I set some delay between each iteration, to better reflect
the
behavior of an actual scientific application.
Fun! this is kind of like what MADNESS does.  "computes" by sleeping
for a bit.   I think Phil's questions will help us understand the
highly variable performance.

Can you experiment with IORs collective I/O?  by default, collective
I/O will select one client per node as an "i/o aggregator".  The IOR
workload will not benefit from ROMIO's two-phase optimization, but
you've got 336 clients banging away on two servers.  When I last
studied pvfs scalability,  100x more clients than servers wasn't a
big
deal, but 5-6 years ago nodes did not have 24 way parallelism.

==rob

--
Rob Latham
Mathematics and Computer Science Division
Argonne National Lab, IL USA
>


_______________________________________________
Pvfs2-users mailing list
[email protected]
http://www.beowulf-underground.org/mailman/listinfo/pvfs2-users

_______________________________________________
Pvfs2-users mailing list
[email protected]
http://www.beowulf-underground.org/mailman/listinfo/pvfs2-users

Reply via email to