Hi Julian,

Probably I'm being slow, just coming back from the holidays, but I think that the issue is that your data is noncontiguous in memory? Current ROMIO doesn't do buffering into a contiguous region prior to writing to PVFS (i.e., data sieving on writes is disabled). Looking at the PVFS2 ADIO implementation, it appears that by default we would instead create an hindexed PVFS type and let PVFS do the work (RobL can verify).

This is sort of too bad, because in the "contiguous in file" case data sieving would be just fine. Opportunity lost.

Do we agree on what is happening here?

Great to hear that OrangeFS is comparing well.

Regards,

Rob

On Dec 29, 2010, at 6:23 PM, Julian Kunkel wrote:

Dear Rob & others,
regarding derived datatypes & PVFS2 I observed the following with MPICH2 1.3.2
and either PVFS 2.8.1 or orangefs-2.8.3-20101113.

I use a derived memory datatype to write (append) the diagonal of a
matrix to a file in MPI.
The data itself is written in a contiguous manor (without applying any
file view).
Therefore, I would expect that within MPI the data is written in a
contiguous way to the file,
however what I observe is that many small writes by using the
small-io.sm are used.
The volume of the data (e.g. the matrix diagonal) is 64072 byte and
starts in the file at offset 41, each I/O generates 125 small-io with
a size of 512 Bytes and one with 72 byte.

In Trove (alt-aio) I can observe the sequence of writes including
offset and sizes as follows:
<e t="3" time="11.629352" size="4" offset="41"/><un t="3"
time="11.629354"/><rel t="4" time="11.630205" p="0:9"/><s
name="alt-io-write" t="4" time="11.630209"/><e t="4" time="11.630223"
size="512" offset="45"/><un t="4" time="11.630225"/><rel t="5"
time="11.631027" p="0:10"/><s name="alt-io-write" t="5"
time="11.631030"/><e t="5" time="11.631045" size="512"
offset="557"/><un t="5" time="11.631047"/><rel t="6" time="11.631765"
p="0:11"/><s name="alt-io-write" t="6" time="11.631769"/><e t="6"
time="11.631784" size="512" offset="1069"/><un t="6"
time="11.631786"/><rel t="7" time="11.632460" p="0:12"/><s
name="alt-io-write" t="7" time="11.632464"/><e t="7" time="11.632483"
size="512" offset="1581"/>
....
<e t="129" time="11.695048" size="72" offset="64045"/>

The offsets increase linearly, therefore I could imagine something in
ROMIO might split the I/O up, because it might guess data on disk is
non-contiguous.
Here are some code snippets attached which produced this issue:

Initalization of the file:
 MPI_File_open (MPI_COMM_WORLD, name, MPI_MODE_WRONLY |
MPI_MODE_CREATE, MPI_INFO_NULL,                  &fd_visualization);

 /* construct datatype for parts of a Matrix diagonal */
 MPI_Type_vector (myrows,      /*int count */
                  1,           /*int blocklen */
                  N + 2,       /*int stride */
                  MPI_DOUBLE,  /*MPI_Datatype old_type */
                  &vis_datatype);      /*MPI_Datatype *newtype */
 MPI_Type_commit (&vis_datatype);

Per iteration:
rank0 writes iteration number extra - (I know its suboptimal):
ret = MPI_File_write_at (fd_visualization, /*MPI_File fh, */
                              (MPI_Offset) (start_row + vis_iter * (N
+ 1)) * sizeof (double) + (vis_iter - 1) * sizeof (int) + offset,
 /*MPI_Offset offset,  */
                              &stat_iteration, /*void *buf, */
                              1,       /*int count,  */
                              MPI_INT, /*MPI_Datatype datatype,  */
                              &status  /* MPI_Status *status */

This generates the small writes:
 ret = MPI_File_write_at (
                           fd_visualization,   /*MPI_File fh,  */
                           (MPI_Offset) (start_row + vis_iter * (N +
1)) * sizeof (double) + vis_iter * sizeof (int) + offset, /*MPI_Offset
offset,  */
                           v,  /*void *buf, */
                           1,  /*int count,  */
vis_datatype, /*MPI_Datatype datatype, */
                           &status );    /* MPI_Status *status */

I attached a screenshot which shows the MPI activity and the server
activity of our tracing environment, here one can see that the
operations are processed sequentially on the server (one small request
is processed after another).
Before I might dig deeper into this issue, maybe you have an idea
about this issue.

By the way, I did some basic tests with the old instrumented version
of PVFS 2.8.1 vs. the instrumented OrangeFS and I'm happy about the
I/O performance improvements, on our Xeon Westmere cluster the
performance is also more predictable.

Thanks,
Julian
<orangefs1-small.png>_______________________________________________
Pvfs2-developers mailing list
Pvfs2-developers@beowulf-underground.org
http://www.beowulf-underground.org/mailman/listinfo/pvfs2-developers

_______________________________________________
Pvfs2-developers mailing list
Pvfs2-developers@beowulf-underground.org
http://www.beowulf-underground.org/mailman/listinfo/pvfs2-developers

Reply via email to