For the benefit of people running into similar problems and ending up reading this thread, we finally found a solution.

One can use the mpi function MPI_TYPE_CREATE_HINDEXED to create an mpi data type with 32-bit local variable count and 64-bit offsets, which will work good enough for us for the time being.

Specifically the code looks like this :

  ! Create vector type with 64-bit offset measured in bytes
CALL MPI_TYPE_CREATE_HINDEXED (1, local_particle_number, offset_in_global_particle_array, MPI_REAL, filetype, err)
  CALL MPI_TYPE_COMMIT (filetype, err)
  ! Write data
CALL MPI_FILE_SET_VIEW (file_handle, file_position, MPI_REAL, filetype, 'native', MPI_INFO_NULL, err) CALL MPI_FILE_WRITE_ALL(file_handle, data, local_particle_number, MPI_REAL, status, err)
  file_position = file_position + global_particle_number
  ! Free type
  CALL MPI_TYPE_FREE (filetype, err)

and we get good (+GB/s) performance when writing files from large runs.

Interestingly, an alternative and conceptually simpler option is to use MPI_FILE_WRITE_ORDERED, but the performance of that function on Blue-Gene/P sucks - 20 MB/s instead of GB/s. I do not know why.

best,

Troels

On 6/7/11 15:04 , Jeff Squyres wrote:
On Jun 7, 2011, at 4:53 AM, Troels Haugboelle wrote:

In principle yes, but the problem is we have an unequal amount of particles on 
each node, so the length of each array is not guaranteed to be divisible by 2, 
4 or any other number. If I have understood the definition of 
MPI_TYPE_CREATE_SUBARRAY correctly the offset can be 64-bit, but not the global 
array size, so, optimally, what I am looking for is something that has unequal 
size for each thread, simple vector, and with 64-bit offsets and global array 
size.
It's a bit awkward, but you can still make datatypes to give the offset that 
you want.  E.g., if you need an offset of 2B+31 bytes, you can make datatype A 
with type contig of N=(2B/sizeof(int)) int's.  Then make datatype B with type 
struct, containing type A and 31 MPI_BYTEs.  Then use 1 instance of datatype B 
to get the offset that you want.

You could make utility functions that, given a specific (64 bit) offset, it 
makes an MPI datatype that matches the offset, and then frees it (and all 
sub-datatypes).

There is a bit of overhead in creating these datatypes, but it should be 
dwarfed by the amount of data that you're reading/writing, right?

It's awkward, but it should work.

Another possible workaround would be to identify subsections that do not pass 
2B elements, make sub communicators, and then let each of them dump their 
elements with proper offsets. It may work. The problematic architecture is a 
BG/P. On other clusters doing simple I/O, letting all threads open the file, 
seek to their position, and then write their chunk works fine, but somehow on 
BG/P performance drops dramatically. My guess is that there is some file 
locking, or we are overwhelming the I/O nodes..

This ticket for the MPI-3 standard is a first step in the right direction, but 
won't do everything you need (this is more FYI):

     https://svn.mpi-forum.org/trac/mpi-forum-web/ticket/265

See the PDF attached to the ticket; it's going up for a "first reading" in a 
month.  It'll hopefully be part of the MPI-3 standard by the end of the year (Fab 
Tillier, CC'ed, has been the chief proponent of this ticket for the past several months).

Quincey Koziol from the HDF group is going to propose a follow on to this 
ticket, specifically about the case you're referring to -- large counts for 
file functions and datatype constructors.  Quincey -- can you expand on what 
you'll be proposing, perchance?
Interesting, I think something along the lines of the note would be very useful 
and needed for large applications.

Thanks a lot for the pointers and your suggestions,

cheers,

Troels


Reply via email to