Hello,

Which MPI Version are you using?
This looks for me like it triggers https://github.com/open-mpi/ompi/issues/2399

You can check if you are running into this problem by playing around with the 
mca_io_ompio_cycle_buffer_size parameter.

Best
Christoph Niethammer

--

Christoph Niethammer
High Performance Computing Center Stuttgart (HLRS)
Nobelstrasse 19
70569 Stuttgart

Tel: ++49(0)711-685-87203
email: nietham...@hlrs.de
http://www.hlrs.de/people/niethammer



----- Original Message -----
From: "Nils Moschuering" <ni...@ipp.mpg.de>
To: "Open MPI Users" <users@lists.open-mpi.org>
Sent: Friday, April 28, 2017 12:51:50 PM
Subject: [OMPI users] MPI I/O gives undefined behavior if the amount of bytes 
described by a filetype reaches 2^32

Dear OpenMPI Mailing List, 

I have a problem with MPI I/O running on more than 1 rank using very large 
filetypes. In order to reproduce the problem please take advantage of the 
attached program "mpi_io_test.c". After compilation it should be run on 2 
nodes. 

The program will do the following for a variety of different parameters: 
1. Create an elementary datatype (commonly refered to as etype in the MPI 
Standard) of a specific size given by the parameter bsize (in multiple of 
bytes). This datatype is called blk_filetype . 
2. Create a complex filetype, which is different for each rank. This filetype 
divides the file into a number of blocks given by parameter nr_blocks of size 
bsize . Each rank only gets access to a subarray containing 
nr_blocks_per_rank = nr_blocks / size 
blocks (where size is the number of participating ranks). The respective 
subarray of each rank starts at 
rank * nr_blocks_per_rank 
This guarantees that the regions of the different ranks don't overlap. 
The resulting datatype is called full_filetype . 
3. Allocate enough memory on each rank, in order to be able to write a whole 
block. 
4. Fill the allocated memory with the rank number to be able to check the 
resulting file for correctness. 
5. Open a file named fname and set the view using the previously generated 
blk_filetype and full_filetype . 
6. Write one block on each rank, using the collective routine. 
7. Clean up. 

The above will be repeated for different values of bsize and nr_blocks . Please 
note, that there is no overflow of the used basic dataype int . 
The output is verified using 
hexdump fname 
which performs a hexdump of the file. This tool collects consecutive equal 
lines in a file into one output line. The resulting output of a call to hexdump 
is given by a structure comparable to the following 
00000000 01 01 01 01 01 01 01 01 01 01 01 01 01 01 01 01 |................| 
* 
1f400000 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 |................| 
* 
3e800000 02 02 02 02 02 02 02 02 02 02 02 02 02 02 02 02 |................| 
* 
5dc00000 
This example is to be read in the following manner: 
-From byte 00000000 to 1f400000 (which is equal to 500 Mib) the file contains 
the value 01 in each byte. 
-From byte 1f400000 to 3e800000 (which is equal to 1000 Mib) the file contains 
the value 00 in each byte. 
-From byte 3e800000 to 5dc00000 (which is equal to 1500 Mib) the file contains 
the value 02 in each byte. 
-The file ends here. 
This is the correct output of the above outlined program with parameters 
bsize=500*1023*1024 
nr_blocks=4 
running on 2 ranks. The attached file contains a lot of tests for different 
cases. These were made to pinpoint the source of the problem and to exclude 
different other, potentially important, factors. 
I deem an output wrong if it doesn't follow from the parameters or if the 
program crashes on execution. 
The only difference between OpenMPI and Intel MPI, according to my tests, is in 
the different behavior on error: OpenMPI will mostly write wrong data but won't 
crash, whereas Intel MPI mostly crashes. 

The tests and their results are defined in comments in the source. 
The final conclusions, I derive from the tests, are the following: 

1. If the filetype used in the view is set in a way that it describes an amount 
of bytes equaling or exceeding 2^32 = 4Gib the code produces wrong output. For 
values slightly smaller (the second example with fname="test_8_blocks" uses a 
total filetype size of 4000 MiB which is smaller than 4Gib) the code works as 
expected. 
2. The act of actually writing the described regions is not important. When the 
filetype describes an area >= 4Gib but only writes to regions much smaller than 
that, the code still produces undefined behavior (please refer to the 6th 
example with fname="test_too_large_blocks" in order to see an example). 
3. It doesn't matter if the block size or the amount of blocks pushes the 
filetype over the 4 Gib (refer to the 5th and 6th example, with filenames 
"test_16_blocks" and "test_too_large_blocks" respectively). 
4. If the binary is launched using only one rank, the output is always as 
expected (refer to the 3rd and 4th example, with filenames 
"test_too_large_blocks_single" and "test_too_large_blocks_single_even_larger", 
respectively). 

There are, of course, many other things one could test. 
It seems that the implementations use 32bit integer variables to compute the 
byte composition inside the filetype. Since the filetype is defined using two 
32bit integer variables, this can easily lead to integer overflows if the user 
supplies large values. It seems that no implementation expects this problem and 
therefore they do not act gracefully on its occurrence. 

I looked at [ https://software.intel.com/en-us/node/528914 | ILP64 ] Support, 
but it only adapts the function parameters and not the internally used 
variables and it is also not available for C. 

I looked at [ 
https://www.gnu.org/software/libc/manual/html_node/Program-Error-Signals.html#Program%20Error%20Signals
 | integer
      overflow ] (FPE_INTOVF_TRAP) trapping, which could help to verify the 
source of the problem, but it doesn't seem to be possible for C. Intel does [ 
https://software.intel.com/en-us/forums/intel-c-compiler/topic/306156 | not ] 
offer any built-in integer overflow trapping. 

There are ways to circumvent this problem for most cases. It is only 
unavoidable if the logic of a program contains complex, non-repeating data 
structures with sizes of over (or equal) 4GiB. Even then, one could split up 
the filetype and use a different displacement in two distinct write calls. 

Still, this problem violates the standard as it produces undefined behavior 
even when using the API in a consistent way. The implementation should at least 
provide a warning for the user, but should ideally use larger datatypes in the 
filetype computations. When a user stumbles on this problem, he will have a 
hard time to debug it. 

Thank you very much for reading everything ;) 

Kind Regards, 

Nils 

_______________________________________________
users mailing list
users@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/users
_______________________________________________
users mailing list
users@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/users

Reply via email to