Jeff and all,

i already reported the issue and posted a patch for ad_nfs at 
https://github.com/pmodels/mpich/pull/2617

a bug was also identified in Open MPI (related to datatype handling) and 
a first draft is available at https://github.com/open-mpi/ompi/pull/3439

Cheers,

Gilles

----- Original Message -----



    On Fri, Apr 28, 2017 at 3:51 AM, Nils Moschuering <ni...@ipp.mpg.de> 
wrote:

        Dear OpenMPI Mailing List,

        I have a problem with MPI I/O running on more than 1 rank using 
very large filetypes. In order to reproduce the problem please take 
advantage of the attached program "mpi_io_test.c". After compilation it 
should be run on 2 nodes.

        The program will do the following for a variety of different 
parameters:
        1. Create an elementary datatype (commonly refered to as etype 
in the MPI Standard) of a specific size given by the parameter bsize (in 
multiple of bytes). This datatype is called blk_filetype.
        2. Create a complex filetype, which is different for each rank. 
This filetype divides the file into a number of blocks given by 
parameter nr_blocks of size bsize. Each rank only gets access to a 
subarray containing
        nr_blocks_per_rank = nr_blocks / size
        blocks (where size is the number of participating ranks). The 
respective subarray of each rank starts at
        rank * nr_blocks_per_rank
        This guarantees that the regions of the different ranks don't 
overlap.
        The resulting datatype is called full_filetype.
        3. Allocate enough memory on each rank, in order to be able to 
write a whole block.
        4. Fill the allocated memory with the rank number to be able to 
check the resulting file for correctness.
        5. Open a file named fname and set the view using the previously 
generated blk_filetype and full_filetype.
        6. Write one block on each rank, using the collective routine.
        7. Clean up.

        The above will be repeated for different values of bsize and nr_
blocks. Please note, that there is no overflow of the used basic dataype 
int.
        The output is verified using
        hexdump fname
        which performs a hexdump of the file. This tool collects 
consecutive equal lines in a file into one output line. The resulting 
output of a call to hexdump is given by a structure comparable to the 
following
        00000000  01 01 01 01 01 01 01 01  01 01 01 01 01 01 01 01  |....
............|
        *
        1f400000  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |....
............|
        *
        3e800000  02 02 02 02 02 02 02 02  02 02 02 02 02 02 02 02  |....
............|
        *
        5dc00000
        This example is to be read in the following manner:
        -From byte 00000000 to 1f400000 (which is equal to 500 Mib) the 
file contains the value 01 in each byte.
        -From byte 1f400000 to 3e800000 (which is equal to 1000 Mib) the 
file contains the value 00 in each byte.
        -From byte 3e800000 to 5dc00000 (which is equal to 1500 Mib) the 
file contains the value 02 in each byte.
        -The file ends here.
        This is the correct output of the above outlined program with 
parameters
        bsize=500*1023*1024
        nr_blocks=4
        running on 2 ranks. The attached file contains a lot of tests 
for different cases. These were made to pinpoint the source of the 
problem and to exclude different other, potentially important, factors.
        I deem an output wrong if it doesn't follow from the parameters 
or if the program crashes on execution.
        The only difference between OpenMPI and Intel MPI, according to 
my tests, is in the different behavior on error: OpenMPI will mostly 
write wrong data but won't crash, whereas Intel MPI mostly crashes.


    Intel MPI is based on MPICH so you should verify that this bug 
appears in MPICH and then report it here: 
https://github.com/pmodels/mpich/issues.  This is particularly useful 
because the person most responsible for MPI-IO in MPICH (Rob Latham) 
also happens to be interested in integer-overflow issues.
     

        The tests and their results are defined in comments in the 
source.
        The final conclusions, I derive from the tests, are the 
following:

        1. If the filetype used in the view is set in a way that it 
describes an amount of bytes equaling or exceeding 2^32 = 4Gib the code 
produces wrong output. For values slightly smaller (the second example 
with fname="test_8_blocks" uses a total filetype size of 4000 MiB which 
is smaller than 4Gib) the code works as expected.
        2. The act of actually writing the described regions is not 
important. When the filetype describes an area >= 4Gib but only writes 
to regions much smaller than that, the code still produces undefined 
behavior (please refer to the 6th example with fname="test_too_large_
blocks" in order to see an example).
        3. It doesn't matter if the block size or the amount of blocks 
pushes the filetype over the 4 Gib (refer to the 5th and 6th example, 
with filenames "test_16_blocks" and "test_too_large_blocks" respectively)
.
        4. If the binary is launched using only one rank, the output is 
always as expected (refer to the 3rd and 4th example, with filenames "
test_too_large_blocks_single" and "test_too_large_blocks_single_even_
larger", respectively).

        There are, of course, many other things one could test.
        It seems that the implementations use 32bit integer variables to 
compute the byte composition inside the filetype. Since the filetype is 
defined using two 32bit integer variables, this can easily lead to 
integer overflows if the user supplies large values. It seems that no 
implementation expects this problem and therefore they do not act 
gracefully on its occurrence.

        I looked at ILP64 Support, but it only adapts the function 
parameters and not the internally used variables and it is also not 
available for C.


    As far as I know, this won't fix anything, because it will run into 
all the internal implementation issues with overflow.  The ILP64 feature 
for Fortran is just to workaround the horrors of default integer width 
promotion by Fortran compilers.
     

        I looked at integer overflow (FPE_INTOVF_TRAP) trapping, which 
could help to verify the source of the problem, but it doesn't seem to 
be possible for C. Intel does not offer any built-in integer overflow 
trapping.


    You might be interested in http://blog.regehr.org/archives/1154 and 
linked material therein.  I think it's possible to implement the 
effective equivalent of a hardware trap using the compiler, although I 
don't know any (production) compiler that supports this.
     

        There are ways to circumvent this problem for most cases. It is 
only unavoidable if the logic of a program contains complex, non-
repeating data structures with sizes of over (or equal) 4GiB. Even then, 
one could split up the filetype and use a different displacement in two 
distinct write calls.

        Still, this problem violates the standard as it produces 
undefined behavior even when using the API in a consistent way. The 
implementation should at least provide a warning for the user, but 
should ideally use larger datatypes in the filetype computations. When a 
user stumbles on this problem, he will have a hard time to debug it.


    Indeed, this is a problem.  There is an effort to fix the API in MPI
-4 (see https://github.com/jeffhammond/bigmpi-paper) but as you know, 
there are implementation defects that break correct MPI-3 programs that 
use datatypes to workaround the limits of C int.  We were able to find a 
bunch of problems in MPICH using BigMPI but clearly not all of them.

    Jeff
     

        Thank you very much for reading everything ;)

        Kind Regards,

        Nils

        _______________________________________________
        users mailing list
        users@lists.open-mpi.org
        https://rfd.newmexicoconsortium.org/mailman/listinfo/users




    -- 
    Jeff Hammond
    jeff.scie...@gmail.com
    http://jeffhammond.github.io/



_______________________________________________
users mailing list
users@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/users

Reply via email to