Thanks for analyzing this, Gilles - I guess this is a question for Edgar or 
someone who cares about MPI-IO. Should we worry about this for 1.10?

I’m inclined to not delay 1.10.3 over this one, but am open to contrary opinions


> On May 26, 2016, at 7:22 PM, Gilles Gouaillardet <gil...@rist.or.jp> wrote:
> 
> In my environment, the root cause of MPI_File_open failing seems to be NFS.
> 
> MPI_File_open(MPI_COMM_WORLD, "temp", MPI_MODE_RDWR | MPI_MODE_CREATE,
>                   MPI_INFO_NULL, &lFile);
> 
> 
> if the file does not previously exists, rank 0 creates the file, MPI_Bcast(), 
> and then every rank open the file.
> 
> that works fine with all the tasks running on the same node than rank 0, but 
> other nodes fail when opening the file.
> 
> 
> i ran some more tests and observe a quite consistent behavior:
> 
> on n1:
> 
> nc -l 6666 && touch temp
> 
> on n0:
> 
> echo "" | nc n1 6666 ; while true; do date ; ls -l temp && break ; sleep 1; 
> done
> 
> 
> on n0, the temp file is immediatly found, no problem so far.
> 
> 
> now, if i run
> 
> on n1:
> 
> nc -l 6666 && touch temp2
> 
> on n0:
> 
> ls -l temp2; echo "" | nc n1 6666 ; while true; do date ; ls -l temp2 && 
> break ; sleep 1; done
> 
> 
> it takes a few iterations before n0 find temp2.
> 
> the only difference is that n0 looked up this file before it was created, and 
> it somehow cache this information
> 
> (e.g. the file does not exist), and it takes a while before the cache gets 
> updated (e.g. the file now exists)
> 
> i cannot remember whether this is what should be expected from NFS nor if 
> that can be changed with appropriate tuning.
> 
> 
> Cheers,
> 
> 
> Gilles
> 
> 
> On 5/27/2016 10:32 AM, Gilles Gouaillardet wrote:
>> Ralph,
>> 
>> 
>> the cxx_win_attr issue is dealt at 
>> https://github.com/open-mpi/ompi/pull/1473 
>> <https://github.com/open-mpi/ompi/pull/1473>
>> iirc, only big endian and/or sizeof(Fortran integer) > sizeof(int) is 
>> impacted.
>> 
>> 
>> the second error seems a bit weirdest at a time.
>> 
>> once in a while, MPI_File_open fails, and when it fails, it always fails 
>> silently.
>> 
>> in this case (MPI_File_open failed), if --mca mpi_param_check true, then 
>> next calls to MPI-IO will also fail silently.
>> 
>> if --mca mpi_param_check false (or Open MPI was configure'd with 
>> --without-mpi-param-check),
>> 
>> then something will go wrong in MPI_File_close
>> 
>> 
>> that raises several questions ...
>> 
>> - why does MPI-IO default behavior is to fail silently ?
>> 
>> (point to point or collective will abort by default)
>> 
>> - why does MPI_File_open fails once in a while ?
>> 
>> (Open MPI bug ? ROMIO bug ? intermittent failure caused by the NFS 
>> filesystem ?)
>> 
>> - is there a bug in the test ?
>> 
>> for example, the program could abort with error code 77 (skip) if 
>> MPI_File_open fails
>> 
>> 
>> Cheers,
>> 
>> 
>> Gilles
>> 
>> On 5/26/2016 11:14 PM, Ralph Castain wrote:
>>> I’m seeing three errors in MTT today - of these, I only consider the first 
>>> two to be of significant concern:
>>> 
>>> onesided/cxx_win_attr : https://mtt.open-mpi.org/index.php?do_redir=2326 
>>> <https://mtt.open-mpi.org/index.php?do_redir=2326>
>>> [**ERROR**]: MPI_COMM_WORLD rank 0, file cxx_win_attr.cc:50:
>>> Win::Get_attr: Got wrong value for disp
>>> unit--------------------------------------------------------------------------
>>> 
>>> 
>>> datatype/idx_null : https://mtt.open-mpi.org/index.php?do_redir=2327 
>>> <https://mtt.open-mpi.org/index.php?do_redir=2327>
>>> home/mpiteam/scratches/community/2016-05-25cron/56jr/installs/i0Lt/install/lib/libopen-pal.so.13(opal_memory_ptmalloc2_int_free+0x82)[0x2aaaab7ef70a]
>>> [mpi031:06729] [ 2]
>>> /home/mpiteam/scratches/community/2016-05-25cron/56jr/installs/i0Lt/install/lib/libopen-pal.so.13(opal_memory_ptmalloc2_free+0x96)[0x2aaaab7ee047]
>>> [mpi031:06729] [ 3]
>>> /home/mpiteam/scratches/community/2016-05-25cron/56jr/installs/i0Lt/install/lib/libopen-pal.so.13(+0xd0ed8)[0x2aaaab7eced8]
>>> [mpi031:06729] [ 4]
>>> /home/mpiteam/scratches/community/2016-05-25cron/56jr/installs/i0Lt/install/lib/libmpi.so.12(ompi_file_close+0x101)[0x2aaaaab2963c]
>>> [mpi031:06729] [ 5]
>>> /home/mpiteam/scratches/community/2016-05-25cron/56jr/installs/i0Lt/install/lib/libmpi.so.12(PMPI_File_close+0x18)[0x2aaaaab83216]
>>> [mpi031:06729] [ 6] datatype/idx_null[0x400cb2]
>>> [mpi031:06729] [ 7] /lib64/libc.so.6(__libc_start_main+0xfd)[0x3c2f21ed1d]
>>> [mpi031:06729] [ 8] datatype/idx_null[0x400a89]
>>> [mpi031:06729] *** End of error message ***
>>> [mpi031:06732] *** Process received signal ***
>>> [mpi031:06732] Signal: Segmentation fault (11)
>>> [mpi031:06732] Signal code: Address not mapped (1)
>>> [mpi031:06732] Failing at address: 0x2ab2aba3cea0
>>> [mpi031:06732] [ 0] /lib64/libpthread.so.0[0x3c2f60f710]
>>> [mpi031:06732] [ 1]
>>> 
>>> dynamic/loop_spawn : https://mtt.open-mpi.org/index.php?do_redir=2328 
>>> <https://mtt.open-mpi.org/index.php?do_redir=2328>
>>> [p10a601:159913] too many retries sending message to 0x000b:0x00427ad6, 
>>> giving up
>>> -------------------------------------------------------
>>> Child job 8 terminated normally, but 1 process returned
>>> a non-zero exit code.. Per user-direction, the job has been aborted.
>>> ---------------------------------------------------------------------------------------------------------------------------------
>>> mp
>>> 
>>> 
>>> 
>>> 
>>> _______________________________________________
>>> devel mailing list
>>> de...@open-mpi.org <mailto:de...@open-mpi.org>
>>> Subscription: https://www.open-mpi.org/mailman/listinfo.cgi/devel 
>>> <https://www.open-mpi.org/mailman/listinfo.cgi/devel>
>>> Link to this post: 
>>> http://www.open-mpi.org/community/lists/devel/2016/05/19037.php 
>>> <http://www.open-mpi.org/community/lists/devel/2016/05/19037.php>
>> 
>> 
>> _______________________________________________
>> devel mailing list
>> de...@open-mpi.org <mailto:de...@open-mpi.org>
>> Subscription: https://www.open-mpi.org/mailman/listinfo.cgi/devel 
>> <https://www.open-mpi.org/mailman/listinfo.cgi/devel>
>> Link to this post: 
>> http://www.open-mpi.org/community/lists/devel/2016/05/19040.php 
>> <http://www.open-mpi.org/community/lists/devel/2016/05/19040.php>
> _______________________________________________
> devel mailing list
> de...@open-mpi.org
> Subscription: https://www.open-mpi.org/mailman/listinfo.cgi/devel
> Link to this post: 
> http://www.open-mpi.org/community/lists/devel/2016/05/19041.php

Reply via email to