Thanks for analyzing this, Gilles - I guess this is a question for Edgar or someone who cares about MPI-IO. Should we worry about this for 1.10?
I’m inclined to not delay 1.10.3 over this one, but am open to contrary opinions > On May 26, 2016, at 7:22 PM, Gilles Gouaillardet <gil...@rist.or.jp> wrote: > > In my environment, the root cause of MPI_File_open failing seems to be NFS. > > MPI_File_open(MPI_COMM_WORLD, "temp", MPI_MODE_RDWR | MPI_MODE_CREATE, > MPI_INFO_NULL, &lFile); > > > if the file does not previously exists, rank 0 creates the file, MPI_Bcast(), > and then every rank open the file. > > that works fine with all the tasks running on the same node than rank 0, but > other nodes fail when opening the file. > > > i ran some more tests and observe a quite consistent behavior: > > on n1: > > nc -l 6666 && touch temp > > on n0: > > echo "" | nc n1 6666 ; while true; do date ; ls -l temp && break ; sleep 1; > done > > > on n0, the temp file is immediatly found, no problem so far. > > > now, if i run > > on n1: > > nc -l 6666 && touch temp2 > > on n0: > > ls -l temp2; echo "" | nc n1 6666 ; while true; do date ; ls -l temp2 && > break ; sleep 1; done > > > it takes a few iterations before n0 find temp2. > > the only difference is that n0 looked up this file before it was created, and > it somehow cache this information > > (e.g. the file does not exist), and it takes a while before the cache gets > updated (e.g. the file now exists) > > i cannot remember whether this is what should be expected from NFS nor if > that can be changed with appropriate tuning. > > > Cheers, > > > Gilles > > > On 5/27/2016 10:32 AM, Gilles Gouaillardet wrote: >> Ralph, >> >> >> the cxx_win_attr issue is dealt at >> https://github.com/open-mpi/ompi/pull/1473 >> <https://github.com/open-mpi/ompi/pull/1473> >> iirc, only big endian and/or sizeof(Fortran integer) > sizeof(int) is >> impacted. >> >> >> the second error seems a bit weirdest at a time. >> >> once in a while, MPI_File_open fails, and when it fails, it always fails >> silently. >> >> in this case (MPI_File_open failed), if --mca mpi_param_check true, then >> next calls to MPI-IO will also fail silently. >> >> if --mca mpi_param_check false (or Open MPI was configure'd with >> --without-mpi-param-check), >> >> then something will go wrong in MPI_File_close >> >> >> that raises several questions ... >> >> - why does MPI-IO default behavior is to fail silently ? >> >> (point to point or collective will abort by default) >> >> - why does MPI_File_open fails once in a while ? >> >> (Open MPI bug ? ROMIO bug ? intermittent failure caused by the NFS >> filesystem ?) >> >> - is there a bug in the test ? >> >> for example, the program could abort with error code 77 (skip) if >> MPI_File_open fails >> >> >> Cheers, >> >> >> Gilles >> >> On 5/26/2016 11:14 PM, Ralph Castain wrote: >>> I’m seeing three errors in MTT today - of these, I only consider the first >>> two to be of significant concern: >>> >>> onesided/cxx_win_attr : https://mtt.open-mpi.org/index.php?do_redir=2326 >>> <https://mtt.open-mpi.org/index.php?do_redir=2326> >>> [**ERROR**]: MPI_COMM_WORLD rank 0, file cxx_win_attr.cc:50: >>> Win::Get_attr: Got wrong value for disp >>> unit-------------------------------------------------------------------------- >>> >>> >>> datatype/idx_null : https://mtt.open-mpi.org/index.php?do_redir=2327 >>> <https://mtt.open-mpi.org/index.php?do_redir=2327> >>> home/mpiteam/scratches/community/2016-05-25cron/56jr/installs/i0Lt/install/lib/libopen-pal.so.13(opal_memory_ptmalloc2_int_free+0x82)[0x2aaaab7ef70a] >>> [mpi031:06729] [ 2] >>> /home/mpiteam/scratches/community/2016-05-25cron/56jr/installs/i0Lt/install/lib/libopen-pal.so.13(opal_memory_ptmalloc2_free+0x96)[0x2aaaab7ee047] >>> [mpi031:06729] [ 3] >>> /home/mpiteam/scratches/community/2016-05-25cron/56jr/installs/i0Lt/install/lib/libopen-pal.so.13(+0xd0ed8)[0x2aaaab7eced8] >>> [mpi031:06729] [ 4] >>> /home/mpiteam/scratches/community/2016-05-25cron/56jr/installs/i0Lt/install/lib/libmpi.so.12(ompi_file_close+0x101)[0x2aaaaab2963c] >>> [mpi031:06729] [ 5] >>> /home/mpiteam/scratches/community/2016-05-25cron/56jr/installs/i0Lt/install/lib/libmpi.so.12(PMPI_File_close+0x18)[0x2aaaaab83216] >>> [mpi031:06729] [ 6] datatype/idx_null[0x400cb2] >>> [mpi031:06729] [ 7] /lib64/libc.so.6(__libc_start_main+0xfd)[0x3c2f21ed1d] >>> [mpi031:06729] [ 8] datatype/idx_null[0x400a89] >>> [mpi031:06729] *** End of error message *** >>> [mpi031:06732] *** Process received signal *** >>> [mpi031:06732] Signal: Segmentation fault (11) >>> [mpi031:06732] Signal code: Address not mapped (1) >>> [mpi031:06732] Failing at address: 0x2ab2aba3cea0 >>> [mpi031:06732] [ 0] /lib64/libpthread.so.0[0x3c2f60f710] >>> [mpi031:06732] [ 1] >>> >>> dynamic/loop_spawn : https://mtt.open-mpi.org/index.php?do_redir=2328 >>> <https://mtt.open-mpi.org/index.php?do_redir=2328> >>> [p10a601:159913] too many retries sending message to 0x000b:0x00427ad6, >>> giving up >>> ------------------------------------------------------- >>> Child job 8 terminated normally, but 1 process returned >>> a non-zero exit code.. Per user-direction, the job has been aborted. >>> --------------------------------------------------------------------------------------------------------------------------------- >>> mp >>> >>> >>> >>> >>> _______________________________________________ >>> devel mailing list >>> de...@open-mpi.org <mailto:de...@open-mpi.org> >>> Subscription: https://www.open-mpi.org/mailman/listinfo.cgi/devel >>> <https://www.open-mpi.org/mailman/listinfo.cgi/devel> >>> Link to this post: >>> http://www.open-mpi.org/community/lists/devel/2016/05/19037.php >>> <http://www.open-mpi.org/community/lists/devel/2016/05/19037.php> >> >> >> _______________________________________________ >> devel mailing list >> de...@open-mpi.org <mailto:de...@open-mpi.org> >> Subscription: https://www.open-mpi.org/mailman/listinfo.cgi/devel >> <https://www.open-mpi.org/mailman/listinfo.cgi/devel> >> Link to this post: >> http://www.open-mpi.org/community/lists/devel/2016/05/19040.php >> <http://www.open-mpi.org/community/lists/devel/2016/05/19040.php> > _______________________________________________ > devel mailing list > de...@open-mpi.org > Subscription: https://www.open-mpi.org/mailman/listinfo.cgi/devel > Link to this post: > http://www.open-mpi.org/community/lists/devel/2016/05/19041.php