Yes, I would love to have a copy of that test program, if you could share it. I'll add it to our internal test suite.
On Nov 5, 2014, at 5:08 AM, <michael.rach...@dlr.de> <michael.rach...@dlr.de> wrote: > Dear Gilles, > > My small downsized Ftn-testprogram for testing the shared memory feature > (MPI_WIN_ALLOCATE_SHARED, MPI_WIN_SHARED_QUERY, C_F_POINTER) > presumes for simplicity that all processes are running on the same node (i.e. > the communicator containing the procs on the same node is just > MPI_COMM_WORLD). > So the hanging of MPI_WIN_ALLOCATE_SHARED when running on 2 nodes could only > be observed with our large CFD-code. > > Are OPENMPI-developers nevertheless interested in that testprogram? > > Greetings > Michael > > > > > > > -----Ursprüngliche Nachricht----- > Von: users [mailto:users-boun...@open-mpi.org] Im Auftrag von Gilles > Gouaillardet > Gesendet: Mittwoch, 5. November 2014 10:46 > An: Open MPI Users > Betreff: Re: [OMPI users] Bug in OpenMPI-1.8.3: storage limition in shared > memory allocation (MPI_WIN_ALLOCATE_SHARED) in Ftn-code > > Michael, > > could you please share your test program so we can investigate it ? > > Cheers, > > Gilles > > On 2014/10/31 18:53, michael.rach...@dlr.de wrote: >> Dear developers of OPENMPI, >> >> There remains a hanging observed in MPI_WIN_ALLOCATE_SHARED. >> >> But first: >> Thank you for your advices to employ shmem_mmap_relocate_backing_file = 1 >> It indeed turned out, that the bad (but silent) allocations by >> MPI_WIN_ALLOCATE_SHARED, which I observed in the past after ~140 MB of >> allocated shared memory, were indeed caused by a too small available >> storage for the sharedmem backing files. Applying the MCA parameter resolved >> the problem. >> >> Now the allocation of shared data windows by MPI_WIN_ALLOCATE_SHARED in the >> OPENMPI-1.8.3 release version works on both clusters! >> I tested it both with my small sharedmem-Ftn-testprogram as well as with >> our Ftn-CFD-code. >> It worked even when allocating 1000 shared data windows containing a total >> of 40 GB. Very well. >> >> But now I come to the problem remaining: >> According to the attached email of Jeff (see below) of 2014-10-24, we >> have alternatively installed and tested the bugfixed OPENMPI Nightly Tarball >> of 2014-10-24 (openmpi-dev-176-g9334abc.tar.gz) on Cluster5 . >> That version worked well, when our CFD-code was running on only 1 node. >> But I observe now, that when running the CFD-code on 2 node with 2 >> processes per node, after having allocated a total of 200 MB of data >> in 20 shared windows, the allocation of the 21-th window fails, because all >> 4 processes enter MPI_WIN_ALLOCATE_SHARED but never leave it. The code hangs >> in that routine, without any message. >> >> In contrast, that bug does NOT occur with the OPENMPI-1.8.3 release version >> with same program on same machine. >> >> That means for you: >> In openmpi-dev-176-g9334abc.tar.gz the new-introduced bugfix concerning >> the shared memory allocation may be not yet correctly coded , >> or that version contains another new bug in sharedmemory allocation >> compared to the working(!) 1.8.3-release version. >> >> Greetings to you all >> Michael Rachner >> >> >> >> >> -----Ursprüngliche Nachricht----- >> Von: users [mailto:users-boun...@open-mpi.org] Im Auftrag von Jeff >> Squyres (jsquyres) >> Gesendet: Freitag, 24. Oktober 2014 22:45 >> An: Open MPI User's List >> Betreff: Re: [OMPI users] Bug in OpenMPI-1.8.3: storage limition in >> shared memory allocation (MPI_WIN_ALLOCATE_SHARED) in Ftn-code >> >> Nathan tells me that this may well be related to a fix that was literally >> just pulled into the v1.8 branch today: >> >> https://github.com/open-mpi/ompi-release/pull/56 >> >> Would you mind testing any nightly tarball after tonight? (i.e., the >> v1.8 tarballs generated tonight will be the first ones to contain this >> fix) >> >> http://www.open-mpi.org/nightly/master/ >> >> >> >> On Oct 24, 2014, at 11:46 AM, <michael.rach...@dlr.de> >> <michael.rach...@dlr.de> wrote: >> >>> Dear developers of OPENMPI, >>> >>> I am running a small downsized Fortran-testprogram for shared memory >>> allocation (using MPI_WIN_ALLOCATE_SHARED and MPI_WIN_SHARED_QUERY) ) >>> on only 1 node of 2 different Linux-clusters with OPENMPI-1.8.3 and >>> Intel-14.0.4 /Intel-13.0.1, respectively. >>> >>> The program simply allocates a sequence of shared data windows, each >>> consisting of 1 integer*4-array. >>> None of the windows is freed, so the amount of allocated data in shared >>> windows raises during the course of the execution. >>> >>> That worked well on the 1st cluster (Laki, having 8 procs per node)) >>> when allocating even 1000 shared windows each having 50000 integer*4 array >>> elements, i.e. a total of 200 MBytes. >>> On the 2nd cluster (Cluster5, having 24 procs per node) it also worked on >>> the login node, but it did NOT work on a compute node. >>> In that error case, there occurs something like an internal storage limit >>> of ~ 140 MB for the total storage allocated in all shared windows. >>> When that limit is reached, all later shared memory allocations fail (but >>> silently). >>> So the first attempt to use such a bad shared data window results in a bus >>> error due to the bad storage address encountered. >>> >>> That strange behavior could be observed in the small testprogram but also >>> with my large Fortran CFD-code. >>> If the error occurs, then it occurs with both codes, and both at a storage >>> limit of ~140 MB. >>> I found that this storage limit depends only weakly on the number of >>> processes (for np=2,4,8,16,24 it is: 144.4 , 144.0, 141.0, 137.0, >>> 132.2 MB) >>> >>> Note that the shared memory storage available on both clusters was very >>> large (many GB of free memory). >>> >>> Here is the error message when running with np=2 and an array >>> dimension of idim_1=50000 for the integer*4 array allocated per shared >>> window on the compute node of Cluster5: >>> In that case, the error occurred at the 723-th shared window, which is the >>> 1st badly allocated window in that case: >>> (722 successfully allocated shared windows * 50000 array elements * 4 >>> Bytes/el. = 144.4 MB) >>> >>> >>> [1,0]<stdout>: ========on nodemaster: iwin= 722 : >>> [1,0]<stdout>: total storage [MByte] alloc. in shared windows so far: >>> 144.400000000000 >>> [1,0]<stdout>: =========== allocation of shared window no. iwin= 723 >>> [1,0]<stdout>: starting now with idim_1= 50000 >>> [1,0]<stdout>: ========on nodemaster for iwin= 723 : before writing >>> on shared mem >>> [1,0]<stderr>:[r5i5n13:12597] *** Process received signal *** >>> [1,0]<stderr>:[r5i5n13:12597] Signal: Bus error (7) >>> [1,0]<stderr>:[r5i5n13:12597] Signal code: Non-existant physical >>> address (2) [1,0]<stderr>:[r5i5n13:12597] Failing at address: >>> 0x7fffe08da000 [1,0]<stderr>:[r5i5n13:12597] [ 0] >>> [1,0]<stderr>:/lib64/libpthread.so.0(+0xf800)[0x7ffff6d67800] >>> [1,0]<stderr>:[r5i5n13:12597] [ 1] ./a.out[0x408a8b] >>> [1,0]<stderr>:[r5i5n13:12597] [ 2] ./a.out[0x40800c] >>> [1,0]<stderr>:[r5i5n13:12597] [ 3] >>> [1,0]<stderr>:/lib64/libc.so.6(__libc_start_main+0xe6)[0x7ffff69fec36 >>> ] [1,0]<stderr>:[r5i5n13:12597] [ 4] [1,0]<stderr>:./a.out[0x407f09] >>> [1,0]<stderr>:[r5i5n13:12597] *** End of error message *** >>> [1,1]<stderr>:forrtl: error (78): process killed (SIGTERM) >>> [1,1]<stderr>:Image PC Routine Line >>> Source >>> [1,1]<stderr>:libopen-pal.so.6 00007FFFF4B74580 Unknown >>> Unknown Unknown >>> [1,1]<stderr>:libmpi.so.1 00007FFFF7267F3E Unknown >>> Unknown Unknown >>> [1,1]<stderr>:libmpi.so.1 00007FFFF733B555 Unknown >>> Unknown Unknown >>> [1,1]<stderr>:libmpi.so.1 00007FFFF727DFFD Unknown >>> Unknown Unknown >>> [1,1]<stderr>:libmpi_mpifh.so.2 00007FFFF779BA03 Unknown >>> Unknown Unknown >>> [1,1]<stderr>:a.out 0000000000408D15 Unknown >>> Unknown Unknown >>> [1,1]<stderr>:a.out 000000000040800C Unknown >>> Unknown Unknown >>> [1,1]<stderr>:libc.so.6 00007FFFF69FEC36 Unknown >>> Unknown Unknown >>> [1,1]<stderr>:a.out 0000000000407F09 Unknown >>> Unknown Unknown >>> --------------------------------------------------------------------- >>> - >>> ---- mpiexec noticed that process rank 0 with PID 12597 on node >>> r5i5n13 exited on signal 7 (Bus error). >>> --------------------------------------------------------------------- >>> - >>> ---- >>> >>> >>> The small Ftn-testprogram was built by >>> mpif90 sharedmemtest.f90 >>> mpiexec -np 2 -bind-to core -tag-output ./a.out >>> >>> Why does it work on the Laki (both on login-node and on a compute >>> node) as well as on the login-node of Cluster5, but fails on an compute >>> node of Cluster5? >>> >>> Greetings >>> Michael Rachner >>> >>> >>> >>> _______________________________________________ >>> users mailing list >>> us...@open-mpi.org >>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users >>> Link to this post: >>> http://www.open-mpi.org/community/lists/users/2014/10/25572.php >> >> -- >> Jeff Squyres >> jsquy...@cisco.com >> For corporate legal information go to: >> http://www.cisco.com/web/about/doing_business/legal/cri/ >> >> _______________________________________________ >> users mailing list >> us...@open-mpi.org >> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users >> Link to this post: >> http://www.open-mpi.org/community/lists/users/2014/10/25580.php >> _______________________________________________ >> users mailing list >> us...@open-mpi.org >> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users >> Link to this post: >> http://www.open-mpi.org/community/lists/users/2014/10/25654.php > > _______________________________________________ > users mailing list > us...@open-mpi.org > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users > Link to this post: > http://www.open-mpi.org/community/lists/users/2014/11/25677.php > _______________________________________________ > users mailing list > us...@open-mpi.org > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users > Link to this post: > http://www.open-mpi.org/community/lists/users/2014/11/25678.php -- Jeff Squyres jsquy...@cisco.com For corporate legal information go to: http://www.cisco.com/web/about/doing_business/legal/cri/