[OMPI devel] Looking for a replacement call for repeated call to MPI_IPROBE
Hello, I am trying to figure out the most appropriate MPI calls for a certain portion of my code. I will describe the situation here: Each cell (i,j) of my array A is being updated by a calculation that depends on the values of 1 or 2 of the 4 possible neighbors A(i+1,j), A(i-1,j), A(i,j+1), and A(i,j-1). Say, for example, A(i,j)=A(i-1,j)*A(i,j-1). The thing is, the values of the neighbors A(i-1,j) and A(i,j-1) cannot be used until an auxiliary array B has been updated from 0 to 1. The values B(i-1,j) and B(i,j-1) are changed from 0 -> 1 after the values A(i-1,j) and A(i,j-1) have been communicated to the proc that contains cell (i,j), as cells (i-1,j) and (i,j-1) belong to different procs. Here is pseudocode for how I have the algorithm implemented (in fortran): do while (B(ii,jj,kk).eq.0) if (probe_for_message(i0,j0,k0,this_sc)) then my_ibuf(1)=my_ibuf(1)+1 A(i0,j0,k0)=this_sc B(i0,j0,k0)=1 end if end do The function 'probe_for_message' uses an 'MPI_IPROBE' to see if 'MPI_ANY_SOURCE' has a message for my current proc. If there is a message, the function returns a true logical and calls 'MPI_RECV', receiving (i0,j0,k0,this_sc) from the proc that has the message. This works! My concern is that I am probing repeatedly inside the while loop until I receive a message from a proc such that ii=i0, jj=j0, kk=k0. I could potentially call MPI_IPROBE many many times before this happens... and I'm worried that this is a messy way of doing this. Could I "break" the mpi probe call? Are there MPI routines that would allow me to accomplish the same thing in a more formal or safer way? Maybe a persistent communication or something? For very large computations with many procs, I am observing a hanging situation which I suspect may be due to this. I observe it when using openmpi-1.4.4, and the hanging seems to disappear if I use mvapich. Any suggestions/comments would be greatly appreciated. Thanks so much! -- JM
Re: [OMPI devel] Looking for a replacement call for repeated call to MPI_IPROBE
Thank you for the feedback. I actually just changed the repeated probing for a message to a blocking MPI_RECV, as the processor waiting to receive does nothing but repeatedly probe until the message is there anyway. This also works, and it makes more sense to do it this way. However, this did not fix my hanging issue. I am wondering if it has something to do with the size of my buffer used in MPI_BUFFER_ATTACH. I believe I am following the proper MPI_BSEND_OVERHEAD protocol. I am waiting on the admins to install openmpi-1.6.3, and hoping that maybe this will fix my issue. On Sat, Jan 26, 2013 at 7:32 AM, Jeff Squyres (jsquyres) wrote: > First off, 1.4.4 is fairly ancient. You might want to try upgrading to > 1.6.3. > > Second, you might want to use non-blocking receives for B such that you > can MPI_WAITALL, or perhaps MPI_WAITSOME or MPI_WAITANY to wait for > some/all of the values to arrive in B. This keeps any looping down in MPI > (i.e., as close to the hardware as possible). > > > On Jan 25, 2013, at 3:21 PM, Jeremy McCaslin wrote: > > > Hello, > > > > I am trying to figure out the most appropriate MPI calls for a certain > portion of my code. I will describe the situation here: > > > > Each cell (i,j) of my array A is being updated by a calculation that > depends on the values of 1 or 2 of the 4 possible neighbors A(i+1,j), > A(i-1,j), A(i,j+1), and A(i,j-1). Say, for example, > A(i,j)=A(i-1,j)*A(i,j-1). The thing is, the values of the neighbors > A(i-1,j) and A(i,j-1) cannot be used until an auxiliary array B has been > updated from 0 to 1. The values B(i-1,j) and B(i,j-1) are changed from 0 > -> 1 after the values A(i-1,j) and A(i,j-1) have been communicated to the > proc that contains cell (i,j), as cells (i-1,j) and (i,j-1) belong to > different procs. Here is pseudocode for how I have the algorithm > implemented (in fortran): > > > > do while (B(ii,jj,kk).eq.0) > > if (probe_for_message(i0,j0,k0,this_sc)) then > > my_ibuf(1)=my_ibuf(1)+1 > > A(i0,j0,k0)=this_sc > > B(i0,j0,k0)=1 > > end if > > end do > > > > The function 'probe_for_message' uses an 'MPI_IPROBE' to see if > 'MPI_ANY_SOURCE' has a message for my current proc. If there is a message, > the function returns a true logical and calls 'MPI_RECV', receiving > (i0,j0,k0,this_sc) from the proc that has the message. This works! My > concern is that I am probing repeatedly inside the while loop until I > receive a message from a proc such that ii=i0, jj=j0, kk=k0. I could > potentially call MPI_IPROBE many many times before this happens... and I'm > worried that this is a messy way of doing this. Could I "break" the mpi > probe call? Are there MPI routines that would allow me to accomplish the > same thing in a more formal or safer way? Maybe a persistent communication > or something? For very large computations with many procs, I am observing > a hanging situation which I suspect may be due to this. I observe it when > using openmpi-1.4.4, and the hanging seems to disappear if I use mvapich. > Any suggestions/comments would be greatly ap! > preciated. Thanks so much! > > > > -- > > JM ___ > > devel mailing list > > de...@open-mpi.org > > http://www.open-mpi.org/mailman/listinfo.cgi/devel > > > -- > Jeff Squyres > jsquy...@cisco.com > For corporate legal information go to: > http://www.cisco.com/web/about/doing_business/legal/cri/ > > > ___ > devel mailing list > de...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/devel > -- JM
Re: [OMPI devel] OMPI devel] OpenMPI 2.x: bug: violent break at beginning with (sequential) runs...
Hi, Can I please be removed from this list? Thanks, Jeremy On Thu, Sep 15, 2016 at 8:44 AM, r...@open-mpi.org wrote: > I don’t think a collision was the issue here. We were taking the > mpirun-generated jobid and passing it thru the hash, thus creating an > incorrect and invalid value. What I’m more surprised by is that it doesn’t > -always- fail. Only thing I can figure is that, unlike with PMIx, the usock > oob component doesn’t check the incoming identifier of the connecting proc > to see if it is someone it knows. So unless you just happened to hash into > a daemon jobid form, it would accept the connection (even though the name > wasn’t correct). > > I think this should fix the issue. Let’s wait and see > > > On Sep 15, 2016, at 4:47 AM, Gilles Gouaillardet < > gilles.gouaillar...@gmail.com> wrote: > > I just realized i screwed up my test, and i was missing some relevant > info... > So on one hand, i fixed a bug in singleton, > But on the other hand, i cannot tell whether a collision was involved in > this issue > > Cheers, > > Gilles > > Joshua Ladd wrote: > Great catch, Gilles! Not much of a surprise though. > > Indeed, this issue has EVERYTHING to do with how PMIx is calculating the > jobid, which, in this case, results in hash collisions. ;-P > > Josh > > On Thursday, September 15, 2016, Gilles Gouaillardet > wrote: > >> Eric, >> >> >> a bug has been identified, and a patch is available at >> https://patch-diff.githubusercontent.com/raw/open-mpi/ompi-r >> elease/pull/1376.patch >> >> the bug is specific to singleton mode (e.g. ./a.out vs mpirun -np 1 >> ./a.out), so if applying a patch does not fit your test workflow, >> >> it might be easier for you to update it and mpirun -np 1 ./a.out instead >> of ./a.out >> >> >> basically, increasing verbosity runs some extra code, which include >> sprintf. >> so yes, it is possible to crash an app by increasing verbosity by running >> into a bug that is hidden under normal operation. >> my intuition suggests this is quite unlikely ... if you can get a core >> file and a backtrace, we will soon find out >> >> >> Cheers, >> >> Gilles >> >> >> >> On 9/15/2016 2:58 AM, Eric Chamberland wrote: >> >>> Ok, >>> >>> one test segfaulted *but* I can't tell if it is the *same* bug because >>> there has been a segfault: >>> >>> stderr: >>> http://www.giref.ulaval.ca/~cmpgiref/dernier_ompi/2016.09.14 >>> .10h38m52s.faultyCerr.Triangle.h_cte_1.txt >>> >>> [lorien:190552] [[INVALID],INVALID] plm:rsh_lookup on agent ssh : rsh >>> path NULL >>> [lorien:190552] plm:base:set_hnp_name: initial bias 190552 nodename hash >>> 1366255883 >>> [lorien:190552] plm:base:set_hnp_name: final jobfam 53310 >>> [lorien:190552] [[53310,0],0] plm:rsh_setup on agent ssh : rsh path NULL >>> [lorien:190552] [[53310,0],0] plm:base:receive start comm >>> *** Error in `orted': realloc(): invalid next size: 0x01e58770 >>> *** >>> ... >>> ... >>> [lorien:190306] [[INVALID],INVALID] ORTE_ERROR_LOG: Unable to start a >>> daemon on the local node in file ess_singleton_module.c at line 573 >>> [lorien:190306] [[INVALID],INVALID] ORTE_ERROR_LOG: Unable to start a >>> daemon on the local node in file ess_singleton_module.c at line 163 >>> *** An error occurred in MPI_Init_thread >>> *** on a NULL communicator >>> *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort, >>> ***and potentially your MPI job) >>> [lorien:190306] Local abort before MPI_INIT completed completed >>> successfully, but am not able to aggregate error messages, and not able to >>> guarantee that all other processes were killed! >>> >>> stdout: >>> >>> -- >>> >>> It looks like orte_init failed for some reason; your parallel process is >>> likely to abort. There are many reasons that a parallel process can >>> fail during orte_init; some of which are due to configuration or >>> environment problems. This failure appears to be an internal failure; >>> here's some additional information (which may only be relevant to an >>> Open MPI developer): >>> >>> orte_ess_init failed >>> --> Returned value Unable to start a daemon on the local node (-127) >>> instead of ORTE_SUCCESS >>> -- >>> >>> -- >>> >>> It looks like MPI_INIT failed for some reason; your parallel process is >>> likely to abort. There are many reasons that a parallel process can >>> fail during MPI_INIT; some of which are due to configuration or >>> environment >>> problems. This failure appears to be an internal failure; here's some >>> additional information (which may only be relevant to an Open MPI >>> developer): >>> >>> ompi_mpi_init: ompi_rte_init failed >>> --> Returned "Unable to start a daemon on the local node" (-127) >>> instead of "Success" (0) >>> ---
[OMPI devel] Please remove me from this list
-- JM ___ devel mailing list devel@lists.open-mpi.org https://rfd.newmexicoconsortium.org/mailman/listinfo/devel