Re: [OMPI users] mpi send/recv pair hangin

2018-04-10 Thread Nathan Hjelm
Using icc will not change anything unless there is a bug in the gcc version. I personally never build Open MPI with icc as it is slow and provides no benefit over gcc these days. I do, however, use ifort for the Fortran bindings. -Nathan > On Apr 10, 2018, at 5:56 AM, Reuti wrote: > > >>> Am

Re: [OMPI users] mpi send/recv pair hangin

2018-04-10 Thread Reuti
> Am 10.04.2018 um 13:37 schrieb Noam Bernstein : > >> On Apr 10, 2018, at 4:20 AM, Reuti wrote: >> >>> >>> Am 10.04.2018 um 01:04 schrieb Noam Bernstein : >>> On Apr 9, 2018, at 6:36 PM, George Bosilca wrote: Noam, I have few questions for you. According to your o

Re: [OMPI users] mpi send/recv pair hangin

2018-04-10 Thread Noam Bernstein
> On Apr 10, 2018, at 4:20 AM, Reuti wrote: > >> >> Am 10.04.2018 um 01:04 schrieb Noam Bernstein > >: >> >>> On Apr 9, 2018, at 6:36 PM, George Bosilca >> > wrote: >>> >>> Noam, >>> >>> I have few questions for you. According to

Re: [OMPI users] mpi send/recv pair hangin

2018-04-10 Thread Reuti
> Am 10.04.2018 um 01:04 schrieb Noam Bernstein : > >> On Apr 9, 2018, at 6:36 PM, George Bosilca wrote: >> >> Noam, >> >> I have few questions for you. According to your original email you are using >> OMPI 3.0.1 (but the hang can also be reproduced with the 3.0.0). > > Correct. > >> Also

Re: [OMPI users] mpi send/recv pair hangin

2018-04-09 Thread Noam Bernstein
On Apr 9, 2018, at 6:36 PM, George Bosilca wrote:Noam,I have few questions for you. According to your original email you are using OMPI 3.0.1 (but the hang can also be reproduced with the 3.0.0).Correct. Also according to your stacktrace I assume it is an x86_64, compiled with

Re: [OMPI users] mpi send/recv pair hangin

2018-04-09 Thread George Bosilca
Noam, I have few questions for you. According to your original email you are using OMPI 3.0.1 (but the hang can also be reproduced with the 3.0.0). Also according to your stacktrace I assume it is an x86_64, compiled with icc. Is your application multithreaded ? How did you initialized MPI (which

Re: [OMPI users] mpi send/recv pair hangin

2018-04-08 Thread George Bosilca
Right, it has nothing to do with the tag. The sequence number is an internal counter that help OMPI to deliver the messages in the MPI required order (FIFO ordering per communicator per peer). Thanks for offering your help to debug this issue. We'll need to figure out how this can happen, and we w

Re: [OMPI users] mpi send/recv pair hangin

2018-04-08 Thread Noam Bernstein
> On Apr 8, 2018, at 3:58 PM, George Bosilca wrote: > > Noam, > > Thanks for your output, it highlight an usual outcome. It shows that a > process (29662) has pending messages from other processes that are tagged > with a past sequence number, something that should have not happened. The > on

Re: [OMPI users] mpi send/recv pair hangin

2018-04-08 Thread George Bosilca
Noam, Thanks for your output, it highlight an usual outcome. It shows that a process (29662) has pending messages from other processes that are tagged with a past sequence number, something that should have not happened. The only way to get that is if somehow we screwed-up the sending part and pus

Re: [OMPI users] mpi send/recv pair hangin

2018-04-06 Thread Noam Bernstein
> On Apr 6, 2018, at 1:41 PM, George Bosilca wrote: > > Noam, > > According to your stack trace the correct way to call the mca_pml_ob1_dump is > with the communicator from the PMPI call. Thus, this call was successful: > > (gdb) call mca_pml_ob1_dump(0xed932d0, 1) > $1 = 0 > > I should have

Re: [OMPI users] mpi send/recv pair hangin

2018-04-06 Thread George Bosilca
Noam, According to your stack trace the correct way to call the mca_pml_ob1_dump is with the communicator from the PMPI call. Thus, this call was successful: (gdb) call mca_pml_ob1_dump(0xed932d0, 1) $1 = 0 I should have been more clear, the output is not on gdb but on the output stream of your

Re: [OMPI users] mpi send/recv pair hangin

2018-04-06 Thread Noam Bernstein
> On Apr 5, 2018, at 4:11 PM, George Bosilca wrote: > > I attach with gdb on the processes and do a "call mca_pml_ob1_dump(comm, 1)". > This allows the debugger to make a call our function, and output internal > information about the library status. OK - after a number of missteps, I recompile

Re: [OMPI users] mpi send/recv pair hangin

2018-04-05 Thread Gilles Gouaillardet
Noam, you might also want to try mpirun --mca btl tcp,self ... to rule out btl (shared memory and/or infiniband) related issues. Once you rebuild Open MPI with --enable-debug, I recommend you first check the arguments of the MPI_Send() and MPI_Recv() functions and make sure - same communicato

Re: [OMPI users] mpi send/recv pair hangin

2018-04-05 Thread George Bosilca
Yes, you can do this by adding --enable-debug to OMPI configure (and make sure your don't have the configure flag --with-platform=optimize). George. On Thu, Apr 5, 2018 at 4:20 PM, Noam Bernstein wrote: > > On Apr 5, 2018, at 4:11 PM, George Bosilca wrote: > > I attach with gdb on the proce

Re: [OMPI users] mpi send/recv pair hangin

2018-04-05 Thread Noam Bernstein
> On Apr 5, 2018, at 4:11 PM, George Bosilca wrote: > > I attach with gdb on the processes and do a "call mca_pml_ob1_dump(comm, 1)". > This allows the debugger to make a call our function, and output internal > information about the library status. Great. But I guess I need to recompile omp

Re: [OMPI users] mpi send/recv pair hangin

2018-04-05 Thread George Bosilca
I attach with gdb on the processes and do a "call mca_pml_ob1_dump(comm, 1)". This allows the debugger to make a call our function, and output internal information about the library status. George. On Thu, Apr 5, 2018 at 4:03 PM, Noam Bernstein wrote: > On Apr 5, 2018, at 3:55 PM, George Bo

Re: [OMPI users] mpi send/recv pair hangin

2018-04-05 Thread Noam Bernstein
> On Apr 5, 2018, at 3:55 PM, George Bosilca wrote: > > Noam, > > The OB1 provide a mechanism to dump all pending communications in a > particular communicator. To do this I usually call mca_pml_ob1_dump(comm, 1), > with comm being the MPI_Comm and 1 being the verbose mode. I have no idea how

Re: [OMPI users] mpi send/recv pair hangin

2018-04-05 Thread George Bosilca
Noam, The OB1 provide a mechanism to dump all pending communications in a particular communicator. To do this I usually call mca_pml_ob1_dump(comm, 1), with comm being the MPI_Comm and 1 being the verbose mode. I have no idea how you can find the pointer to the communicator out of your code, but i

Re: [OMPI users] mpi send/recv pair hangin

2018-04-05 Thread Edgar Gabriel
is the file I/O that you mentioned using MPI I/O for that? If yes, what file system are you writing to? Edgar On 4/5/2018 10:15 AM, Noam Bernstein wrote: On Apr 5, 2018, at 11:03 AM, Reuti wrote: Hi, Am 05.04.2018 um 16:16 schrieb Noam Bernstein : Hi all - I have a code that uses MPI (va

Re: [OMPI users] mpi send/recv pair hangin

2018-04-05 Thread Noam Bernstein
> On Apr 5, 2018, at 11:32 AM, Edgar Gabriel wrote: > > is the file I/O that you mentioned using MPI I/O for that? If yes, what file > system are you writing to? No MPI I/O. Just MPI calls to gather the data, and plain Fortran I/O on the head node only. I should also say that in lots of ot

Re: [OMPI users] mpi send/recv pair hangin

2018-04-05 Thread Noam Bernstein
> On Apr 5, 2018, at 11:03 AM, Reuti wrote: > > Hi, > >> Am 05.04.2018 um 16:16 schrieb Noam Bernstein : >> >> Hi all - I have a code that uses MPI (vasp), and it’s hanging in a strange >> way. Basically, there’s a Cartesian communicator, 4x16 (64 processes >> total), and despite the fact th

Re: [OMPI users] mpi send/recv pair hangin

2018-04-05 Thread Reuti
Hi, > Am 05.04.2018 um 16:16 schrieb Noam Bernstein : > > Hi all - I have a code that uses MPI (vasp), and it’s hanging in a strange > way. Basically, there’s a Cartesian communicator, 4x16 (64 processes total), > and despite the fact that the communication pattern is rather regular, one > pa

[OMPI users] mpi send/recv pair hangin

2018-04-05 Thread Noam Bernstein
Hi all - I have a code that uses MPI (vasp), and it’s hanging in a strange way. Basically, there’s a Cartesian communicator, 4x16 (64 processes total), and despite the fact that the communication pattern is rather regular, one particular send/recv pair hangs consistently. Basically, across eac