Re: [OMPI devel] Hints for using an own pmix server

2018-10-17 Thread Stephan Krempel

Hi Ralph.

> > > One point that remains open and is interesting for me is if I can
> > > achieve the same with the 3.1.2 release of OpenMPI. Is it somehow
> > > possible to configure it as there were the "--with-ompi-pmix-rte"
> > > switch from version 4.x?
> > 
> > I’m afraid we didn’t backport that capability to the v3.x branches.
> > I’ll ask the relevant release managers if they’d like us to do so.
> 
> I checked and we will not be backporting this to the v3.x series. It
> will begin with v4.x.

Thanks for checking out. I need to check with our users if supporting
OpenMPI 4 will be sufficient for them, else for sure I will come back
soon with some more questions regarding how to manage supporting
OpenMPI 3.

Thank you again for the assistance.

Best regards

Stephan
___
devel mailing list
devel@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/devel

[OMPI devel] Bug on branch v2.x since october 3

2018-10-17 Thread Eric Chamberland

Hi,

since commit 18f23724a, our nightly base test is broken on v2.x branch.

Strangely, on branch v3.x, it broke the same day with 2fd9510b4b44, but 
was repaired some days after (can't tell exactly, but at most it was 
fixed with fa3d92981a).


I get segmentation faults or deadlocks in many cases.

Could this be related with issue 5842 ?
(https://github.com/open-mpi/ompi/issues/5842)

Here is an example of backtrace for a deadlock:

#4  
#5  0x7f9dc9151d17 in sched_yield () from /lib64/libc.so.6
#6  0x7f9dccee in opal_progress () at runtime/opal_progress.c:243
#7  0x7f9dbe53cf78 in ompi_request_wait_completion (req=0x46ea000) 
at ../../../../ompi/request/request.h:392
#8  0x7f9dbe53e162 in mca_pml_ob1_recv (addr=0x7f9dd64a6b30 
long, long, PAType*, std::__debug::vectorstd::allocator >&)::slValeurs>, count=3, 
datatype=0x7f9dca61e2c0 , src=1, tag=32767, 
comm=0x7f9dca62a840 , status=0x7ffcf4f08170) at 
pml_ob1_irecv.c:129
#9  0x7f9dca35f3c4 in PMPI_Recv (buf=0x7f9dd64a6b30 
long, long, PAType*, std::__debug::vectorstd::allocator >&)::slValeurs>, count=3, 
type=0x7f9dca61e2c0 , source=1, tag=32767, 
comm=0x7f9dca62a840 , status=0x7ffcf4f08170) at 
precv.c:77
#10 0x7f9dd6261d06 in assertionValeursIdentiquesSurTousLesProcessus 
(pComm=0x7f9dca62a840 , pRang=0, pNbProcessus=2, 
pValeurs=0x7f9dd5a94da0 girefSynchroniseGroupeProcessusModeDebugImpl(PAGroupeProcessus 
const&, char const*, int)::slDonnees>, pRequetes=std::__debug::vector of 
length 1, capacity 1 = {...}) at 
/pmi/cmpbib/compilation_BIB_dernier_ompi/COMPILE_AUTO/GIREF/src/commun/Parallele/mpi_giref.cc:332


And some informations about configuration:

http://www.giref.ulaval.ca/~cmpgiref/dernier_ompi/2018.10.17.02h16m02s_config.log

http://www.giref.ulaval.ca/~cmpgiref/dernier_ompi/2018.10.17.02h16m02s_ompi_info_all.txt

Thanks,

Eric
___
devel mailing list
devel@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/devel


Re: [OMPI devel] Bug on branch v2.x since october 3

2018-10-17 Thread Nathan Hjelm via devel


Ah yes, 18f23724a broke things so we had to fix the fix. Didn't apply it to the 
v2.x branch. Will open a PR to bring it over.

-Nathan

On Oct 17, 2018, at 11:28 AM, Eric Chamberland 
 wrote:

Hi,

since commit 18f23724a, our nightly base test is broken on v2.x branch.

Strangely, on branch v3.x, it broke the same day with 2fd9510b4b44, but 
was repaired some days after (can't tell exactly, but at most it was 
fixed with fa3d92981a).


I get segmentation faults or deadlocks in many cases.

Could this be related with issue 5842 ?
(https://github.com/open-mpi/ompi/issues/5842)

Here is an example of backtrace for a deadlock:

#4 
#5 0x7f9dc9151d17 in sched_yield () from /lib64/libc.so.6
#6 0x7f9dccee in opal_progress () at runtime/opal_progress.c:243
#7 0x7f9dbe53cf78 in ompi_request_wait_completion (req=0x46ea000) 
at ../../../../ompi/request/request.h:392
#8 0x7f9dbe53e162 in mca_pml_ob1_recv (addr=0x7f9dd64a6b30 
long, long, PAType*, std::__debug::vectorstd::allocator >&)::slValeurs>, count=3, 
datatype=0x7f9dca61e2c0 , src=1, tag=32767, 
comm=0x7f9dca62a840 , status=0x7ffcf4f08170) at 
pml_ob1_irecv.c:129
#9 0x7f9dca35f3c4 in PMPI_Recv (buf=0x7f9dd64a6b30 
long, long, PAType*, std::__debug::vectorstd::allocator >&)::slValeurs>, count=3, 
type=0x7f9dca61e2c0 , source=1, tag=32767, 
comm=0x7f9dca62a840 , status=0x7ffcf4f08170) at 
precv.c:77
#10 0x7f9dd6261d06 in assertionValeursIdentiquesSurTousLesProcessus 
(pComm=0x7f9dca62a840 , pRang=0, pNbProcessus=2, 
pValeurs=0x7f9dd5a94da0 girefSynchroniseGroupeProcessusModeDebugImpl(PAGroupeProcessus 
const&, char const*, int)::slDonnees>, pRequetes=std::__debug::vector of 
length 1, capacity 1 = {...}) at 
/pmi/cmpbib/compilation_BIB_dernier_ompi/COMPILE_AUTO/GIREF/src/commun/Parallele/mpi_giref.cc:332


And some informations about configuration:

http://www.giref.ulaval.ca/~cmpgiref/dernier_ompi/2018.10.17.02h16m02s_config.log

http://www.giref.ulaval.ca/~cmpgiref/dernier_ompi/2018.10.17.02h16m02s_ompi_info_all.txt

Thanks,

Eric
___
devel mailing list
devel@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/devel
___
devel mailing list
devel@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/devel

Re: [OMPI devel] Bug on branch v2.x since october 3

2018-10-17 Thread Eric Chamberland

ok, thanks a lot! :)

Eric


On 17/10/18 01:32 PM, Nathan Hjelm via devel wrote:


Ah yes, 18f23724a broke things so we had to fix the fix. Didn't apply it 
to the v2.x branch. Will open a PR to bring it over.


-Nathan

On Oct 17, 2018, at 11:28 AM, Eric Chamberland 
 wrote:



Hi,

since commit 18f23724a, our nightly base test is broken on v2.x branch.

Strangely, on branch v3.x, it broke the same day with 2fd9510b4b44, but
was repaired some days after (can't tell exactly, but at most it was
fixed with fa3d92981a).

I get segmentation faults or deadlocks in many cases.

Could this be related with issue 5842 ?
(https://github.com/open-mpi/ompi/issues/5842)

Here is an example of backtrace for a deadlock:

#4 
#5 0x7f9dc9151d17 in sched_yield () from /lib64/libc.so.6
#6 0x7f9dccee in opal_progress () at runtime/opal_progress.c:243
#7 0x7f9dbe53cf78 in ompi_request_wait_completion (req=0x46ea000)
at ../../../../ompi/request/request.h:392
#8 0x7f9dbe53e162 in mca_pml_ob1_recv (addr=0x7f9dd64a6b30
*, std::__debug::vector >&)::slValeurs>, count=3,
datatype=0x7f9dca61e2c0 , src=1, tag=32767,
comm=0x7f9dca62a840 , status=0x7ffcf4f08170) at
pml_ob1_irecv.c:129
#9 0x7f9dca35f3c4 in PMPI_Recv (buf=0x7f9dd64a6b30
*, std::__debug::vector >&)::slValeurs>, count=3,
type=0x7f9dca61e2c0 , source=1, tag=32767,
comm=0x7f9dca62a840 , status=0x7ffcf4f08170) at
precv.c:77
#10 0x7f9dd6261d06 in assertionValeursIdentiquesSurTousLesProcessus
(pComm=0x7f9dca62a840 , pRang=0, pNbProcessus=2,
pValeurs=0x7f9dd5a94da0 girefSynchroniseGroupeProcessusModeDebugImpl(PAGroupeProcessus 


const&, char const*, int)::slDonnees>, pRequetes=std::__debug::vector of
length 1, capacity 1 = {...}) at
/pmi/cmpbib/compilation_BIB_dernier_ompi/COMPILE_AUTO/GIREF/src/commun/Parallele/mpi_giref.cc:332

And some informations about configuration:

http://www.giref.ulaval.ca/~cmpgiref/dernier_ompi/2018.10.17.02h16m02s_config.log

http://www.giref.ulaval.ca/~cmpgiref/dernier_ompi/2018.10.17.02h16m02s_ompi_info_all.txt

Thanks,

Eric
___
devel mailing list
devel@lists.open-mpi.org 
https://lists.open-mpi.org/mailman/listinfo/devel

___
devel mailing list
devel@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/devel