Re: [OMPI devel] MTT failures since the last few days on ppc64

2015-09-09 Thread Adrian Reber
I was about to try Gilles' patch but the current master checkout does
not build on my ppc64 system: (b79cffc73b88c2e5e2f2161e096c49aed5b9d2ed)

Making all in mca/coll/ml
make[2]: Entering directory '/home/adrian/ompi/build/ompi/mca/coll/ml'
/bin/sh ../../../../libtool  --tag=CC   --mode=link gcc -std=gnu99  -g -Wall 
-Wundef -Wno-long-long -Wsign-compare -Wmissing-prototypes -Wstrict-prototypes 
-Wcomment -pedantic -Werror-implicit-function-declaration -finline-functions 
-fno-strict-aliasing -pthread -module -avoid-version  -o mca_coll_ml.la -rpath 
/tmp/ompi/lib/openmpi coll_ml_module.lo coll_ml_allocation.lo 
coll_ml_barrier.lo coll_ml_bcast.lo coll_ml_component.lo coll_ml_copy_fns.lo 
coll_ml_descriptors.lo coll_ml_hier_algorithms.lo 
coll_ml_hier_algorithms_setup.lo coll_ml_hier_algorithms_bcast_setup.lo 
coll_ml_hier_algorithms_allreduce_setup.lo 
coll_ml_hier_algorithms_reduce_setup.lo coll_ml_hier_algorithms_common_setup.lo 
coll_ml_hier_algorithms_allgather_setup.lo 
coll_ml_hier_algorithm_memsync_setup.lo coll_ml_custom_utils.lo 
coll_ml_progress.lo coll_ml_reduce.lo coll_ml_allreduce.lo coll_ml_allgather.lo 
coll_ml_mca.lo coll_ml_lmngr.lo coll_ml_hier_algorithms_barrier_setup.lo 
coll_ml_select.lo coll_ml_memsync.lo coll_ml_lex.lo coll_ml_config.lo  -lrt  
-lm -lutil   -lm -lutil  
libtool: link: `coll_ml_bcast.lo' is not a valid libtool object
Makefile:1860: recipe for target 'mca_coll_ml.la' failed
make[2]: *** [mca_coll_ml.la] Error 1
make[2]: Leaving directory '/home/adrian/ompi/build/ompi/mca/coll/ml'
Makefile:3366: recipe for target 'all-recursive' failed




On Tue, Sep 08, 2015 at 05:19:56PM +, Jeff Squyres (jsquyres) wrote:
> Thanks Adrian; I turned this into https://github.com/open-mpi/ompi/issues/874.
> 
> > On Sep 8, 2015, at 9:56 AM, Adrian Reber  wrote:
> > 
> > Since a few days the MTT runs on my ppc64 systems are failing with:
> > 
> > [bimini:11716] *** Process received signal ***
> > [bimini:11716] Signal: Segmentation fault (11)
> > [bimini:11716] Signal code: Address not mapped (1)
> > [bimini:11716] Failing at address: (nil)[bimini:11716] [ 0] [0x3fffa2bb0448]
> > [bimini:11716] [ 1] /lib64/libc.so.6(+0xcb074)[0x3fffa27eb074] 
> > [bimini:11716] [ 2]
> > /home/adrian/mtt-scratch/installs/GubX/install/lib/libpmix.so.0(opal_pmix_pmix1xx_pmix_value_xfer-0x68758)[0x3fffa2158a10]
> >  [bimini:11716] [ 3]
> > /home/adrian/mtt-scratch/installs/GubX/install/lib/libpmix.so.0(OPAL_PMIX_PMIX1XX_PMIx_Put-0x48338)[0x3fffa2179f70]
> >  [bimini:11716] [ 4]
> > /home/adrian/mtt-scratch/installs/GubX/install/lib/openmpi/mca_pmix_pmix1xx.so(pmix1_put-0x27efc)[0x3fffa21d858c]
> > 
> > I think I do not see these kind of errors on any of the other MTT setups
> > so it might be ppc64 related. Just wanted to point it out.
> > 
> > Adrian
> > ___
> > devel mailing list
> > de...@open-mpi.org
> > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
> > Link to this post: 
> > http://www.open-mpi.org/community/lists/devel/2015/09/17979.php
> 
> 
> -- 
> Jeff Squyres
> jsquy...@cisco.com
> For corporate legal information go to: 
> http://www.cisco.com/web/about/doing_business/legal/cri/
> 
> ___
> devel mailing list
> de...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
> Link to this post: 
> http://www.open-mpi.org/community/lists/devel/2015/09/17981.php


Re: [OMPI devel] MTT failures since the last few days on ppc64

2015-09-09 Thread Jeff Squyres (jsquyres)
Try making clean (perhaps just in ompi/coll/ml) and trying again -- this looks 
like it could just be a stale file in your tree.

> On Sep 9, 2015, at 5:41 AM, Adrian Reber  wrote:
> 
> I was about to try Gilles' patch but the current master checkout does
> not build on my ppc64 system: (b79cffc73b88c2e5e2f2161e096c49aed5b9d2ed)
> 
> Making all in mca/coll/ml
> make[2]: Entering directory '/home/adrian/ompi/build/ompi/mca/coll/ml'
> /bin/sh ../../../../libtool  --tag=CC   --mode=link gcc -std=gnu99  -g -Wall 
> -Wundef -Wno-long-long -Wsign-compare -Wmissing-prototypes 
> -Wstrict-prototypes -Wcomment -pedantic -Werror-implicit-function-declaration 
> -finline-functions -fno-strict-aliasing -pthread -module -avoid-version  -o 
> mca_coll_ml.la -rpath /tmp/ompi/lib/openmpi coll_ml_module.lo 
> coll_ml_allocation.lo coll_ml_barrier.lo coll_ml_bcast.lo 
> coll_ml_component.lo coll_ml_copy_fns.lo coll_ml_descriptors.lo 
> coll_ml_hier_algorithms.lo coll_ml_hier_algorithms_setup.lo 
> coll_ml_hier_algorithms_bcast_setup.lo 
> coll_ml_hier_algorithms_allreduce_setup.lo 
> coll_ml_hier_algorithms_reduce_setup.lo 
> coll_ml_hier_algorithms_common_setup.lo 
> coll_ml_hier_algorithms_allgather_setup.lo 
> coll_ml_hier_algorithm_memsync_setup.lo coll_ml_custom_utils.lo 
> coll_ml_progress.lo coll_ml_reduce.lo coll_ml_allreduce.lo 
> coll_ml_allgather.lo coll_ml_mca.lo coll_ml_lmngr.lo 
> coll_ml_hier_algorithms_barrier_setup.lo coll_ml_select.lo coll_ml_memsync.
> lo coll_ml_lex.lo coll_ml_config.lo  -lrt  -lm -lutil   -lm -lutil  
> libtool: link: `coll_ml_bcast.lo' is not a valid libtool object
> Makefile:1860: recipe for target 'mca_coll_ml.la' failed
> make[2]: *** [mca_coll_ml.la] Error 1
> make[2]: Leaving directory '/home/adrian/ompi/build/ompi/mca/coll/ml'
> Makefile:3366: recipe for target 'all-recursive' failed
> 
> 
> 
> 
> On Tue, Sep 08, 2015 at 05:19:56PM +, Jeff Squyres (jsquyres) wrote:
>> Thanks Adrian; I turned this into 
>> https://github.com/open-mpi/ompi/issues/874.
>> 
>>> On Sep 8, 2015, at 9:56 AM, Adrian Reber  wrote:
>>> 
>>> Since a few days the MTT runs on my ppc64 systems are failing with:
>>> 
>>> [bimini:11716] *** Process received signal ***
>>> [bimini:11716] Signal: Segmentation fault (11)
>>> [bimini:11716] Signal code: Address not mapped (1)
>>> [bimini:11716] Failing at address: (nil)[bimini:11716] [ 0] [0x3fffa2bb0448]
>>> [bimini:11716] [ 1] /lib64/libc.so.6(+0xcb074)[0x3fffa27eb074] 
>>> [bimini:11716] [ 2]
>>> /home/adrian/mtt-scratch/installs/GubX/install/lib/libpmix.so.0(opal_pmix_pmix1xx_pmix_value_xfer-0x68758)[0x3fffa2158a10]
>>>  [bimini:11716] [ 3]
>>> /home/adrian/mtt-scratch/installs/GubX/install/lib/libpmix.so.0(OPAL_PMIX_PMIX1XX_PMIx_Put-0x48338)[0x3fffa2179f70]
>>>  [bimini:11716] [ 4]
>>> /home/adrian/mtt-scratch/installs/GubX/install/lib/openmpi/mca_pmix_pmix1xx.so(pmix1_put-0x27efc)[0x3fffa21d858c]
>>> 
>>> I think I do not see these kind of errors on any of the other MTT setups
>>> so it might be ppc64 related. Just wanted to point it out.
>>> 
>>> Adrian
>>> ___
>>> devel mailing list
>>> de...@open-mpi.org
>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>> Link to this post: 
>>> http://www.open-mpi.org/community/lists/devel/2015/09/17979.php
>> 
>> 
>> -- 
>> Jeff Squyres
>> jsquy...@cisco.com
>> For corporate legal information go to: 
>> http://www.cisco.com/web/about/doing_business/legal/cri/
>> 
>> ___
>> devel mailing list
>> de...@open-mpi.org
>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
>> Link to this post: 
>> http://www.open-mpi.org/community/lists/devel/2015/09/17981.php
> ___
> devel mailing list
> de...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
> Link to this post: 
> http://www.open-mpi.org/community/lists/devel/2015/09/17988.php


-- 
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to: 
http://www.cisco.com/web/about/doing_business/legal/cri/



Re: [OMPI devel] MTT failures since the last few days on ppc64

2015-09-09 Thread Adrian Reber
After lots of make cleans it works again. Thanks.

On Wed, Sep 09, 2015 at 10:00:10AM +, Jeff Squyres (jsquyres) wrote:
> Try making clean (perhaps just in ompi/coll/ml) and trying again -- this 
> looks like it could just be a stale file in your tree.
> 
> > On Sep 9, 2015, at 5:41 AM, Adrian Reber  wrote:
> > 
> > I was about to try Gilles' patch but the current master checkout does
> > not build on my ppc64 system: (b79cffc73b88c2e5e2f2161e096c49aed5b9d2ed)
> > 
> > Making all in mca/coll/ml
> > make[2]: Entering directory '/home/adrian/ompi/build/ompi/mca/coll/ml'
> > /bin/sh ../../../../libtool  --tag=CC   --mode=link gcc -std=gnu99  -g 
> > -Wall -Wundef -Wno-long-long -Wsign-compare -Wmissing-prototypes 
> > -Wstrict-prototypes -Wcomment -pedantic 
> > -Werror-implicit-function-declaration -finline-functions 
> > -fno-strict-aliasing -pthread -module -avoid-version  -o mca_coll_ml.la 
> > -rpath /tmp/ompi/lib/openmpi coll_ml_module.lo coll_ml_allocation.lo 
> > coll_ml_barrier.lo coll_ml_bcast.lo coll_ml_component.lo 
> > coll_ml_copy_fns.lo coll_ml_descriptors.lo coll_ml_hier_algorithms.lo 
> > coll_ml_hier_algorithms_setup.lo coll_ml_hier_algorithms_bcast_setup.lo 
> > coll_ml_hier_algorithms_allreduce_setup.lo 
> > coll_ml_hier_algorithms_reduce_setup.lo 
> > coll_ml_hier_algorithms_common_setup.lo 
> > coll_ml_hier_algorithms_allgather_setup.lo 
> > coll_ml_hier_algorithm_memsync_setup.lo coll_ml_custom_utils.lo 
> > coll_ml_progress.lo coll_ml_reduce.lo coll_ml_allreduce.lo 
> > coll_ml_allgather.lo coll_ml_mca.lo coll_ml_lmngr.lo 
> > coll_ml_hier_algorithms_barrier_setup.lo coll_ml_select.lo coll_ml_memsyn
>  c.
> > lo coll_ml_lex.lo coll_ml_config.lo  -lrt  -lm -lutil   -lm -lutil  
> > libtool: link: `coll_ml_bcast.lo' is not a valid libtool object
> > Makefile:1860: recipe for target 'mca_coll_ml.la' failed
> > make[2]: *** [mca_coll_ml.la] Error 1
> > make[2]: Leaving directory '/home/adrian/ompi/build/ompi/mca/coll/ml'
> > Makefile:3366: recipe for target 'all-recursive' failed
> > 
> > 
> > 
> > 
> > On Tue, Sep 08, 2015 at 05:19:56PM +, Jeff Squyres (jsquyres) wrote:
> >> Thanks Adrian; I turned this into 
> >> https://github.com/open-mpi/ompi/issues/874.
> >> 
> >>> On Sep 8, 2015, at 9:56 AM, Adrian Reber  wrote:
> >>> 
> >>> Since a few days the MTT runs on my ppc64 systems are failing with:
> >>> 
> >>> [bimini:11716] *** Process received signal ***
> >>> [bimini:11716] Signal: Segmentation fault (11)
> >>> [bimini:11716] Signal code: Address not mapped (1)
> >>> [bimini:11716] Failing at address: (nil)[bimini:11716] [ 0] 
> >>> [0x3fffa2bb0448]
> >>> [bimini:11716] [ 1] /lib64/libc.so.6(+0xcb074)[0x3fffa27eb074] 
> >>> [bimini:11716] [ 2]
> >>> /home/adrian/mtt-scratch/installs/GubX/install/lib/libpmix.so.0(opal_pmix_pmix1xx_pmix_value_xfer-0x68758)[0x3fffa2158a10]
> >>>  [bimini:11716] [ 3]
> >>> /home/adrian/mtt-scratch/installs/GubX/install/lib/libpmix.so.0(OPAL_PMIX_PMIX1XX_PMIx_Put-0x48338)[0x3fffa2179f70]
> >>>  [bimini:11716] [ 4]
> >>> /home/adrian/mtt-scratch/installs/GubX/install/lib/openmpi/mca_pmix_pmix1xx.so(pmix1_put-0x27efc)[0x3fffa21d858c]
> >>> 
> >>> I think I do not see these kind of errors on any of the other MTT setups
> >>> so it might be ppc64 related. Just wanted to point it out.
> >>> 
> >>>   Adrian


Re: [OMPI devel] Slurm support in master

2015-09-09 Thread Howard Pritchard
Hi Ralph,

mpirun works for me now on master on the NERSC systems.


Thanks,


Howard



--

sent from my smart phonr so no good type.

Howard
On Sep 8, 2015 7:49 PM, "Ralph Castain"  wrote:

> Hi folks
>
> I’ve poked around this evening and gotten the Slurm support in master to
> at least build, and for mpirun to now work correctly under a Slurm job
> allocation. This should all be committed as soon as auto-testing completes:
>
> https://github.com/open-mpi/ompi/pull/877
>
> Howard/Nathan: I believe I fixed mpirun for ALPS too - please check.
>
> Direct launch under Slurm still segfaults, and I’m out of time chasing it
> down. Could someone please take a look? It seems to have something to do
> with the hash table support in the base, but I’m not sure of the problem.
>
> Thanks
> Ralph
>
> ___
> devel mailing list
> de...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
> Link to this post:
> http://www.open-mpi.org/community/lists/devel/2015/09/17987.php


Re: [OMPI devel] Slurm support in master

2015-09-09 Thread Ralph Castain
woohoo!! Thanks!

> On Sep 9, 2015, at 11:59 AM, Howard Pritchard  wrote:
> 
> Hi Ralph,
> 
> mpirun works for me now on master on the NERSC systems.
> 
> 
> 
> Thanks,
> 
> 
> 
> Howard
> 
> 
> 
> 
> 
> --
> 
> sent from my smart phonr so no good type.
> 
> Howard
> 
> On Sep 8, 2015 7:49 PM, "Ralph Castain"  > wrote:
> Hi folks
> 
> I’ve poked around this evening and gotten the Slurm support in master to at 
> least build, and for mpirun to now work correctly under a Slurm job 
> allocation. This should all be committed as soon as auto-testing completes:
> 
> https://github.com/open-mpi/ompi/pull/877 
> 
> 
> Howard/Nathan: I believe I fixed mpirun for ALPS too - please check.
> 
> Direct launch under Slurm still segfaults, and I’m out of time chasing it 
> down. Could someone please take a look? It seems to have something to do with 
> the hash table support in the base, but I’m not sure of the problem.
> 
> Thanks
> Ralph
> 
> ___
> devel mailing list
> de...@open-mpi.org 
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel 
> 
> Link to this post: 
> http://www.open-mpi.org/community/lists/devel/2015/09/17987.php 
> ___
> devel mailing list
> de...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
> Link to this post: 
> http://www.open-mpi.org/community/lists/devel/2015/09/17991.php