Re: [OMPI users] Building PMIx and Slurm support

2019-02-28 Thread Bennet Fauber
Here is some additional information about when using
`--with-hwloc=/usr` seems necessary.

Using the tarball for the released, stable

13bb410b52becbfa140f5791bd50d580  /sw/src/arcts/ompi/openmpi-1.10.7.tar.gz

and with

  $ ./configure --prefix=/sw/arcts/centos7/intel_14_0_2/openmpi/1.10.7
--mandir=/sw/arcts/centos7/intel_14_0_2/openmpi/1.10.7/share/man
--with-slurm --without-tm --with-verbs --with-hwloc --disable-dlopen
--enable-shared CC=icc CXX=icpc FC=ifort F77=ifort

config.log reports

configure:67748: checking hwloc.h usability
configure:67748: icc -std=gnu99 -c -O3 -DNDEBUG -finline-functions
-fno-strict-aliasing -restrict -Qoption,cpp,--extended_float_types
-pthread   conftest.c >&5
configure:67748: $? = 0
configure:67748: result: yes
configure:67748: checking hwloc.h presence
configure:67748: icc -E   conftest.c
configure:67748: $? = 0
configure:67748: result: yes
configure:67748: checking for hwloc.h
configure:67748: result: yes

but make reports

Making all in mca/hwloc
make[2]: Entering directory `/tmp/bennet/build/openmpi-1.10.7/opal/mca/hwloc'
  CC   base/hwloc_base_frame.lo
  CC   base/hwloc_base_util.lo
  CC   base/hwloc_base_dt.lo
  CC   base/hwloc_base_maffinity.lo
In file included from ../../../opal/mca/hwloc/hwloc.h(134),
 from base/hwloc_base_frame.c(23):
../../../opal/mca/hwloc/external/external.h(20): catastrophic error:
cannot open source file "/include/hwloc.h"
  #include MCA_hwloc_external_header
^
compilation aborted for base/hwloc_base_frame.c (code 4)


It looks to me like configure is not prepending `/usr` to the path to
`/include/hwloc.h` when one uses the bare `--with-hwloc` on the
configure line, and therefore using `--with-hwloc=/usr` is called for
if one wants it included.

Thanks,-- bennet

Note, this test fails, but it isn't directly related to finding hwloc.h.

configure:67840: result: looking for library without search path
configure:67842: checking for library containing hwloc_topology_init
configure:67873: icc -std=gnu99 -o conftest -O3 -DNDEBUG
-finline-functions -fno-strict-aliasing -restrict
-Qoption,cpp,--extended_float_types -pthread conftest.c -lutil
>&5
/tmp/iccu8nO7c.o: In function `main':
conftest.c:(.text+0x35): undefined reference to `hwloc_topology_init'
configure:67873: $? = 1

with too much more output to include here.

Seems to be the same situation with the last available nightly build, as well.

bcea63d634d05c0f5a821ce75a1eb2b2  openmpi-v1.10-201705170239-5e373bf.tar.gz


On Sun, Feb 24, 2019 at 8:11 AM Bennet Fauber  wrote:
>
> HI, Gilles,
>
> With respect to your comment about not using --FOO=/usr  It is bad
> practice, sure, and it should be unnecessary, but we have had at least
> one instance where it is also necessary for the requested feature to
> actually work.  The case I am thinking of was, in particular, OpenMPI
> 1.10.2, where OMPI did not properly bind processes to cores unless we
> built --with-hwloc=/usr.
>
> When it wasn't used, it mostly ran fine, but for one program, it would
> occasionally result in 'hopping processes' and very poor performance.
> After rebuilding --with-hwloc=/usr, that is no longer a problem.  It
> was difficult to pin down, because it did not seem to be an issue all
> the time, but maybe one in five dragging performance.
>
> So, while it may be the case that it should not be necessary and not
> be good practice, it may also be the case that sometimes it does seem
> to be necessary.
>
> -- bennet
>
> On Sun, Feb 24, 2019 at 5:21 AM Gilles Gouaillardet
>  wrote:
> >
> > Passant,
> >
> > The fix is included in PMIx 2.2.2
> >
> > The bug is in a public header file, so you might indeed have to
> > rebuild the SLURM plugin for PMIx.
> > I did not check the SLURM sources though, so assuming PMIx was built
> > as a shared library, there is still a chance
> > it might work even if you do not rebuild the SLURM plugin. I'd rebuild
> > at least the SLURM plugin for PMIx to be on the safe side though.
> >
> > Cheers,
> >
> > Gilles
> >
> > On Sun, Feb 24, 2019 at 4:07 PM Passant A. Hafez
> >  wrote:
> > >
> > > Thanks Gilles.
> > >
> > > So do we have to rebuild Slurm after applying this patch?
> > >
> > > Another question, is this fix included in the PMIx 2.2.2 
> > > https://github.com/pmix/pmix/releases/tag/v2.2.2 ?
> > >
> > >
> > >
> > >
> > > All the best,
> > >
> > >
> > > 
> > > From: users  on behalf of Gilles 
> > > Gouaillardet 
> > > Sent: Sunday, February 24, 2019 4:09 AM
> > > To: Open MPI Users
> > > Subject: Re: [OMPI users] Building PMIx and Slurm support
> > >
> > > Passant,
> > >
> > > you have to manually download and apply
> > > https://github.com/pmix/pmix/commit/2e2f4445b45eac5a3fcbd409c81efe318876e659.patch
> > > to PMIx 2.2.1
> > > that should likely fix your problem.
> > >
> > > As a side note,  it is a bad practice to configure --with-FOO=/usr
> > > since it might have some unex

Re: [OMPI users] Building PMIx and Slurm support

2019-02-28 Thread Jeff Squyres (jsquyres) via users
On Feb 28, 2019, at 11:27 AM, Bennet Fauber  wrote:
> 
> 13bb410b52becbfa140f5791bd50d580  /sw/src/arcts/ompi/openmpi-1.10.7.tar.gz
> bcea63d634d05c0f5a821ce75a1eb2b2  openmpi-v1.10-201705170239-5e373bf.tar.gz

Bennet --

I'm sorry; I don't think we've updated the 1.10.x branch in forever.  The date 
stamp on that nightly 1.10.x tarball is from 2017.

Is it possible to test with, for example, the latest nightly tarball on the 
v4.0.x branch?

https://www.open-mpi.org/nightly/v4.0.x/

-- 
Jeff Squyres
jsquy...@cisco.com

___
users mailing list
users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/users


Re: [OMPI users] Building PMIx and Slurm support

2019-02-28 Thread Bennet Fauber
Jeff,

We could do that, but alas, we have code we need to build against 1.10.7.

It looks like what is happening is that opal_hwloc_include is not getting
set, but that maybe a test in configure isn't working right, so it
populates with a bare '/include/hwloc.h' rather than with either
'/usr/include/hwloc.h' or what it looks like it should do, which is to use
just 'hwloc.h'.

Somewhere around here in the configure that results from autogen.pl being
run on the v1.10.7 tag from the Git source.

   if test "$opal_hwloc_dir" != ""; then :
  opal_hwloc_include="$opal_hwloc_dir/include/hwloc.h"

opal_hwloc_shmem_include="$opal_hwloc_dir/include/hwloc/shmem.h"

opal_hwloc_openfabrics_include="$opal_hwloc_dir/include/hwloc/openfabrics-verbs.h"
else
  opal_hwloc_include="hwloc.h"
  opal_hwloc_shmem_include="hwloc/shmem.h"
  opal_hwloc_openfabrics_include="hwloc/openfabrics-verbs.h"
fi

I'm more a drowning victim when dealing with autotools than a swimmer.

I was pointing out why someone might think using `--with-FEATURE=/usr` is
sometimes necessary.

If 1.10.7 is too old to debug, I understand.





On Thu, Feb 28, 2019 at 12:06 PM Jeff Squyres (jsquyres) via users <
users@lists.open-mpi.org> wrote:

> On Feb 28, 2019, at 11:27 AM, Bennet Fauber  wrote:
> >
> > 13bb410b52becbfa140f5791bd50d580
> /sw/src/arcts/ompi/openmpi-1.10.7.tar.gz
> > bcea63d634d05c0f5a821ce75a1eb2b2
> openmpi-v1.10-201705170239-5e373bf.tar.gz
>
> Bennet --
>
> I'm sorry; I don't think we've updated the 1.10.x branch in forever.  The
> date stamp on that nightly 1.10.x tarball is from 2017.
>
> Is it possible to test with, for example, the latest nightly tarball on
> the v4.0.x branch?
>
> https://www.open-mpi.org/nightly/v4.0.x/
>
> --
> Jeff Squyres
> jsquy...@cisco.com
>
> ___
> users mailing list
> users@lists.open-mpi.org
> https://lists.open-mpi.org/mailman/listinfo/users
>
___
users mailing list
users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/users

Re: [OMPI users] Building PMIx and Slurm support

2019-02-28 Thread Bennet Fauber
I will see if I get the same thing from the latest OpenMPI -- it will
take me a while to fit it in.

We're still probably in the 3.x series for prodution, so this is going
to have to get done in the margin.

Sorry, and thanks for your reply,

-- bennet

On Thu, Feb 28, 2019 at 12:06 PM Jeff Squyres (jsquyres) via users
 wrote:
>
> On Feb 28, 2019, at 11:27 AM, Bennet Fauber  wrote:
> >
> > 13bb410b52becbfa140f5791bd50d580  /sw/src/arcts/ompi/openmpi-1.10.7.tar.gz
> > bcea63d634d05c0f5a821ce75a1eb2b2  openmpi-v1.10-201705170239-5e373bf.tar.gz
>
> Bennet --
>
> I'm sorry; I don't think we've updated the 1.10.x branch in forever.  The 
> date stamp on that nightly 1.10.x tarball is from 2017.
>
> Is it possible to test with, for example, the latest nightly tarball on the 
> v4.0.x branch?
>
> https://www.open-mpi.org/nightly/v4.0.x/
>
> --
> Jeff Squyres
> jsquy...@cisco.com
>
> ___
> users mailing list
> users@lists.open-mpi.org
> https://lists.open-mpi.org/mailman/listinfo/users
___
users mailing list
users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/users


Re: [OMPI users] Building PMIx and Slurm support

2019-02-28 Thread Jeff Squyres (jsquyres) via users
On Feb 28, 2019, at 12:20 PM, Bennet Fauber  wrote:
> 
> I was pointing out why someone might think using `--with-FEATURE=/usr` is 
> sometimes necessary.

True, but only in the case of a bug.  :-)

...but in this case, the bug is too old, and we're almost certainly not going 
to fix it (sorry! :-( ).  So using --with-BLAH=/usr as a workaround is 
perfectly acceptable in this case.

-- 
Jeff Squyres
jsquy...@cisco.com

___
users mailing list
users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/users


Re: [OMPI users] OpenMPI v4.0.0 signal 11 (Segmentation fault)

2019-02-28 Thread Adam LeBlanc
Hello all,

Thank you all for the suggestions. Takahiro suggestion has gotten me to a
point were all of the test will run but as soon as it gets to the clean up
step IMB will seg fault again. I opened an issues on IMB's Github but I
guess I am not gonna be able to get much help from them. So I will have to
wait and see what happens next.

Thanks again for all your help,
Adam LeBlanc

On Thu, Feb 21, 2019 at 7:22 AM Peter Kjellström  wrote:

> On Wed, 20 Feb 2019 10:46:10 -0500
> Adam LeBlanc  wrote:
>
> > Hello,
> >
> > When I do a run with OpenMPI v4.0.0 on Infiniband with this command:
> > mpirun --mca btl_openib_warn_no_device_params_found 0 --map-by node
> > --mca orte_base_help_aggregate 0 --mca btl openib,vader,self --mca
> > pml ob1 --mca btl_openib_allow_ib 1 -np 6
> >  -hostfile /home/aleblanc/ib-mpi-hosts IMB-MPI1
> >
> > I get this error:
> ...
> > # Benchmarking Reduce_scatter
> ...
> >   2097152   20  8738.08  9340.50  9147.89
> > [pandora:04500] *** Process received signal ***
> > [pandora:04500] Signal: Segmentation fault (11)
>
> This is very likely a bug in IMB not in OpenMPI. It's been discussed on
> the list before, thread name:
>
>  MPI_Reduce_Scatter Segmentation Fault with Intel  2019 Update 1
>  Compilers on OPA-1...
>
> You can work around by using an older IMB version (the bug is in the
> newer/est version).
>
> /Peter K
>
___
users mailing list
users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/users