Re: [OMPI users] "Warning :: opal_list_remove_item" with openmpi-2.1.0rc4

2017-03-22 Thread Gilles Gouaillardet
Roland,

the easiest way is to use an external hwloc that is configured with
--disable-nvml

an other option is to hack the embedded hwloc configure.m4 and pass
--disable-nvml to the embedded hwloc configure. note this requires you run
autogen.sh and you hence needs recent autotools.

i guess Open MPI 1.8 embeds an older hwloc that is not aware of nvml, hence
the lack of warning.

Cheers,

Gilles

On Wednesday, March 22, 2017, Roland Fehrenbacher  wrote:

> > "SJ" == Sylvain Jeaugey > writes:
>
> SJ> If you installed CUDA libraries and includes in /usr, then it's
> SJ> not surprising hwloc finds them even without defining CFLAGS.
>
> Well, that's the place where distribution packages install to :)
> I don't think a build system should misbehave, if libraries are installed
> in default places.
>
> SJ> I'm just saying I think you won't get the error message if Open
> SJ> MPI finds CUDA but hwloc does not.
>
> OK, so I think I need to ask the original question again: Is there a way
> to suppress these warnings with a "normal" build? I guess the answer
> must be yes, since 1.8.x didn't have this problem. The real question
> then would be how ...
>
> Thanks,
>
> Roland
>
> SJ> On 03/21/2017 11:05 AM, Roland Fehrenbacher wrote:
> >>> "SJ" == Sylvain Jeaugey >
> writes:
> >> Hi Silvain,
> >>
> >> I get the "NVIDIA : ..." run-time error messages just by
> >> compiling with "--with-cuda=/usr":
> >>
> >> ./configure --prefix=${prefix} \ --mandir=${prefix}/share/man \
> >> --infodir=${prefix}/share/info \
> >> --sysconfdir=/etc/openmpi/${VERSION} --with-devel-headers \
> >> --disable-memchecker \ --disable-vt \ --with-tm --with-slurm
> >> --with-pmi --with-sge \ --with-cuda=/usr \
> >> --with-io-romio-flags='--with-file-system=nfs+lustre' \
> >> --with-cma --without-valgrind \ --enable-openib-connectx-xrc \
> >> --enable-orterun-prefix-by-default \ --disable-java
> >>
> >> Roland
> >>
> SJ> Hi Siegmar, I think this "NVIDIA : ..." error message comes from
> SJ> the fact that you add CUDA includes in the C*FLAGS. If you just
> SJ> use --with-cuda, Open MPI will compile with CUDA support, but
> SJ> hwloc will not find CUDA and that will be fine. However, setting
> SJ> CUDA in CFLAGS will make hwloc find CUDA, compile CUDA support
> SJ> (which is not needed) and then NVML will show this error message
> SJ> when not run on a machine with CUDA devices.
> >>
> SJ> I guess gcc picks the environment variable, while cc does not
> SJ> hence the different behavior. So again, there is no need to add
> SJ> all those CUDA includes, --with-cuda is enough.
> >>
> SJ> About the opal_list_remove_item, we'll try to reproduce the
> SJ> issue and see where it comes from.
> >>
> SJ> Sylvain
> >>
> SJ> On 03/21/2017 12:38 AM, Siegmar Gross wrote:
> >> >> Hi,
> >> >>
> >> >> I have installed openmpi-2.1.0rc4 on my "SUSE Linux Enterprise
> >> >> Server
> >> >> 12.2 (x86_64)" with Sun C 5.14 and gcc-6.3.0. Sometimes I get
> >> >>  once
> >> >> more a warning about a missing item for one of my small
> >> >> programs (it doesn't matter if I use my cc or gcc version). My
> >> >> gcc version also displays the message "NVIDIA: no NVIDIA
> >> >> devices found" for the server without NVIDIA devices (I don't
> >> >> get the message for my cc version).  I used the following
> >> >> commands to build the package (${SYSTEM_ENV} is Linux and
> >> >> ${MACHINE_ENV} is x86_64).
> >> >>
> >> >>
> >> >> mkdir openmpi-2.1.0rc4-${SYSTEM_ENV}.${MACHINE_ENV}.64_cc cd
> >> >> openmpi-2.1.0rc4-${SYSTEM_ENV}.${MACHINE_ENV}.64_cc
> >> >>
> >> >> ../openmpi-2.1.0rc4/configure \
> >> >> --prefix=/usr/local/openmpi-2.1.0_64_cc \
> >> >> --libdir=/usr/local/openmpi-2.1.0_64_cc/lib64 \
> >> >> --with-jdk-bindir=/usr/local/jdk1.8.0_66/bin \
> >> >> --with-jdk-headers=/usr/local/jdk1.8.0_66/include \
> >> >> JAVA_HOME=/usr/local/jdk1.8.0_66 \ LDFLAGS="-m64 -mt -Wl,-z
> >> >> -Wl,noexecstack -L/usr/local/lib64 -L/usr/local/cuda/ lib64" \
> >> >> CC="cc" CXX="CC" FC="f95" \ CFLAGS="-m64 -mt
> >> >> -I/usr/local/include -I/usr/local/cuda/include" \
> >> >> CXXFLAGS="-m64 -I/usr/local/include -I/usr/local/cuda/include"
> >> >> \ FCFLAGS="-m64" \ CPP="cpp -I/usr/local/include
> >> >> -I/usr/local/cuda/include" \ CXXCPP="cpp -I/usr/local/include
> >> >> -I/usr/local/cuda/include" \ --enable-mpi-cxx \
> >> >> --enable-cxx-exceptions \ --enable-mpi-java \
> >> >> --with-cuda=/usr/local/cuda \
> >> >> --with-valgrind=/usr/local/valgrind \
> >> >> --enable-mpi-thread-multiple \ --with-hwloc=internal \
> >> >> --without-verbs \ --with-wrapper-cflags="-m64 -mt" \
> >> >> 

Re: [OMPI users] "Warning :: opal_list_remove_item" with openmpi-2.1.0rc4

2017-03-22 Thread Roland Fehrenbacher
> "SJ" == Sylvain Jeaugey  writes:

SJ> If you installed CUDA libraries and includes in /usr, then it's
SJ> not surprising hwloc finds them even without defining CFLAGS.

Well, that's the place where distribution packages install to :)
I don't think a build system should misbehave, if libraries are installed
in default places.

SJ> I'm just saying I think you won't get the error message if Open
SJ> MPI finds CUDA but hwloc does not.

OK, so I think I need to ask the original question again: Is there a way
to suppress these warnings with a "normal" build? I guess the answer
must be yes, since 1.8.x didn't have this problem. The real question
then would be how ...

Thanks,

Roland

SJ> On 03/21/2017 11:05 AM, Roland Fehrenbacher wrote:
>>> "SJ" == Sylvain Jeaugey  writes:
>> Hi Silvain,
>>
>> I get the "NVIDIA : ..." run-time error messages just by
>> compiling with "--with-cuda=/usr":
>>
>> ./configure --prefix=${prefix} \ --mandir=${prefix}/share/man \
>> --infodir=${prefix}/share/info \
>> --sysconfdir=/etc/openmpi/${VERSION} --with-devel-headers \
>> --disable-memchecker \ --disable-vt \ --with-tm --with-slurm
>> --with-pmi --with-sge \ --with-cuda=/usr \
>> --with-io-romio-flags='--with-file-system=nfs+lustre' \
>> --with-cma --without-valgrind \ --enable-openib-connectx-xrc \
>> --enable-orterun-prefix-by-default \ --disable-java
>>
>> Roland
>>
SJ> Hi Siegmar, I think this "NVIDIA : ..." error message comes from
SJ> the fact that you add CUDA includes in the C*FLAGS. If you just
SJ> use --with-cuda, Open MPI will compile with CUDA support, but
SJ> hwloc will not find CUDA and that will be fine. However, setting
SJ> CUDA in CFLAGS will make hwloc find CUDA, compile CUDA support
SJ> (which is not needed) and then NVML will show this error message
SJ> when not run on a machine with CUDA devices.
>>
SJ> I guess gcc picks the environment variable, while cc does not
SJ> hence the different behavior. So again, there is no need to add
SJ> all those CUDA includes, --with-cuda is enough.
>>
SJ> About the opal_list_remove_item, we'll try to reproduce the
SJ> issue and see where it comes from.
>>
SJ> Sylvain
>>
SJ> On 03/21/2017 12:38 AM, Siegmar Gross wrote:
>> >> Hi,
>> >>
>> >> I have installed openmpi-2.1.0rc4 on my "SUSE Linux Enterprise
>> >> Server
>> >> 12.2 (x86_64)" with Sun C 5.14 and gcc-6.3.0. Sometimes I get
>> >>  once
>> >> more a warning about a missing item for one of my small
>> >> programs (it doesn't matter if I use my cc or gcc version). My
>> >> gcc version also displays the message "NVIDIA: no NVIDIA
>> >> devices found" for the server without NVIDIA devices (I don't
>> >> get the message for my cc version).  I used the following
>> >> commands to build the package (${SYSTEM_ENV} is Linux and
>> >> ${MACHINE_ENV} is x86_64).
>> >>
>> >>
>> >> mkdir openmpi-2.1.0rc4-${SYSTEM_ENV}.${MACHINE_ENV}.64_cc cd
>> >> openmpi-2.1.0rc4-${SYSTEM_ENV}.${MACHINE_ENV}.64_cc
>> >>
>> >> ../openmpi-2.1.0rc4/configure \
>> >> --prefix=/usr/local/openmpi-2.1.0_64_cc \
>> >> --libdir=/usr/local/openmpi-2.1.0_64_cc/lib64 \
>> >> --with-jdk-bindir=/usr/local/jdk1.8.0_66/bin \
>> >> --with-jdk-headers=/usr/local/jdk1.8.0_66/include \
>> >> JAVA_HOME=/usr/local/jdk1.8.0_66 \ LDFLAGS="-m64 -mt -Wl,-z
>> >> -Wl,noexecstack -L/usr/local/lib64 -L/usr/local/cuda/ lib64" \
>> >> CC="cc" CXX="CC" FC="f95" \ CFLAGS="-m64 -mt
>> >> -I/usr/local/include -I/usr/local/cuda/include" \
>> >> CXXFLAGS="-m64 -I/usr/local/include -I/usr/local/cuda/include"
>> >> \ FCFLAGS="-m64" \ CPP="cpp -I/usr/local/include
>> >> -I/usr/local/cuda/include" \ CXXCPP="cpp -I/usr/local/include
>> >> -I/usr/local/cuda/include" \ --enable-mpi-cxx \
>> >> --enable-cxx-exceptions \ --enable-mpi-java \
>> >> --with-cuda=/usr/local/cuda \
>> >> --with-valgrind=/usr/local/valgrind \
>> >> --enable-mpi-thread-multiple \ --with-hwloc=internal \
>> >> --without-verbs \ --with-wrapper-cflags="-m64 -mt" \
>> >> --with-wrapper-cxxflags="-m64" \ --with-wrapper-fcflags="-m64"
>> >> \ --with-wrapper-ldflags="-mt" \ --enable-debug \ |& tee
>> >> log.configure.$SYSTEM_ENV.$MACHINE_ENV.64_cc
>> >>
>> >> make |& tee log.make.$SYSTEM_ENV.$MACHINE_ENV.64_cc rm -r
>> >> /usr/local/openmpi-2.1.0_64_cc.old mv
>> >> /usr/local/openmpi-2.1.0_64_cc
>> >> /usr/local/openmpi-2.1.0_64_cc.old make install |& tee
>> >> log.make-install.$SYSTEM_ENV.$MACHINE_ENV.64_cc make check |&
>> >> tee log.make-check.$SYSTEM_ENV.$MACHINE_ENV.64_cc
>> >>
>> >>
>> >> Sometimes everything works as expected.
>> >>
>> >> loki spawn 144 mpiexec -np 1 --host loki,nfs1,nfs2
>> >> 

Re: [OMPI users] "Warning :: opal_list_remove_item" with openmpi-2.1.0rc4

2017-03-22 Thread Siegmar Gross

Hi Akshay,


Would it possible for you to provide the source to reproduce the issue?


Yes, I've appended the file.


Kind regards

Siegmar




Thanks

On Tue, Mar 21, 2017 at 9:52 AM, Sylvain Jeaugey > wrote:

Hi Siegmar,

I think this "NVIDIA : ..." error message comes from the fact that you add 
CUDA includes in the C*FLAGS. If you just use --with-cuda, Open MPI will compile
with CUDA support, but hwloc will not find CUDA and that will be fine. 
However, setting CUDA in CFLAGS will make hwloc find CUDA, compile CUDA support
(which is not needed) and then NVML will show this error message when not 
run on a machine with CUDA devices.

I guess gcc picks the environment variable, while cc does not hence the 
different behavior. So again, there is no need to add all those CUDA includes,
--with-cuda is enough.

About the opal_list_remove_item, we'll try to reproduce the issue and see 
where it comes from.

Sylvain


On 03/21/2017 12:38 AM, Siegmar Gross wrote:

Hi,

I have installed openmpi-2.1.0rc4 on my "SUSE Linux Enterprise Server
12.2 (x86_64)" with Sun C 5.14 and gcc-6.3.0. Sometimes I get once
more a warning about a missing item for one of my small programs (it
doesn't matter if I use my cc or gcc version). My gcc version also
displays the message "NVIDIA: no NVIDIA devices found" for the server
without NVIDIA devices (I don't get the message for my cc version).
I used the following commands to build the package (${SYSTEM_ENV}
is Linux and ${MACHINE_ENV} is x86_64).


mkdir openmpi-2.1.0rc4-${SYSTEM_ENV}.${MACHINE_ENV}.64_cc
cd openmpi-2.1.0rc4-${SYSTEM_ENV}.${MACHINE_ENV}.64_cc

../openmpi-2.1.0rc4/configure \
  --prefix=/usr/local/openmpi-2.1.0_64_cc \
  --libdir=/usr/local/openmpi-2.1.0_64_cc/lib64 \
  --with-jdk-bindir=/usr/local/jdk1.8.0_66/bin \
  --with-jdk-headers=/usr/local/jdk1.8.0_66/include \
  JAVA_HOME=/usr/local/jdk1.8.0_66 \
  LDFLAGS="-m64 -mt -Wl,-z -Wl,noexecstack -L/usr/local/lib64 
-L/usr/local/cuda/
lib64" \
  CC="cc" CXX="CC" FC="f95" \
  CFLAGS="-m64 -mt -I/usr/local/include -I/usr/local/cuda/include" \
  CXXFLAGS="-m64 -I/usr/local/include -I/usr/local/cuda/include" \
  FCFLAGS="-m64" \
  CPP="cpp -I/usr/local/include -I/usr/local/cuda/include" \
  CXXCPP="cpp -I/usr/local/include -I/usr/local/cuda/include" \
  --enable-mpi-cxx \
  --enable-cxx-exceptions \
  --enable-mpi-java \
  --with-cuda=/usr/local/cuda \
  --with-valgrind=/usr/local/valgrind \
  --enable-mpi-thread-multiple \
  --with-hwloc=internal \
  --without-verbs \
  --with-wrapper-cflags="-m64 -mt" \
  --with-wrapper-cxxflags="-m64" \
  --with-wrapper-fcflags="-m64" \
  --with-wrapper-ldflags="-mt" \
  --enable-debug \
  |& tee log.configure.$SYSTEM_ENV.$MACHINE_ENV.64_cc

make |& tee log.make.$SYSTEM_ENV.$MACHINE_ENV.64_cc
rm -r /usr/local/openmpi-2.1.0_64_cc.old
mv /usr/local/openmpi-2.1.0_64_cc /usr/local/openmpi-2.1.0_64_cc.old
make install |& tee log.make-install.$SYSTEM_ENV.$MACHINE_ENV.64_cc
make check |& tee log.make-check.$SYSTEM_ENV.$MACHINE_ENV.64_cc


Sometimes everything works as expected.

loki spawn 144 mpiexec -np 1 --host loki,nfs1,nfs2 spawn_intra_comm
Parent process 0: I create 2 slave processes

Parent process 0 running on loki
MPI_COMM_WORLD ntasks:  1
COMM_CHILD_PROCESSES ntasks_local:  1
COMM_CHILD_PROCESSES ntasks_remote: 2
COMM_ALL_PROCESSES ntasks:  3
mytid in COMM_ALL_PROCESSES:0

Child process 0 running on nfs1
MPI_COMM_WORLD ntasks:  2
COMM_ALL_PROCESSES ntasks:  3
mytid in COMM_ALL_PROCESSES:1

Child process 1 running on nfs2
MPI_COMM_WORLD ntasks:  2
COMM_ALL_PROCESSES ntasks:  3
mytid in COMM_ALL_PROCESSES:2



More often I get a warning.

loki spawn 144 mpiexec -np 1 --host loki,nfs1,nfs2 spawn_intra_comm
Parent process 0: I create 2 slave processes

Parent process 0 running on loki
MPI_COMM_WORLD ntasks:  1
COMM_CHILD_PROCESSES ntasks_local:  1
COMM_CHILD_PROCESSES ntasks_remote: 2
COMM_ALL_PROCESSES ntasks:  3
mytid in COMM_ALL_PROCESSES:0

Child process 0 running on nfs1
MPI_COMM_WORLD ntasks:  2
COMM_ALL_PROCESSES ntasks:  3

Child process 1 running on nfs2
MPI_COMM_WORLD ntasks:  2
 

Re: [OMPI users] "Warning :: opal_list_remove_item" with openmpi-2.1.0rc4

2017-03-21 Thread Sylvain Jeaugey
If you installed CUDA libraries and includes in /usr, then it's not 
surprising hwloc finds them even without defining CFLAGS.


I'm just saying I think you won't get the error message if Open MPI 
finds CUDA but hwloc does not.


On 03/21/2017 11:05 AM, Roland Fehrenbacher wrote:

"SJ" == Sylvain Jeaugey  writes:

Hi Silvain,

I get the "NVIDIA : ..." run-time error messages just by compiling
with "--with-cuda=/usr":

./configure --prefix=${prefix} \
 --mandir=${prefix}/share/man \
 --infodir=${prefix}/share/info \
 --sysconfdir=/etc/openmpi/${VERSION} --with-devel-headers \
 --disable-memchecker \
 --disable-vt \
 --with-tm --with-slurm --with-pmi --with-sge \
 --with-cuda=/usr \
 --with-io-romio-flags='--with-file-system=nfs+lustre' \
 --with-cma --without-valgrind \
 --enable-openib-connectx-xrc \
 --enable-orterun-prefix-by-default \
 --disable-java

Roland
 
 SJ> Hi Siegmar, I think this "NVIDIA : ..." error message comes from

 SJ> the fact that you add CUDA includes in the C*FLAGS. If you just
 SJ> use --with-cuda, Open MPI will compile with CUDA support, but
 SJ> hwloc will not find CUDA and that will be fine. However, setting
 SJ> CUDA in CFLAGS will make hwloc find CUDA, compile CUDA support
 SJ> (which is not needed) and then NVML will show this error message
 SJ> when not run on a machine with CUDA devices.

 SJ> I guess gcc picks the environment variable, while cc does not
 SJ> hence the different behavior. So again, there is no need to add
 SJ> all those CUDA includes, --with-cuda is enough.

 SJ> About the opal_list_remove_item, we'll try to reproduce the
 SJ> issue and see where it comes from.

 SJ> Sylvain

 SJ> On 03/21/2017 12:38 AM, Siegmar Gross wrote:
 >> Hi,
 >>
 >> I have installed openmpi-2.1.0rc4 on my "SUSE Linux Enterprise
 >> Server
 >> 12.2 (x86_64)" with Sun C 5.14 and gcc-6.3.0. Sometimes I get
 >>  once
 >> more a warning about a missing item for one of my small programs
 >> (it doesn't matter if I use my cc or gcc version). My gcc version
 >> also displays the message "NVIDIA: no NVIDIA devices found" for
 >> the server without NVIDIA devices (I don't get the message for my
 >> cc version).  I used the following commands to build the package
 >> (${SYSTEM_ENV} is Linux and ${MACHINE_ENV} is x86_64).
 >>
 >>
 >> mkdir openmpi-2.1.0rc4-${SYSTEM_ENV}.${MACHINE_ENV}.64_cc cd
 >> openmpi-2.1.0rc4-${SYSTEM_ENV}.${MACHINE_ENV}.64_cc
 >>
 >> ../openmpi-2.1.0rc4/configure \
 >> --prefix=/usr/local/openmpi-2.1.0_64_cc \
 >> --libdir=/usr/local/openmpi-2.1.0_64_cc/lib64 \
 >> --with-jdk-bindir=/usr/local/jdk1.8.0_66/bin \
 >> --with-jdk-headers=/usr/local/jdk1.8.0_66/include \
 >> JAVA_HOME=/usr/local/jdk1.8.0_66 \ LDFLAGS="-m64 -mt -Wl,-z
 >> -Wl,noexecstack -L/usr/local/lib64 -L/usr/local/cuda/ lib64" \
 >> CC="cc" CXX="CC" FC="f95" \ CFLAGS="-m64 -mt -I/usr/local/include
 >> -I/usr/local/cuda/include" \ CXXFLAGS="-m64 -I/usr/local/include
 >> -I/usr/local/cuda/include" \ FCFLAGS="-m64" \ CPP="cpp
 >> -I/usr/local/include -I/usr/local/cuda/include" \ CXXCPP="cpp
 >> -I/usr/local/include -I/usr/local/cuda/include" \
 >> --enable-mpi-cxx \ --enable-cxx-exceptions \ --enable-mpi-java \
 >> --with-cuda=/usr/local/cuda \ --with-valgrind=/usr/local/valgrind
 >> \ --enable-mpi-thread-multiple \ --with-hwloc=internal \
 >> --without-verbs \ --with-wrapper-cflags="-m64 -mt" \
 >> --with-wrapper-cxxflags="-m64" \ --with-wrapper-fcflags="-m64" \
 >> --with-wrapper-ldflags="-mt" \ --enable-debug \ |& tee
 >> log.configure.$SYSTEM_ENV.$MACHINE_ENV.64_cc
 >>
 >> make |& tee log.make.$SYSTEM_ENV.$MACHINE_ENV.64_cc rm -r
 >> /usr/local/openmpi-2.1.0_64_cc.old mv
 >> /usr/local/openmpi-2.1.0_64_cc /usr/local/openmpi-2.1.0_64_cc.old
 >> make install |& tee
 >> log.make-install.$SYSTEM_ENV.$MACHINE_ENV.64_cc make check |& tee
 >> log.make-check.$SYSTEM_ENV.$MACHINE_ENV.64_cc
 >>
 >>
 >> Sometimes everything works as expected.
 >>
 >> loki spawn 144 mpiexec -np 1 --host loki,nfs1,nfs2
 >> spawn_intra_comm Parent process 0: I create 2 slave processes
 >>
 >> Parent process 0 running on loki MPI_COMM_WORLD ntasks: 1
 >> COMM_CHILD_PROCESSES ntasks_local: 1 COMM_CHILD_PROCESSES
 >> ntasks_remote: 2 COMM_ALL_PROCESSES ntasks: 3 mytid in
 >> COMM_ALL_PROCESSES: 0
 >>
 >> Child process 0 running on nfs1 MPI_COMM_WORLD ntasks: 2
 >> COMM_ALL_PROCESSES ntasks: 3 mytid in COMM_ALL_PROCESSES: 1
 >>
 >> Child process 1 running on nfs2 MPI_COMM_WORLD ntasks: 2
 >> COMM_ALL_PROCESSES ntasks: 3 mytid in COMM_ALL_PROCESSES: 2
 >>
 >>
 >>
 >> More often I get a warning.
 >>
 >> loki spawn 144 mpiexec -np 1 --host 

Re: [OMPI users] "Warning :: opal_list_remove_item" with openmpi-2.1.0rc4

2017-03-21 Thread Roland Fehrenbacher
> "SJ" == Sylvain Jeaugey  writes:

Hi Silvain,

I get the "NVIDIA : ..." run-time error messages just by compiling
with "--with-cuda=/usr":

./configure --prefix=${prefix} \
--mandir=${prefix}/share/man \
--infodir=${prefix}/share/info \
--sysconfdir=/etc/openmpi/${VERSION} --with-devel-headers \
--disable-memchecker \
--disable-vt \
--with-tm --with-slurm --with-pmi --with-sge \
--with-cuda=/usr \
--with-io-romio-flags='--with-file-system=nfs+lustre' \
--with-cma --without-valgrind \
--enable-openib-connectx-xrc \
--enable-orterun-prefix-by-default \
--disable-java

Roland

SJ> Hi Siegmar, I think this "NVIDIA : ..." error message comes from
SJ> the fact that you add CUDA includes in the C*FLAGS. If you just
SJ> use --with-cuda, Open MPI will compile with CUDA support, but
SJ> hwloc will not find CUDA and that will be fine. However, setting
SJ> CUDA in CFLAGS will make hwloc find CUDA, compile CUDA support
SJ> (which is not needed) and then NVML will show this error message
SJ> when not run on a machine with CUDA devices.

SJ> I guess gcc picks the environment variable, while cc does not
SJ> hence the different behavior. So again, there is no need to add
SJ> all those CUDA includes, --with-cuda is enough.

SJ> About the opal_list_remove_item, we'll try to reproduce the
SJ> issue and see where it comes from.

SJ> Sylvain

SJ> On 03/21/2017 12:38 AM, Siegmar Gross wrote:
>> Hi,
>>
>> I have installed openmpi-2.1.0rc4 on my "SUSE Linux Enterprise
>> Server
>> 12.2 (x86_64)" with Sun C 5.14 and gcc-6.3.0. Sometimes I get
>>  once
>> more a warning about a missing item for one of my small programs
>> (it doesn't matter if I use my cc or gcc version). My gcc version
>> also displays the message "NVIDIA: no NVIDIA devices found" for
>> the server without NVIDIA devices (I don't get the message for my
>> cc version).  I used the following commands to build the package
>> (${SYSTEM_ENV} is Linux and ${MACHINE_ENV} is x86_64).
>>
>>
>> mkdir openmpi-2.1.0rc4-${SYSTEM_ENV}.${MACHINE_ENV}.64_cc cd
>> openmpi-2.1.0rc4-${SYSTEM_ENV}.${MACHINE_ENV}.64_cc
>>
>> ../openmpi-2.1.0rc4/configure \
>> --prefix=/usr/local/openmpi-2.1.0_64_cc \
>> --libdir=/usr/local/openmpi-2.1.0_64_cc/lib64 \
>> --with-jdk-bindir=/usr/local/jdk1.8.0_66/bin \
>> --with-jdk-headers=/usr/local/jdk1.8.0_66/include \
>> JAVA_HOME=/usr/local/jdk1.8.0_66 \ LDFLAGS="-m64 -mt -Wl,-z
>> -Wl,noexecstack -L/usr/local/lib64 -L/usr/local/cuda/ lib64" \
>> CC="cc" CXX="CC" FC="f95" \ CFLAGS="-m64 -mt -I/usr/local/include
>> -I/usr/local/cuda/include" \ CXXFLAGS="-m64 -I/usr/local/include
>> -I/usr/local/cuda/include" \ FCFLAGS="-m64" \ CPP="cpp
>> -I/usr/local/include -I/usr/local/cuda/include" \ CXXCPP="cpp
>> -I/usr/local/include -I/usr/local/cuda/include" \
>> --enable-mpi-cxx \ --enable-cxx-exceptions \ --enable-mpi-java \
>> --with-cuda=/usr/local/cuda \ --with-valgrind=/usr/local/valgrind
>> \ --enable-mpi-thread-multiple \ --with-hwloc=internal \
>> --without-verbs \ --with-wrapper-cflags="-m64 -mt" \
>> --with-wrapper-cxxflags="-m64" \ --with-wrapper-fcflags="-m64" \
>> --with-wrapper-ldflags="-mt" \ --enable-debug \ |& tee
>> log.configure.$SYSTEM_ENV.$MACHINE_ENV.64_cc
>>
>> make |& tee log.make.$SYSTEM_ENV.$MACHINE_ENV.64_cc rm -r
>> /usr/local/openmpi-2.1.0_64_cc.old mv
>> /usr/local/openmpi-2.1.0_64_cc /usr/local/openmpi-2.1.0_64_cc.old
>> make install |& tee
>> log.make-install.$SYSTEM_ENV.$MACHINE_ENV.64_cc make check |& tee
>> log.make-check.$SYSTEM_ENV.$MACHINE_ENV.64_cc
>>
>>
>> Sometimes everything works as expected.
>>
>> loki spawn 144 mpiexec -np 1 --host loki,nfs1,nfs2
>> spawn_intra_comm Parent process 0: I create 2 slave processes
>>
>> Parent process 0 running on loki MPI_COMM_WORLD ntasks: 1
>> COMM_CHILD_PROCESSES ntasks_local: 1 COMM_CHILD_PROCESSES
>> ntasks_remote: 2 COMM_ALL_PROCESSES ntasks: 3 mytid in
>> COMM_ALL_PROCESSES: 0
>>
>> Child process 0 running on nfs1 MPI_COMM_WORLD ntasks: 2
>> COMM_ALL_PROCESSES ntasks: 3 mytid in COMM_ALL_PROCESSES: 1
>>
>> Child process 1 running on nfs2 MPI_COMM_WORLD ntasks: 2
>> COMM_ALL_PROCESSES ntasks: 3 mytid in COMM_ALL_PROCESSES: 2
>>
>>
>>
>> More often I get a warning.
>>
>> loki spawn 144 mpiexec -np 1 --host loki,nfs1,nfs2
>> spawn_intra_comm Parent process 0: I create 2 slave processes
>>
>> Parent process 0 running on loki MPI_COMM_WORLD ntasks: 1
>> COMM_CHILD_PROCESSES ntasks_local: 1 COMM_CHILD_PROCESSES
>> ntasks_remote: 2 COMM_ALL_PROCESSES ntasks: 3 mytid in
>> COMM_ALL_PROCESSES: 0
>>
>> Child process 0 running on nfs1 MPI_COMM_WORLD ntasks: 2
 

Re: [OMPI users] "Warning :: opal_list_remove_item" with openmpi-2.1.0rc4

2017-03-21 Thread Akshay Venkatesh
Hi Siegmar,

Would it possible for you to provide the source to reproduce the issue?

Thanks

On Tue, Mar 21, 2017 at 9:52 AM, Sylvain Jeaugey 
wrote:

> Hi Siegmar,
>
> I think this "NVIDIA : ..." error message comes from the fact that you add
> CUDA includes in the C*FLAGS. If you just use --with-cuda, Open MPI will
> compile with CUDA support, but hwloc will not find CUDA and that will be
> fine. However, setting CUDA in CFLAGS will make hwloc find CUDA, compile
> CUDA support (which is not needed) and then NVML will show this error
> message when not run on a machine with CUDA devices.
>
> I guess gcc picks the environment variable, while cc does not hence the
> different behavior. So again, there is no need to add all those CUDA
> includes, --with-cuda is enough.
>
> About the opal_list_remove_item, we'll try to reproduce the issue and see
> where it comes from.
>
> Sylvain
>
>
> On 03/21/2017 12:38 AM, Siegmar Gross wrote:
>
>> Hi,
>>
>> I have installed openmpi-2.1.0rc4 on my "SUSE Linux Enterprise Server
>> 12.2 (x86_64)" with Sun C 5.14 and gcc-6.3.0. Sometimes I get once
>> more a warning about a missing item for one of my small programs (it
>> doesn't matter if I use my cc or gcc version). My gcc version also
>> displays the message "NVIDIA: no NVIDIA devices found" for the server
>> without NVIDIA devices (I don't get the message for my cc version).
>> I used the following commands to build the package (${SYSTEM_ENV}
>> is Linux and ${MACHINE_ENV} is x86_64).
>>
>>
>> mkdir openmpi-2.1.0rc4-${SYSTEM_ENV}.${MACHINE_ENV}.64_cc
>> cd openmpi-2.1.0rc4-${SYSTEM_ENV}.${MACHINE_ENV}.64_cc
>>
>> ../openmpi-2.1.0rc4/configure \
>>   --prefix=/usr/local/openmpi-2.1.0_64_cc \
>>   --libdir=/usr/local/openmpi-2.1.0_64_cc/lib64 \
>>   --with-jdk-bindir=/usr/local/jdk1.8.0_66/bin \
>>   --with-jdk-headers=/usr/local/jdk1.8.0_66/include \
>>   JAVA_HOME=/usr/local/jdk1.8.0_66 \
>>   LDFLAGS="-m64 -mt -Wl,-z -Wl,noexecstack -L/usr/local/lib64
>> -L/usr/local/cuda/
>> lib64" \
>>   CC="cc" CXX="CC" FC="f95" \
>>   CFLAGS="-m64 -mt -I/usr/local/include -I/usr/local/cuda/include" \
>>   CXXFLAGS="-m64 -I/usr/local/include -I/usr/local/cuda/include" \
>>   FCFLAGS="-m64" \
>>   CPP="cpp -I/usr/local/include -I/usr/local/cuda/include" \
>>   CXXCPP="cpp -I/usr/local/include -I/usr/local/cuda/include" \
>>   --enable-mpi-cxx \
>>   --enable-cxx-exceptions \
>>   --enable-mpi-java \
>>   --with-cuda=/usr/local/cuda \
>>   --with-valgrind=/usr/local/valgrind \
>>   --enable-mpi-thread-multiple \
>>   --with-hwloc=internal \
>>   --without-verbs \
>>   --with-wrapper-cflags="-m64 -mt" \
>>   --with-wrapper-cxxflags="-m64" \
>>   --with-wrapper-fcflags="-m64" \
>>   --with-wrapper-ldflags="-mt" \
>>   --enable-debug \
>>   |& tee log.configure.$SYSTEM_ENV.$MACHINE_ENV.64_cc
>>
>> make |& tee log.make.$SYSTEM_ENV.$MACHINE_ENV.64_cc
>> rm -r /usr/local/openmpi-2.1.0_64_cc.old
>> mv /usr/local/openmpi-2.1.0_64_cc /usr/local/openmpi-2.1.0_64_cc.old
>> make install |& tee log.make-install.$SYSTEM_ENV.$MACHINE_ENV.64_cc
>> make check |& tee log.make-check.$SYSTEM_ENV.$MACHINE_ENV.64_cc
>>
>>
>> Sometimes everything works as expected.
>>
>> loki spawn 144 mpiexec -np 1 --host loki,nfs1,nfs2 spawn_intra_comm
>> Parent process 0: I create 2 slave processes
>>
>> Parent process 0 running on loki
>> MPI_COMM_WORLD ntasks:  1
>> COMM_CHILD_PROCESSES ntasks_local:  1
>> COMM_CHILD_PROCESSES ntasks_remote: 2
>> COMM_ALL_PROCESSES ntasks:  3
>> mytid in COMM_ALL_PROCESSES:0
>>
>> Child process 0 running on nfs1
>> MPI_COMM_WORLD ntasks:  2
>> COMM_ALL_PROCESSES ntasks:  3
>> mytid in COMM_ALL_PROCESSES:1
>>
>> Child process 1 running on nfs2
>> MPI_COMM_WORLD ntasks:  2
>> COMM_ALL_PROCESSES ntasks:  3
>> mytid in COMM_ALL_PROCESSES:2
>>
>>
>>
>> More often I get a warning.
>>
>> loki spawn 144 mpiexec -np 1 --host loki,nfs1,nfs2 spawn_intra_comm
>> Parent process 0: I create 2 slave processes
>>
>> Parent process 0 running on loki
>> MPI_COMM_WORLD ntasks:  1
>> COMM_CHILD_PROCESSES ntasks_local:  1
>> COMM_CHILD_PROCESSES ntasks_remote: 2
>> COMM_ALL_PROCESSES ntasks:  3
>> mytid in COMM_ALL_PROCESSES:0
>>
>> Child process 0 running on nfs1
>> MPI_COMM_WORLD ntasks:  2
>> COMM_ALL_PROCESSES ntasks:  3
>>
>> Child process 1 running on nfs2
>> MPI_COMM_WORLD ntasks:  2
>> COMM_ALL_PROCESSES ntasks:  3
>> mytid in COMM_ALL_PROCESSES:2
>> mytid in COMM_ALL_PROCESSES:1
>>  Warning :: opal_list_remove_item - the item 0x25a76f0 is not on the list
>> 0x7f96db515998
>> loki spawn 144
>>
>>
>>
>> I would be grateful, if somebody can fix the problem. Do you need anything
>> else? Thank you very much for any help in advance.
>>
>>
>> Kind regards
>>
>> Siegmar
>> 

Re: [OMPI users] "Warning :: opal_list_remove_item" with openmpi-2.1.0rc4

2017-03-21 Thread Sylvain Jeaugey

Hi Siegmar,

I think this "NVIDIA : ..." error message comes from the fact that you 
add CUDA includes in the C*FLAGS. If you just use --with-cuda, Open MPI 
will compile with CUDA support, but hwloc will not find CUDA and that 
will be fine. However, setting CUDA in CFLAGS will make hwloc find CUDA, 
compile CUDA support (which is not needed) and then NVML will show this 
error message when not run on a machine with CUDA devices.


I guess gcc picks the environment variable, while cc does not hence the 
different behavior. So again, there is no need to add all those CUDA 
includes, --with-cuda is enough.


About the opal_list_remove_item, we'll try to reproduce the issue and 
see where it comes from.


Sylvain

On 03/21/2017 12:38 AM, Siegmar Gross wrote:

Hi,

I have installed openmpi-2.1.0rc4 on my "SUSE Linux Enterprise Server
12.2 (x86_64)" with Sun C 5.14 and gcc-6.3.0. Sometimes I get once
more a warning about a missing item for one of my small programs (it
doesn't matter if I use my cc or gcc version). My gcc version also
displays the message "NVIDIA: no NVIDIA devices found" for the server
without NVIDIA devices (I don't get the message for my cc version).
I used the following commands to build the package (${SYSTEM_ENV}
is Linux and ${MACHINE_ENV} is x86_64).


mkdir openmpi-2.1.0rc4-${SYSTEM_ENV}.${MACHINE_ENV}.64_cc
cd openmpi-2.1.0rc4-${SYSTEM_ENV}.${MACHINE_ENV}.64_cc

../openmpi-2.1.0rc4/configure \
  --prefix=/usr/local/openmpi-2.1.0_64_cc \
  --libdir=/usr/local/openmpi-2.1.0_64_cc/lib64 \
  --with-jdk-bindir=/usr/local/jdk1.8.0_66/bin \
  --with-jdk-headers=/usr/local/jdk1.8.0_66/include \
  JAVA_HOME=/usr/local/jdk1.8.0_66 \
  LDFLAGS="-m64 -mt -Wl,-z -Wl,noexecstack -L/usr/local/lib64 
-L/usr/local/cuda/

lib64" \
  CC="cc" CXX="CC" FC="f95" \
  CFLAGS="-m64 -mt -I/usr/local/include -I/usr/local/cuda/include" \
  CXXFLAGS="-m64 -I/usr/local/include -I/usr/local/cuda/include" \
  FCFLAGS="-m64" \
  CPP="cpp -I/usr/local/include -I/usr/local/cuda/include" \
  CXXCPP="cpp -I/usr/local/include -I/usr/local/cuda/include" \
  --enable-mpi-cxx \
  --enable-cxx-exceptions \
  --enable-mpi-java \
  --with-cuda=/usr/local/cuda \
  --with-valgrind=/usr/local/valgrind \
  --enable-mpi-thread-multiple \
  --with-hwloc=internal \
  --without-verbs \
  --with-wrapper-cflags="-m64 -mt" \
  --with-wrapper-cxxflags="-m64" \
  --with-wrapper-fcflags="-m64" \
  --with-wrapper-ldflags="-mt" \
  --enable-debug \
  |& tee log.configure.$SYSTEM_ENV.$MACHINE_ENV.64_cc

make |& tee log.make.$SYSTEM_ENV.$MACHINE_ENV.64_cc
rm -r /usr/local/openmpi-2.1.0_64_cc.old
mv /usr/local/openmpi-2.1.0_64_cc /usr/local/openmpi-2.1.0_64_cc.old
make install |& tee log.make-install.$SYSTEM_ENV.$MACHINE_ENV.64_cc
make check |& tee log.make-check.$SYSTEM_ENV.$MACHINE_ENV.64_cc


Sometimes everything works as expected.

loki spawn 144 mpiexec -np 1 --host loki,nfs1,nfs2 spawn_intra_comm
Parent process 0: I create 2 slave processes

Parent process 0 running on loki
MPI_COMM_WORLD ntasks:  1
COMM_CHILD_PROCESSES ntasks_local:  1
COMM_CHILD_PROCESSES ntasks_remote: 2
COMM_ALL_PROCESSES ntasks:  3
mytid in COMM_ALL_PROCESSES:0

Child process 0 running on nfs1
MPI_COMM_WORLD ntasks:  2
COMM_ALL_PROCESSES ntasks:  3
mytid in COMM_ALL_PROCESSES:1

Child process 1 running on nfs2
MPI_COMM_WORLD ntasks:  2
COMM_ALL_PROCESSES ntasks:  3
mytid in COMM_ALL_PROCESSES:2



More often I get a warning.

loki spawn 144 mpiexec -np 1 --host loki,nfs1,nfs2 spawn_intra_comm
Parent process 0: I create 2 slave processes

Parent process 0 running on loki
MPI_COMM_WORLD ntasks:  1
COMM_CHILD_PROCESSES ntasks_local:  1
COMM_CHILD_PROCESSES ntasks_remote: 2
COMM_ALL_PROCESSES ntasks:  3
mytid in COMM_ALL_PROCESSES:0

Child process 0 running on nfs1
MPI_COMM_WORLD ntasks:  2
COMM_ALL_PROCESSES ntasks:  3

Child process 1 running on nfs2
MPI_COMM_WORLD ntasks:  2
COMM_ALL_PROCESSES ntasks:  3
mytid in COMM_ALL_PROCESSES:2
mytid in COMM_ALL_PROCESSES:1
 Warning :: opal_list_remove_item - the item 0x25a76f0 is not on the 
list 0x7f96db515998

loki spawn 144



I would be grateful, if somebody can fix the problem. Do you need 
anything

else? Thank you very much for any help in advance.


Kind regards

Siegmar
___
users mailing list
users@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/users



---
This email message is for the sole use of the intended recipient(s) and may 
contain
confidential information.  Any unauthorized review, use, disclosure or 
distribution
is prohibited.  If you are not the intended recipient, please contact the 
sender by
reply email and 

[OMPI users] "Warning :: opal_list_remove_item" with openmpi-2.1.0rc4

2017-03-21 Thread Siegmar Gross

Hi,

I have installed openmpi-2.1.0rc4 on my "SUSE Linux Enterprise Server
12.2 (x86_64)" with Sun C 5.14 and gcc-6.3.0. Sometimes I get once
more a warning about a missing item for one of my small programs (it
doesn't matter if I use my cc or gcc version). My gcc version also
displays the message "NVIDIA: no NVIDIA devices found" for the server
without NVIDIA devices (I don't get the message for my cc version).
I used the following commands to build the package (${SYSTEM_ENV}
is Linux and ${MACHINE_ENV} is x86_64).


mkdir openmpi-2.1.0rc4-${SYSTEM_ENV}.${MACHINE_ENV}.64_cc
cd openmpi-2.1.0rc4-${SYSTEM_ENV}.${MACHINE_ENV}.64_cc

../openmpi-2.1.0rc4/configure \
  --prefix=/usr/local/openmpi-2.1.0_64_cc \
  --libdir=/usr/local/openmpi-2.1.0_64_cc/lib64 \
  --with-jdk-bindir=/usr/local/jdk1.8.0_66/bin \
  --with-jdk-headers=/usr/local/jdk1.8.0_66/include \
  JAVA_HOME=/usr/local/jdk1.8.0_66 \
  LDFLAGS="-m64 -mt -Wl,-z -Wl,noexecstack -L/usr/local/lib64 -L/usr/local/cuda/
lib64" \
  CC="cc" CXX="CC" FC="f95" \
  CFLAGS="-m64 -mt -I/usr/local/include -I/usr/local/cuda/include" \
  CXXFLAGS="-m64 -I/usr/local/include -I/usr/local/cuda/include" \
  FCFLAGS="-m64" \
  CPP="cpp -I/usr/local/include -I/usr/local/cuda/include" \
  CXXCPP="cpp -I/usr/local/include -I/usr/local/cuda/include" \
  --enable-mpi-cxx \
  --enable-cxx-exceptions \
  --enable-mpi-java \
  --with-cuda=/usr/local/cuda \
  --with-valgrind=/usr/local/valgrind \
  --enable-mpi-thread-multiple \
  --with-hwloc=internal \
  --without-verbs \
  --with-wrapper-cflags="-m64 -mt" \
  --with-wrapper-cxxflags="-m64" \
  --with-wrapper-fcflags="-m64" \
  --with-wrapper-ldflags="-mt" \
  --enable-debug \
  |& tee log.configure.$SYSTEM_ENV.$MACHINE_ENV.64_cc

make |& tee log.make.$SYSTEM_ENV.$MACHINE_ENV.64_cc
rm -r /usr/local/openmpi-2.1.0_64_cc.old
mv /usr/local/openmpi-2.1.0_64_cc /usr/local/openmpi-2.1.0_64_cc.old
make install |& tee log.make-install.$SYSTEM_ENV.$MACHINE_ENV.64_cc
make check |& tee log.make-check.$SYSTEM_ENV.$MACHINE_ENV.64_cc


Sometimes everything works as expected.

loki spawn 144 mpiexec -np 1 --host loki,nfs1,nfs2 spawn_intra_comm
Parent process 0: I create 2 slave processes

Parent process 0 running on loki
MPI_COMM_WORLD ntasks:  1
COMM_CHILD_PROCESSES ntasks_local:  1
COMM_CHILD_PROCESSES ntasks_remote: 2
COMM_ALL_PROCESSES ntasks:  3
mytid in COMM_ALL_PROCESSES:0

Child process 0 running on nfs1
MPI_COMM_WORLD ntasks:  2
COMM_ALL_PROCESSES ntasks:  3
mytid in COMM_ALL_PROCESSES:1

Child process 1 running on nfs2
MPI_COMM_WORLD ntasks:  2
COMM_ALL_PROCESSES ntasks:  3
mytid in COMM_ALL_PROCESSES:2



More often I get a warning.

loki spawn 144 mpiexec -np 1 --host loki,nfs1,nfs2 spawn_intra_comm
Parent process 0: I create 2 slave processes

Parent process 0 running on loki
MPI_COMM_WORLD ntasks:  1
COMM_CHILD_PROCESSES ntasks_local:  1
COMM_CHILD_PROCESSES ntasks_remote: 2
COMM_ALL_PROCESSES ntasks:  3
mytid in COMM_ALL_PROCESSES:0

Child process 0 running on nfs1
MPI_COMM_WORLD ntasks:  2
COMM_ALL_PROCESSES ntasks:  3

Child process 1 running on nfs2
MPI_COMM_WORLD ntasks:  2
COMM_ALL_PROCESSES ntasks:  3
mytid in COMM_ALL_PROCESSES:2
mytid in COMM_ALL_PROCESSES:1
 Warning :: opal_list_remove_item - the item 0x25a76f0 is not on the list 
0x7f96db515998
loki spawn 144



I would be grateful, if somebody can fix the problem. Do you need anything
else? Thank you very much for any help in advance.


Kind regards

Siegmar
___
users mailing list
users@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/users