Hi Siegmar,

I think this "NVIDIA : ..." error message comes from the fact that you add CUDA includes in the C*FLAGS. If you just use --with-cuda, Open MPI will compile with CUDA support, but hwloc will not find CUDA and that will be fine. However, setting CUDA in CFLAGS will make hwloc find CUDA, compile CUDA support (which is not needed) and then NVML will show this error message when not run on a machine with CUDA devices.

I guess gcc picks the environment variable, while cc does not hence the different behavior. So again, there is no need to add all those CUDA includes, --with-cuda is enough.

About the opal_list_remove_item, we'll try to reproduce the issue and see where it comes from.

Sylvain

On 03/21/2017 12:38 AM, Siegmar Gross wrote:
Hi,

I have installed openmpi-2.1.0rc4 on my "SUSE Linux Enterprise Server
12.2 (x86_64)" with Sun C 5.14 and gcc-6.3.0. Sometimes I get once
more a warning about a missing item for one of my small programs (it
doesn't matter if I use my cc or gcc version). My gcc version also
displays the message "NVIDIA: no NVIDIA devices found" for the server
without NVIDIA devices (I don't get the message for my cc version).
I used the following commands to build the package (${SYSTEM_ENV}
is Linux and ${MACHINE_ENV} is x86_64).


mkdir openmpi-2.1.0rc4-${SYSTEM_ENV}.${MACHINE_ENV}.64_cc
cd openmpi-2.1.0rc4-${SYSTEM_ENV}.${MACHINE_ENV}.64_cc

../openmpi-2.1.0rc4/configure \
  --prefix=/usr/local/openmpi-2.1.0_64_cc \
  --libdir=/usr/local/openmpi-2.1.0_64_cc/lib64 \
  --with-jdk-bindir=/usr/local/jdk1.8.0_66/bin \
  --with-jdk-headers=/usr/local/jdk1.8.0_66/include \
  JAVA_HOME=/usr/local/jdk1.8.0_66 \
LDFLAGS="-m64 -mt -Wl,-z -Wl,noexecstack -L/usr/local/lib64 -L/usr/local/cuda/
lib64" \
  CC="cc" CXX="CC" FC="f95" \
  CFLAGS="-m64 -mt -I/usr/local/include -I/usr/local/cuda/include" \
  CXXFLAGS="-m64 -I/usr/local/include -I/usr/local/cuda/include" \
  FCFLAGS="-m64" \
  CPP="cpp -I/usr/local/include -I/usr/local/cuda/include" \
  CXXCPP="cpp -I/usr/local/include -I/usr/local/cuda/include" \
  --enable-mpi-cxx \
  --enable-cxx-exceptions \
  --enable-mpi-java \
  --with-cuda=/usr/local/cuda \
  --with-valgrind=/usr/local/valgrind \
  --enable-mpi-thread-multiple \
  --with-hwloc=internal \
  --without-verbs \
  --with-wrapper-cflags="-m64 -mt" \
  --with-wrapper-cxxflags="-m64" \
  --with-wrapper-fcflags="-m64" \
  --with-wrapper-ldflags="-mt" \
  --enable-debug \
  |& tee log.configure.$SYSTEM_ENV.$MACHINE_ENV.64_cc

make |& tee log.make.$SYSTEM_ENV.$MACHINE_ENV.64_cc
rm -r /usr/local/openmpi-2.1.0_64_cc.old
mv /usr/local/openmpi-2.1.0_64_cc /usr/local/openmpi-2.1.0_64_cc.old
make install |& tee log.make-install.$SYSTEM_ENV.$MACHINE_ENV.64_cc
make check |& tee log.make-check.$SYSTEM_ENV.$MACHINE_ENV.64_cc


Sometimes everything works as expected.

loki spawn 144 mpiexec -np 1 --host loki,nfs1,nfs2 spawn_intra_comm
Parent process 0: I create 2 slave processes

Parent process 0 running on loki
    MPI_COMM_WORLD ntasks:              1
    COMM_CHILD_PROCESSES ntasks_local:  1
    COMM_CHILD_PROCESSES ntasks_remote: 2
    COMM_ALL_PROCESSES ntasks:          3
    mytid in COMM_ALL_PROCESSES:        0

Child process 0 running on nfs1
    MPI_COMM_WORLD ntasks:              2
    COMM_ALL_PROCESSES ntasks:          3
    mytid in COMM_ALL_PROCESSES:        1

Child process 1 running on nfs2
    MPI_COMM_WORLD ntasks:              2
    COMM_ALL_PROCESSES ntasks:          3
    mytid in COMM_ALL_PROCESSES:        2



More often I get a warning.

loki spawn 144 mpiexec -np 1 --host loki,nfs1,nfs2 spawn_intra_comm
Parent process 0: I create 2 slave processes

Parent process 0 running on loki
    MPI_COMM_WORLD ntasks:              1
    COMM_CHILD_PROCESSES ntasks_local:  1
    COMM_CHILD_PROCESSES ntasks_remote: 2
    COMM_ALL_PROCESSES ntasks:          3
    mytid in COMM_ALL_PROCESSES:        0

Child process 0 running on nfs1
    MPI_COMM_WORLD ntasks:              2
    COMM_ALL_PROCESSES ntasks:          3

Child process 1 running on nfs2
    MPI_COMM_WORLD ntasks:              2
    COMM_ALL_PROCESSES ntasks:          3
    mytid in COMM_ALL_PROCESSES:        2
    mytid in COMM_ALL_PROCESSES:        1
Warning :: opal_list_remove_item - the item 0x25a76f0 is not on the list 0x7f96db515998
loki spawn 144



I would be grateful, if somebody can fix the problem. Do you need anything
else? Thank you very much for any help in advance.


Kind regards

Siegmar
_______________________________________________
users mailing list
users@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/users


-----------------------------------------------------------------------------------
This email message is for the sole use of the intended recipient(s) and may 
contain
confidential information.  Any unauthorized review, use, disclosure or 
distribution
is prohibited.  If you are not the intended recipient, please contact the 
sender by
reply email and destroy all copies of the original message.
-----------------------------------------------------------------------------------
_______________________________________________
users mailing list
users@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/users

Reply via email to