Hi Siegmar, Would it possible for you to provide the source to reproduce the issue?
Thanks On Tue, Mar 21, 2017 at 9:52 AM, Sylvain Jeaugey <sjeau...@nvidia.com> wrote: > Hi Siegmar, > > I think this "NVIDIA : ..." error message comes from the fact that you add > CUDA includes in the C*FLAGS. If you just use --with-cuda, Open MPI will > compile with CUDA support, but hwloc will not find CUDA and that will be > fine. However, setting CUDA in CFLAGS will make hwloc find CUDA, compile > CUDA support (which is not needed) and then NVML will show this error > message when not run on a machine with CUDA devices. > > I guess gcc picks the environment variable, while cc does not hence the > different behavior. So again, there is no need to add all those CUDA > includes, --with-cuda is enough. > > About the opal_list_remove_item, we'll try to reproduce the issue and see > where it comes from. > > Sylvain > > > On 03/21/2017 12:38 AM, Siegmar Gross wrote: > >> Hi, >> >> I have installed openmpi-2.1.0rc4 on my "SUSE Linux Enterprise Server >> 12.2 (x86_64)" with Sun C 5.14 and gcc-6.3.0. Sometimes I get once >> more a warning about a missing item for one of my small programs (it >> doesn't matter if I use my cc or gcc version). My gcc version also >> displays the message "NVIDIA: no NVIDIA devices found" for the server >> without NVIDIA devices (I don't get the message for my cc version). >> I used the following commands to build the package (${SYSTEM_ENV} >> is Linux and ${MACHINE_ENV} is x86_64). >> >> >> mkdir openmpi-2.1.0rc4-${SYSTEM_ENV}.${MACHINE_ENV}.64_cc >> cd openmpi-2.1.0rc4-${SYSTEM_ENV}.${MACHINE_ENV}.64_cc >> >> ../openmpi-2.1.0rc4/configure \ >> --prefix=/usr/local/openmpi-2.1.0_64_cc \ >> --libdir=/usr/local/openmpi-2.1.0_64_cc/lib64 \ >> --with-jdk-bindir=/usr/local/jdk1.8.0_66/bin \ >> --with-jdk-headers=/usr/local/jdk1.8.0_66/include \ >> JAVA_HOME=/usr/local/jdk1.8.0_66 \ >> LDFLAGS="-m64 -mt -Wl,-z -Wl,noexecstack -L/usr/local/lib64 >> -L/usr/local/cuda/ >> lib64" \ >> CC="cc" CXX="CC" FC="f95" \ >> CFLAGS="-m64 -mt -I/usr/local/include -I/usr/local/cuda/include" \ >> CXXFLAGS="-m64 -I/usr/local/include -I/usr/local/cuda/include" \ >> FCFLAGS="-m64" \ >> CPP="cpp -I/usr/local/include -I/usr/local/cuda/include" \ >> CXXCPP="cpp -I/usr/local/include -I/usr/local/cuda/include" \ >> --enable-mpi-cxx \ >> --enable-cxx-exceptions \ >> --enable-mpi-java \ >> --with-cuda=/usr/local/cuda \ >> --with-valgrind=/usr/local/valgrind \ >> --enable-mpi-thread-multiple \ >> --with-hwloc=internal \ >> --without-verbs \ >> --with-wrapper-cflags="-m64 -mt" \ >> --with-wrapper-cxxflags="-m64" \ >> --with-wrapper-fcflags="-m64" \ >> --with-wrapper-ldflags="-mt" \ >> --enable-debug \ >> |& tee log.configure.$SYSTEM_ENV.$MACHINE_ENV.64_cc >> >> make |& tee log.make.$SYSTEM_ENV.$MACHINE_ENV.64_cc >> rm -r /usr/local/openmpi-2.1.0_64_cc.old >> mv /usr/local/openmpi-2.1.0_64_cc /usr/local/openmpi-2.1.0_64_cc.old >> make install |& tee log.make-install.$SYSTEM_ENV.$MACHINE_ENV.64_cc >> make check |& tee log.make-check.$SYSTEM_ENV.$MACHINE_ENV.64_cc >> >> >> Sometimes everything works as expected. >> >> loki spawn 144 mpiexec -np 1 --host loki,nfs1,nfs2 spawn_intra_comm >> Parent process 0: I create 2 slave processes >> >> Parent process 0 running on loki >> MPI_COMM_WORLD ntasks: 1 >> COMM_CHILD_PROCESSES ntasks_local: 1 >> COMM_CHILD_PROCESSES ntasks_remote: 2 >> COMM_ALL_PROCESSES ntasks: 3 >> mytid in COMM_ALL_PROCESSES: 0 >> >> Child process 0 running on nfs1 >> MPI_COMM_WORLD ntasks: 2 >> COMM_ALL_PROCESSES ntasks: 3 >> mytid in COMM_ALL_PROCESSES: 1 >> >> Child process 1 running on nfs2 >> MPI_COMM_WORLD ntasks: 2 >> COMM_ALL_PROCESSES ntasks: 3 >> mytid in COMM_ALL_PROCESSES: 2 >> >> >> >> More often I get a warning. >> >> loki spawn 144 mpiexec -np 1 --host loki,nfs1,nfs2 spawn_intra_comm >> Parent process 0: I create 2 slave processes >> >> Parent process 0 running on loki >> MPI_COMM_WORLD ntasks: 1 >> COMM_CHILD_PROCESSES ntasks_local: 1 >> COMM_CHILD_PROCESSES ntasks_remote: 2 >> COMM_ALL_PROCESSES ntasks: 3 >> mytid in COMM_ALL_PROCESSES: 0 >> >> Child process 0 running on nfs1 >> MPI_COMM_WORLD ntasks: 2 >> COMM_ALL_PROCESSES ntasks: 3 >> >> Child process 1 running on nfs2 >> MPI_COMM_WORLD ntasks: 2 >> COMM_ALL_PROCESSES ntasks: 3 >> mytid in COMM_ALL_PROCESSES: 2 >> mytid in COMM_ALL_PROCESSES: 1 >> Warning :: opal_list_remove_item - the item 0x25a76f0 is not on the list >> 0x7f96db515998 >> loki spawn 144 >> >> >> >> I would be grateful, if somebody can fix the problem. Do you need anything >> else? Thank you very much for any help in advance. >> >> >> Kind regards >> >> Siegmar >> _______________________________________________ >> users mailing list >> users@lists.open-mpi.org >> https://rfd.newmexicoconsortium.org/mailman/listinfo/users >> > > > ------------------------------------------------------------ > ----------------------- > This email message is for the sole use of the intended recipient(s) and > may contain > confidential information. Any unauthorized review, use, disclosure or > distribution > is prohibited. If you are not the intended recipient, please contact the > sender by > reply email and destroy all copies of the original message. > ------------------------------------------------------------ > ----------------------- > > _______________________________________________ > users mailing list > users@lists.open-mpi.org > https://rfd.newmexicoconsortium.org/mailman/listinfo/users > -- -Akshay
_______________________________________________ users mailing list users@lists.open-mpi.org https://rfd.newmexicoconsortium.org/mailman/listinfo/users