Hi Siegmar,

Would it possible for you to provide the source to reproduce the issue?

Thanks

On Tue, Mar 21, 2017 at 9:52 AM, Sylvain Jeaugey <sjeau...@nvidia.com>
wrote:

> Hi Siegmar,
>
> I think this "NVIDIA : ..." error message comes from the fact that you add
> CUDA includes in the C*FLAGS. If you just use --with-cuda, Open MPI will
> compile with CUDA support, but hwloc will not find CUDA and that will be
> fine. However, setting CUDA in CFLAGS will make hwloc find CUDA, compile
> CUDA support (which is not needed) and then NVML will show this error
> message when not run on a machine with CUDA devices.
>
> I guess gcc picks the environment variable, while cc does not hence the
> different behavior. So again, there is no need to add all those CUDA
> includes, --with-cuda is enough.
>
> About the opal_list_remove_item, we'll try to reproduce the issue and see
> where it comes from.
>
> Sylvain
>
>
> On 03/21/2017 12:38 AM, Siegmar Gross wrote:
>
>> Hi,
>>
>> I have installed openmpi-2.1.0rc4 on my "SUSE Linux Enterprise Server
>> 12.2 (x86_64)" with Sun C 5.14 and gcc-6.3.0. Sometimes I get once
>> more a warning about a missing item for one of my small programs (it
>> doesn't matter if I use my cc or gcc version). My gcc version also
>> displays the message "NVIDIA: no NVIDIA devices found" for the server
>> without NVIDIA devices (I don't get the message for my cc version).
>> I used the following commands to build the package (${SYSTEM_ENV}
>> is Linux and ${MACHINE_ENV} is x86_64).
>>
>>
>> mkdir openmpi-2.1.0rc4-${SYSTEM_ENV}.${MACHINE_ENV}.64_cc
>> cd openmpi-2.1.0rc4-${SYSTEM_ENV}.${MACHINE_ENV}.64_cc
>>
>> ../openmpi-2.1.0rc4/configure \
>>   --prefix=/usr/local/openmpi-2.1.0_64_cc \
>>   --libdir=/usr/local/openmpi-2.1.0_64_cc/lib64 \
>>   --with-jdk-bindir=/usr/local/jdk1.8.0_66/bin \
>>   --with-jdk-headers=/usr/local/jdk1.8.0_66/include \
>>   JAVA_HOME=/usr/local/jdk1.8.0_66 \
>>   LDFLAGS="-m64 -mt -Wl,-z -Wl,noexecstack -L/usr/local/lib64
>> -L/usr/local/cuda/
>> lib64" \
>>   CC="cc" CXX="CC" FC="f95" \
>>   CFLAGS="-m64 -mt -I/usr/local/include -I/usr/local/cuda/include" \
>>   CXXFLAGS="-m64 -I/usr/local/include -I/usr/local/cuda/include" \
>>   FCFLAGS="-m64" \
>>   CPP="cpp -I/usr/local/include -I/usr/local/cuda/include" \
>>   CXXCPP="cpp -I/usr/local/include -I/usr/local/cuda/include" \
>>   --enable-mpi-cxx \
>>   --enable-cxx-exceptions \
>>   --enable-mpi-java \
>>   --with-cuda=/usr/local/cuda \
>>   --with-valgrind=/usr/local/valgrind \
>>   --enable-mpi-thread-multiple \
>>   --with-hwloc=internal \
>>   --without-verbs \
>>   --with-wrapper-cflags="-m64 -mt" \
>>   --with-wrapper-cxxflags="-m64" \
>>   --with-wrapper-fcflags="-m64" \
>>   --with-wrapper-ldflags="-mt" \
>>   --enable-debug \
>>   |& tee log.configure.$SYSTEM_ENV.$MACHINE_ENV.64_cc
>>
>> make |& tee log.make.$SYSTEM_ENV.$MACHINE_ENV.64_cc
>> rm -r /usr/local/openmpi-2.1.0_64_cc.old
>> mv /usr/local/openmpi-2.1.0_64_cc /usr/local/openmpi-2.1.0_64_cc.old
>> make install |& tee log.make-install.$SYSTEM_ENV.$MACHINE_ENV.64_cc
>> make check |& tee log.make-check.$SYSTEM_ENV.$MACHINE_ENV.64_cc
>>
>>
>> Sometimes everything works as expected.
>>
>> loki spawn 144 mpiexec -np 1 --host loki,nfs1,nfs2 spawn_intra_comm
>> Parent process 0: I create 2 slave processes
>>
>> Parent process 0 running on loki
>>     MPI_COMM_WORLD ntasks:              1
>>     COMM_CHILD_PROCESSES ntasks_local:  1
>>     COMM_CHILD_PROCESSES ntasks_remote: 2
>>     COMM_ALL_PROCESSES ntasks:          3
>>     mytid in COMM_ALL_PROCESSES:        0
>>
>> Child process 0 running on nfs1
>>     MPI_COMM_WORLD ntasks:              2
>>     COMM_ALL_PROCESSES ntasks:          3
>>     mytid in COMM_ALL_PROCESSES:        1
>>
>> Child process 1 running on nfs2
>>     MPI_COMM_WORLD ntasks:              2
>>     COMM_ALL_PROCESSES ntasks:          3
>>     mytid in COMM_ALL_PROCESSES:        2
>>
>>
>>
>> More often I get a warning.
>>
>> loki spawn 144 mpiexec -np 1 --host loki,nfs1,nfs2 spawn_intra_comm
>> Parent process 0: I create 2 slave processes
>>
>> Parent process 0 running on loki
>>     MPI_COMM_WORLD ntasks:              1
>>     COMM_CHILD_PROCESSES ntasks_local:  1
>>     COMM_CHILD_PROCESSES ntasks_remote: 2
>>     COMM_ALL_PROCESSES ntasks:          3
>>     mytid in COMM_ALL_PROCESSES:        0
>>
>> Child process 0 running on nfs1
>>     MPI_COMM_WORLD ntasks:              2
>>     COMM_ALL_PROCESSES ntasks:          3
>>
>> Child process 1 running on nfs2
>>     MPI_COMM_WORLD ntasks:              2
>>     COMM_ALL_PROCESSES ntasks:          3
>>     mytid in COMM_ALL_PROCESSES:        2
>>     mytid in COMM_ALL_PROCESSES:        1
>>  Warning :: opal_list_remove_item - the item 0x25a76f0 is not on the list
>> 0x7f96db515998
>> loki spawn 144
>>
>>
>>
>> I would be grateful, if somebody can fix the problem. Do you need anything
>> else? Thank you very much for any help in advance.
>>
>>
>> Kind regards
>>
>> Siegmar
>> _______________________________________________
>> users mailing list
>> users@lists.open-mpi.org
>> https://rfd.newmexicoconsortium.org/mailman/listinfo/users
>>
>
>
> ------------------------------------------------------------
> -----------------------
> This email message is for the sole use of the intended recipient(s) and
> may contain
> confidential information.  Any unauthorized review, use, disclosure or
> distribution
> is prohibited.  If you are not the intended recipient, please contact the
> sender by
> reply email and destroy all copies of the original message.
> ------------------------------------------------------------
> -----------------------
>
> _______________________________________________
> users mailing list
> users@lists.open-mpi.org
> https://rfd.newmexicoconsortium.org/mailman/listinfo/users
>



-- 
-Akshay
_______________________________________________
users mailing list
users@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/users

Reply via email to