Hello, while testing new PMI implementation I faced a problem with OpenIB and/or usNIC support. The cluster I use is build on Mellanox QDR. We don't use Cisco hardware, thus no Cisco Virtual Interface Card. To exclude possibility of new PMI code influence I used mpirun to launch the job. Slurm job script is attached.
While investigating the problem I found the following: 1. With TCP btl everything works without errors (add export OMPI_MCA_btl="tcp,self" in attached batch script). 2. With fixed OpenIB support (add export OMPI_MCA_btl="openib,self" in attached batch script) I get followint error: hellompi: /home/research/artpol/ompi_dev//ompi-trunk_r31907/ompi/mca/btl/openib/btl_openib.c:1151: mca_btl_openib_del_procs: Assertion `((opal_object_t*)endpoint)->obj_reference_count == 1' failed. hellompi: /home/research/artpol/ompi_dev//ompi-trunk_r31907/ompi/mca/btl/openib/btl_openib.c:1151: mca_btl_openib_del_procs: Assertion `((opal_object_t*)endpoint)->obj_reference_count == 1' failed. hellompi: /home/research/artpol/ompi_dev//ompi-trunk_r31907/ompi/mca/btl/openib/btl_openib.c:1151: mca_btl_openib_del_procs: Assertion `((opal_object_t*)endpoint)->obj_reference_count == 1' failed. hellompi: /home/research/artpol/ompi_dev//ompi-trunk_r31907/ompi/mca/btl/openib/btl_openib.c:1151: mca_btl_openib_del_procs: Assertion `((opal_object_t*)endpoint)->obj_reference_count == 1' failed. hellompi: /home/research/artpol/ompi_dev//ompi-trunk_r31907/ompi/mca/btl/openib/btl_openib.c:1151: mca_btl_openib_del_procs: Assertion `((opal_object_t*)endpoint)->obj_reference_count == 1' failed. hellompi: /home/research/artpol/ompi_dev//ompi-trunk_r31907/ompi/mca/btl/openib/btl_openib.c:1151: mca_btl_openib_del_procs: Assertion `((opal_object_t*)endpoint)->obj_reference_count == 1' failed. hellompi: /home/research/artpol/ompi_dev//ompi-trunk_r31907/ompi/mca/btl/openib/btl_openib.c:1151: mca_btl_openib_del_procs: Assertion `((opal_object_t*)endpoint)->obj_reference_count == 1' failed. hellompi: /home/research/artpol/ompi_dev//ompi-trunk_r31907/ompi/mca/btl/openib/btl_openib.c:1151: mca_btl_openib_del_procs: Assertion `((opal_object_t*)endpoint)->obj_reference_count == 1' failed. hellompi: /home/research/artpol/ompi_dev//ompi-trunk_r31907/ompi/mca/btl/openib/btl_openib.c:1151: mca_btl_openib_del_procs: Assertion `((opal_object_t*)endpoint)->obj_reference_count == 1' failed. hellompi: /home/research/artpol/ompi_dev//ompi-trunk_r31907/ompi/mca/btl/openib/btl_openib.c:1151: mca_btl_openib_del_procs: Assertion `((opal_object_t*)endpoint)->obj_reference_count == 1' failed. Complete logs are tar-ed, check "openib-failure" directory. 3. If I do not fix the BTL component (no OMPI_MCA_btl is exported) I can get either immediate fail talking about usNIC/OpenIB problems OR programs hangs. For both cases I'm attaching complete tar-ed logs. Check "auto-failure" dir for ompi stdout and stderr and "auto-hang" for the hang case. I am ready to provide additional info or help with testing but I have no time to track the problem myself in near several days. -- С Уважением, Поляков Артем Юрьевич Best regards, Artem Y. Polyakov
task_mpirun.job
Description: Binary data
usnic-openib-faults.tar.bz2
Description: BZip2 compressed data