Hello, while testing new PMI implementation I faced a problem with OpenIB
and/or usNIC support.
The cluster I use is build on Mellanox QDR. We don't use Cisco hardware,
thus no Cisco Virtual Interface Card. To exclude possibility of new PMI
code influence I used mpirun to launch the job. Slurm job script is
attached.

While investigating the problem I found the following:
1. With TCP btl everything works without errors (add export
OMPI_MCA_btl="tcp,self" in attached batch script).

2. With fixed OpenIB  support  (add export OMPI_MCA_btl="openib,self" in
attached batch script) I get followint error:
hellompi:
/home/research/artpol/ompi_dev//ompi-trunk_r31907/ompi/mca/btl/openib/btl_openib.c:1151:
mca_btl_openib_del_procs: Assertion
`((opal_object_t*)endpoint)->obj_reference_count == 1' failed.
hellompi:
/home/research/artpol/ompi_dev//ompi-trunk_r31907/ompi/mca/btl/openib/btl_openib.c:1151:
mca_btl_openib_del_procs: Assertion
`((opal_object_t*)endpoint)->obj_reference_count == 1' failed.
hellompi:
/home/research/artpol/ompi_dev//ompi-trunk_r31907/ompi/mca/btl/openib/btl_openib.c:1151:
mca_btl_openib_del_procs: Assertion
`((opal_object_t*)endpoint)->obj_reference_count == 1' failed.
hellompi:
/home/research/artpol/ompi_dev//ompi-trunk_r31907/ompi/mca/btl/openib/btl_openib.c:1151:
mca_btl_openib_del_procs: Assertion
`((opal_object_t*)endpoint)->obj_reference_count == 1' failed.
hellompi:
/home/research/artpol/ompi_dev//ompi-trunk_r31907/ompi/mca/btl/openib/btl_openib.c:1151:
mca_btl_openib_del_procs: Assertion
`((opal_object_t*)endpoint)->obj_reference_count == 1' failed.
hellompi:
/home/research/artpol/ompi_dev//ompi-trunk_r31907/ompi/mca/btl/openib/btl_openib.c:1151:
mca_btl_openib_del_procs: Assertion
`((opal_object_t*)endpoint)->obj_reference_count == 1' failed.
hellompi:
/home/research/artpol/ompi_dev//ompi-trunk_r31907/ompi/mca/btl/openib/btl_openib.c:1151:
mca_btl_openib_del_procs: Assertion
`((opal_object_t*)endpoint)->obj_reference_count == 1' failed.
hellompi:
/home/research/artpol/ompi_dev//ompi-trunk_r31907/ompi/mca/btl/openib/btl_openib.c:1151:
mca_btl_openib_del_procs: Assertion
`((opal_object_t*)endpoint)->obj_reference_count == 1' failed.
hellompi:
/home/research/artpol/ompi_dev//ompi-trunk_r31907/ompi/mca/btl/openib/btl_openib.c:1151:
mca_btl_openib_del_procs: Assertion
`((opal_object_t*)endpoint)->obj_reference_count == 1' failed.
hellompi:
/home/research/artpol/ompi_dev//ompi-trunk_r31907/ompi/mca/btl/openib/btl_openib.c:1151:
mca_btl_openib_del_procs: Assertion
`((opal_object_t*)endpoint)->obj_reference_count == 1' failed.

Complete logs are tar-ed, check "openib-failure" directory.

3. If I do not fix the BTL component (no OMPI_MCA_btl is exported) I can
get either immediate fail talking about usNIC/OpenIB problems OR programs
hangs.
For both cases I'm attaching complete tar-ed logs. Check "auto-failure" dir
for ompi stdout and stderr and "auto-hang" for the hang case.

I am ready to provide additional info or help with testing but I have no
time to track the problem myself in near several days.

-- 
С Уважением, Поляков Артем Юрьевич
Best regards, Artem Y. Polyakov

Attachment: task_mpirun.job
Description: Binary data

Attachment: usnic-openib-faults.tar.bz2
Description: BZip2 compressed data

Reply via email to