P.S.

1. Just to make sure I tried the same program with old ompi-1.6.5 that is
installed on our cluster without any problem.
2. My testing program just sends data through the ring.


2014-06-01 13:57 GMT+07:00 Artem Polyakov <artpo...@gmail.com>:

> Hello, while testing new PMI implementation I faced a problem with OpenIB
> and/or usNIC support.
> The cluster I use is build on Mellanox QDR. We don't use Cisco hardware,
> thus no Cisco Virtual Interface Card. To exclude possibility of new PMI
> code influence I used mpirun to launch the job. Slurm job script is
> attached.
>
> While investigating the problem I found the following:
> 1. With TCP btl everything works without errors (add export
> OMPI_MCA_btl="tcp,self" in attached batch script).
>
> 2. With fixed OpenIB  support  (add export OMPI_MCA_btl="openib,self" in
> attached batch script) I get followint error:
> hellompi:
> /home/research/artpol/ompi_dev//ompi-trunk_r31907/ompi/mca/btl/openib/btl_openib.c:1151:
> mca_btl_openib_del_procs: Assertion
> `((opal_object_t*)endpoint)->obj_reference_count == 1' failed.
> hellompi:
> /home/research/artpol/ompi_dev//ompi-trunk_r31907/ompi/mca/btl/openib/btl_openib.c:1151:
> mca_btl_openib_del_procs: Assertion
> `((opal_object_t*)endpoint)->obj_reference_count == 1' failed.
> hellompi:
> /home/research/artpol/ompi_dev//ompi-trunk_r31907/ompi/mca/btl/openib/btl_openib.c:1151:
> mca_btl_openib_del_procs: Assertion
> `((opal_object_t*)endpoint)->obj_reference_count == 1' failed.
> hellompi:
> /home/research/artpol/ompi_dev//ompi-trunk_r31907/ompi/mca/btl/openib/btl_openib.c:1151:
> mca_btl_openib_del_procs: Assertion
> `((opal_object_t*)endpoint)->obj_reference_count == 1' failed.
> hellompi:
> /home/research/artpol/ompi_dev//ompi-trunk_r31907/ompi/mca/btl/openib/btl_openib.c:1151:
> mca_btl_openib_del_procs: Assertion
> `((opal_object_t*)endpoint)->obj_reference_count == 1' failed.
> hellompi:
> /home/research/artpol/ompi_dev//ompi-trunk_r31907/ompi/mca/btl/openib/btl_openib.c:1151:
> mca_btl_openib_del_procs: Assertion
> `((opal_object_t*)endpoint)->obj_reference_count == 1' failed.
> hellompi:
> /home/research/artpol/ompi_dev//ompi-trunk_r31907/ompi/mca/btl/openib/btl_openib.c:1151:
> mca_btl_openib_del_procs: Assertion
> `((opal_object_t*)endpoint)->obj_reference_count == 1' failed.
> hellompi:
> /home/research/artpol/ompi_dev//ompi-trunk_r31907/ompi/mca/btl/openib/btl_openib.c:1151:
> mca_btl_openib_del_procs: Assertion
> `((opal_object_t*)endpoint)->obj_reference_count == 1' failed.
> hellompi:
> /home/research/artpol/ompi_dev//ompi-trunk_r31907/ompi/mca/btl/openib/btl_openib.c:1151:
> mca_btl_openib_del_procs: Assertion
> `((opal_object_t*)endpoint)->obj_reference_count == 1' failed.
> hellompi:
> /home/research/artpol/ompi_dev//ompi-trunk_r31907/ompi/mca/btl/openib/btl_openib.c:1151:
> mca_btl_openib_del_procs: Assertion
> `((opal_object_t*)endpoint)->obj_reference_count == 1' failed.
>
> Complete logs are tar-ed, check "openib-failure" directory.
>
> 3. If I do not fix the BTL component (no OMPI_MCA_btl is exported) I can
> get either immediate fail talking about usNIC/OpenIB problems OR programs
> hangs.
> For both cases I'm attaching complete tar-ed logs. Check "auto-failure"
> dir for ompi stdout and stderr and "auto-hang" for the hang case.
>
> I am ready to provide additional info or help with testing but I have no
> time to track the problem myself in near several days.
>
> --
> С Уважением, Поляков Артем Юрьевич
> Best regards, Artem Y. Polyakov
>



-- 
С Уважением, Поляков Артем Юрьевич
Best regards, Artem Y. Polyakov

Reply via email to