Just to be clear: it looks like you haven't seen any errors from the usnic BTL, 
right?  (the Cisco VIC uses the usnic BTL only -- it does not use the openib 
BTL)


On Jun 1, 2014, at 2:57 AM, Artem Polyakov <artpo...@gmail.com> wrote:

> Hello, while testing new PMI implementation I faced a problem with OpenIB 
> and/or usNIC support. 
> The cluster I use is build on Mellanox QDR. We don't use Cisco hardware, thus 
> no Cisco Virtual Interface Card. To exclude possibility of new PMI code 
> influence I used mpirun to launch the job. Slurm job script is attached.
> 
> While investigating the problem I found the following:
> 1. With TCP btl everything works without errors (add export 
> OMPI_MCA_btl="tcp,self" in attached batch script).
> 
> 2. With fixed OpenIB  support  (add export OMPI_MCA_btl="openib,self" in 
> attached batch script) I get followint error:
> hellompi: 
> /home/research/artpol/ompi_dev//ompi-trunk_r31907/ompi/mca/btl/openib/btl_openib.c:1151:
>  mca_btl_openib_del_procs: Assertion 
> `((opal_object_t*)endpoint)->obj_reference_count == 1' failed.
> hellompi: 
> /home/research/artpol/ompi_dev//ompi-trunk_r31907/ompi/mca/btl/openib/btl_openib.c:1151:
>  mca_btl_openib_del_procs: Assertion 
> `((opal_object_t*)endpoint)->obj_reference_count == 1' failed.
> hellompi: 
> /home/research/artpol/ompi_dev//ompi-trunk_r31907/ompi/mca/btl/openib/btl_openib.c:1151:
>  mca_btl_openib_del_procs: Assertion 
> `((opal_object_t*)endpoint)->obj_reference_count == 1' failed.
> hellompi: 
> /home/research/artpol/ompi_dev//ompi-trunk_r31907/ompi/mca/btl/openib/btl_openib.c:1151:
>  mca_btl_openib_del_procs: Assertion 
> `((opal_object_t*)endpoint)->obj_reference_count == 1' failed.
> hellompi: 
> /home/research/artpol/ompi_dev//ompi-trunk_r31907/ompi/mca/btl/openib/btl_openib.c:1151:
>  mca_btl_openib_del_procs: Assertion 
> `((opal_object_t*)endpoint)->obj_reference_count == 1' failed.
> hellompi: 
> /home/research/artpol/ompi_dev//ompi-trunk_r31907/ompi/mca/btl/openib/btl_openib.c:1151:
>  mca_btl_openib_del_procs: Assertion 
> `((opal_object_t*)endpoint)->obj_reference_count == 1' failed.
> hellompi: 
> /home/research/artpol/ompi_dev//ompi-trunk_r31907/ompi/mca/btl/openib/btl_openib.c:1151:
>  mca_btl_openib_del_procs: Assertion 
> `((opal_object_t*)endpoint)->obj_reference_count == 1' failed.
> hellompi: 
> /home/research/artpol/ompi_dev//ompi-trunk_r31907/ompi/mca/btl/openib/btl_openib.c:1151:
>  mca_btl_openib_del_procs: Assertion 
> `((opal_object_t*)endpoint)->obj_reference_count == 1' failed.
> hellompi: 
> /home/research/artpol/ompi_dev//ompi-trunk_r31907/ompi/mca/btl/openib/btl_openib.c:1151:
>  mca_btl_openib_del_procs: Assertion 
> `((opal_object_t*)endpoint)->obj_reference_count == 1' failed.
> hellompi: 
> /home/research/artpol/ompi_dev//ompi-trunk_r31907/ompi/mca/btl/openib/btl_openib.c:1151:
>  mca_btl_openib_del_procs: Assertion 
> `((opal_object_t*)endpoint)->obj_reference_count == 1' failed.
> 
> Complete logs are tar-ed, check "openib-failure" directory.
> 
> 3. If I do not fix the BTL component (no OMPI_MCA_btl is exported) I can get 
> either immediate fail talking about usNIC/OpenIB problems OR programs hangs.
> For both cases I'm attaching complete tar-ed logs. Check "auto-failure" dir 
> for ompi stdout and stderr and "auto-hang" for the hang case.
> 
> I am ready to provide additional info or help with testing but I have no time 
> to track the problem myself in near several days.
> 
> -- 
> С Уважением, Поляков Артем Юрьевич
> Best regards, Artem Y. Polyakov
> <task_mpirun.job><usnic-openib-faults.tar.bz2>_______________________________________________
> devel mailing list
> de...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
> Link to this post: 
> http://www.open-mpi.org/community/lists/devel/2014/06/14922.php


-- 
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to: 
http://www.cisco.com/web/about/doing_business/legal/cri/

Reply via email to