Ah -- I missed the attachment; I only looked at your email text. I'll have a look now...
auto-failure: Ah, I found this late last week and sent a fix around internally for review. Should have something soon for trunk/v1.8. If you care: we accidentally still fire up the usnic connectivity checker even if there are no usNICs present. On Jun 1, 2014, at 8:33 AM, Artem Polyakov <artpo...@gmail.com> wrote: > Hello, Jeff. > > Please, check attached tar ("auto-failure" dir). There I've seen the > following message: > -------------------------------------------------------------------------- > > > An internal error has occurred in the Open MPI usNIC BTL. This is > highly unusual and shouldn't happen. It suggests that there may be > something wrong with the usNIC or OpenFabrics configuration on this > server. > Server: cn5 > Message: > usnic connectivity client IPC connect read failed > File: > /home/research/artpol/ompi_dev//ompi-trunk_r31907/ompi/mca/btl/usnic/btl_usnic_cclient.c > > Line: 125 > Error: Operation not permitted > -------------------------------------------------------------------------- > > And I was wondered because as I've said we don't use Cisco hardware. My guess > that it can be a problem in query function. But I think this shows that usnic > BTL somehow participates in computiation. > > > 2014-06-01 19:20 GMT+07:00 Jeff Squyres (jsquyres) <jsquy...@cisco.com>: > Just to be clear: it looks like you haven't seen any errors from the usnic > BTL, right? (the Cisco VIC uses the usnic BTL only -- it does not use the > openib BTL) > > > On Jun 1, 2014, at 2:57 AM, Artem Polyakov <artpo...@gmail.com> wrote: > > > Hello, while testing new PMI implementation I faced a problem with OpenIB > > and/or usNIC support. > > The cluster I use is build on Mellanox QDR. We don't use Cisco hardware, > > thus no Cisco Virtual Interface Card. To exclude possibility of new PMI > > code influence I used mpirun to launch the job. Slurm job script is > > attached. > > > > While investigating the problem I found the following: > > 1. With TCP btl everything works without errors (add export > > OMPI_MCA_btl="tcp,self" in attached batch script). > > > > 2. With fixed OpenIB support (add export OMPI_MCA_btl="openib,self" in > > attached batch script) I get followint error: > > hellompi: > > /home/research/artpol/ompi_dev//ompi-trunk_r31907/ompi/mca/btl/openib/btl_openib.c:1151: > > mca_btl_openib_del_procs: Assertion > > `((opal_object_t*)endpoint)->obj_reference_count == 1' failed. > > hellompi: > > /home/research/artpol/ompi_dev//ompi-trunk_r31907/ompi/mca/btl/openib/btl_openib.c:1151: > > mca_btl_openib_del_procs: Assertion > > `((opal_object_t*)endpoint)->obj_reference_count == 1' failed. > > hellompi: > > /home/research/artpol/ompi_dev//ompi-trunk_r31907/ompi/mca/btl/openib/btl_openib.c:1151: > > mca_btl_openib_del_procs: Assertion > > `((opal_object_t*)endpoint)->obj_reference_count == 1' failed. > > hellompi: > > /home/research/artpol/ompi_dev//ompi-trunk_r31907/ompi/mca/btl/openib/btl_openib.c:1151: > > mca_btl_openib_del_procs: Assertion > > `((opal_object_t*)endpoint)->obj_reference_count == 1' failed. > > hellompi: > > /home/research/artpol/ompi_dev//ompi-trunk_r31907/ompi/mca/btl/openib/btl_openib.c:1151: > > mca_btl_openib_del_procs: Assertion > > `((opal_object_t*)endpoint)->obj_reference_count == 1' failed. > > hellompi: > > /home/research/artpol/ompi_dev//ompi-trunk_r31907/ompi/mca/btl/openib/btl_openib.c:1151: > > mca_btl_openib_del_procs: Assertion > > `((opal_object_t*)endpoint)->obj_reference_count == 1' failed. > > hellompi: > > /home/research/artpol/ompi_dev//ompi-trunk_r31907/ompi/mca/btl/openib/btl_openib.c:1151: > > mca_btl_openib_del_procs: Assertion > > `((opal_object_t*)endpoint)->obj_reference_count == 1' failed. > > hellompi: > > /home/research/artpol/ompi_dev//ompi-trunk_r31907/ompi/mca/btl/openib/btl_openib.c:1151: > > mca_btl_openib_del_procs: Assertion > > `((opal_object_t*)endpoint)->obj_reference_count == 1' failed. > > hellompi: > > /home/research/artpol/ompi_dev//ompi-trunk_r31907/ompi/mca/btl/openib/btl_openib.c:1151: > > mca_btl_openib_del_procs: Assertion > > `((opal_object_t*)endpoint)->obj_reference_count == 1' failed. > > hellompi: > > /home/research/artpol/ompi_dev//ompi-trunk_r31907/ompi/mca/btl/openib/btl_openib.c:1151: > > mca_btl_openib_del_procs: Assertion > > `((opal_object_t*)endpoint)->obj_reference_count == 1' failed. > > > > Complete logs are tar-ed, check "openib-failure" directory. > > > > 3. If I do not fix the BTL component (no OMPI_MCA_btl is exported) I can > > get either immediate fail talking about usNIC/OpenIB problems OR programs > > hangs. > > For both cases I'm attaching complete tar-ed logs. Check "auto-failure" dir > > for ompi stdout and stderr and "auto-hang" for the hang case. > > > > I am ready to provide additional info or help with testing but I have no > > time to track the problem myself in near several days. > > > > -- > > С Уважением, Поляков Артем Юрьевич > > Best regards, Artem Y. Polyakov > > <task_mpirun.job><usnic-openib-faults.tar.bz2>_______________________________________________ > > devel mailing list > > de...@open-mpi.org > > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel > > Link to this post: > > http://www.open-mpi.org/community/lists/devel/2014/06/14922.php > > > -- > Jeff Squyres > jsquy...@cisco.com > For corporate legal information go to: > http://www.cisco.com/web/about/doing_business/legal/cri/ > > _______________________________________________ > devel mailing list > de...@open-mpi.org > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel > Link to this post: > http://www.open-mpi.org/community/lists/devel/2014/06/14926.php > > > > -- > С Уважением, Поляков Артем Юрьевич > Best regards, Artem Y. Polyakov > _______________________________________________ > devel mailing list > de...@open-mpi.org > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel > Link to this post: > http://www.open-mpi.org/community/lists/devel/2014/06/14927.php -- Jeff Squyres jsquy...@cisco.com For corporate legal information go to: http://www.cisco.com/web/about/doing_business/legal/cri/