Ah -- I missed the attachment; I only looked at your email text.

I'll have a look now...

auto-failure: Ah, I found this late last week and sent a fix around internally 
for review.  Should have something soon for trunk/v1.8.

If you care: we accidentally still fire up the usnic connectivity checker even 
if there are no usNICs present.



On Jun 1, 2014, at 8:33 AM, Artem Polyakov <artpo...@gmail.com> wrote:

> Hello, Jeff.
> 
> Please, check attached tar ("auto-failure" dir). There I've seen the 
> following message:
> --------------------------------------------------------------------------    
>                                                                               
>             
> An internal error has occurred in the Open MPI usNIC BTL.  This is
> highly unusual and shouldn't happen.  It suggests that there may be
> something wrong with the usNIC or OpenFabrics configuration on this
> server.          
>   Server:       cn5                                                           
>                                                                   Message:    
>   usnic connectivity client IPC connect read failed                           
>                               File:         
> /home/research/artpol/ompi_dev//ompi-trunk_r31907/ompi/mca/btl/usnic/btl_usnic_cclient.c
>                                                                
>   Line:         125   
>   Error:        Operation not permitted
> --------------------------------------------------------------------------
> 
> And I was wondered because as I've said we don't use Cisco hardware. My guess 
> that it can be a problem in query function. But I think this shows that usnic 
> BTL somehow participates in computiation.
> 
> 
> 2014-06-01 19:20 GMT+07:00 Jeff Squyres (jsquyres) <jsquy...@cisco.com>:
> Just to be clear: it looks like you haven't seen any errors from the usnic 
> BTL, right?  (the Cisco VIC uses the usnic BTL only -- it does not use the 
> openib BTL)
> 
> 
> On Jun 1, 2014, at 2:57 AM, Artem Polyakov <artpo...@gmail.com> wrote:
> 
> > Hello, while testing new PMI implementation I faced a problem with OpenIB 
> > and/or usNIC support.
> > The cluster I use is build on Mellanox QDR. We don't use Cisco hardware, 
> > thus no Cisco Virtual Interface Card. To exclude possibility of new PMI 
> > code influence I used mpirun to launch the job. Slurm job script is 
> > attached.
> >
> > While investigating the problem I found the following:
> > 1. With TCP btl everything works without errors (add export 
> > OMPI_MCA_btl="tcp,self" in attached batch script).
> >
> > 2. With fixed OpenIB  support  (add export OMPI_MCA_btl="openib,self" in 
> > attached batch script) I get followint error:
> > hellompi: 
> > /home/research/artpol/ompi_dev//ompi-trunk_r31907/ompi/mca/btl/openib/btl_openib.c:1151:
> >  mca_btl_openib_del_procs: Assertion 
> > `((opal_object_t*)endpoint)->obj_reference_count == 1' failed.
> > hellompi: 
> > /home/research/artpol/ompi_dev//ompi-trunk_r31907/ompi/mca/btl/openib/btl_openib.c:1151:
> >  mca_btl_openib_del_procs: Assertion 
> > `((opal_object_t*)endpoint)->obj_reference_count == 1' failed.
> > hellompi: 
> > /home/research/artpol/ompi_dev//ompi-trunk_r31907/ompi/mca/btl/openib/btl_openib.c:1151:
> >  mca_btl_openib_del_procs: Assertion 
> > `((opal_object_t*)endpoint)->obj_reference_count == 1' failed.
> > hellompi: 
> > /home/research/artpol/ompi_dev//ompi-trunk_r31907/ompi/mca/btl/openib/btl_openib.c:1151:
> >  mca_btl_openib_del_procs: Assertion 
> > `((opal_object_t*)endpoint)->obj_reference_count == 1' failed.
> > hellompi: 
> > /home/research/artpol/ompi_dev//ompi-trunk_r31907/ompi/mca/btl/openib/btl_openib.c:1151:
> >  mca_btl_openib_del_procs: Assertion 
> > `((opal_object_t*)endpoint)->obj_reference_count == 1' failed.
> > hellompi: 
> > /home/research/artpol/ompi_dev//ompi-trunk_r31907/ompi/mca/btl/openib/btl_openib.c:1151:
> >  mca_btl_openib_del_procs: Assertion 
> > `((opal_object_t*)endpoint)->obj_reference_count == 1' failed.
> > hellompi: 
> > /home/research/artpol/ompi_dev//ompi-trunk_r31907/ompi/mca/btl/openib/btl_openib.c:1151:
> >  mca_btl_openib_del_procs: Assertion 
> > `((opal_object_t*)endpoint)->obj_reference_count == 1' failed.
> > hellompi: 
> > /home/research/artpol/ompi_dev//ompi-trunk_r31907/ompi/mca/btl/openib/btl_openib.c:1151:
> >  mca_btl_openib_del_procs: Assertion 
> > `((opal_object_t*)endpoint)->obj_reference_count == 1' failed.
> > hellompi: 
> > /home/research/artpol/ompi_dev//ompi-trunk_r31907/ompi/mca/btl/openib/btl_openib.c:1151:
> >  mca_btl_openib_del_procs: Assertion 
> > `((opal_object_t*)endpoint)->obj_reference_count == 1' failed.
> > hellompi: 
> > /home/research/artpol/ompi_dev//ompi-trunk_r31907/ompi/mca/btl/openib/btl_openib.c:1151:
> >  mca_btl_openib_del_procs: Assertion 
> > `((opal_object_t*)endpoint)->obj_reference_count == 1' failed.
> >
> > Complete logs are tar-ed, check "openib-failure" directory.
> >
> > 3. If I do not fix the BTL component (no OMPI_MCA_btl is exported) I can 
> > get either immediate fail talking about usNIC/OpenIB problems OR programs 
> > hangs.
> > For both cases I'm attaching complete tar-ed logs. Check "auto-failure" dir 
> > for ompi stdout and stderr and "auto-hang" for the hang case.
> >
> > I am ready to provide additional info or help with testing but I have no 
> > time to track the problem myself in near several days.
> >
> > --
> > С Уважением, Поляков Артем Юрьевич
> > Best regards, Artem Y. Polyakov
> > <task_mpirun.job><usnic-openib-faults.tar.bz2>_______________________________________________
> > devel mailing list
> > de...@open-mpi.org
> > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
> > Link to this post: 
> > http://www.open-mpi.org/community/lists/devel/2014/06/14922.php
> 
> 
> --
> Jeff Squyres
> jsquy...@cisco.com
> For corporate legal information go to: 
> http://www.cisco.com/web/about/doing_business/legal/cri/
> 
> _______________________________________________
> devel mailing list
> de...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
> Link to this post: 
> http://www.open-mpi.org/community/lists/devel/2014/06/14926.php
> 
> 
> 
> -- 
> С Уважением, Поляков Артем Юрьевич
> Best regards, Artem Y. Polyakov
> _______________________________________________
> devel mailing list
> de...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
> Link to this post: 
> http://www.open-mpi.org/community/lists/devel/2014/06/14927.php


-- 
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to: 
http://www.cisco.com/web/about/doing_business/legal/cri/

Reply via email to