P.S. 1. Just to make sure I tried the same program with old ompi-1.6.5 that is installed on our cluster without any problem. 2. My testing program just sends data through the ring.
2014-06-01 13:57 GMT+07:00 Artem Polyakov <artpo...@gmail.com>: > Hello, while testing new PMI implementation I faced a problem with OpenIB > and/or usNIC support. > The cluster I use is build on Mellanox QDR. We don't use Cisco hardware, > thus no Cisco Virtual Interface Card. To exclude possibility of new PMI > code influence I used mpirun to launch the job. Slurm job script is > attached. > > While investigating the problem I found the following: > 1. With TCP btl everything works without errors (add export > OMPI_MCA_btl="tcp,self" in attached batch script). > > 2. With fixed OpenIB support (add export OMPI_MCA_btl="openib,self" in > attached batch script) I get followint error: > hellompi: > /home/research/artpol/ompi_dev//ompi-trunk_r31907/ompi/mca/btl/openib/btl_openib.c:1151: > mca_btl_openib_del_procs: Assertion > `((opal_object_t*)endpoint)->obj_reference_count == 1' failed. > hellompi: > /home/research/artpol/ompi_dev//ompi-trunk_r31907/ompi/mca/btl/openib/btl_openib.c:1151: > mca_btl_openib_del_procs: Assertion > `((opal_object_t*)endpoint)->obj_reference_count == 1' failed. > hellompi: > /home/research/artpol/ompi_dev//ompi-trunk_r31907/ompi/mca/btl/openib/btl_openib.c:1151: > mca_btl_openib_del_procs: Assertion > `((opal_object_t*)endpoint)->obj_reference_count == 1' failed. > hellompi: > /home/research/artpol/ompi_dev//ompi-trunk_r31907/ompi/mca/btl/openib/btl_openib.c:1151: > mca_btl_openib_del_procs: Assertion > `((opal_object_t*)endpoint)->obj_reference_count == 1' failed. > hellompi: > /home/research/artpol/ompi_dev//ompi-trunk_r31907/ompi/mca/btl/openib/btl_openib.c:1151: > mca_btl_openib_del_procs: Assertion > `((opal_object_t*)endpoint)->obj_reference_count == 1' failed. > hellompi: > /home/research/artpol/ompi_dev//ompi-trunk_r31907/ompi/mca/btl/openib/btl_openib.c:1151: > mca_btl_openib_del_procs: Assertion > `((opal_object_t*)endpoint)->obj_reference_count == 1' failed. > hellompi: > /home/research/artpol/ompi_dev//ompi-trunk_r31907/ompi/mca/btl/openib/btl_openib.c:1151: > mca_btl_openib_del_procs: Assertion > `((opal_object_t*)endpoint)->obj_reference_count == 1' failed. > hellompi: > /home/research/artpol/ompi_dev//ompi-trunk_r31907/ompi/mca/btl/openib/btl_openib.c:1151: > mca_btl_openib_del_procs: Assertion > `((opal_object_t*)endpoint)->obj_reference_count == 1' failed. > hellompi: > /home/research/artpol/ompi_dev//ompi-trunk_r31907/ompi/mca/btl/openib/btl_openib.c:1151: > mca_btl_openib_del_procs: Assertion > `((opal_object_t*)endpoint)->obj_reference_count == 1' failed. > hellompi: > /home/research/artpol/ompi_dev//ompi-trunk_r31907/ompi/mca/btl/openib/btl_openib.c:1151: > mca_btl_openib_del_procs: Assertion > `((opal_object_t*)endpoint)->obj_reference_count == 1' failed. > > Complete logs are tar-ed, check "openib-failure" directory. > > 3. If I do not fix the BTL component (no OMPI_MCA_btl is exported) I can > get either immediate fail talking about usNIC/OpenIB problems OR programs > hangs. > For both cases I'm attaching complete tar-ed logs. Check "auto-failure" > dir for ompi stdout and stderr and "auto-hang" for the hang case. > > I am ready to provide additional info or help with testing but I have no > time to track the problem myself in near several days. > > -- > С Уважением, Поляков Артем Юрьевич > Best regards, Artem Y. Polyakov > -- С Уважением, Поляков Артем Юрьевич Best regards, Artem Y. Polyakov