Hi Jefff No firewall is enabled. Running the diagnostics I found that non communication mpi job is running . While ring_c remains stuck. There are of course warnings for open fabrics but in my case I an running application by disabling openib., Please see below
[pmdtest@pmd ~]$ mpirun --host compute-01-01,compute-01-06 hello_c.out -------------------------------------------------------------------------- WARNING: There is at least one OpenFabrics device found but there are no active ports detected (or Open MPI was unable to use them). This is most certainly not what you wanted. Check your cables, subnet manager configuration, etc. The openib BTL will be ignored for this job. Local host: compute-01-01.private.dns.zone -------------------------------------------------------------------------- Hello, world, I am 0 of 2 Hello, world, I am 1 of 2 [pmd.pakmet.com:06386] 1 more process has sent help message help-mpi-btl-openib.txt / no active ports found [pmd.pakmet.com:06386] Set MCA parameter "orte_base_help_aggregate" to 0 to see all help / error messages [pmdtest@pmd ~]$ mpirun --host compute-01-01,compute-01-06 ring_c -------------------------------------------------------------------------- WARNING: There is at least one OpenFabrics device found but there are no active ports detected (or Open MPI was unable to use them). This is most certainly not what you wanted. Check your cables, subnet manager configuration, etc. The openib BTL will be ignored for this job. Local host: compute-01-01.private.dns.zone -------------------------------------------------------------------------- Process 0 sending 10 to 1, tag 201 (2 processes in ring) Process 0 sent to 1 Process 0 decremented value: 9 [compute-01-01.private.dns.zone][[54687,1],0][btl_tcp_endpoint.c:638:mca_btl_tcp_endpoint_complete_connect] connect() to 192.168.108.10 failed: No route to host (113) [pmd.pakmet.com:15965] 1 more process has sent help message help-mpi-btl-openib.txt / no active ports found [pmd.pakmet.com:15965] Set MCA parameter "orte_base_help_aggregate" to 0 to see all help / error messages <span class="sewh9wyhn1gq30p"><br></span> On Wed, Nov 12, 2014 at 7:32 PM, Jeff Squyres (jsquyres) <jsquy...@cisco.com> wrote: > Do you have firewalling enabled on either server? > > See this FAQ item: > > http://www.open-mpi.org/faq/?category=running#diagnose-multi-host-problems > > > > On Nov 12, 2014, at 4:57 AM, Syed Ahsan Ali <ahsansha...@gmail.com> wrote: > >> Dear All >> >> I need your advice. While trying to run mpirun job across nodes I get >> following error. It seems that the two nodes i.e, compute-01-01 and >> compute-01-06 are not able to communicate with each other. While nodes >> see each other on ping. >> >> [pmdtest@pmd ERA_CLM45]$ mpirun -np 16 -hostfile hostlist --mca btl >> ^openib ../bin/regcmMPICLM45 regcm.in >> >> [compute-01-06.private.dns.zone][[48897,1],7][btl_tcp_endpoint.c:638:mca_btl_tcp_endpoint_complete_connect] >> connect() to 192.168.108.14 failed: No route to host (113) >> [compute-01-06.private.dns.zone][[48897,1],4][btl_tcp_endpoint.c:638:mca_btl_tcp_endpoint_complete_connect] >> connect() to 192.168.108.14 failed: No route to host (113) >> [compute-01-06.private.dns.zone][[48897,1],5][btl_tcp_endpoint.c:638:mca_btl_tcp_endpoint_complete_connect] >> connect() to 192.168.108.14 failed: No route to host (113) >> [compute-01-01.private.dns.zone][[48897,1],10][btl_tcp_endpoint.c:638:mca_btl_tcp_endpoint_complete_connect] >> [compute-01-01.private.dns.zone][[48897,1],12][btl_tcp_endpoint.c:638:mca_btl_tcp_endpoint_complete_connect] >> connect() to 192.168.108.10 failed: No route to host (113) >> [compute-01-01.private.dns.zone][[48897,1],14][btl_tcp_endpoint.c:638:mca_btl_tcp_endpoint_complete_connect] >> connect() to 192.168.108.10 failed: No route to host (113) >> connect() to 192.168.108.10 failed: No route to host (113) >> >> mpirun: killing job... >> >> [pmdtest@pmd ERA_CLM45]$ ssh compute-01-01 >> Last login: Wed Nov 12 09:48:53 2014 from pmd-eth0.private.dns.zone >> [pmdtest@compute-01-01 ~]$ ping compute-01-06 >> PING compute-01-06.private.dns.zone (10.0.0.8) 56(84) bytes of data. >> 64 bytes from compute-01-06.private.dns.zone (10.0.0.8): icmp_seq=1 >> ttl=64 time=0.108 ms >> 64 bytes from compute-01-06.private.dns.zone (10.0.0.8): icmp_seq=2 >> ttl=64 time=0.088 ms >> >> --- compute-01-06.private.dns.zone ping statistics --- >> 2 packets transmitted, 2 received, 0% packet loss, time 999ms >> rtt min/avg/max/mdev = 0.088/0.098/0.108/0.010 ms >> [pmdtest@compute-01-01 ~]$ >> >> Thanks in advance. >> >> Ahsan >> _______________________________________________ >> users mailing list >> us...@open-mpi.org >> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users >> Link to this post: >> http://www.open-mpi.org/community/lists/users/2014/11/25761.php > > > -- > Jeff Squyres > jsquy...@cisco.com > For corporate legal information go to: > http://www.cisco.com/web/about/doing_business/legal/cri/ > > _______________________________________________ > users mailing list > us...@open-mpi.org > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users > Link to this post: > http://www.open-mpi.org/community/lists/users/2014/11/25763.php -- Syed Ahsan Ali Bokhari Electronic Engineer (EE) Research & Development Division Pakistan Meteorological Department H-8/4, Islamabad. Phone # off +92518358714 Cell # +923155145014