Are you running any kind of firewall on the node where mpirun is invoked? Open MPI needs to be able to use arbitrary TCP ports between the servers on which it runs.
This second mail seems to imply a bug in OMPI's oob_tcp_if_include param handling, however -- it's supposed to be able to handle an interface name (not just a network specification). Ralph -- can you have a look? On Aug 12, 2014, at 8:41 AM, Timur Ismagilov <tismagi...@mail.ru> wrote: > When i add --mca oob_tcp_if_include ib0 (infiniband interface) to mpirun (as > it was here: http://www.open-mpi.org/community/lists/users/2014/07/24857.php > ) i got this output: > > [compiler-2:08792] mca:base:select:( plm) Querying component [isolated] > [compiler-2:08792] mca:base:select:( plm) Query of component [isolated] set > priority to 0 > [compiler-2:08792] mca:base:select:( plm) Querying component [rsh] > [compiler-2:08792] mca:base:select:( plm) Query of component [rsh] set > priority to 10 > [compiler-2:08792] mca:base:select:( plm) Querying component [slurm] > [compiler-2:08792] mca:base:select:( plm) Query of component [slurm] set > priority to 75 > [compiler-2:08792] mca:base:select:( plm) Selected component [slurm] > [compiler-2:08792] mca: base: components_register: registering oob components > [compiler-2:08792] mca: base: components_register: found loaded component tcp > [compiler-2:08792] mca: base: components_register: component tcp register > function successful > [compiler-2:08792] mca: base: components_open: opening oob components > [compiler-2:08792] mca: base: components_open: found loaded component tcp > [compiler-2:08792] mca: base: components_open: component tcp open function > successful > [compiler-2:08792] mca:oob:select: checking available component tcp > [compiler-2:08792] mca:oob:select: Querying component [tcp] > [compiler-2:08792] oob:tcp: component_available called > [compiler-2:08792] WORKING INTERFACE 1 KERNEL INDEX 1 FAMILY: V4 > [compiler-2:08792] WORKING INTERFACE 2 KERNEL INDEX 3 FAMILY: V4 > [compiler-2:08792] WORKING INTERFACE 3 KERNEL INDEX 4 FAMILY: V4 > [compiler-2:08792] WORKING INTERFACE 4 KERNEL INDEX 5 FAMILY: V4 > [compiler-2:08792] WORKING INTERFACE 5 KERNEL INDEX 6 FAMILY: V4 > [compiler-2:08792] [[42190,0],0] oob:tcp:init adding 10.128.0.4 to our list > of V4 connections > [compiler-2:08792] WORKING INTERFACE 6 KERNEL INDEX 7 FAMILY: V4 > [compiler-2:08792] [[42190,0],0] TCP STARTUP > [compiler-2:08792] [[42190,0],0] attempting to bind to IPv4 port 0 > [compiler-2:08792] [[42190,0],0] assigned IPv4 port 53883 > [compiler-2:08792] mca:oob:select: Adding component to end > [compiler-2:08792] mca:oob:select: Found 1 active transports > [compiler-2:08792] mca: base: components_register: registering rml components > [compiler-2:08792] mca: base: components_register: found loaded component oob > [compiler-2:08792] mca: base: components_register: component oob has no > register or open function > [compiler-2:08792] mca: base: components_open: opening rml components > [compiler-2:08792] mca: base: components_open: found loaded component oob > [compiler-2:08792] mca: base: components_open: component oob open function > successful > [compiler-2:08792] orte_rml_base_select: initializing rml component oob > [compiler-2:08792] [[42190,0],0] posting recv > [compiler-2:08792] [[42190,0],0] posting persistent recv on tag 30 for peer > [[WILDCARD],WILDCARD] > [compiler-2:08792] [[42190,0],0] posting recv > [compiler-2:08792] [[42190,0],0] posting persistent recv on tag 15 for peer > [[WILDCARD],WILDCARD] > [compiler-2:08792] [[42190,0],0] posting recv > [compiler-2:08792] [[42190,0],0] posting persistent recv on tag 32 for peer > [[WILDCARD],WILDCARD] > [compiler-2:08792] [[42190,0],0] posting recv > [compiler-2:08792] [[42190,0],0] posting persistent recv on tag 33 for peer > [[WILDCARD],WILDCARD] > [compiler-2:08792] [[42190,0],0] posting recv > [compiler-2:08792] [[42190,0],0] posting persistent recv on tag 5 for peer > [[WILDCARD],WILDCARD] > [compiler-2:08792] [[42190,0],0] posting recv > [compiler-2:08792] [[42190,0],0] posting persistent recv on tag 10 for peer > [[WILDCARD],WILDCARD] > [compiler-2:08792] [[42190,0],0] posting recv > [compiler-2:08792] [[42190,0],0] posting persistent recv on tag 12 for peer > [[WILDCARD],WILDCARD] > [compiler-2:08792] [[42190,0],0] posting recv > [compiler-2:08792] [[42190,0],0] posting persistent recv on tag 9 for peer > [[WILDCARD],WILDCARD] > [compiler-2:08792] [[42190,0],0] posting recv > [compiler-2:08792] [[42190,0],0] posting persistent recv on tag 34 for peer > [[WILDCARD],WILDCARD] > [compiler-2:08792] [[42190,0],0] posting recv > [compiler-2:08792] [[42190,0],0] posting persistent recv on tag 2 for peer > [[WILDCARD],WILDCARD] > [compiler-2:08792] [[42190,0],0] posting recv > [compiler-2:08792] [[42190,0],0] posting persistent recv on tag 21 for peer > [[WILDCARD],WILDCARD] > [compiler-2:08792] [[42190,0],0] posting recv > [compiler-2:08792] [[42190,0],0] posting persistent recv on tag 22 for peer > [[WILDCARD],WILDCARD] > [compiler-2:08792] [[42190,0],0] posting recv > [compiler-2:08792] [[42190,0],0] posting persistent recv on tag 45 for peer > [[WILDCARD],WILDCARD] > [compiler-2:08792] [[42190,0],0] posting recv > [compiler-2:08792] [[42190,0],0] posting persistent recv on tag 46 for peer > [[WILDCARD],WILDCARD] > [compiler-2:08792] [[42190,0],0] posting recv > [compiler-2:08792] [[42190,0],0] posting persistent recv on tag 1 for peer > [[WILDCARD],WILDCARD] > [compiler-2:08792] [[42190,0],0] posting recv > [compiler-2:08792] [[42190,0],0] posting persistent recv on tag 27 for peer > [[WILDCARD],WILDCARD] > Daemon was launched on node1-128-01 - beginning to initialize > Daemon was launched on node1-128-02 - beginning to initialize > -------------------------------------------------------------------------- > WARNING: An invalid value was given for oob_tcp_if_include. This > value will be ignored. > > Local host: node1-128-01 > Value: "ib0" > Message: Invalid specification (missing "/") > -------------------------------------------------------------------------- > -------------------------------------------------------------------------- > WARNING: An invalid value was given for oob_tcp_if_include. This > value will be ignored. > > Local host: node1-128-02 > Value: "ib0" > Message: Invalid specification (missing "/") > -------------------------------------------------------------------------- > -------------------------------------------------------------------------- > None of the TCP networks specified to be included for out-of-band > communications > could be found: > > Value given: > > Please revise the specification and try again. > -------------------------------------------------------------------------- > -------------------------------------------------------------------------- > None of the TCP networks specified to be included for out-of-band > communications > could be found: > > Value given: > > Please revise the specification and try again. > -------------------------------------------------------------------------- > -------------------------------------------------------------------------- > No network interfaces were found for out-of-band communications. We require > at least one available network for out-of-band messaging. > -------------------------------------------------------------------------- > -------------------------------------------------------------------------- > No network interfaces were found for out-of-band communications. We require > at least one available network for out-of-band messaging. > -------------------------------------------------------------------------- > -------------------------------------------------------------------------- > It looks like orte_init failed for some reason; your parallel process is > likely to abort. There are many reasons that a parallel process can > fail during orte_init; some of which are due to configuration or > environment problems. This failure appears to be an internal failure; > here's some additional information (which may only be relevant to an > Open MPI developer): > > orte_oob_base_select failed > --> Returned value (null) (-43) instead of ORTE_SUCCESS > -------------------------------------------------------------------------- > -------------------------------------------------------------------------- > It looks like orte_init failed for some reason; your parallel process is > likely to abort. There are many reasons that a parallel process can > fail during orte_init; some of which are due to configuration or > environment problems. This failure appears to be an internal failure; > here's some additional information (which may only be relevant to an > Open MPI developer): > > orte_oob_base_select failed > --> Returned value (null) (-43) instead of ORTE_SUCCESS > -------------------------------------------------------------------------- > srun: error: node1-128-02: task 1: Exited with exit code 213 > srun: Terminating job step 657300.0 > srun: error: node1-128-01: task 0: Exited with exit code 213 > -------------------------------------------------------------------------- > An ORTE daemon has unexpectedly failed after launch and before > communicating back to mpirun. This could be caused by a number > of factors, including an inability to create a connection back > to mpirun due to a lack of common network interfaces and/or no > route found between them. Please check network connectivity > (including firewalls and network routing requirements). > -------------------------------------------------------------------------- > [compiler-2:08792] [[42190,0],0] orted_cmd: received halt_vm cmd > [compiler-2:08792] mca: base: close: component oob closed > [compiler-2:08792] mca: base: close: unloading component oob > [compiler-2:08792] [[42190,0],0] TCP SHUTDOWN > [compiler-2:08792] mca: base: close: component tcp closed > [compiler-2:08792] mca: base: close: unloading component tcp > > > > Tue, 12 Aug 2014 16:14:58 +0400 от Timur Ismagilov <tismagi...@mail.ru>: > Hello! > > I have Open MPI v1.8.2rc4r32485 > > When i run hello_c, I got this error message > $mpirun -np 2 hello_c > > An ORTE daemon has unexpectedly failed after launch and before > > communicating back to mpirun. This could be caused by a number > of factors, including an inability to create a connection back > to mpirun due to a lack of common network interfaces and/or no > route found between them. Please check network connectivity > (including firewalls and network routing requirements). > > When i run with --debug-daemons --mca plm_base_verbose 5 -mca > oob_base_verbose 10 -mca rml_base_verbose 10 i got this output: > $mpirun --debug-daemons --mca plm_base_verbose 5 -mca oob_base_verbose 10 > -mca rml_base_verbose 10 -np 2 hello_c > > [compiler-2:08780] mca:base:select:( plm) Querying component [isolated] > [compiler-2:08780] mca:base:select:( plm) Query of component [isolated] set > priority to 0 > [compiler-2:08780] mca:base:select:( plm) Querying component [rsh] > [compiler-2:08780] mca:base:select:( plm) Query of component [rsh] set > priority to 10 > [compiler-2:08780] mca:base:select:( plm) Querying component [slurm] > [compiler-2:08780] mca:base:select:( plm) Query of component [slurm] set > priority to 75 > [compiler-2:08780] mca:base:select:( plm) Selected component [slurm] > [compiler-2:08780] mca: base: components_register: registering oob components > [compiler-2:08780] mca: base: components_register: found loaded component tcp > [compiler-2:08780] mca: base: components_register: component tcp register > function successful > [compiler-2:08780] mca: base: components_open: opening oob components > [compiler-2:08780] mca: base: components_open: found loaded component tcp > [compiler-2:08780] mca: base: components_open: component tcp open function > successful > [compiler-2:08780] mca:oob:select: checking available component tcp > [compiler-2:08780] mca:oob:select: Querying component [tcp] > [compiler-2:08780] oob:tcp: component_available called > [compiler-2:08780] WORKING INTERFACE 1 KERNEL INDEX 1 FAMILY: V4 > [compiler-2:08780] WORKING INTERFACE 2 KERNEL INDEX 3 FAMILY: V4 > [compiler-2:08780] [[42202,0],0] oob:tcp:init adding 10.0.251.53 to our list > of V4 connections > [compiler-2:08780] WORKING INTERFACE 3 KERNEL INDEX 4 FAMILY: V4 > [compiler-2:08780] [[42202,0],0] oob:tcp:init adding 10.0.0.4 to our list of > V4 connections > [compiler-2:08780] WORKING INTERFACE 4 KERNEL INDEX 5 FAMILY: V4 > [compiler-2:08780] [[42202,0],0] oob:tcp:init adding 10.2.251.14 to our list > of V4 connections > [compiler-2:08780] WORKING INTERFACE 5 KERNEL INDEX 6 FAMILY: V4 > [compiler-2:08780] [[42202,0],0] oob:tcp:init adding 10.128.0.4 to our list > of V4 connections > [compiler-2:08780] WORKING INTERFACE 6 KERNEL INDEX 7 FAMILY: V4 > [compiler-2:08780] [[42202,0],0] oob:tcp:init adding 93.180.7.38 to our list > of V4 connections > [compiler-2:08780] [[42202,0],0] TCP STARTUP > [compiler-2:08780] [[42202,0],0] attempting to bind to IPv4 port 0 > [compiler-2:08780] [[42202,0],0] assigned IPv4 port 38420 > [compiler-2:08780] mca:oob:select: Adding component to end > [compiler-2:08780] mca:oob:select: Found 1 active transports > [compiler-2:08780] mca: base: components_register: registering rml components > [compiler-2:08780] mca: base: components_register: found loaded component oob > [compiler-2:08780] mca: base: components_register: component oob has no > register or open function > [compiler-2:08780] mca: base: components_open: opening rml components > [compiler-2:08780] mca: base: components_open: found loaded component oob > [compiler-2:08780] mca: base: components_open: component oob open function > successful > [compiler-2:08780] orte_rml_base_select: initializing rml component oob > [compiler-2:08780] [[42202,0],0] posting recv > [compiler-2:08780] [[42202,0],0] posting persistent recv on tag 30 for peer > [[WILDCARD],WILDCARD] > [compiler-2:08780] [[42202,0],0] posting recv > [compiler-2:08780] [[42202,0],0] posting persistent recv on tag 15 for peer > [[WILDCARD],WILDCARD] > [compiler-2:08780] [[42202,0],0] posting recv > [compiler-2:08780] [[42202,0],0] posting persistent recv on tag 32 for peer > [[WILDCARD],WILDCARD] > [compiler-2:08780] [[42202,0],0] posting recv > [compiler-2:08780] [[42202,0],0] posting persistent recv on tag 33 for peer > [[WILDCARD],WILDCARD] > [compiler-2:08780] [[42202,0],0] posting recv > [compiler-2:08780] [[42202,0],0] posting persistent recv on tag 5 for peer > [[WILDCARD],WILDCARD] > [compiler-2:08780] [[42202,0],0] posting recv > [compiler-2:08780] [[42202,0],0] posting persistent recv on tag 10 for peer > [[WILDCARD],WILDCARD] > [compiler-2:08780] [[42202,0],0] posting recv > [compiler-2:08780] [[42202,0],0] posting persistent recv on tag 12 for peer > [[WILDCARD],WILDCARD] > [compiler-2:08780] [[42202,0],0] posting recv > [compiler-2:08780] [[42202,0],0] posting persistent recv on tag 9 for peer > [[WILDCARD],WILDCARD] > [compiler-2:08780] [[42202,0],0] posting recv > [compiler-2:08780] [[42202,0],0] posting persistent recv on tag 34 for peer > [[WILDCARD],WILDCARD] > [compiler-2:08780] [[42202,0],0] posting recv > [compiler-2:08780] [[42202,0],0] posting persistent recv on tag 2 for peer > [[WILDCARD],WILDCARD] > [compiler-2:08780] [[42202,0],0] posting recv > [compiler-2:08780] [[42202,0],0] posting persistent recv on tag 21 for peer > [[WILDCARD],WILDCARD] > [compiler-2:08780] [[42202,0],0] posting recv > [compiler-2:08780] [[42202,0],0] posting persistent recv on tag 22 for peer > [[WILDCARD],WILDCARD] > [compiler-2:08780] [[42202,0],0] posting recv > [compiler-2:08780] [[42202,0],0] posting persistent recv on tag 45 for peer > [[WILDCARD],WILDCARD] > [compiler-2:08780] [[42202,0],0] posting recv > [compiler-2:08780] [[42202,0],0] posting persistent recv on tag 46 for peer > [[WILDCARD],WILDCARD] > [compiler-2:08780] [[42202,0],0] posting recv > [compiler-2:08780] [[42202,0],0] posting persistent recv on tag 1 for peer > [[WILDCARD],WILDCARD] > [compiler-2:08780] [[42202,0],0] posting recv > [compiler-2:08780] [[42202,0],0] posting persistent recv on tag 27 for peer > [[WILDCARD],WILDCARD] > Daemon was launched on node1-130-08 - beginning to initialize > Daemon was launched on node1-130-03 - beginning to initialize > Daemon was launched on node1-130-05 - beginning to initialize > Daemon was launched on node1-130-02 - beginning to initialize > Daemon was launched on node1-130-01 - beginning to initialize > Daemon was launched on node1-130-04 - beginning to initialize > Daemon was launched on node1-130-07 - beginning to initialize > Daemon was launched on node1-130-06 - beginning to initialize > Daemon [[42202,0],3] checking in as pid 7178 on host node1-130-03 > [node1-130-03:07178] [[42202,0],3] orted: up and running - waiting for > commands! > Daemon [[42202,0],2] checking in as pid 13581 on host node1-130-02 > [node1-130-02:13581] [[42202,0],2] orted: up and running - waiting for > commands! > Daemon [[42202,0],1] checking in as pid 17220 on host node1-130-01 > [node1-130-01:17220] [[42202,0],1] orted: up and running - waiting for > commands! > Daemon [[42202,0],5] checking in as pid 6663 on host node1-130-05 > [node1-130-05:06663] [[42202,0],5] orted: up and running - waiting for > commands! > Daemon [[42202,0],8] checking in as pid 6683 on host node1-130-08 > [node1-130-08:06683] [[42202,0],8] orted: up and running - waiting for > commands! > Daemon [[42202,0],7] checking in as pid 7877 on host node1-130-07 > [node1-130-07:07877] [[42202,0],7] orted: up and running - waiting for > commands! > Daemon [[42202,0],4] checking in as pid 7735 on host node1-130-04 > [node1-130-04:07735] [[42202,0],4] orted: up and running - waiting for > commands! > Daemon [[42202,0],6] checking in as pid 8451 on host node1-130-06 > [node1-130-06:08451] [[42202,0],6] orted: up and running - waiting for > commands! > srun: error: node1-130-03: task 2: Exited with exit code 1 > srun: Terminating job step 657040.1 > srun: error: node1-130-02: task 1: Exited with exit code 1 > slurmd[node1-130-04]: *** STEP 657040.1 KILLED AT 2014-08-12T12:59:07 WITH > SIGNAL 9 *** > slurmd[node1-130-07]: *** STEP 657040.1 KILLED AT 2014-08-12T12:59:07 WITH > SIGNAL 9 *** > slurmd[node1-130-06]: *** STEP 657040.1 KILLED AT 2014-08-12T12:59:07 WITH > SIGNAL 9 *** > srun: Job step aborted: Waiting up to 2 seconds for job step to finish. > srun: error: node1-130-01: task 0: Exited with exit code 1 > srun: error: node1-130-05: task 4: Exited with exit code 1 > srun: error: node1-130-08: task 7: Exited with exit code 1 > srun: error: node1-130-07: task 6: Exited with exit code 1 > srun: error: node1-130-04: task 3: Killed > srun: error: node1-130-06: task 5: Killed > -------------------------------------------------------------------------- > An ORTE daemon has unexpectedly failed after launch and before > communicating back to mpirun. This could be caused by a number > of factors, including an inability to create a connection back > to mpirun due to a lack of common network interfaces and/or no > route found between them. Please check network connectivity > (including firewalls and network routing requirements). > -------------------------------------------------------------------------- > [compiler-2:08780] [[42202,0],0] orted_cmd: received halt_vm cmd > [compiler-2:08780] mca: base: close: component oob closed > [compiler-2:08780] mca: base: close: unloading component oob > [compiler-2:08780] [[42202,0],0] TCP SHUTDOWN > [compiler-2:08780] mca: base: close: component tcp closed > [compiler-2:08780] mca: base: close: unloading component tcp > > _______________________________________________ > users mailing list > us...@open-mpi.org > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users > Link to this post: > http://www.open-mpi.org/community/lists/users/2014/08/24987.php > > > > _______________________________________________ > users mailing list > us...@open-mpi.org > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users > Link to this post: > http://www.open-mpi.org/community/lists/users/2014/08/24988.php -- Jeff Squyres jsquy...@cisco.com For corporate legal information go to: http://www.cisco.com/web/about/doing_business/legal/cri/