Re: [OMPI users] [ORTE] Connecting back to parent - Forcing tcp port

2021-01-11 Thread Vincent via users

On 07/01/2021 19:51, Josh Hursey via users wrote:
I posted a fix for the static ports issue (currently on the v4.1.x 
branch):

https://github.com/open-mpi/ompi/pull/8339

If you have time do you want to give it a try and confirm that it 
fixes your issue?



Hello Josh

Definitely yes ! It does not crash anymore and I see through ss/netstat 
the orted process is connecting to the port I specified. Good work. 
Thank you.


I wish you a happy 2021.

Regards

Vincent.




Thanks,
Josh


On Tue, Dec 22, 2020 at 2:44 AM Vincent > wrote:


On 18/12/2020 23:04, Josh Hursey wrote:

Vincent,

Thanks for the details on the bug. Indeed this is a case that
seems to have been a problem for a little while now when you
use static ports with ORTE (-mca oob_tcp_static_ipv4_ports
option). It must have crept in when we refactored the internal
regular expression mechanism for the v4 branches (and now that I
look maybe as far back as v3.1). I just hit this same issue in
the past day or so working with a different user.

Though I do not have a suggestion for a workaround at this time
(sorry) I did file a GitHub Issue and am looking at this issue.
With the holiday I don't know when I will have a fix, but you can
watch the ticket for updates.
https://github.com/open-mpi/ompi/issues/8304

In the meantime, you could try the v3.0 series release (which
predates this change) or the current Open MPI master branch
(which approaches this a little differently). The same command
line should work in both. Both can be downloaded from the links
below:
https://www.open-mpi.org/software/ompi/v3.0/
https://www.open-mpi.org/nightly/master/

Hello Josh

Thank you for considering the problem. I will certainly keep
watching the ticket. However, there is nothing really urgent (to
me anyway).



Regarding your command line, it looks pretty good:
  orterun --launch-agent /home/boubliki/openmpi/bin/orted -mca
btl tcp --mca btl_tcp_port_min_v4 6706 --mca
btl_tcp_port_range_v4 10 --mca oob_tcp_static_ipv4_ports 6705
-host node2:1 -np 1 /path/to/some/program arg1 .. argn

I would suggest, while you are debugging this, that you use a
program like /bin/hostname instead of a real MPI program. If
/bin/hostname launches properly then move on to an MPI program.
That will assure you that the runtime wired up correctly
(oob/tcp), and then we can focus on the MPI side of the
communication (btl/tcp). You will want to change "-mca btl tcp"
to at least "-mca btl tcp,self" (or better "-mca btl
tcp,vader,self" if you want shared memory). 'self' is the
loopback interface in Open MPI.

Yes. This is actually what I did. I just wanted to be generic and
report the problem without too much flourish.
But it is important you reminded this for new users, helping them
to understand the real purpose of each layer in an MPI implementation.



Is there a reason that you are specifying the --launch-agent to
the orted? Is it installed in a different path on the remote
nodes? If Open MPI is installed in the same location on all nodes
then you shouldn't need that.

I recompiled the sources, activating
--enable-orterun-prefix-by-default when running ./configure. Of
course, it helps :)

Again, thank you.

Kind regards

Vincent.




Thanks,
Josh




--
Josh Hursey
IBM Spectrum MPI Developer




Re: [OMPI users] [ORTE] Connecting back to parent - Forcing tcp port

2021-01-07 Thread Josh Hursey via users
I posted a fix for the static ports issue (currently on the v4.1.x branch):
  https://github.com/open-mpi/ompi/pull/8339

If you have time do you want to give it a try and confirm that it fixes your 
issue?

Thanks,
Josh


On Tue, Dec 22, 2020 at 2:44 AM Vincent mailto:boubl...@yahoo.co.uk> > wrote:
 
 
On 18/12/2020 23:04, Josh Hursey wrote:
 
 
 
Vincent,
 

 
 
Thanks for the details on the bug. Indeed this is a case that seems to have 
been a problem for a little while now when you use static ports with ORTE (-mca 
oob_tcp_static_ipv4_ports option). It must have crept in when we refactored the 
internal regular expression mechanism for the v4 branches (and now that I look 
maybe as far back as v3.1). I just hit this same issue in the past day or so 
working with a different user.
 

 
 
Though I do not have a suggestion for a workaround at this time (sorry) I did 
file a GitHub Issue and am looking at this issue. With the holiday I don't know 
when I will have a fix, but you can watch the ticket for updates.
 
  https://github.com/open-mpi/ompi/issues/8304
 

 
 
In the meantime, you could try the v3.0 series release (which predates this 
change) or the current Open MPI master branch (which approaches this a little 
differently). The same command line should work in both. Both can be downloaded 
from the links below:
 
  https://www.open-mpi.org/software/ompi/v3.0/
 
  https://www.open-mpi.org/nightly/master/
 
 Hello Josh
 
 Thank you for considering the problem. I will certainly keep watching the 
ticket. However, there is nothing really urgent (to me anyway).
 
 

 
 

 
 
Regarding your command line, it looks pretty good:
 
  orterun --launch-agent /home/boubliki/openmpi/bin/orted -mca btl tcp --mca 
btl_tcp_port_min_v4 6706 --mca btl_tcp_port_range_v4 10 --mca 
oob_tcp_static_ipv4_ports 6705 -host node2:1 -np 1 /path/to/some/program arg1 
.. argn
 

 
 
I would suggest, while you are debugging this, that you use a program like 
/bin/hostname instead of a real MPI program. If /bin/hostname launches properly 
then move on to an MPI program. That will assure you that the runtime wired up 
correctly (oob/tcp), and then we can focus on the MPI side of the communication 
(btl/tcp). You will want to change "-mca btl tcp" to at least "-mca btl 
tcp,self" (or better "-mca btl tcp,vader,self" if you want shared memory). 
'self' is the loopback interface in Open MPI.
 
 Yes. This is actually what I did. I just wanted to be generic and report the 
problem without too much flourish.
 But it is important you reminded this for new users, helping them to 
understand the real purpose of each layer in an MPI implementation.
 
 
 

 
 
Is there a reason that you are specifying the --launch-agent to the orted? Is 
it installed in a different path on the remote nodes? If Open MPI is installed 
in the same location on all nodes then you shouldn't need that.
 
 I recompiled the sources, activating --enable-orterun-prefix-by-default when 
running ./configure. Of course, it helps :)
 
 Again, thank you.
 
 Kind regards
 
 Vincent.
 
 
 

 
 

 
 
Thanks,
 
Josh
 
 


-- 
Josh Hursey
IBM Spectrum MPI Developer


Re: [OMPI users] [ORTE] Connecting back to parent - Forcing tcp port

2020-12-22 Thread Vincent via users

On 18/12/2020 23:04, Josh Hursey wrote:

Vincent,

Thanks for the details on the bug. Indeed this is a case that seems to 
have been a problem for a little while now when you use static ports 
with ORTE (-mca oob_tcp_static_ipv4_ports option). It must have crept 
in when we refactored the internal regular expression mechanism for 
the v4 branches (and now that I look maybe as far back as v3.1). I 
just hit this same issue in the past day or so working with a 
different user.


Though I do not have a suggestion for a workaround at this time 
(sorry) I did file a GitHub Issue and am looking at this issue. With 
the holiday I don't know when I will have a fix, but you can watch the 
ticket for updates.

https://github.com/open-mpi/ompi/issues/8304

In the meantime, you could try the v3.0 series release (which predates 
this change) or the current Open MPI master branch (which approaches 
this a little differently). The same command line should work in both. 
Both can be downloaded from the links below:

https://www.open-mpi.org/software/ompi/v3.0/
https://www.open-mpi.org/nightly/master/

Hello Josh

Thank you for considering the problem. I will certainly keep watching 
the ticket. However, there is nothing really urgent (to me anyway).



Regarding your command line, it looks pretty good:
  orterun --launch-agent /home/boubliki/openmpi/bin/orted -mca btl tcp 
--mca btl_tcp_port_min_v4 6706 --mca btl_tcp_port_range_v4 10 --mca 
oob_tcp_static_ipv4_ports 6705 -host node2:1 -np 1 
/path/to/some/program arg1 .. argn


I would suggest, while you are debugging this, that you use a program 
like /bin/hostname instead of a real MPI program. If /bin/hostname 
launches properly then move on to an MPI program. That will assure you 
that the runtime wired up correctly (oob/tcp), and then we can focus 
on the MPI side of the communication (btl/tcp). You will want to 
change "-mca btl tcp" to at least "-mca btl tcp,self" (or better "-mca 
btl tcp,vader,self" if you want shared memory). 'self' is the loopback 
interface in Open MPI.
Yes. This is actually what I did. I just wanted to be generic and report 
the problem without too much flourish.
But it is important you reminded this for new users, helping them to 
understand the real purpose of each layer in an MPI implementation.




Is there a reason that you are specifying the --launch-agent to the 
orted? Is it installed in a different path on the remote nodes? If 
Open MPI is installed in the same location on all nodes then you 
shouldn't need that.
I recompiled the sources, activating --enable-orterun-prefix-by-default 
when running ./configure. Of course, it helps :)


Again, thank you.

Kind regards

Vincent.




Thanks,
Josh


Re: [OMPI users] [ORTE] Connecting back to parent - Forcing tcp port

2020-12-18 Thread Josh Hursey via users
Vincent,

Thanks for the details on the bug. Indeed this is a case that seems to have 
been a problem for a little while now when you use static ports with ORTE (-mca 
oob_tcp_static_ipv4_ports option). It must have crept in when we refactored the 
internal regular expression mechanism for the v4 branches (and now that I look 
maybe as far back as v3.1). I just hit this same issue in the past day or so 
working with a different user.

Though I do not have a suggestion for a workaround at this time (sorry) I did 
file a GitHub Issue and am looking at this issue. With the holiday I don't know 
when I will have a fix, but you can watch the ticket for updates.
  https://github.com/open-mpi/ompi/issues/8304

In the meantime, you could try the v3.0 series release (which predates this 
change) or the current Open MPI master branch (which approaches this a little 
differently). The same command line should work in both. Both can be downloaded 
from the links below:
  https://www.open-mpi.org/software/ompi/v3.0/
  https://www.open-mpi.org/nightly/master/


Regarding your command line, it looks pretty good:
  orterun --launch-agent /home/boubliki/openmpi/bin/orted -mca btl tcp --mca 
btl_tcp_port_min_v4 6706 --mca btl_tcp_port_range_v4 10 --mca 
oob_tcp_static_ipv4_ports 6705 -host node2:1 -np 1 /path/to/some/program arg1 
.. argn

I would suggest, while you are debugging this, that you use a program like 
/bin/hostname instead of a real MPI program. If /bin/hostname launches properly 
then move on to an MPI program. That will assure you that the runtime wired up 
correctly (oob/tcp), and then we can focus on the MPI side of the communication 
(btl/tcp). You will want to change "-mca btl tcp" to at least "-mca btl 
tcp,self" (or better "-mca btl tcp,vader,self" if you want shared memory). 
'self' is the loopback interface in Open MPI.

Is there a reason that you are specifying the --launch-agent to the orted? Is 
it installed in a different path on the remote nodes? If Open MPI is installed 
in the same location on all nodes then you shouldn't need that.


Thanks,
Josh



On Wed, Dec 16, 2020 at 9:23 AM Vincent Letocart via users 
mailto:users@lists.open-mpi.org> > wrote:
 
 
 Good morning
 
 I am facing a tuning problem while playing with the orterun command in order 
to set a tcp port within a specific range.
 A part of this can be I'm not very familiar with the architecture of the 
software and I sometimes struggle through the documentation.
 
 Here is what I'm trying to do (problem has been here reduced to launching a 
single task on ··one·· remote node):
 orterun --launch-agent /home/boubliki/openmpi/bin/orted -mca btl tcp --mca 
btl_tcp_port_min_v4 6706 --mca btl_tcp_port_range_v4 10 --mca 
oob_tcp_static_ipv4_ports 6705 -host node2:1 -np 1 /path/to/some/program arg1 
.. argn
 Those mca options are highlighted here and there in various mailing-lists or 
archives on the net. Version is 4.0.5.
 
 I tried different combinations like
 only --mca btl_tcp_port_min_v4 6706 --mca btl_tcp_port_range_v4 10 (then 
--report-uri shows a randomly picked up tcp port number)
 or adding --mca oob_tcp_static_ipv4_ports 6705  (then --report-uri report the 
tcp port I specified and everything crashes)
 or many others
 but the result becomes:
 [node2:4050181] *** Process received signal ***
 [node2:4050181] Signal: Segmentation fault (11)
 [node2:4050181] Signal code: Address not mapped (1)
 [node2:4050181] Failing at address: (nil)
 [node2:4050181] [ 0] /lib64/libpthread.so.0(+0x12dd0)[0x7fdaf95a9dd0]
 [node2:4050181] *** End of error message ***
 bash: line 1: 4050181 Segmentation fault  (core dumped) 
/home/boubliki/openmpi/bin/orted -mca ess "env" -mca ess_base_jobid 
"1254293504" -mca ess_base_vpid 1 -mca ess_base_num_procs "2" -mca 
orte_node_regex "node[1:1,2]@0(2)" -mca btl "tcp" --mca btl_tcp_port_min_v4 
"6706" --mca btl_tcp_port_range_v4 "10" --mca oob_tcp_static_ipv4_ports "6705" 
-mca plm "rsh" --tree-spawn -mca routed "radix" -mca orte_parent_uri 
"1254293504.0;tcp://192.168.xxx.xxx:6705" -mca orte_launch_agent 
"/home/boubliki/openmpi/bin/orted" -mca pmix "^s1,s2,cray,isolated"
 I tried on different machines, and also with different compilers (gcc 10.2 and 
intel 19u1). Version 4.1.0rc5 did not improve the execution. Forcing no 
optimization with -O0 neither.
 
 Not familiar with debugging such a software but I could add a lantency 
somewhere (sleep()) and catch the orted process on the [single] remote node, 
reaching line 572 with gdb
 
 boubliki@node1: ~/openmpi/src/openmpi-4.0.5> cat -n 
orte/mca/ess/base/ess_base_std_orted.c | sed -n -r -e '562,583p'
    562  if (orte_static_ports || orte_fwd_mpirun_port) {
    563  if (NULL == orte_node_regex) {
    564  /* we didn't get the node info */
    565  error = "cannot construct daemon map for static ports - no 
node map info";
    566  goto error;
    567  }
    568  /* extract the node info from