Ralph,

I had a look at the problem via "mpirun -np 1 strace -o trace -ff ./hello"
I find that there is an attempt (by a secondary thread) to establish a TCP
socket from the rank process to the eth0 address of localhost (I am
guessing to reach the orted/mpirun).
However, when the "lo" interface is down, the Linux kernel apparently
cannot establish that socket.

In fact, if I am sufficiently patient, it turns out the "hang" is bounded,
and eventually one sees:

phargrov@blcr-armv7:~$ time mpirun -np 1 ./a.out
------------------------------------------------------------
A process or daemon was unable to complete a TCP connection
to another process:
  Local host:    blcr-armv7
  Remote host:   10.0.2.15
This is usually caused by a firewall on the remote host. Please
check that any firewall (e.g., iptables) has been disabled and
try again.
------------------------------------------------------------

real    2m8.151s
user    0m5.360s
sys     0m57.430s


Where blcr-armv7 and 10.0.2.15 are *both* the local (only) host.

There is no firewall, but in case you doubt me on that, here is a
demonstration using ping to show that 10.0.2.15 is only reachable when the
loopback interface is enabled:

phargrov@blcr-armv7:~$ sudo ifconfig lo up
phargrov@blcr-armv7:~$ ping -q -c2 10.0.2.15
PING 10.0.2.15 (10.0.2.15) 56(84) bytes of data.

--- 10.0.2.15 ping statistics ---
2 packets transmitted, 2 received, 0% packet loss, time 1002ms
rtt min/avg/max/mdev = 0.527/0.534/0.542/0.024 ms


phargrov@blcr-armv7:~$ sudo ifconfig lo down
phargrov@blcr-armv7:~$ ping -q -c2 10.0.2.15
PING 10.0.2.15 (10.0.2.15) 56(84) bytes of data.

--- 10.0.2.15 ping statistics ---
2 packets transmitted, 0 received, 100% packet loss, time 1006ms


So, there is no "hang" -- just a 2 minute pause before the error message is
generated.
However, it may still be possible to present a better/earlier error message
when there is no loopback interface (and at least one rank process is to be
launched locally).


-Paul

On Tue, Nov 25, 2014 at 4:19 PM, Ralph Castain <r...@open-mpi.org> wrote:

> I'll have to look - there isn't supposed to be such a requirement, and I
> certainly haven't seen it before.
>
>
> On Nov 25, 2014, at 3:26 PM, Paul Hargrove <phhargr...@lbl.gov> wrote:
>
> Allan,
>
> I am glad things are working for you now.
> I can confirm (on a QEMU-emulated Versatile Express A9 board running
> Ubuntu 14.04) that disabling the "lo" interface reproduces the problem.
> I imagine this is true on other architectures, though I did not attempt to
> verify.
>
> Ralph,
>
> If oob:tcp really does need the loopback interface, shouldn't its lack be
> something that could/should be detected and reported instead of hanging as
> Allan saw?
>
> FWIW, neither of the following resolved the problem:
>     -mca oob_tcp_if_exclude lo
>     -mca oob_tcp_if_include eth0
>
>
> -Paul
>
> On Tue, Nov 25, 2014 at 2:58 PM, Allan Wu <al...@cs.ucla.edu> wrote:
>
>> I think I have found the problem. After inspecting the output with
>>
>> "-mca state_base_verbose 10 -mca odls_base_verbose 10 -mca
>> oob_base_verbose 10
>> 0
>> "
>>
>> on both the old system and the new system, I noticed there is one line
>>
>> that is
>>
>> different
>> :
>>
>> o
>> n the old system where it works correctly, there is a line that says:
>> "oob:tcp:init rejecting loopback interface lo"
>> ,
>> while
>> on the new system there is no such line. Both system proceed to open
>> interface eth0 afterwards. Then I checked the new system, and found out
>> that somehow the loopback interface is not up by default. After I opened
>> the lo interface, the mpirun executes normally.
>>
>> Does it means that OpenMPI will use lo for some initial setup? Since the
>> actual socket was created on eth0 I did not think of checking the lo
>> interface. Anyway, thanks everyone for all of your kind help. Let me know
>> if you want me to provide any more information for future references.
>>
>> Regards,
>> Allan
>>
>> --
>> Di Wu (Allan)
>> PhD student, VAST Laboratory <http://vast.cs.ucla.edu/>,
>> Department of Computer Science, UC Los Angeles
>> Email: al...@cs.ucla.edu
>>
>> On Tue, Nov 25, 2014 at 11:55 AM, Allan Wu <al...@cs.ucla.edu> wrote:
>>
>>> Thanks Ralph!
>>>
>>> I did not compile my openmpi with --enable-debug, and I am compiling it
>>> now. But your suggested command already provide
>>> d
>>> some output, which I attached with this email.
>>>
>>> It seems the process was stuck on the line:
>>> "[fpga2:00962] [[44848,1],0] waiting for connect completion to
>>> [[44848,0],0] - activating send event"
>>>
>>> Then it got stuck and I CTRL+C'ed it. Previous to that line, it said
>>> something about 'orte_tcp_peer_try_connect: attempting to connect to proc
>>> [[44848,0],0] via interface eth0'
>>> .
>>>
>>>
>>> Regards,
>>> Di
>>>
>>> On Tue, Nov 25, 2014 at 2:25 PM, Ralph Castain <r...@open-mpi.org> wrote:
>>>
>>>> This is all running on a single node, correct? If so, did you configure
>>>> OMPI with â EURO "enable-debug?
>>>>
>>>> If you can do that, or already have, then letâ EURO (tm)s add the 
>>>> following to
>>>> the mpirun cmd line:
>>>>
>>>> -mca state_base_verbose 10 -mca odls_base_verbose 10 -mca
>>>> oob_base_verbose 10
>>>>
>>>> Youâ EURO (tm)ll get a bunch of output, but hopefully it will tell us where
>>>> mpirun is encountering a problem.
>>>> Ralph
>>>> On Tue, Nov 25, 2014 at 11:20 AM, Paul Hargrove <phhargr...@lbl.gov>
>>>> wrote:
>>>>
>>>>> Allan,
>>>>>
>>>>> If you send me the .config from your build of the kernel I can compare
>>>>> it against, for instance, my .config for a Raspberry Pi.
>>>>> There will certainly be many differences, but I am hoping my own
>>>>> experience configuring linux kernels will help me filter the "noise" from
>>>>> any differences that might be significant.
>>>>>
>>>>> -Paul
>>>>>
>>>>> On Tue, Nov 25, 2014 at 11:11 AM, Allan Wu <al...@cs.ucla.edu> wrote:
>>>>>
>>>>>> Thanks Paul! Unfortunately '/boot' is not available in my embedded
>>>>>> linux, and I do not have the configuration file for the old kernel since 
>>>>>> it
>>>>>> is provided as is. However, I have the new kernel configuration since I
>>>>>> compiled it myself. Would it be helpful if I provide you the .config file
>>>>>> when I compile the kernel? It maybe quite painful to look through that 
>>>>>> file
>>>>>> though. Is there any other way that I can obtain the configuration?
>>>>>>
>>>>>> I checked my config for the new kernel, and UNIX-domain sockets and
>>>>>> Sys V IPC are both enabled in the build. Are there any other 
>>>>>> possibilities
>>>>>> I can check?
>>>>>>
>>>>>> Thanks,
>>>>>> Di
>>>>>>
>>>>>> --
>>>>>> Di Wu (Allan)
>>>>>> PhD student, VAST Laboratory <http://vast.cs.ucla.edu/>,
>>>>>> Department of Computer Science, UC Los Angeles
>>>>>> Email: al...@cs.ucla.edu
>>>>>>
>>>>>> On Tue, Nov 25, 2014 at 10:45 AM, Paul Hargrove <phhargr...@lbl.gov>
>>>>>> wrote:
>>>>>>
>>>>>>> Allan,
>>>>>>>
>>>>>>> A likely possibility is that some important kernel feature (that
>>>>>>> Open MPI assumes is present) is missing.
>>>>>>> That includes not only "kernel modules" as you mention, but also
>>>>>>> features configure in (or out) of the base kernel.
>>>>>>> For instance, some embedded kernels omit UNIX-domain sockets and
>>>>>>> SysV IPC support.
>>>>>>>
>>>>>>> If you can send me (preferably off-list) the kernel config files for
>>>>>>> the old an new kernels I may be able to spot something.
>>>>>>> If present, you are looking for /boot/config-[VERSION]
>>>>>>>
>>>>>>> -Paul
>>>>>>>
>>>>>>> On Tue, Nov 25, 2014 at 10:25 AM, Allan Wu <al...@cs.ucla.edu>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> I'm sorry I forgot to change the subject when I reply to the digest
>>>>>>>> issue. Please find my original email below.
>>>>>>>>
>>>>>>>> Regards,
>>>>>>>> Di
>>>>>>>>
>>>>>>>> On Tue, Nov 25, 2014 at 10:19 AM, Allan Wu <al...@cs.ucla.edu>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>> Thanks Ralph for the reply. Sorry about the log file, I think I
>>>>>>>>> forgot to put an extension to the file. Please find a new one 
>>>>>>>>> attached with
>>>>>>>>> this email.
>>>>>>>>>
>>>>>>>>> I'm sorry for not enough debugging information, but 'omp_info' and
>>>>>>>>> '--debug-devel' are the only ways I know for collecting information, 
>>>>>>>>> are
>>>>>>>>> there any other things I can try to provide more info?
>>>>>>>>>
>>>>>>>>> When I execute 'mpirun --debug-devel -np 1 ./helloworld', all the
>>>>>>>>> output is the logging information in my last email. It got stuck at
>>>>>>>>>  "[fpga1:00718] tmp: /tmp", and nothing from my helloworld program
>>>>>>>>> is printed out to the screen. So I think it is mpirun failing to 
>>>>>>>>> start my
>>>>>>>>> executable, not failing to terminate.
>>>>>>>>>
>>>>>>>>> I was wondering if this has anything to do with my newer kernel
>>>>>>>>> version, since it works well in the old case.
>>>>>>>>>
>>>>>>>>> Thanks,
>>>>>>>>> --
>>>>>>>>> Di Wu (Allan)
>>>>>>>>> PhD student, VAST Laboratory <http://vast.cs.ucla.edu/>,
>>>>>>>>> Department of Computer Science, UC Los Angeles
>>>>>>>>> Email: al...@cs.ucla.edu
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Date: Tue, 25 Nov 2014 07:29:51 -0800
>>>>>>>>> From:
>>>>>>>>> Ralph Castain <r...@open-mpi.org>
>>>>>>>>> To: Open MPI Developers <de...@open-mpi.org>
>>>>>>>>> Subject: Re: [OMPI devel] OpenMPI v1.8 and v1.8.3 mpirun hangs at
>>>>>>>>>         execution       on an embedded ARM Linux kernel version
>>>>>>>>> 3.15.0
>>>>>>>>> Message-ID: <898cb117-f6a6-4569-89c3-49b75d65b...@open-mpi.org>
>>>>>>>>> Content-Type: text/plain; charset="utf-8"
>>>>>>>>>
>>>>>>>>> I don?t know what you put in that log file, but it was an
>>>>>>>>> executable and I?m not feeling that trusting :-)
>>>>>>>>>
>>>>>>>>> I?m afraid there isn?t enough debug output there to really tell
>>>>>>>>> anything. From what little I can see, I?m guessing that the 
>>>>>>>>> application ran
>>>>>>>>> fine and you got the usual ?hello? output and the helloworld process 
>>>>>>>>> exited
>>>>>>>>> safely - is that correct? And so it is solely mpirun that is failing 
>>>>>>>>> to
>>>>>>>>> cleanly terminate?
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> > On Nov 24, 2014, at 11:24 PM, Allan Wu <al...@cs.ucla.edu>
>>>>>>>>> wrote:
>>>>>>>>> >
>>>>>>>>> > Hello everyone,
>>>>>>>>> >
>>>>>>>>> > I have cross-compiled OpenMPI for an embedded ARM Linux.
>>>>>>>>> Everything works fine for my system based on Linux 3.8.0. I have 
>>>>>>>>> previously
>>>>>>>>> submitted a post related to my compilation, which can be found here:
>>>>>>>>> http://www.open-mpi.org/community/lists/devel/2014/04/14440.php <
>>>>>>>>> http://www.open-mpi.org/community/lists/devel/2014/04/14440.php>.
>>>>>>>>> When I recently upgraded my Linux kernel to 3.15.0, mpirun begins to 
>>>>>>>>> stuck
>>>>>>>>> at even the helloworld program. The program consists only simple APIs:
>>>>>>>>> MPI_Init, MPI_Comm_size, MPI_Comm_rank, MPI_Finalize. The problem 
>>>>>>>>> occurs
>>>>>>>>> even at 'mpirun -np 1 ./helloworld', and below are the output with
>>>>>>>>> --debug-devel (before it got stuck):
>>>>>>>>> > [fpga1:00716] sess_dir_finalize: job session dir not empty -
>>>>>>>>> leaving
>>>>>>>>> > [fpga1:00716] procdir: /tmp/openmpi-sessions-root@
>>>>>>>>> fpga1_0/63813/0/0
>>>>>>>>> > [fpga1:00716] jobdir: /tmp/openmpi-sessions-root@fpga1_0/63813/0
>>>>>>>>> > [fpga1:00716] top: openmpi-sessions-root@fpga1_0
>>>>>>>>> > [fpga1:00716] tmp: /tmp
>>>>>>>>> > [fpga1:00718] procdir: /tmp/openmpi-sessions-root@
>>>>>>>>> fpga1_0/63813/1/0
>>>>>>>>> > [fpga1:00718] jobdir: /tmp/openmpi-sessions-root@fpga1_0/63813/1
>>>>>>>>> > [fpga1:00718] top: openmpi-sessions-root@fpga1_0
>>>>>>>>> >
>>>>>>>>> [fpga1:00718] tmp: /tmp
>>>>>>>>> >
>>>>>>>>> > I suspect maybe it is due to incompatible kernel version or some
>>>>>>>>> missing kernel modules. I tried also with the latest version 1.8.3, 
>>>>>>>>> and had
>>>>>>>>> the same problem. Does anyone have any thoughts? I have attached the 
>>>>>>>>> output
>>>>>>>>> of 'ompi-info --all' with this email.
>>>>>>>>> >
>>>>>>>>> > Please let me know if I need to provide more information. Thanks
>>>>>>>>> in advance!
>>>>>>>>> >
>>>>>>>>> > Regards,
>>>>>>>>> > --
>>>>>>>>> > Di Wu (Allan)
>>>>>>>>> > PhD student, VAST?Laboratory <http://vast.cs.ucla.edu/>,
>>>>>>>>> > Department of Computer Science, UC Los Angeles
>>>>>>>>> > Email: al...@cs.ucla.edu <mailto:al...@cs.ucla.edu>
>>>>>>>>> > <log.tar.gz>_______________________________________________
>>>>>>>>> > devel mailing list
>>>>>>>>> > de...@open-mpi.org
>>>>>>>>> > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>>>>>>> > Link to this post:
>>>>>>>>> http://www.open-mpi.org/community/lists/devel/2014/11/16330.php
>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>> _______________________________________________
>>>>>>>> devel mailing list
>>>>>>>> de...@open-mpi.org
>>>>>>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>>>>>> Link to this post:
>>>>>>>> http://www.open-mpi.org/community/lists/devel/2014/11/16341.php
>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> --
>>>>>>> Paul H. Hargrove                          phhargr...@lbl.gov
>>>>>>> Computer Languages & Systems Software (CLaSS) Group
>>>>>>> Computer Science Department               Tel: +1-510-495-2352
>>>>>>> Lawrence Berkeley National Laboratory     Fax: +1-510-486-6900
>>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> Paul H. Hargrove                          phhargr...@lbl.gov
>>>>> Computer Languages & Systems Software (CLaSS) Group
>>>>> Computer Science Department               Tel: +1-510-495-2352
>>>>> Lawrence Berkeley National Laboratory     Fax: +1-510-486-6900
>>>>>
>>>>
>>>>
>>>
>>
>> _______________________________________________
>> devel mailing list
>> de...@open-mpi.org
>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
>> Link to this post:
>> http://www.open-mpi.org/community/lists/devel/2014/11/16348.php
>>
>
>
>
> --
> Paul H. Hargrove                          phhargr...@lbl.gov
> Computer Languages & Systems Software (CLaSS) Group
> Computer Science Department               Tel: +1-510-495-2352
> Lawrence Berkeley National Laboratory     Fax: +1-510-486-6900
>  _______________________________________________
> devel mailing list
> de...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
> Link to this post:
> http://www.open-mpi.org/community/lists/devel/2014/11/16349.php
>
>
>
> _______________________________________________
> devel mailing list
> de...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
> Link to this post:
> http://www.open-mpi.org/community/lists/devel/2014/11/16350.php
>



-- 
Paul H. Hargrove                          phhargr...@lbl.gov
Computer Languages & Systems Software (CLaSS) Group
Computer Science Department               Tel: +1-510-495-2352
Lawrence Berkeley National Laboratory     Fax: +1-510-486-6900

Reply via email to