Ralph, I had a look at the problem via "mpirun -np 1 strace -o trace -ff ./hello" I find that there is an attempt (by a secondary thread) to establish a TCP socket from the rank process to the eth0 address of localhost (I am guessing to reach the orted/mpirun). However, when the "lo" interface is down, the Linux kernel apparently cannot establish that socket.
In fact, if I am sufficiently patient, it turns out the "hang" is bounded, and eventually one sees: phargrov@blcr-armv7:~$ time mpirun -np 1 ./a.out ------------------------------------------------------------ A process or daemon was unable to complete a TCP connection to another process: Local host: blcr-armv7 Remote host: 10.0.2.15 This is usually caused by a firewall on the remote host. Please check that any firewall (e.g., iptables) has been disabled and try again. ------------------------------------------------------------ real 2m8.151s user 0m5.360s sys 0m57.430s Where blcr-armv7 and 10.0.2.15 are *both* the local (only) host. There is no firewall, but in case you doubt me on that, here is a demonstration using ping to show that 10.0.2.15 is only reachable when the loopback interface is enabled: phargrov@blcr-armv7:~$ sudo ifconfig lo up phargrov@blcr-armv7:~$ ping -q -c2 10.0.2.15 PING 10.0.2.15 (10.0.2.15) 56(84) bytes of data. --- 10.0.2.15 ping statistics --- 2 packets transmitted, 2 received, 0% packet loss, time 1002ms rtt min/avg/max/mdev = 0.527/0.534/0.542/0.024 ms phargrov@blcr-armv7:~$ sudo ifconfig lo down phargrov@blcr-armv7:~$ ping -q -c2 10.0.2.15 PING 10.0.2.15 (10.0.2.15) 56(84) bytes of data. --- 10.0.2.15 ping statistics --- 2 packets transmitted, 0 received, 100% packet loss, time 1006ms So, there is no "hang" -- just a 2 minute pause before the error message is generated. However, it may still be possible to present a better/earlier error message when there is no loopback interface (and at least one rank process is to be launched locally). -Paul On Tue, Nov 25, 2014 at 4:19 PM, Ralph Castain <r...@open-mpi.org> wrote: > I'll have to look - there isn't supposed to be such a requirement, and I > certainly haven't seen it before. > > > On Nov 25, 2014, at 3:26 PM, Paul Hargrove <phhargr...@lbl.gov> wrote: > > Allan, > > I am glad things are working for you now. > I can confirm (on a QEMU-emulated Versatile Express A9 board running > Ubuntu 14.04) that disabling the "lo" interface reproduces the problem. > I imagine this is true on other architectures, though I did not attempt to > verify. > > Ralph, > > If oob:tcp really does need the loopback interface, shouldn't its lack be > something that could/should be detected and reported instead of hanging as > Allan saw? > > FWIW, neither of the following resolved the problem: > -mca oob_tcp_if_exclude lo > -mca oob_tcp_if_include eth0 > > > -Paul > > On Tue, Nov 25, 2014 at 2:58 PM, Allan Wu <al...@cs.ucla.edu> wrote: > >> I think I have found the problem. After inspecting the output with >> >> "-mca state_base_verbose 10 -mca odls_base_verbose 10 -mca >> oob_base_verbose 10 >> 0 >> " >> >> on both the old system and the new system, I noticed there is one line >> >> that is >> >> different >> : >> >> o >> n the old system where it works correctly, there is a line that says: >> "oob:tcp:init rejecting loopback interface lo" >> , >> while >> on the new system there is no such line. Both system proceed to open >> interface eth0 afterwards. Then I checked the new system, and found out >> that somehow the loopback interface is not up by default. After I opened >> the lo interface, the mpirun executes normally. >> >> Does it means that OpenMPI will use lo for some initial setup? Since the >> actual socket was created on eth0 I did not think of checking the lo >> interface. Anyway, thanks everyone for all of your kind help. Let me know >> if you want me to provide any more information for future references. >> >> Regards, >> Allan >> >> -- >> Di Wu (Allan) >> PhD student, VAST Laboratory <http://vast.cs.ucla.edu/>, >> Department of Computer Science, UC Los Angeles >> Email: al...@cs.ucla.edu >> >> On Tue, Nov 25, 2014 at 11:55 AM, Allan Wu <al...@cs.ucla.edu> wrote: >> >>> Thanks Ralph! >>> >>> I did not compile my openmpi with --enable-debug, and I am compiling it >>> now. But your suggested command already provide >>> d >>> some output, which I attached with this email. >>> >>> It seems the process was stuck on the line: >>> "[fpga2:00962] [[44848,1],0] waiting for connect completion to >>> [[44848,0],0] - activating send event" >>> >>> Then it got stuck and I CTRL+C'ed it. Previous to that line, it said >>> something about 'orte_tcp_peer_try_connect: attempting to connect to proc >>> [[44848,0],0] via interface eth0' >>> . >>> >>> >>> Regards, >>> Di >>> >>> On Tue, Nov 25, 2014 at 2:25 PM, Ralph Castain <r...@open-mpi.org> wrote: >>> >>>> This is all running on a single node, correct? If so, did you configure >>>> OMPI with â EURO "enable-debug? >>>> >>>> If you can do that, or already have, then letâ EURO (tm)s add the >>>> following to >>>> the mpirun cmd line: >>>> >>>> -mca state_base_verbose 10 -mca odls_base_verbose 10 -mca >>>> oob_base_verbose 10 >>>> >>>> Youâ EURO (tm)ll get a bunch of output, but hopefully it will tell us where >>>> mpirun is encountering a problem. >>>> Ralph >>>> On Tue, Nov 25, 2014 at 11:20 AM, Paul Hargrove <phhargr...@lbl.gov> >>>> wrote: >>>> >>>>> Allan, >>>>> >>>>> If you send me the .config from your build of the kernel I can compare >>>>> it against, for instance, my .config for a Raspberry Pi. >>>>> There will certainly be many differences, but I am hoping my own >>>>> experience configuring linux kernels will help me filter the "noise" from >>>>> any differences that might be significant. >>>>> >>>>> -Paul >>>>> >>>>> On Tue, Nov 25, 2014 at 11:11 AM, Allan Wu <al...@cs.ucla.edu> wrote: >>>>> >>>>>> Thanks Paul! Unfortunately '/boot' is not available in my embedded >>>>>> linux, and I do not have the configuration file for the old kernel since >>>>>> it >>>>>> is provided as is. However, I have the new kernel configuration since I >>>>>> compiled it myself. Would it be helpful if I provide you the .config file >>>>>> when I compile the kernel? It maybe quite painful to look through that >>>>>> file >>>>>> though. Is there any other way that I can obtain the configuration? >>>>>> >>>>>> I checked my config for the new kernel, and UNIX-domain sockets and >>>>>> Sys V IPC are both enabled in the build. Are there any other >>>>>> possibilities >>>>>> I can check? >>>>>> >>>>>> Thanks, >>>>>> Di >>>>>> >>>>>> -- >>>>>> Di Wu (Allan) >>>>>> PhD student, VAST Laboratory <http://vast.cs.ucla.edu/>, >>>>>> Department of Computer Science, UC Los Angeles >>>>>> Email: al...@cs.ucla.edu >>>>>> >>>>>> On Tue, Nov 25, 2014 at 10:45 AM, Paul Hargrove <phhargr...@lbl.gov> >>>>>> wrote: >>>>>> >>>>>>> Allan, >>>>>>> >>>>>>> A likely possibility is that some important kernel feature (that >>>>>>> Open MPI assumes is present) is missing. >>>>>>> That includes not only "kernel modules" as you mention, but also >>>>>>> features configure in (or out) of the base kernel. >>>>>>> For instance, some embedded kernels omit UNIX-domain sockets and >>>>>>> SysV IPC support. >>>>>>> >>>>>>> If you can send me (preferably off-list) the kernel config files for >>>>>>> the old an new kernels I may be able to spot something. >>>>>>> If present, you are looking for /boot/config-[VERSION] >>>>>>> >>>>>>> -Paul >>>>>>> >>>>>>> On Tue, Nov 25, 2014 at 10:25 AM, Allan Wu <al...@cs.ucla.edu> >>>>>>> wrote: >>>>>>> >>>>>>>> I'm sorry I forgot to change the subject when I reply to the digest >>>>>>>> issue. Please find my original email below. >>>>>>>> >>>>>>>> Regards, >>>>>>>> Di >>>>>>>> >>>>>>>> On Tue, Nov 25, 2014 at 10:19 AM, Allan Wu <al...@cs.ucla.edu> >>>>>>>> wrote: >>>>>>>> >>>>>>>>> Thanks Ralph for the reply. Sorry about the log file, I think I >>>>>>>>> forgot to put an extension to the file. Please find a new one >>>>>>>>> attached with >>>>>>>>> this email. >>>>>>>>> >>>>>>>>> I'm sorry for not enough debugging information, but 'omp_info' and >>>>>>>>> '--debug-devel' are the only ways I know for collecting information, >>>>>>>>> are >>>>>>>>> there any other things I can try to provide more info? >>>>>>>>> >>>>>>>>> When I execute 'mpirun --debug-devel -np 1 ./helloworld', all the >>>>>>>>> output is the logging information in my last email. It got stuck at >>>>>>>>> "[fpga1:00718] tmp: /tmp", and nothing from my helloworld program >>>>>>>>> is printed out to the screen. So I think it is mpirun failing to >>>>>>>>> start my >>>>>>>>> executable, not failing to terminate. >>>>>>>>> >>>>>>>>> I was wondering if this has anything to do with my newer kernel >>>>>>>>> version, since it works well in the old case. >>>>>>>>> >>>>>>>>> Thanks, >>>>>>>>> -- >>>>>>>>> Di Wu (Allan) >>>>>>>>> PhD student, VAST Laboratory <http://vast.cs.ucla.edu/>, >>>>>>>>> Department of Computer Science, UC Los Angeles >>>>>>>>> Email: al...@cs.ucla.edu >>>>>>>>> >>>>>>>>> >>>>>>>>> Date: Tue, 25 Nov 2014 07:29:51 -0800 >>>>>>>>> From: >>>>>>>>> Ralph Castain <r...@open-mpi.org> >>>>>>>>> To: Open MPI Developers <de...@open-mpi.org> >>>>>>>>> Subject: Re: [OMPI devel] OpenMPI v1.8 and v1.8.3 mpirun hangs at >>>>>>>>> execution on an embedded ARM Linux kernel version >>>>>>>>> 3.15.0 >>>>>>>>> Message-ID: <898cb117-f6a6-4569-89c3-49b75d65b...@open-mpi.org> >>>>>>>>> Content-Type: text/plain; charset="utf-8" >>>>>>>>> >>>>>>>>> I don?t know what you put in that log file, but it was an >>>>>>>>> executable and I?m not feeling that trusting :-) >>>>>>>>> >>>>>>>>> I?m afraid there isn?t enough debug output there to really tell >>>>>>>>> anything. From what little I can see, I?m guessing that the >>>>>>>>> application ran >>>>>>>>> fine and you got the usual ?hello? output and the helloworld process >>>>>>>>> exited >>>>>>>>> safely - is that correct? And so it is solely mpirun that is failing >>>>>>>>> to >>>>>>>>> cleanly terminate? >>>>>>>>> >>>>>>>>> >>>>>>>>> > On Nov 24, 2014, at 11:24 PM, Allan Wu <al...@cs.ucla.edu> >>>>>>>>> wrote: >>>>>>>>> > >>>>>>>>> > Hello everyone, >>>>>>>>> > >>>>>>>>> > I have cross-compiled OpenMPI for an embedded ARM Linux. >>>>>>>>> Everything works fine for my system based on Linux 3.8.0. I have >>>>>>>>> previously >>>>>>>>> submitted a post related to my compilation, which can be found here: >>>>>>>>> http://www.open-mpi.org/community/lists/devel/2014/04/14440.php < >>>>>>>>> http://www.open-mpi.org/community/lists/devel/2014/04/14440.php>. >>>>>>>>> When I recently upgraded my Linux kernel to 3.15.0, mpirun begins to >>>>>>>>> stuck >>>>>>>>> at even the helloworld program. The program consists only simple APIs: >>>>>>>>> MPI_Init, MPI_Comm_size, MPI_Comm_rank, MPI_Finalize. The problem >>>>>>>>> occurs >>>>>>>>> even at 'mpirun -np 1 ./helloworld', and below are the output with >>>>>>>>> --debug-devel (before it got stuck): >>>>>>>>> > [fpga1:00716] sess_dir_finalize: job session dir not empty - >>>>>>>>> leaving >>>>>>>>> > [fpga1:00716] procdir: /tmp/openmpi-sessions-root@ >>>>>>>>> fpga1_0/63813/0/0 >>>>>>>>> > [fpga1:00716] jobdir: /tmp/openmpi-sessions-root@fpga1_0/63813/0 >>>>>>>>> > [fpga1:00716] top: openmpi-sessions-root@fpga1_0 >>>>>>>>> > [fpga1:00716] tmp: /tmp >>>>>>>>> > [fpga1:00718] procdir: /tmp/openmpi-sessions-root@ >>>>>>>>> fpga1_0/63813/1/0 >>>>>>>>> > [fpga1:00718] jobdir: /tmp/openmpi-sessions-root@fpga1_0/63813/1 >>>>>>>>> > [fpga1:00718] top: openmpi-sessions-root@fpga1_0 >>>>>>>>> > >>>>>>>>> [fpga1:00718] tmp: /tmp >>>>>>>>> > >>>>>>>>> > I suspect maybe it is due to incompatible kernel version or some >>>>>>>>> missing kernel modules. I tried also with the latest version 1.8.3, >>>>>>>>> and had >>>>>>>>> the same problem. Does anyone have any thoughts? I have attached the >>>>>>>>> output >>>>>>>>> of 'ompi-info --all' with this email. >>>>>>>>> > >>>>>>>>> > Please let me know if I need to provide more information. Thanks >>>>>>>>> in advance! >>>>>>>>> > >>>>>>>>> > Regards, >>>>>>>>> > -- >>>>>>>>> > Di Wu (Allan) >>>>>>>>> > PhD student, VAST?Laboratory <http://vast.cs.ucla.edu/>, >>>>>>>>> > Department of Computer Science, UC Los Angeles >>>>>>>>> > Email: al...@cs.ucla.edu <mailto:al...@cs.ucla.edu> >>>>>>>>> > <log.tar.gz>_______________________________________________ >>>>>>>>> > devel mailing list >>>>>>>>> > de...@open-mpi.org >>>>>>>>> > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel >>>>>>>>> > Link to this post: >>>>>>>>> http://www.open-mpi.org/community/lists/devel/2014/11/16330.php >>>>>>>>> >>>>>>>>> >>>>>>>> >>>>>>>> _______________________________________________ >>>>>>>> devel mailing list >>>>>>>> de...@open-mpi.org >>>>>>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel >>>>>>>> Link to this post: >>>>>>>> http://www.open-mpi.org/community/lists/devel/2014/11/16341.php >>>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> -- >>>>>>> Paul H. Hargrove phhargr...@lbl.gov >>>>>>> Computer Languages & Systems Software (CLaSS) Group >>>>>>> Computer Science Department Tel: +1-510-495-2352 >>>>>>> Lawrence Berkeley National Laboratory Fax: +1-510-486-6900 >>>>>>> >>>>>> >>>>>> >>>>> >>>>> >>>>> -- >>>>> Paul H. Hargrove phhargr...@lbl.gov >>>>> Computer Languages & Systems Software (CLaSS) Group >>>>> Computer Science Department Tel: +1-510-495-2352 >>>>> Lawrence Berkeley National Laboratory Fax: +1-510-486-6900 >>>>> >>>> >>>> >>> >> >> _______________________________________________ >> devel mailing list >> de...@open-mpi.org >> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel >> Link to this post: >> http://www.open-mpi.org/community/lists/devel/2014/11/16348.php >> > > > > -- > Paul H. Hargrove phhargr...@lbl.gov > Computer Languages & Systems Software (CLaSS) Group > Computer Science Department Tel: +1-510-495-2352 > Lawrence Berkeley National Laboratory Fax: +1-510-486-6900 > _______________________________________________ > devel mailing list > de...@open-mpi.org > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel > Link to this post: > http://www.open-mpi.org/community/lists/devel/2014/11/16349.php > > > > _______________________________________________ > devel mailing list > de...@open-mpi.org > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel > Link to this post: > http://www.open-mpi.org/community/lists/devel/2014/11/16350.php > -- Paul H. Hargrove phhargr...@lbl.gov Computer Languages & Systems Software (CLaSS) Group Computer Science Department Tel: +1-510-495-2352 Lawrence Berkeley National Laboratory Fax: +1-510-486-6900