I’ll have to look - there isn’t supposed to be such a requirement, and I certainly haven’t seen it before.
> On Nov 25, 2014, at 3:26 PM, Paul Hargrove <phhargr...@lbl.gov> wrote: > > Allan, > > I am glad things are working for you now. > I can confirm (on a QEMU-emulated Versatile Express A9 board running Ubuntu > 14.04) that disabling the "lo" interface reproduces the problem. > I imagine this is true on other architectures, though I did not attempt to > verify. > > Ralph, > > If oob:tcp really does need the loopback interface, shouldn't its lack be > something that could/should be detected and reported instead of hanging as > Allan saw? > > FWIW, neither of the following resolved the problem: > -mca oob_tcp_if_exclude lo > -mca oob_tcp_if_include eth0 > > > -Paul > > On Tue, Nov 25, 2014 at 2:58 PM, Allan Wu <al...@cs.ucla.edu > <mailto:al...@cs.ucla.edu>> wrote: > I think I have found the problem. After inspecting the output with "-mca > state_base_verbose 10 -mca odls_base_verbose 10 -mca oob_base_verbose 100" on > both the old system and the new system, I noticed there is one line that is > different: on the old system where it works correctly, there is a line that > says: "oob:tcp:init rejecting loopback interface lo", while on the new system > there is no such line. Both system proceed to open interface eth0 afterwards. > Then I checked the new system, and found out that somehow the loopback > interface is not up by default. After I opened the lo interface, the mpirun > executes normally. > > Does it means that OpenMPI will use lo for some initial setup? Since the > actual socket was created on eth0 I did not think of checking the lo > interface. Anyway, thanks everyone for all of your kind help. Let me know if > you want me to provide any more information for future references. > > Regards, > Allan > > -- > Di Wu (Allan) > PhD student, VAST Laboratory <http://vast.cs.ucla.edu/>, > Department of Computer Science, UC Los Angeles > Email: al...@cs.ucla.edu <mailto:al...@cs.ucla.edu> > > On Tue, Nov 25, 2014 at 11:55 AM, Allan Wu <al...@cs.ucla.edu > <mailto:al...@cs.ucla.edu>> wrote: > Thanks Ralph! > > I did not compile my openmpi with --enable-debug, and I am compiling it now. > But your suggested command already provided some output, which I attached > with this email. > > It seems the process was stuck on the line: > "[fpga2:00962] [[44848,1],0] waiting for connect completion to [[44848,0],0] > - activating send event" > > Then it got stuck and I CTRL+C'ed it. Previous to that line, it said > something about 'orte_tcp_peer_try_connect: attempting to connect to proc > [[44848,0],0] via interface eth0'. > > Regards, > Di > > On Tue, Nov 25, 2014 at 2:25 PM, Ralph Castain <r...@open-mpi.org > <mailto:r...@open-mpi.org>> wrote: > This is all running on a single node, correct? If so, did you configure OMPI > with —enable-debug? > If you can do that, or already have, then let’s add the following to the > mpirun cmd line: > > -mca state_base_verbose 10 -mca odls_base_verbose 10 -mca oob_base_verbose 10 > > You’ll get a bunch of output, but hopefully it will tell us where mpirun is > encountering a problem. > Ralph > > On Tue, Nov 25, 2014 at 11:20 AM, Paul Hargrove <phhargr...@lbl.gov > <mailto:phhargr...@lbl.gov>> wrote: > Allan, > > If you send me the .config from your build of the kernel I can compare it > against, for instance, my .config for a Raspberry Pi. > There will certainly be many differences, but I am hoping my own experience > configuring linux kernels will help me filter the "noise" from any > differences that might be significant. > > -Paul > > On Tue, Nov 25, 2014 at 11:11 AM, Allan Wu <al...@cs.ucla.edu > <mailto:al...@cs.ucla.edu>> wrote: > Thanks Paul! Unfortunately '/boot' is not available in my embedded linux, and > I do not have the configuration file for the old kernel since it is provided > as is. However, I have the new kernel configuration since I compiled it > myself. Would it be helpful if I provide you the .config file when I compile > the kernel? It maybe quite painful to look through that file though. Is there > any other way that I can obtain the configuration? > > I checked my config for the new kernel, and UNIX-domain sockets and Sys V IPC > are both enabled in the build. Are there any other possibilities I can check? > > Thanks, > Di > > -- > Di Wu (Allan) > PhD student, VAST Laboratory <http://vast.cs.ucla.edu/>, > Department of Computer Science, UC Los Angeles > Email: al...@cs.ucla.edu <mailto:al...@cs.ucla.edu> > > On Tue, Nov 25, 2014 at 10:45 AM, Paul Hargrove <phhargr...@lbl.gov > <mailto:phhargr...@lbl.gov>> wrote: > Allan, > > A likely possibility is that some important kernel feature (that Open MPI > assumes is present) is missing. > That includes not only "kernel modules" as you mention, but also features > configure in (or out) of the base kernel. > For instance, some embedded kernels omit UNIX-domain sockets and SysV IPC > support. > > If you can send me (preferably off-list) the kernel config files for the old > an new kernels I may be able to spot something. > If present, you are looking for /boot/config-[VERSION] > > -Paul > > On Tue, Nov 25, 2014 at 10:25 AM, Allan Wu <al...@cs.ucla.edu > <mailto:al...@cs.ucla.edu>> wrote: > I'm sorry I forgot to change the subject when I reply to the digest issue. > Please find my original email below. > > Regards, > Di > > On Tue, Nov 25, 2014 at 10:19 AM, Allan Wu <al...@cs.ucla.edu > <mailto:al...@cs.ucla.edu>> wrote: > Thanks Ralph for the reply. Sorry about the log file, I think I forgot to put > an extension to the file. Please find a new one attached with this email. > > I'm sorry for not enough debugging information, but 'omp_info' and > '--debug-devel' are the only ways I know for collecting information, are > there any other things I can try to provide more info? > > When I execute 'mpirun --debug-devel -np 1 ./helloworld', all the output is > the logging information in my last email. It got stuck at "[fpga1:00718] > tmp: /tmp", and nothing from my helloworld program is printed out to the > screen. So I think it is mpirun failing to start my executable, not failing > to terminate. > > I was wondering if this has anything to do with my newer kernel version, > since it works well in the old case. > > Thanks, > -- > Di Wu (Allan) > PhD student, VAST Laboratory <http://vast.cs.ucla.edu/>, > Department of Computer Science, UC Los Angeles > Email: al...@cs.ucla.edu <mailto:al...@cs.ucla.edu> > > > Date: Tue, 25 Nov 2014 07:29:51 -0800 > From: Ralph Castain <r...@open-mpi.org <mailto:r...@open-mpi.org>> > To: Open MPI Developers <de...@open-mpi.org <mailto:de...@open-mpi.org>> > Subject: Re: [OMPI devel] OpenMPI v1.8 and v1.8.3 mpirun hangs at > execution on an embedded ARM Linux kernel version 3.15.0 > Message-ID: <898cb117-f6a6-4569-89c3-49b75d65b...@open-mpi.org > <mailto:898cb117-f6a6-4569-89c3-49b75d65b...@open-mpi.org>> > Content-Type: text/plain; charset="utf-8" > > I don?t know what you put in that log file, but it was an executable and I?m > not feeling that trusting :-) > > I?m afraid there isn?t enough debug output there to really tell anything. > From what little I can see, I?m guessing that the application ran fine and > you got the usual ?hello? output and the helloworld process exited safely - > is that correct? And so it is solely mpirun that is failing to cleanly > terminate? > > > > On Nov 24, 2014, at 11:24 PM, Allan Wu <al...@cs.ucla.edu > > <mailto:al...@cs.ucla.edu>> wrote: > > > > Hello everyone, > > > > I have cross-compiled OpenMPI for an embedded ARM Linux. Everything works > > fine for my system based on Linux 3.8.0. I have previously submitted a post > > related to my compilation, which can be found here: > > http://www.open-mpi.org/community/lists/devel/2014/04/14440.php > > <http://www.open-mpi.org/community/lists/devel/2014/04/14440.php> > > <http://www.open-mpi.org/community/lists/devel/2014/04/14440.php > > <http://www.open-mpi.org/community/lists/devel/2014/04/14440.php>>. When I > > recently upgraded my Linux kernel to 3.15.0, mpirun begins to stuck at even > > the helloworld program. The program consists only simple APIs: MPI_Init, > > MPI_Comm_size, MPI_Comm_rank, MPI_Finalize. The problem occurs even at > > 'mpirun -np 1 ./helloworld', and below are the output with --debug-devel > > (before it got stuck): > > [fpga1:00716] sess_dir_finalize: job session dir not empty - leaving > > [fpga1:00716] procdir: /tmp/openmpi-sessions-root@fpga1_0/63813/0/0 > > [fpga1:00716] jobdir: /tmp/openmpi-sessions-root@fpga1_0/63813/0 > > [fpga1:00716] top: openmpi-sessions-root@fpga1_0 > > [fpga1:00716] tmp: /tmp > > [fpga1:00718] procdir: /tmp/openmpi-sessions-root@fpga1_0/63813/1/0 > > [fpga1:00718] jobdir: /tmp/openmpi-sessions-root@fpga1_0/63813/1 > > [fpga1:00718] top: openmpi-sessions-root@fpga1_0 > > [fpga1:00718] tmp: /tmp > > > > I suspect maybe it is due to incompatible kernel version or some missing > > kernel modules. I tried also with the latest version 1.8.3, and had the > > same problem. Does anyone have any thoughts? I have attached the output of > > 'ompi-info --all' with this email. > > > > Please let me know if I need to provide more information. Thanks in advance! > > > > Regards, > > -- > > Di Wu (Allan) > > PhD student, VAST?Laboratory <http://vast.cs.ucla.edu/ > > <http://vast.cs.ucla.edu/>>, > > Department of Computer Science, UC Los Angeles > > Email: al...@cs.ucla.edu <mailto:al...@cs.ucla.edu> > > <mailto:al...@cs.ucla.edu <mailto:al...@cs.ucla.edu>> > > <log.tar.gz>_______________________________________________ > > devel mailing list > > de...@open-mpi.org <mailto:de...@open-mpi.org> > > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel > > <http://www.open-mpi.org/mailman/listinfo.cgi/devel> > > Link to this post: > > http://www.open-mpi.org/community/lists/devel/2014/11/16330.php > > <http://www.open-mpi.org/community/lists/devel/2014/11/16330.php> > > > _______________________________________________ > devel mailing list > de...@open-mpi.org <mailto:de...@open-mpi.org> > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel > <http://www.open-mpi.org/mailman/listinfo.cgi/devel> > Link to this post: > http://www.open-mpi.org/community/lists/devel/2014/11/16341.php > <http://www.open-mpi.org/community/lists/devel/2014/11/16341.php> > > > > -- > Paul H. Hargrove phhargr...@lbl.gov > <mailto:phhargr...@lbl.gov> > Computer Languages & Systems Software (CLaSS) Group > Computer Science Department Tel: +1-510-495-2352 > <tel:%2B1-510-495-2352> > Lawrence Berkeley National Laboratory Fax: +1-510-486-6900 > <tel:%2B1-510-486-6900> > > > > -- > Paul H. Hargrove phhargr...@lbl.gov > <mailto:phhargr...@lbl.gov> > Computer Languages & Systems Software (CLaSS) Group > Computer Science Department Tel: +1-510-495-2352 > <tel:%2B1-510-495-2352> > Lawrence Berkeley National Laboratory Fax: +1-510-486-6900 > <tel:%2B1-510-486-6900> > > > > _______________________________________________ > devel mailing list > de...@open-mpi.org <mailto:de...@open-mpi.org> > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel > <http://www.open-mpi.org/mailman/listinfo.cgi/devel> > Link to this post: > http://www.open-mpi.org/community/lists/devel/2014/11/16348.php > <http://www.open-mpi.org/community/lists/devel/2014/11/16348.php> > > > > -- > Paul H. Hargrove phhargr...@lbl.gov > <mailto:phhargr...@lbl.gov> > Computer Languages & Systems Software (CLaSS) Group > Computer Science Department Tel: +1-510-495-2352 > Lawrence Berkeley National Laboratory Fax: +1-510-486-6900 > _______________________________________________ > devel mailing list > de...@open-mpi.org > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel > Link to this post: > http://www.open-mpi.org/community/lists/devel/2014/11/16349.php