Thanks Ralph! I did not compile my openmpi with --enable-debug, and I am compiling it now. But your suggested command already provide d some output, which I attached with this email.
It seems the process was stuck on the line: "[fpga2:00962] [[44848,1],0] waiting for connect completion to [[44848,0],0] - activating send event" Then it got stuck and I CTRL+C'ed it. Previous to that line, it said something about 'orte_tcp_peer_try_connect: attempting to connect to proc [[44848,0],0] via interface eth0' . Regards, Di On Tue, Nov 25, 2014 at 2:25 PM, Ralph Castain <r...@open-mpi.org> wrote: > > This is all running on a single node, correct? If so, did you configure > OMPI with —enable-debug? > > If you can do that, or already have, then let’s add the following to the > mpirun cmd line: > > -mca state_base_verbose 10 -mca odls_base_verbose 10 -mca oob_base_verbose > 10 > > You’ll get a bunch of output, but hopefully it will tell us where mpirun > is encountering a problem. > Ralph > On Tue, Nov 25, 2014 at 11:20 AM, Paul Hargrove <phhargr...@lbl.gov> > wrote: > >> Allan, >> >> If you send me the .config from your build of the kernel I can compare it >> against, for instance, my .config for a Raspberry Pi. >> There will certainly be many differences, but I am hoping my own >> experience configuring linux kernels will help me filter the "noise" from >> any differences that might be significant. >> >> -Paul >> >> On Tue, Nov 25, 2014 at 11:11 AM, Allan Wu <al...@cs.ucla.edu> wrote: >> >>> Thanks Paul! Unfortunately '/boot' is not available in my embedded >>> linux, and I do not have the configuration file for the old kernel since it >>> is provided as is. However, I have the new kernel configuration since I >>> compiled it myself. Would it be helpful if I provide you the .config file >>> when I compile the kernel? It maybe quite painful to look through that file >>> though. Is there any other way that I can obtain the configuration? >>> >>> I checked my config for the new kernel, and UNIX-domain sockets and Sys >>> V IPC are both enabled in the build. Are there any other possibilities I >>> can check? >>> >>> Thanks, >>> Di >>> >>> -- >>> Di Wu (Allan) >>> PhD student, VAST Laboratory <http://vast.cs.ucla.edu/>, >>> Department of Computer Science, UC Los Angeles >>> Email: al...@cs.ucla.edu >>> >>> On Tue, Nov 25, 2014 at 10:45 AM, Paul Hargrove <phhargr...@lbl.gov> >>> wrote: >>> >>>> Allan, >>>> >>>> A likely possibility is that some important kernel feature (that Open >>>> MPI assumes is present) is missing. >>>> That includes not only "kernel modules" as you mention, but also >>>> features configure in (or out) of the base kernel. >>>> For instance, some embedded kernels omit UNIX-domain sockets and SysV >>>> IPC support. >>>> >>>> If you can send me (preferably off-list) the kernel config files for >>>> the old an new kernels I may be able to spot something. >>>> If present, you are looking for /boot/config-[VERSION] >>>> >>>> -Paul >>>> >>>> On Tue, Nov 25, 2014 at 10:25 AM, Allan Wu <al...@cs.ucla.edu> wrote: >>>> >>>>> I'm sorry I forgot to change the subject when I reply to the digest >>>>> issue. Please find my original email below. >>>>> >>>>> Regards, >>>>> Di >>>>> >>>>> On Tue, Nov 25, 2014 at 10:19 AM, Allan Wu <al...@cs.ucla.edu> wrote: >>>>> >>>>>> Thanks Ralph for the reply. Sorry about the log file, I think I >>>>>> forgot to put an extension to the file. Please find a new one attached >>>>>> with >>>>>> this email. >>>>>> >>>>>> I'm sorry for not enough debugging information, but 'omp_info' and >>>>>> '--debug-devel' are the only ways I know for collecting information, are >>>>>> there any other things I can try to provide more info? >>>>>> >>>>>> When I execute 'mpirun --debug-devel -np 1 ./helloworld', all the >>>>>> output is the logging information in my last email. It got stuck at >>>>>> "[fpga1:00718] tmp: /tmp", and nothing from my helloworld program is >>>>>> printed out to the screen. So I think it is mpirun failing to start my >>>>>> executable, not failing to terminate. >>>>>> >>>>>> I was wondering if this has anything to do with my newer kernel >>>>>> version, since it works well in the old case. >>>>>> >>>>>> Thanks, >>>>>> -- >>>>>> Di Wu (Allan) >>>>>> PhD student, VAST Laboratory <http://vast.cs.ucla.edu/>, >>>>>> Department of Computer Science, UC Los Angeles >>>>>> Email: al...@cs.ucla.edu >>>>>> >>>>>> >>>>>> Date: Tue, 25 Nov 2014 07:29:51 -0800 >>>>>> From: >>>>>> Ralph Castain <r...@open-mpi.org> >>>>>> To: Open MPI Developers <de...@open-mpi.org> >>>>>> Subject: Re: [OMPI devel] OpenMPI v1.8 and v1.8.3 mpirun hangs at >>>>>> execution on an embedded ARM Linux kernel version 3.15.0 >>>>>> Message-ID: <898cb117-f6a6-4569-89c3-49b75d65b...@open-mpi.org> >>>>>> Content-Type: text/plain; charset="utf-8" >>>>>> >>>>>> I don?t know what you put in that log file, but it was an executable >>>>>> and I?m not feeling that trusting :-) >>>>>> >>>>>> I?m afraid there isn?t enough debug output there to really tell >>>>>> anything. From what little I can see, I?m guessing that the application >>>>>> ran >>>>>> fine and you got the usual ?hello? output and the helloworld process >>>>>> exited >>>>>> safely - is that correct? And so it is solely mpirun that is failing to >>>>>> cleanly terminate? >>>>>> >>>>>> >>>>>> > On Nov 24, 2014, at 11:24 PM, Allan Wu <al...@cs.ucla.edu> wrote: >>>>>> > >>>>>> > Hello everyone, >>>>>> > >>>>>> > I have cross-compiled OpenMPI for an embedded ARM Linux. Everything >>>>>> works fine for my system based on Linux 3.8.0. I have previously >>>>>> submitted >>>>>> a post related to my compilation, which can be found here: >>>>>> http://www.open-mpi.org/community/lists/devel/2014/04/14440.php < >>>>>> http://www.open-mpi.org/community/lists/devel/2014/04/14440.php>. >>>>>> When I recently upgraded my Linux kernel to 3.15.0, mpirun begins to >>>>>> stuck >>>>>> at even the helloworld program. The program consists only simple APIs: >>>>>> MPI_Init, MPI_Comm_size, MPI_Comm_rank, MPI_Finalize. The problem occurs >>>>>> even at 'mpirun -np 1 ./helloworld', and below are the output with >>>>>> --debug-devel (before it got stuck): >>>>>> > [fpga1:00716] sess_dir_finalize: job session dir not empty - leaving >>>>>> > [fpga1:00716] procdir: /tmp/openmpi-sessions-root@fpga1_0/63813/0/0 >>>>>> > [fpga1:00716] jobdir: /tmp/openmpi-sessions-root@fpga1_0/63813/0 >>>>>> > [fpga1:00716] top: openmpi-sessions-root@fpga1_0 >>>>>> > [fpga1:00716] tmp: /tmp >>>>>> > [fpga1:00718] procdir: /tmp/openmpi-sessions-root@fpga1_0/63813/1/0 >>>>>> > [fpga1:00718] jobdir: /tmp/openmpi-sessions-root@fpga1_0/63813/1 >>>>>> > [fpga1:00718] top: openmpi-sessions-root@fpga1_0 >>>>>> > >>>>>> [fpga1:00718] tmp: /tmp >>>>>> > >>>>>> > I suspect maybe it is due to incompatible kernel version or some >>>>>> missing kernel modules. I tried also with the latest version 1.8.3, and >>>>>> had >>>>>> the same problem. Does anyone have any thoughts? I have attached the >>>>>> output >>>>>> of 'ompi-info --all' with this email. >>>>>> > >>>>>> > Please let me know if I need to provide more information. Thanks in >>>>>> advance! >>>>>> > >>>>>> > Regards, >>>>>> > -- >>>>>> > Di Wu (Allan) >>>>>> > PhD student, VAST?Laboratory <http://vast.cs.ucla.edu/>, >>>>>> > Department of Computer Science, UC Los Angeles >>>>>> > Email: al...@cs.ucla.edu <mailto:al...@cs.ucla.edu> >>>>>> > <log.tar.gz>_______________________________________________ >>>>>> > devel mailing list >>>>>> > de...@open-mpi.org >>>>>> > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel >>>>>> > Link to this post: >>>>>> http://www.open-mpi.org/community/lists/devel/2014/11/16330.php >>>>>> >>>>>> >>>>> >>>>> _______________________________________________ >>>>> devel mailing list >>>>> de...@open-mpi.org >>>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel >>>>> Link to this post: >>>>> http://www.open-mpi.org/community/lists/devel/2014/11/16341.php >>>>> >>>> >>>> >>>> >>>> -- >>>> Paul H. Hargrove phhargr...@lbl.gov >>>> Computer Languages & Systems Software (CLaSS) Group >>>> Computer Science Department Tel: +1-510-495-2352 >>>> Lawrence Berkeley National Laboratory Fax: +1-510-486-6900 >>>> >>> >>> >> >> >> -- >> Paul H. Hargrove phhargr...@lbl.gov >> Computer Languages & Systems Software (CLaSS) Group >> Computer Science Department Tel: +1-510-495-2352 >> Lawrence Berkeley National Laboratory Fax: +1-510-486-6900 >> > >
output_verbose.tar.gz
Description: GNU Zip compressed data