Thanks Ralph!

I did not compile my openmpi with --enable-debug, and I am compiling it
now. But your suggested command already provide
​d​
some output, which I attached with this email.

It seems the process was stuck on the line:
"[fpga2:00962] [[44848,1],0] waiting for connect completion to
[[44848,0],0] - activating send event"

Then it got stuck and I CTRL+C'ed it. Previous to that line, it said
something about 'orte_tcp_peer_try_connect: attempting to connect to proc
[[44848,0],0] via interface eth0'
​.​


Regards,
Di

On Tue, Nov 25, 2014 at 2:25 PM, Ralph Castain <r...@open-mpi.org> wrote:

> ​
> This is all running on a single node, correct? If so, did you configure
> OMPI with —enable-debug?
>
> If you can do that, or already have, then let’s add the following to the
> mpirun cmd line:
>
> -mca state_base_verbose 10 -mca odls_base_verbose 10 -mca oob_base_verbose
> 10
>
> You’ll get a bunch of output, but hopefully it will tell us where mpirun
> is encountering a problem.
> Ralph
> On Tue, Nov 25, 2014 at 11:20 AM, Paul Hargrove <phhargr...@lbl.gov>
> wrote:
>
>> Allan,
>>
>> If you send me the .config from your build of the kernel I can compare it
>> against, for instance, my .config for a Raspberry Pi.
>> There will certainly be many differences, but I am hoping my own
>> experience configuring linux kernels will help me filter the "noise" from
>> any differences that might be significant.
>>
>> -Paul
>>
>> On Tue, Nov 25, 2014 at 11:11 AM, Allan Wu <al...@cs.ucla.edu> wrote:
>>
>>> Thanks Paul! Unfortunately '/boot' is not available in my embedded
>>> linux, and I do not have the configuration file for the old kernel since it
>>> is provided as is. However, I have the new kernel configuration since I
>>> compiled it myself. Would it be helpful if I provide you the .config file
>>> when I compile the kernel? It maybe quite painful to look through that file
>>> though. Is there any other way that I can obtain the configuration?
>>>
>>> I checked my config for the new kernel, and UNIX-domain sockets and Sys
>>> V IPC are both enabled in the build. Are there any other possibilities I
>>> can check?
>>>
>>> Thanks,
>>> Di
>>>
>>> --
>>> Di Wu (Allan)
>>> PhD student, VAST Laboratory <http://vast.cs.ucla.edu/>,
>>> Department of Computer Science, UC Los Angeles
>>> Email: al...@cs.ucla.edu
>>>
>>> On Tue, Nov 25, 2014 at 10:45 AM, Paul Hargrove <phhargr...@lbl.gov>
>>> wrote:
>>>
>>>> Allan,
>>>>
>>>> A likely possibility is that some important kernel feature (that Open
>>>> MPI assumes is present) is missing.
>>>> That includes not only "kernel modules" as you mention, but also
>>>> features configure in (or out) of the base kernel.
>>>> For instance, some embedded kernels omit UNIX-domain sockets and SysV
>>>> IPC support.
>>>>
>>>> If you can send me (preferably off-list) the kernel config files for
>>>> the old an new kernels I may be able to spot something.
>>>> If present, you are looking for /boot/config-[VERSION]
>>>>
>>>> -Paul
>>>>
>>>> On Tue, Nov 25, 2014 at 10:25 AM, Allan Wu <al...@cs.ucla.edu> wrote:
>>>>
>>>>> I'm sorry I forgot to change the subject when I reply to the digest
>>>>> issue. Please find my original email below.
>>>>>
>>>>> Regards,
>>>>> Di
>>>>>
>>>>> On Tue, Nov 25, 2014 at 10:19 AM, Allan Wu <al...@cs.ucla.edu> wrote:
>>>>>
>>>>>> Thanks Ralph for the reply. Sorry about the log file, I think I
>>>>>> forgot to put an extension to the file. Please find a new one attached 
>>>>>> with
>>>>>> this email.
>>>>>>
>>>>>> I'm sorry for not enough debugging information, but 'omp_info' and
>>>>>> '--debug-devel' are the only ways I know for collecting information, are
>>>>>> there any other things I can try to provide more info?
>>>>>>
>>>>>> When I execute 'mpirun --debug-devel -np 1 ./helloworld', all the
>>>>>> output is the logging information in my last email. It got stuck at
>>>>>>  "[fpga1:00718] tmp: /tmp", and nothing from my helloworld program is
>>>>>> printed out to the screen. So I think it is mpirun failing to start my
>>>>>> executable, not failing to terminate.
>>>>>>
>>>>>> I was wondering if this has anything to do with my newer kernel
>>>>>> version, since it works well in the old case.
>>>>>>
>>>>>> Thanks,
>>>>>> --
>>>>>> Di Wu (Allan)
>>>>>> PhD student, VAST Laboratory <http://vast.cs.ucla.edu/>,
>>>>>> Department of Computer Science, UC Los Angeles
>>>>>> Email: al...@cs.ucla.edu
>>>>>>
>>>>>>
>>>>>> Date: Tue, 25 Nov 2014 07:29:51 -0800
>>>>>> From:
>>>>>> Ralph Castain <r...@open-mpi.org>
>>>>>> To: Open MPI Developers <de...@open-mpi.org>
>>>>>> Subject: Re: [OMPI devel] OpenMPI v1.8 and v1.8.3 mpirun hangs at
>>>>>>         execution       on an embedded ARM Linux kernel version 3.15.0
>>>>>> Message-ID: <898cb117-f6a6-4569-89c3-49b75d65b...@open-mpi.org>
>>>>>> Content-Type: text/plain; charset="utf-8"
>>>>>>
>>>>>> I don?t know what you put in that log file, but it was an executable
>>>>>> and I?m not feeling that trusting :-)
>>>>>>
>>>>>> I?m afraid there isn?t enough debug output there to really tell
>>>>>> anything. From what little I can see, I?m guessing that the application 
>>>>>> ran
>>>>>> fine and you got the usual ?hello? output and the helloworld process 
>>>>>> exited
>>>>>> safely - is that correct? And so it is solely mpirun that is failing to
>>>>>> cleanly terminate?
>>>>>>
>>>>>>
>>>>>> > On Nov 24, 2014, at 11:24 PM, Allan Wu <al...@cs.ucla.edu> wrote:
>>>>>> >
>>>>>> > Hello everyone,
>>>>>> >
>>>>>> > I have cross-compiled OpenMPI for an embedded ARM Linux. Everything
>>>>>> works fine for my system based on Linux 3.8.0. I have previously 
>>>>>> submitted
>>>>>> a post related to my compilation, which can be found here:
>>>>>> http://www.open-mpi.org/community/lists/devel/2014/04/14440.php <
>>>>>> http://www.open-mpi.org/community/lists/devel/2014/04/14440.php>.
>>>>>> When I recently upgraded my Linux kernel to 3.15.0, mpirun begins to 
>>>>>> stuck
>>>>>> at even the helloworld program. The program consists only simple APIs:
>>>>>> MPI_Init, MPI_Comm_size, MPI_Comm_rank, MPI_Finalize. The problem occurs
>>>>>> even at 'mpirun -np 1 ./helloworld', and below are the output with
>>>>>> --debug-devel (before it got stuck):
>>>>>> > [fpga1:00716] sess_dir_finalize: job session dir not empty - leaving
>>>>>> > [fpga1:00716] procdir: /tmp/openmpi-sessions-root@fpga1_0/63813/0/0
>>>>>> > [fpga1:00716] jobdir: /tmp/openmpi-sessions-root@fpga1_0/63813/0
>>>>>> > [fpga1:00716] top: openmpi-sessions-root@fpga1_0
>>>>>> > [fpga1:00716] tmp: /tmp
>>>>>> > [fpga1:00718] procdir: /tmp/openmpi-sessions-root@fpga1_0/63813/1/0
>>>>>> > [fpga1:00718] jobdir: /tmp/openmpi-sessions-root@fpga1_0/63813/1
>>>>>> > [fpga1:00718] top: openmpi-sessions-root@fpga1_0
>>>>>> >
>>>>>> [fpga1:00718] tmp: /tmp
>>>>>> >
>>>>>> > I suspect maybe it is due to incompatible kernel version or some
>>>>>> missing kernel modules. I tried also with the latest version 1.8.3, and 
>>>>>> had
>>>>>> the same problem. Does anyone have any thoughts? I have attached the 
>>>>>> output
>>>>>> of 'ompi-info --all' with this email.
>>>>>> >
>>>>>> > Please let me know if I need to provide more information. Thanks in
>>>>>> advance!
>>>>>> >
>>>>>> > Regards,
>>>>>> > --
>>>>>> > Di Wu (Allan)
>>>>>> > PhD student, VAST?Laboratory <http://vast.cs.ucla.edu/>,
>>>>>> > Department of Computer Science, UC Los Angeles
>>>>>> > Email: al...@cs.ucla.edu <mailto:al...@cs.ucla.edu>
>>>>>> > <log.tar.gz>_______________________________________________
>>>>>> > devel mailing list
>>>>>> > de...@open-mpi.org
>>>>>> > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>>>> > Link to this post:
>>>>>> http://www.open-mpi.org/community/lists/devel/2014/11/16330.php
>>>>>>
>>>>>>
>>>>>
>>>>> _______________________________________________
>>>>> devel mailing list
>>>>> de...@open-mpi.org
>>>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>>> Link to this post:
>>>>> http://www.open-mpi.org/community/lists/devel/2014/11/16341.php
>>>>>
>>>>
>>>>
>>>>
>>>> --
>>>> Paul H. Hargrove                          phhargr...@lbl.gov
>>>> Computer Languages & Systems Software (CLaSS) Group
>>>> Computer Science Department               Tel: +1-510-495-2352
>>>> Lawrence Berkeley National Laboratory     Fax: +1-510-486-6900
>>>>
>>>
>>>
>>
>>
>> --
>> Paul H. Hargrove                          phhargr...@lbl.gov
>> Computer Languages & Systems Software (CLaSS) Group
>> Computer Science Department               Tel: +1-510-495-2352
>> Lawrence Berkeley National Laboratory     Fax: +1-510-486-6900
>>
>
>

Attachment: output_verbose.tar.gz
Description: GNU Zip compressed data

Reply via email to