[OMPI devel] OpenMPI v1.8 and v1.8.3 mpirun hangs at execution on an embedded ARM Linux kernel version 3.15.0

2014-11-25 Thread Allan Wu
Hello everyone,

I have cross-compiled OpenMPI for an embedded ARM Linux. Everything works
fine for my system based on Linux 3.8.0. I have previously submitted a post
related to my compilation, which can be found here: http://www.open-mpi
.org/community/lists/devel/2014/04/14440.php. When I recently upgraded my
Linux kernel to 3.15.0, mpirun begins to stuck at even the helloworld
program. The program consists only simple APIs: MPI_Init, MPI_Comm_size,
MPI_Comm_rank, MPI_Finalize. The problem occurs even at 'mpirun -np 1
./helloworld', and below are the output with --debug-devel (before it got
stuck):
[fpga1:00716] sess_dir_finalize: job session dir not empty - leaving
[fpga1:00716] procdir: /tmp/openmpi-sessions-root@fpga1_0/63813/0/0
[fpga1:00716] jobdir: /tmp/openmpi-sessions-root@fpga1_0/63813/0
[fpga1:00716] top: openmpi-sessions-root@fpga1_0
[fpga1:00716] tmp: /tmp
[fpga1:00718] procdir: /tmp/openmpi-sessions-root@fpga1_0/63813/1/0
[fpga1:00718] jobdir: /tmp/openmpi-sessions-root@fpga1_0/63813/1
[fpga1:00718] top: openmpi-sessions-root@fpga1_0
[fpga1:00718] tmp: /tmp

I suspect maybe it is due to incompatible kernel version or some missing
kernel modules. I tried also with the latest version 1.8.3, and had the
same problem. Does anyone have any thoughts? I have attached the output of
'ompi-info --all' with this email.

Please let me know if I need to provide more information. Thanks in advance!

Regards,
--
Di Wu (Allan)
PhD student, VAST Laboratory ,
Department of Computer Science, UC Los Angeles
Email: al...@cs.ucla.edu


log.tar.gz
Description: GNU Zip compressed data


Re: [OMPI devel] OpenMPI v1.8 and v1.8.3 mpirun hangs at execution on an embedded ARM Linux kernel version 3.15.0

2014-11-25 Thread Ralph Castain
I don’t know what you put in that log file, but it was an executable and I’m 
not feeling that trusting :-)

I’m afraid there isn’t enough debug output there to really tell anything. From 
what little I can see, I’m guessing that the application ran fine and you got 
the usual “hello” output and the helloworld process exited safely - is that 
correct? And so it is solely mpirun that is failing to cleanly terminate?


> On Nov 24, 2014, at 11:24 PM, Allan Wu  wrote:
> 
> Hello everyone,
> 
> I have cross-compiled OpenMPI for an embedded ARM Linux. Everything works 
> fine for my system based on Linux 3.8.0. I have previously submitted a post 
> related to my compilation, which can be found here: 
> http://www.open-mpi.org/community/lists/devel/2014/04/14440.php 
> . When I 
> recently upgraded my Linux kernel to 3.15.0, mpirun begins to stuck at even 
> the helloworld program. The program consists only simple APIs: MPI_Init, 
> MPI_Comm_size, MPI_Comm_rank, MPI_Finalize. The problem occurs even at 
> 'mpirun -np 1 ./helloworld', and below are the output with --debug-devel 
> (before it got stuck):
> [fpga1:00716] sess_dir_finalize: job session dir not empty - leaving
> [fpga1:00716] procdir: /tmp/openmpi-sessions-root@fpga1_0/63813/0/0
> [fpga1:00716] jobdir: /tmp/openmpi-sessions-root@fpga1_0/63813/0
> [fpga1:00716] top: openmpi-sessions-root@fpga1_0
> [fpga1:00716] tmp: /tmp
> [fpga1:00718] procdir: /tmp/openmpi-sessions-root@fpga1_0/63813/1/0
> [fpga1:00718] jobdir: /tmp/openmpi-sessions-root@fpga1_0/63813/1
> [fpga1:00718] top: openmpi-sessions-root@fpga1_0
> [fpga1:00718] tmp: /tmp
> 
> I suspect maybe it is due to incompatible kernel version or some missing 
> kernel modules. I tried also with the latest version 1.8.3, and had the same 
> problem. Does anyone have any thoughts? I have attached the output of 
> 'ompi-info --all' with this email. 
> 
> Please let me know if I need to provide more information. Thanks in advance!
> 
> Regards,
> --
> Di Wu (Allan)
> PhD student, VAST Laboratory ,
> Department of Computer Science, UC Los Angeles
> Email: al...@cs.ucla.edu 
> ___
> devel mailing list
> de...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
> Link to this post: 
> http://www.open-mpi.org/community/lists/devel/2014/11/16330.php



[OMPI devel] question to OMPI_DECLSPEC

2014-11-25 Thread Edgar Gabriel
Has something changed recently on the trunk/master regarding 
OMPI_DECLSPEC? The reason I ask is because we get now errors about 
unresolved symbols, e.g.


symbol lookup error: 
/home/gabriel/OpenMPI/lib64/openmpi/mca_fcoll_dynamic.so: undefined 
symbol: ompi_io_ompio_decode_datatype



and that problem was not there roughly two weeks back the last time I 
tested. I did verify that the the function listed there has an 
OMPI_DECLSPEC before its definition.


Thanks
Edgar
--
Edgar Gabriel
Associate Professor
Parallel Software Technologies Lab  http://pstl.cs.uh.edu
Department of Computer Science  University of Houston
Philip G. Hoffman Hall, Room 524Houston, TX-77204, USA
Tel: +1 (713) 743-3857  Fax: +1 (713) 743-3335


Re: [OMPI devel] question to OMPI_DECLSPEC

2014-11-25 Thread Ralph Castain
Hmmm…no, nothing has changed with regard to declspec that I know about. I’ll 
ask the obvious things to check:

* does that component have the proper include to find this function? Could be 
that it used to be found thru some chain, but the chain is now broken and it 
needs to be directly included

* is that function in the base code, or down in a component? If the latter, 
then that’s a problem, but I’m assuming you didn’t make that mistake.


> On Nov 25, 2014, at 8:07 AM, Edgar Gabriel  wrote:
> 
> Has something changed recently on the trunk/master regarding OMPI_DECLSPEC? 
> The reason I ask is because we get now errors about unresolved symbols, e.g.
> 
> symbol lookup error: 
> /home/gabriel/OpenMPI/lib64/openmpi/mca_fcoll_dynamic.so: undefined symbol: 
> ompi_io_ompio_decode_datatype
> 
> 
> and that problem was not there roughly two weeks back the last time I tested. 
> I did verify that the the function listed there has an OMPI_DECLSPEC before 
> its definition.
> 
> Thanks
> Edgar
> -- 
> Edgar Gabriel
> Associate Professor
> Parallel Software Technologies Lab  http://pstl.cs.uh.edu
> Department of Computer Science  University of Houston
> Philip G. Hoffman Hall, Room 524Houston, TX-77204, USA
> Tel: +1 (713) 743-3857  Fax: +1 (713) 743-3335
> ___
> devel mailing list
> de...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
> Link to this post: 
> http://www.open-mpi.org/community/lists/devel/2014/11/16332.php



Re: [OMPI devel] question to OMPI_DECLSPEC

2014-11-25 Thread Edgar Gabriel

On 11/25/2014 10:18 AM, Ralph Castain wrote:

Hmmm…no, nothing has changed with regard to declspec that I know
about. I’ll ask the obvious things to check:

* does that component have the proper include to find this function?
Could be that it used to be found thru some chain, but the chain is
now broken and it needs to be directly included


header is included, I double checked.


* is that function in the base code, or down in a component? If the
latter, then that’s a problem, but I’m assuming you didn’t make that
mistake.



I am not sure what you mean. The function is in a component, but I am 
not aware that it is illegal to call a function of a component from 
another component.


Thanks
Edgar







On Nov 25, 2014, at 8:07 AM, Edgar Gabriel 
wrote:

Has something changed recently on the trunk/master regarding
OMPI_DECLSPEC? The reason I ask is because we get now errors about
unresolved symbols, e.g.

symbol lookup error:
/home/gabriel/OpenMPI/lib64/openmpi/mca_fcoll_dynamic.so: undefined
symbol: ompi_io_ompio_decode_datatype


and that problem was not there roughly two weeks back the last time
I tested. I did verify that the the function listed there has an
OMPI_DECLSPEC before its definition.

Thanks Edgar -- Edgar Gabriel Associate Professor Parallel Software
Technologies Lab  http://pstl.cs.uh.edu Department of Computer
Science  University of Houston Philip G. Hoffman Hall, Room
524Houston, TX-77204, USA Tel: +1 (713) 743-3857
Fax: +1 (713) 743-3335
___ devel mailing list
de...@open-mpi.org Subscription:
http://www.open-mpi.org/mailman/listinfo.cgi/devel Link to this
post:
http://www.open-mpi.org/community/lists/devel/2014/11/16332.php


___ devel mailing list
de...@open-mpi.org Subscription:
http://www.open-mpi.org/mailman/listinfo.cgi/devel Link to this post:
http://www.open-mpi.org/community/lists/devel/2014/11/16333.php



--
Edgar Gabriel
Associate Professor
Parallel Software Technologies Lab  http://pstl.cs.uh.edu
Department of Computer Science  University of Houston
Philip G. Hoffman Hall, Room 524Houston, TX-77204, USA
Tel: +1 (713) 743-3857  Fax: +1 (713) 743-3335


Re: [OMPI devel] OpenMPI v1.8 and v1.8.3 mpirun hangs at execution on an embedded ARM Linux kernel version 3.15.0

2014-11-25 Thread Paul Hargrove
Ralph,

I downloaded the attachment and found it to be a gzipped tar file
containing a single text file "log".
I have attached the bzipped (not tarred) log file.

-Paul

On Tue, Nov 25, 2014 at 7:29 AM, Ralph Castain  wrote:

> I don't know what you put in that log file, but it was an executable and
> I'm not feeling that trusting :-)
>
> I'm afraid there isn't enough debug output there to really tell anything.
> From what little I can see, I'm guessing that the application ran fine and
> you got the usual "hello" output and the helloworld process exited safely -
> is that correct? And so it is solely mpirun that is failing to cleanly
> terminate?
>
>
> On Nov 24, 2014, at 11:24 PM, Allan Wu  wrote:
>
> Hello everyone,
>
> I have cross-compiled OpenMPI for an embedded ARM Linux. Everything works
> fine for my system based on Linux 3.8.0. I have previously submitted a post
> related to my compilation, which can be found here: http://www.open-mpi
> .org/community/lists/devel/2014/04/14440.php. When I recently upgraded my
> Linux kernel to 3.15.0, mpirun begins to stuck at even the helloworld
> program. The program consists only simple APIs: MPI_Init, MPI_Comm_size,
> MPI_Comm_rank, MPI_Finalize. The problem occurs even at 'mpirun -np 1
> ./helloworld', and below are the output with --debug-devel (before it got
> stuck):
> [fpga1:00716] sess_dir_finalize: job session dir not empty - leaving
> [fpga1:00716] procdir: /tmp/openmpi-sessions-root@fpga1_0/63813/0/0
> [fpga1:00716] jobdir: /tmp/openmpi-sessions-root@fpga1_0/63813/0
> [fpga1:00716] top: openmpi-sessions-root@fpga1_0
> [fpga1:00716] tmp: /tmp
> [fpga1:00718] procdir: /tmp/openmpi-sessions-root@fpga1_0/63813/1/0
> [fpga1:00718] jobdir: /tmp/openmpi-sessions-root@fpga1_0/63813/1
> [fpga1:00718] top: openmpi-sessions-root@fpga1_0
> [fpga1:00718] tmp: /tmp
>
> I suspect maybe it is due to incompatible kernel version or some missing
> kernel modules. I tried also with the latest version 1.8.3, and had the
> same problem. Does anyone have any thoughts? I have attached the output of
> 'ompi-info --all' with this email.
>
> Please let me know if I need to provide more information. Thanks in
> advance!
>
> Regards,
> --
> Di Wu (Allan)
> PhD student, VAST Laboratory ,
> Department of Computer Science, UC Los Angeles
> Email: al...@cs.ucla.edu
>  ___
> devel mailing list
> de...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
> Link to this post:
> http://www.open-mpi.org/community/lists/devel/2014/11/16330.php
>
>
>
> ___
> devel mailing list
> de...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
> Link to this post:
> http://www.open-mpi.org/community/lists/devel/2014/11/16331.php
>



-- 
Paul H. Hargrove  phhargr...@lbl.gov
Computer Languages & Systems Software (CLaSS) Group
Computer Science Department   Tel: +1-510-495-2352
Lawrence Berkeley National Laboratory Fax: +1-510-486-6900


log.bz2
Description: BZip2 compressed data


Re: [OMPI devel] question to OMPI_DECLSPEC

2014-11-25 Thread Ralph Castain

> On Nov 25, 2014, at 8:24 AM, Edgar Gabriel  wrote:
> 
> On 11/25/2014 10:18 AM, Ralph Castain wrote:
>> Hmmm…no, nothing has changed with regard to declspec that I know
>> about. I’ll ask the obvious things to check:
>> 
>> * does that component have the proper include to find this function?
>> Could be that it used to be found thru some chain, but the chain is
>> now broken and it needs to be directly included
> 
> header is included, I double checked.
> 
>> * is that function in the base code, or down in a component? If the
>> latter, then that’s a problem, but I’m assuming you didn’t make that
>> mistake.
> 
> 
> I am not sure what you mean. The function is in a component, but I am not 
> aware that it is illegal to call a function of a component from another 
> component.


Of course that is illegal - you can only access a function via the framework 
interface, not directly. You have no way of knowing that the other component 
has been loaded. Doing it directly violates the abstraction rules.


> 
> Thanks
> Edgar
> 
> 
> 
>> 
>> 
>>> On Nov 25, 2014, at 8:07 AM, Edgar Gabriel 
>>> wrote:
>>> 
>>> Has something changed recently on the trunk/master regarding
>>> OMPI_DECLSPEC? The reason I ask is because we get now errors about
>>> unresolved symbols, e.g.
>>> 
>>> symbol lookup error:
>>> /home/gabriel/OpenMPI/lib64/openmpi/mca_fcoll_dynamic.so: undefined
>>> symbol: ompi_io_ompio_decode_datatype
>>> 
>>> 
>>> and that problem was not there roughly two weeks back the last time
>>> I tested. I did verify that the the function listed there has an
>>> OMPI_DECLSPEC before its definition.
>>> 
>>> Thanks Edgar -- Edgar Gabriel Associate Professor Parallel Software
>>> Technologies Lab  http://pstl.cs.uh.edu Department of Computer
>>> Science  University of Houston Philip G. Hoffman Hall, Room
>>> 524Houston, TX-77204, USA Tel: +1 (713) 743-3857
>>> Fax: +1 (713) 743-3335
>>> ___ devel mailing list
>>> de...@open-mpi.org Subscription:
>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel Link to this
>>> post:
>>> http://www.open-mpi.org/community/lists/devel/2014/11/16332.php
>> 
>> ___ devel mailing list
>> de...@open-mpi.org  Subscription:
>> http://www.open-mpi.org/mailman/listinfo.cgi/devel 
>>  Link to this post:
>> http://www.open-mpi.org/community/lists/devel/2014/11/16333.php 
>> 
>> 
> 
> -- 
> Edgar Gabriel
> Associate Professor
> Parallel Software Technologies Lab  http://pstl.cs.uh.edu 
> 
> Department of Computer Science  University of Houston
> Philip G. Hoffman Hall, Room 524Houston, TX-77204, USA
> Tel: +1 (713) 743-3857  Fax: +1 (713) 743-3335
> ___
> devel mailing list
> de...@open-mpi.org 
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel 
> 
> Link to this post: 
> http://www.open-mpi.org/community/lists/devel/2014/11/16334.php 
> 


Re: [OMPI devel] OpenMPI v1.8 and v1.8.3 mpirun hangs at execution on an embedded ARM Linux kernel version 3.15.0

2014-11-25 Thread Ralph Castain
Thanks - no idea why it was trying to execute on my machine, but I’ve learned 
to be far less trusting.

Looks like it was just a complete output of ompi_info, which doesn’t really 
help here anyway. Will need to hear the answers to my questions before 
suggesting a next step.


> On Nov 25, 2014, at 9:09 AM, Paul Hargrove  wrote:
> 
> Ralph,
> 
> I downloaded the attachment and found it to be a gzipped tar file containing 
> a single text file "log".
> I have attached the bzipped (not tarred) log file.
> 
> -Paul
> 
> On Tue, Nov 25, 2014 at 7:29 AM, Ralph Castain  > wrote:
> I don’t know what you put in that log file, but it was an executable and I’m 
> not feeling that trusting :-)
> 
> I’m afraid there isn’t enough debug output there to really tell anything. 
> From what little I can see, I’m guessing that the application ran fine and 
> you got the usual “hello” output and the helloworld process exited safely - 
> is that correct? And so it is solely mpirun that is failing to cleanly 
> terminate?
> 
> 
>> On Nov 24, 2014, at 11:24 PM, Allan Wu > > wrote:
>> 
>> Hello everyone,
>> 
>> I have cross-compiled OpenMPI for an embedded ARM Linux. Everything works 
>> fine for my system based on Linux 3.8.0. I have previously submitted a post 
>> related to my compilation, which can be found here: 
>> http://www.open-mpi.org/community/lists/devel/2014/04/14440.php 
>> . When I 
>> recently upgraded my Linux kernel to 3.15.0, mpirun begins to stuck at even 
>> the helloworld program. The program consists only simple APIs: MPI_Init, 
>> MPI_Comm_size, MPI_Comm_rank, MPI_Finalize. The problem occurs even at 
>> 'mpirun -np 1 ./helloworld', and below are the output with --debug-devel 
>> (before it got stuck):
>> [fpga1:00716] sess_dir_finalize: job session dir not empty - leaving
>> [fpga1:00716] procdir: /tmp/openmpi-sessions-root@fpga1_0/63813/0/0
>> [fpga1:00716] jobdir: /tmp/openmpi-sessions-root@fpga1_0/63813/0
>> [fpga1:00716] top: openmpi-sessions-root@fpga1_0
>> [fpga1:00716] tmp: /tmp
>> [fpga1:00718] procdir: /tmp/openmpi-sessions-root@fpga1_0/63813/1/0
>> [fpga1:00718] jobdir: /tmp/openmpi-sessions-root@fpga1_0/63813/1
>> [fpga1:00718] top: openmpi-sessions-root@fpga1_0
>> [fpga1:00718] tmp: /tmp
>> 
>> I suspect maybe it is due to incompatible kernel version or some missing 
>> kernel modules. I tried also with the latest version 1.8.3, and had the same 
>> problem. Does anyone have any thoughts? I have attached the output of 
>> 'ompi-info --all' with this email. 
>> 
>> Please let me know if I need to provide more information. Thanks in advance!
>> 
>> Regards,
>> --
>> Di Wu (Allan)
>> PhD student, VAST Laboratory ,
>> Department of Computer Science, UC Los Angeles
>> Email: al...@cs.ucla.edu 
>> ___
>> devel mailing list
>> de...@open-mpi.org 
>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel 
>> 
>> Link to this post: 
>> http://www.open-mpi.org/community/lists/devel/2014/11/16330.php 
>> 
> 
> ___
> devel mailing list
> de...@open-mpi.org 
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel 
> 
> Link to this post: 
> http://www.open-mpi.org/community/lists/devel/2014/11/16331.php 
> 
> 
> 
> 
> -- 
> Paul H. Hargrove  phhargr...@lbl.gov 
> 
> Computer Languages & Systems Software (CLaSS) Group
> Computer Science Department   Tel: +1-510-495-2352
> Lawrence Berkeley National Laboratory Fax: +1-510-486-6900
> ___
> devel mailing list
> de...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
> Link to this post: 
> http://www.open-mpi.org/community/lists/devel/2014/11/16335.php



Re: [OMPI devel] question to OMPI_DECLSPEC

2014-11-25 Thread Edgar Gabriel

On 11/25/2014 11:31 AM, Ralph Castain wrote:



On Nov 25, 2014, at 8:24 AM, Edgar Gabriel mailto:gabr...@cs.uh.edu>> wrote:

On 11/25/2014 10:18 AM, Ralph Castain wrote:

Hmmm…no, nothing has changed with regard to declspec that I know
about. I’ll ask the obvious things to check:

* does that component have the proper include to find this function?
Could be that it used to be found thru some chain, but the chain is
now broken and it needs to be directly included


header is included, I double checked.


* is that function in the base code, or down in a component? If the
latter, then that’s a problem, but I’m assuming you didn’t make that
mistake.



I am not sure what you mean. The function is in a component, but I am
not aware that it is illegal to call a function of a component from
another component.



Of course that is illegal - you can only access a function via the
framework interface, not directly. You have no way of knowing that the
other component has been loaded. Doing it directly violates the
abstraction rules.


well, ok. I know that the other componen has been loaded because that 
component triggered the initialization of these sub-frameworks.


I can move that functionality to the base, however, none of the 20+ 
functions are required for the other components of the io framework 
(i.e. ROMIO). So I would basically add functionality required for one 
component only into the base.


Nevertheless, I think the original question is still valid. We did not 
see this problem before, but it is now showing on all of our platforms, 
and I am still wandering that is the case. I *know* that the ompio 
component is loaded, and I still get the error message about the missing 
symbol from the ompio component. I do not understand why that happens.



Thanks
Edgar






Thanks
Edgar







On Nov 25, 2014, at 8:07 AM, Edgar Gabriel mailto:gabr...@cs.uh.edu>>
wrote:

Has something changed recently on the trunk/master regarding
OMPI_DECLSPEC? The reason I ask is because we get now errors about
unresolved symbols, e.g.

symbol lookup error:
/home/gabriel/OpenMPI/lib64/openmpi/mca_fcoll_dynamic.so: undefined
symbol: ompi_io_ompio_decode_datatype


and that problem was not there roughly two weeks back the last time
I tested. I did verify that the the function listed there has an
OMPI_DECLSPEC before its definition.

Thanks Edgar -- Edgar Gabriel Associate Professor Parallel Software
Technologies Lab http://pstl.cs.uh.edu Department of Computer
Science  University of Houston Philip G. Hoffman Hall, Room
524Houston, TX-77204, USA Tel: +1 (713) 743-3857
Fax: +1 (713) 743-3335
___ devel mailing list
de...@open-mpi.org  Subscription:
http://www.open-mpi.org/mailman/listinfo.cgi/devel Link to this
post:
http://www.open-mpi.org/community/lists/devel/2014/11/16332.php


___ devel mailing list
de...@open-mpi.org Subscription:
http://www.open-mpi.org/mailman/listinfo.cgi/develLink to this post:
http://www.open-mpi.org/community/lists/devel/2014/11/16333.php



--
Edgar Gabriel
Associate Professor
Parallel Software Technologies Lab http://pstl.cs.uh.edu

Department of Computer Science  University of Houston
Philip G. Hoffman Hall, Room 524Houston, TX-77204, USA
Tel: +1 (713) 743-3857  Fax: +1 (713) 743-3335
___
devel mailing list
de...@open-mpi.org 
Subscription:http://www.open-mpi.org/mailman/listinfo.cgi/devel
Link to this
post:http://www.open-mpi.org/community/lists/devel/2014/11/16334.php




___
devel mailing list
de...@open-mpi.org
Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
Link to this post: 
http://www.open-mpi.org/community/lists/devel/2014/11/16336.php



--
Edgar Gabriel
Associate Professor
Parallel Software Technologies Lab  http://pstl.cs.uh.edu
Department of Computer Science  University of Houston
Philip G. Hoffman Hall, Room 524Houston, TX-77204, USA
Tel: +1 (713) 743-3857  Fax: +1 (713) 743-3335


Re: [OMPI devel] question to OMPI_DECLSPEC

2014-11-25 Thread Ralph Castain

> On Nov 25, 2014, at 9:36 AM, Edgar Gabriel  wrote:
> 
> On 11/25/2014 11:31 AM, Ralph Castain wrote:
>> 
>>> On Nov 25, 2014, at 8:24 AM, Edgar Gabriel >> > wrote:
>>> 
>>> On 11/25/2014 10:18 AM, Ralph Castain wrote:
 Hmmm…no, nothing has changed with regard to declspec that I know
 about. I’ll ask the obvious things to check:
 
 * does that component have the proper include to find this function?
 Could be that it used to be found thru some chain, but the chain is
 now broken and it needs to be directly included
>>> 
>>> header is included, I double checked.
>>> 
 * is that function in the base code, or down in a component? If the
 latter, then that’s a problem, but I’m assuming you didn’t make that
 mistake.
>>> 
>>> 
>>> I am not sure what you mean. The function is in a component, but I am
>>> not aware that it is illegal to call a function of a component from
>>> another component.
>> 
>> 
>> Of course that is illegal - you can only access a function via the
>> framework interface, not directly. You have no way of knowing that the
>> other component has been loaded. Doing it directly violates the
>> abstraction rules.
> 
> well, ok. I know that the other componen has been loaded because that 
> component triggered the initialization of these sub-frameworks.

I think we’ve seen that before, and run into problems with that approach (i.e., 
components calling framework opens).

> 
> I can move that functionality to the base, however, none of the 20+ functions 
> are required for the other components of the io framework (i.e. ROMIO). So I 
> would basically add functionality required for one component only into the 
> base.

Sounds like you’ve got an abstraction problem. If the fcoll component requires 
certain functions from another framework, then the framework should be exposing 
those APIs. If ROMIO doesn’t provide them, then it needs to return an error if 
someone attempts to call it.

You are welcome to bring this up on next week’s call if you like. IIRC, this 
has come up before when people have tried this hard links between components. 
Maybe someone else will have a better solution, but is just seems to me like 
you have to go thru the framework to avoid the problem.

> 
> Nevertheless, I think the original question is still valid. We did not see 
> this problem before, but it is now showing on all of our platforms, and I am 
> still wandering that is the case. I *know* that the ompio component is 
> loaded, and I still get the error message about the missing symbol from the 
> ompio component. I do not understand why that happens.

Probably because the fcoll component didn’t explicitly link against the ompio 
component. You were likely getting away with it out of pure luck.

> 
> 
> Thanks
> Edgar
> 
>> 
>> 
>>> 
>>> Thanks
>>> Edgar
>>> 
>>> 
>>> 
 
 
> On Nov 25, 2014, at 8:07 AM, Edgar Gabriel  >
> wrote:
> 
> Has something changed recently on the trunk/master regarding
> OMPI_DECLSPEC? The reason I ask is because we get now errors about
> unresolved symbols, e.g.
> 
> symbol lookup error:
> /home/gabriel/OpenMPI/lib64/openmpi/mca_fcoll_dynamic.so: undefined
> symbol: ompi_io_ompio_decode_datatype
> 
> 
> and that problem was not there roughly two weeks back the last time
> I tested. I did verify that the the function listed there has an
> OMPI_DECLSPEC before its definition.
> 
> Thanks Edgar -- Edgar Gabriel Associate Professor Parallel Software
> Technologies Lab http://pstl.cs.uh.edu Department of Computer
> Science  University of Houston Philip G. Hoffman Hall, Room
> 524Houston, TX-77204, USA Tel: +1 (713) 743-3857
> Fax: +1 (713) 743-3335
> ___ devel mailing list
> de...@open-mpi.org  Subscription:
> http://www.open-mpi.org/mailman/listinfo.cgi/devel Link to this
> post:
> http://www.open-mpi.org/community/lists/devel/2014/11/16332.php
 
 ___ devel mailing list
 de...@open-mpi.org Subscription:
 http://www.open-mpi.org/mailman/listinfo.cgi/develLink to this post:
 http://www.open-mpi.org/community/lists/devel/2014/11/16333.php
 
>>> 
>>> --
>>> Edgar Gabriel
>>> Associate Professor
>>> Parallel Software Technologies Lab http://pstl.cs.uh.edu
>>> 
>>> Department of Computer Science  University of Houston
>>> Philip G. Hoffman Hall, Room 524Houston, TX-77204, USA
>>> Tel: +1 (713) 743-3857  Fax: +1 (713) 743-3335
>>> ___
>>> devel mailing list
>>> de...@open-mpi.org 
>>> Subscription:http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>> Link to this
>>> post:http://www.open-mpi.or

Re: [OMPI devel] devel Digest, Vol 2854, Issue 1

2014-11-25 Thread Allan Wu
Thanks Ralph for the reply. Sorry about the log file, I think I forgot to
put an extension to the file. Please find a new one attached with this
email.

​I'm sorry for not enough debugging information, ​but 'omp_info' and
'--debug-devel' are the only ways I know for collecting information, are
there any other things I can try to provide more info?

When I execute 'mpirun --debug-devel -np 1 ./helloworld', all the output is
the logging information in my last email. It got stuck at
​
 "[fpga1:00718] tmp: /tmp", and nothing from my helloworld program is
printed out to the screen. So I think it is mpirun failing to start my
executable, not failing to terminate.

I was wondering if this has anything to do with my newer kernel version,
since it works well in the old case.

Thanks,
--
Di Wu (Allan)
PhD student, VAST Laboratory ,
Department of Computer Science, UC Los Angeles
Email: al...@cs.ucla.edu


​Date: Tue, 25 Nov 2014 07:29:51 -0800
From:
​​
Ralph Castain 
To: Open MPI Developers 
Subject: Re: [OMPI devel] OpenMPI v1.8 and v1.8.3 mpirun hangs at
execution   on an embedded ARM Linux kernel version 3.15.0
Message-ID: <898cb117-f6a6-4569-89c3-49b75d65b...@open-mpi.org>
Content-Type: text/plain; charset="utf-8"

I don?t know what you put in that log file, but it was an executable and
I?m not feeling that trusting :-)

I?m afraid there isn?t enough debug output there to really tell anything.
>From what little I can see, I?m guessing that the application ran fine and
you got the usual ?hello? output and the helloworld process exited safely -
is that correct? And so it is solely mpirun that is failing to cleanly
terminate?


> On Nov 24, 2014, at 11:24 PM, Allan Wu  wrote:
>
> Hello everyone,
>
> I have cross-compiled OpenMPI for an embedded ARM Linux. Everything works
fine for my system based on Linux 3.8.0. I have previously submitted a post
related to my compilation, which can be found here:
http://www.open-mpi.org/community/lists/devel/2014/04/14440.php <
http://www.open-mpi.org/community/lists/devel/2014/04/14440.php>. When I
recently upgraded my Linux kernel to 3.15.0, mpirun begins to stuck at even
the helloworld program. The program consists only simple APIs: MPI_Init,
MPI_Comm_size, MPI_Comm_rank, MPI_Finalize. The problem occurs even at
'mpirun -np 1 ./helloworld', and below are the output with --debug-devel
(before it got stuck):
> [fpga1:00716] sess_dir_finalize: job session dir not empty - leaving
> [fpga1:00716] procdir: /tmp/openmpi-sessions-root@fpga1_0/63813/0/0
> [fpga1:00716] jobdir: /tmp/openmpi-sessions-root@fpga1_0/63813/0
> [fpga1:00716] top: openmpi-sessions-root@fpga1_0
> [fpga1:00716] tmp: /tmp
> [fpga1:00718] procdir: /tmp/openmpi-sessions-root@fpga1_0/63813/1/0
> [fpga1:00718] jobdir: /tmp/openmpi-sessions-root@fpga1_0/63813/1
> [fpga1:00718] top: openmpi-sessions-root@fpga1_0
>
​​
[fpga1:00718] tmp: /tmp
>
> I suspect maybe it is due to incompatible kernel version or some missing
kernel modules. I tried also with the latest version 1.8.3, and had the
same problem. Does anyone have any thoughts? I have attached the output of
'ompi-info --all' with this email.
>
> Please let me know if I need to provide more information. Thanks in
advance!
>
> Regards,
> --
> Di Wu (Allan)
> PhD student, VAST?Laboratory ,
> Department of Computer Science, UC Los Angeles
> Email: al...@cs.ucla.edu 
> ___
> devel mailing list
> de...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
> Link to this post:
http://www.open-mpi.org/community/lists/devel/2014/11/16330.php​


log.tar.gz
Description: GNU Zip compressed data


Re: [OMPI devel] OpenMPI v1.8 and v1.8.3 mpirun hangs at execution on an embedded ARM Linux kernel version 3.15.0

2014-11-25 Thread Allan Wu
​​I'm sorry I forgot to change the subject when I reply to the digest
issue. Please find my original email below.

Regards,
Di

On Tue, Nov 25, 2014 at 10:19 AM, Allan Wu  wrote:

> Thanks Ralph for the reply. Sorry about the log file, I think I forgot to
> put an extension to the file. Please find a new one attached with this
> email.
>
> ​I'm sorry for not enough debugging information, ​but 'omp_info' and
> '--debug-devel' are the only ways I know for collecting information, are
> there any other things I can try to provide more info?
>
> When I execute 'mpirun --debug-devel -np 1 ./helloworld', all the output
> is the logging information in my last email. It got stuck at
> ​
>  "[fpga1:00718] tmp: /tmp", and nothing from my helloworld program is
> printed out to the screen. So I think it is mpirun failing to start my
> executable, not failing to terminate.
>
> I was wondering if this has anything to do with my newer kernel version,
> since it works well in the old case.
>
> Thanks,
> --
> Di Wu (Allan)
> PhD student, VAST Laboratory ,
> Department of Computer Science, UC Los Angeles
> Email: al...@cs.ucla.edu
>
>
> ​Date: Tue, 25 Nov 2014 07:29:51 -0800
> From:
> ​​
> ​​
> Ralph Castain 
> To: Open MPI Developers 
> Subject: Re: [OMPI devel] OpenMPI v1.8 and v1.8.3 mpirun hangs at
> execution   on an embedded ARM Linux kernel version 3.15.0
> Message-ID: <898cb117-f6a6-4569-89c3-49b75d65b...@open-mpi.org>
> Content-Type: text/plain; charset="utf-8"
>
> I don?t know what you put in that log file, but it was an executable and
> I?m not feeling that trusting :-)
>
> I?m afraid there isn?t enough debug output there to really tell anything.
> From what little I can see, I?m guessing that the application ran fine and
> you got the usual ?hello? output and the helloworld process exited safely -
> is that correct? And so it is solely mpirun that is failing to cleanly
> terminate?
>
>
> > On Nov 24, 2014, at 11:24 PM, Allan Wu  wrote:
> >
> > Hello everyone,
> >
> > I have cross-compiled OpenMPI for an embedded ARM Linux. Everything
> works fine for my system based on Linux 3.8.0. I have previously submitted
> a post related to my compilation, which can be found here:
> http://www.open-mpi.org/community/lists/devel/2014/04/14440.php <
> http://www.open-mpi.org/community/lists/devel/2014/04/14440.php>. When I
> recently upgraded my Linux kernel to 3.15.0, mpirun begins to stuck at even
> the helloworld program. The program consists only simple APIs: MPI_Init,
> MPI_Comm_size, MPI_Comm_rank, MPI_Finalize. The problem occurs even at
> 'mpirun -np 1 ./helloworld', and below are the output with --debug-devel
> (before it got stuck):
> > [fpga1:00716] sess_dir_finalize: job session dir not empty - leaving
> > [fpga1:00716] procdir: /tmp/openmpi-sessions-root@fpga1_0/63813/0/0
> > [fpga1:00716] jobdir: /tmp/openmpi-sessions-root@fpga1_0/63813/0
> > [fpga1:00716] top: openmpi-sessions-root@fpga1_0
> > [fpga1:00716] tmp: /tmp
> > [fpga1:00718] procdir: /tmp/openmpi-sessions-root@fpga1_0/63813/1/0
> > [fpga1:00718] jobdir: /tmp/openmpi-sessions-root@fpga1_0/63813/1
> > [fpga1:00718] top: openmpi-sessions-root@fpga1_0
> >
> ​​
> [fpga1:00718] tmp: /tmp
> >
> > I suspect maybe it is due to incompatible kernel version or some missing
> kernel modules. I tried also with the latest version 1.8.3, and had the
> same problem. Does anyone have any thoughts? I have attached the output of
> 'ompi-info --all' with this email.
> >
> > Please let me know if I need to provide more information. Thanks in
> advance!
> >
> > Regards,
> > --
> > Di Wu (Allan)
> > PhD student, VAST?Laboratory ,
> > Department of Computer Science, UC Los Angeles
> > Email: al...@cs.ucla.edu 
> > ___
> > devel mailing list
> > de...@open-mpi.org
> > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
> > Link to this post:
> http://www.open-mpi.org/community/lists/devel/2014/11/16330.php​
>
>


Re: [OMPI devel] OpenMPI v1.8 and v1.8.3 mpirun hangs at execution on an embedded ARM Linux kernel version 3.15.0

2014-11-25 Thread Paul Hargrove
Allan,

A likely possibility is that some important kernel feature (that Open MPI
assumes is present) is missing.
That includes not only "kernel modules" as you mention, but also features
configure in (or out) of the base kernel.
For instance, some embedded kernels omit UNIX-domain sockets and SysV IPC
support.

If you can send me (preferably off-list) the kernel config files for the
old an new kernels I may be able to spot something.
If present, you are looking for /boot/config-[VERSION]

-Paul

On Tue, Nov 25, 2014 at 10:25 AM, Allan Wu  wrote:

> I'm sorry I forgot to change the subject when I reply to the digest
> issue. Please find my original email below.
>
> Regards,
> Di
>
> On Tue, Nov 25, 2014 at 10:19 AM, Allan Wu  wrote:
>
>> Thanks Ralph for the reply. Sorry about the log file, I think I forgot to
>> put an extension to the file. Please find a new one attached with this
>> email.
>>
>> I'm sorry for not enough debugging information, but 'omp_info' and
>> '--debug-devel' are the only ways I know for collecting information, are
>> there any other things I can try to provide more info?
>>
>> When I execute 'mpirun --debug-devel -np 1 ./helloworld', all the output
>> is the logging information in my last email. It got stuck at
>> 
>>  "[fpga1:00718] tmp: /tmp", and nothing from my helloworld program is
>> printed out to the screen. So I think it is mpirun failing to start my
>> executable, not failing to terminate.
>>
>> I was wondering if this has anything to do with my newer kernel version,
>> since it works well in the old case.
>>
>> Thanks,
>> --
>> Di Wu (Allan)
>> PhD student, VAST Laboratory ,
>> Department of Computer Science, UC Los Angeles
>> Email: al...@cs.ucla.edu
>>
>>
>> Date: Tue, 25 Nov 2014 07:29:51 -0800
>> From:
>> 
>> 
>> Ralph Castain 
>> To: Open MPI Developers 
>> Subject: Re: [OMPI devel] OpenMPI v1.8 and v1.8.3 mpirun hangs at
>> execution   on an embedded ARM Linux kernel version 3.15.0
>> Message-ID: <898cb117-f6a6-4569-89c3-49b75d65b...@open-mpi.org>
>> Content-Type: text/plain; charset="utf-8"
>>
>> I don?t know what you put in that log file, but it was an executable and
>> I?m not feeling that trusting :-)
>>
>> I?m afraid there isn?t enough debug output there to really tell anything.
>> From what little I can see, I?m guessing that the application ran fine and
>> you got the usual ?hello? output and the helloworld process exited safely -
>> is that correct? And so it is solely mpirun that is failing to cleanly
>> terminate?
>>
>>
>> > On Nov 24, 2014, at 11:24 PM, Allan Wu  wrote:
>> >
>> > Hello everyone,
>> >
>> > I have cross-compiled OpenMPI for an embedded ARM Linux. Everything
>> works fine for my system based on Linux 3.8.0. I have previously submitted
>> a post related to my compilation, which can be found here:
>> http://www.open-mpi.org/community/lists/devel/2014/04/14440.php <
>> http://www.open-mpi.org/community/lists/devel/2014/04/14440.php>. When I
>> recently upgraded my Linux kernel to 3.15.0, mpirun begins to stuck at even
>> the helloworld program. The program consists only simple APIs: MPI_Init,
>> MPI_Comm_size, MPI_Comm_rank, MPI_Finalize. The problem occurs even at
>> 'mpirun -np 1 ./helloworld', and below are the output with --debug-devel
>> (before it got stuck):
>> > [fpga1:00716] sess_dir_finalize: job session dir not empty - leaving
>> > [fpga1:00716] procdir: /tmp/openmpi-sessions-root@fpga1_0/63813/0/0
>> > [fpga1:00716] jobdir: /tmp/openmpi-sessions-root@fpga1_0/63813/0
>> > [fpga1:00716] top: openmpi-sessions-root@fpga1_0
>> > [fpga1:00716] tmp: /tmp
>> > [fpga1:00718] procdir: /tmp/openmpi-sessions-root@fpga1_0/63813/1/0
>> > [fpga1:00718] jobdir: /tmp/openmpi-sessions-root@fpga1_0/63813/1
>> > [fpga1:00718] top: openmpi-sessions-root@fpga1_0
>> >
>> 
>> [fpga1:00718] tmp: /tmp
>> >
>> > I suspect maybe it is due to incompatible kernel version or some
>> missing kernel modules. I tried also with the latest version 1.8.3, and had
>> the same problem. Does anyone have any thoughts? I have attached the output
>> of 'ompi-info --all' with this email.
>> >
>> > Please let me know if I need to provide more information. Thanks in
>> advance!
>> >
>> > Regards,
>> > --
>> > Di Wu (Allan)
>> > PhD student, VAST?Laboratory ,
>> > Department of Computer Science, UC Los Angeles
>> > Email: al...@cs.ucla.edu 
>> > ___
>> > devel mailing list
>> > de...@open-mpi.org
>> > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
>> > Link to this post:
>> http://www.open-mpi.org/community/lists/devel/2014/11/16330.php
>>
>>
>
> ___
> devel mailing list
> de...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
> Link to this post:
> http://www.open-mpi.org/community/lists/devel/2014/11/16341.php
>



-- 
Paul H. Hargrove  

Re: [OMPI devel] OpenMPI v1.8 and v1.8.3 mpirun hangs at execution on an embedded ARM Linux kernel version 3.15.0

2014-11-25 Thread Allan Wu
Thanks Paul! Unfortunately '/boot' is not available in my embedded linux,
and I do not have the configuration file for the old kernel since it is
provided as is. However, I have the new kernel configuration since I
compiled it myself. Would it be helpful if I provide you the .config file
when I compile the kernel? It maybe quite painful to look through that file
though. Is there any other way that I can obtain the configuration?

I checked my config for the new kernel, and UNIX-domain sockets and Sys V
IPC are both enabled in the build. Are there any other possibilities I can
check?

Thanks,
Di

--
Di Wu (Allan)
PhD student, VAST Laboratory ,
Department of Computer Science, UC Los Angeles
Email: al...@cs.ucla.edu

On Tue, Nov 25, 2014 at 10:45 AM, Paul Hargrove  wrote:

> Allan,
>
> A likely possibility is that some important kernel feature (that Open MPI
> assumes is present) is missing.
> That includes not only "kernel modules" as you mention, but also features
> configure in (or out) of the base kernel.
> For instance, some embedded kernels omit UNIX-domain sockets and SysV IPC
> support.
>
> If you can send me (preferably off-list) the kernel config files for the
> old an new kernels I may be able to spot something.
> If present, you are looking for /boot/config-[VERSION]
>
> -Paul
>
> On Tue, Nov 25, 2014 at 10:25 AM, Allan Wu  wrote:
>
>> I'm sorry I forgot to change the subject when I reply to the digest
>> issue. Please find my original email below.
>>
>> Regards,
>> Di
>>
>> On Tue, Nov 25, 2014 at 10:19 AM, Allan Wu  wrote:
>>
>>> Thanks Ralph for the reply. Sorry about the log file, I think I forgot
>>> to put an extension to the file. Please find a new one attached with this
>>> email.
>>>
>>> I'm sorry for not enough debugging information, but 'omp_info' and
>>> '--debug-devel' are the only ways I know for collecting information, are
>>> there any other things I can try to provide more info?
>>>
>>> When I execute 'mpirun --debug-devel -np 1 ./helloworld', all the output
>>> is the logging information in my last email. It got stuck at
>>>  "[fpga1:00718] tmp: /tmp", and nothing from my helloworld program is
>>> printed out to the screen. So I think it is mpirun failing to start my
>>> executable, not failing to terminate.
>>>
>>> I was wondering if this has anything to do with my newer kernel version,
>>> since it works well in the old case.
>>>
>>> Thanks,
>>> --
>>> Di Wu (Allan)
>>> PhD student, VAST Laboratory ,
>>> Department of Computer Science, UC Los Angeles
>>> Email: al...@cs.ucla.edu
>>>
>>>
>>> Date: Tue, 25 Nov 2014 07:29:51 -0800
>>> From:
>>> Ralph Castain 
>>> To: Open MPI Developers 
>>> Subject: Re: [OMPI devel] OpenMPI v1.8 and v1.8.3 mpirun hangs at
>>> execution   on an embedded ARM Linux kernel version 3.15.0
>>> Message-ID: <898cb117-f6a6-4569-89c3-49b75d65b...@open-mpi.org>
>>> Content-Type: text/plain; charset="utf-8"
>>>
>>> I don?t know what you put in that log file, but it was an executable and
>>> I?m not feeling that trusting :-)
>>>
>>> I?m afraid there isn?t enough debug output there to really tell
>>> anything. From what little I can see, I?m guessing that the application ran
>>> fine and you got the usual ?hello? output and the helloworld process exited
>>> safely - is that correct? And so it is solely mpirun that is failing to
>>> cleanly terminate?
>>>
>>>
>>> > On Nov 24, 2014, at 11:24 PM, Allan Wu  wrote:
>>> >
>>> > Hello everyone,
>>> >
>>> > I have cross-compiled OpenMPI for an embedded ARM Linux. Everything
>>> works fine for my system based on Linux 3.8.0. I have previously submitted
>>> a post related to my compilation, which can be found here:
>>> http://www.open-mpi.org/community/lists/devel/2014/04/14440.php <
>>> http://www.open-mpi.org/community/lists/devel/2014/04/14440.php>. When
>>> I recently upgraded my Linux kernel to 3.15.0, mpirun begins to stuck at
>>> even the helloworld program. The program consists only simple APIs:
>>> MPI_Init, MPI_Comm_size, MPI_Comm_rank, MPI_Finalize. The problem occurs
>>> even at 'mpirun -np 1 ./helloworld', and below are the output with
>>> --debug-devel (before it got stuck):
>>> > [fpga1:00716] sess_dir_finalize: job session dir not empty - leaving
>>> > [fpga1:00716] procdir: /tmp/openmpi-sessions-root@fpga1_0/63813/0/0
>>> > [fpga1:00716] jobdir: /tmp/openmpi-sessions-root@fpga1_0/63813/0
>>> > [fpga1:00716] top: openmpi-sessions-root@fpga1_0
>>> > [fpga1:00716] tmp: /tmp
>>> > [fpga1:00718] procdir: /tmp/openmpi-sessions-root@fpga1_0/63813/1/0
>>> > [fpga1:00718] jobdir: /tmp/openmpi-sessions-root@fpga1_0/63813/1
>>> > [fpga1:00718] top: openmpi-sessions-root@fpga1_0
>>> >
>>> [fpga1:00718] tmp: /tmp
>>> >
>>> > I suspect maybe it is due to incompatible kernel version or some
>>> missing kernel modules. I tried also with the latest version 1.8.3, and had
>>> the same problem. Does anyone have any thoughts? I have attached the output
>>> of 

Re: [OMPI devel] OpenMPI v1.8 and v1.8.3 mpirun hangs at execution on an embedded ARM Linux kernel version 3.15.0

2014-11-25 Thread Ralph Castain
This is all running on a single node, correct? If so, did you configure OMPI 
with —enable-debug?

If you can do that, or already have, then let’s add the following to the mpirun 
cmd line:

-mca state_base_verbose 10 -mca odls_base_verbose 10 -mca oob_base_verbose 10

You’ll get a bunch of output, but hopefully it will tell us where mpirun is 
encountering a problem.
Ralph


> On Nov 25, 2014, at 11:11 AM, Allan Wu  wrote:
> 
> Thanks Paul! Unfortunately '/boot' is not available in my embedded linux, and 
> I do not have the configuration file for the old kernel since it is provided 
> as is. However, I have the new kernel configuration since I compiled it 
> myself. Would it be helpful if I provide you the .config file when I compile 
> the kernel? It maybe quite painful to look through that file though. Is there 
> any other way that I can obtain the configuration? 
> 
> I checked my config for the new kernel, and UNIX-domain sockets and Sys V IPC 
> are both enabled in the build. Are there any other possibilities I can check?
> 
> Thanks,
> Di
> 
> --
> Di Wu (Allan)
> PhD student, VAST Laboratory ,
> Department of Computer Science, UC Los Angeles
> Email: al...@cs.ucla.edu 
> 
> On Tue, Nov 25, 2014 at 10:45 AM, Paul Hargrove  > wrote:
> Allan,
> 
> A likely possibility is that some important kernel feature (that Open MPI 
> assumes is present) is missing.
> That includes not only "kernel modules" as you mention, but also features 
> configure in (or out) of the base kernel.
> For instance, some embedded kernels omit UNIX-domain sockets and SysV IPC 
> support.
> 
> If you can send me (preferably off-list) the kernel config files for the old 
> an new kernels I may be able to spot something.
> If present, you are looking for /boot/config-[VERSION]
> 
> -Paul
> 
> On Tue, Nov 25, 2014 at 10:25 AM, Allan Wu  > wrote:
> I'm sorry I forgot to change the subject when I reply to the digest issue. 
> Please find my original email below. 
> 
> Regards,
> Di
> 
> On Tue, Nov 25, 2014 at 10:19 AM, Allan Wu  > wrote:
> Thanks Ralph for the reply. Sorry about the log file, I think I forgot to put 
> an extension to the file. Please find a new one attached with this email. 
> 
> I'm sorry for not enough debugging information, but 'omp_info' and 
> '--debug-devel' are the only ways I know for collecting information, are 
> there any other things I can try to provide more info?
> 
> When I execute 'mpirun --debug-devel -np 1 ./helloworld', all the output is 
> the logging information in my last email. It got stuck at  "[fpga1:00718] 
> tmp: /tmp", and nothing from my helloworld program is printed out to the 
> screen. So I think it is mpirun failing to start my executable, not failing 
> to terminate.
> 
> I was wondering if this has anything to do with my newer kernel version, 
> since it works well in the old case. 
> 
> Thanks,
> --
> Di Wu (Allan)
> PhD student, VAST Laboratory ,
> Department of Computer Science, UC Los Angeles
> Email: al...@cs.ucla.edu 
> 
> 
> Date: Tue, 25 Nov 2014 07:29:51 -0800
> From: Ralph Castain mailto:r...@open-mpi.org>>
> To: Open MPI Developers mailto:de...@open-mpi.org>>
> Subject: Re: [OMPI devel] OpenMPI v1.8 and v1.8.3 mpirun hangs at
> execution   on an embedded ARM Linux kernel version 3.15.0
> Message-ID: <898cb117-f6a6-4569-89c3-49b75d65b...@open-mpi.org 
> >
> Content-Type: text/plain; charset="utf-8"
> 
> I don?t know what you put in that log file, but it was an executable and I?m 
> not feeling that trusting :-)
> 
> I?m afraid there isn?t enough debug output there to really tell anything. 
> From what little I can see, I?m guessing that the application ran fine and 
> you got the usual ?hello? output and the helloworld process exited safely - 
> is that correct? And so it is solely mpirun that is failing to cleanly 
> terminate?
> 
> 
> > On Nov 24, 2014, at 11:24 PM, Allan Wu  > > wrote:
> >
> > Hello everyone,
> >
> > I have cross-compiled OpenMPI for an embedded ARM Linux. Everything works 
> > fine for my system based on Linux 3.8.0. I have previously submitted a post 
> > related to my compilation, which can be found here: 
> > http://www.open-mpi.org/community/lists/devel/2014/04/14440.php 
> >  
> >  > >. When I 
> > recently upgraded my Linux kernel to 3.15.0, mpirun begins to stuck at even 
> > the helloworld program. The program consists only simple APIs: MPI_Init, 
> > MPI_Comm_size, MPI_Comm_rank, MPI_Finalize. The problem occurs even at 
> > 'mpirun -np 1 ./helloworld', and below are the out

Re: [OMPI devel] OpenMPI v1.8 and v1.8.3 mpirun hangs at execution on an embedded ARM Linux kernel version 3.15.0

2014-11-25 Thread Larry Baker
Allan,

If you can still boot the old embedded system, a lot of times the config 
parameters are saved as /proc/config.gz.  You can at least them compare the two 
configs.

Larry Baker
US Geological Survey
650-329-5608
ba...@usgs.gov



On 25 Nov 2014, at 11:11 AM, Allan Wu wrote:

> Thanks Paul! Unfortunately '/boot' is not available in my embedded linux, and 
> I do not have the configuration file for the old kernel since it is provided 
> as is. However, I have the new kernel configuration since I compiled it 
> myself. Would it be helpful if I provide you the .config file when I compile 
> the kernel? It maybe quite painful to look through that file though. Is there 
> any other way that I can obtain the configuration? 
> 
> I checked my config for the new kernel, and UNIX-domain sockets and Sys V IPC 
> are both enabled in the build. Are there any other possibilities I can check?
> 
> Thanks,
> Di
> 
> --
> Di Wu (Allan)
> PhD student, VAST Laboratory,
> Department of Computer Science, UC Los Angeles
> Email: al...@cs.ucla.edu
> 
> On Tue, Nov 25, 2014 at 10:45 AM, Paul Hargrove  wrote:
> Allan,
> 
> A likely possibility is that some important kernel feature (that Open MPI 
> assumes is present) is missing.
> That includes not only "kernel modules" as you mention, but also features 
> configure in (or out) of the base kernel.
> For instance, some embedded kernels omit UNIX-domain sockets and SysV IPC 
> support.
> 
> If you can send me (preferably off-list) the kernel config files for the old 
> an new kernels I may be able to spot something.
> If present, you are looking for /boot/config-[VERSION]
> 
> -Paul
> 
> On Tue, Nov 25, 2014 at 10:25 AM, Allan Wu  wrote:
> I'm sorry I forgot to change the subject when I reply to the digest issue. 
> Please find my original email below. 
> 
> Regards,
> Di
> 
> On Tue, Nov 25, 2014 at 10:19 AM, Allan Wu  wrote:
> Thanks Ralph for the reply. Sorry about the log file, I think I forgot to put 
> an extension to the file. Please find a new one attached with this email. 
> 
> I'm sorry for not enough debugging information, but 'omp_info' and 
> '--debug-devel' are the only ways I know for collecting information, are 
> there any other things I can try to provide more info?
> 
> When I execute 'mpirun --debug-devel -np 1 ./helloworld', all the output is 
> the logging information in my last email. It got stuck at  "[fpga1:00718] 
> tmp: /tmp", and nothing from my helloworld program is printed out to the 
> screen. So I think it is mpirun failing to start my executable, not failing 
> to terminate.
> 
> I was wondering if this has anything to do with my newer kernel version, 
> since it works well in the old case. 
> 
> Thanks,
> --
> Di Wu (Allan)
> PhD student, VAST Laboratory,
> Department of Computer Science, UC Los Angeles
> Email: al...@cs.ucla.edu
> 
> 
> Date: Tue, 25 Nov 2014 07:29:51 -0800
> From: Ralph Castain 
> To: Open MPI Developers 
> Subject: Re: [OMPI devel] OpenMPI v1.8 and v1.8.3 mpirun hangs at
> execution   on an embedded ARM Linux kernel version 3.15.0
> Message-ID: <898cb117-f6a6-4569-89c3-49b75d65b...@open-mpi.org>
> Content-Type: text/plain; charset="utf-8"
> 
> I don?t know what you put in that log file, but it was an executable and I?m 
> not feeling that trusting :-)
> 
> I?m afraid there isn?t enough debug output there to really tell anything. 
> From what little I can see, I?m guessing that the application ran fine and 
> you got the usual ?hello? output and the helloworld process exited safely - 
> is that correct? And so it is solely mpirun that is failing to cleanly 
> terminate?
> 
> 
> > On Nov 24, 2014, at 11:24 PM, Allan Wu  wrote:
> >
> > Hello everyone,
> >
> > I have cross-compiled OpenMPI for an embedded ARM Linux. Everything works 
> > fine for my system based on Linux 3.8.0. I have previously submitted a post 
> > related to my compilation, which can be found here: 
> > http://www.open-mpi.org/community/lists/devel/2014/04/14440.php 
> > . When I 
> > recently upgraded my Linux kernel to 3.15.0, mpirun begins to stuck at even 
> > the helloworld program. The program consists only simple APIs: MPI_Init, 
> > MPI_Comm_size, MPI_Comm_rank, MPI_Finalize. The problem occurs even at 
> > 'mpirun -np 1 ./helloworld', and below are the output with --debug-devel 
> > (before it got stuck):
> > [fpga1:00716] sess_dir_finalize: job session dir not empty - leaving
> > [fpga1:00716] procdir: /tmp/openmpi-sessions-root@fpga1_0/63813/0/0
> > [fpga1:00716] jobdir: /tmp/openmpi-sessions-root@fpga1_0/63813/0
> > [fpga1:00716] top: openmpi-sessions-root@fpga1_0
> > [fpga1:00716] tmp: /tmp
> > [fpga1:00718] procdir: /tmp/openmpi-sessions-root@fpga1_0/63813/1/0
> > [fpga1:00718] jobdir: /tmp/openmpi-sessions-root@fpga1_0/63813/1
> > [fpga1:00718] top: openmpi-sessions-root@fpga1_0
> > [fpga1:00718] tmp: /tmp
> >
> > I suspect maybe it is due to incompatible kernel ver

Re: [OMPI devel] OpenMPI v1.8 and v1.8.3 mpirun hangs at execution on an embedded ARM Linux kernel version 3.15.0

2014-11-25 Thread Allan Wu
Thanks Ralph!

I did not compile my openmpi with --enable-debug, and I am compiling it
now. But your suggested command already provide
​d​
some output, which I attached with this email.

It seems the process was stuck on the line:
"[fpga2:00962] [[44848,1],0] waiting for connect completion to
[[44848,0],0] - activating send event"

Then it got stuck and I CTRL+C'ed it. Previous to that line, it said
something about 'orte_tcp_peer_try_connect: attempting to connect to proc
[[44848,0],0] via interface eth0'
​.​


Regards,
Di

On Tue, Nov 25, 2014 at 2:25 PM, Ralph Castain  wrote:

> ​
> This is all running on a single node, correct? If so, did you configure
> OMPI with —enable-debug?
>
> If you can do that, or already have, then let’s add the following to the
> mpirun cmd line:
>
> -mca state_base_verbose 10 -mca odls_base_verbose 10 -mca oob_base_verbose
> 10
>
> You’ll get a bunch of output, but hopefully it will tell us where mpirun
> is encountering a problem.
> Ralph
> On Tue, Nov 25, 2014 at 11:20 AM, Paul Hargrove 
> wrote:
>
>> Allan,
>>
>> If you send me the .config from your build of the kernel I can compare it
>> against, for instance, my .config for a Raspberry Pi.
>> There will certainly be many differences, but I am hoping my own
>> experience configuring linux kernels will help me filter the "noise" from
>> any differences that might be significant.
>>
>> -Paul
>>
>> On Tue, Nov 25, 2014 at 11:11 AM, Allan Wu  wrote:
>>
>>> Thanks Paul! Unfortunately '/boot' is not available in my embedded
>>> linux, and I do not have the configuration file for the old kernel since it
>>> is provided as is. However, I have the new kernel configuration since I
>>> compiled it myself. Would it be helpful if I provide you the .config file
>>> when I compile the kernel? It maybe quite painful to look through that file
>>> though. Is there any other way that I can obtain the configuration?
>>>
>>> I checked my config for the new kernel, and UNIX-domain sockets and Sys
>>> V IPC are both enabled in the build. Are there any other possibilities I
>>> can check?
>>>
>>> Thanks,
>>> Di
>>>
>>> --
>>> Di Wu (Allan)
>>> PhD student, VAST Laboratory ,
>>> Department of Computer Science, UC Los Angeles
>>> Email: al...@cs.ucla.edu
>>>
>>> On Tue, Nov 25, 2014 at 10:45 AM, Paul Hargrove 
>>> wrote:
>>>
 Allan,

 A likely possibility is that some important kernel feature (that Open
 MPI assumes is present) is missing.
 That includes not only "kernel modules" as you mention, but also
 features configure in (or out) of the base kernel.
 For instance, some embedded kernels omit UNIX-domain sockets and SysV
 IPC support.

 If you can send me (preferably off-list) the kernel config files for
 the old an new kernels I may be able to spot something.
 If present, you are looking for /boot/config-[VERSION]

 -Paul

 On Tue, Nov 25, 2014 at 10:25 AM, Allan Wu  wrote:

> I'm sorry I forgot to change the subject when I reply to the digest
> issue. Please find my original email below.
>
> Regards,
> Di
>
> On Tue, Nov 25, 2014 at 10:19 AM, Allan Wu  wrote:
>
>> Thanks Ralph for the reply. Sorry about the log file, I think I
>> forgot to put an extension to the file. Please find a new one attached 
>> with
>> this email.
>>
>> I'm sorry for not enough debugging information, but 'omp_info' and
>> '--debug-devel' are the only ways I know for collecting information, are
>> there any other things I can try to provide more info?
>>
>> When I execute 'mpirun --debug-devel -np 1 ./helloworld', all the
>> output is the logging information in my last email. It got stuck at
>>  "[fpga1:00718] tmp: /tmp", and nothing from my helloworld program is
>> printed out to the screen. So I think it is mpirun failing to start my
>> executable, not failing to terminate.
>>
>> I was wondering if this has anything to do with my newer kernel
>> version, since it works well in the old case.
>>
>> Thanks,
>> --
>> Di Wu (Allan)
>> PhD student, VAST Laboratory ,
>> Department of Computer Science, UC Los Angeles
>> Email: al...@cs.ucla.edu
>>
>>
>> Date: Tue, 25 Nov 2014 07:29:51 -0800
>> From:
>> Ralph Castain 
>> To: Open MPI Developers 
>> Subject: Re: [OMPI devel] OpenMPI v1.8 and v1.8.3 mpirun hangs at
>> execution   on an embedded ARM Linux kernel version 3.15.0
>> Message-ID: <898cb117-f6a6-4569-89c3-49b75d65b...@open-mpi.org>
>> Content-Type: text/plain; charset="utf-8"
>>
>> I don?t know what you put in that log file, but it was an executable
>> and I?m not feeling that trusting :-)
>>
>> I?m afraid there isn?t enough debug output there to really tell
>> anything. From what little I can see, I?m guessing that the application 
>> ra

Re: [OMPI devel] OpenMPI v1.8 and v1.8.3 mpirun hangs at execution on an embedded ARM Linux kernel version 3.15.0

2014-11-25 Thread Paul Hargrove
Following Larry's suggestion to use /proc/config.gz, Allan sent me kernel
configs for the old (3.8) and new (3.15) kernels.
While there were more changes than I expected, none relates to removing an
API/feature that Open MPI is likely to be using.

-Paul

On Tue, Nov 25, 2014 at 11:28 AM, Larry Baker  wrote:

> Allan,
>
> If you can still boot the old embedded system, a lot of times the config
> parameters are saved as /proc/config.gz.  You can at least them compare the
> two configs.
>
> Larry Baker
> US Geological Survey
> 650-329-5608
> ba...@usgs.gov
>
>
>
> On 25 Nov 2014, at 11:11 AM, Allan Wu wrote:
>
> Thanks Paul! Unfortunately '/boot' is not available in my embedded linux,
> and I do not have the configuration file for the old kernel since it is
> provided as is. However, I have the new kernel configuration since I
> compiled it myself. Would it be helpful if I provide you the .config file
> when I compile the kernel? It maybe quite painful to look through that file
> though. Is there any other way that I can obtain the configuration?
>
> I checked my config for the new kernel, and UNIX-domain sockets and Sys V
> IPC are both enabled in the build. Are there any other possibilities I can
> check?
>
> Thanks,
> Di
>
> --
> Di Wu (Allan)
> PhD student, VAST Laboratory ,
> Department of Computer Science, UC Los Angeles
> Email: al...@cs.ucla.edu
>
> On Tue, Nov 25, 2014 at 10:45 AM, Paul Hargrove 
> wrote:
>
>> Allan,
>>
>> A likely possibility is that some important kernel feature (that Open MPI
>> assumes is present) is missing.
>> That includes not only "kernel modules" as you mention, but also features
>> configure in (or out) of the base kernel.
>> For instance, some embedded kernels omit UNIX-domain sockets and SysV IPC
>> support.
>>
>> If you can send me (preferably off-list) the kernel config files for the
>> old an new kernels I may be able to spot something.
>> If present, you are looking for /boot/config-[VERSION]
>>
>> -Paul
>>
>> On Tue, Nov 25, 2014 at 10:25 AM, Allan Wu  wrote:
>>
>>> I'm sorry I forgot to change the subject when I reply to the digest
>>> issue. Please find my original email below.
>>>
>>> Regards,
>>> Di
>>>
>>> On Tue, Nov 25, 2014 at 10:19 AM, Allan Wu  wrote:
>>>
 Thanks Ralph for the reply. Sorry about the log file, I think I forgot
 to put an extension to the file. Please find a new one attached with this
 email.

 I'm sorry for not enough debugging information, but 'omp_info' and
 '--debug-devel' are the only ways I know for collecting information, are
 there any other things I can try to provide more info?

 When I execute 'mpirun --debug-devel -np 1 ./helloworld', all the
 output is the logging information in my last email. It got stuck at
  "[fpga1:00718] tmp: /tmp", and nothing from my helloworld program is
 printed out to the screen. So I think it is mpirun failing to start my
 executable, not failing to terminate.

 I was wondering if this has anything to do with my newer kernel
 version, since it works well in the old case.

 Thanks,
 --
 Di Wu (Allan)
 PhD student, VAST Laboratory ,
 Department of Computer Science, UC Los Angeles
 Email: al...@cs.ucla.edu


 Date: Tue, 25 Nov 2014 07:29:51 -0800
 From:
 Ralph Castain 
 To: Open MPI Developers 
 Subject: Re: [OMPI devel] OpenMPI v1.8 and v1.8.3 mpirun hangs at
 execution   on an embedded ARM Linux kernel version 3.15.0
 Message-ID: <898cb117-f6a6-4569-89c3-49b75d65b...@open-mpi.org>
 Content-Type: text/plain; charset="utf-8"

 I don?t know what you put in that log file, but it was an executable
 and I?m not feeling that trusting :-)

 I?m afraid there isn?t enough debug output there to really tell
 anything. From what little I can see, I?m guessing that the application ran
 fine and you got the usual ?hello? output and the helloworld process exited
 safely - is that correct? And so it is solely mpirun that is failing to
 cleanly terminate?


 > On Nov 24, 2014, at 11:24 PM, Allan Wu  wrote:
 >
 > Hello everyone,
 >
 > I have cross-compiled OpenMPI for an embedded ARM Linux. Everything
 works fine for my system based on Linux 3.8.0. I have previously submitted
 a post related to my compilation, which can be found here:
 http://www.open-mpi.org/community/lists/devel/2014/04/14440.php <
 http://www.open-mpi.org/community/lists/devel/2014/04/14440.php>. When
 I recently upgraded my Linux kernel to 3.15.0, mpirun begins to stuck at
 even the helloworld program. The program consists only simple APIs:
 MPI_Init, MPI_Comm_size, MPI_Comm_rank, MPI_Finalize. The problem occurs
 even at 'mpirun -np 1 ./helloworld', and below are the output with
 --debug-devel (before it got stuck):
 > [fpga1:00716] sess_dir_finalize: job

Re: [OMPI devel] OpenMPI v1.8 and v1.8.3 mpirun hangs at execution on an embedded ARM Linux kernel version 3.15.0

2014-11-25 Thread Allan Wu
I think I have found the problem. After inspecting the output with
​
"-mca state_base_verbose 10 -mca odls_base_verbose 10 -mca oob_base_verbose
10
​0​
"
​
on both the old system and the new system, I noticed there is one line
​
that is
​
different
​:​

​o​
n the old system where it works correctly, there is a line that says:
"oob:tcp:init rejecting loopback interface lo"
​,
while
​ ​
on the new system there is no such line. Both system proceed to open
interface eth0 afterwards. Then I checked the new system, and found out
that somehow the loopback interface is not up by default. After I opened
the lo interface, the mpirun executes normally.

Does it means that OpenMPI will use lo for some initial setup? Since the
actual socket was created on eth0 I did not think of checking the lo
interface. Anyway, thanks everyone for all of your kind help. Let me know
if you want me to provide any more information for future references.

Regards,
Allan

--
Di Wu (Allan)
PhD student, VAST Laboratory ,
Department of Computer Science, UC Los Angeles
Email: al...@cs.ucla.edu

On Tue, Nov 25, 2014 at 11:55 AM, Allan Wu  wrote:

> Thanks Ralph!
>
> I did not compile my openmpi with --enable-debug, and I am compiling it
> now. But your suggested command already provide
> ​d​
> some output, which I attached with this email.
>
> It seems the process was stuck on the line:
> "[fpga2:00962] [[44848,1],0] waiting for connect completion to
> [[44848,0],0] - activating send event"
>
> Then it got stuck and I CTRL+C'ed it. Previous to that line, it said
> something about 'orte_tcp_peer_try_connect: attempting to connect to proc
> [[44848,0],0] via interface eth0'
> ​.​
>
>
> Regards,
> Di
>
> On Tue, Nov 25, 2014 at 2:25 PM, Ralph Castain  wrote:
>
>> ​
>> This is all running on a single node, correct? If so, did you configure
>> OMPI with —enable-debug?
>>
>> If you can do that, or already have, then let’s add the following to
>> the mpirun cmd line:
>>
>> -mca state_base_verbose 10 -mca odls_base_verbose 10 -mca
>> oob_base_verbose 10
>>
>> You’ll get a bunch of output, but hopefully it will tell us where
>> mpirun is encountering a problem.
>> Ralph
>> On Tue, Nov 25, 2014 at 11:20 AM, Paul Hargrove 
>> wrote:
>>
>>> Allan,
>>>
>>> If you send me the .config from your build of the kernel I can compare
>>> it against, for instance, my .config for a Raspberry Pi.
>>> There will certainly be many differences, but I am hoping my own
>>> experience configuring linux kernels will help me filter the "noise" from
>>> any differences that might be significant.
>>>
>>> -Paul
>>>
>>> On Tue, Nov 25, 2014 at 11:11 AM, Allan Wu  wrote:
>>>
 Thanks Paul! Unfortunately '/boot' is not available in my embedded
 linux, and I do not have the configuration file for the old kernel since it
 is provided as is. However, I have the new kernel configuration since I
 compiled it myself. Would it be helpful if I provide you the .config file
 when I compile the kernel? It maybe quite painful to look through that file
 though. Is there any other way that I can obtain the configuration?

 I checked my config for the new kernel, and UNIX-domain sockets and Sys
 V IPC are both enabled in the build. Are there any other possibilities I
 can check?

 Thanks,
 Di

 --
 Di Wu (Allan)
 PhD student, VAST Laboratory ,
 Department of Computer Science, UC Los Angeles
 Email: al...@cs.ucla.edu

 On Tue, Nov 25, 2014 at 10:45 AM, Paul Hargrove 
 wrote:

> Allan,
>
> A likely possibility is that some important kernel feature (that Open
> MPI assumes is present) is missing.
> That includes not only "kernel modules" as you mention, but also
> features configure in (or out) of the base kernel.
> For instance, some embedded kernels omit UNIX-domain sockets and SysV
> IPC support.
>
> If you can send me (preferably off-list) the kernel config files for
> the old an new kernels I may be able to spot something.
> If present, you are looking for /boot/config-[VERSION]
>
> -Paul
>
> On Tue, Nov 25, 2014 at 10:25 AM, Allan Wu  wrote:
>
>> I'm sorry I forgot to change the subject when I reply to the digest
>> issue. Please find my original email below.
>>
>> Regards,
>> Di
>>
>> On Tue, Nov 25, 2014 at 10:19 AM, Allan Wu  wrote:
>>
>>> Thanks Ralph for the reply. Sorry about the log file, I think I
>>> forgot to put an extension to the file. Please find a new one attached 
>>> with
>>> this email.
>>>
>>> I'm sorry for not enough debugging information, but 'omp_info' and
>>> '--debug-devel' are the only ways I know for collecting information, are
>>> there any other things I can try to provide more info?
>>>
>>> When I execute 'mpirun --debug-devel -np 1 ./helloworld', all the
>>> output is the log

Re: [OMPI devel] OpenMPI v1.8 and v1.8.3 mpirun hangs at execution on an embedded ARM Linux kernel version 3.15.0

2014-11-25 Thread Paul Hargrove
Allan,

I am glad things are working for you now.
I can confirm (on a QEMU-emulated Versatile Express A9 board running Ubuntu
14.04) that disabling the "lo" interface reproduces the problem.
I imagine this is true on other architectures, though I did not attempt to
verify.

Ralph,

If oob:tcp really does need the loopback interface, shouldn't its lack be
something that could/should be detected and reported instead of hanging as
Allan saw?

FWIW, neither of the following resolved the problem:
-mca oob_tcp_if_exclude lo
-mca oob_tcp_if_include eth0


-Paul

On Tue, Nov 25, 2014 at 2:58 PM, Allan Wu  wrote:

> I think I have found the problem. After inspecting the output with
> 
> "-mca state_base_verbose 10 -mca odls_base_verbose 10 -mca
> oob_base_verbose 10
> 0
> "
> 
> on both the old system and the new system, I noticed there is one line
> 
> that is
> 
> different
> :
>
> o
> n the old system where it works correctly, there is a line that says:
> "oob:tcp:init rejecting loopback interface lo"
> ,
> while
>  
> on the new system there is no such line. Both system proceed to open
> interface eth0 afterwards. Then I checked the new system, and found out
> that somehow the loopback interface is not up by default. After I opened
> the lo interface, the mpirun executes normally.
>
> Does it means that OpenMPI will use lo for some initial setup? Since the
> actual socket was created on eth0 I did not think of checking the lo
> interface. Anyway, thanks everyone for all of your kind help. Let me know
> if you want me to provide any more information for future references.
>
> Regards,
> Allan
>
> --
> Di Wu (Allan)
> PhD student, VAST Laboratory ,
> Department of Computer Science, UC Los Angeles
> Email: al...@cs.ucla.edu
>
> On Tue, Nov 25, 2014 at 11:55 AM, Allan Wu  wrote:
>
>> Thanks Ralph!
>>
>> I did not compile my openmpi with --enable-debug, and I am compiling it
>> now. But your suggested command already provide
>> d
>> some output, which I attached with this email.
>>
>> It seems the process was stuck on the line:
>> "[fpga2:00962] [[44848,1],0] waiting for connect completion to
>> [[44848,0],0] - activating send event"
>>
>> Then it got stuck and I CTRL+C'ed it. Previous to that line, it said
>> something about 'orte_tcp_peer_try_connect: attempting to connect to proc
>> [[44848,0],0] via interface eth0'
>> .
>>
>>
>> Regards,
>> Di
>>
>> On Tue, Nov 25, 2014 at 2:25 PM, Ralph Castain  wrote:
>>
>>> 
>>> This is all running on a single node, correct? If so, did you configure
>>> OMPI with â EURO "enable-debug?
>>>
>>> If you can do that, or already have, then letâ EURO (tm)s add the following 
>>> to
>>> the mpirun cmd line:
>>>
>>> -mca state_base_verbose 10 -mca odls_base_verbose 10 -mca
>>> oob_base_verbose 10
>>>
>>> Youâ EURO (tm)ll get a bunch of output, but hopefully it will tell us where
>>> mpirun is encountering a problem.
>>> Ralph
>>> On Tue, Nov 25, 2014 at 11:20 AM, Paul Hargrove 
>>> wrote:
>>>
 Allan,

 If you send me the .config from your build of the kernel I can compare
 it against, for instance, my .config for a Raspberry Pi.
 There will certainly be many differences, but I am hoping my own
 experience configuring linux kernels will help me filter the "noise" from
 any differences that might be significant.

 -Paul

 On Tue, Nov 25, 2014 at 11:11 AM, Allan Wu  wrote:

> Thanks Paul! Unfortunately '/boot' is not available in my embedded
> linux, and I do not have the configuration file for the old kernel since 
> it
> is provided as is. However, I have the new kernel configuration since I
> compiled it myself. Would it be helpful if I provide you the .config file
> when I compile the kernel? It maybe quite painful to look through that 
> file
> though. Is there any other way that I can obtain the configuration?
>
> I checked my config for the new kernel, and UNIX-domain sockets and
> Sys V IPC are both enabled in the build. Are there any other possibilities
> I can check?
>
> Thanks,
> Di
>
> --
> Di Wu (Allan)
> PhD student, VAST Laboratory ,
> Department of Computer Science, UC Los Angeles
> Email: al...@cs.ucla.edu
>
> On Tue, Nov 25, 2014 at 10:45 AM, Paul Hargrove 
> wrote:
>
>> Allan,
>>
>> A likely possibility is that some important kernel feature (that Open
>> MPI assumes is present) is missing.
>> That includes not only "kernel modules" as you mention, but also
>> features configure in (or out) of the base kernel.
>> For instance, some embedded kernels omit UNIX-domain sockets and SysV
>> IPC support.
>>
>> If you can send me (preferably off-list) the kernel config files for
>> the old an new kernels I may be able to spot something.
>> If present, you are looking for /boot/config-[VERSION]
>>
>> -Paul
>>
>>

Re: [OMPI devel] OpenMPI v1.8 and v1.8.3 mpirun hangs at execution on an embedded ARM Linux kernel version 3.15.0

2014-11-25 Thread Ralph Castain
I’ll have to look - there isn’t supposed to be such a requirement, and I 
certainly haven’t seen it before.


> On Nov 25, 2014, at 3:26 PM, Paul Hargrove  wrote:
> 
> Allan,
> 
> I am glad things are working for you now.
> I can confirm (on a QEMU-emulated Versatile Express A9 board running Ubuntu 
> 14.04) that disabling the "lo" interface reproduces the problem.
> I imagine this is true on other architectures, though I did not attempt to 
> verify.
> 
> Ralph,
> 
> If oob:tcp really does need the loopback interface, shouldn't its lack be 
> something that could/should be detected and reported instead of hanging as 
> Allan saw?
> 
> FWIW, neither of the following resolved the problem:
> -mca oob_tcp_if_exclude lo
> -mca oob_tcp_if_include eth0
> 
> 
> -Paul
> 
> On Tue, Nov 25, 2014 at 2:58 PM, Allan Wu  > wrote:
> I think I have found the problem. After inspecting the output with "-mca 
> state_base_verbose 10 -mca odls_base_verbose 10 -mca oob_base_verbose 100" on 
> both the old system and the new system, I noticed there is one line that is 
> different: on the old system where it works correctly, there is a line that 
> says: "oob:tcp:init rejecting loopback interface lo", while on the new system 
> there is no such line. Both system proceed to open interface eth0 afterwards. 
> Then I checked the new system, and found out that somehow the loopback 
> interface is not up by default. After I opened the lo interface, the mpirun 
> executes normally.
> 
> Does it means that OpenMPI will use lo for some initial setup? Since the 
> actual socket was created on eth0 I did not think of checking the lo 
> interface. Anyway, thanks everyone for all of your kind help. Let me know if 
> you want me to provide any more information for future references. 
> 
> Regards,
> Allan
> 
> --
> Di Wu (Allan)
> PhD student, VAST Laboratory ,
> Department of Computer Science, UC Los Angeles
> Email: al...@cs.ucla.edu 
> 
> On Tue, Nov 25, 2014 at 11:55 AM, Allan Wu  > wrote:
> Thanks Ralph!
> 
> I did not compile my openmpi with --enable-debug, and I am compiling it now. 
> But your suggested command already provided some output, which I attached 
> with this email. 
> 
> It seems the process was stuck on the line:
> "[fpga2:00962] [[44848,1],0] waiting for connect completion to [[44848,0],0] 
> - activating send event"
> 
> Then it got stuck and I CTRL+C'ed it. Previous to that line, it said 
> something about 'orte_tcp_peer_try_connect: attempting to connect to proc 
> [[44848,0],0] via interface eth0'.
> 
> Regards,
> Di
> 
> On Tue, Nov 25, 2014 at 2:25 PM, Ralph Castain  > wrote:
> This is all running on a single node, correct? If so, did you configure OMPI 
> with —enable-debug? 
> If you can do that, or already have, then let’s add the following to the 
> mpirun cmd line: 
> 
> -mca state_base_verbose 10 -mca odls_base_verbose 10 -mca oob_base_verbose 10 
> 
> You’ll get a bunch of output, but hopefully it will tell us where mpirun is 
> encountering a problem. 
> Ralph 
> 
> On Tue, Nov 25, 2014 at 11:20 AM, Paul Hargrove  > wrote:
> Allan,
> 
> If you send me the .config from your build of the kernel I can compare it 
> against, for instance, my .config for a Raspberry Pi.
> There will certainly be many differences, but I am hoping my own experience 
> configuring linux kernels will help me filter the "noise" from any 
> differences that might be significant.
> 
> -Paul
> 
> On Tue, Nov 25, 2014 at 11:11 AM, Allan Wu  > wrote:
> Thanks Paul! Unfortunately '/boot' is not available in my embedded linux, and 
> I do not have the configuration file for the old kernel since it is provided 
> as is. However, I have the new kernel configuration since I compiled it 
> myself. Would it be helpful if I provide you the .config file when I compile 
> the kernel? It maybe quite painful to look through that file though. Is there 
> any other way that I can obtain the configuration? 
> 
> I checked my config for the new kernel, and UNIX-domain sockets and Sys V IPC 
> are both enabled in the build. Are there any other possibilities I can check?
> 
> Thanks,
> Di
> 
> --
> Di Wu (Allan)
> PhD student, VAST Laboratory ,
> Department of Computer Science, UC Los Angeles
> Email: al...@cs.ucla.edu 
> 
> On Tue, Nov 25, 2014 at 10:45 AM, Paul Hargrove  > wrote:
> Allan,
> 
> A likely possibility is that some important kernel feature (that Open MPI 
> assumes is present) is missing.
> That includes not only "kernel modules" as you mention, but also features 
> configure in (or out) of the base kernel.
> For instance, some embedded kernels omit UNIX-domain sockets and SysV IPC 
> support.
> 
> If you can send me (preferably off-list) the kernel config files for the

Re: [OMPI devel] OpenMPI v1.8 and v1.8.3 mpirun hangs at execution on an embedded ARM Linux kernel version 3.15.0

2014-11-25 Thread Paul Hargrove
Ralph,

I had a look at the problem via "mpirun -np 1 strace -o trace -ff ./hello"
I find that there is an attempt (by a secondary thread) to establish a TCP
socket from the rank process to the eth0 address of localhost (I am
guessing to reach the orted/mpirun).
However, when the "lo" interface is down, the Linux kernel apparently
cannot establish that socket.

In fact, if I am sufficiently patient, it turns out the "hang" is bounded,
and eventually one sees:

phargrov@blcr-armv7:~$ time mpirun -np 1 ./a.out

A process or daemon was unable to complete a TCP connection
to another process:
  Local host:blcr-armv7
  Remote host:   10.0.2.15
This is usually caused by a firewall on the remote host. Please
check that any firewall (e.g., iptables) has been disabled and
try again.


real2m8.151s
user0m5.360s
sys 0m57.430s


Where blcr-armv7 and 10.0.2.15 are *both* the local (only) host.

There is no firewall, but in case you doubt me on that, here is a
demonstration using ping to show that 10.0.2.15 is only reachable when the
loopback interface is enabled:

phargrov@blcr-armv7:~$ sudo ifconfig lo up
phargrov@blcr-armv7:~$ ping -q -c2 10.0.2.15
PING 10.0.2.15 (10.0.2.15) 56(84) bytes of data.

--- 10.0.2.15 ping statistics ---
2 packets transmitted, 2 received, 0% packet loss, time 1002ms
rtt min/avg/max/mdev = 0.527/0.534/0.542/0.024 ms


phargrov@blcr-armv7:~$ sudo ifconfig lo down
phargrov@blcr-armv7:~$ ping -q -c2 10.0.2.15
PING 10.0.2.15 (10.0.2.15) 56(84) bytes of data.

--- 10.0.2.15 ping statistics ---
2 packets transmitted, 0 received, 100% packet loss, time 1006ms


So, there is no "hang" -- just a 2 minute pause before the error message is
generated.
However, it may still be possible to present a better/earlier error message
when there is no loopback interface (and at least one rank process is to be
launched locally).


-Paul

On Tue, Nov 25, 2014 at 4:19 PM, Ralph Castain  wrote:

> I'll have to look - there isn't supposed to be such a requirement, and I
> certainly haven't seen it before.
>
>
> On Nov 25, 2014, at 3:26 PM, Paul Hargrove  wrote:
>
> Allan,
>
> I am glad things are working for you now.
> I can confirm (on a QEMU-emulated Versatile Express A9 board running
> Ubuntu 14.04) that disabling the "lo" interface reproduces the problem.
> I imagine this is true on other architectures, though I did not attempt to
> verify.
>
> Ralph,
>
> If oob:tcp really does need the loopback interface, shouldn't its lack be
> something that could/should be detected and reported instead of hanging as
> Allan saw?
>
> FWIW, neither of the following resolved the problem:
> -mca oob_tcp_if_exclude lo
> -mca oob_tcp_if_include eth0
>
>
> -Paul
>
> On Tue, Nov 25, 2014 at 2:58 PM, Allan Wu  wrote:
>
>> I think I have found the problem. After inspecting the output with
>>
>> "-mca state_base_verbose 10 -mca odls_base_verbose 10 -mca
>> oob_base_verbose 10
>> 0
>> "
>>
>> on both the old system and the new system, I noticed there is one line
>>
>> that is
>>
>> different
>> :
>>
>> o
>> n the old system where it works correctly, there is a line that says:
>> "oob:tcp:init rejecting loopback interface lo"
>> ,
>> while
>> on the new system there is no such line. Both system proceed to open
>> interface eth0 afterwards. Then I checked the new system, and found out
>> that somehow the loopback interface is not up by default. After I opened
>> the lo interface, the mpirun executes normally.
>>
>> Does it means that OpenMPI will use lo for some initial setup? Since the
>> actual socket was created on eth0 I did not think of checking the lo
>> interface. Anyway, thanks everyone for all of your kind help. Let me know
>> if you want me to provide any more information for future references.
>>
>> Regards,
>> Allan
>>
>> --
>> Di Wu (Allan)
>> PhD student, VAST Laboratory ,
>> Department of Computer Science, UC Los Angeles
>> Email: al...@cs.ucla.edu
>>
>> On Tue, Nov 25, 2014 at 11:55 AM, Allan Wu  wrote:
>>
>>> Thanks Ralph!
>>>
>>> I did not compile my openmpi with --enable-debug, and I am compiling it
>>> now. But your suggested command already provide
>>> d
>>> some output, which I attached with this email.
>>>
>>> It seems the process was stuck on the line:
>>> "[fpga2:00962] [[44848,1],0] waiting for connect completion to
>>> [[44848,0],0] - activating send event"
>>>
>>> Then it got stuck and I CTRL+C'ed it. Previous to that line, it said
>>> something about 'orte_tcp_peer_try_connect: attempting to connect to proc
>>> [[44848,0],0] via interface eth0'
>>> .
>>>
>>>
>>> Regards,
>>> Di
>>>
>>> On Tue, Nov 25, 2014 at 2:25 PM, Ralph Castain  wrote:
>>>
 This is all running on a single node, correct? If so, did you configure
 OMPI with â EURO "enable-debug?

 If you can do that, or already have, then letâ EURO (tm)s add the 
 following

Re: [OMPI devel] OpenMPI v1.8 and v1.8.3 mpirun hangs at execution on an embedded ARM Linux kernel version 3.15.0

2014-11-25 Thread Ralph Castain
I would never doubt you about the firewall, Paul :-)

So it looks like the issue isn’t so much with our code as it is with the OS 
stack, yes? We aren’t requiring that the loopback be “up”, but the stack is in 
order to establish the connection, even when we are trying a non-lo interface.

I can look into generating a faster timeout on the socket creation. In the 
trunk, we now use unix domain sockets instead of TCP to avoid such issues, but 
that won’t help with the 1.8 series.


> On Nov 25, 2014, at 4:50 PM, Paul Hargrove  wrote:
> 
> Ralph,
> 
> I had a look at the problem via "mpirun -np 1 strace -o trace -ff ./hello"
> I find that there is an attempt (by a secondary thread) to establish a TCP 
> socket from the rank process to the eth0 address of localhost (I am guessing 
> to reach the orted/mpirun).
> However, when the "lo" interface is down, the Linux kernel apparently cannot 
> establish that socket.
> 
> In fact, if I am sufficiently patient, it turns out the "hang" is bounded, 
> and eventually one sees:
> 
> phargrov@blcr-armv7:~$ time mpirun -np 1 ./a.out
> 
> A process or daemon was unable to complete a TCP connection
> to another process:
>   Local host:blcr-armv7
>   Remote host:   10.0.2.15
> This is usually caused by a firewall on the remote host. Please
> check that any firewall (e.g., iptables) has been disabled and
> try again.
> 
> 
> real2m8.151s
> user0m5.360s
> sys 0m57.430s
> 
> 
> Where blcr-armv7 and 10.0.2.15 are *both* the local (only) host.
> 
> There is no firewall, but in case you doubt me on that, here is a 
> demonstration using ping to show that 10.0.2.15 is only reachable when the 
> loopback interface is enabled:
> 
> phargrov@blcr-armv7:~$ sudo ifconfig lo up
> phargrov@blcr-armv7:~$ ping -q -c2 10.0.2.15
> PING 10.0.2.15 (10.0.2.15) 56(84) bytes of data.
> 
> --- 10.0.2.15 ping statistics ---
> 2 packets transmitted, 2 received, 0% packet loss, time 1002ms
> rtt min/avg/max/mdev = 0.527/0.534/0.542/0.024 ms
> 
> 
> phargrov@blcr-armv7:~$ sudo ifconfig lo down
> phargrov@blcr-armv7:~$ ping -q -c2 10.0.2.15
> PING 10.0.2.15 (10.0.2.15) 56(84) bytes of data.
> 
> --- 10.0.2.15 ping statistics ---
> 2 packets transmitted, 0 received, 100% packet loss, time 1006ms
> 
> 
> So, there is no "hang" -- just a 2 minute pause before the error message is 
> generated.
> However, it may still be possible to present a better/earlier error message 
> when there is no loopback interface (and at least one rank process is to be 
> launched locally).
> 
> 
> -Paul
> 
> On Tue, Nov 25, 2014 at 4:19 PM, Ralph Castain  > wrote:
> I’ll have to look - there isn’t supposed to be such a requirement, and I 
> certainly haven’t seen it before.
> 
> 
>> On Nov 25, 2014, at 3:26 PM, Paul Hargrove > > wrote:
>> 
>> Allan,
>> 
>> I am glad things are working for you now.
>> I can confirm (on a QEMU-emulated Versatile Express A9 board running Ubuntu 
>> 14.04) that disabling the "lo" interface reproduces the problem.
>> I imagine this is true on other architectures, though I did not attempt to 
>> verify.
>> 
>> Ralph,
>> 
>> If oob:tcp really does need the loopback interface, shouldn't its lack be 
>> something that could/should be detected and reported instead of hanging as 
>> Allan saw?
>> 
>> FWIW, neither of the following resolved the problem:
>> -mca oob_tcp_if_exclude lo
>> -mca oob_tcp_if_include eth0
>> 
>> 
>> -Paul
>> 
>> On Tue, Nov 25, 2014 at 2:58 PM, Allan Wu > > wrote:
>> I think I have found the problem. After inspecting the output with "-mca 
>> state_base_verbose 10 -mca odls_base_verbose 10 -mca oob_base_verbose 100" 
>> on both the old system and the new system, I noticed there is one line that 
>> is different: on the old system where it works correctly, there is a line 
>> that says: "oob:tcp:init rejecting loopback interface lo", while on the new 
>> system there is no such line. Both system proceed to open interface eth0 
>> afterwards. Then I checked the new system, and found out that somehow the 
>> loopback interface is not up by default. After I opened the lo interface, 
>> the mpirun executes normally.
>> 
>> Does it means that OpenMPI will use lo for some initial setup? Since the 
>> actual socket was created on eth0 I did not think of checking the lo 
>> interface. Anyway, thanks everyone for all of your kind help. Let me know if 
>> you want me to provide any more information for future references. 
>> 
>> Regards,
>> Allan
>> 
>> --
>> Di Wu (Allan)
>> PhD student, VAST Laboratory ,
>> Department of Computer Science, UC Los Angeles
>> Email: al...@cs.ucla.edu 
>> 
>> On Tue, Nov 25, 2014 at 11:55 AM, Allan Wu > > wrote:
>> Thanks Ralph!
>> 
>> I did not compile my openm

Re: [OMPI devel] OpenMPI v1.8 and v1.8.3 mpirun hangs at execution on an embedded ARM Linux kernel version 3.15.0

2014-11-25 Thread Paul Hargrove
On Tue, Nov 25, 2014 at 5:37 PM, Ralph Castain  wrote:

> So it looks like the issue isn't so much with our code as it is with the
> OS stack, yes? We aren't requiring that the loopback be "up", but the stack
> is in order to establish the connection, even when we are trying a non-lo
> interface.



Correct, as far as I can tell.
It look to me as if the stack says "Hey, that is my own address" and uses
the loopback interface instead of the one associated with the address.

I have checked Mac OSX and Solaris and neither one exhibits this behavior.
I can, if requested, check {Net,Open,Free}BSD as well.

-Paul


-- 
Paul H. Hargrove  phhargr...@lbl.gov
Computer Languages & Systems Software (CLaSS) Group
Computer Science Department   Tel: +1-510-495-2352
Lawrence Berkeley National Laboratory Fax: +1-510-486-6900


Re: [OMPI devel] OpenMPI v1.8 and v1.8.3 mpirun hangs at execution on an embedded ARM Linux kernel version 3.15.0

2014-11-25 Thread Ralph Castain
No need - looks like I just need to fail faster and add this possibility to the 
error message.

Thanks!


> On Nov 25, 2014, at 4:50 PM, Paul Hargrove  wrote:
> 
> Ralph,
> 
> I had a look at the problem via "mpirun -np 1 strace -o trace -ff ./hello"
> I find that there is an attempt (by a secondary thread) to establish a TCP 
> socket from the rank process to the eth0 address of localhost (I am guessing 
> to reach the orted/mpirun).
> However, when the "lo" interface is down, the Linux kernel apparently cannot 
> establish that socket.
> 
> In fact, if I am sufficiently patient, it turns out the "hang" is bounded, 
> and eventually one sees:
> 
> phargrov@blcr-armv7:~$ time mpirun -np 1 ./a.out
> 
> A process or daemon was unable to complete a TCP connection
> to another process:
>   Local host:blcr-armv7
>   Remote host:   10.0.2.15
> This is usually caused by a firewall on the remote host. Please
> check that any firewall (e.g., iptables) has been disabled and
> try again.
> 
> 
> real2m8.151s
> user0m5.360s
> sys 0m57.430s
> 
> 
> Where blcr-armv7 and 10.0.2.15 are *both* the local (only) host.
> 
> There is no firewall, but in case you doubt me on that, here is a 
> demonstration using ping to show that 10.0.2.15 is only reachable when the 
> loopback interface is enabled:
> 
> phargrov@blcr-armv7:~$ sudo ifconfig lo up
> phargrov@blcr-armv7:~$ ping -q -c2 10.0.2.15
> PING 10.0.2.15 (10.0.2.15) 56(84) bytes of data.
> 
> --- 10.0.2.15 ping statistics ---
> 2 packets transmitted, 2 received, 0% packet loss, time 1002ms
> rtt min/avg/max/mdev = 0.527/0.534/0.542/0.024 ms
> 
> 
> phargrov@blcr-armv7:~$ sudo ifconfig lo down
> phargrov@blcr-armv7:~$ ping -q -c2 10.0.2.15
> PING 10.0.2.15 (10.0.2.15) 56(84) bytes of data.
> 
> --- 10.0.2.15 ping statistics ---
> 2 packets transmitted, 0 received, 100% packet loss, time 1006ms
> 
> 
> So, there is no "hang" -- just a 2 minute pause before the error message is 
> generated.
> However, it may still be possible to present a better/earlier error message 
> when there is no loopback interface (and at least one rank process is to be 
> launched locally).
> 
> 
> -Paul
> 
> On Tue, Nov 25, 2014 at 4:19 PM, Ralph Castain  > wrote:
> I’ll have to look - there isn’t supposed to be such a requirement, and I 
> certainly haven’t seen it before.
> 
> 
>> On Nov 25, 2014, at 3:26 PM, Paul Hargrove > > wrote:
>> 
>> Allan,
>> 
>> I am glad things are working for you now.
>> I can confirm (on a QEMU-emulated Versatile Express A9 board running Ubuntu 
>> 14.04) that disabling the "lo" interface reproduces the problem.
>> I imagine this is true on other architectures, though I did not attempt to 
>> verify.
>> 
>> Ralph,
>> 
>> If oob:tcp really does need the loopback interface, shouldn't its lack be 
>> something that could/should be detected and reported instead of hanging as 
>> Allan saw?
>> 
>> FWIW, neither of the following resolved the problem:
>> -mca oob_tcp_if_exclude lo
>> -mca oob_tcp_if_include eth0
>> 
>> 
>> -Paul
>> 
>> On Tue, Nov 25, 2014 at 2:58 PM, Allan Wu > > wrote:
>> I think I have found the problem. After inspecting the output with "-mca 
>> state_base_verbose 10 -mca odls_base_verbose 10 -mca oob_base_verbose 100" 
>> on both the old system and the new system, I noticed there is one line that 
>> is different: on the old system where it works correctly, there is a line 
>> that says: "oob:tcp:init rejecting loopback interface lo", while on the new 
>> system there is no such line. Both system proceed to open interface eth0 
>> afterwards. Then I checked the new system, and found out that somehow the 
>> loopback interface is not up by default. After I opened the lo interface, 
>> the mpirun executes normally.
>> 
>> Does it means that OpenMPI will use lo for some initial setup? Since the 
>> actual socket was created on eth0 I did not think of checking the lo 
>> interface. Anyway, thanks everyone for all of your kind help. Let me know if 
>> you want me to provide any more information for future references. 
>> 
>> Regards,
>> Allan
>> 
>> --
>> Di Wu (Allan)
>> PhD student, VAST Laboratory ,
>> Department of Computer Science, UC Los Angeles
>> Email: al...@cs.ucla.edu 
>> 
>> On Tue, Nov 25, 2014 at 11:55 AM, Allan Wu > > wrote:
>> Thanks Ralph!
>> 
>> I did not compile my openmpi with --enable-debug, and I am compiling it now. 
>> But your suggested command already provided some output, which I attached 
>> with this email. 
>> 
>> It seems the process was stuck on the line:
>> "[fpga2:00962] [[44848,1],0] waiting for connect completion to [[44848,0],0] 
>> - activating send event"
>> 
>> Then it got stuck and I CTRL+C'ed it. Previous to that line, 

Re: [OMPI devel] OpenMPI v1.8 and v1.8.3 mpirun hangs at execution on an embedded ARM Linux kernel version 3.15.0

2014-11-25 Thread Gilles Gouaillardet
Ralph and Paul,

On 2014/11/26 10:37, Ralph Castain wrote:
> So it looks like the issue isn't so much with our code as it is with the OS 
> stack, yes? We aren't requiring that the loopback be "up", but the stack is 
> in order to establish the connection, even when we are trying a non-lo 
> interface.
this is correct (imho)
> I can look into generating a faster timeout on the socket creation. In the 
> trunk, we now use unix domain sockets instead of TCP to avoid such issues, 
> but that won't help with the 1.8 series.
i was about to suggest this situation could have been avoided in the
first place by using unix domain sockets instead of TCP sockets :-)

is a backport (since this is already available in the trunk/master)
simply out of the question ?

Cheers,

Gilles

>
>> On Nov 25, 2014, at 4:50 PM, Paul Hargrove  wrote:
>>
>> Ralph,
>>
>> I had a look at the problem via "mpirun -np 1 strace -o trace -ff ./hello"
>> I find that there is an attempt (by a secondary thread) to establish a TCP 
>> socket from the rank process to the eth0 address of localhost (I am guessing 
>> to reach the orted/mpirun).
>> However, when the "lo" interface is down, the Linux kernel apparently cannot 
>> establish that socket.
>>
>> In fact, if I am sufficiently patient, it turns out the "hang" is bounded, 
>> and eventually one sees:
>>
>> phargrov@blcr-armv7:~$ time mpirun -np 1 ./a.out
>> 
>> A process or daemon was unable to complete a TCP connection
>> to another process:
>>   Local host:blcr-armv7
>>   Remote host:   10.0.2.15
>> This is usually caused by a firewall on the remote host. Please
>> check that any firewall (e.g., iptables) has been disabled and
>> try again.
>> 
>>
>> real2m8.151s
>> user0m5.360s
>> sys 0m57.430s
>>
>>
>> Where blcr-armv7 and 10.0.2.15 are *both* the local (only) host.
>>
>> There is no firewall, but in case you doubt me on that, here is a 
>> demonstration using ping to show that 10.0.2.15 is only reachable when the 
>> loopback interface is enabled:
>>
>> phargrov@blcr-armv7:~$ sudo ifconfig lo up
>> phargrov@blcr-armv7:~$ ping -q -c2 10.0.2.15
>> PING 10.0.2.15 (10.0.2.15) 56(84) bytes of data.
>>
>> --- 10.0.2.15 ping statistics ---
>> 2 packets transmitted, 2 received, 0% packet loss, time 1002ms
>> rtt min/avg/max/mdev = 0.527/0.534/0.542/0.024 ms
>>
>>
>> phargrov@blcr-armv7:~$ sudo ifconfig lo down
>> phargrov@blcr-armv7:~$ ping -q -c2 10.0.2.15
>> PING 10.0.2.15 (10.0.2.15) 56(84) bytes of data.
>>
>> --- 10.0.2.15 ping statistics ---
>> 2 packets transmitted, 0 received, 100% packet loss, time 1006ms
>>
>>
>> So, there is no "hang" -- just a 2 minute pause before the error message is 
>> generated.
>> However, it may still be possible to present a better/earlier error message 
>> when there is no loopback interface (and at least one rank process is to be 
>> launched locally).
>>
>>
>> -Paul
>>
>> On Tue, Nov 25, 2014 at 4:19 PM, Ralph Castain > > wrote:
>> I'll have to look - there isn't supposed to be such a requirement, and I 
>> certainly haven't seen it before.
>>
>>
>>> On Nov 25, 2014, at 3:26 PM, Paul Hargrove >> > wrote:
>>>
>>> Allan,
>>>
>>> I am glad things are working for you now.
>>> I can confirm (on a QEMU-emulated Versatile Express A9 board running Ubuntu 
>>> 14.04) that disabling the "lo" interface reproduces the problem.
>>> I imagine this is true on other architectures, though I did not attempt to 
>>> verify.
>>>
>>> Ralph,
>>>
>>> If oob:tcp really does need the loopback interface, shouldn't its lack be 
>>> something that could/should be detected and reported instead of hanging as 
>>> Allan saw?
>>>
>>> FWIW, neither of the following resolved the problem:
>>> -mca oob_tcp_if_exclude lo
>>> -mca oob_tcp_if_include eth0
>>>
>>>
>>> -Paul
>>>
>>> On Tue, Nov 25, 2014 at 2:58 PM, Allan Wu >> > wrote:
>>> I think I have found the problem. After inspecting the output with "-mca 
>>> state_base_verbose 10 -mca odls_base_verbose 10 -mca oob_base_verbose 100" 
>>> on both the old system and the new system, I noticed there is one line that 
>>> is different: on the old system where it works correctly, there is a line 
>>> that says: "oob:tcp:init rejecting loopback interface lo", while on the new 
>>> system there is no such line. Both system proceed to open interface eth0 
>>> afterwards. Then I checked the new system, and found out that somehow the 
>>> loopback interface is not up by default. After I opened the lo interface, 
>>> the mpirun executes normally.
>>>
>>> Does it means that OpenMPI will use lo for some initial setup? Since the 
>>> actual socket was created on eth0 I did not think of checking the lo 
>>> interface. Anyway, thanks everyone for all of your kind help. Let me know 
>>> if you want me to provide any more information for 

Re: [OMPI devel] OpenMPI v1.8 and v1.8.3 mpirun hangs at execution on an embedded ARM Linux kernel version 3.15.0

2014-11-25 Thread Ralph Castain

> On Nov 25, 2014, at 6:15 PM, Gilles Gouaillardet 
>  wrote:
> 
> Ralph and Paul,
> 
> On 2014/11/26 10:37, Ralph Castain wrote:
>> So it looks like the issue isn’t so much with our code as it is with the OS 
>> stack, yes? We aren’t requiring that the loopback be “up”, but the stack is 
>> in order to establish the connection, even when we are trying a non-lo 
>> interface.
> this is correct (imho)
>> I can look into generating a faster timeout on the socket creation. In the 
>> trunk, we now use unix domain sockets instead of TCP to avoid such issues, 
>> but that won’t help with the 1.8 series.
> i was about to suggest this situation could have been avoided in the first 
> place by using unix domain sockets instead of TCP sockets :-)

There were some historical reasons for not doing so - mostly because it 
generally isn’t necessary on a cluster.

> 
> is a backport (since this is already available in the trunk/master) simply 
> out of the question ?

It would be against our normal procedures, but I can raise it at next week’s 
meeting.

> 
> Cheers,
> 
> Gilles
> 
>> 
>>> On Nov 25, 2014, at 4:50 PM, Paul Hargrove  
>>>  wrote:
>>> 
>>> Ralph,
>>> 
>>> I had a look at the problem via "mpirun -np 1 strace -o trace -ff ./hello"
>>> I find that there is an attempt (by a secondary thread) to establish a TCP 
>>> socket from the rank process to the eth0 address of localhost (I am 
>>> guessing to reach the orted/mpirun).
>>> However, when the "lo" interface is down, the Linux kernel apparently 
>>> cannot establish that socket.
>>> 
>>> In fact, if I am sufficiently patient, it turns out the "hang" is bounded, 
>>> and eventually one sees:
>>> 
>>> phargrov@blcr-armv7:~$ time mpirun -np 1 ./a.out
>>> 
>>> A process or daemon was unable to complete a TCP connection
>>> to another process:
>>>   Local host:blcr-armv7
>>>   Remote host:   10.0.2.15
>>> This is usually caused by a firewall on the remote host. Please
>>> check that any firewall (e.g., iptables) has been disabled and
>>> try again.
>>> 
>>> 
>>> real2m8.151s
>>> user0m5.360s
>>> sys 0m57.430s
>>> 
>>> 
>>> Where blcr-armv7 and 10.0.2.15 are *both* the local (only) host.
>>> 
>>> There is no firewall, but in case you doubt me on that, here is a 
>>> demonstration using ping to show that 10.0.2.15 is only reachable when the 
>>> loopback interface is enabled:
>>> 
>>> phargrov@blcr-armv7:~$ sudo ifconfig lo up
>>> phargrov@blcr-armv7:~$ ping -q -c2 10.0.2.15
>>> PING 10.0.2.15 (10.0.2.15) 56(84) bytes of data.
>>> 
>>> --- 10.0.2.15 ping statistics ---
>>> 2 packets transmitted, 2 received, 0% packet loss, time 1002ms
>>> rtt min/avg/max/mdev = 0.527/0.534/0.542/0.024 ms
>>> 
>>> 
>>> phargrov@blcr-armv7:~$ sudo ifconfig lo down
>>> phargrov@blcr-armv7:~$ ping -q -c2 10.0.2.15
>>> PING 10.0.2.15 (10.0.2.15) 56(84) bytes of data.
>>> 
>>> --- 10.0.2.15 ping statistics ---
>>> 2 packets transmitted, 0 received, 100% packet loss, time 1006ms
>>> 
>>> 
>>> So, there is no "hang" -- just a 2 minute pause before the error message is 
>>> generated.
>>> However, it may still be possible to present a better/earlier error message 
>>> when there is no loopback interface (and at least one rank process is to be 
>>> launched locally).
>>> 
>>> 
>>> -Paul
>>> 
>>> On Tue, Nov 25, 2014 at 4:19 PM, Ralph Castain >>   
>>> > wrote:
>>> I’ll have to look - there isn’t supposed to be such a requirement, and I 
>>> certainly haven’t seen it before.
>>> 
>>> 
 On Nov 25, 2014, at 3:26 PM, Paul Hargrove >>>   
 > wrote:
 
 Allan,
 
 I am glad things are working for you now.
 I can confirm (on a QEMU-emulated Versatile Express A9 board running 
 Ubuntu 14.04) that disabling the "lo" interface reproduces the problem.
 I imagine this is true on other architectures, though I did not attempt to 
 verify.
 
 Ralph,
 
 If oob:tcp really does need the loopback interface, shouldn't its lack be 
 something that could/should be detected and reported instead of hanging as 
 Allan saw?
 
 FWIW, neither of the following resolved the problem:
 -mca oob_tcp_if_exclude lo
 -mca oob_tcp_if_include eth0
 
 
 -Paul
 
 On Tue, Nov 25, 2014 at 2:58 PM, Allan Wu >>>   
 > wrote:
 I think I have found the problem. After inspecting the output with "-mca 
 state_base_verbose 10 -mca odls_base_verbose 10 -mca oob_base_verbose 100" 
 on both the old system and the new system, I noticed there is one line 
 that is different: on the old system where it works correctly, there is