[OMPI devel] Master: Segfault seen while running imb tests

2016-06-28 Thread Potnuri Bharat Teja
Hi All,
I am seeing the following segfault with openmpi-master.


[root@maneybhanjang ~]# /usr/mpi/gcc/openmpi-2.0-dev/bin/mpirun
--allow-run-as-root --hostfile /root/mpd.hosts -np 8 --prefix
/usr/mpi/gcc/openmpi-2.0-dev/ --map-by node --display-allocation
--oversubscribe --mca btl openib,sm,self
/usr/mpi/gcc/openmpi-2.0-dev/tests/IMB/IMB-MPI1

==   ALLOCATED NODES   ==
maneybhanjang: flags=0x01 slots=8 max_slots=0 slots_inuse=0 state=UP
10.193.184.162: flags=0x03 slots=4 max_slots=0 slots_inuse=0 state=UNKNOWN
=
[maneybhanjang:28532] *** Process received signal ***
[maneybhanjang:28532] Signal: Segmentation fault (11)
[maneybhanjang:28532] Signal code: Invalid permissions (2)
[maneybhanjang:28532] Failing at address: 0x106ca70
[maneybhanjang:28532] [ 0]
/lib64/libpthread.so.0[0x3aea40f710]
[maneybhanjang:28532] [ 1] [0x106ca70]
[maneybhanjang:28532] *** End of error message ***
[tonglu:02068] *** Process received signal ***
[tonglu:02068] Signal: Segmentation fault (11)
[tonglu:02068] Signal code: Invalid permissions (2)
[tonglu:02068] Failing at address: 0x2478500
[tonglu:02068] [ 0] /lib64/libpthread.so.0[0x3ef5c0f710]
[tonglu:02068] [ 1] [0x2478500]
[tonglu:02068] *** End of error message ***
bash: line 1:  2068 Segmentation fault  (core
dumped) /usr/mpi/gcc/openmpi-2.0-dev/bin/orted
--hnp-topo-sig 0N:2S:0L3:4L2:8L1:8C:8H:x86_64 -mca ess
"env" -mca ess_base_jobid "3921674240" -mca
ess_base_vpid 1 -mca ess_base_num_procs "2" -mca
orte_hnp_uri

"3921674240.0;usock;tcp://10.193.184.161,102.1.1.161,102.2.2.161:43160"
--mca btl "openib,sm,self" -mca plm "rsh" -mca
rmaps_base_mapping_policy "node" -mca orte_display_alloc
"1" -mca rmaps_base_oversubscribe "1"
Segmentation fault (core dumped)
[root@maneybhanjang ~]# dmesg
mpirun[28532]: segfault at 106ca70 ip 0106ca70 sp 7fffc00a7f28 
error 15

Segfault is seen on the other peer too.
[root@tonglu ~]# dmesg
orted[2068]: segfault at 2478500 ip 02478500 sp 7fff521c2e68 error 
15

gdb on coredump points me to orted/pmix/pmix_server_gen.c:80
Following is the Back trace.
[root@maneybhanjang ~]# gdb /usr/mpi/gcc/openmpi-2.0-dev/bin/mpirun core.28532
Program terminated with signal 11, Segmentation fault.
#0  0x0106ca70 in ?? ()
Missing separate debuginfos, use: debuginfo-install
glibc-2.12-1.149.el6.x86_64 libgcc-4.4.7-11.el6.x86_64
libudev-147-2.57.el6.x86_64
(gdb) bt
#0  0x0106ca70 in ?? ()
#1  0x2b217f7a43aa in _client_conn (sd=-1, args=4,
cbdata=0x2b2188022260)
at orted/pmix/pmix_server_gen.c:80
#2  0x2b217fad5a7c in event_process_active_single_queue
(base=0xfcc730, flags=1)
at event.c:1370
#3  event_process_active (base=0xfcc730, flags=1) at
event.c:1440
#4  opal_libevent2022_event_base_loop (base=0xfcc730, flags=1)
at event.c:1644
#5  0x004014d3 in orterun (argc=16, argv=0x7fffc00a81e8)
at orterun.c:192
#6  0x00400f04 in main (argc=16, argv=0x7fffc00a81e8) at
main.c:13
(gdb) frame
#0  0x0106ca70 in ?? ()
(gdb) up
#1  0x2b217f7a43aa in _client_conn (sd=-1, args=4,
cbdata=0x2b2188022260) at orted/pmix/pmix_server_gen.c:80
80  cd->cbfunc(OPAL_SUCCESS, cd->cbdata);


Here is the backtrace of peer machine, pointing to same line:

[root@tonglu ~]# gdb /usr/mpi/gcc/openmpi-2.0-dev/bin/orted core.2068
Program terminated with signal 11, Segmentation fault.
#0  0x02478500 in ?? ()
Missing separate debuginfos, use: debuginfo-install
glibc-2.12-1.149.el6.x86_64 libgcc-4.4.7-11.el6.x86_64
libudev-147-2.57.el6.x86_64 numactl-2.0.9-2.el6.x86_64
(gdb) bt
#0  0x02478500 in ?? ()
#1  0x2af4511433ba in _client_conn (sd=-1, args=4,
cbdata=0x2af458022260)
at orted/pmix/pmix_server_gen.c:80
#2  0x2af451474cac in event_process_active_single_queue
(base=0x2408e90, flags=1)
at event.c:1370
#3  event_process_active (base=0x2408e90, flags=1) at
event.c:1440
#4  opal_libevent2022_event_base_loop (base=0x2408e90, flags=1)
at event.c:1644
#5  0x2af451123c57 in orte_daemon (argc=33,
argv=0x7fff521c33d8)
at orted/orted_main.c:859
#6  0x0040081a in main (argc=33,
argv=0x7fff521c33d8) at orted.c:60
(gdb) frame
#0  0x02478500 in ?? ()
(gdb) up
#1  0x2af4511433ba in _client_conn (sd=-1, args=4,
cbdata=0x2af458022260)
 at orted/pmix/pmix_server_gen.c:80
80  cd->cbfunc(OPAL_SUCCESS, cd->cbdata);

I 

Re: [OMPI devel] Master: Segfault seen while running imb tests

2016-06-28 Thread Jeff Squyres (jsquyres)
This looks like a segv in mpirun itself -- can you file an issue on github so 
that we can track this?

Thanks.


> On Jun 28, 2016, at 3:33 AM, Potnuri Bharat Teja  wrote:
> 
> Hi All,
> I am seeing the following segfault with openmpi-master.
> 
> 
> [root@maneybhanjang ~]# /usr/mpi/gcc/openmpi-2.0-dev/bin/mpirun
> --allow-run-as-root --hostfile /root/mpd.hosts -np 8 --prefix
> /usr/mpi/gcc/openmpi-2.0-dev/ --map-by node --display-allocation
> --oversubscribe --mca btl openib,sm,self
> /usr/mpi/gcc/openmpi-2.0-dev/tests/IMB/IMB-MPI1
> 
> ==   ALLOCATED NODES   ==
> maneybhanjang: flags=0x01 slots=8 max_slots=0 slots_inuse=0 state=UP
> 10.193.184.162: flags=0x03 slots=4 max_slots=0 slots_inuse=0 state=UNKNOWN
> =
>   [maneybhanjang:28532] *** Process received signal ***
>   [maneybhanjang:28532] Signal: Segmentation fault (11)
>   [maneybhanjang:28532] Signal code: Invalid permissions (2)
>   [maneybhanjang:28532] Failing at address: 0x106ca70
>   [maneybhanjang:28532] [ 0]
>   /lib64/libpthread.so.0[0x3aea40f710]
>   [maneybhanjang:28532] [ 1] [0x106ca70]
>   [maneybhanjang:28532] *** End of error message ***
>   [tonglu:02068] *** Process received signal ***
>   [tonglu:02068] Signal: Segmentation fault (11)
>   [tonglu:02068] Signal code: Invalid permissions (2)
>   [tonglu:02068] Failing at address: 0x2478500
>   [tonglu:02068] [ 0] /lib64/libpthread.so.0[0x3ef5c0f710]
>   [tonglu:02068] [ 1] [0x2478500]
>   [tonglu:02068] *** End of error message ***
>   bash: line 1:  2068 Segmentation fault  (core
>   dumped) /usr/mpi/gcc/openmpi-2.0-dev/bin/orted
>   --hnp-topo-sig 0N:2S:0L3:4L2:8L1:8C:8H:x86_64 -mca ess
>   "env" -mca ess_base_jobid "3921674240" -mca
>   ess_base_vpid 1 -mca ess_base_num_procs "2" -mca
>   orte_hnp_uri
>   
> "3921674240.0;usock;tcp://10.193.184.161,102.1.1.161,102.2.2.161:43160"
>   --mca btl "openib,sm,self" -mca plm "rsh" -mca
>   rmaps_base_mapping_policy "node" -mca orte_display_alloc
>   "1" -mca rmaps_base_oversubscribe "1"
>   Segmentation fault (core dumped)
> [root@maneybhanjang ~]# dmesg
> mpirun[28532]: segfault at 106ca70 ip 0106ca70 sp 7fffc00a7f28 
> error 15
> 
> Segfault is seen on the other peer too.
> [root@tonglu ~]# dmesg
> orted[2068]: segfault at 2478500 ip 02478500 sp 7fff521c2e68 
> error 15
> 
> gdb on coredump points me to orted/pmix/pmix_server_gen.c:80
> Following is the Back trace.
> [root@maneybhanjang ~]# gdb /usr/mpi/gcc/openmpi-2.0-dev/bin/mpirun core.28532
> Program terminated with signal 11, Segmentation fault.
> #0  0x0106ca70 in ?? ()
> Missing separate debuginfos, use: debuginfo-install
> glibc-2.12-1.149.el6.x86_64 libgcc-4.4.7-11.el6.x86_64
> libudev-147-2.57.el6.x86_64
> (gdb) bt
> #0  0x0106ca70 in ?? ()
> #1  0x2b217f7a43aa in _client_conn (sd=-1, args=4,
> cbdata=0x2b2188022260)
>at orted/pmix/pmix_server_gen.c:80
> #2  0x2b217fad5a7c in event_process_active_single_queue
>(base=0xfcc730, flags=1)
>at event.c:1370
> #3  event_process_active (base=0xfcc730, flags=1) at
>   event.c:1440
> #4  opal_libevent2022_event_base_loop (base=0xfcc730, flags=1)
>   at event.c:1644
> #5  0x004014d3 in orterun (argc=16, argv=0x7fffc00a81e8)
>   at orterun.c:192
> #6  0x00400f04 in main (argc=16, argv=0x7fffc00a81e8) at
>   main.c:13
> (gdb) frame
> #0  0x0106ca70 in ?? ()
> (gdb) up
> #1  0x2b217f7a43aa in _client_conn (sd=-1, args=4,
> cbdata=0x2b2188022260) at orted/pmix/pmix_server_gen.c:80
>   80  cd->cbfunc(OPAL_SUCCESS, cd->cbdata);
> 
> 
> Here is the backtrace of peer machine, pointing to same line:
> 
> [root@tonglu ~]# gdb /usr/mpi/gcc/openmpi-2.0-dev/bin/orted core.2068
> Program terminated with signal 11, Segmentation fault.
> #0  0x02478500 in ?? ()
> Missing separate debuginfos, use: debuginfo-install
> glibc-2.12-1.149.el6.x86_64 libgcc-4.4.7-11.el6.x86_64
> libudev-147-2.57.el6.x86_64 numactl-2.0.9-2.el6.x86_64
> (gdb) bt
> #0  0x02478500 in ?? ()
> #1  0x2af4511433ba in _client_conn (sd=-1, args=4,
> cbdata=0x2af458022260)
>at orted/pmix/pmix_server_gen.c:80
> #2  0x2af451474cac in event_process_active_single_queue
>(base=0x2408e90, flags=1)
>at event.c:1370
> #3  event_process_active (base=0x2408e90, flags=1) at
>   event.c:1440
> #4  opal_libevent2022_event_base_loop (base=0x2408e90, flags=1)
>   at event.c:1644
> #5  0x2af451123c57 in orte_daemon (argc=33,
>   argv=0x7fff521c33d8)
>   at orted/orted_main.c:859
> #6  0x004008

[OMPI devel] Open MPI infrastructure moving (over the next few months)

2016-06-28 Thread Jeff Squyres (jsquyres)
Heads up for those who were not on the webex today: our faculty sponsor at 
Indiana University (IU), Dr. Andrew Lumsdaine, is moving to a different 
organization.

IU has been incredibly helpful over the past 12 years (!) of the Open MPI 
project: they have provided all kinds of hosting and infrastructure to the Open 
MPI community, completely free of charge.  We are deeply grateful for all the 
time, effort, and resources that have been freely given to the Open MPI 
project.  Thank you, IU!

However, now that Andrew is leaving IU, we will need to move our infrastructure 
elsewhere -- probably within the next 3 months.

Discussions are now occurring about the logistics of how to move this 
infrastructure.  We'll likely get some wiki pages up detailing what needs to 
move, how, ...etc.  Stay tuned.

Two main notes:

1. As we move our infrastructure to new location(s), some user-noticeable 
behavior may change (e.g., functionality in the web sites and/or mailing 
lists).  We'll try to give heads up when this happens, but just be aware that 
new infrastructure may mean different capabilities.

2. We've been fortunate that all of our costs so far have essentially been $0, 
but we're now going to start incurring infrastructure hosting costs as a 
community.  Who will pay these costs?

Because of #2, today on the webex, we proposed moving to the MPI Forum model of 
funding: charging a small registration fee for the Open MPI face-to-face 
developer meetings.  All that money will go into some kind of community account 
somewhere, and we'll use that money to pay community bills (e.g., github and 
other hosting/infrastructure fees).

Unless there is serious objection, we plan to start using this model for the 
upcoming Open MPI developer meeting (i.e., there will be a registration fee).

-- 
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to: 
http://www.cisco.com/web/about/doing_business/legal/cri/



[OMPI devel] August Open MPI dev meeting: Dallas, TX, USA

2016-06-28 Thread Jeff Squyres (jsquyres)
On the webex today, we decided that the next face-to-face developer meeting 
will be:

- When: 9am, Tue Aug 15 - 1pm, Thu Aug 18, 2016
- Where: Dallas, TX, USA (at the same IBM facility that we used in Feb 2016)

*** PLEASE ADD YOUR NAME TO THE WIKI IF YOU PLAN TO ATTEND:

https://github.com/open-mpi/ompi/wiki/Meeting-2016-08
**>> please add your name by COB Fri, 1 July -- see below <<**

Also per discussion on the webex today (and per 
http://www.open-mpi.org/community/lists/devel/2016/06/19139.php), there will 
likely be a registration fee.  

We haven't done the math yet to figure out how much the fee will be (we tossed 
around $50 on the webex today, but that is with no accounting/math backing it 
up).

Putting your name on the wiki by the end of this week will greatly help us in 
calculating how much we need to charge for the registration fee.

Thanks!

-- 
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to: 
http://www.cisco.com/web/about/doing_business/legal/cri/



Re: [OMPI devel] August Open MPI dev meeting: Dallas, TX, USA

2016-06-28 Thread Jeff Squyres (jsquyres)
On Jun 28, 2016, at 12:43 PM, Jeff Squyres (jsquyres)  
wrote:
> 
> - When: 9am, Tue Aug 15 - 1pm, Thu Aug 18, 2016

That should be:

9am Tue, Aug ***16*** - 1pm, Thu Aug 18, 2016.

Sorry for the confusion!

-- 
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to: 
http://www.cisco.com/web/about/doing_business/legal/cri/