[OMPI devel] Master: Segfault seen while running imb tests
Hi All, I am seeing the following segfault with openmpi-master. [root@maneybhanjang ~]# /usr/mpi/gcc/openmpi-2.0-dev/bin/mpirun --allow-run-as-root --hostfile /root/mpd.hosts -np 8 --prefix /usr/mpi/gcc/openmpi-2.0-dev/ --map-by node --display-allocation --oversubscribe --mca btl openib,sm,self /usr/mpi/gcc/openmpi-2.0-dev/tests/IMB/IMB-MPI1 == ALLOCATED NODES == maneybhanjang: flags=0x01 slots=8 max_slots=0 slots_inuse=0 state=UP 10.193.184.162: flags=0x03 slots=4 max_slots=0 slots_inuse=0 state=UNKNOWN = [maneybhanjang:28532] *** Process received signal *** [maneybhanjang:28532] Signal: Segmentation fault (11) [maneybhanjang:28532] Signal code: Invalid permissions (2) [maneybhanjang:28532] Failing at address: 0x106ca70 [maneybhanjang:28532] [ 0] /lib64/libpthread.so.0[0x3aea40f710] [maneybhanjang:28532] [ 1] [0x106ca70] [maneybhanjang:28532] *** End of error message *** [tonglu:02068] *** Process received signal *** [tonglu:02068] Signal: Segmentation fault (11) [tonglu:02068] Signal code: Invalid permissions (2) [tonglu:02068] Failing at address: 0x2478500 [tonglu:02068] [ 0] /lib64/libpthread.so.0[0x3ef5c0f710] [tonglu:02068] [ 1] [0x2478500] [tonglu:02068] *** End of error message *** bash: line 1: 2068 Segmentation fault (core dumped) /usr/mpi/gcc/openmpi-2.0-dev/bin/orted --hnp-topo-sig 0N:2S:0L3:4L2:8L1:8C:8H:x86_64 -mca ess "env" -mca ess_base_jobid "3921674240" -mca ess_base_vpid 1 -mca ess_base_num_procs "2" -mca orte_hnp_uri "3921674240.0;usock;tcp://10.193.184.161,102.1.1.161,102.2.2.161:43160" --mca btl "openib,sm,self" -mca plm "rsh" -mca rmaps_base_mapping_policy "node" -mca orte_display_alloc "1" -mca rmaps_base_oversubscribe "1" Segmentation fault (core dumped) [root@maneybhanjang ~]# dmesg mpirun[28532]: segfault at 106ca70 ip 0106ca70 sp 7fffc00a7f28 error 15 Segfault is seen on the other peer too. [root@tonglu ~]# dmesg orted[2068]: segfault at 2478500 ip 02478500 sp 7fff521c2e68 error 15 gdb on coredump points me to orted/pmix/pmix_server_gen.c:80 Following is the Back trace. [root@maneybhanjang ~]# gdb /usr/mpi/gcc/openmpi-2.0-dev/bin/mpirun core.28532 Program terminated with signal 11, Segmentation fault. #0 0x0106ca70 in ?? () Missing separate debuginfos, use: debuginfo-install glibc-2.12-1.149.el6.x86_64 libgcc-4.4.7-11.el6.x86_64 libudev-147-2.57.el6.x86_64 (gdb) bt #0 0x0106ca70 in ?? () #1 0x2b217f7a43aa in _client_conn (sd=-1, args=4, cbdata=0x2b2188022260) at orted/pmix/pmix_server_gen.c:80 #2 0x2b217fad5a7c in event_process_active_single_queue (base=0xfcc730, flags=1) at event.c:1370 #3 event_process_active (base=0xfcc730, flags=1) at event.c:1440 #4 opal_libevent2022_event_base_loop (base=0xfcc730, flags=1) at event.c:1644 #5 0x004014d3 in orterun (argc=16, argv=0x7fffc00a81e8) at orterun.c:192 #6 0x00400f04 in main (argc=16, argv=0x7fffc00a81e8) at main.c:13 (gdb) frame #0 0x0106ca70 in ?? () (gdb) up #1 0x2b217f7a43aa in _client_conn (sd=-1, args=4, cbdata=0x2b2188022260) at orted/pmix/pmix_server_gen.c:80 80 cd->cbfunc(OPAL_SUCCESS, cd->cbdata); Here is the backtrace of peer machine, pointing to same line: [root@tonglu ~]# gdb /usr/mpi/gcc/openmpi-2.0-dev/bin/orted core.2068 Program terminated with signal 11, Segmentation fault. #0 0x02478500 in ?? () Missing separate debuginfos, use: debuginfo-install glibc-2.12-1.149.el6.x86_64 libgcc-4.4.7-11.el6.x86_64 libudev-147-2.57.el6.x86_64 numactl-2.0.9-2.el6.x86_64 (gdb) bt #0 0x02478500 in ?? () #1 0x2af4511433ba in _client_conn (sd=-1, args=4, cbdata=0x2af458022260) at orted/pmix/pmix_server_gen.c:80 #2 0x2af451474cac in event_process_active_single_queue (base=0x2408e90, flags=1) at event.c:1370 #3 event_process_active (base=0x2408e90, flags=1) at event.c:1440 #4 opal_libevent2022_event_base_loop (base=0x2408e90, flags=1) at event.c:1644 #5 0x2af451123c57 in orte_daemon (argc=33, argv=0x7fff521c33d8) at orted/orted_main.c:859 #6 0x0040081a in main (argc=33, argv=0x7fff521c33d8) at orted.c:60 (gdb) frame #0 0x02478500 in ?? () (gdb) up #1 0x2af4511433ba in _client_conn (sd=-1, args=4, cbdata=0x2af458022260) at orted/pmix/pmix_server_gen.c:80 80 cd->cbfunc(OPAL_SUCCESS, cd->cbdata); I
Re: [OMPI devel] Master: Segfault seen while running imb tests
This looks like a segv in mpirun itself -- can you file an issue on github so that we can track this? Thanks. > On Jun 28, 2016, at 3:33 AM, Potnuri Bharat Teja wrote: > > Hi All, > I am seeing the following segfault with openmpi-master. > > > [root@maneybhanjang ~]# /usr/mpi/gcc/openmpi-2.0-dev/bin/mpirun > --allow-run-as-root --hostfile /root/mpd.hosts -np 8 --prefix > /usr/mpi/gcc/openmpi-2.0-dev/ --map-by node --display-allocation > --oversubscribe --mca btl openib,sm,self > /usr/mpi/gcc/openmpi-2.0-dev/tests/IMB/IMB-MPI1 > > == ALLOCATED NODES == > maneybhanjang: flags=0x01 slots=8 max_slots=0 slots_inuse=0 state=UP > 10.193.184.162: flags=0x03 slots=4 max_slots=0 slots_inuse=0 state=UNKNOWN > = > [maneybhanjang:28532] *** Process received signal *** > [maneybhanjang:28532] Signal: Segmentation fault (11) > [maneybhanjang:28532] Signal code: Invalid permissions (2) > [maneybhanjang:28532] Failing at address: 0x106ca70 > [maneybhanjang:28532] [ 0] > /lib64/libpthread.so.0[0x3aea40f710] > [maneybhanjang:28532] [ 1] [0x106ca70] > [maneybhanjang:28532] *** End of error message *** > [tonglu:02068] *** Process received signal *** > [tonglu:02068] Signal: Segmentation fault (11) > [tonglu:02068] Signal code: Invalid permissions (2) > [tonglu:02068] Failing at address: 0x2478500 > [tonglu:02068] [ 0] /lib64/libpthread.so.0[0x3ef5c0f710] > [tonglu:02068] [ 1] [0x2478500] > [tonglu:02068] *** End of error message *** > bash: line 1: 2068 Segmentation fault (core > dumped) /usr/mpi/gcc/openmpi-2.0-dev/bin/orted > --hnp-topo-sig 0N:2S:0L3:4L2:8L1:8C:8H:x86_64 -mca ess > "env" -mca ess_base_jobid "3921674240" -mca > ess_base_vpid 1 -mca ess_base_num_procs "2" -mca > orte_hnp_uri > > "3921674240.0;usock;tcp://10.193.184.161,102.1.1.161,102.2.2.161:43160" > --mca btl "openib,sm,self" -mca plm "rsh" -mca > rmaps_base_mapping_policy "node" -mca orte_display_alloc > "1" -mca rmaps_base_oversubscribe "1" > Segmentation fault (core dumped) > [root@maneybhanjang ~]# dmesg > mpirun[28532]: segfault at 106ca70 ip 0106ca70 sp 7fffc00a7f28 > error 15 > > Segfault is seen on the other peer too. > [root@tonglu ~]# dmesg > orted[2068]: segfault at 2478500 ip 02478500 sp 7fff521c2e68 > error 15 > > gdb on coredump points me to orted/pmix/pmix_server_gen.c:80 > Following is the Back trace. > [root@maneybhanjang ~]# gdb /usr/mpi/gcc/openmpi-2.0-dev/bin/mpirun core.28532 > Program terminated with signal 11, Segmentation fault. > #0 0x0106ca70 in ?? () > Missing separate debuginfos, use: debuginfo-install > glibc-2.12-1.149.el6.x86_64 libgcc-4.4.7-11.el6.x86_64 > libudev-147-2.57.el6.x86_64 > (gdb) bt > #0 0x0106ca70 in ?? () > #1 0x2b217f7a43aa in _client_conn (sd=-1, args=4, > cbdata=0x2b2188022260) >at orted/pmix/pmix_server_gen.c:80 > #2 0x2b217fad5a7c in event_process_active_single_queue >(base=0xfcc730, flags=1) >at event.c:1370 > #3 event_process_active (base=0xfcc730, flags=1) at > event.c:1440 > #4 opal_libevent2022_event_base_loop (base=0xfcc730, flags=1) > at event.c:1644 > #5 0x004014d3 in orterun (argc=16, argv=0x7fffc00a81e8) > at orterun.c:192 > #6 0x00400f04 in main (argc=16, argv=0x7fffc00a81e8) at > main.c:13 > (gdb) frame > #0 0x0106ca70 in ?? () > (gdb) up > #1 0x2b217f7a43aa in _client_conn (sd=-1, args=4, > cbdata=0x2b2188022260) at orted/pmix/pmix_server_gen.c:80 > 80 cd->cbfunc(OPAL_SUCCESS, cd->cbdata); > > > Here is the backtrace of peer machine, pointing to same line: > > [root@tonglu ~]# gdb /usr/mpi/gcc/openmpi-2.0-dev/bin/orted core.2068 > Program terminated with signal 11, Segmentation fault. > #0 0x02478500 in ?? () > Missing separate debuginfos, use: debuginfo-install > glibc-2.12-1.149.el6.x86_64 libgcc-4.4.7-11.el6.x86_64 > libudev-147-2.57.el6.x86_64 numactl-2.0.9-2.el6.x86_64 > (gdb) bt > #0 0x02478500 in ?? () > #1 0x2af4511433ba in _client_conn (sd=-1, args=4, > cbdata=0x2af458022260) >at orted/pmix/pmix_server_gen.c:80 > #2 0x2af451474cac in event_process_active_single_queue >(base=0x2408e90, flags=1) >at event.c:1370 > #3 event_process_active (base=0x2408e90, flags=1) at > event.c:1440 > #4 opal_libevent2022_event_base_loop (base=0x2408e90, flags=1) > at event.c:1644 > #5 0x2af451123c57 in orte_daemon (argc=33, > argv=0x7fff521c33d8) > at orted/orted_main.c:859 > #6 0x004008
[OMPI devel] Open MPI infrastructure moving (over the next few months)
Heads up for those who were not on the webex today: our faculty sponsor at Indiana University (IU), Dr. Andrew Lumsdaine, is moving to a different organization. IU has been incredibly helpful over the past 12 years (!) of the Open MPI project: they have provided all kinds of hosting and infrastructure to the Open MPI community, completely free of charge. We are deeply grateful for all the time, effort, and resources that have been freely given to the Open MPI project. Thank you, IU! However, now that Andrew is leaving IU, we will need to move our infrastructure elsewhere -- probably within the next 3 months. Discussions are now occurring about the logistics of how to move this infrastructure. We'll likely get some wiki pages up detailing what needs to move, how, ...etc. Stay tuned. Two main notes: 1. As we move our infrastructure to new location(s), some user-noticeable behavior may change (e.g., functionality in the web sites and/or mailing lists). We'll try to give heads up when this happens, but just be aware that new infrastructure may mean different capabilities. 2. We've been fortunate that all of our costs so far have essentially been $0, but we're now going to start incurring infrastructure hosting costs as a community. Who will pay these costs? Because of #2, today on the webex, we proposed moving to the MPI Forum model of funding: charging a small registration fee for the Open MPI face-to-face developer meetings. All that money will go into some kind of community account somewhere, and we'll use that money to pay community bills (e.g., github and other hosting/infrastructure fees). Unless there is serious objection, we plan to start using this model for the upcoming Open MPI developer meeting (i.e., there will be a registration fee). -- Jeff Squyres jsquy...@cisco.com For corporate legal information go to: http://www.cisco.com/web/about/doing_business/legal/cri/
[OMPI devel] August Open MPI dev meeting: Dallas, TX, USA
On the webex today, we decided that the next face-to-face developer meeting will be: - When: 9am, Tue Aug 15 - 1pm, Thu Aug 18, 2016 - Where: Dallas, TX, USA (at the same IBM facility that we used in Feb 2016) *** PLEASE ADD YOUR NAME TO THE WIKI IF YOU PLAN TO ATTEND: https://github.com/open-mpi/ompi/wiki/Meeting-2016-08 **>> please add your name by COB Fri, 1 July -- see below <<** Also per discussion on the webex today (and per http://www.open-mpi.org/community/lists/devel/2016/06/19139.php), there will likely be a registration fee. We haven't done the math yet to figure out how much the fee will be (we tossed around $50 on the webex today, but that is with no accounting/math backing it up). Putting your name on the wiki by the end of this week will greatly help us in calculating how much we need to charge for the registration fee. Thanks! -- Jeff Squyres jsquy...@cisco.com For corporate legal information go to: http://www.cisco.com/web/about/doing_business/legal/cri/
Re: [OMPI devel] August Open MPI dev meeting: Dallas, TX, USA
On Jun 28, 2016, at 12:43 PM, Jeff Squyres (jsquyres) wrote: > > - When: 9am, Tue Aug 15 - 1pm, Thu Aug 18, 2016 That should be: 9am Tue, Aug ***16*** - 1pm, Thu Aug 18, 2016. Sorry for the confusion! -- Jeff Squyres jsquy...@cisco.com For corporate legal information go to: http://www.cisco.com/web/about/doing_business/legal/cri/