Re: [OMPI devel] Solaris/x86-64 SEGV with 1.8-latest
ion = 0x4d51d0 set_thisproc_membind = 0x3334323832373031 get_thisproc_membind = 0x7063743b302e3032 set_thisthread_membind = 0x312e3237312f2f3a get_thisthread_membind = 0x2c3032312e302e36 set_proc_membind = 0x302e38312e323731 get_proc_membind = 0x3439343a3032312e set_area_membind = 0x6568003836 get_area_membind = (nil) alloc= 0x41 alloc_membind= 0x6361635f6d656d6b free_membind = 0x65706f2f706d742f } support= { discovery = 0x7365732d69706d6e cpubind = 0x68702d736e6f6973 membind = 0x7040766f72677261 } userdata_export_cb = 0x5f30322d6a2d7063 userdata_import_cb = 0x2f30373336312f30 first_osdist = 0x302f30 last_osdist= 0xfba56f14 backends = 0x41 } On Wed, Dec 17, 2014 at 12:54 PM, Rolf vandeVaart wrote: > > I think this has already been fixed by Ralph this morning. I had > observed the same issue but is now gone. > > > > *From:* devel [mailto:devel-boun...@open-mpi.org] *On Behalf Of *Brice > Goglin > *Sent:* Wednesday, December 17, 2014 3:53 PM > *To:* de...@open-mpi.org > *Subject:* Re: [OMPI devel] Solaris/x86-64 SEGV with 1.8-latest > > > > Le 17/12/2014 21:43, Paul Hargrove a écrit : > > > > Dbx gives me > > t@1 (l@1) terminated by signal SEGV (no mapping at the fault address) > > Current function is opal_hwloc172_hwloc_get_obj_by_depth > >74 return topology->levels[depth][idx]; > > (dbx) where > > current thread: t@1 > > =>[1] opal_hwloc172_hwloc_get_obj_by_depth(topology = 0x4d49e0, depth = 0, > idx = 0), line 74 in "traversal.c" > > [2] opal_hwloc172_hwloc_get_root_obj(topology = 0x4d49e0), line 118 in > "helper.h" > > [3] opal_hwloc_base_get_nbobjs_by_type(topo = 0x4d49e0, target = > OPAL_HWLOC172_hwloc_OBJ_CORE, cache_level = 0, rtype = '\003'), line 833 in > "hwloc_base_util.c" > > [4] orte_rmaps_rr_byobj(jdata = 0x43c940, app = 0x483fe0, node_list = > 0xfd7fffdff4b0, num_slots = 2, num_procs = 2U, target = > OPAL_HWLOC172_hwloc_OBJ_CORE, cache_level = 0), line 495 in > "rmaps_rr_mappers.c" > > [5] orte_rmaps_rr_map(jdata = 0x43c940), line 165 in "rmaps_rr.c" > > [6] orte_rmaps_base_map_job(fd = -1, args = 4, cbdata = 0x4a3300), line > 277 in "rmaps_base_map_job.c" > > [7] event_process_active_single_queue(0x0, 0x0, 0x0, 0x0, 0x0, 0x0), at > 0xfd7fe453afbc > > [8] event_process_active(0x0, 0x0, 0x0, 0x0, 0x0, 0x0), at > 0xfd7fe453b361 > > [9] opal_libevent2021_event_base_loop(0x0, 0x0, 0x0, 0x0, 0x0, 0x0), at > 0xfd7fe453bc79 > > [10] orterun(argc = 9, argv = 0xfd7fffdffa58), line 1081 in > "orterun.c" > > [11] main(argc = 9, argv = 0xfd7fffdffa58), line 13 in "main.c" > > (dbx) print depth > > depth = 0 > > (dbx) print index > > index = 0xfd7fff19c174 > > > > Pretty sure that index value is bogus. > > > > > I see "idx" instead of "index" in the code above. index may be a pointer > to the "index()" function in your standard library? > Anyway, depth=0 and idx=0 is totally valid, especially when called from > hwloc_get_root_obj(). Something bad happened to the topology object? Can > you print the contents of topology and topology->nblevels and > topology->levels ? > > Brice > -- > This email message is for the sole use of the intended recipient(s) and > may contain confidential information. Any unauthorized review, use, > disclosure or distribution is prohibited. If you are not the intended > recipient, please contact the sender by reply email and destroy all copies > of the original message. > -- > > ___ > devel mailing list > de...@open-mpi.org > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel > Link to this post: > http://www.open-mpi.org/community/lists/devel/2014/12/16652.php > -- Paul H. Hargrove phhargr...@lbl.gov Computer Languages & Systems Software (CLaSS) Group Computer Science Department Tel: +1-510-495-2352 Lawrence Berkeley National Laboratory Fax: +1-510-486-6900
Re: [OMPI devel] Solaris/x86-64 SEGV with 1.8-latest
I think this has already been fixed by Ralph this morning. I had observed the same issue but is now gone. From: devel [mailto:devel-boun...@open-mpi.org] On Behalf Of Brice Goglin Sent: Wednesday, December 17, 2014 3:53 PM To: de...@open-mpi.org Subject: Re: [OMPI devel] Solaris/x86-64 SEGV with 1.8-latest Le 17/12/2014 21:43, Paul Hargrove a écrit : Dbx gives me t@1 (l@1) terminated by signal SEGV (no mapping at the fault address) Current function is opal_hwloc172_hwloc_get_obj_by_depth 74 return topology->levels[depth][idx]; (dbx) where current thread: t@1 =>[1] opal_hwloc172_hwloc_get_obj_by_depth(topology = 0x4d49e0, depth = 0, idx = 0), line 74 in "traversal.c" [2] opal_hwloc172_hwloc_get_root_obj(topology = 0x4d49e0), line 118 in "helper.h" [3] opal_hwloc_base_get_nbobjs_by_type(topo = 0x4d49e0, target = OPAL_HWLOC172_hwloc_OBJ_CORE, cache_level = 0, rtype = '\003'), line 833 in "hwloc_base_util.c" [4] orte_rmaps_rr_byobj(jdata = 0x43c940, app = 0x483fe0, node_list = 0xfd7fffdff4b0, num_slots = 2, num_procs = 2U, target = OPAL_HWLOC172_hwloc_OBJ_CORE, cache_level = 0), line 495 in "rmaps_rr_mappers.c" [5] orte_rmaps_rr_map(jdata = 0x43c940), line 165 in "rmaps_rr.c" [6] orte_rmaps_base_map_job(fd = -1, args = 4, cbdata = 0x4a3300), line 277 in "rmaps_base_map_job.c" [7] event_process_active_single_queue(0x0, 0x0, 0x0, 0x0, 0x0, 0x0), at 0xfd7fe453afbc [8] event_process_active(0x0, 0x0, 0x0, 0x0, 0x0, 0x0), at 0xfd7fe453b361 [9] opal_libevent2021_event_base_loop(0x0, 0x0, 0x0, 0x0, 0x0, 0x0), at 0xfd7fe453bc79 [10] orterun(argc = 9, argv = 0xfd7fffdffa58), line 1081 in "orterun.c" [11] main(argc = 9, argv = 0xfd7fffdffa58), line 13 in "main.c" (dbx) print depth depth = 0 (dbx) print index index = 0xfd7fff19c174 Pretty sure that index value is bogus. I see "idx" instead of "index" in the code above. index may be a pointer to the "index()" function in your standard library? Anyway, depth=0 and idx=0 is totally valid, especially when called from hwloc_get_root_obj(). Something bad happened to the topology object? Can you print the contents of topology and topology->nblevels and topology->levels ? Brice --- This email message is for the sole use of the intended recipient(s) and may contain confidential information. Any unauthorized review, use, disclosure or distribution is prohibited. If you are not the intended recipient, please contact the sender by reply email and destroy all copies of the original message. ---
Re: [OMPI devel] Solaris/x86-64 SEGV with 1.8-latest
Le 17/12/2014 21:43, Paul Hargrove a écrit : > > Dbx gives me > > t@1 (l@1) terminated by signal SEGV (no mapping at the fault address) > Current function is opal_hwloc172_hwloc_get_obj_by_depth >74 return topology->levels[depth][idx]; > (dbx) where > current thread: t@1 > =>[1] opal_hwloc172_hwloc_get_obj_by_depth(topology = 0x4d49e0, > depth = 0, idx = 0), line 74 in "traversal.c" > [2] opal_hwloc172_hwloc_get_root_obj(topology = 0x4d49e0), line > 118 in "helper.h" > [3] opal_hwloc_base_get_nbobjs_by_type(topo = 0x4d49e0, target = > OPAL_HWLOC172_hwloc_OBJ_CORE, cache_level = 0, rtype = '\003'), > line 833 in "hwloc_base_util.c" > [4] orte_rmaps_rr_byobj(jdata = 0x43c940, app = 0x483fe0, > node_list = 0xfd7fffdff4b0, num_slots = 2, num_procs = 2U, > target = OPAL_HWLOC172_hwloc_OBJ_CORE, cache_level = 0), line 495 > in "rmaps_rr_mappers.c" > [5] orte_rmaps_rr_map(jdata = 0x43c940), line 165 in "rmaps_rr.c" > [6] orte_rmaps_base_map_job(fd = -1, args = 4, cbdata = > 0x4a3300), line 277 in "rmaps_base_map_job.c" > [7] event_process_active_single_queue(0x0, 0x0, 0x0, 0x0, 0x0, > 0x0), at 0xfd7fe453afbc > [8] event_process_active(0x0, 0x0, 0x0, 0x0, 0x0, 0x0), at > 0xfd7fe453b361 > [9] opal_libevent2021_event_base_loop(0x0, 0x0, 0x0, 0x0, 0x0, > 0x0), at 0xfd7fe453bc79 > [10] orterun(argc = 9, argv = 0xfd7fffdffa58), line 1081 in > "orterun.c" > [11] main(argc = 9, argv = 0xfd7fffdffa58), line 13 in "main.c" > (dbx) print depth > depth = 0 > (dbx) print index > index = 0xfd7fff19c174 > > > Pretty sure that index value is bogus. > I see "idx" instead of "index" in the code above. index may be a pointer to the "index()" function in your standard library? Anyway, depth=0 and idx=0 is totally valid, especially when called from hwloc_get_root_obj(). Something bad happened to the topology object? Can you print the contents of topology and topology->nblevels and topology->levels ? Brice
[OMPI devel] Solaris/x86-64 SEGV with 1.8-latest
I tried last nights v1.8 tarball (openmpi-v1.8.3-272-g4e4f997.tar.bz2) with the Studio Compilers (v12.3) on a Solaris/x86-64 system. Configure args (other than prefix) were: --enable-debug --with-verbs \ CC=cc CXX=CC FC=f90 \ CFLAGS=-m64 --with-wrapper-cflags=-m64 \ FCFLAGS=-m64 --with-wrapper-fcflags=-m64 \ CXXFLAGS='-m64 -library=stlport4' --with-wrapper-cxxflags='-m64 -library=stlport4' When running ring_c I see the following $ mpirun -mca btl sm,self,openib -np 2 -host pcp-j-19,pcp-j-20 examples/ring_c' [pcp-j-20:24250] mca_oob_tcp_accept: accept() failed: Error 0 (0). [pcp-j-20:24250] *** Process received signal *** [pcp-j-20:24250] Signal: Segmentation Fault (11) [pcp-j-20:24250] Signal code: Address not mapped (1) [pcp-j-20:24250] Failing at address: fd7fe45bf227 /shared/OMPI/openmpi-1.8-latest-solaris11-x64-ib-ss12u3-nightly/INST/lib/libopen-pal.so.6.2.1'opal_backtrace_print+0x2d [0xfd7fe450a91d] /shared/OMPI/openmpi-1.8-latest-solaris11-x64-ib-ss12u3-nightly/INST/lib/libopen-pal.so.6.2.1'show_stackframe+0xafd [0xfd7fe450066d] /lib/amd64/libc.so.1'__sighndlr+0x6 [0xfd7fff202cc6] /lib/amd64/libc.so.1'call_user_handler+0x2aa [0xfd7fff1f648e] /shared/OMPI/openmpi-1.8-latest-solaris11-x64-ib-ss12u3-nightly/INST/lib/libopen-pal.so.6.2.1'opal_hwloc172_hwloc_get_obj_by_depth+0x1d7 [0xfd7fe45bf227] [Signal 11 (SEGV)] /shared/OMPI/openmpi-1.8-latest-solaris11-x64-ib-ss12u3-nightly/INST/lib/libopen-pal.so.6.2.1'opal_hwloc172_hwloc_get_root_obj+0x24 [0xfd7fe4560504] /shared/OMPI/openmpi-1.8-latest-solaris11-x64-ib-ss12u3-nightly/INST/lib/libopen-pal.so.6.2.1'opal_hwloc_base_get_nbobjs_by_type+0xec [0xfd7fe45653ec] /shared/OMPI/openmpi-1.8-latest-solaris11-x64-ib-ss12u3-nightly/INST/lib/openmpi/mca_rmaps_round_robin.so'orte_rmaps_rr_byobj+0x252 [0xfd7fe1c9ddd2] /shared/OMPI/openmpi-1.8-latest-solaris11-x64-ib-ss12u3-nightly/INST/lib/openmpi/mca_rmaps_round_robin.so'orte_rmaps_rr_map+0x65e [0xfd7fe1c912be] /shared/OMPI/openmpi-1.8-latest-solaris11-x64-ib-ss12u3-nightly/INST/lib/libopen-rte.so.7.0.5'orte_rmaps_base_map_job+0xdce [0xfd7fe276aace] /shared/OMPI/openmpi-1.8-latest-solaris11-x64-ib-ss12u3-nightly/INST/lib/libopen-pal.so.6.2.1'event_process_ac tive_single_queue+0x1dc [0xfd7fe453afbc] /shared/OMPI/openmpi-1.8-latest-solaris11-x64-ib-ss12u3-nightly/INST/lib/libopen-pal.so.6.2.1'event_process_active+0xb1 [0xfd7fe453b361] /shared/OMPI/openmpi-1.8-latest-solaris11-x64-ib-ss12u3-nightly/INST/lib/libopen-pal.so.6.2.1'opal_libevent2021_event_base_loop+0x339 [0xfd7fe453bc79] /shared/OMPI/openmpi-1.8-latest-solaris11-x64-ib-ss12u3-nightly/INST/bin/orterun'orterun+0x1d0e [0x4101fe] /shared/OMPI/openmpi-1.8-latest-solaris11-x64-ib-ss12u3-nightly/INST/bin/orterun'main+0x20 [0x408ca0] /shared/OMPI/openmpi-1.8-latest-solaris11-x64-ib-ss12u3-nightly/INST/bin/orterun'0x8b0b [0x408b0b] [pcp-j-20:24250] *** End of error message *** Dbx gives me t@1 (l@1) terminated by signal SEGV (no mapping at the fault address) Current function is opal_hwloc172_hwloc_get_obj_by_depth 74 return topology->levels[depth][idx]; (dbx) where current thread: t@1 =>[1] opal_hwloc172_hwloc_get_obj_by_depth(topology = 0x4d49e0, depth = 0, idx = 0), line 74 in "traversal.c" [2] opal_hwloc172_hwloc_get_root_obj(topology = 0x4d49e0), line 118 in "helper.h" [3] opal_hwloc_base_get_nbobjs_by_type(topo = 0x4d49e0, target = OPAL_HWLOC172_hwloc_OBJ_CORE, cache_level = 0, rtype = '\003'), line 833 in "hwloc_base_util.c" [4] orte_rmaps_rr_byobj(jdata = 0x43c940, app = 0x483fe0, node_list = 0xfd7fffdff4b0, num_slots = 2, num_procs = 2U, target = OPAL_HWLOC172_hwloc_OBJ_CORE, cache_level = 0), line 495 in "rmaps_rr_mappers.c" [5] orte_rmaps_rr_map(jdata = 0x43c940), line 165 in "rmaps_rr.c" [6] orte_rmaps_base_map_job(fd = -1, args = 4, cbdata = 0x4a3300), line 277 in "rmaps_base_map_job.c" [7] event_process_active_single_queue(0x0, 0x0, 0x0, 0x0, 0x0, 0x0), at 0xfd7fe453afbc [8] event_process_active(0x0, 0x0, 0x0, 0x0, 0x0, 0x0), at 0xfd7fe453b361 [9] opal_libevent2021_event_base_loop(0x0, 0x0, 0x0, 0x0, 0x0, 0x0), at 0xfd7fe453bc79 [10] orterun(argc = 9, argv = 0xfd7fffdffa58), line 1081 in "orterun.c" [11] main(argc = 9, argv = 0xfd7fffdffa58), line 13 in "main.c" (dbx) print depth depth = 0 (dbx) print index index = 0xfd7fff19c174 Pretty sure that index value is bogus. -Paul -- Paul H. Hargrove phhargr...@lbl.gov Computer Languages & Systems Software (CLaSS) Group Computer Science Department Tel: +1-510-495-2352 Lawrence Berkeley National Laboratory Fax: +1-510-486-6900