Re: [OMPI devel] Solaris/x86-64 SEGV with 1.8-latest

2014-12-17 Thread Paul Hargrove
ion   = 0x4d51d0
set_thisproc_membind = 0x3334323832373031
get_thisproc_membind = 0x7063743b302e3032
set_thisthread_membind   = 0x312e3237312f2f3a
get_thisthread_membind   = 0x2c3032312e302e36
set_proc_membind = 0x302e38312e323731
get_proc_membind = 0x3439343a3032312e
set_area_membind = 0x6568003836
get_area_membind = (nil)
alloc= 0x41
alloc_membind= 0x6361635f6d656d6b
free_membind = 0x65706f2f706d742f
}
support= {
discovery = 0x7365732d69706d6e
cpubind   = 0x68702d736e6f6973
membind   = 0x7040766f72677261
}
userdata_export_cb = 0x5f30322d6a2d7063
userdata_import_cb = 0x2f30373336312f30
first_osdist   = 0x302f30
last_osdist= 0xfba56f14
backends   = 0x41
}




On Wed, Dec 17, 2014 at 12:54 PM, Rolf vandeVaart 
wrote:
>
>  I think this has already been fixed by Ralph this morning.  I had
> observed the same issue but is now gone.
>
>
>
> *From:* devel [mailto:devel-boun...@open-mpi.org] *On Behalf Of *Brice
> Goglin
> *Sent:* Wednesday, December 17, 2014 3:53 PM
> *To:* de...@open-mpi.org
> *Subject:* Re: [OMPI devel] Solaris/x86-64 SEGV with 1.8-latest
>
>
>
> Le 17/12/2014 21:43, Paul Hargrove a écrit :
>
>
>
> Dbx gives me
>
>   t@1 (l@1) terminated by signal SEGV (no mapping at the fault address)
>
> Current function is opal_hwloc172_hwloc_get_obj_by_depth
>
>74 return topology->levels[depth][idx];
>
> (dbx) where
>
> current thread: t@1
>
> =>[1] opal_hwloc172_hwloc_get_obj_by_depth(topology = 0x4d49e0, depth = 0,
> idx = 0), line 74 in "traversal.c"
>
>   [2] opal_hwloc172_hwloc_get_root_obj(topology = 0x4d49e0), line 118 in
> "helper.h"
>
>   [3] opal_hwloc_base_get_nbobjs_by_type(topo = 0x4d49e0, target =
> OPAL_HWLOC172_hwloc_OBJ_CORE, cache_level = 0, rtype = '\003'), line 833 in
> "hwloc_base_util.c"
>
>   [4] orte_rmaps_rr_byobj(jdata = 0x43c940, app = 0x483fe0, node_list =
> 0xfd7fffdff4b0, num_slots = 2, num_procs = 2U, target =
> OPAL_HWLOC172_hwloc_OBJ_CORE, cache_level = 0), line 495 in
> "rmaps_rr_mappers.c"
>
>   [5] orte_rmaps_rr_map(jdata = 0x43c940), line 165 in "rmaps_rr.c"
>
>   [6] orte_rmaps_base_map_job(fd = -1, args = 4, cbdata = 0x4a3300), line
> 277 in "rmaps_base_map_job.c"
>
>   [7] event_process_active_single_queue(0x0, 0x0, 0x0, 0x0, 0x0, 0x0), at
> 0xfd7fe453afbc
>
>   [8] event_process_active(0x0, 0x0, 0x0, 0x0, 0x0, 0x0), at
> 0xfd7fe453b361
>
>   [9] opal_libevent2021_event_base_loop(0x0, 0x0, 0x0, 0x0, 0x0, 0x0), at
> 0xfd7fe453bc79
>
>   [10] orterun(argc = 9, argv = 0xfd7fffdffa58), line 1081 in
> "orterun.c"
>
>   [11] main(argc = 9, argv = 0xfd7fffdffa58), line 13 in "main.c"
>
> (dbx) print depth
>
> depth = 0
>
> (dbx) print index
>
> index = 0xfd7fff19c174
>
>
>
> Pretty sure that index value is bogus.
>
>
>
>
> I see "idx" instead of "index" in the code above. index may be a pointer
> to the "index()" function in your standard library?
> Anyway, depth=0 and idx=0 is totally valid, especially when called from
> hwloc_get_root_obj(). Something bad happened to the topology object? Can
> you print the contents of topology and topology->nblevels and
> topology->levels ?
>
> Brice
>   --
>  This email message is for the sole use of the intended recipient(s) and
> may contain confidential information.  Any unauthorized review, use,
> disclosure or distribution is prohibited.  If you are not the intended
> recipient, please contact the sender by reply email and destroy all copies
> of the original message.
>  --
>
> ___
> devel mailing list
> de...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
> Link to this post:
> http://www.open-mpi.org/community/lists/devel/2014/12/16652.php
>


-- 
Paul H. Hargrove  phhargr...@lbl.gov
Computer Languages & Systems Software (CLaSS) Group
Computer Science Department   Tel: +1-510-495-2352
Lawrence Berkeley National Laboratory Fax: +1-510-486-6900


Re: [OMPI devel] Solaris/x86-64 SEGV with 1.8-latest

2014-12-17 Thread Rolf vandeVaart
I think this has already been fixed by Ralph this morning.  I had observed the 
same issue but is now gone.

From: devel [mailto:devel-boun...@open-mpi.org] On Behalf Of Brice Goglin
Sent: Wednesday, December 17, 2014 3:53 PM
To: de...@open-mpi.org
Subject: Re: [OMPI devel] Solaris/x86-64 SEGV with 1.8-latest

Le 17/12/2014 21:43, Paul Hargrove a écrit :

Dbx gives me
t@1 (l@1) terminated by signal SEGV (no mapping at the fault address)
Current function is opal_hwloc172_hwloc_get_obj_by_depth
   74 return topology->levels[depth][idx];
(dbx) where
current thread: t@1
=>[1] opal_hwloc172_hwloc_get_obj_by_depth(topology = 0x4d49e0, depth = 0, idx 
= 0), line 74 in "traversal.c"
  [2] opal_hwloc172_hwloc_get_root_obj(topology = 0x4d49e0), line 118 in 
"helper.h"
  [3] opal_hwloc_base_get_nbobjs_by_type(topo = 0x4d49e0, target = 
OPAL_HWLOC172_hwloc_OBJ_CORE, cache_level = 0, rtype = '\003'), line 833 in 
"hwloc_base_util.c"
  [4] orte_rmaps_rr_byobj(jdata = 0x43c940, app = 0x483fe0, node_list = 
0xfd7fffdff4b0, num_slots = 2, num_procs = 2U, target = 
OPAL_HWLOC172_hwloc_OBJ_CORE, cache_level = 0), line 495 in "rmaps_rr_mappers.c"
  [5] orte_rmaps_rr_map(jdata = 0x43c940), line 165 in "rmaps_rr.c"
  [6] orte_rmaps_base_map_job(fd = -1, args = 4, cbdata = 0x4a3300), line 277 
in "rmaps_base_map_job.c"
  [7] event_process_active_single_queue(0x0, 0x0, 0x0, 0x0, 0x0, 0x0), at 
0xfd7fe453afbc
  [8] event_process_active(0x0, 0x0, 0x0, 0x0, 0x0, 0x0), at 0xfd7fe453b361
  [9] opal_libevent2021_event_base_loop(0x0, 0x0, 0x0, 0x0, 0x0, 0x0), at 
0xfd7fe453bc79
  [10] orterun(argc = 9, argv = 0xfd7fffdffa58), line 1081 in "orterun.c"
  [11] main(argc = 9, argv = 0xfd7fffdffa58), line 13 in "main.c"
(dbx) print depth
depth = 0
(dbx) print index
index = 0xfd7fff19c174

Pretty sure that index value is bogus.


I see "idx" instead of "index" in the code above. index may be a pointer to the 
"index()" function in your standard library?
Anyway, depth=0 and idx=0 is totally valid, especially when called from 
hwloc_get_root_obj(). Something bad happened to the topology object? Can you 
print the contents of topology and topology->nblevels and topology->levels ?

Brice

---
This email message is for the sole use of the intended recipient(s) and may 
contain
confidential information.  Any unauthorized review, use, disclosure or 
distribution
is prohibited.  If you are not the intended recipient, please contact the 
sender by
reply email and destroy all copies of the original message.
---


Re: [OMPI devel] Solaris/x86-64 SEGV with 1.8-latest

2014-12-17 Thread Brice Goglin
Le 17/12/2014 21:43, Paul Hargrove a écrit :
>
> Dbx gives me
>
> t@1 (l@1) terminated by signal SEGV (no mapping at the fault address)
> Current function is opal_hwloc172_hwloc_get_obj_by_depth
>74 return topology->levels[depth][idx];
> (dbx) where
> current thread: t@1
> =>[1] opal_hwloc172_hwloc_get_obj_by_depth(topology = 0x4d49e0,
> depth = 0, idx = 0), line 74 in "traversal.c"
>   [2] opal_hwloc172_hwloc_get_root_obj(topology = 0x4d49e0), line
> 118 in "helper.h"
>   [3] opal_hwloc_base_get_nbobjs_by_type(topo = 0x4d49e0, target =
> OPAL_HWLOC172_hwloc_OBJ_CORE, cache_level = 0, rtype = '\003'),
> line 833 in "hwloc_base_util.c"
>   [4] orte_rmaps_rr_byobj(jdata = 0x43c940, app = 0x483fe0,
> node_list = 0xfd7fffdff4b0, num_slots = 2, num_procs = 2U,
> target = OPAL_HWLOC172_hwloc_OBJ_CORE, cache_level = 0), line 495
> in "rmaps_rr_mappers.c"
>   [5] orte_rmaps_rr_map(jdata = 0x43c940), line 165 in "rmaps_rr.c"
>   [6] orte_rmaps_base_map_job(fd = -1, args = 4, cbdata =
> 0x4a3300), line 277 in "rmaps_base_map_job.c"
>   [7] event_process_active_single_queue(0x0, 0x0, 0x0, 0x0, 0x0,
> 0x0), at 0xfd7fe453afbc 
>   [8] event_process_active(0x0, 0x0, 0x0, 0x0, 0x0, 0x0), at
> 0xfd7fe453b361 
>   [9] opal_libevent2021_event_base_loop(0x0, 0x0, 0x0, 0x0, 0x0,
> 0x0), at 0xfd7fe453bc79 
>   [10] orterun(argc = 9, argv = 0xfd7fffdffa58), line 1081 in
> "orterun.c"
>   [11] main(argc = 9, argv = 0xfd7fffdffa58), line 13 in "main.c"
> (dbx) print depth
> depth = 0
> (dbx) print index
> index = 0xfd7fff19c174
>
>
> Pretty sure that index value is bogus.
>

I see "idx" instead of "index" in the code above. index may be a pointer
to the "index()" function in your standard library?
Anyway, depth=0 and idx=0 is totally valid, especially when called from
hwloc_get_root_obj(). Something bad happened to the topology object? Can
you print the contents of topology and topology->nblevels and
topology->levels ?

Brice



[OMPI devel] Solaris/x86-64 SEGV with 1.8-latest

2014-12-17 Thread Paul Hargrove
I tried last nights v1.8 tarball (openmpi-v1.8.3-272-g4e4f997.tar.bz2) with
the Studio Compilers (v12.3) on a Solaris/x86-64 system.
Configure args (other than prefix) were:

--enable-debug --with-verbs \
CC=cc CXX=CC FC=f90 \
CFLAGS=-m64 --with-wrapper-cflags=-m64 \
FCFLAGS=-m64 --with-wrapper-fcflags=-m64 \
CXXFLAGS='-m64 -library=stlport4' --with-wrapper-cxxflags='-m64
-library=stlport4'


When running ring_c I see the following

$ mpirun -mca btl sm,self,openib -np 2 -host pcp-j-19,pcp-j-20
examples/ring_c'
[pcp-j-20:24250] mca_oob_tcp_accept: accept() failed: Error 0 (0).
[pcp-j-20:24250] *** Process received signal ***
[pcp-j-20:24250] Signal: Segmentation Fault (11)
[pcp-j-20:24250] Signal code: Address not mapped (1)
[pcp-j-20:24250] Failing at address: fd7fe45bf227
/shared/OMPI/openmpi-1.8-latest-solaris11-x64-ib-ss12u3-nightly/INST/lib/libopen-pal.so.6.2.1'opal_backtrace_print+0x2d
[0xfd7fe450a91d]
/shared/OMPI/openmpi-1.8-latest-solaris11-x64-ib-ss12u3-nightly/INST/lib/libopen-pal.so.6.2.1'show_stackframe+0xafd
[0xfd7fe450066d]
/lib/amd64/libc.so.1'__sighndlr+0x6 [0xfd7fff202cc6]
/lib/amd64/libc.so.1'call_user_handler+0x2aa [0xfd7fff1f648e]
/shared/OMPI/openmpi-1.8-latest-solaris11-x64-ib-ss12u3-nightly/INST/lib/libopen-pal.so.6.2.1'opal_hwloc172_hwloc_get_obj_by_depth+0x1d7
[0xfd7fe45bf227] [Signal 11 (SEGV)]
/shared/OMPI/openmpi-1.8-latest-solaris11-x64-ib-ss12u3-nightly/INST/lib/libopen-pal.so.6.2.1'opal_hwloc172_hwloc_get_root_obj+0x24
[0xfd7fe4560504]
/shared/OMPI/openmpi-1.8-latest-solaris11-x64-ib-ss12u3-nightly/INST/lib/libopen-pal.so.6.2.1'opal_hwloc_base_get_nbobjs_by_type+0xec
[0xfd7fe45653ec]
/shared/OMPI/openmpi-1.8-latest-solaris11-x64-ib-ss12u3-nightly/INST/lib/openmpi/mca_rmaps_round_robin.so'orte_rmaps_rr_byobj+0x252
[0xfd7fe1c9ddd2]
/shared/OMPI/openmpi-1.8-latest-solaris11-x64-ib-ss12u3-nightly/INST/lib/openmpi/mca_rmaps_round_robin.so'orte_rmaps_rr_map+0x65e
[0xfd7fe1c912be]
/shared/OMPI/openmpi-1.8-latest-solaris11-x64-ib-ss12u3-nightly/INST/lib/libopen-rte.so.7.0.5'orte_rmaps_base_map_job+0xdce
[0xfd7fe276aace]
/shared/OMPI/openmpi-1.8-latest-solaris11-x64-ib-ss12u3-nightly/INST/lib/libopen-pal.so.6.2.1'event_process_ac
tive_single_queue+0x1dc [0xfd7fe453afbc]
/shared/OMPI/openmpi-1.8-latest-solaris11-x64-ib-ss12u3-nightly/INST/lib/libopen-pal.so.6.2.1'event_process_active+0xb1
[0xfd7fe453b361]
/shared/OMPI/openmpi-1.8-latest-solaris11-x64-ib-ss12u3-nightly/INST/lib/libopen-pal.so.6.2.1'opal_libevent2021_event_base_loop+0x339
[0xfd7fe453bc79]
/shared/OMPI/openmpi-1.8-latest-solaris11-x64-ib-ss12u3-nightly/INST/bin/orterun'orterun+0x1d0e
[0x4101fe]
/shared/OMPI/openmpi-1.8-latest-solaris11-x64-ib-ss12u3-nightly/INST/bin/orterun'main+0x20
[0x408ca0]
/shared/OMPI/openmpi-1.8-latest-solaris11-x64-ib-ss12u3-nightly/INST/bin/orterun'0x8b0b
[0x408b0b]
[pcp-j-20:24250] *** End of error message ***


Dbx gives me

t@1 (l@1) terminated by signal SEGV (no mapping at the fault address)
Current function is opal_hwloc172_hwloc_get_obj_by_depth
   74 return topology->levels[depth][idx];
(dbx) where
current thread: t@1
=>[1] opal_hwloc172_hwloc_get_obj_by_depth(topology = 0x4d49e0, depth = 0,
idx = 0), line 74 in "traversal.c"
  [2] opal_hwloc172_hwloc_get_root_obj(topology = 0x4d49e0), line 118 in
"helper.h"
  [3] opal_hwloc_base_get_nbobjs_by_type(topo = 0x4d49e0, target =
OPAL_HWLOC172_hwloc_OBJ_CORE, cache_level = 0, rtype = '\003'), line 833 in
"hwloc_base_util.c"
  [4] orte_rmaps_rr_byobj(jdata = 0x43c940, app = 0x483fe0, node_list =
0xfd7fffdff4b0, num_slots = 2, num_procs = 2U, target =
OPAL_HWLOC172_hwloc_OBJ_CORE, cache_level = 0), line 495 in
"rmaps_rr_mappers.c"
  [5] orte_rmaps_rr_map(jdata = 0x43c940), line 165 in "rmaps_rr.c"
  [6] orte_rmaps_base_map_job(fd = -1, args = 4, cbdata = 0x4a3300), line
277 in "rmaps_base_map_job.c"
  [7] event_process_active_single_queue(0x0, 0x0, 0x0, 0x0, 0x0, 0x0), at
0xfd7fe453afbc
  [8] event_process_active(0x0, 0x0, 0x0, 0x0, 0x0, 0x0), at
0xfd7fe453b361
  [9] opal_libevent2021_event_base_loop(0x0, 0x0, 0x0, 0x0, 0x0, 0x0), at
0xfd7fe453bc79
  [10] orterun(argc = 9, argv = 0xfd7fffdffa58), line 1081 in
"orterun.c"
  [11] main(argc = 9, argv = 0xfd7fffdffa58), line 13 in "main.c"
(dbx) print depth
depth = 0
(dbx) print index
index = 0xfd7fff19c174


Pretty sure that index value is bogus.

-Paul



-- 
Paul H. Hargrove  phhargr...@lbl.gov
Computer Languages & Systems Software (CLaSS) Group
Computer Science Department   Tel: +1-510-495-2352
Lawrence Berkeley National Laboratory Fax: +1-510-486-6900