Ralph,

I cannot find a case for the %u format is guess_strlen
And since the default does not invoke va_arg()
I
it seems strlen is invoked on nnuma instead of arch

Makes sense ?

Cheers,

Gilles

Ralph Castain <r...@open-mpi.org> wrote:
>Afraid I’m drawing a blank, Paul - I can’t see how we got to a bad address 
>down there. This is at the beginning of orte_init, so there are no threads 
>running nor has anything much happened.
>
>
>Do you have any suggestions?
>
>
>
>On Dec 12, 2014, at 9:02 AM, Paul Hargrove <phhargr...@lbl.gov> wrote:
>
>
>Ralph,
>
>
>The "arch" variable looks fine:
>
>Current function is opal_hwloc_base_get_topo_signature
>
> 2134                    nnuma, nsocket, nl3, nl2, nl1, ncore, nhwt, arch);
>
>(dbx) print arch
>
>arch = 0x1001700a0 "sun4v"
>
>
>And so is "fmt":
>
>
>Current function is opal_asprintf
>
>  194       length = opal_vasprintf(ptr, fmt, ap);
>
>(dbx) print fmt
>
>fmt = 0xffffffff7eeada98 "%uN:%uS:%uL3:%uL2:%uL1:%uC:%uH:%s"
>
>
>However, things have gone bad in guess_strlen():
>
>
>Current function is guess_strlen
>
>   71                       len += (int)strlen(sarg);
>
>(dbx) print sarg
>
>sarg = 0x2 "<bad address 0x2>"
>
>
>-Paul
>
>
>On Fri, Dec 12, 2014 at 2:24 AM, Ralph Castain <r...@open-mpi.org> wrote:
>
>Hmmm….this is really odd. I actually do have a protection for that arch value 
>being NULL, and you are in the code section when it isn’t.
>
>
>Do you still have the core file around? If so, can you print out the value of 
>the “arch” variable? It would be in the opal_hwloc_base_get_topo_signature 
>level.
>
>
>I’m wondering if that value has been hosed, and the problem is memory 
>corruption somewhere.
>
>
>
>On Dec 11, 2014, at 8:56 PM, Ralph Castain <r...@open-mpi.org> wrote:
>
>
>Thanks Paul - I will post a fix for this tomorrow. Looks like Sparc isn’t 
>returning an architecture type for some reason, and I didn’t protect against 
>it.
>
>
>
>On Dec 11, 2014, at 7:39 PM, Paul Hargrove <phhargr...@lbl.gov> wrote:
>
>
>Backtrace for the Solaris-10/SPARC SEGV appears below.
>
>I've changed the subject line to distinguish this from the earlier report.
>
>
>-Paul
>
>
>program terminated by signal SEGV (no mapping at the fault address)
>
>0xffffffff7d93b634: strlen+0x0014:      lduh     [%o2], %o1
>
>Current function is guess_strlen
>
>   71                       len += (int)strlen(sarg);
>
>(dbx) where
>
>  [1] strlen(0x2, 0x73000000, 0x2, 0x80808080, 0x2, 0x80808080), at 
>0xffffffff7d93b634 
>
>=>[2] guess_strlen(fmt = 0xffffffff7eeada98 
>"%uN:%uS:%uL3:%uL2:%uL1:%uC:%uH:%s", ap = 0xffffffff7ffff058), line 71 in 
>"printf.c"
>
>  [3] opal_vasprintf(ptr = 0xffffffff7ffff0b8, fmt = 0xffffffff7eeada98 
>"%uN:%uS:%uL3:%uL2:%uL1:%uC:%uH:%s", ap = 0xffffffff7ffff050), line 218 in 
>"printf.c"
>
>  [4] opal_asprintf(ptr = 0xffffffff7ffff0b8, fmt = 0xffffffff7eeada98 
>"%uN:%uS:%uL3:%uL2:%uL1:%uC:%uH:%s", ... = 0x807ede0103, ...), line 194 in 
>"printf.c"
>
>  [5] opal_hwloc_base_get_topo_signature(topo = 0x100128ea0), line 2134 in 
>"hwloc_base_util.c"
>
>  [6] rte_init(), line 205 in "ess_hnp_module.c"
>
>  [7] orte_init(pargc = 0xffffffff7ffff61c, pargv = 0xffffffff7ffff610, flags 
>= 4U), line 148 in "orte_init.c"
>
>  [8] orterun(argc = 7, argv = 0xffffffff7ffff7a8), line 856 in "orterun.c"
>
>  [9] main(argc = 7, argv = 0xffffffff7ffff7a8), line 13 in "main.c"
>
>
>On Thu, Dec 11, 2014 at 7:17 PM, Ralph Castain <r...@open-mpi.org> wrote:
>
>No, that looks different - it’s failing in mpirun itself. Can you get a line 
>number on it?
>
>
>Sorry for delay - I’m generating rc3 now
>
>
>
>On Dec 11, 2014, at 6:59 PM, Paul Hargrove <phhargr...@lbl.gov> wrote:
>
>
>Don't see an rc3 yet.
>
>
>My Solaris-10/SPARC runs fail slightly differently (see below).
>
>It looks sufficiently similar that it MIGHT be the same root cause.
>
>However, lacking an rc3 to test I figured it would be better to report this 
>than to ignore it.
>
>
>The problem is present with both V8+ and V9 ABIs, and with both Gnu and Sun 
>compilers.
>
>
>-Paul
>
>
>[niagara1:29881] *** Process received signal ***
>
>[niagara1:29881] Signal: Segmentation Fault (11)
>
>[niagara1:29881] Signal code: Address not mapped (1)
>
>[niagara1:29881] Failing at address: 2
>
>/sandbox/hargrove/OMPI/openmpi-1.8.4rc2-solaris10-sparcT2-gcc346-v8plus/INST/lib/libopen-pal.so.6.2.1:opal_backtrace_print+0x24
>
>/sandbox/hargrove/OMPI/openmpi-1.8.4rc2-solaris10-sparcT2-gcc346-v8plus/INST/lib/libopen-pal.so.6.2.1:0xaa160
>
>/lib/libc.so.1:0xc5364
>
>/lib/libc.so.1:0xb9e64
>
>/lib/libc.so.1:strlen+0x14 [ Signal 11 (SEGV)]
>
>/sandbox/hargrove/OMPI/openmpi-1.8.4rc2-solaris10-sparcT2-gcc346-v8plus/INST/lib/libopen-pal.so.6.2.1:opal_vasprintf+0x20
>
>/sandbox/hargrove/OMPI/openmpi-1.8.4rc2-solaris10-sparcT2-gcc346-v8plus/INST/lib/libopen-pal.so.6.2.1:opal_asprintf+0x30
>
>/sandbox/hargrove/OMPI/openmpi-1.8.4rc2-solaris10-sparcT2-gcc346-v8plus/INST/lib/libopen-pal.so.6.2.1:opal_hwloc_base_get_topo_signature+0x24c
>
>/sandbox/hargrove/OMPI/openmpi-1.8.4rc2-solaris10-sparcT2-gcc346-v8plus/INST/lib/openmpi/mca_ess_hnp.so:0x2d90
>
>/sandbox/hargrove/OMPI/openmpi-1.8.4rc2-solaris10-sparcT2-gcc346-v8plus/INST/lib/libopen-rte.so.7.0.5:orte_init+0x2f8
>
>/sandbox/hargrove/OMPI/openmpi-1.8.4rc2-solaris10-sparcT2-gcc346-v8plus/INST/bin/orterun:orterun+0xaa8
>
>/sandbox/hargrove/OMPI/openmpi-1.8.4rc2-solaris10-sparcT2-gcc346-v8plus/INST/bin/orterun:main+0x14
>
>/sandbox/hargrove/OMPI/openmpi-1.8.4rc2-solaris10-sparcT2-gcc346-v8plus/INST/bin/orterun:_start+0x5c
>
>[niagara1:29881] *** End of error message ***
>
>Segmentation Fault - core dumped
>
>
>On Thu, Dec 11, 2014 at 3:29 PM, Ralph Castain <r...@open-mpi.org> wrote:
>
>Ah crud - incomplete commit means we didn’t send the topo string. Will roll 
>rc3 in a few minutes.
>
>
>Thanks, Paul
>
>Ralph
>
>
>On Dec 11, 2014, at 3:08 PM, Paul Hargrove <phhargr...@lbl.gov> wrote:
>
>
>Testing the 1.8.4rc2 tarball on my x86-64 Solaris-11 systems I am getting the 
>following crash for both "-m32" and "-m64" builds:
>
>
>$ mpirun -mca btl sm,self,openib -np 2 -host pcp-j-19,pcp-j-20 examples/ring_c'
>
>[pcp-j-19:18762] *** Process received signal ***
>
>[pcp-j-19:18762] Signal: Segmentation Fault (11)
>
>[pcp-j-19:18762] Signal code: Address not mapped (1)
>
>[pcp-j-19:18762] Failing at address: 0
>
>/shared/OMPI/openmpi-1.8.4rc2-solaris11-x64-ib-gcc452/INST/lib/libopen-pal.so.6.2.1'opal_backtrace_print+0x26
> [0xfffffd7ffaf237ba]
>
>/shared/OMPI/openmpi-1.8.4rc2-solaris11-x64-ib-gcc452/INST/lib/libopen-pal.so.6.2.1'show_stackframe+0x833
> [0xfffffd7ffaf20ba1]
>
>/lib/amd64/libc.so.1'__sighndlr+0x6 [0xfffffd7fff202cc6]
>
>/lib/amd64/libc.so.1'call_user_handler+0x2aa [0xfffffd7fff1f648e]
>
>/lib/amd64/libc.so.1'strcmp+0x1a [0xfffffd7fff170fda] [Signal 11 (SEGV)]
>
>/shared/OMPI/openmpi-1.8.4rc2-solaris11-x64-ib-gcc452/INST/bin/orted'main+0x90 
>[0x4010b7]
>
>/shared/OMPI/openmpi-1.8.4rc2-solaris11-x64-ib-gcc452/INST/bin/orted'_start+0x6c
> [0x400f2c]
>
>[pcp-j-19:18762] *** End of error message ***
>
>bash: line 1: 18762 Segmentation Fault      (core dumped) 
>/shared/OMPI/openmpi-1.8.4rc2-solaris11-x64-ib-gcc452/INST/bin/orted -mca ess 
>"env" -mca orte_ess_jobid "911343616" -mca orte_ess_vpid 1 -mca 
>orte_ess_num_procs "2" -mca orte_hnp_uri 
>"911343616.0;tcp://172.16.0.120,172.18.0.120:50362" --tree-spawn -mca btl 
>"sm,self,openib" -mca plm "rsh" -mca shmem_mmap_enable_nfs_warning "0"
>
>
>Running gdb against a core generated by the 32-bit build gives line numbers:
>
>#0  0xfea1cb45 in strcmp () from /lib/libc.so.1
>
>#1  0xfeef4900 in orte_daemon (argc=26, argv=0x80479b0)
>
>    at 
>/shared/OMPI/openmpi-1.8.4rc2-solaris11-x86-ib-gcc452/openmpi-1.8.4rc2/orte/orted/orted_main.c:789
>
>#2  0x08050fb1 in main (argc=26, argv=0x80479b0)
>
>    at 
>/shared/OMPI/openmpi-1.8.4rc2-solaris11-x86-ib-gcc452/openmpi-1.8.4rc2/orte/tools/orted/orted.c:62
>
>
>-Paul
>
>
>-- 
>
>Paul H. Hargrove                          phhargr...@lbl.gov
>
>Computer Languages & Systems Software (CLaSS) Group
>
>Computer Science Department               Tel: +1-510-495-2352
>
>Lawrence Berkeley National Laboratory     Fax: +1-510-486-6900
>
>_______________________________________________
>devel mailing list
>de...@open-mpi.org
>Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
>Link to this post: 
>http://www.open-mpi.org/community/lists/devel/2014/12/16514.php
>
>
>
>_______________________________________________
>devel mailing list
>de...@open-mpi.org
>Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
>Link to this post: 
>http://www.open-mpi.org/community/lists/devel/2014/12/16515.php
>
>
>
>
>-- 
>
>Paul H. Hargrove                          phhargr...@lbl.gov
>
>Computer Languages & Systems Software (CLaSS) Group
>
>Computer Science Department               Tel: +1-510-495-2352
>
>Lawrence Berkeley National Laboratory     Fax: +1-510-486-6900
>
>_______________________________________________
>devel mailing list
>de...@open-mpi.org
>Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
>
>Link to this post: 
>http://www.open-mpi.org/community/lists/devel/2014/12/16521.php
>
>
>
>_______________________________________________
>devel mailing list
>de...@open-mpi.org
>Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
>Link to this post: 
>http://www.open-mpi.org/community/lists/devel/2014/12/16522.php
>
>
>
>
>-- 
>
>Paul H. Hargrove                          phhargr...@lbl.gov
>
>Computer Languages & Systems Software (CLaSS) Group
>
>Computer Science Department               Tel: +1-510-495-2352
>
>Lawrence Berkeley National Laboratory     Fax: +1-510-486-6900
>
>_______________________________________________
>devel mailing list
>de...@open-mpi.org
>Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
>Link to this post: 
>http://www.open-mpi.org/community/lists/devel/2014/12/16524.php
>
>
>
>
>_______________________________________________
>devel mailing list
>de...@open-mpi.org
>Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
>Link to this post: 
>http://www.open-mpi.org/community/lists/devel/2014/12/16541.php
>
>
>
>
>-- 
>
>Paul H. Hargrove                          phhargr...@lbl.gov
>
>Computer Languages & Systems Software (CLaSS) Group
>
>Computer Science Department               Tel: +1-510-495-2352
>
>Lawrence Berkeley National Laboratory     Fax: +1-510-486-6900
>
>_______________________________________________
>devel mailing list
>de...@open-mpi.org
>Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
>Link to this post: 
>http://www.open-mpi.org/community/lists/devel/2014/12/16552.php
>
>

Reply via email to