Ralph, I cannot find a case for the %u format is guess_strlen And since the default does not invoke va_arg() I it seems strlen is invoked on nnuma instead of arch
Makes sense ? Cheers, Gilles Ralph Castain <r...@open-mpi.org> wrote: >Afraid I’m drawing a blank, Paul - I can’t see how we got to a bad address >down there. This is at the beginning of orte_init, so there are no threads >running nor has anything much happened. > > >Do you have any suggestions? > > > >On Dec 12, 2014, at 9:02 AM, Paul Hargrove <phhargr...@lbl.gov> wrote: > > >Ralph, > > >The "arch" variable looks fine: > >Current function is opal_hwloc_base_get_topo_signature > > 2134 nnuma, nsocket, nl3, nl2, nl1, ncore, nhwt, arch); > >(dbx) print arch > >arch = 0x1001700a0 "sun4v" > > >And so is "fmt": > > >Current function is opal_asprintf > > 194 length = opal_vasprintf(ptr, fmt, ap); > >(dbx) print fmt > >fmt = 0xffffffff7eeada98 "%uN:%uS:%uL3:%uL2:%uL1:%uC:%uH:%s" > > >However, things have gone bad in guess_strlen(): > > >Current function is guess_strlen > > 71 len += (int)strlen(sarg); > >(dbx) print sarg > >sarg = 0x2 "<bad address 0x2>" > > >-Paul > > >On Fri, Dec 12, 2014 at 2:24 AM, Ralph Castain <r...@open-mpi.org> wrote: > >Hmmm….this is really odd. I actually do have a protection for that arch value >being NULL, and you are in the code section when it isn’t. > > >Do you still have the core file around? If so, can you print out the value of >the “arch” variable? It would be in the opal_hwloc_base_get_topo_signature >level. > > >I’m wondering if that value has been hosed, and the problem is memory >corruption somewhere. > > > >On Dec 11, 2014, at 8:56 PM, Ralph Castain <r...@open-mpi.org> wrote: > > >Thanks Paul - I will post a fix for this tomorrow. Looks like Sparc isn’t >returning an architecture type for some reason, and I didn’t protect against >it. > > > >On Dec 11, 2014, at 7:39 PM, Paul Hargrove <phhargr...@lbl.gov> wrote: > > >Backtrace for the Solaris-10/SPARC SEGV appears below. > >I've changed the subject line to distinguish this from the earlier report. > > >-Paul > > >program terminated by signal SEGV (no mapping at the fault address) > >0xffffffff7d93b634: strlen+0x0014: lduh [%o2], %o1 > >Current function is guess_strlen > > 71 len += (int)strlen(sarg); > >(dbx) where > > [1] strlen(0x2, 0x73000000, 0x2, 0x80808080, 0x2, 0x80808080), at >0xffffffff7d93b634 > >=>[2] guess_strlen(fmt = 0xffffffff7eeada98 >"%uN:%uS:%uL3:%uL2:%uL1:%uC:%uH:%s", ap = 0xffffffff7ffff058), line 71 in >"printf.c" > > [3] opal_vasprintf(ptr = 0xffffffff7ffff0b8, fmt = 0xffffffff7eeada98 >"%uN:%uS:%uL3:%uL2:%uL1:%uC:%uH:%s", ap = 0xffffffff7ffff050), line 218 in >"printf.c" > > [4] opal_asprintf(ptr = 0xffffffff7ffff0b8, fmt = 0xffffffff7eeada98 >"%uN:%uS:%uL3:%uL2:%uL1:%uC:%uH:%s", ... = 0x807ede0103, ...), line 194 in >"printf.c" > > [5] opal_hwloc_base_get_topo_signature(topo = 0x100128ea0), line 2134 in >"hwloc_base_util.c" > > [6] rte_init(), line 205 in "ess_hnp_module.c" > > [7] orte_init(pargc = 0xffffffff7ffff61c, pargv = 0xffffffff7ffff610, flags >= 4U), line 148 in "orte_init.c" > > [8] orterun(argc = 7, argv = 0xffffffff7ffff7a8), line 856 in "orterun.c" > > [9] main(argc = 7, argv = 0xffffffff7ffff7a8), line 13 in "main.c" > > >On Thu, Dec 11, 2014 at 7:17 PM, Ralph Castain <r...@open-mpi.org> wrote: > >No, that looks different - it’s failing in mpirun itself. Can you get a line >number on it? > > >Sorry for delay - I’m generating rc3 now > > > >On Dec 11, 2014, at 6:59 PM, Paul Hargrove <phhargr...@lbl.gov> wrote: > > >Don't see an rc3 yet. > > >My Solaris-10/SPARC runs fail slightly differently (see below). > >It looks sufficiently similar that it MIGHT be the same root cause. > >However, lacking an rc3 to test I figured it would be better to report this >than to ignore it. > > >The problem is present with both V8+ and V9 ABIs, and with both Gnu and Sun >compilers. > > >-Paul > > >[niagara1:29881] *** Process received signal *** > >[niagara1:29881] Signal: Segmentation Fault (11) > >[niagara1:29881] Signal code: Address not mapped (1) > >[niagara1:29881] Failing at address: 2 > >/sandbox/hargrove/OMPI/openmpi-1.8.4rc2-solaris10-sparcT2-gcc346-v8plus/INST/lib/libopen-pal.so.6.2.1:opal_backtrace_print+0x24 > >/sandbox/hargrove/OMPI/openmpi-1.8.4rc2-solaris10-sparcT2-gcc346-v8plus/INST/lib/libopen-pal.so.6.2.1:0xaa160 > >/lib/libc.so.1:0xc5364 > >/lib/libc.so.1:0xb9e64 > >/lib/libc.so.1:strlen+0x14 [ Signal 11 (SEGV)] > >/sandbox/hargrove/OMPI/openmpi-1.8.4rc2-solaris10-sparcT2-gcc346-v8plus/INST/lib/libopen-pal.so.6.2.1:opal_vasprintf+0x20 > >/sandbox/hargrove/OMPI/openmpi-1.8.4rc2-solaris10-sparcT2-gcc346-v8plus/INST/lib/libopen-pal.so.6.2.1:opal_asprintf+0x30 > >/sandbox/hargrove/OMPI/openmpi-1.8.4rc2-solaris10-sparcT2-gcc346-v8plus/INST/lib/libopen-pal.so.6.2.1:opal_hwloc_base_get_topo_signature+0x24c > >/sandbox/hargrove/OMPI/openmpi-1.8.4rc2-solaris10-sparcT2-gcc346-v8plus/INST/lib/openmpi/mca_ess_hnp.so:0x2d90 > >/sandbox/hargrove/OMPI/openmpi-1.8.4rc2-solaris10-sparcT2-gcc346-v8plus/INST/lib/libopen-rte.so.7.0.5:orte_init+0x2f8 > >/sandbox/hargrove/OMPI/openmpi-1.8.4rc2-solaris10-sparcT2-gcc346-v8plus/INST/bin/orterun:orterun+0xaa8 > >/sandbox/hargrove/OMPI/openmpi-1.8.4rc2-solaris10-sparcT2-gcc346-v8plus/INST/bin/orterun:main+0x14 > >/sandbox/hargrove/OMPI/openmpi-1.8.4rc2-solaris10-sparcT2-gcc346-v8plus/INST/bin/orterun:_start+0x5c > >[niagara1:29881] *** End of error message *** > >Segmentation Fault - core dumped > > >On Thu, Dec 11, 2014 at 3:29 PM, Ralph Castain <r...@open-mpi.org> wrote: > >Ah crud - incomplete commit means we didn’t send the topo string. Will roll >rc3 in a few minutes. > > >Thanks, Paul > >Ralph > > >On Dec 11, 2014, at 3:08 PM, Paul Hargrove <phhargr...@lbl.gov> wrote: > > >Testing the 1.8.4rc2 tarball on my x86-64 Solaris-11 systems I am getting the >following crash for both "-m32" and "-m64" builds: > > >$ mpirun -mca btl sm,self,openib -np 2 -host pcp-j-19,pcp-j-20 examples/ring_c' > >[pcp-j-19:18762] *** Process received signal *** > >[pcp-j-19:18762] Signal: Segmentation Fault (11) > >[pcp-j-19:18762] Signal code: Address not mapped (1) > >[pcp-j-19:18762] Failing at address: 0 > >/shared/OMPI/openmpi-1.8.4rc2-solaris11-x64-ib-gcc452/INST/lib/libopen-pal.so.6.2.1'opal_backtrace_print+0x26 > [0xfffffd7ffaf237ba] > >/shared/OMPI/openmpi-1.8.4rc2-solaris11-x64-ib-gcc452/INST/lib/libopen-pal.so.6.2.1'show_stackframe+0x833 > [0xfffffd7ffaf20ba1] > >/lib/amd64/libc.so.1'__sighndlr+0x6 [0xfffffd7fff202cc6] > >/lib/amd64/libc.so.1'call_user_handler+0x2aa [0xfffffd7fff1f648e] > >/lib/amd64/libc.so.1'strcmp+0x1a [0xfffffd7fff170fda] [Signal 11 (SEGV)] > >/shared/OMPI/openmpi-1.8.4rc2-solaris11-x64-ib-gcc452/INST/bin/orted'main+0x90 >[0x4010b7] > >/shared/OMPI/openmpi-1.8.4rc2-solaris11-x64-ib-gcc452/INST/bin/orted'_start+0x6c > [0x400f2c] > >[pcp-j-19:18762] *** End of error message *** > >bash: line 1: 18762 Segmentation Fault (core dumped) >/shared/OMPI/openmpi-1.8.4rc2-solaris11-x64-ib-gcc452/INST/bin/orted -mca ess >"env" -mca orte_ess_jobid "911343616" -mca orte_ess_vpid 1 -mca >orte_ess_num_procs "2" -mca orte_hnp_uri >"911343616.0;tcp://172.16.0.120,172.18.0.120:50362" --tree-spawn -mca btl >"sm,self,openib" -mca plm "rsh" -mca shmem_mmap_enable_nfs_warning "0" > > >Running gdb against a core generated by the 32-bit build gives line numbers: > >#0 0xfea1cb45 in strcmp () from /lib/libc.so.1 > >#1 0xfeef4900 in orte_daemon (argc=26, argv=0x80479b0) > > at >/shared/OMPI/openmpi-1.8.4rc2-solaris11-x86-ib-gcc452/openmpi-1.8.4rc2/orte/orted/orted_main.c:789 > >#2 0x08050fb1 in main (argc=26, argv=0x80479b0) > > at >/shared/OMPI/openmpi-1.8.4rc2-solaris11-x86-ib-gcc452/openmpi-1.8.4rc2/orte/tools/orted/orted.c:62 > > >-Paul > > >-- > >Paul H. Hargrove phhargr...@lbl.gov > >Computer Languages & Systems Software (CLaSS) Group > >Computer Science Department Tel: +1-510-495-2352 > >Lawrence Berkeley National Laboratory Fax: +1-510-486-6900 > >_______________________________________________ >devel mailing list >de...@open-mpi.org >Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel >Link to this post: >http://www.open-mpi.org/community/lists/devel/2014/12/16514.php > > > >_______________________________________________ >devel mailing list >de...@open-mpi.org >Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel >Link to this post: >http://www.open-mpi.org/community/lists/devel/2014/12/16515.php > > > > >-- > >Paul H. Hargrove phhargr...@lbl.gov > >Computer Languages & Systems Software (CLaSS) Group > >Computer Science Department Tel: +1-510-495-2352 > >Lawrence Berkeley National Laboratory Fax: +1-510-486-6900 > >_______________________________________________ >devel mailing list >de...@open-mpi.org >Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel > >Link to this post: >http://www.open-mpi.org/community/lists/devel/2014/12/16521.php > > > >_______________________________________________ >devel mailing list >de...@open-mpi.org >Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel >Link to this post: >http://www.open-mpi.org/community/lists/devel/2014/12/16522.php > > > > >-- > >Paul H. Hargrove phhargr...@lbl.gov > >Computer Languages & Systems Software (CLaSS) Group > >Computer Science Department Tel: +1-510-495-2352 > >Lawrence Berkeley National Laboratory Fax: +1-510-486-6900 > >_______________________________________________ >devel mailing list >de...@open-mpi.org >Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel >Link to this post: >http://www.open-mpi.org/community/lists/devel/2014/12/16524.php > > > > >_______________________________________________ >devel mailing list >de...@open-mpi.org >Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel >Link to this post: >http://www.open-mpi.org/community/lists/devel/2014/12/16541.php > > > > >-- > >Paul H. Hargrove phhargr...@lbl.gov > >Computer Languages & Systems Software (CLaSS) Group > >Computer Science Department Tel: +1-510-495-2352 > >Lawrence Berkeley National Laboratory Fax: +1-510-486-6900 > >_______________________________________________ >devel mailing list >de...@open-mpi.org >Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel >Link to this post: >http://www.open-mpi.org/community/lists/devel/2014/12/16552.php > >