Crud - sorry for delayed response. I was out for a bit. I’ll just change it to %d as there is nothing magic about it being unsigned. How bizarre.
> On Dec 12, 2014, at 3:21 PM, Paul Hargrove <phhargr...@lbl.gov> wrote: > > NOTE: > > The existing code for "%l." in guess_strlen() is garbage. > The va_arg() macro calls all have "int" for the type!! > > I am *only* testing a fix for the missing "%u" at the moment. > > -Paul > > On Fri, Dec 12, 2014 at 3:14 PM, Paul Hargrove <phhargr...@lbl.gov > <mailto:phhargr...@lbl.gov>> wrote: > Thanks, Gilles! > > I was looking at that same code just now and completely missed the lack of a > case for '%u' (and '%lu'). I will add one now and see if that resolves the > problem.... > > > -Paul > > On Fri, Dec 12, 2014 at 3:10 PM, Gilles Gouaillardet > <gilles.gouaillar...@gmail.com <mailto:gilles.gouaillar...@gmail.com>> wrote: > Ralph, > > I cannot find a case for the %u format is guess_strlen > And since the default does not invoke va_arg() > I > it seems strlen is invoked on nnuma instead of arch > > Makes sense ? > > Cheers, > > Gilles > > Ralph Castain <r...@open-mpi.org <mailto:r...@open-mpi.org>> wrote: > Afraid I’m drawing a blank, Paul - I can’t see how we got to a bad address > down there. This is at the beginning of orte_init, so there are no threads > running nor has anything much happened. > > Do you have any suggestions? > > >> On Dec 12, 2014, at 9:02 AM, Paul Hargrove <phhargr...@lbl.gov >> <mailto:phhargr...@lbl.gov>> wrote: >> >> Ralph, >> >> The "arch" variable looks fine: >> Current function is opal_hwloc_base_get_topo_signature >> 2134 nnuma, nsocket, nl3, nl2, nl1, ncore, nhwt, arch); >> (dbx) print arch >> arch = 0x1001700a0 "sun4v" >> >> And so is "fmt": >> >> Current function is opal_asprintf >> 194 length = opal_vasprintf(ptr, fmt, ap); >> (dbx) print fmt >> fmt = 0xffffffff7eeada98 "%uN:%uS:%uL3:%uL2:%uL1:%uC:%uH:%s" >> >> However, things have gone bad in guess_strlen(): >> >> Current function is guess_strlen >> 71 len += (int)strlen(sarg); >> (dbx) print sarg >> sarg = 0x2 "<bad address 0x2>" >> >> -Paul >> >> On Fri, Dec 12, 2014 at 2:24 AM, Ralph Castain <r...@open-mpi.org >> <mailto:r...@open-mpi.org>> wrote: >> Hmmm….this is really odd. I actually do have a protection for that arch >> value being NULL, and you are in the code section when it isn’t. >> >> Do you still have the core file around? If so, can you print out the value >> of the “arch” variable? It would be in the >> opal_hwloc_base_get_topo_signature level. >> >> I’m wondering if that value has been hosed, and the problem is memory >> corruption somewhere. >> >> >>> On Dec 11, 2014, at 8:56 PM, Ralph Castain <r...@open-mpi.org >>> <mailto:r...@open-mpi.org>> wrote: >>> >>> Thanks Paul - I will post a fix for this tomorrow. Looks like Sparc isn’t >>> returning an architecture type for some reason, and I didn’t protect >>> against it. >>> >>> >>>> On Dec 11, 2014, at 7:39 PM, Paul Hargrove <phhargr...@lbl.gov >>>> <mailto:phhargr...@lbl.gov>> wrote: >>>> >>>> Backtrace for the Solaris-10/SPARC SEGV appears below. >>>> I've changed the subject line to distinguish this from the earlier report. >>>> >>>> -Paul >>>> >>>> program terminated by signal SEGV (no mapping at the fault address) >>>> 0xffffffff7d93b634: strlen+0x0014: lduh [%o2], %o1 >>>> Current function is guess_strlen >>>> 71 len += (int)strlen(sarg); >>>> (dbx) where >>>> [1] strlen(0x2, 0x73000000, 0x2, 0x80808080, 0x2, 0x80808080), at >>>> 0xffffffff7d93b634 >>>> =>[2] guess_strlen(fmt = 0xffffffff7eeada98 >>>> "%uN:%uS:%uL3:%uL2:%uL1:%uC:%uH:%s", ap = 0xffffffff7ffff058), line 71 in >>>> "printf.c" >>>> [3] opal_vasprintf(ptr = 0xffffffff7ffff0b8, fmt = 0xffffffff7eeada98 >>>> "%uN:%uS:%uL3:%uL2:%uL1:%uC:%uH:%s", ap = 0xffffffff7ffff050), line 218 in >>>> "printf.c" >>>> [4] opal_asprintf(ptr = 0xffffffff7ffff0b8, fmt = 0xffffffff7eeada98 >>>> "%uN:%uS:%uL3:%uL2:%uL1:%uC:%uH:%s", ... = 0x807ede0103, ...), line 194 in >>>> "printf.c" >>>> [5] opal_hwloc_base_get_topo_signature(topo = 0x100128ea0), line 2134 in >>>> "hwloc_base_util.c" >>>> [6] rte_init(), line 205 in "ess_hnp_module.c" >>>> [7] orte_init(pargc = 0xffffffff7ffff61c, pargv = 0xffffffff7ffff610, >>>> flags = 4U), line 148 in "orte_init.c" >>>> [8] orterun(argc = 7, argv = 0xffffffff7ffff7a8), line 856 in "orterun.c" >>>> [9] main(argc = 7, argv = 0xffffffff7ffff7a8), line 13 in "main.c" >>>> >>>> On Thu, Dec 11, 2014 at 7:17 PM, Ralph Castain <r...@open-mpi.org >>>> <mailto:r...@open-mpi.org>> wrote: >>>> No, that looks different - it’s failing in mpirun itself. Can you get a >>>> line number on it? >>>> >>>> Sorry for delay - I’m generating rc3 now >>>> >>>> >>>>> On Dec 11, 2014, at 6:59 PM, Paul Hargrove <phhargr...@lbl.gov >>>>> <mailto:phhargr...@lbl.gov>> wrote: >>>>> >>>>> Don't see an rc3 yet. >>>>> >>>>> My Solaris-10/SPARC runs fail slightly differently (see below). >>>>> It looks sufficiently similar that it MIGHT be the same root cause. >>>>> However, lacking an rc3 to test I figured it would be better to report >>>>> this than to ignore it. >>>>> >>>>> The problem is present with both V8+ and V9 ABIs, and with both Gnu and >>>>> Sun compilers. >>>>> >>>>> -Paul >>>>> >>>>> [niagara1:29881] *** Process received signal *** >>>>> [niagara1:29881] Signal: Segmentation Fault (11) >>>>> [niagara1:29881] Signal code: Address not mapped (1) >>>>> [niagara1:29881] Failing at address: 2 >>>>> /sandbox/hargrove/OMPI/openmpi-1.8.4rc2-solaris10-sparcT2-gcc346-v8plus/INST/lib/libopen-pal.so.6.2.1:opal_backtrace_print+0x24 >>>>> /sandbox/hargrove/OMPI/openmpi-1.8.4rc2-solaris10-sparcT2-gcc346-v8plus/INST/lib/libopen-pal.so.6.2.1:0xaa160 >>>>> /lib/libc.so.1:0xc5364 >>>>> /lib/libc.so.1:0xb9e64 >>>>> /lib/libc.so.1:strlen+0x14 [ Signal 11 (SEGV)] >>>>> /sandbox/hargrove/OMPI/openmpi-1.8.4rc2-solaris10-sparcT2-gcc346-v8plus/INST/lib/libopen-pal.so.6.2.1:opal_vasprintf+0x20 >>>>> /sandbox/hargrove/OMPI/openmpi-1.8.4rc2-solaris10-sparcT2-gcc346-v8plus/INST/lib/libopen-pal.so.6.2.1:opal_asprintf+0x30 >>>>> /sandbox/hargrove/OMPI/openmpi-1.8.4rc2-solaris10-sparcT2-gcc346-v8plus/INST/lib/libopen-pal.so.6.2.1:opal_hwloc_base_get_topo_signature+0x24c >>>>> /sandbox/hargrove/OMPI/openmpi-1.8.4rc2-solaris10-sparcT2-gcc346-v8plus/INST/lib/openmpi/mca_ess_hnp.so:0x2d90 >>>>> /sandbox/hargrove/OMPI/openmpi-1.8.4rc2-solaris10-sparcT2-gcc346-v8plus/INST/lib/libopen-rte.so.7.0.5:orte_init+0x2f8 >>>>> /sandbox/hargrove/OMPI/openmpi-1.8.4rc2-solaris10-sparcT2-gcc346-v8plus/INST/bin/orterun:orterun+0xaa8 >>>>> /sandbox/hargrove/OMPI/openmpi-1.8.4rc2-solaris10-sparcT2-gcc346-v8plus/INST/bin/orterun:main+0x14 >>>>> /sandbox/hargrove/OMPI/openmpi-1.8.4rc2-solaris10-sparcT2-gcc346-v8plus/INST/bin/orterun:_start+0x5c >>>>> [niagara1:29881] *** End of error message *** >>>>> Segmentation Fault - core dumped >>>>> >>>>> On Thu, Dec 11, 2014 at 3:29 PM, Ralph Castain <r...@open-mpi.org >>>>> <mailto:r...@open-mpi.org>> wrote: >>>>> Ah crud - incomplete commit means we didn’t send the topo string. Will >>>>> roll rc3 in a few minutes. >>>>> >>>>> Thanks, Paul >>>>> Ralph >>>>> >>>>>> On Dec 11, 2014, at 3:08 PM, Paul Hargrove <phhargr...@lbl.gov >>>>>> <mailto:phhargr...@lbl.gov>> wrote: >>>>>> >>>>>> Testing the 1.8.4rc2 tarball on my x86-64 Solaris-11 systems I am >>>>>> getting the following crash for both "-m32" and "-m64" builds: >>>>>> >>>>>> $ mpirun -mca btl sm,self,openib -np 2 -host pcp-j-19,pcp-j-20 >>>>>> examples/ring_c' >>>>>> [pcp-j-19:18762] *** Process received signal *** >>>>>> [pcp-j-19:18762] Signal: Segmentation Fault (11) >>>>>> [pcp-j-19:18762] Signal code: Address not mapped (1) >>>>>> [pcp-j-19:18762] Failing at address: 0 >>>>>> /shared/OMPI/openmpi-1.8.4rc2-solaris11-x64-ib-gcc452/INST/lib/libopen-pal.so.6.2.1'opal_backtrace_print+0x26 >>>>>> [0xfffffd7ffaf237ba] >>>>>> /shared/OMPI/openmpi-1.8.4rc2-solaris11-x64-ib-gcc452/INST/lib/libopen-pal.so.6.2.1'show_stackframe+0x833 >>>>>> [0xfffffd7ffaf20ba1] >>>>>> /lib/amd64/libc.so.1'__sighndlr+0x6 [0xfffffd7fff202cc6] >>>>>> /lib/amd64/libc.so.1'call_user_handler+0x2aa [0xfffffd7fff1f648e] >>>>>> /lib/amd64/libc.so.1'strcmp+0x1a [0xfffffd7fff170fda] [Signal 11 (SEGV)] >>>>>> /shared/OMPI/openmpi-1.8.4rc2-solaris11-x64-ib-gcc452/INST/bin/orted'main+0x90 >>>>>> [0x4010b7] >>>>>> /shared/OMPI/openmpi-1.8.4rc2-solaris11-x64-ib-gcc452/INST/bin/orted'_start+0x6c >>>>>> [0x400f2c] >>>>>> [pcp-j-19:18762] *** End of error message *** >>>>>> bash: line 1: 18762 Segmentation Fault (core dumped) >>>>>> /shared/OMPI/openmpi-1.8.4rc2-solaris11-x64-ib-gcc452/INST/bin/orted >>>>>> -mca ess "env" -mca orte_ess_jobid "911343616" -mca orte_ess_vpid 1 -mca >>>>>> orte_ess_num_procs "2" -mca orte_hnp_uri "911343616.0;tcp://172.16.0.120 >>>>>> <http://172.16.0.120/>,172.18.0.120:50362 <http://172.18.0.120:50362/>" >>>>>> --tree-spawn -mca btl "sm,self,openib" -mca plm "rsh" -mca >>>>>> shmem_mmap_enable_nfs_warning "0" >>>>>> >>>>>> Running gdb against a core generated by the 32-bit build gives line >>>>>> numbers: >>>>>> #0 0xfea1cb45 in strcmp () from /lib/libc.so.1 >>>>>> #1 0xfeef4900 in orte_daemon (argc=26, argv=0x80479b0) >>>>>> at >>>>>> /shared/OMPI/openmpi-1.8.4rc2-solaris11-x86-ib-gcc452/openmpi-1.8.4rc2/orte/orted/orted_main.c:789 >>>>>> #2 0x08050fb1 in main (argc=26, argv=0x80479b0) >>>>>> at >>>>>> /shared/OMPI/openmpi-1.8.4rc2-solaris11-x86-ib-gcc452/openmpi-1.8.4rc2/orte/tools/orted/orted.c:62 >>>>>> >>>>>> -Paul >>>>>> >>>>>> -- >>>>>> Paul H. Hargrove phhargr...@lbl.gov >>>>>> <mailto:phhargr...@lbl.gov> >>>>>> Computer Languages & Systems Software (CLaSS) Group >>>>>> Computer Science Department Tel: +1-510-495-2352 >>>>>> <tel:%2B1-510-495-2352> >>>>>> Lawrence Berkeley National Laboratory Fax: +1-510-486-6900 >>>>>> <tel:%2B1-510-486-6900>_______________________________________________ >>>>>> devel mailing list >>>>>> de...@open-mpi.org <mailto:de...@open-mpi.org> >>>>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel >>>>>> <http://www.open-mpi.org/mailman/listinfo.cgi/devel> >>>>>> Link to this post: >>>>>> http://www.open-mpi.org/community/lists/devel/2014/12/16514.php >>>>>> <http://www.open-mpi.org/community/lists/devel/2014/12/16514.php> >>>>> >>>>> _______________________________________________ >>>>> devel mailing list >>>>> de...@open-mpi.org <mailto:de...@open-mpi.org> >>>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel >>>>> <http://www.open-mpi.org/mailman/listinfo.cgi/devel> >>>>> Link to this post: >>>>> http://www.open-mpi.org/community/lists/devel/2014/12/16515.php >>>>> <http://www.open-mpi.org/community/lists/devel/2014/12/16515.php> >>>>> >>>>> >>>>> >>>>> -- >>>>> Paul H. Hargrove phhargr...@lbl.gov >>>>> <mailto:phhargr...@lbl.gov> >>>>> Computer Languages & Systems Software (CLaSS) Group >>>>> Computer Science Department Tel: +1-510-495-2352 >>>>> <tel:%2B1-510-495-2352> >>>>> Lawrence Berkeley National Laboratory Fax: +1-510-486-6900 >>>>> <tel:%2B1-510-486-6900>_______________________________________________ >>>>> devel mailing list >>>>> de...@open-mpi.org <mailto:de...@open-mpi.org> >>>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel >>>>> <http://www.open-mpi.org/mailman/listinfo.cgi/devel> >>>>> Link to this post: >>>>> http://www.open-mpi.org/community/lists/devel/2014/12/16521.php >>>>> <http://www.open-mpi.org/community/lists/devel/2014/12/16521.php> >>>> >>>> _______________________________________________ >>>> devel mailing list >>>> de...@open-mpi.org <mailto:de...@open-mpi.org> >>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel >>>> <http://www.open-mpi.org/mailman/listinfo.cgi/devel> >>>> Link to this post: >>>> http://www.open-mpi.org/community/lists/devel/2014/12/16522.php >>>> <http://www.open-mpi.org/community/lists/devel/2014/12/16522.php> >>>> >>>> >>>> >>>> -- >>>> Paul H. Hargrove phhargr...@lbl.gov >>>> <mailto:phhargr...@lbl.gov> >>>> Computer Languages & Systems Software (CLaSS) Group >>>> Computer Science Department Tel: +1-510-495-2352 >>>> <tel:%2B1-510-495-2352> >>>> Lawrence Berkeley National Laboratory Fax: +1-510-486-6900 >>>> <tel:%2B1-510-486-6900>_______________________________________________ >>>> devel mailing list >>>> de...@open-mpi.org <mailto:de...@open-mpi.org> >>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel >>>> <http://www.open-mpi.org/mailman/listinfo.cgi/devel> >>>> Link to this post: >>>> http://www.open-mpi.org/community/lists/devel/2014/12/16524.php >>>> <http://www.open-mpi.org/community/lists/devel/2014/12/16524.php> >> >> >> _______________________________________________ >> devel mailing list >> de...@open-mpi.org <mailto:de...@open-mpi.org> >> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel >> <http://www.open-mpi.org/mailman/listinfo.cgi/devel> >> Link to this post: >> http://www.open-mpi.org/community/lists/devel/2014/12/16541.php >> <http://www.open-mpi.org/community/lists/devel/2014/12/16541.php> >> >> >> >> -- >> Paul H. Hargrove phhargr...@lbl.gov >> <mailto:phhargr...@lbl.gov> >> Computer Languages & Systems Software (CLaSS) Group >> Computer Science Department Tel: +1-510-495-2352 >> <tel:%2B1-510-495-2352> >> Lawrence Berkeley National Laboratory Fax: +1-510-486-6900 >> <tel:%2B1-510-486-6900>_______________________________________________ >> devel mailing list >> de...@open-mpi.org <mailto:de...@open-mpi.org> >> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel >> <http://www.open-mpi.org/mailman/listinfo.cgi/devel> >> Link to this post: >> http://www.open-mpi.org/community/lists/devel/2014/12/16552.php >> <http://www.open-mpi.org/community/lists/devel/2014/12/16552.php> > > _______________________________________________ > devel mailing list > de...@open-mpi.org <mailto:de...@open-mpi.org> > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel > <http://www.open-mpi.org/mailman/listinfo.cgi/devel> > Link to this post: > http://www.open-mpi.org/community/lists/devel/2014/12/16562.php > <http://www.open-mpi.org/community/lists/devel/2014/12/16562.php> > > > > -- > Paul H. Hargrove phhargr...@lbl.gov > <mailto:phhargr...@lbl.gov> > Computer Languages & Systems Software (CLaSS) Group > Computer Science Department Tel: +1-510-495-2352 > <tel:%2B1-510-495-2352> > Lawrence Berkeley National Laboratory Fax: +1-510-486-6900 > <tel:%2B1-510-486-6900> > > > -- > Paul H. Hargrove phhargr...@lbl.gov > <mailto:phhargr...@lbl.gov> > Computer Languages & Systems Software (CLaSS) Group > Computer Science Department Tel: +1-510-495-2352 > Lawrence Berkeley National Laboratory Fax: +1-510-486-6900 > _______________________________________________ > devel mailing list > de...@open-mpi.org <mailto:de...@open-mpi.org> > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel > <http://www.open-mpi.org/mailman/listinfo.cgi/devel> > Link to this post: > http://www.open-mpi.org/community/lists/devel/2014/12/16564.php > <http://www.open-mpi.org/community/lists/devel/2014/12/16564.php>