Hmmm….this is really odd. I actually do have a protection for that arch value being NULL, and you are in the code section when it isn’t.
Do you still have the core file around? If so, can you print out the value of the “arch” variable? It would be in the opal_hwloc_base_get_topo_signature level. I’m wondering if that value has been hosed, and the problem is memory corruption somewhere. > On Dec 11, 2014, at 8:56 PM, Ralph Castain <r...@open-mpi.org> wrote: > > Thanks Paul - I will post a fix for this tomorrow. Looks like Sparc isn’t > returning an architecture type for some reason, and I didn’t protect against > it. > > >> On Dec 11, 2014, at 7:39 PM, Paul Hargrove <phhargr...@lbl.gov >> <mailto:phhargr...@lbl.gov>> wrote: >> >> Backtrace for the Solaris-10/SPARC SEGV appears below. >> I've changed the subject line to distinguish this from the earlier report. >> >> -Paul >> >> program terminated by signal SEGV (no mapping at the fault address) >> 0xffffffff7d93b634: strlen+0x0014: lduh [%o2], %o1 >> Current function is guess_strlen >> 71 len += (int)strlen(sarg); >> (dbx) where >> [1] strlen(0x2, 0x73000000, 0x2, 0x80808080, 0x2, 0x80808080), at >> 0xffffffff7d93b634 >> =>[2] guess_strlen(fmt = 0xffffffff7eeada98 >> "%uN:%uS:%uL3:%uL2:%uL1:%uC:%uH:%s", ap = 0xffffffff7ffff058), line 71 in >> "printf.c" >> [3] opal_vasprintf(ptr = 0xffffffff7ffff0b8, fmt = 0xffffffff7eeada98 >> "%uN:%uS:%uL3:%uL2:%uL1:%uC:%uH:%s", ap = 0xffffffff7ffff050), line 218 in >> "printf.c" >> [4] opal_asprintf(ptr = 0xffffffff7ffff0b8, fmt = 0xffffffff7eeada98 >> "%uN:%uS:%uL3:%uL2:%uL1:%uC:%uH:%s", ... = 0x807ede0103, ...), line 194 in >> "printf.c" >> [5] opal_hwloc_base_get_topo_signature(topo = 0x100128ea0), line 2134 in >> "hwloc_base_util.c" >> [6] rte_init(), line 205 in "ess_hnp_module.c" >> [7] orte_init(pargc = 0xffffffff7ffff61c, pargv = 0xffffffff7ffff610, >> flags = 4U), line 148 in "orte_init.c" >> [8] orterun(argc = 7, argv = 0xffffffff7ffff7a8), line 856 in "orterun.c" >> [9] main(argc = 7, argv = 0xffffffff7ffff7a8), line 13 in "main.c" >> >> On Thu, Dec 11, 2014 at 7:17 PM, Ralph Castain <r...@open-mpi.org >> <mailto:r...@open-mpi.org>> wrote: >> No, that looks different - it’s failing in mpirun itself. Can you get a line >> number on it? >> >> Sorry for delay - I’m generating rc3 now >> >> >>> On Dec 11, 2014, at 6:59 PM, Paul Hargrove <phhargr...@lbl.gov >>> <mailto:phhargr...@lbl.gov>> wrote: >>> >>> Don't see an rc3 yet. >>> >>> My Solaris-10/SPARC runs fail slightly differently (see below). >>> It looks sufficiently similar that it MIGHT be the same root cause. >>> However, lacking an rc3 to test I figured it would be better to report this >>> than to ignore it. >>> >>> The problem is present with both V8+ and V9 ABIs, and with both Gnu and Sun >>> compilers. >>> >>> -Paul >>> >>> [niagara1:29881] *** Process received signal *** >>> [niagara1:29881] Signal: Segmentation Fault (11) >>> [niagara1:29881] Signal code: Address not mapped (1) >>> [niagara1:29881] Failing at address: 2 >>> /sandbox/hargrove/OMPI/openmpi-1.8.4rc2-solaris10-sparcT2-gcc346-v8plus/INST/lib/libopen-pal.so.6.2.1:opal_backtrace_print+0x24 >>> /sandbox/hargrove/OMPI/openmpi-1.8.4rc2-solaris10-sparcT2-gcc346-v8plus/INST/lib/libopen-pal.so.6.2.1:0xaa160 >>> /lib/libc.so.1:0xc5364 >>> /lib/libc.so.1:0xb9e64 >>> /lib/libc.so.1:strlen+0x14 [ Signal 11 (SEGV)] >>> /sandbox/hargrove/OMPI/openmpi-1.8.4rc2-solaris10-sparcT2-gcc346-v8plus/INST/lib/libopen-pal.so.6.2.1:opal_vasprintf+0x20 >>> /sandbox/hargrove/OMPI/openmpi-1.8.4rc2-solaris10-sparcT2-gcc346-v8plus/INST/lib/libopen-pal.so.6.2.1:opal_asprintf+0x30 >>> /sandbox/hargrove/OMPI/openmpi-1.8.4rc2-solaris10-sparcT2-gcc346-v8plus/INST/lib/libopen-pal.so.6.2.1:opal_hwloc_base_get_topo_signature+0x24c >>> /sandbox/hargrove/OMPI/openmpi-1.8.4rc2-solaris10-sparcT2-gcc346-v8plus/INST/lib/openmpi/mca_ess_hnp.so:0x2d90 >>> /sandbox/hargrove/OMPI/openmpi-1.8.4rc2-solaris10-sparcT2-gcc346-v8plus/INST/lib/libopen-rte.so.7.0.5:orte_init+0x2f8 >>> /sandbox/hargrove/OMPI/openmpi-1.8.4rc2-solaris10-sparcT2-gcc346-v8plus/INST/bin/orterun:orterun+0xaa8 >>> /sandbox/hargrove/OMPI/openmpi-1.8.4rc2-solaris10-sparcT2-gcc346-v8plus/INST/bin/orterun:main+0x14 >>> /sandbox/hargrove/OMPI/openmpi-1.8.4rc2-solaris10-sparcT2-gcc346-v8plus/INST/bin/orterun:_start+0x5c >>> [niagara1:29881] *** End of error message *** >>> Segmentation Fault - core dumped >>> >>> On Thu, Dec 11, 2014 at 3:29 PM, Ralph Castain <r...@open-mpi.org >>> <mailto:r...@open-mpi.org>> wrote: >>> Ah crud - incomplete commit means we didn’t send the topo string. Will roll >>> rc3 in a few minutes. >>> >>> Thanks, Paul >>> Ralph >>> >>>> On Dec 11, 2014, at 3:08 PM, Paul Hargrove <phhargr...@lbl.gov >>>> <mailto:phhargr...@lbl.gov>> wrote: >>>> >>>> Testing the 1.8.4rc2 tarball on my x86-64 Solaris-11 systems I am getting >>>> the following crash for both "-m32" and "-m64" builds: >>>> >>>> $ mpirun -mca btl sm,self,openib -np 2 -host pcp-j-19,pcp-j-20 >>>> examples/ring_c' >>>> [pcp-j-19:18762] *** Process received signal *** >>>> [pcp-j-19:18762] Signal: Segmentation Fault (11) >>>> [pcp-j-19:18762] Signal code: Address not mapped (1) >>>> [pcp-j-19:18762] Failing at address: 0 >>>> /shared/OMPI/openmpi-1.8.4rc2-solaris11-x64-ib-gcc452/INST/lib/libopen-pal.so.6.2.1'opal_backtrace_print+0x26 >>>> [0xfffffd7ffaf237ba] >>>> /shared/OMPI/openmpi-1.8.4rc2-solaris11-x64-ib-gcc452/INST/lib/libopen-pal.so.6.2.1'show_stackframe+0x833 >>>> [0xfffffd7ffaf20ba1] >>>> /lib/amd64/libc.so.1'__sighndlr+0x6 [0xfffffd7fff202cc6] >>>> /lib/amd64/libc.so.1'call_user_handler+0x2aa [0xfffffd7fff1f648e] >>>> /lib/amd64/libc.so.1'strcmp+0x1a [0xfffffd7fff170fda] [Signal 11 (SEGV)] >>>> /shared/OMPI/openmpi-1.8.4rc2-solaris11-x64-ib-gcc452/INST/bin/orted'main+0x90 >>>> [0x4010b7] >>>> /shared/OMPI/openmpi-1.8.4rc2-solaris11-x64-ib-gcc452/INST/bin/orted'_start+0x6c >>>> [0x400f2c] >>>> [pcp-j-19:18762] *** End of error message *** >>>> bash: line 1: 18762 Segmentation Fault (core dumped) >>>> /shared/OMPI/openmpi-1.8.4rc2-solaris11-x64-ib-gcc452/INST/bin/orted -mca >>>> ess "env" -mca orte_ess_jobid "911343616" -mca orte_ess_vpid 1 -mca >>>> orte_ess_num_procs "2" -mca orte_hnp_uri "911343616.0;tcp://172.16.0.120 >>>> <http://172.16.0.120/>,172.18.0.120:50362 <http://172.18.0.120:50362/>" >>>> --tree-spawn -mca btl "sm,self,openib" -mca plm "rsh" -mca >>>> shmem_mmap_enable_nfs_warning "0" >>>> >>>> Running gdb against a core generated by the 32-bit build gives line >>>> numbers: >>>> #0 0xfea1cb45 in strcmp () from /lib/libc.so.1 >>>> #1 0xfeef4900 in orte_daemon (argc=26, argv=0x80479b0) >>>> at >>>> /shared/OMPI/openmpi-1.8.4rc2-solaris11-x86-ib-gcc452/openmpi-1.8.4rc2/orte/orted/orted_main.c:789 >>>> #2 0x08050fb1 in main (argc=26, argv=0x80479b0) >>>> at >>>> /shared/OMPI/openmpi-1.8.4rc2-solaris11-x86-ib-gcc452/openmpi-1.8.4rc2/orte/tools/orted/orted.c:62 >>>> >>>> -Paul >>>> >>>> -- >>>> Paul H. Hargrove phhargr...@lbl.gov >>>> <mailto:phhargr...@lbl.gov> >>>> Computer Languages & Systems Software (CLaSS) Group >>>> Computer Science Department Tel: +1-510-495-2352 >>>> <tel:%2B1-510-495-2352> >>>> Lawrence Berkeley National Laboratory Fax: +1-510-486-6900 >>>> <tel:%2B1-510-486-6900>_______________________________________________ >>>> devel mailing list >>>> de...@open-mpi.org <mailto:de...@open-mpi.org> >>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel >>>> <http://www.open-mpi.org/mailman/listinfo.cgi/devel> >>>> Link to this post: >>>> http://www.open-mpi.org/community/lists/devel/2014/12/16514.php >>>> <http://www.open-mpi.org/community/lists/devel/2014/12/16514.php> >>> >>> _______________________________________________ >>> devel mailing list >>> de...@open-mpi.org <mailto:de...@open-mpi.org> >>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel >>> <http://www.open-mpi.org/mailman/listinfo.cgi/devel> >>> Link to this post: >>> http://www.open-mpi.org/community/lists/devel/2014/12/16515.php >>> <http://www.open-mpi.org/community/lists/devel/2014/12/16515.php> >>> >>> >>> >>> -- >>> Paul H. Hargrove phhargr...@lbl.gov >>> <mailto:phhargr...@lbl.gov> >>> Computer Languages & Systems Software (CLaSS) Group >>> Computer Science Department Tel: +1-510-495-2352 >>> <tel:%2B1-510-495-2352> >>> Lawrence Berkeley National Laboratory Fax: +1-510-486-6900 >>> <tel:%2B1-510-486-6900>_______________________________________________ >>> devel mailing list >>> de...@open-mpi.org <mailto:de...@open-mpi.org> >>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel >>> <http://www.open-mpi.org/mailman/listinfo.cgi/devel> >>> Link to this post: >>> http://www.open-mpi.org/community/lists/devel/2014/12/16521.php >>> <http://www.open-mpi.org/community/lists/devel/2014/12/16521.php> >> >> _______________________________________________ >> devel mailing list >> de...@open-mpi.org <mailto:de...@open-mpi.org> >> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel >> <http://www.open-mpi.org/mailman/listinfo.cgi/devel> >> Link to this post: >> http://www.open-mpi.org/community/lists/devel/2014/12/16522.php >> <http://www.open-mpi.org/community/lists/devel/2014/12/16522.php> >> >> >> >> -- >> Paul H. Hargrove phhargr...@lbl.gov >> <mailto:phhargr...@lbl.gov> >> Computer Languages & Systems Software (CLaSS) Group >> Computer Science Department Tel: +1-510-495-2352 >> Lawrence Berkeley National Laboratory Fax: +1-510-486-6900 >> _______________________________________________ >> devel mailing list >> de...@open-mpi.org <mailto:de...@open-mpi.org> >> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel >> Link to this post: >> http://www.open-mpi.org/community/lists/devel/2014/12/16524.php >