OK, applying my attached patch (based on Gilles's observation) resolved the
problem!
So I fully expect Ralph's plan to use "%d" to also resolve this.

HOWEVER, while the patch catches the "%u" case, there are plenty of
potential ways to hit the same problem if, for instance, one uses "%zu" for
size_t.  Additionally, I've already noted that the code for "%ld", "%lx",
"%lX", "%lf" are all currently incorrect.

So, I ask: "Why isn't guess_strlen() just implemented as follows?"

/* From man vsnprintf:
 *            The functions snprintf and vsnprintf do not write more  than
 * size  bytes (including the trailing '\0').  If the output was truncated
 * due to this limit then the return value is  the  number  of  characters
 * (not  including the trailing '\0') which would have been written to the
 * final string if enough space had been available.
 */
static int guess_strlen(const char *fmt, va_list ap)
{
  char dummy[1];
  return 1 + vsnprintf(dummy, 1, fmt, ap);
}



BTW: I do see some messages like "select: Interrupted system call" which I
assume are related to the timeout code (and thus the subject of a different
thread).


-Paul

On Fri, Dec 12, 2014 at 3:14 PM, Paul Hargrove <phhargr...@lbl.gov> wrote:

> Thanks, Gilles!
>
> I was looking at that same code just now and completely missed the lack of
> a case for '%u' (and '%lu').  I will add one now and see if that resolves
> the problem....
>
>
> -Paul
>
> On Fri, Dec 12, 2014 at 3:10 PM, Gilles Gouaillardet <
> gilles.gouaillar...@gmail.com> wrote:
>
>> Ralph,
>>
>> I cannot find a case for the %u format is guess_strlen
>> And since the default does not invoke va_arg()
>> I
>> it seems strlen is invoked on nnuma instead of arch
>>
>> Makes sense ?
>>
>> Cheers,
>>
>> Gilles
>>
>> Ralph Castain <r...@open-mpi.org> wrote:
>> Afraid I'm drawing a blank, Paul - I can't see how we got to a bad
>> address down there. This is at the beginning of orte_init, so there are no
>> threads running nor has anything much happened.
>>
>> Do you have any suggestions?
>>
>>
>> On Dec 12, 2014, at 9:02 AM, Paul Hargrove <phhargr...@lbl.gov> wrote:
>>
>> Ralph,
>>
>> The "arch" variable looks fine:
>> Current function is opal_hwloc_base_get_topo_signature
>>  2134                    nnuma, nsocket, nl3, nl2, nl1, ncore, nhwt,
>> arch);
>> (dbx) print arch
>> arch = 0x1001700a0 "sun4v"
>>
>> And so is "fmt":
>>
>> Current function is opal_asprintf
>>   194       length = opal_vasprintf(ptr, fmt, ap);
>> (dbx) print fmt
>> fmt = 0xffffffff7eeada98 "%uN:%uS:%uL3:%uL2:%uL1:%uC:%uH:%s"
>>
>> However, things have gone bad in guess_strlen():
>>
>> Current function is guess_strlen
>>    71                       len += (int)strlen(sarg);
>> (dbx) print sarg
>> sarg = 0x2 "<bad address 0x2>"
>>
>> -Paul
>>
>> On Fri, Dec 12, 2014 at 2:24 AM, Ralph Castain <r...@open-mpi.org> wrote:
>>
>>> Hmmm....this is really odd. I actually do have a protection for that arch
>>> value being NULL, and you are in the code section when it isn't.
>>>
>>> Do you still have the core file around? If so, can you print out the
>>> value of the "arch" variable? It would be in the
>>> opal_hwloc_base_get_topo_signature level.
>>>
>>> I'm wondering if that value has been hosed, and the problem is memory
>>> corruption somewhere.
>>>
>>>
>>> On Dec 11, 2014, at 8:56 PM, Ralph Castain <r...@open-mpi.org> wrote:
>>>
>>> Thanks Paul - I will post a fix for this tomorrow. Looks like Sparc
>>> isn't returning an architecture type for some reason, and I didn't protect
>>> against it.
>>>
>>>
>>> On Dec 11, 2014, at 7:39 PM, Paul Hargrove <phhargr...@lbl.gov> wrote:
>>>
>>> Backtrace for the Solaris-10/SPARC SEGV appears below.
>>> I've changed the subject line to distinguish this from the earlier
>>> report.
>>>
>>> -Paul
>>>
>>> program terminated by signal SEGV (no mapping at the fault address)
>>> 0xffffffff7d93b634: strlen+0x0014:      lduh     [%o2], %o1
>>> Current function is guess_strlen
>>>    71                       len += (int)strlen(sarg);
>>> (dbx) where
>>>   [1] strlen(0x2, 0x73000000, 0x2, 0x80808080, 0x2, 0x80808080), at
>>> 0xffffffff7d93b634
>>> =>[2] guess_strlen(fmt = 0xffffffff7eeada98
>>> "%uN:%uS:%uL3:%uL2:%uL1:%uC:%uH:%s", ap = 0xffffffff7ffff058), line 71 in
>>> "printf.c"
>>>   [3] opal_vasprintf(ptr = 0xffffffff7ffff0b8, fmt = 0xffffffff7eeada98
>>> "%uN:%uS:%uL3:%uL2:%uL1:%uC:%uH:%s", ap = 0xffffffff7ffff050), line 218 in
>>> "printf.c"
>>>   [4] opal_asprintf(ptr = 0xffffffff7ffff0b8, fmt = 0xffffffff7eeada98
>>> "%uN:%uS:%uL3:%uL2:%uL1:%uC:%uH:%s", ... = 0x807ede0103, ...), line 194 in
>>> "printf.c"
>>>   [5] opal_hwloc_base_get_topo_signature(topo = 0x100128ea0), line 2134
>>> in "hwloc_base_util.c"
>>>   [6] rte_init(), line 205 in "ess_hnp_module.c"
>>>   [7] orte_init(pargc = 0xffffffff7ffff61c, pargv = 0xffffffff7ffff610,
>>> flags = 4U), line 148 in "orte_init.c"
>>>   [8] orterun(argc = 7, argv = 0xffffffff7ffff7a8), line 856 in
>>> "orterun.c"
>>>   [9] main(argc = 7, argv = 0xffffffff7ffff7a8), line 13 in "main.c"
>>>
>>> On Thu, Dec 11, 2014 at 7:17 PM, Ralph Castain <r...@open-mpi.org> wrote:
>>>
>>>> No, that looks different - it's failing in mpirun itself. Can you get a
>>>> line number on it?
>>>>
>>>> Sorry for delay - I'm generating rc3 now
>>>>
>>>>
>>>> On Dec 11, 2014, at 6:59 PM, Paul Hargrove <phhargr...@lbl.gov> wrote:
>>>>
>>>> Don't see an rc3 yet.
>>>>
>>>> My Solaris-10/SPARC runs fail slightly differently (see below).
>>>> It looks sufficiently similar that it MIGHT be the same root cause.
>>>> However, lacking an rc3 to test I figured it would be better to report
>>>> this than to ignore it.
>>>>
>>>> The problem is present with both V8+ and V9 ABIs, and with both Gnu and
>>>> Sun compilers.
>>>>
>>>> -Paul
>>>>
>>>> [niagara1:29881] *** Process received signal ***
>>>> [niagara1:29881] Signal: Segmentation Fault (11)
>>>> [niagara1:29881] Signal code: Address not mapped (1)
>>>> [niagara1:29881] Failing at address: 2
>>>>
>>>> /sandbox/hargrove/OMPI/openmpi-1.8.4rc2-solaris10-sparcT2-gcc346-v8plus/INST/lib/libopen-pal.so.6.2.1:opal_bac
>>>> ktrace_print+0x24
>>>>
>>>> /sandbox/hargrove/OMPI/openmpi-1.8.4rc2-solaris10-sparcT2-gcc346-v8plus/INST/lib/libopen-pal.so.6.2.1:0xaa160
>>>> /lib/libc.so.1:0xc5364
>>>> /lib/libc.so.1:0xb9e64
>>>> /lib/libc.so.1:strlen+0x14 [ Signal 11 (SEGV)]
>>>>
>>>> /sandbox/hargrove/OMPI/openmpi-1.8.4rc2-solaris10-sparcT2-gcc346-v8plus/INST/lib/libopen-pal.so.6.2.1:opal_vas
>>>> printf+0x20
>>>>
>>>> /sandbox/hargrove/OMPI/openmpi-1.8.4rc2-solaris10-sparcT2-gcc346-v8plus/INST/lib/libopen-pal.so.6.2.1:opal_asp
>>>> rintf+0x30
>>>>
>>>> /sandbox/hargrove/OMPI/openmpi-1.8.4rc2-solaris10-sparcT2-gcc346-v8plus/INST/lib/libopen-pal.so.6.2.1:opal_hwl
>>>> oc_base_get_topo_signature+0x24c
>>>>
>>>> /sandbox/hargrove/OMPI/openmpi-1.8.4rc2-solaris10-sparcT2-gcc346-v8plus/INST/lib/openmpi/mca_ess_hnp.so:0x2d90
>>>>
>>>> /sandbox/hargrove/OMPI/openmpi-1.8.4rc2-solaris10-sparcT2-gcc346-v8plus/INST/lib/libopen-rte.so.7.0.5:orte_ini
>>>> t+0x2f8
>>>>
>>>> /sandbox/hargrove/OMPI/openmpi-1.8.4rc2-solaris10-sparcT2-gcc346-v8plus/INST/bin/orterun:orterun+0xaa8
>>>>
>>>> /sandbox/hargrove/OMPI/openmpi-1.8.4rc2-solaris10-sparcT2-gcc346-v8plus/INST/bin/orterun:main+0x14
>>>>
>>>> /sandbox/hargrove/OMPI/openmpi-1.8.4rc2-solaris10-sparcT2-gcc346-v8plus/INST/bin/orterun:_start+0x5c
>>>> [niagara1:29881] *** End of error message ***
>>>> Segmentation Fault - core dumped
>>>>
>>>> On Thu, Dec 11, 2014 at 3:29 PM, Ralph Castain <r...@open-mpi.org>
>>>> wrote:
>>>>
>>>>> Ah crud - incomplete commit means we didn't send the topo string. Will
>>>>> roll rc3 in a few minutes.
>>>>>
>>>>> Thanks, Paul
>>>>> Ralph
>>>>>
>>>>> On Dec 11, 2014, at 3:08 PM, Paul Hargrove <phhargr...@lbl.gov> wrote:
>>>>>
>>>>> Testing the 1.8.4rc2 tarball on my x86-64 Solaris-11 systems I am
>>>>> getting the following crash for both "-m32" and "-m64" builds:
>>>>>
>>>>> $ mpirun -mca btl sm,self,openib -np 2 -host pcp-j-19,pcp-j-20
>>>>> examples/ring_c'
>>>>> [pcp-j-19:18762] *** Process received signal ***
>>>>> [pcp-j-19:18762] Signal: Segmentation Fault (11)
>>>>> [pcp-j-19:18762] Signal code: Address not mapped (1)
>>>>> [pcp-j-19:18762] Failing at address: 0
>>>>> /shared/OMPI/openmpi-1.8.4rc2-solaris11-x64-ib-gcc452/INST/lib/libopen-pal.so.6.2.1'opal_backtrace_print+0x26
>>>>> [0xfffffd7ffaf237ba]
>>>>> /shared/OMPI/openmpi-1.8.4rc2-solaris11-x64-ib-gcc452/INST/lib/libopen-pal.so.6.2.1'show_stackframe+0x833
>>>>> [0xfffffd7ffaf20ba1]
>>>>> /lib/amd64/libc.so.1'__sighndlr+0x6 [0xfffffd7fff202cc6]
>>>>> /lib/amd64/libc.so.1'call_user_handler+0x2aa [0xfffffd7fff1f648e]
>>>>> /lib/amd64/libc.so.1'strcmp+0x1a [0xfffffd7fff170fda] [Signal 11
>>>>> (SEGV)]
>>>>> /shared/OMPI/openmpi-1.8.4rc2-solaris11-x64-ib-gcc452/INST/bin/orted'main+0x90
>>>>> [0x4010b7]
>>>>> /shared/OMPI/openmpi-1.8.4rc2-solaris11-x64-ib-gcc452/INST/bin/orted'_start+0x6c
>>>>> [0x400f2c]
>>>>> [pcp-j-19:18762] *** End of error message ***
>>>>> bash: line 1: 18762 Segmentation Fault      (core dumped)
>>>>> /shared/OMPI/openmpi-1.8.4rc2-solaris11-x64-ib-gcc452/INST/bin/orted -mca
>>>>> ess "env" -mca orte_ess_jobid "911343616" -mca orte_ess_vpid 1 -mca
>>>>> orte_ess_num_procs "2" -mca orte_hnp_uri "911343616.0;tcp://
>>>>> 172.16.0.120,172.18.0.120:50362" --tree-spawn -mca btl
>>>>> "sm,self,openib" -mca plm "rsh" -mca shmem_mmap_enable_nfs_warning "0"
>>>>>
>>>>> Running gdb against a core generated by the 32-bit build gives line
>>>>> numbers:
>>>>> #0  0xfea1cb45 in strcmp () from /lib/libc.so.1
>>>>> #1  0xfeef4900 in orte_daemon (argc=26, argv=0x80479b0)
>>>>>     at
>>>>> /shared/OMPI/openmpi-1.8.4rc2-solaris11-x86-ib-gcc452/openmpi-1.8.4rc2/orte/orted/orted_main.c:789
>>>>> #2  0x08050fb1 in main (argc=26, argv=0x80479b0)
>>>>>     at
>>>>> /shared/OMPI/openmpi-1.8.4rc2-solaris11-x86-ib-gcc452/openmpi-1.8.4rc2/orte/tools/orted/orted.c:62
>>>>>
>>>>> -Paul
>>>>>
>>>>> --
>>>>> Paul H. Hargrove                          phhargr...@lbl.gov
>>>>> Computer Languages & Systems Software (CLaSS) Group
>>>>> Computer Science Department               Tel: +1-510-495-2352
>>>>> Lawrence Berkeley National Laboratory     Fax: +1-510-486-6900
>>>>>  _______________________________________________
>>>>> devel mailing list
>>>>> de...@open-mpi.org
>>>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>>> Link to this post:
>>>>> http://www.open-mpi.org/community/lists/devel/2014/12/16514.php
>>>>>
>>>>>
>>>>>
>>>>> _______________________________________________
>>>>> devel mailing list
>>>>> de...@open-mpi.org
>>>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>>> Link to this post:
>>>>> http://www.open-mpi.org/community/lists/devel/2014/12/16515.php
>>>>>
>>>>
>>>>
>>>>
>>>> --
>>>> Paul H. Hargrove                          phhargr...@lbl.gov
>>>> Computer Languages & Systems Software (CLaSS) Group
>>>> Computer Science Department               Tel: +1-510-495-2352
>>>> Lawrence Berkeley National Laboratory     Fax: +1-510-486-6900
>>>>  _______________________________________________
>>>> devel mailing list
>>>> de...@open-mpi.org
>>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>> Link to this post:
>>>> http://www.open-mpi.org/community/lists/devel/2014/12/16521.php
>>>>
>>>>
>>>>
>>>> _______________________________________________
>>>> devel mailing list
>>>> de...@open-mpi.org
>>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>> Link to this post:
>>>> http://www.open-mpi.org/community/lists/devel/2014/12/16522.php
>>>>
>>>
>>>
>>>
>>> --
>>> Paul H. Hargrove                          phhargr...@lbl.gov
>>> Computer Languages & Systems Software (CLaSS) Group
>>> Computer Science Department               Tel: +1-510-495-2352
>>> Lawrence Berkeley National Laboratory     Fax: +1-510-486-6900
>>>  _______________________________________________
>>> devel mailing list
>>> de...@open-mpi.org
>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>> Link to this post:
>>> http://www.open-mpi.org/community/lists/devel/2014/12/16524.php
>>>
>>>
>>>
>>>
>>> _______________________________________________
>>> devel mailing list
>>> de...@open-mpi.org
>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>> Link to this post:
>>> http://www.open-mpi.org/community/lists/devel/2014/12/16541.php
>>>
>>
>>
>>
>> --
>> Paul H. Hargrove                          phhargr...@lbl.gov
>> Computer Languages & Systems Software (CLaSS) Group
>> Computer Science Department               Tel: +1-510-495-2352
>> Lawrence Berkeley National Laboratory     Fax: +1-510-486-6900
>>  _______________________________________________
>> devel mailing list
>> de...@open-mpi.org
>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
>> Link to this post:
>> http://www.open-mpi.org/community/lists/devel/2014/12/16552.php
>>
>>
>>
>> _______________________________________________
>> devel mailing list
>> de...@open-mpi.org
>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
>> Link to this post:
>> http://www.open-mpi.org/community/lists/devel/2014/12/16562.php
>>
>
>
>
> --
> Paul H. Hargrove                          phhargr...@lbl.gov
> Computer Languages & Systems Software (CLaSS) Group
> Computer Science Department               Tel: +1-510-495-2352
> Lawrence Berkeley National Laboratory     Fax: +1-510-486-6900
>



-- 
Paul H. Hargrove                          phhargr...@lbl.gov
Computer Languages & Systems Software (CLaSS) Group
Computer Science Department               Tel: +1-510-495-2352
Lawrence Berkeley National Laboratory     Fax: +1-510-486-6900
--- opal/util/printf.c~ Fri Dec 12 17:07:30 2014
+++ opal/util/printf.c  Fri Dec 12 17:43:00 2014
@@ -7,7 +7,7 @@
  *                         reserved.
  * Copyright (c) 2004-2005 High Performance Computing Center Stuttgart, 
  *                         University of Stuttgart.  All rights reserved.
- * Copyright (c) 2004-2005 The Regents of the University of California.
+ * Copyright (c) 2004-2014 The Regents of the University of California.
  *                         All rights reserved.
  * Copyright (c) 2007      Cisco Systems, Inc.  All rights reserved.
  * $COPYRIGHT$
@@ -45,6 +45,7 @@
     float farg;
     size_t i;
     int iarg;
+    unsigned int uiarg;
     int len;
     long larg;
 
@@ -90,6 +91,15 @@
                 } while (0 != iarg);
                 break;
 
+            case 'u':
+                uiarg = va_arg(ap, unsigned int);
+                /* Now get the log10 */
+                do {
+                    ++len;
+                    uiarg /= 10;
+                } while (0 != uiarg);
+                break;
+
             case 'x':
             case 'X':
                 iarg = va_arg(ap, int);

Reply via email to