On Wed, Sep 12, 2007 at 03:30:38AM +0200, Roland Mainz wrote:
> William James wrote:
> > 
> > Suse just reported a problem about ksh 64bit crashes. Does anyone know
> > if the bug affects Solaris on AMD64 or SPARC, too?
> 
> I am not sure... I have to test Werner's script on an AMD64 box with and
> without our patches (see below) applied...
> 
> > ---------- Forwarded message ----------
> > From: Dr. Werner Fink <werner at suse.de>
> > Date: Aug 27, 2007 4:43 PM
> > Subject: [ast-users] Crash on 64bit system with vmalloc from libast
> > To: ast-users at research.att.com
> > 
> > Hi,
> > 
> > just found by ksh user on s390x but also valid on x86_64 ...
> > the sniplet below crashes not only ksh93r but also ksh93s.
> > GDB states most time a SIGSEGV at
> > 
> > bestreclaim() in src/lib/libast/vmalloc/vmbest.c:520
> > bestsearch()  in src/lib/libast/vmalloc/vmbest.c:351
> > 
> > but also at line 291 in src/lib/libast/vmalloc/vmbest.c
> > the crash occurs.  Interessting that the addresses at
> > this point for the left part of the tree are 0xffffffff
> > (which is not -1UL but -1U) or simply -1UL + address.
> [snip]
> 
> AFAIK there are three known memory corruption-related crashers in
> ast-ksh.2007-04-18 we currently have in OS/Net tree... the worst of them
> was timing/signal-related and only happened in rare cases on a small
> subset of our test machines and April hunted-down&&cornered this
> xx@@@!!! - we applied a hot-fix for it (see
> http://src.opensolaris.org/source/xref/onnv/onnv-gate/usr/src/lib/libshell/misc/ERRATA.txt
> , Errata #003) and I hope it's just an incarnation of this bug...

The last patch of the ERRATA.txt is already tested out and doe not
help, the first change also does not help.

I've tried to detected what happens with the help of valgrind
and see thing like:

 ==17286== Invalid read of size 8
 ==17286==    at 0x4EC8C7: bestsearch (vmbest.c:292)
 ==17286==    by 0x4ECE68: bestreclaim (vmbest.c:438)
 ==17286==    by 0x4ED8C7: bestalloc (vmbest.c:744)
 ==17286==    by 0x4F2CA0: malloc (malloc.c:440)
 ==17286==    by 0x4D2C1D: sfnew (sfnew.c:97)
 ==17286==    by 0x4D4D40: sfvsprintf (sfprintf.c:68)
 ==17286==    by 0x4D4EDD: sfsprintf (sfprintf.c:112)
 ==17286==    by 0x4A198D: fmtbasell (fmtbase.c:44)
 ==17286==    by 0x430B80: special (macro.c:2282)
 ==17286==    by 0x42BFFE: varsub (macro.c:991)
 ==17286==    by 0x42A6DE: copyto (macro.c:556)
 ==17286==    by 0x428BB0: sh_mactrim (macro.c:170)
 ==17286==  Address 0x100000007 is not stack'd, malloc'd or (recently) free'd
 ==17286== 
 ==17286== ---- Attach to debugger ? --- [Return/N/n/Y/y/C/c] ---- 

which seems to me that this always happen within a sub shell.
I've set all limits to unlimited with the help of ulimit but
the wrong address remains and break random but always at last
within vmbest.c

Interesting after attaching the gdb I can print out the current
values like around line 292

 287             /* find the right one to delete */
 288             l = r = &link;
 289     l = &link;
 290             if((root = vd->root) ) do
 291             {//     /**/ ASSERT(!ISBITS(size) && !ISBITS(SIZE(root)));
 292                     if(size == (s = SIZE(root)) )
 293                             break;
 294                     if(size < s)
 295                     {       if((t = LEFT(root)) )
 296                             {       if(size <= (s = SIZE(t)) )

(line 289 was added by me for clarification) and see things like

 (gdb) print l
 $1 = (Block_t *) 0x421da70
 (gdb) print r
 $2 = (Block_t *) 0x7feff8770
 (gdb) print &link
 $3 = (Block_t *) 0x7feff8770
 (gdb) print root
 $4 = (Block_t *) 0xffffffff
 (gdb) print vd->root
 $5 = (Block_t *) 0x4308770

please note that even for line 288 and the added line 289 it happens
that  l != p ... this is weired, isn't it?


 (gdb) print *l 
 $7 = {head = {data = {0 '\0', 208 '?', 33 '!', 4 '\004', 0 '\0', 0 '\0', 0 
'\0', 0 '\0', 96 '`', 0 '\0', 0 '\0', 0 '\0', 0 '\0', 0 '\0', 0 '\0', 0 '\0'}, 
     head = {seg = {seg = 0x421d000, link = 0x421d000, pf = 0x421d000, file = 
0x421d000 "??v"}, size = {size = 96, link = 0x60, line = 96}}}, body = {
     data = {0 '\0', 0 '\0', 0 '\0', 0 '\0', 0 '\0', 0 '\0', 0 '\0', 0 '\0', 
112 'p', 135 '\207', 48 '0', 4 '\004', 0 '\0', 0 '\0', 0 '\0', 0 '\0', 255 '?', 
       255 '?', 255 '?', 255 '?', 0 '\0', 0 '\0', 0 '\0', 0 '\0', 16 '\020', 
187 '?', 79 'O', 0 '\0', 0 '\0', 0 '\0', 0 '\0', 0 '\0'}, body = {link = 0x0, 
       left = 0x4308770, right = 0xffffffff, self = 0x4fbb10}}}
 (gdb) print *r
 $8 = {head = {data = {0 '\0' <repeats 16 times>}, head = {seg = {seg = 0x0, 
link = 0x0, pf = 0x0, file = 0x0}, size = {size = 0, link = 0x0, line = 0}}}, 
   body = {data = {161 '?', 255 '?', 225 '?', 53 '5', 211 '?', 33 '!', 122 'z', 
223 '?', 161 '?', 255 '?', 143 '\217', 36 '$', 166 '?', 220 '?', 117 'u', 
       223 '?', 112 'p', 218 '?', 33 '!', 4 '\004', 0 '\0', 0 '\0', 0 '\0', 0 
'\0', 196 '?', 19 '\023', 77 'M', 0 '\0', 0 '\0', 0 '\0', 0 '\0', 0 '\0'}, 
     body = {link = 0xdf7a21d335e1ffa1, left = 0xdf75dca6248fffa1, right = 
0x421da70, self = 0x4d13c4}}}
 (gdb) print *vd->root
 $9 = {head = {data = {0 '\0', 208 '?', 33 '!', 4 '\004', 0 '\0', 0 '\0', 0 
'\0', 0 '\0', 64 '@', 0 '\0', 0 '\0', 0 '\0', 0 '\0', 0 '\0', 0 '\0', 0 '\0'}, 
     head = {seg = {seg = 0x421d000, link = 0x421d000, pf = 0x421d000, file = 
0x421d000 "??v"}, size = {size = 64, link = 0x40, line = 64}}}, body = {
     data = {112 'p', 31 '\037', 35 '#', 4 '\004', 0 '\0', 0 '\0', 0 '\0', 0 
'\0', 80 'P', 97 'a', 45 '-', 4 '\004', 0 '\0' <repeats 20 times>}, body = {
       link = 0x4231f70, left = 0x42d6150, right = 0x0, self = 0x0}}}

in other words a broken stack.  And I've reports that this error
also happens on 32bit linux architectures.

Other valgrind output shows things like

 ==18087== Jump to the invalid address stated on the next line
 ==18087==    at 0xFFFFFFFF004CCBDC: ???
 ==18087==    by 0x428D56: sh_macexpand (macro.c:198)
 ==18087==    by 0x47368D: arg_expand (args.c:837)
 ==18087==    by 0x4730AD: sh_argbuild (args.c:706)
 ==18087==    by 0x44A967: sh_exec (xec.c:597)
 ==18087==    by 0x44CB1E: sh_exec (xec.c:1145)
 ==18087==    by 0x44D2B8: sh_exec (xec.c:1274)
 ==18087==    by 0x44726D: sh_subshell (subshell.c:417)
 ==18087==    by 0x42F1EA: comsubst (macro.c:1764)
 ==18087==    by 0x42C0C9: varsub (macro.c:1015)
 ==18087==    by 0x42A6DE: copyto (macro.c:556)
 ==18087==    by 0x428BB0: sh_mactrim (macro.c:170)
 ==18087==  Address 0xFFFFFFFF004CCBDC is not stack'd, malloc'd or (recently) 
free'd
 ==18087== 
 ==18087== Process terminating with default action of signal 11 (SIGSEGV)
 ==18087==  Bad permissions for mapped region at address 0xFFFFFFFF004CCBDC

or

 ==9765== Jump to the invalid address stated on the next line
 ==9765==    at 0xFFFFFFFF004968D0: ???
 ==9765==  Address 0xFFFFFFFF004968D0 is not stack'd, malloc'd or (recently) 
free'd
 ==9765== 
 ==9765== Process terminating with default action of signal 11 (SIGSEGV)
 ==9765==  Bad permissions for mapped region at address 0xFFFFFFFF004968D0
 ==9765==    at 0xFFFFFFFF004968D0: ???
 ==9765== 
 ==9765== ERROR SUMMARY: 1 errors from 1 contexts (suppressed: 2 from 1)
 ==9765== 
 ==9765== 1 errors in context 1 of 1:
 ==9765== Jump to the invalid address stated on the next line
 ==9765==    at 0xFFFFFFFF004968D0: ???
 ==9765==  Address 0xFFFFFFFF004968D0 is not stack'd, malloc'd or (recently) 
free'd

which could be an indication for an mapping error.

        Werner

-- 
 Dr. Werner Fink <werner at suse.de>
 SuSE LINUX Products GmbH,  Maxfeldstrasse 5,  Nuernberg,  Germany
 GF: Markus Rex,  HRB 16746 (AG Nuernberg)
 phone: +49-911-740-53-0,  fax: +49-911-3206727,  www.opensuse.org
------------------------------------------------------------------
  "Having a smoking section in a restaurant is like having
          a peeing section in a swimming pool." -- Edward Burr

Reply via email to