[NOTE: was sent to Simon Marlow, Simon Peyton Jones, erroneously to [EMAIL PROTECTED]

Great job! I had actually worked on this problem for some time but did not have enough experience with the source code and backtracing from assembler through Stg to find the exact problem. Would you have time to answer a few questions I have?

(1) how do I obtain the latest 6.4.3 release? It is no longer on CVS, the darcs branch, http://darcs.haskell.org/ghc.ghc-6.4, seems to have been updated last January, and I had been working on the ghc-6.4.2.tar.bz2 snapshot from http://www.haskell.org/ghc/dist/6.4.2/ ghc-6.4.2-src.tar.bz2. Is 6.4.3 a darcs tag?

(2) For working a debug build of ghc-6.4.2 I had to modify the file ghc/compiler/nativeGen/RegisterAlloc.hs by adding a deriving declaration:

ghc/compiler/nativeGen/RegisterAlloc.hs:158
> data FreeRegs = FreeRegs !Word32 !Word32
+                               deriving (Show)

This fix was in the 6.6 branch.  Is it also now in the 6.4.3 branch?

(3) I cheated and modified the ghc script that invokes the executable ..lib/ghc-6.4.2/ghc-6.4.2 by inserting a gdb invocation after the exec statement. (I was working on compiling Crypto with the original Cabal setup but didn't want to resort to makefiles.):

 # Mini-driver for GHC
 exec gdb --args $GHCBIN  $TOPDIROPT ${1+"$@"}

Is there a better way to go about this?

(4) Would you please elaborate on the problem and the fix? The problems consistently showed up in ghc/rts/GC.c:threadSqueezeStack, in the variable frame (note: comments *follow* code):

(gdb) disas threadSqueezeStack
...
0x00c24678 <threadSqueezeStack+16>:     mr      r24,r3            
                                                                        ; tso, 
tso
0x00c2467c <threadSqueezeStack+20>:     addi    r0,r3,56  
                                                                        ; r0 = 
tso->stack[0]
0x00c24680 <threadSqueezeStack+24>:     lwz     r2,44(r3) 
                                                                        ; 
<variable>.stack_size
0x00c24684 <threadSqueezeStack+28>:     rlwinm  r2,r2,2,0,29
                                                                        ; 
stack_size
0x00c24688 <threadSqueezeStack+32>:     add     r25,r0,r2 
                                                                        ; bottom, 
tso->stack[0], stack_size
0x00c2468c <threadSqueezeStack+36>:     lwz     r31,52(r3)        
                                                                        ; 
<variable>.sp (tso->sp)
0x00c24690 <threadSqueezeStack+40>:     cmplw   cr7,r25,r31       
                                                                        ; 
assert(frame < stack)
0x00c24694 <threadSqueezeStack+44>:     bgt+    cr7,0xc246a8
                                                                        ; 
<threadSqueezeStack+64>
0x00c24698 <threadSqueezeStack+48>:     lis     r3,211
0x00c2469c <threadSqueezeStack+52>:     addi    r3,r3,304
0x00c246a0 <threadSqueezeStack+56>:     li      r4,4356
0x00c246a4 <threadSqueezeStack+60>:     bl      0xcb9758
                                                                        ; 
<_assertFail>
0x00c246a8 <threadSqueezeStack+64>:     addi    r29,r31,-8        
                                                                        ; gap, 
<variable>.sp,
0x00c246ac <threadSqueezeStack+68>:     li      r23,0             
                                                                        ; 
updatee,
0x00c246b0 <threadSqueezeStack+72>:     li      r27,0             
                                                                        ; 
prev_was_update_frame,
0x00c246b4 <threadSqueezeStack+76>:     li      r28,0             
                                                                        ; 
current_gap_size,
0x00c246b8 <threadSqueezeStack+80>:     mr      r11,r31
0x00c246bc <threadSqueezeStack+84>:     lwz     r2,0(r31) 
                                                                        ; 
<variable>.header.info, D.xxxx
0x00c246c0 <threadSqueezeStack+88>:     addi    r9,r2,-12 
                                                                        ; info, 
<variable>.header.info
0x00c246c4 <threadSqueezeStack+92>:     lhz     r2,8(r9)  
; crash point: <variable>.i.type, r9=0xfffffffc, sometimes r9=0x70000100
; 8(r9) overflows to 0x00000004
; NOTE: 8 in 8(r9) derived from:
;       sizeof((StgInt)srt_offset) + sizeof((StgClosureInfo)layout)

Sorry if my comments seem pedantic--I just started really learning assembler in August and I partly used gcc -S -fverbose-asm to help out. After running the build under Crypto many times there were a few times when the assert (frame < stack) would fail so I was following registers (r31, r2 and r9) in the functions used in other threads, as well. (The problem was in r9.) After that it was a matter of traceback.

(5) One other avenue I was exploring was the use of Zero Length Arrays (ZLA's) and potential gcc bugs (a few of this sort have been noticed in gcc-3.3 through 4.0). Why do you use ZLA's in the code? The reasons not to are:

        a. ZLA's are largely supported by GNU extensions.
As noted in the GCC manual, Section 5.12, at http://gcc.gnu.org/ onlinedocs/gcc/Zero-Length.html : "A structure containing a flexible array member, or a union containing such a structure (possibly recursively), may not be a member of a structure or an element of an array. (However, these uses are permitted by GCC as extensions.)" You are therefore forced to include structures containing ZLA's as pointers-to-structures, for example:

# 280 "ghc/includes/InfoTables.h"
typedef struct _StgInfoTable {
    StgClosureInfo layout;
    StgHalfWord type;
    StgHalfWord srt_bitmap;
    StgCode code[];
} StgInfoTable;
# 44 "ghc/includes/Closures.h"
typedef struct {
 const struct _StgInfoTable* info;
} StgHeader;


b. the C sizeof() operator does not correctly report the size of structures containing ZLA's, so sizeof(StgInfoTable) reports 8, not 12, although the gcc compiler correctly produces the assembler for manipulating such a structure:

0x00c246c0 <threadSqueezeStack+88>:     addi    r9,r2,-12 
        ; (((StgRetInfoTable *)(((StgClosure *)frame)->header.info) - 1))
        ; the -12 is the size of StgRetInfoTable

So macros such as ghc/includes/TSO.h:TSO_STRUCT_SIZE can't simply be defined as


#define TSO_STRUCT_SIZE sizeof(StgTSO)

and gdb has trouble accessing the members of the structure:

(gdb) p *tso
$30 = {
  header = {
    info = 0xaf5ff0
  },
  link = 0xded718,
  mut_link = 0xded71c,
  global_link = 0x2bce000,
  what_next = 1,
  why_blocked = 11,
  block_info = {
    closure = 0x0,
    tso = 0x0,
    fd = 0,
    target = 0
  },
  blocked_exceptions = 0xded718,
  id = 2,
  saved_errno = 25,
  main = 0x0,
  trec = 0xded714,
  stack_size = 242,
  max_stack_size = 2080754,
  sp = 0x2bddc78
}
// note the lack of member tso.stack

Finally, if there are alignment issues, wouldn't that be better controlled explicitly through pragmas?

Please don't think I am being critical here: I just don't know enough to understand your reasons.

-Pete

on 2006/10/16 06:50:02 PDT Simon Marlow wrote:


 Modified files:        (Branch: ghc-6-4-branch)
    ghc/rts              Capability.c
  Log:
  Fix crash in the threaded RTS caused by spurious wakeups of
pthread_cond_wait(). This is certainly affecting the threaded RTS in
  6.4.x on Solaris, and possibly other platforms too.  I'm currently
testing to see whether there are any further problems on Solaris, but
  with luck this may be the final fix for the threaded RTS problems in
  the 6.4.x branch.

  Does not affect 6.6; the corresponding code in 6.6 is already
  spurious-wakeup-safe.

  Revision  Changes    Path
  1.31.6.2  +32 -7     fptools/ghc/rts/Capability.c

_______________________________________________
Cvs-ghc mailing list
[EMAIL PROTECTED]
http://www.haskell.org/mailman/listinfo/cvs-ghc

Reply via email to