Re: [go-nuts] Re: Accessing *[]uint64 from assembly - strange memory corruption under heavy load - any ideas?

Ian Lance Taylor Fri, 22 Mar 2019 12:59:04 -0700

On Fri, Mar 22, 2019 at 10:55 AM Robert Johnstone
<r.w.johnst...@gmail.com> wrote:
>
> I don't see any memory barriers in your assembly.  If you are modifying the 
> backing array while it is being scanned by the GC, there could be some 
> interaction.  I don't know enough about the GC internals to say more than 
> that.  If you look at when memory barriers are inserted by the Go compiler, 
> it might provide more guidance.


If it's just []uint64 that shouldn't be an issue, as write barriers
are not required for uint64.

You are certainly correct if the assembly is manipulating slices that
contain pointers.

Ian


> On Friday, 22 March 2019 00:39:34 UTC-4, Tom wrote:
>>
>> I've been stuck on this for a few days so thought I would ask the brains 
>> trust.
>>
>> TL;DR: When I have native amd64 instructions mutating (updating the len + 
>> values of a []uint64) a slice, I experience spurious & random memory 
>> corruption when under heavy load (# runnable goroutines > MAXPROCS, doing 
>> the same thing continuously), and only when the GC is enabled. Any debugging 
>> ideas or things I should look into?
>>
>> Background:
>>
>> I'm calling into go assembly with a few pointers to slices (*[]uint64), and 
>> that assembly is mutating them (reading/writing values, updating len within 
>> capacity). I'm experiencing random memory corruption, but I can only trigger 
>> it in the following scenarios:
>>
>> Heavy load - Doing a zillion things at once (specifically running all my 
>> test cases in parallel) and maxing out my machine.
>> Parallelism - A panic due to memory corruption happens faster if --parallel 
>> is set higher, and never if not in parallel.
>> GC - The panic never happens if the GC is disabled (of course, the test 
>> process eventually runs out of memory).
>>
>> The memory corruption varies, but usually results in an element of an 
>> unrelated slice being zero'ed, the len of a unrelated slice being zeroed, or 
>> (less likely) a segfault.
>>
>> Tested on go1.11.2 and go1.12.1. I can only trigger this if I run all my 
>> test cases at once (with --count at 8000 or so & using t.Parallel()). 
>> Running thing serially or individually yields the correct behaviour.
>>
>> The assembly in question looks like this:
>>
>> TEXT ·jitcall(SB),NOSPLIT|NOFRAME,$0-24
>>         GO_ARGS
>>         MOVQ asm+0(FP),     AX  // Load the address of the assembly section.
>>         MOVQ stack+8(FP),   R10 // Load the address of the 1st slice.
>>         MOVQ locals+16(FP), R11 // Load the address of the 2nd slice.
>>         MOVQ 0(AX),         AX  // Deference pointer to native code.
>>         JMP AX                  // Jump to native code.
>>
>> And slice manipulation like this (this is a 'pop'):
>>
>>  MOVQ r13,     [r10+8]       // Load the length of the slice.
>>  DECQ r13                    // Decrements the len (I can guarantee this 
>> will never underflow).
>>  MOVQ r12,     [r10]         // Load the 0th element address.
>>  LEAQ r12,     [r12 + r13*8] // Compute the address of the last element.
>>  MOVQ reg,     [r12]         // Load the element to reg.
>>  MOVQ [r10+8], r13           // Write the len back.
>>
>> or 'push' like this (note: cap is always large enough for any pushes) ...
>>
>>  MOVQ r12,     [r10]          // Load the 0th element address.
>>  MOVQ r13,     [r10+8]        // Load the len.
>>  LEAQ r12,     [r12 + r13*8]  // Compute the address of the last element + 1.
>>  INCQ r13                     // Increment the len.
>>  MOVQ [r10+8], r13            // Save the len.
>>  MOVQ [r12],   reg            // Write the new element.
>>
>>
>> I acknowledge that calling into code like this is unsupported, but I 
>> struggle to understand how such corruption can happen, and having stared at 
>> it for a few days, I am frankly stumped. I mean, even if non-cooperative 
>> preemption was in these versions of Go I would expect the GC to  abort when 
>> it cant find the stack maps for my RIP value. With no GC safe points in my 
>> native assembly, I dont see how the GC could interfere (yet the issue 
>> disappears with the GC off??).
>>
>> Questions:
>>
>> Any ideas what I'm doing wrong?
>> Any ideas how I can trace this from the application side and also the 
>> runtime side? I've tried schedtrace and the like, but the output didnt 
>> appear useful or correlated to the crashes.
>> Any suggestions for assumptions I might have missed and should write tests / 
>> guards for?
>>
>> Thanks,
>> Tom
>
> --
> You received this message because you are subscribed to the Google Groups 
> "golang-nuts" group.
> To unsubscribe from this group and stop receiving emails from it, send an 
> email to golang-nuts+unsubscr...@googlegroups.com.
> For more options, visit https://groups.google.com/d/optout.

-- 
You received this message because you are subscribed to the Google Groups 
"golang-nuts" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to golang-nuts+unsubscr...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Re: [go-nuts] Re: Accessing *[]uint64 from assembly - strange memory corruption under heavy load - any ideas?

Reply via email to