I'm not making any function calls in the assembly, just writing to memory
addresses that represent the elements / len of the slice. I've also tried
using LockOSThread() to see if that made any difference, alas it does not.
On Friday, March 22, 2019 at 4:59:30 AM UTC-7, Robert Engels wrote:
>
> Are you making any calls modifying the len that would allow GC to occur,
> or change stack size? You might need to pin the Go routine so that the
> operation you are performing is “atomic” with respect to those.
>
> This also sounds very scary if the Go runtime every had a compacting
> collector.
>
> On Mar 22, 2019, at 12:27 AM, Tom > wrote:
>
> The allocation is in go, and assembly never modifies the size of the
> backing array. Assembly only ever modifies len, which is the len of the
> slice and not the backing array.
>
> On Thursday, 21 March 2019 22:18:29 UTC-7, Tamás Gulácsi wrote:
>>
>> 2019. március 22., péntek 6:06:06 UTC+1 időpontban Tom a következőt írta:
>>>
>>> Still errors I'm afraid :/
>>>
>>> On Thursday, 21 March 2019 21:54:59 UTC-7, Ian Lance Taylor wrote:
On Thu, Mar 21, 2019 at 9:39 PM Tom wrote:
>
> I've been stuck on this for a few days so thought I would ask the
brains trust.
>
> TL;DR: When I have native amd64 instructions mutating (updating the
len + values of a []uint64) a slice, I experience spurious & random memory
corruption when under heavy load (# runnable goroutines > MAXPROCS, doing
the same thing continuously), and only when the GC is enabled. Any
debugging ideas or things I should look into?
>
> Background:
>
> I'm calling into go assembly with a few pointers to slices
(*[]uint64), and that assembly is mutating them (reading/writing values,
updating len within capacity). I'm experiencing random memory corruption,
but I can only trigger it in the following scenarios:
>
> Heavy load - Doing a zillion things at once (specifically running all
my test cases in parallel) and maxing out my machine.
> Parallelism - A panic due to memory corruption happens faster if
--parallel is set higher, and never if not in parallel.
> GC - The panic never happens if the GC is disabled (of course, the
test process eventually runs out of memory).
>
> The memory corruption varies, but usually results in an element of an
unrelated slice being zero'ed, the len of a unrelated slice being zeroed,
or (less likely) a segfault.
>
> Tested on go1.11.2 and go1.12.1. I can only trigger this if I run all
my test cases at once (with --count at 8000 or so & using t.Parallel()).
Running thing serially or individually yields the correct behaviour.
>
> The assembly in question looks like this:
>
> TEXT ·jitcall(SB),NOSPLIT|NOFRAME,$0-24
> GO_ARGS
> MOVQ asm+0(FP), AX // Load the address of the assembly
section.
> MOVQ stack+8(FP), R10 // Load the address of the 1st slice.
> MOVQ locals+16(FP), R11 // Load the address of the 2nd slice.
> MOVQ 0(AX), AX // Deference pointer to native code.
> JMP AX // Jump to native code.
>
> And slice manipulation like this (this is a 'pop'):
>
> MOVQ r13, [r10+8] // Load the length of the slice.
> DECQ r13// Decrements the len (I can guarantee
this will never underflow).
> MOVQ r12, [r10] // Load the 0th element address.
> LEAQ r12, [r12 + r13*8] // Compute the address of the last
element.
> MOVQ reg, [r12] // Load the element to reg.
> MOVQ [r10+8], r13 // Write the len back.
>
> or 'push' like this (note: cap is always large enough for any pushes)
...
>
> MOVQ r12, [r10] // Load the 0th element address.
> MOVQ r13, [r10+8]// Load the len.
> LEAQ r12, [r12 + r13*8] // Compute the address of the last
element + 1.
> INCQ r13 // Increment the len.
> MOVQ [r10+8], r13// Save the len.
> MOVQ [r12], reg// Write the new element.
>
>
> I acknowledge that calling into code like this is unsupported, but I
struggle to understand how such corruption can happen, and having stared
at
it for a few days, I am frankly stumped. I mean, even if non-cooperative
preemption was in these versions of Go I would expect the GC to abort
when
it cant find the stack maps for my RIP value. With no GC safe points in my
native assembly, I dont see how the GC could interfere (yet the issue
disappears with the GC off??).
>
> Questions:
>
> Any ideas what I'm doing wrong?
> Any ideas how I can