Re: current_thread_info() not respecting program order with gcc 4.8.x

Luis Lozano Fri, 22 Nov 2013 00:34:15 -0800

On Fri, Nov 22, 2013 at 12:18 AM, Luis Lozano <[email protected]> wrote:
> On Thu, Nov 21, 2013 at 7:38 PM, Mathieu Desnoyers
> <[email protected]> wrote:
>>
>> ----- Original Message -----
>> > From: "Luis Lozano" <[email protected]>
>> > To: "Mathieu Desnoyers" <[email protected]>
>> > Cc: "Jakub Jelinek" <[email protected]>, "Alexander Holler" 
>> > <[email protected]>, "Linus Torvalds"
>> > <[email protected]>, "Richard Henderson" <[email protected]>, 
>> > "Linux Kernel Mailing List"
>> > <[email protected]>, "Will Deacon" <[email protected]>, 
>> > "Catalin Marinas" <[email protected]>,
>> > "Peter Zijlstra" <[email protected]>, [email protected], 
>> > "Nathan Lynch" <[email protected]>, "Paul
>> > E. McKenney" <[email protected]>, "Andrew Morton" 
>> > <[email protected]>, "Bhaskar Janakiraman"
>> > <[email protected]>, "Han Shen" <[email protected]>
>> > Sent: Thursday, November 21, 2013 9:36:27 PM
>> > Subject: Re: current_thread_info() not respecting program order with gcc 
>> > 4.8.x
>> >
>> > Hi Mathieu,
>> >
>> > Yes the problem we were seeing with GCC bug 58854 is that an interrupt
>> > handler was corrupting the stack of a routine which had an invalid
>> > value of SP (not really on "top" of the stack) for part of the
>> > routine.
>> > But in the assembly code you sent, I don't see where sp is being
>> > modified... or where the access to "below" sp is happening.
>>
>> The following instruction
>>
>>   f1c:      e50b1048        str     r1, [fp, #-72]  ; 0xffffffb8
>>
>> appears in the assembly generated by gcc 4.8.2, but not in the one
>> generated by 4.7.3,
>
>
>
> It is there in 4.7.3 just a few lines difference.
> I don't see any meaningful difference between the 2 assembly snippets...
> Just the instructions ordered a little different?
>


unless there is a dependency between the following 2 instructions?
-     f1c:      e50b1048        str     r1, [fp, #-72]  ; 0xffffffb8
...
      f20:      e5936030        ldr     r6, [r3, #48]   ; 0x30

and moving the str across the ldr makes a difference...


> Luis
>
>>
>> which makes me wonder if it's good or not. With
>> slightly different build output from the diff (probably a different config
>> from Nathan), the first function instructions look like:
>>
>> 00000efc <lttng_event_reserve>:
>>      efc:       e1a0c00d        mov     ip, sp
>>      f00:       e92ddff0        push    {r4, r5, r6, r7, r8, r9, sl, fp, ip, 
>> lr, pc}
>>      f04:       e24cb004        sub     fp, ip, #4
>>      f08:       e24dd03c        sub     sp, sp, #60     ; 0x3c
>>      f0c:       e52de004        push    {lr}            ; (str lr, [sp, 
>> #-4]!)
>>      f10:       ebfffffe        bl      0 <__gnu_mcount_nc>
>>      f14:       e5903000        ldr     r3, [r0]
>>      f18:       e1a04000        mov     r4, r0
>>      f1c:       e1a0000d        mov     r0, sp
>>      f20:       e5936030        ldr     r6, [r3, #48]   ; 0x30
>>      f24:       e3c03d7f        bic     r3, r0, #8128   ; 0x1fc0
>>      f28:       e3c3303f        bic     r3, r3, #63     ; 0x3f
>>      f2c:       e50b104c        str     r1, [fp, #-76]  ; 0xffffffb4
>>      f30:       e5932004        ldr     r2, [r3, #4]
>>      f34:       e2822001        add     r2, r2, #1
>>      f38:       e5832004        str     r2, [r3, #4]
>>      f3c:       ebfffffe        bl      0 <debug_smp_processor_id>
>>      f40:       e59f2e44        ldr     r2, [pc, #3652] ; 1d8c 
>> <lttng_event_reserve+0xe90>
>>      f44:       e59f3e44        ldr     r3, [pc, #3652] ; 1d90 
>> <lttng_event_reserve+0xe94>
>>      f48:       e7921100        ldr     r1, [r2, r0, lsl #2]
>>      f4c:       e1a05000        mov     r5, r0
>>      f50:       e7932001        ldr     r2, [r3, r1]
>>
>> My rusty ARM assembler knowledge analyzes this like:
>>
>> ip = sp            (ip being the scratch register)
>> push 11 registers * 4 bytes = 44 bytes onto the stack
>> fp = ip - 4
>> sp = sp - 60   (sp = ip - 104)
>> push {lr} bl
>> 0 <__gnu_mcount_nc>  (dynamically patched to a pop {lr})
>> [...]
>> store r1 into mem location fp - 76
>>
>> So yes, it does look like the -76 from fp is within the stack.
>>
>>
>> > In GCC bug 58854 it was pretty clear that the restore of the sp
>> > register in the epilogue is moved to somewhere close to the prologue
>> > of the routine.
>> > Are we missing some diffs from the assembly comparison?
>>
>> My next step is to setup my own 4.8.2 ARM cross-compiler. I've tried this 
>> morning,
>> did not manage to get one working even after following 2 howtos, trying with 
>> Debian
>> packages, etc. Nathan told me this diff was the full comparison between the 
>> two
>> functions, but you'll understand that at this point I want to reproduce
>> everything myself, because this is a _weird_ issue.
>>
>> But after a long day of debugging, it's time for some sleep. I will get back 
>> to this
>> tomorrow,
>>
>> Thanks,
>>
>> Mathieu
>>
>> >
>> > Thanks
>> >
>> > Luis
>> >
>> >
>> >
>> >
>> > On Thu, Nov 21, 2013 at 5:57 PM, Mathieu Desnoyers
>> > <[email protected]> wrote:
>> > > ----- Original Message -----
>> > >> From: "Jakub Jelinek" <[email protected]>
>> > >> To: "Luis Lozano" <[email protected]>
>> > >> Cc: "Alexander Holler" <[email protected]>, "Linus Torvalds"
>> > >> <[email protected]>, "Mathieu Desnoyers"
>> > >> <[email protected]>, "Richard Henderson" 
>> > >> <[email protected]>,
>> > >> "Linux Kernel Mailing List"
>> > >> <[email protected]>, "Will Deacon" <[email protected]>,
>> > >> "Catalin Marinas" <[email protected]>,
>> > >> "Peter Zijlstra" <[email protected]>, [email protected],
>> > >> "Nathan Lynch" <[email protected]>, "Paul
>> > >> E. McKenney" <[email protected]>, "Andrew Morton"
>> > >> <[email protected]>, "Bhaskar Janakiraman"
>> > >> <[email protected]>, "Han Shen" <[email protected]>
>> > >> Sent: Thursday, November 21, 2013 7:39:04 PM
>> > >> Subject: Re: current_thread_info() not respecting program order with gcc
>> > >> 4.8.x
>> > >>
>> > >> On Thu, Nov 21, 2013 at 03:45:35PM -0800, Luis Lozano wrote:
>> > >> > I think we need a reproducer. Without this we may all be going on the
>> > >> > wrong path. This whole conversation started on an *assumption* that
>> > >> > some accesses were being reordered.
>> > >> >
>> > >> > evidence of the reorder or reproducer please?
>> > >>
>> > >> Yeah, if a compiler bug is suspected, can anybody please open
>> > >> a bugreport in http://gcc.gnu.org/bugzilla/ with the preprocessed 
>> > >> source,
>> > >> compiler version, flags and how it was configured and some hint in which
>> > >> function to look for what exactly?  We don't necessarily need a runtime
>> > >> small reproducer, but if it can be shown in the assembly what has been
>> > >> reordered and why you think it shouldn't, with the above mentioned input
>> > >> that ought to be sufficient.  Thanks.
>> > >
>> > > OK OK, let me reply on list first so I can share the result of a full day
>> > > of bug hunting. We're not there yet, but many options have been 
>> > > eliminated.
>> > >
>> > > The issue shows up in stress test, when tracing with lttng-modules 
>> > > 2.4-rc1,
>> > > on ARM. It's been reproduced with a Linux kernel 3.12 so far, with
>> > > lttng-modules
>> > > compiled against that kernel.
>> > >
>> > > First, I asked Nathan to compile his kernel with gcc 4.7, and 
>> > > lttng-modules
>> > > with gcc 4.8.x (and vice-versa). The problem only appears when
>> > > lttng-modules
>> > > are compiled with gcc 4.8.x. The compiler version used to compile the 
>> > > rest
>> > > of the kernel does not matter.
>> > >
>> > > Then I looked at gcc 4.8 changelog for ARM, new feature:
>> > > -fno-sched-pressure
>> > > (sched pressure is there by default). Nathan tried compiling 
>> > > lttng-modules
>> > > with
>> > > -fno-sched-pressure. The problem still reproduces.
>> > >
>> > > Knowing that adding barrier() outside of preempt_disable()/enable() was
>> > > fixing the issue, we tried identifying which code location was 
>> > > responsible
>> > > for working around the issue. Skipping a long investigation, here is the
>> > > executive summary:
>> > >
>> > > http://git.lttng.org/?p=lttng-modules.git;a=blob;f=lttng-ring-buffer-client.h;h=50c47b3bf49f6c2dd24e250cf1a9b97808cd8e27;hb=HEAD
>> > >
>> > > Has the following function. We identified that adding a barrier() as 
>> > > shown
>> > > below
>> > > works around the issue:
>> > >
>> > > static
>> > > int lttng_event_reserve(struct lib_ring_buffer_ctx *ctx,
>> > >                       uint32_t event_id)
>> > > {
>> > >         struct lttng_channel *lttng_chan = 
>> > > channel_get_private(ctx->chan);
>> > >         int ret, cpu;
>> > >
>> > >         cpu = lib_ring_buffer_get_cpu(&client_config);
>> > >         if (cpu < 0)
>> > >                 return -EPERM;
>> > >         ctx->cpu = cpu;
>> > >
>> > >         switch (lttng_chan->header_type) {
>> > >         case 1: /* compact */
>> > >                 if (event_id > 30)
>> > >                         ctx->rflags |= LTTNG_RFLAG_EXTENDED;
>> > >                 break;
>> > >         case 2: /* large */
>> > >                 if (event_id > 65534)
>> > >                         ctx->rflags |= LTTNG_RFLAG_EXTENDED;
>> > >                 break;
>> > >         default:
>> > >                 WARN_ON_ONCE(1);
>> > >         }
>> > >
>> > >         ret = lib_ring_buffer_reserve(&client_config, ctx);
>> > >         if (ret)
>> > >                 goto put;
>> > >         lttng_write_event_header(&client_config, ctx, event_id);
>> > >         return 0;
>> > > put:
>> > >         lib_ring_buffer_put_cpu(&client_config);
>> > >         ---------------> barrier() added here;
>> > >         <----------------------------
>> > >         return ret;
>> > > }
>> > >
>> > > Where barrier() is the usual asm volatile with "memory" clobber, nothing
>> > > else.
>> > >
>> > > Nathan gave me the binary diff for the assembly generated for this 
>> > > function
>> > > without
>> > > the barrier and with the barrier:
>> > >
>> > > --- /tmp/lttng_event_reserve-4.8.2.dump 2013-11-21 11:14:14.536495079 
>> > > -0600
>> > > +++ /tmp/lttng_event_reserve-with-barrier-4.8.2.dump    2013-11-21
>> > > 14:12:52.997355907 -0600
>> > > @@ -7,11 +7,11 @@
>> > >       f10:      ebfffffe        bl      0 <__gnu_mcount_nc>
>> > >       f14:      e5903000        ldr     r3, [r0]
>> > >       f18:      e1a04000        mov     r4, r0
>> > > -     f1c:      e50b1048        str     r1, [fp, #-72]  ; 0xffffffb8
>> > > +     f1c:      e1a0000d        mov     r0, sp
>> > >       f20:      e5936030        ldr     r6, [r3, #48]   ; 0x30
>> > > -     f24:      e1a0000d        mov     r0, sp
>> > > -     f28:      e3c03d7f        bic     r3, r0, #8128   ; 0x1fc0
>> > > -     f2c:      e3c3303f        bic     r3, r3, #63     ; 0x3f
>> > > +     f24:      e3c03d7f        bic     r3, r0, #8128   ; 0x1fc0
>> > > +     f28:      e3c3303f        bic     r3, r3, #63     ; 0x3f
>> > > +     f2c:      e50b1048        str     r1, [fp, #-72]  ; 0xffffffb8
>> > >       f30:      e5932004        ldr     r2, [r3, #4]
>> > >       f34:      e2822001        add     r2, r2, #1
>> > >       f38:      e5832004        str     r2, [r3, #4]
>> > > @@ -53,7 +53,7 @@
>> > >       fc8:      e3c3303f        bic     r3, r3, #63     ; 0x3f
>> > >       fcc:      e5933000        ldr     r3, [r3]
>> > >       fd0:      e3130002        tst     r3, #2
>> > > -     fd4:      0a000000        beq     fdc <lttng_event_reserve+0xe0>
>> > > +     fd4:      0a0002be        beq     1ad4 <lttng_event_reserve+0xbd8>
>> > >       fd8:      ebfffffe        bl      0 <preempt_schedule>
>> > >       fdc:      ea0002bc        b       1ad4 <lttng_event_reserve+0xbd8>
>> > >       fe0:      e3500000        cmp     r0, #0
>> > >
>> > > We tried disabling the ftrace function tracing to get mcount out of the
>> > > way,
>> > > and the problem still reproduces.
>> > >
>> > > I'm stopping here in terms of details about the disassembly, because I
>> > > need to double check with Nathan that I get the right disassembly for the
>> > > right
>> > > cases. I also terribly need to setup a 4.8.2 ARM cross-compiler on my
>> > > machine.
>> > >
>> > > I'm attaching Nathan's ARM configuration.
>> > >
>> > > It does look behave a bit like this bug pointed out by Luis:
>> > >
>> > > http://gcc.gnu.org/bugzilla/show_bug.cgi?id=58854
>> > >
>> > > AFAIU (please correct me if I am wrong), ARM's interrupt handler run
>> > > on top of the thread stack (?). If it's the case, then anything stored
>> > > on the stack below "sp" could be overwritten by an interrupt handler.
>> > > This would fit well the reproduction scenario for this bug: Nathan runs
>> > > LTTng tracing of kmem_cache_free tracepoint with hackbench running.
>> > > A race between a short window of stack use below sp and interrupt 
>> > > handlers
>> > > would trigger with this kind of stress-test.
>> > >
>> > > Thoughts ?
>> > >
>> > > Thanks,
>> > >
>> > > Mathieu
>> > >
>> > >
>> > > --
>> > > Mathieu Desnoyers
>> > > EfficiOS Inc.
>> > > http://www.efficios.com
>> >
>> >
>> >
>> > --
>> >
>> > Luis A. Lozano | Software Engineer | [email protected] | +1 (408)431-5164
>> >
>>
>> --
>> Mathieu Desnoyers
>> EfficiOS Inc.
>> http://www.efficios.com



-- 

Luis A. Lozano | Software Engineer | [email protected] | +1 (408)431-5164
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [email protected]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: current_thread_info() not respecting program order with gcc 4.8.x

Reply via email to