On Thu, Nov 13, 2014 at 1:03 PM, David Wohlferd <d...@limegreensocks.com> wrote:
> Sorry for the (very) delayed response.  I'm still looking for feedback here
> so I can fix the docs.
>
> To refresh: The topic of conversation was the (extremely) wrong explanation
> that has been in the docs since forever about how to use memory constraints
> with inline asm to avoid the performance hit of a full memory clobber.
> Trying to understand how this really works has led to some surprising
> results.
>
> Me:
>>> While I really like the idea of using memory constraints to avoid all out
>>> memory clobbers, 16 bytes is a pretty small maximum memory block, and x86
>>> only supports a max of 8.  Unless there's some way to use larger sizes
>>> (say
>>> SSIZE_MAX), this feature hardly seems worth documenting.
>
> Richard:
>> I wonder how you figured out that a 12 byte clobber performs a full
>> memory clobber?
>
> Here's the code (compiled with gcc version 4.9.0 x86_64-win32-seh-rev2,
> using -m64 -O2 -fdump-final-insns):
>
> --------------------
> #include <stdio.h>
>
> #define MYSIZE 3
>
> inline void
> __stosb(unsigned char *Dest, unsigned char Data, size_t Count)
> {
>    struct _reallybigstruct { char x[MYSIZE]; }
>       *p = (struct _reallybigstruct *)Dest;
>
>    __asm__ __volatile__ ("rep stos{b|b}"
>       : "+D" (Dest), "+c" (Count), "=m" (*p)
>       : [Data] "a" (Data)
>       //: "memory"
>    );
> }
>
> int main()
> {
>    unsigned char buff[100];
>    buff[5] = 'A';
>
>    __stosb(buff, 'B', sizeof(buff));
>    printf("%c\n", buff[5]);
> }
> --------------------
>
> In summary:
>
>    1) Create a 100 byte buffer, and set buff[5] to 'A'.
>    2) Call __stosb, which uses inline asm to overwrite all of buff with 'B'.
>    3) Use a memory constraint in __stosb to flush buff.  The size of the
> memory constraint is controlled by a #define.
>
> With this, I have a simple way to test various sizes of memory constraints
> to see if the buffer gets flushed.  If it *is* flushing the buffer, printing
> buff[5] after __stosb will print 'B'.  If it is *not* flushing, it will
> print 'A'.
>
> Results:
>    - Since buff[5] is the 6th byte in the buffer, using memory constraint
> sizes of 1, 2 & 4 (not surprisingly) all print 'A', showing that no flush
> was done.
>    - Sizes of 8 and 16 print 'B', showing that the flush was done. This is
> also the expected result, since I am now flushing enough of buff to include
> buff[5].
>    - The surprise comes from using a size of 3 or 5.  These also print 'B'.
> WTF?  If 4 doesn't flush, why does 3?
>
> I believe the answer comes from reading the RTL.  The difference between
> sizes of 3 and 16 comes here:
>
>   (set (mem/c:TI (plus:DI (reg/f:DI 7 sp)
>      (const_int 32 [0x20])) [ MEM[(struct _reallybigstruct *)&buff]+0 S16
> A128])
>      (asm_operands/v:TI ("rep stos{b|b}") ("=m") 2 [
>
>    (set (mem/c:BLK (plus:DI (reg/f:DI 7 sp)
>          (const_int 32 [0x20])) [ MEM[(struct _reallybigstruct *)&buff]+0 S3
> A128])
>          (asm_operands/v:BLK ("rep stos{b|b}") ("=m") 2 [
>
> While I don't actually speak RTL, TI clearly refers to TIMode. Apparently
> when using a size that exactly matches a machine mode, asm memory references
> (on i386) can flush the exact number of bytes.  But for other sizes, gcc
> seems to falls back to BLK mode, which doesn't.
>
> I don't know the exact meaning of BLK on a "set" or "asm_operands." Does it
> cause a full clobber?  Or just a complete clobber of buff? Attempting to
> answer that question leads us to the second bit of code:
>
> --------------------
> #include <stdio.h>
>
> #define MYSIZE 8
>
> inline void
> __stosb(unsigned char *Dest, unsigned char Data, size_t Count)
> {
>    struct _reallybigstruct { char x[MYSIZE]; }
>       *p = (struct _reallybigstruct *)Dest;
>
>    __asm__ __volatile__ ("rep stos{b|b}"
>       : "+D" (Dest), "+c" (Count), "=m" (*p)
>       : [Data] "a" (Data)
>       //: "memory"
>    );
> }
> int main()
> {
>    unsigned char buff[100], buff2[100];
>    buff[5] = 'A';
>    buff2[5] = 'M';
>    asm("#" : : "r" (buff2));
>
>    __stosb(buff, 'B', sizeof(buff));
>    printf("%c %c\n", buff[5], buff2[5]);
> }
> --------------------
>
> Here I've added a buff2, and I set buff2[5] to 'M' (aka ascii 77), which I
> also print.  I still perform the memory constraint against buff, then I
> check to see if it is affecting buff2.
>
> I start by compiling this with a size of 8 and look at the -S output.  If
> this is NOT doing a full clobber, gcc should be able to just print buff2[5]
> by moving 77 into the appropriate register before calling printf.  And
> indeed, that's what we see.
>
> /APP
>  # 17 "mem2.cpp" 1
>         rep stosb
>  # 0 "" 2
> /NO_APP
>         movzbl  37(%rsp), %edx
>         movl    $77, %r8d
>         leaq    .LC0(%rip), %rcx
>         call    printf
>
> If using a size of 3 *is* causing a full memory clobber, we would expect to
> see the value getting read from memory before calling printf.  And indeed,
> that's also what we see.
>
> /APP
>  # 17 "mem2.cpp" 1
>         rep stosb
>  # 0 "" 2
> /NO_APP
>         movzbl  37(%rsp), %edx
>         leaq    .LC0(%rip), %rcx
>         movzbl  149(%rsp), %r8d
>
> I don't know the internals of gcc well enough to understand exactly why this
> is happening.  But from a user's point of view, it sure looks like a memory
> clobber.
>
> As I said before, triggering a full memory clobber for anything over 16
> bytes (and most sizes under 16 bytes) makes this feature all but useless.
> So if that's really what's happening, we need to decide what to do next:
>
> 1) Can this be "fixed?"
> 2) Do we want to doc the current behavior?
> 3) Or do we just remove this section?
>
> I think it could be a nice performance win for inline asm if it could be
> made to work right, but I have no idea what might be involved in that.
> Failing that, I guess if it doesn't work and isn't going to work, I'd
> recommend removing the text for this feature.
>
> Since all 3 suggestions require a doc change, I'll just say that I'm
> prepared to start work on the doc patch as soon as someone lets me know what
> the plan is.
>
> Richard?  Hans-Peter?  Your thoughts?

Just from a very very quick look you miss:

>   (set (mem/c:TI (plus:DI (reg/f:DI 7 sp)
>      (const_int 32 [0x20])) [ MEM[(struct _reallybigstruct *)&buff]+0 S16
> A128])
>      (asm_operands/v:TI ("rep stos{b|b}") ("=m") 2 [
>
>    (set (mem/c:BLK (plus:DI (reg/f:DI 7 sp)
>          (const_int 32 [0x20])) [ MEM[(struct _reallybigstruct *)&buff]+0 S3
> A128])
>          (asm_operands/v:BLK ("rep stos{b|b}") ("=m") 2 [

The memory attributes - the first one has 'S16' (size 16 bytes), the
2nd has 'S3' (size 3 bytes).  So the information is clearly there.
It might be that RTL alias analysis / CSE give up too early here
(we don't optimize across asm() on the GIMPLE level at all ... heh).

I didn't look where it gives up (even though appearantly it does).

Richard.

> Thanks,
> dw

Reply via email to