On Tue, Jun 20, 2017 at 12:18 PM, Richard Biener
<[email protected]> wrote:
> On Tue, Jun 20, 2017 at 10:03 AM, Uros Bizjak <[email protected]> wrote:
>> On Mon, Jun 19, 2017 at 7:51 PM, Jakub Jelinek <[email protected]> wrote:
>>> On Mon, Jun 19, 2017 at 11:45:13AM -0600, Jeff Law wrote:
>>>> On 06/19/2017 11:29 AM, Jakub Jelinek wrote:
>>>> >
>>>> > Also, on i?86 orq $0, (%rsp) or orl $0, (%esp) is used to probe stack,
>>>> > while it is shorter, is it actually faster or as slow as movq $0, (%rsp)
>>>> > or movl $0, (%esp) ?
>>>> Florian raised this privately to me as well. THere's a couple issues.
>>>>
>>>> 1. Is there a performance penalty/gain for sub-word operations? If not,
>>>> we can improve things slighly there. Even if it's performance
>>>> neutral we can probably do better on code size.
>>>
>>> CCing Uros and Honza here, I believe there are at least on x86 penalties
>>> for 2-byte, maybe for 1-byte and then sometimes some stalls when you
>>> write or read in a different size from a recent write or read.
>>
>> Don't use orq $0, (%rsp), as this is a high latency RMW insn.
>
> Well, but _maybe_ it's optimized because oring 0 never changes anything?
> At least it would be nice if it would only trigger the page-fault side-effect
> and then not consume other CPU resources.
It doesn't look so:
--cut here--
void
__attribute__ ((noinline))
test_or (void)
{
volatile int a;
unsigned int n;
for (n = 0; n < (unsigned) -1; n++)
asm ("orl $0, %0" : "+m" (a));
}
void
__attribute__ ((noinline))
test_movb (void)
{
volatile int a;
unsigned int n;
for (n = 0; n < (unsigned) -1; n++)
asm ("movb $0, %0" : "+m" (a));
}
void
__attribute__ ((noinline))
test_movl (void)
{
volatile int a;
unsigned int n;
for (n = 0; n < (unsigned) -1; n++)
asm ("movl $0, %0" : "+m" (a));
}
int main()
{
test_or ();
test_movb ();
test_movl ();
return 0;
}
--cut here--
74,99% a.out a.out [.] test_or
12,50% a.out a.out [.] test_movb
12,50% a.out a.out [.] test_movl
Uros.