Re: [PATCH] Atom: Enabling unroll at O2 optimization level

2012-04-17 Thread Igor Zamyatin
On Thu, Apr 12, 2012 at 3:16 PM, Richard Guenther
 wrote:
> On Thu, Apr 12, 2012 at 1:05 PM, Igor Zamyatin  wrote:
>> On Wed, Apr 11, 2012 at 12:39 PM, Richard Guenther
>>  wrote:
>>> On Tue, Apr 10, 2012 at 8:43 PM, Igor Zamyatin  wrote:
 Hi All!

 Here is a patch that enables unroll at O2 for Atom.

 This gives good performance boost on EEMBC 2.0 (~+8% in Geomean for 32
 bits) with quite moderate code size increase (~5% for EEMBC2.0, 32
 bits).
>>>
>>> 5% is not moderate.  Your patch does enable unrolling at -O2 but not -O3,
>>> why? Why do you disable register renaming?  check_imull requires a function
>>> comment.
>>
>> Sure, enabling unroll for O3 could be the next step.
>> We can't avoid code size increase with unroll - what number do you
>> think will be appropriate?
>> Register renaming was the reason of several degradations during tuning 
>> process
>> Comment for check_imull was added
>>
>>>
>>> This completely looks like a hack for EEMBC2.0, so it's definitely not ok.
>>
>> Why? EEMBC was measured and result provided here just because this
>> benchmark considers to be very relevant for Atom
>
> I'd say that SPEC INT (2000 / 2006) is more relevant for Atom (SPEC FP
> would be irrelevant OTOH).  Similar code size for, say, Mozilla Firefox
> or GCC itself would be important.
>
>>> -O2 is not supposed to give best benchmark results.
>>
>> O2 is wide-used so performance improvement could be important for users.
>
> But not at a 5% size cost.  Please also always check the compile-time effect
> which is important for -O2 as well.

What would be an acceptable number of size cost/compile-time increase
for O2 and O3 on EEMBC, SPEC INT 2000 and Mozilla?

Is it possible in common to put Atom-specific unroll heuristics under
some option which could be mentioned in GCC docs?

>
> Richard.
>
>>>
>>> Thanks,
>>> Richard.
>>>

 Tested for i386 and x86-64, ok for trunk?
>>
>> Updated patch attached
>>

 Thanks,
 Igor

 ChangeLog:

 2012-04-10  Yakovlev Vladimir  

        * gcc/config/i386/i386.c (check_imul): New routine.
        (ix86_loop_unroll_adjust): New target hook.
        (ix86_option_override_internal): Enable unrolling on Atom at -O2.
        (TARGET_LOOP_UNROLL_ADJUST): New define.

Thanks,
Igor


Re: [PATCH] Atom: Enabling unroll at O2 optimization level

2012-04-12 Thread Andi Kleen
> > So would need much more benchmarking on macro workloads first at least.
> 
> Like what, for example? I believe in this case everything also
> strongly depends on test usage model (e.g. it usually compiled with Os
> not O2) and, let's say, internal test structure - whether there are
> hot loops that suitable for unroll.

Normally the compiler doesn't know if a loop is hot unless you use
profile feedback. So worst case on a big code base you may end up
with a lot of unnecessary unrolling. On cold code it's just wasted
bytes, but there could be already icache limited code where it 
would be worse.

How about just a compiler bootstrap on Atom as a "worst case"?
For the benchmark can you use profile feedback?

BTW I know some loops are unrolled at -O3 by default at tree level because
the vectorizer likes it.  I actually have an older patch to dial this
down for some common cases.

-Andi
-- 
a...@linux.intel.com -- Speaking for myself only.


Re: [PATCH] Atom: Enabling unroll at O2 optimization level

2012-04-12 Thread Richard Guenther
On Thu, Apr 12, 2012 at 1:05 PM, Igor Zamyatin  wrote:
> On Wed, Apr 11, 2012 at 12:39 PM, Richard Guenther
>  wrote:
>> On Tue, Apr 10, 2012 at 8:43 PM, Igor Zamyatin  wrote:
>>> Hi All!
>>>
>>> Here is a patch that enables unroll at O2 for Atom.
>>>
>>> This gives good performance boost on EEMBC 2.0 (~+8% in Geomean for 32
>>> bits) with quite moderate code size increase (~5% for EEMBC2.0, 32
>>> bits).
>>
>> 5% is not moderate.  Your patch does enable unrolling at -O2 but not -O3,
>> why? Why do you disable register renaming?  check_imull requires a function
>> comment.
>
> Sure, enabling unroll for O3 could be the next step.
> We can't avoid code size increase with unroll - what number do you
> think will be appropriate?
> Register renaming was the reason of several degradations during tuning process
> Comment for check_imull was added
>
>>
>> This completely looks like a hack for EEMBC2.0, so it's definitely not ok.
>
> Why? EEMBC was measured and result provided here just because this
> benchmark considers to be very relevant for Atom

I'd say that SPEC INT (2000 / 2006) is more relevant for Atom (SPEC FP
would be irrelevant OTOH).  Similar code size for, say, Mozilla Firefox
or GCC itself would be important.

>> -O2 is not supposed to give best benchmark results.
>
> O2 is wide-used so performance improvement could be important for users.

But not at a 5% size cost.  Please also always check the compile-time effect
which is important for -O2 as well.

Richard.

>>
>> Thanks,
>> Richard.
>>
>>>
>>> Tested for i386 and x86-64, ok for trunk?
>
> Updated patch attached
>
>>>
>>> Thanks,
>>> Igor
>>>
>>> ChangeLog:
>>>
>>> 2012-04-10  Yakovlev Vladimir  
>>>
>>>        * gcc/config/i386/i386.c (check_imul): New routine.
>>>        (ix86_loop_unroll_adjust): New target hook.
>>>        (ix86_option_override_internal): Enable unrolling on Atom at -O2.
>>>        (TARGET_LOOP_UNROLL_ADJUST): New define.


Re: [PATCH] Atom: Enabling unroll at O2 optimization level

2012-04-12 Thread Igor Zamyatin
On Wed, Apr 11, 2012 at 12:39 PM, Richard Guenther
 wrote:
> On Tue, Apr 10, 2012 at 8:43 PM, Igor Zamyatin  wrote:
>> Hi All!
>>
>> Here is a patch that enables unroll at O2 for Atom.
>>
>> This gives good performance boost on EEMBC 2.0 (~+8% in Geomean for 32
>> bits) with quite moderate code size increase (~5% for EEMBC2.0, 32
>> bits).
>
> 5% is not moderate.  Your patch does enable unrolling at -O2 but not -O3,
> why? Why do you disable register renaming?  check_imull requires a function
> comment.

Sure, enabling unroll for O3 could be the next step.
We can't avoid code size increase with unroll - what number do you
think will be appropriate?
Register renaming was the reason of several degradations during tuning process
Comment for check_imull was added

>
> This completely looks like a hack for EEMBC2.0, so it's definitely not ok.

Why? EEMBC was measured and result provided here just because this
benchmark considers to be very relevant for Atom

>
> -O2 is not supposed to give best benchmark results.

O2 is wide-used so performance improvement could be important for users.

>
> Thanks,
> Richard.
>
>>
>> Tested for i386 and x86-64, ok for trunk?

Updated patch attached

>>
>> Thanks,
>> Igor
>>
>> ChangeLog:
>>
>> 2012-04-10  Yakovlev Vladimir  
>>
>>        * gcc/config/i386/i386.c (check_imul): New routine.
>>        (ix86_loop_unroll_adjust): New target hook.
>>        (ix86_option_override_internal): Enable unrolling on Atom at -O2.
>>        (TARGET_LOOP_UNROLL_ADJUST): New define.


unroll1.patch
Description: Binary data


Re: [PATCH] Atom: Enabling unroll at O2 optimization level

2012-04-12 Thread Igor Zamyatin
On Wed, Apr 11, 2012 at 5:34 PM, Andi Kleen  wrote:
> Richard Guenther  writes:
>>
>> 5% is not moderate.  Your patch does enable unrolling at -O2 but not -O3,
>> why? Why do you disable register renaming?  check_imull requires a function
>> comment.
>>
>> This completely looks like a hack for EEMBC2.0, so it's definitely not ok.
>>
>> -O2 is not supposed to give best benchmark results.
>
> Besides it is against the Intel Optimization Manual recommendation
> to prefer small code on Atom to avoid falling out of the predecode hints
> in the cache.

Yes, this is well-known concern for Atom. But in the same time unroll
could help a lot for inorder machines because it could provide more
opportunities to a compiler scheduler. And experiments showed that
unroll could really help.

>
> So would need much more benchmarking on macro workloads first at least.

Like what, for example? I believe in this case everything also
strongly depends on test usage model (e.g. it usually compiled with Os
not O2) and, let's say, internal test structure - whether there are
hot loops that suitable for unroll.

>
> -Andi
>
> --
> a...@linux.intel.com -- Speaking for myself only


Re: [PATCH] Atom: Enabling unroll at O2 optimization level

2012-04-11 Thread Andi Kleen
Richard Guenther  writes:
>
> 5% is not moderate.  Your patch does enable unrolling at -O2 but not -O3,
> why? Why do you disable register renaming?  check_imull requires a function
> comment.
>
> This completely looks like a hack for EEMBC2.0, so it's definitely not ok.
>
> -O2 is not supposed to give best benchmark results.

Besides it is against the Intel Optimization Manual recommendation
to prefer small code on Atom to avoid falling out of the predecode hints
in the cache.

So would need much more benchmarking on macro workloads first at least.

-Andi

-- 
a...@linux.intel.com -- Speaking for myself only


Re: [PATCH] Atom: Enabling unroll at O2 optimization level

2012-04-11 Thread Richard Guenther
On Tue, Apr 10, 2012 at 8:43 PM, Igor Zamyatin  wrote:
> Hi All!
>
> Here is a patch that enables unroll at O2 for Atom.
>
> This gives good performance boost on EEMBC 2.0 (~+8% in Geomean for 32
> bits) with quite moderate code size increase (~5% for EEMBC2.0, 32
> bits).

5% is not moderate.  Your patch does enable unrolling at -O2 but not -O3,
why? Why do you disable register renaming?  check_imull requires a function
comment.

This completely looks like a hack for EEMBC2.0, so it's definitely not ok.

-O2 is not supposed to give best benchmark results.

Thanks,
Richard.

>
> Tested for i386 and x86-64, ok for trunk?
>
> Thanks,
> Igor
>
> ChangeLog:
>
> 2012-04-10  Yakovlev Vladimir  
>
>        * gcc/config/i386/i386.c (check_imull): New routine.
>        (ix86_loop_unroll_adjust): New target hook.
>        (ix86_option_override_internal): Enable unrolling on Atom at -O2.
>        (TARGET_LOOP_UNROLL_ADJUST): New define.


[PATCH] Atom: Enabling unroll at O2 optimization level

2012-04-10 Thread Igor Zamyatin
Hi All!

Here is a patch that enables unroll at O2 for Atom.

This gives good performance boost on EEMBC 2.0 (~+8% in Geomean for 32
bits) with quite moderate code size increase (~5% for EEMBC2.0, 32
bits).

Tested for i386 and x86-64, ok for trunk?

Thanks,
Igor

ChangeLog:

2012-04-10  Yakovlev Vladimir  

       * gcc/config/i386/i386.c (check_imull): New routine.
       (ix86_loop_unroll_adjust): New target hook.
       (ix86_option_override_internal): Enable unrolling on Atom at -O2.
       (TARGET_LOOP_UNROLL_ADJUST): New define.


unroll.patch
Description: Binary data