Re: Propose moving vectorization from -O3 to -O2.

Xinliang David Li Wed, 21 Aug 2013 23:51:24 -0700

> The effect on runtime is not correlated to
> either (which means the vectorizer cost model is rather bad), but integer
> code usually does not benefit at all.


The cost model does need some tuning. For instance, GCC vectorizer
does peeling aggressively, but  peeling in many cases can be avoided
while still gaining good performance -- even when target does not have
efficient unaligned load/store to implement unaligned access. GCC
reports too high cost for unaligned access while too low for peeling
overhead.

Example:

ifndef TYPE
#define TYPE float
#endif
#include <stdlib.h>

__attribute__((noinline)) void
foo (TYPE *a, TYPE* b, TYPE *c, int n)
{
   int i;
   for ( i = 0; i < n; i++)
     a[i] = b[i] * c[i];
}

int g;
int
main()
{
   int i;
   float *a = (float*) malloc (100000*4);
   float *b = (float*) malloc (100000*4);
   float *c = (float*) malloc (100000*4);

   for (i = 0; i < 100000; i++)
      foo(a, b, c, 100000);


   g = a[10];

}


1) by default, GCC's vectorizer will peel the loop in foo, so that
access to 'a' is aligned and using movaps instruction. The other
accesses are using movups when -march=corei7 is used
2) Same as above, but -march=x86_64. Access to b is split into 'movlps
and movhps', same for 'c'

3) Disabling peeling (via a hack) with -march=corei7 --- all three
accesses are using movups
4) Disabling peeling, with -march=x86-64 -- all three accesses are
using movlps/movhps

Performance:

1) and 3) -- both 1.58s, but 3) is much smaller than 1). 3)'s text is
1462 bytes, and 1) is 1622 bytes
3) and 4) and no vectorize -- all very slow -- 4.8s

Observations:
a)  if properly tuned, for corei7, 3) should be picked by GCC instead
of 1) -- this is not possible today
b) with march=x86_64, GCC should figure out the benefit of vectorizing
the loop is small and bail out

>> On the other hand, 10% compile time increase due to one pass sounds
>> excessive -- there might be some low hanging fruit to reduce the
>> compile time increase.
>
> I have already spent two man-month speeding up the vectorizer itself,
> I don't think there is any low-hanging fruit left there.  But see above - most
> of the compile-time is due to the cost of processing the extra loop copies.
>

Ok.

I did not notice your patch (in May this year) until recently. Do you
plan to check it in (other than the part to turn in at O2). The cost
model part of the changes are largely independent. If it is in, it
will serve as a good basis for further tuning.


>>  at full feature set vectorization regresses runtime of quite a number
>> of benchmarks significantly. At reduced feature set - basically trying
>> to vectorize only obvious profitable cases - these regressions can be
>> avoided but progressions only remain on two spec fp cases. As most
>> user applications fall into the spec int category a 10% compile-time
>> and 15% code-size regression for no gain is no good.
>>>
>>
>> Cong's data (especially corei7 and corei7avx) shows more significant
>> performance improvement.   If 10% compile time increase is across the
>> board and happens on benchmarks with no performance improvement, it is
>> certainly bad - but I am not sure if that is the case.
>
> Note that we are talking about -O2 - people that enable -march=corei7 usually
> know to use -O3 or FDO anyway.

Many people uses FDO, but not all -- there are still some barriers for
adoption. There are reasons people may not want to use O3:
1) people feel most comfortable to use O2 because it is considered the
most thoroughly tested compiler optimization level;  Going with the
default is the natural choice. FDO is a different beast as the
performance benefit can be too high to resist;
2) In a distributed build environment with object file
caching/sharing, building with O3 (different from the default) leads
to longer build time;
3) The size/compile time cost can be too high with O3. On the other
hand, the benefit of vectorizer can be very high for many types of
applications such as image processing, stitching, image detection,
dsp, encoder/decoder -- other than numerical fortran programs.


> That said, I expect 99% of used software
> (probably rather 99,99999%) is not compiled on the system it runs on but
> compiled to run on generic hardware and thus restricts itself to bare x86_64
> SSE2 features.  So what matters for enabling the vectorizer at -O2 is the
> default architecture features of the given architecture(!) - remember
> to not only
> consider x86 here!
>
>> A couple of points I'd like to make:
>>
>> 1) loop vectorizer passes the quality threshold to be turned on by
>> default at O2 in 4.9; It is already turned on for FDO at O2.
>
> With FDO we have a _much_ better way of reasoning on which loops
> we spend the compile-time and code-size!  Exactly the problem that
> exists without FDO at -O2 (and also at -O3, but -O3 is not said to
> be well-balanced with regard to compile-time and code-size)
>
>> 2) there are still lots of room for improvement for loop vectorizer --
>> there is no doubt about it, and we will need to continue improving it;
>
> I believe we have to first do that.  See the patches regarding to the
> cost model reorg I posted with the proposal for enabling vectorization at -O2.
> One large source of collateral damage of vectorization is if-conversion which
> aggressively if-converts loops regardless of us later vectorizing the result.
> The if-conversion pass needs to be integrated with vectorization.

We notice some small performance problems with tree-if conversion that
is turned on with FDO -- because that pass does not have  cost model
(by looking at branch probability as rtl level if-cvt). What other
problems do you see? is it just compile time concern?

>
>> 3) the only fast way to improve a feature is to get it used widely so
>> that people can file bugs and report problems -- it is hard for
>> developers to find and collect all cases where GCC is weak without GCC
>> community's help; There might be a temporary regression for some
>> users, but it is worth the pain
>
> Well, introducing known regressions at -O2 is not how this works.
> Vectorization is already widely tested and you can look at a plethora of
> bugreports about missed features and vectorizer wrong-doings to improve it.
>
>> 4) Not the most important one, but a practical concern:  without
>> turning it on, GCC will be greatly disadvantaged when people start
>> doing benchmarking latest GCC against other compilers ..
>
> The same argument was done on the fact that GCC does not optimize by default
> but uses -O0.  It's a straw-mans argument.  All "benchmarking" I see uses
> -O3 or -Ofast already.

People can just do -O2 performance comparison.

thanks,

David

> To make vectorization have a bigger impact on day-to-day software GCC would 
> need
> to start versioning for the target sub-architecture - which of course
> increases the
> issue with code-size and compile-time.
>
> Richard.
>
>> thanks,
>>
>> David
>>
>>
>>
>>> Richard.
>>>
>>>>thanks,
>>>>
>>>>David
>>>>
>>>>
>>>>>
>>>>> Richard.
>>>>>
>>>>>>>
>>>>>>> Vectorization has great performance potential -- the more people
>>>>use
>>>>>>> it, the likely it will be further improved -- turning it on at O2
>>>>is
>>>>>>> the way to go ...
>>>>>>>
>>>>>>>
>>>>>>> Thank you!
>>>>>>>
>>>>>>>
>>>>>>> Cong Hou
>>>>>
>>>>>
>>>
>>>

Re: Propose moving vectorization from -O3 to -O2.

Reply via email to