[Bug target/83008] [performance] Is it better to avoid extra instructions in data passing between loops?

rguenther at suse dot de Fri, 15 Dec 2017 03:12:38 -0800

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=83008


--- Comment #15 from rguenther at suse dot de <rguenther at suse dot de> ---
On Fri, 15 Dec 2017, sergey.shalnov at intel dot com wrote:

> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=83008
> 
> --- Comment #14 from sergey.shalnov at intel dot com ---
> " we have a basic-block vectorizer.  Do you propose to remove it? "
> Definitely not! SLP vectorizer is very good to have!
> 
> “What's the rationale for not using vector registers”
> I just tried " -fno-tree-slp-vectorize" option and found the performance gain
> for different -march= options.
> 
> I see some misunderstanding here. Let me clarify the original question with
> –march=znver1.
> I use " -Ofast -mfpmath=sse -funroll-loops -march=znver1" options set for
> experiments.
> 
> For the basic block we are discussing we have (in vect_analyze_slp_cost() in
> tree-vect-slp.c:1897):
> 
> tmp[i_220][0] = _150;
> tmp[i_220][2] = _147;
> tmp[i_220][1] = _144;
> tmp[i_220][3] = _141;
> 
> tmp[i_139][0] = _447;
> tmp[i_139][2] = _450;
> tmp[i_139][1] = _453;
> tmp[i_139][3] = _456;
> 
> tmp[i_458][0] = _54;
> tmp[i_458][2] = _56;
> tmp[i_458][1] = _58;
> tmp[i_458][3] = _60;
> 
> this is si->stmt printed in the loop with "vect_prologue" calculation.
> 
> I see SLP statistic related to this BB:
> note: Cost model analysis:. 
>   Vector inside of basic block cost: 64 
>   Vector prologue cost: 32 
>   Vector epilogue cost: 0 
>   Scalar cost of basic block: 256 
> note: Basic block will be vectorized using SLP

So it looks like both vector and scalar stores are costed 64 by the
target and the vector construction is costed 32, this is likely
because

  COSTS_N_INSNS (1),                    /* cost of cheap SSE instruction.  
*/

and vector construction of 4 elements is thus cost 4 while

  {8, 8, 8},                            /* cost of storing integer
                                           registers.  */
...
  {8, 8, 8, 8, 16},                     /* cost of storing SSE registers
                                           in 32,64,128,256 and 512-bit.  
*/

That cheap SSE instruction cost is esp. odd when looking at

  2, 3, 6,                              /* cost of moving XMM,YMM,ZMM 
register.  */

or

  COSTS_N_INSNS (3),                    /* cost of ADDSS/SD SUBSS/SD 
insns.  */

so if a movq %xmm0, %xmm1 costs 2 I can't imagine anything being cheaper
than that.

Honza?

...
> Please correct me if I wrong but I think we have to have count=3 in
> prologue_cost_vec.

No, it's one vec_construct operation - it's the task of the target
to turn this into a cost comparable to vector_store and scalar_store
in this case.

> And this could slightly change costs for "Vector prologue cost" and might have
> an influence to vectorizer decision.
> 
> Sergey
> PS
> Richard,
> I didn't catch your idea in " but DOM isn't powerful enough " sentence.
> Could you please slightly clarify it?

The BB vectorization prevents eliding 'tmp' from memory to registers,
the CSE pass after vectorization would be responsible for this but
the memory references look too "complicated" to the respective
implementation used.

[Bug target/83008] [performance] Is it better to avoid extra instructions in data passing between loops?

Reply via email to