gcc-ow...@gcc.gnu.org wrote on 01/11/2011 12:41:32 PM:

> Hello,
> I have one example with two very similar loops. cunrolli pass
> unrolls one loop completely
> but not the other based on slightly different cost estimations. The
> not-unrolled loop
> get SLP-vectorized, then unrolled by "cunroll" pass, whereas the
> other unrolled loop cannot
> be vectorized since it is not a loop any more.  In the end, there is
> big difference of
> performance between two loops.
>

Here what I see with the current trunk on x86_64 with -O3 (with the two
loops split into different functions):

The first loop, the one that doesn't get unrolled by cunrolli, gets loop
vectorized with -fno-vect-cost-model. With the cost model the vectorization
fails because the number of iterations is not sufficient (the vectorizer
tries to apply loop peeling in order to align the accesses), the loop gets
later unrolled by cunroll and the basic block gets vectorized by SLP.

The second loop, unrolled by cunrolli, also gets vectorized by SLP.

The *.optimized dumps look similar:


<bb 2>:
  vect_var_.14_48 = MEM[(int *)p_hist_buff_9(D)];
  MEM[(int *)temp_hist_buffer_5(D)] = vect_var_.14_48;
  return;


<bb 2>:
  vect_var_.7_57 = MEM[(int *)p_input_10(D)];
  MEM[(int *)temp_hist_buffer_6(D) + 16B] = vect_var_.7_57;
  return;


> My question is why SLP vectorization has to be performed on loop (it
> is a sub-pass under
> pass_tree_loop). Conceptually, cannot it be done on any basic block?
> Our port are still
> stuck at 4.5. But I checked 4.7, it seems still the same. I also
> checked functions in
> tree-vect-slp.c. They use a lot of loop_vinfo structures. But in
> some places it checks
> whether loop_vinfo exists to use it or other alternative. I tried to
> add an extra SLP
> pass after pass_tree_loop, but it didn't work. I wonder how easy to
> make SLP works for
> non-loop.

SLP vectorization works both on loops (in vectorize pass) and on basic
blocks (in slp-vectorize pass).

Ira

>
> Thanks,
> Bingfeng Mei
>
> Broadcom UK
>
> void foo (int *__restrict__ temp_hist_buffer,
>           int * __restrict__ p_hist_buff,
>           int *__restrict__ p_input)
> {
>   int i;
>   for(i=0;i<4;i++)
>      temp_hist_buffer[i]=p_hist_buff[i];
>
>   for(i=0;i<4;i++)
>      temp_hist_buffer[i+4]=p_input[i];
>
> }
>
>

Reply via email to