http://gcc.gnu.org/bugzilla/show_bug.cgi?id=50031
Ira Rosen <irar at il dot ibm.com> changed: What |Removed |Added ---------------------------------------------------------------------------- CC| |irar at il dot ibm.com --- Comment #2 from Ira Rosen <irar at il dot ibm.com> 2011-08-10 06:36:26 UTC --- (In reply to comment #0) > > 1) When tree-vec-data-refs doesn't know the alignment of memory in a loop that > is vectorized, and the machine has a vec_realign_load_<type> pattern, the loop > that is generated always uses the unalgined load, even though it might be > slow. > On the power7, the realign code uses a vector load and the lvsr instruction > to > create a permute mask, and then in the inner loop, after each load, use the > permute mask to do the unaligned loads. Thus, in the loop, before doing the > conversions, we will be doing 4 vector loads, and 4 permutes. The vector > conversion from 32-bit to 64-bit, involve two more permutes to split the V4SF > values into the appropriate registers before doing the float->double convert. > Thus in the loop we will have 4 permutes for the 4 loads that are done, and 8 > permutes for the conversions. The power7 only has one permute functional > unit, > and multiple permutes can slow things down. The code has one segment with 3 > back to back permutes, and another with 6 back to back permutes. > > If vectorizer could clone the loop and on one side test to see if the pointers > are aligned, and if the pointers are aligned, do the aligned loads, and on the > other side, do unalgined loads it would help. I experimented with an option > to > disable the vec_realign_load_<type> pattern, and it helped this particular > benchmark, but hurt other benchmarks, because the code would do the vectorized > loop only if the pointers are aligned, and fell back to scalar loop if they > were unaligned. I would think falling back to use vec_realign_<xxx> would be > a > win. Yes, this kind of versioning is a good idea. I have it implemented on st/cli-be-vect branch, but it would be probably hard to extract this patch from there. I'll take a look. > > 2) In looking at the documentation, I discovered that vec_realign_<xxx> is not > documented in md.texi. > > 3) The powerpc backend doesn't realize it could use the Altivec memory > instruction to load memory (since the Altivec load implicitly ignores the > bottom bits of the address). > > 4) The code in tree-vec-stmts.c, tree-vec-slp.c, and tree-vec-loop.c that > calls > the vectorization cost target hook, never pass in the actual type to the > argument vectype, or set the misalign argument to non-zero. I would imagine > that vector systems might have different costs, depending on the type. Maybe > the two arguments should be eliminated if we aren't going to pass useful > information. In addition, there doesn't seem to be a cost of doing > vec_realign. There is cost for unaligned loads (via movmisalign), but there > doesn't seem to be a cost for realignment. We pass the type and the misalignment value in vect_get_store_cost and vect_get_load_cost, the only places that they are relevant. It might be true that the actual costs depend on the type, but the cost model is only an evaluation and it is based on tuning, so I guess until now nobody thought that it is useful. The type and the misalignment value are important for VSX in movmisalign case, so the cost for a data access takes them into account in rs6000_builtin_vectorization_cost. The cost of realignment is calculated in vect_get_load_cost under 'case dr_explicit_realign' and 'case dr_explicit_realign_optimized'. I noticed now that it uses just vector_stmt type instead of vec_perm, so it should be fixed like that: Index: tree-vect-stmts.c =================================================================== --- tree-vect-stmts.c (revision 177586) +++ tree-vect-stmts.c (working copy) @@ -1011,7 +1011,7 @@ vect_get_load_cost (struct data_referenc case dr_explicit_realign: { *inside_cost += ncopies * (2 * vect_get_stmt_cost (vector_load) - + vect_get_stmt_cost (vector_stmt)); + + vect_get_stmt_cost (vec_perm)); /* FIXME: If the misalignment remains fixed across the iterations of the containing loop, the following cost should be added to the @@ -1042,7 +1042,7 @@ vect_get_load_cost (struct data_referenc } *inside_cost += ncopies * (vect_get_stmt_cost (vector_load) - + vect_get_stmt_cost (vector_stmt)); + + vect_get_stmt_cost (vec_perm)); break; } but since these costs are equal (at least in rs6000) it will not change anything unless the costs are changed. Ira