On Wed, Jan 16, 2019 at 2:37 PM Alejandro Martinez Vicente <alejandro.martinezvice...@arm.com> wrote: > > Hi, > > Current vectorizer doesn't support masked loads for SLP. We should add that, > to > allow things like: > > void > f (int *restrict x, int *restrict y, int *restrict z, int n) > { > for (int i = 0; i < n; i += 2) > { > x[i] = y[i] ? z[i] : 1; > x[i + 1] = y[i + 1] ? z[i + 1] : 2; > } > } > > to be vectorized using contiguous loads rather than LD2 and ST2. > > This patch was motivated by SVE, but it is completely generic and should apply > to any architecture with masked loads. > > After the patch is applied, the above code generates this output > (-march=armv8.2-a+sve -O2 -ftree-vectorize): > > 0000000000000000 <f>: > 0: 7100007f cmp w3, #0x0 > 4: 540002cd b.le 5c <f+0x5c> > 8: 51000464 sub w4, w3, #0x1 > c: d2800003 mov x3, #0x0 // #0 > 10: 90000005 adrp x5, 0 <f> > 14: 25d8e3e0 ptrue p0.d > 18: 53017c84 lsr w4, w4, #1 > 1c: 910000a5 add x5, x5, #0x0 > 20: 11000484 add w4, w4, #0x1 > 24: 85c0e0a1 ld1rd {z1.d}, p0/z, [x5] > 28: 2598e3e3 ptrue p3.s > 2c: d37ff884 lsl x4, x4, #1 > 30: 25a41fe2 whilelo p2.s, xzr, x4 > 34: d503201f nop > 38: a5434820 ld1w {z0.s}, p2/z, [x1, x3, lsl #2] > 3c: 25808c11 cmpne p1.s, p3/z, z0.s, #0 > 40: 25808810 cmpne p0.s, p2/z, z0.s, #0 > 44: a5434040 ld1w {z0.s}, p0/z, [x2, x3, lsl #2] > 48: 05a1c400 sel z0.s, p1, z0.s, z1.s > 4c: e5434800 st1w {z0.s}, p2, [x0, x3, lsl #2] > 50: 04b0e3e3 incw x3 > 54: 25a41c62 whilelo p2.s, x3, x4 > 58: 54ffff01 b.ne 38 <f+0x38> // b.any > 5c: d65f03c0 ret > > > I tested this patch in an aarch64 machine bootstrapping the compiler and > running the checks.
Thanks for implementing this - note this is stage1 material and I will have a look when time allows unless Richard beats me to it. It might be interesting to note that "non-SLP" code paths are likely to go away in GCC 10 to streamline the vectorizer and make further changes easier (so you'll see group_size == 1 SLP instances). There are quite a few other cases missing SLP handling. Richard. > Alejandro > > gcc/Changelog: > > 2019-01-16 Alejandro Martinez <alejandro.martinezvice...@arm.com> > > * config/aarch64/aarch64-sve.md (copysign<mode>3): New define_expand. > (xorsign<mode>3): Likewise. > internal-fn.c: Marked mask_load_direct and mask_store_direct as > vectorizable. > tree-data-ref.c (data_ref_compare_tree): Fixed comment typo. > tree-vect-data-refs.c (can_group_stmts_p): Allow masked loads to be > combined even if masks different. > (slp_vect_only_p): New function to detect masked loads that are only > vectorizable using SLP. > (vect_analyze_data_ref_accesses): Mark SLP only vectorizable groups. > tree-vect-loop.c (vect_dissolve_slp_only_groups): New function to > dissolve SLP-only vectorizable groups when SLP has been discarded. > (vect_analyze_loop_2): Call vect_dissolve_slp_only_groups when needed. > tree-vect-slp.c (vect_get_and_check_slp_defs): Check masked loads > masks. > (vect_build_slp_tree_1): Fixed comment typo. > (vect_build_slp_tree_2): Include masks from masked loads in SLP tree. > tree-vect-stmts.c (vect_get_vec_defs_for_operand): New function to get > vec_defs for operand with optional SLP and vectype. > (vectorizable_load): Allow vectorizaion of masked loads for SLP only. > tree-vectorizer.h (_stmt_vec_info): Added flag for SLP-only > vectorizable. > tree-vectorizer.c (vec_info::new_stmt_vec_info): Likewise. > > gcc/testsuite/Changelog: > > 2019-01-16 Alejandro Martinez <alejandro.martinezvice...@arm.com> > > * gcc.target/aarch64/sve/mask_load_slp_1.c: New test for SLP > vectorized masked loads.