This patch adds support for vectorising groups of IFN_MASK_LOADs
and IFN_MASK_STOREs using conditional load/store-lanes instructions.
This requires new internal functions to represent the result
(IFN_MASK_{LOAD,STORE}_LANES), as well as associated optabs.

The normal IFN_{LOAD,STORE}_LANES functions are const operations
that logically just perform the permute: the load or store is
encoded as a MEM operand to the call statement.  In contrast,
the IFN_MASK_{LOAD,STORE}_LANES functions use the same kind of
interface as IFN_MASK_{LOAD,STORE}, since the memory is only
conditionally accessed.

The AArch64 patterns were added as part of the main LD[234]/ST[234] patch.

Tested on aarch64-linux-gnu (both with and without SVE), x86_64-linux-gnu
and powerpc64le-linux-gnu.  OK to install?

Thanks,
Richard


2017-11-08  Richard Sandiford  <richard.sandif...@linaro.org>
            Alan Hayward  <alan.hayw...@arm.com>
            David Sherwood  <david.sherw...@arm.com>

gcc/
        * optabs.def (vec_mask_load_lanes_optab): New optab.
        (vec_mask_store_lanes_optab): Likewise.
        * internal-fn.def (MASK_LOAD_LANES): New internal function.
        (MASK_STORE_LANES): Likewise.
        * internal-fn.c (mask_load_lanes_direct): New macro.
        (mask_store_lanes_direct): Likewise.
        (expand_mask_load_optab_fn): Handle masked operations.
        (expand_mask_load_lanes_optab_fn): New macro.
        (expand_mask_store_optab_fn): Handle masked operations.
        (expand_mask_store_lanes_optab_fn): New macro.
        (direct_mask_load_lanes_optab_supported_p): Likewise.
        (direct_mask_store_lanes_optab_supported_p): Likewise.
        * tree-vectorizer.h (vect_store_lanes_supported): Take a masked_p
        parameter.
        (vect_load_lanes_supported): Likewise.
        * tree-vect-data-refs.c (strip_conversion): New function.
        (can_group_stmts_p): Likewise.
        (vect_analyze_data_ref_accesses): Use it instead of checking
        for a pair of assignments.
        (vect_store_lanes_supported): Take a masked_p parameter.
        (vect_load_lanes_supported): Likewise.
        * tree-vect-loop.c (vect_analyze_loop_2): Update calls to
        vect_store_lanes_supported and vect_load_lanes_supported.
        * tree-vect-slp.c (vect_analyze_slp_instance): Likewise.
        * tree-vect-stmts.c (replace_mask_load): New function, split
        out from vectorizable_mask_load_store.  Keep the group information
        up-to-date.
        (get_store_op): New function.
        (get_group_load_store_type): Take a masked_p parameter.  Don't
        allow gaps for masked accesses.  Use get_store_op.  Update calls
        to vect_store_lanes_supported and vect_load_lanes_supported.
        (get_load_store_type): Take a masked_p parameter and update
        call to get_group_load_store_type.
        (init_stored_values, advance_stored_values): New functions,
        split out from vectorizable_store.
        (do_load_lanes, do_store_lanes): New functions.
        (get_masked_group_alias_ptr_type): New function.
        (vectorizable_mask_load_store): Update call to get_load_store_type.
        Handle masked VMAT_LOAD_STORE_LANES.  Update GROUP_STORE_COUNT
        when vectorizing a group of stores and only vectorize when we
        reach the last statement in the group.  Vectorize the first
        statement in a group of loads.  Use an array aggregate type
        rather than a vector type for load/store_lanes.  Use
        init_stored_values, advance_stored_values, do_load_lanes,
        do_store_lanes, get_masked_group_alias_ptr_type and replace_mask_load.
        (vectorizable_store): Update call to get_load_store_type.
        Use init_stored_values, advance_stored_values and do_store_lanes.
        (vectorizable_load): Update call to get_load_store_type.
        Use do_load_lanes.
        (vect_transform_stmt): Set grouped_store for grouped IFN_MASK_STOREs.
        Only set is_store for the last element in the group.

gcc/testsuite/
        * gcc.dg/vect/vect-ooo-group-1.c: New test.
        * gcc.target/aarch64/sve_mask_struct_load_1.c: Likewise.
        * gcc.target/aarch64/sve_mask_struct_load_1_run.c: Likewise.
        * gcc.target/aarch64/sve_mask_struct_load_2.c: Likewise.
        * gcc.target/aarch64/sve_mask_struct_load_2_run.c: Likewise.
        * gcc.target/aarch64/sve_mask_struct_load_3.c: Likewise.
        * gcc.target/aarch64/sve_mask_struct_load_3_run.c: Likewise.
        * gcc.target/aarch64/sve_mask_struct_load_4.c: Likewise.
        * gcc.target/aarch64/sve_mask_struct_load_5.c: Likewise.
        * gcc.target/aarch64/sve_mask_struct_load_6.c: Likewise.
        * gcc.target/aarch64/sve_mask_struct_load_7.c: Likewise.
        * gcc.target/aarch64/sve_mask_struct_load_8.c: Likewise.
        * gcc.target/aarch64/sve_mask_struct_store_1.c: Likewise.
        * gcc.target/aarch64/sve_mask_struct_store_1_run.c: Likewise.
        * gcc.target/aarch64/sve_mask_struct_store_2.c: Likewise.
        * gcc.target/aarch64/sve_mask_struct_store_2_run.c: Likewise.
        * gcc.target/aarch64/sve_mask_struct_store_3.c: Likewise.
        * gcc.target/aarch64/sve_mask_struct_store_3_run.c: Likewise.
        * gcc.target/aarch64/sve_mask_struct_store_4.c: Likewise.

Index: gcc/optabs.def
===================================================================
--- gcc/optabs.def      2017-11-08 15:05:55.697852337 +0000
+++ gcc/optabs.def      2017-11-08 16:35:04.763816035 +0000
@@ -80,6 +80,8 @@ OPTAB_CD(ssmsub_widen_optab, "ssmsub$b$a
 OPTAB_CD(usmsub_widen_optab, "usmsub$a$b4")
 OPTAB_CD(vec_load_lanes_optab, "vec_load_lanes$a$b")
 OPTAB_CD(vec_store_lanes_optab, "vec_store_lanes$a$b")
+OPTAB_CD(vec_mask_load_lanes_optab, "vec_mask_load_lanes$a$b")
+OPTAB_CD(vec_mask_store_lanes_optab, "vec_mask_store_lanes$a$b")
 OPTAB_CD(vcond_optab, "vcond$a$b")
 OPTAB_CD(vcondu_optab, "vcondu$a$b")
 OPTAB_CD(vcondeq_optab, "vcondeq$a$b")
Index: gcc/internal-fn.def
===================================================================
--- gcc/internal-fn.def 2017-11-01 08:07:13.340797708 +0000
+++ gcc/internal-fn.def 2017-11-08 16:35:04.763816035 +0000
@@ -45,9 +45,11 @@ along with GCC; see the file COPYING3.
 
    - mask_load: currently just maskload
    - load_lanes: currently just vec_load_lanes
+   - mask_load_lanes: currently just vec_mask_load_lanes
 
    - mask_store: currently just maskstore
    - store_lanes: currently just vec_store_lanes
+   - mask_store_lanes: currently just vec_mask_store_lanes
 
    DEF_INTERNAL_FLT_FN is like DEF_INTERNAL_OPTAB_FN, but in addition,
    the function implements the computational part of a built-in math
@@ -92,9 +94,13 @@ along with GCC; see the file COPYING3.
 
 DEF_INTERNAL_OPTAB_FN (MASK_LOAD, ECF_PURE, maskload, mask_load)
 DEF_INTERNAL_OPTAB_FN (LOAD_LANES, ECF_CONST, vec_load_lanes, load_lanes)
+DEF_INTERNAL_OPTAB_FN (MASK_LOAD_LANES, ECF_PURE,
+                      vec_mask_load_lanes, mask_load_lanes)
 
 DEF_INTERNAL_OPTAB_FN (MASK_STORE, 0, maskstore, mask_store)
 DEF_INTERNAL_OPTAB_FN (STORE_LANES, ECF_CONST, vec_store_lanes, store_lanes)
+DEF_INTERNAL_OPTAB_FN (MASK_STORE_LANES, 0,
+                      vec_mask_store_lanes, mask_store_lanes)
 
 DEF_INTERNAL_OPTAB_FN (RSQRT, ECF_CONST, rsqrt, unary)
 
Index: gcc/internal-fn.c
===================================================================
--- gcc/internal-fn.c   2017-11-08 15:05:55.618852345 +0000
+++ gcc/internal-fn.c   2017-11-08 16:35:04.763816035 +0000
@@ -79,8 +79,10 @@ #define DEF_INTERNAL_FN(CODE, FLAGS, FNS
 #define not_direct { -2, -2, false }
 #define mask_load_direct { -1, 2, false }
 #define load_lanes_direct { -1, -1, false }
+#define mask_load_lanes_direct { -1, -1, false }
 #define mask_store_direct { 3, 2, false }
 #define store_lanes_direct { 0, 0, false }
+#define mask_store_lanes_direct { 0, 0, false }
 #define unary_direct { 0, 0, true }
 #define binary_direct { 0, 0, true }
 
@@ -2277,7 +2279,7 @@ expand_LOOP_DIST_ALIAS (internal_fn, gca
   gcc_unreachable ();
 }
 
-/* Expand MASK_LOAD call STMT using optab OPTAB.  */
+/* Expand MASK_LOAD{,_LANES} call STMT using optab OPTAB.  */
 
 static void
 expand_mask_load_optab_fn (internal_fn, gcall *stmt, convert_optab optab)
@@ -2286,6 +2288,7 @@ expand_mask_load_optab_fn (internal_fn,
   tree type, lhs, rhs, maskt, ptr;
   rtx mem, target, mask;
   unsigned align;
+  insn_code icode;
 
   maskt = gimple_call_arg (stmt, 2);
   lhs = gimple_call_lhs (stmt);
@@ -2298,6 +2301,12 @@ expand_mask_load_optab_fn (internal_fn,
     type = build_aligned_type (type, align);
   rhs = fold_build2 (MEM_REF, type, gimple_call_arg (stmt, 0), ptr);
 
+  if (optab == vec_mask_load_lanes_optab)
+    icode = get_multi_vector_move (type, optab);
+  else
+    icode = convert_optab_handler (optab, TYPE_MODE (type),
+                                  TYPE_MODE (TREE_TYPE (maskt)));
+
   mem = expand_expr (rhs, NULL_RTX, VOIDmode, EXPAND_WRITE);
   gcc_assert (MEM_P (mem));
   mask = expand_normal (maskt);
@@ -2305,12 +2314,12 @@ expand_mask_load_optab_fn (internal_fn,
   create_output_operand (&ops[0], target, TYPE_MODE (type));
   create_fixed_operand (&ops[1], mem);
   create_input_operand (&ops[2], mask, TYPE_MODE (TREE_TYPE (maskt)));
-  expand_insn (convert_optab_handler (optab, TYPE_MODE (type),
-                                     TYPE_MODE (TREE_TYPE (maskt))),
-              3, ops);
+  expand_insn (icode, 3, ops);
 }
 
-/* Expand MASK_STORE call STMT using optab OPTAB.  */
+#define expand_mask_load_lanes_optab_fn expand_mask_load_optab_fn
+
+/* Expand MASK_STORE{,_LANES} call STMT using optab OPTAB.  */
 
 static void
 expand_mask_store_optab_fn (internal_fn, gcall *stmt, convert_optab optab)
@@ -2319,6 +2328,7 @@ expand_mask_store_optab_fn (internal_fn,
   tree type, lhs, rhs, maskt, ptr;
   rtx mem, reg, mask;
   unsigned align;
+  insn_code icode;
 
   maskt = gimple_call_arg (stmt, 2);
   rhs = gimple_call_arg (stmt, 3);
@@ -2329,6 +2339,12 @@ expand_mask_store_optab_fn (internal_fn,
     type = build_aligned_type (type, align);
   lhs = fold_build2 (MEM_REF, type, gimple_call_arg (stmt, 0), ptr);
 
+  if (optab == vec_mask_store_lanes_optab)
+    icode = get_multi_vector_move (type, optab);
+  else
+    icode = convert_optab_handler (optab, TYPE_MODE (type),
+                                  TYPE_MODE (TREE_TYPE (maskt)));
+
   mem = expand_expr (lhs, NULL_RTX, VOIDmode, EXPAND_WRITE);
   gcc_assert (MEM_P (mem));
   mask = expand_normal (maskt);
@@ -2336,11 +2352,11 @@ expand_mask_store_optab_fn (internal_fn,
   create_fixed_operand (&ops[0], mem);
   create_input_operand (&ops[1], reg, TYPE_MODE (type));
   create_input_operand (&ops[2], mask, TYPE_MODE (TREE_TYPE (maskt)));
-  expand_insn (convert_optab_handler (optab, TYPE_MODE (type),
-                                     TYPE_MODE (TREE_TYPE (maskt))),
-              3, ops);
+  expand_insn (icode, 3, ops);
 }
 
+#define expand_mask_store_lanes_optab_fn expand_mask_store_optab_fn
+
 static void
 expand_ABNORMAL_DISPATCHER (internal_fn, gcall *)
 {
@@ -2732,8 +2748,10 @@ #define direct_unary_optab_supported_p d
 #define direct_binary_optab_supported_p direct_optab_supported_p
 #define direct_mask_load_optab_supported_p direct_optab_supported_p
 #define direct_load_lanes_optab_supported_p multi_vector_optab_supported_p
+#define direct_mask_load_lanes_optab_supported_p multi_vector_optab_supported_p
 #define direct_mask_store_optab_supported_p direct_optab_supported_p
 #define direct_store_lanes_optab_supported_p multi_vector_optab_supported_p
+#define direct_mask_store_lanes_optab_supported_p 
multi_vector_optab_supported_p
 
 /* Return true if FN is supported for the types in TYPES when the
    optimization type is OPT_TYPE.  The types are those associated with
Index: gcc/tree-vectorizer.h
===================================================================
--- gcc/tree-vectorizer.h       2017-11-08 15:05:33.791822333 +0000
+++ gcc/tree-vectorizer.h       2017-11-08 16:35:04.771159765 +0000
@@ -1284,9 +1284,9 @@ extern tree bump_vector_ptr (tree, gimpl
                             tree);
 extern tree vect_create_destination_var (tree, tree);
 extern bool vect_grouped_store_supported (tree, unsigned HOST_WIDE_INT);
-extern bool vect_store_lanes_supported (tree, unsigned HOST_WIDE_INT);
+extern bool vect_store_lanes_supported (tree, unsigned HOST_WIDE_INT, bool);
 extern bool vect_grouped_load_supported (tree, bool, unsigned HOST_WIDE_INT);
-extern bool vect_load_lanes_supported (tree, unsigned HOST_WIDE_INT);
+extern bool vect_load_lanes_supported (tree, unsigned HOST_WIDE_INT, bool);
 extern void vect_permute_store_chain (vec<tree> ,unsigned int, gimple *,
                                     gimple_stmt_iterator *, vec<tree> *);
 extern tree vect_setup_realignment (gimple *, gimple_stmt_iterator *, tree *,
Index: gcc/tree-vect-data-refs.c
===================================================================
--- gcc/tree-vect-data-refs.c   2017-11-08 15:06:16.087850270 +0000
+++ gcc/tree-vect-data-refs.c   2017-11-08 16:35:04.768405866 +0000
@@ -2791,6 +2791,62 @@ dr_group_sort_cmp (const void *dra_, con
   return cmp;
 }
 
+/* If OP is the result of a conversion, return the unconverted value,
+   otherwise return null.  */
+
+static tree
+strip_conversion (tree op)
+{
+  if (TREE_CODE (op) != SSA_NAME)
+    return NULL_TREE;
+  gimple *stmt = SSA_NAME_DEF_STMT (op);
+  if (!is_gimple_assign (stmt)
+      || !CONVERT_EXPR_CODE_P (gimple_assign_rhs_code (stmt)))
+    return NULL_TREE;
+  return gimple_assign_rhs1 (stmt);
+}
+
+/* Return true if vectorizable_* routines can handle statements STMT1
+   and STMT2 being in a single group.  */
+
+static bool
+can_group_stmts_p (gimple *stmt1, gimple *stmt2)
+{
+  if (gimple_assign_single_p (stmt1))
+    return gimple_assign_single_p (stmt2);
+
+  if (is_gimple_call (stmt1) && gimple_call_internal_p (stmt1))
+    {
+      /* Check for two masked loads or two masked stores.  */
+      if (!is_gimple_call (stmt2) || !gimple_call_internal_p (stmt2))
+       return false;
+      internal_fn ifn = gimple_call_internal_fn (stmt1);
+      if (ifn != IFN_MASK_LOAD && ifn != IFN_MASK_STORE)
+       return false;
+      if (ifn != gimple_call_internal_fn (stmt2))
+       return false;
+
+      /* Check that the masks are the same.  Cope with casts of masks,
+        like those created by build_mask_conversion.  */
+      tree mask1 = gimple_call_arg (stmt1, 2);
+      tree mask2 = gimple_call_arg (stmt2, 2);
+      if (!operand_equal_p (mask1, mask2, 0))
+       {
+         mask1 = strip_conversion (mask1);
+         if (!mask1)
+           return false;
+         mask2 = strip_conversion (mask2);
+         if (!mask2)
+           return false;
+         if (!operand_equal_p (mask1, mask2, 0))
+           return false;
+       }
+      return true;
+    }
+
+  return false;
+}
+
 /* Function vect_analyze_data_ref_accesses.
 
    Analyze the access pattern of all the data references in the loop.
@@ -2857,8 +2913,7 @@ vect_analyze_data_ref_accesses (vec_info
              || data_ref_compare_tree (DR_BASE_ADDRESS (dra),
                                        DR_BASE_ADDRESS (drb)) != 0
              || data_ref_compare_tree (DR_OFFSET (dra), DR_OFFSET (drb)) != 0
-             || !gimple_assign_single_p (DR_STMT (dra))
-             || !gimple_assign_single_p (DR_STMT (drb)))
+             || !can_group_stmts_p (DR_STMT (dra), DR_STMT (drb)))
            break;
 
          /* Check that the data-refs have the same constant size.  */
@@ -4662,15 +4717,21 @@ vect_grouped_store_supported (tree vecty
 }
 
 
-/* Return TRUE if vec_store_lanes is available for COUNT vectors of
-   type VECTYPE.  */
+/* Return TRUE if vec_{mask_}store_lanes is available for COUNT vectors of
+   type VECTYPE.  MASKED_P says whether the masked form is needed.  */
 
 bool
-vect_store_lanes_supported (tree vectype, unsigned HOST_WIDE_INT count)
+vect_store_lanes_supported (tree vectype, unsigned HOST_WIDE_INT count,
+                           bool masked_p)
 {
-  return vect_lanes_optab_supported_p ("vec_store_lanes",
-                                      vec_store_lanes_optab,
-                                      vectype, count);
+  if (masked_p)
+    return vect_lanes_optab_supported_p ("vec_mask_store_lanes",
+                                        vec_mask_store_lanes_optab,
+                                        vectype, count);
+  else
+    return vect_lanes_optab_supported_p ("vec_store_lanes",
+                                        vec_store_lanes_optab,
+                                        vectype, count);
 }
 
 
@@ -5238,15 +5299,21 @@ vect_grouped_load_supported (tree vectyp
   return false;
 }
 
-/* Return TRUE if vec_load_lanes is available for COUNT vectors of
-   type VECTYPE.  */
+/* Return TRUE if vec_{masked_}load_lanes is available for COUNT vectors of
+   type VECTYPE.  MASKED_P says whether the masked form is needed.  */
 
 bool
-vect_load_lanes_supported (tree vectype, unsigned HOST_WIDE_INT count)
+vect_load_lanes_supported (tree vectype, unsigned HOST_WIDE_INT count,
+                          bool masked_p)
 {
-  return vect_lanes_optab_supported_p ("vec_load_lanes",
-                                      vec_load_lanes_optab,
-                                      vectype, count);
+  if (masked_p)
+    return vect_lanes_optab_supported_p ("vec_mask_load_lanes",
+                                        vec_mask_load_lanes_optab,
+                                        vectype, count);
+  else
+    return vect_lanes_optab_supported_p ("vec_load_lanes",
+                                        vec_load_lanes_optab,
+                                        vectype, count);
 }
 
 /* Function vect_permute_load_chain.
Index: gcc/tree-vect-loop.c
===================================================================
--- gcc/tree-vect-loop.c        2017-11-08 15:05:36.349044117 +0000
+++ gcc/tree-vect-loop.c        2017-11-08 16:35:04.770241799 +0000
@@ -2247,7 +2247,7 @@ vect_analyze_loop_2 (loop_vec_info loop_
       vinfo = vinfo_for_stmt (STMT_VINFO_GROUP_FIRST_ELEMENT (vinfo));
       unsigned int size = STMT_VINFO_GROUP_SIZE (vinfo);
       tree vectype = STMT_VINFO_VECTYPE (vinfo);
-      if (! vect_store_lanes_supported (vectype, size)
+      if (! vect_store_lanes_supported (vectype, size, false)
          && ! vect_grouped_store_supported (vectype, size))
        return false;
       FOR_EACH_VEC_ELT (SLP_INSTANCE_LOADS (instance), j, node)
@@ -2257,7 +2257,7 @@ vect_analyze_loop_2 (loop_vec_info loop_
          bool single_element_p = !STMT_VINFO_GROUP_NEXT_ELEMENT (vinfo);
          size = STMT_VINFO_GROUP_SIZE (vinfo);
          vectype = STMT_VINFO_VECTYPE (vinfo);
-         if (! vect_load_lanes_supported (vectype, size)
+         if (! vect_load_lanes_supported (vectype, size, false)
              && ! vect_grouped_load_supported (vectype, single_element_p,
                                                size))
            return false;
Index: gcc/tree-vect-slp.c
===================================================================
--- gcc/tree-vect-slp.c 2017-11-08 15:05:34.296308263 +0000
+++ gcc/tree-vect-slp.c 2017-11-08 16:35:04.770241799 +0000
@@ -2175,7 +2175,7 @@ vect_analyze_slp_instance (vec_info *vin
         instructions do not generate this SLP instance.  */
       if (is_a <loop_vec_info> (vinfo)
          && loads_permuted
-         && dr && vect_store_lanes_supported (vectype, group_size))
+         && dr && vect_store_lanes_supported (vectype, group_size, false))
        {
          slp_tree load_node;
          FOR_EACH_VEC_ELT (loads, i, load_node)
@@ -2188,7 +2188,7 @@ vect_analyze_slp_instance (vec_info *vin
              if (STMT_VINFO_STRIDED_P (stmt_vinfo)
                  || ! vect_load_lanes_supported
                        (STMT_VINFO_VECTYPE (stmt_vinfo),
-                        GROUP_SIZE (stmt_vinfo)))
+                        GROUP_SIZE (stmt_vinfo), false))
                break;
            }
          if (i == loads.length ())
Index: gcc/tree-vect-stmts.c
===================================================================
--- gcc/tree-vect-stmts.c       2017-11-08 15:05:36.350875282 +0000
+++ gcc/tree-vect-stmts.c       2017-11-08 16:35:04.771159765 +0000
@@ -1700,6 +1700,69 @@ vectorizable_internal_function (combined
 static tree permute_vec_elements (tree, tree, tree, gimple *,
                                  gimple_stmt_iterator *);
 
+/* Replace IFN_MASK_LOAD statement STMT with a dummy assignment, to ensure
+   that it won't be expanded even when there's no following DCE pass.  */
+
+static void
+replace_mask_load (gimple *stmt, gimple_stmt_iterator *gsi)
+{
+  /* If this statement is part of a pattern created by the vectorizer,
+     get the original statement.  */
+  stmt_vec_info stmt_info = vinfo_for_stmt (stmt);
+  if (STMT_VINFO_RELATED_STMT (stmt_info))
+    {
+      stmt = STMT_VINFO_RELATED_STMT (stmt_info);
+      stmt_info = vinfo_for_stmt (stmt);
+    }
+
+  gcc_assert (gsi_stmt (*gsi) == stmt);
+  tree lhs = gimple_call_lhs (stmt);
+  tree zero = build_zero_cst (TREE_TYPE (lhs));
+  gimple *new_stmt = gimple_build_assign (lhs, zero);
+  set_vinfo_for_stmt (new_stmt, stmt_info);
+  set_vinfo_for_stmt (stmt, NULL);
+  STMT_VINFO_STMT (stmt_info) = new_stmt;
+
+  /* If STMT was the first statement in a group, redirect all
+     GROUP_FIRST_ELEMENT pointers to the new statement (which has the
+     same stmt_info as the old statement).  */
+  if (GROUP_FIRST_ELEMENT (stmt_info) == stmt)
+    {
+      gimple *group_stmt = new_stmt;
+      do
+       {
+         GROUP_FIRST_ELEMENT (vinfo_for_stmt (group_stmt)) = new_stmt;
+         group_stmt = GROUP_NEXT_ELEMENT (vinfo_for_stmt (group_stmt));
+       }
+      while (group_stmt);
+    }
+  else if (GROUP_FIRST_ELEMENT (stmt_info))
+    {
+      /* Otherwise redirect the GROUP_NEXT_ELEMENT.  It would be more
+        efficient if these pointers were to the stmt_vec_info rather
+        than the gimple statements themselves, but this is by no means
+        the only quadractic loop for groups.  */
+      gimple *group_stmt = GROUP_FIRST_ELEMENT (stmt_info);
+      while (GROUP_NEXT_ELEMENT (vinfo_for_stmt (group_stmt)) != stmt)
+       group_stmt = GROUP_NEXT_ELEMENT (vinfo_for_stmt (group_stmt));
+      GROUP_NEXT_ELEMENT (vinfo_for_stmt (group_stmt)) = new_stmt;
+    }
+  gsi_replace (gsi, new_stmt, true);
+}
+
+/* STMT is either a masked or unconditional store.  Return the value
+   being stored.  */
+
+static tree
+get_store_op (gimple *stmt)
+{
+  if (gimple_assign_single_p (stmt))
+    return gimple_assign_rhs1 (stmt);
+  if (gimple_call_internal_p (stmt, IFN_MASK_STORE))
+    return gimple_call_arg (stmt, 3);
+  gcc_unreachable ();
+}
+
 /* STMT is a non-strided load or store, meaning that it accesses
    elements with a known constant step.  Return -1 if that step
    is negative, 0 if it is zero, and 1 if it is greater than zero.  */
@@ -1744,7 +1807,7 @@ perm_mask_for_reverse (tree vectype)
 
 static bool
 get_group_load_store_type (gimple *stmt, tree vectype, bool slp,
-                          vec_load_store_type vls_type,
+                          bool masked_p, vec_load_store_type vls_type,
                           vect_memory_access_type *memory_access_type)
 {
   stmt_vec_info stmt_info = vinfo_for_stmt (stmt);
@@ -1765,7 +1828,10 @@ get_group_load_store_type (gimple *stmt,
 
   /* True if we can cope with such overrun by peeling for gaps, so that
      there is at least one final scalar iteration after the vector loop.  */
-  bool can_overrun_p = (vls_type == VLS_LOAD && loop_vinfo && !loop->inner);
+  bool can_overrun_p = (!masked_p
+                       && vls_type == VLS_LOAD
+                       && loop_vinfo
+                       && !loop->inner);
 
   /* There can only be a gap at the end of the group if the stride is
      known at compile time.  */
@@ -1828,6 +1894,7 @@ get_group_load_store_type (gimple *stmt,
         and so we are guaranteed to access a non-gap element in the
         same B-sized block.  */
       if (would_overrun_p
+         && !masked_p
          && gap < (vect_known_alignment_in_bytes (first_dr)
                    / vect_get_scalar_dr_size (first_dr)))
        would_overrun_p = false;
@@ -1838,8 +1905,8 @@ get_group_load_store_type (gimple *stmt,
        {
          /* First try using LOAD/STORE_LANES.  */
          if (vls_type == VLS_LOAD
-             ? vect_load_lanes_supported (vectype, group_size)
-             : vect_store_lanes_supported (vectype, group_size))
+             ? vect_load_lanes_supported (vectype, group_size, masked_p)
+             : vect_store_lanes_supported (vectype, group_size, masked_p))
            {
              *memory_access_type = VMAT_LOAD_STORE_LANES;
              overrun_p = would_overrun_p;
@@ -1865,8 +1932,7 @@ get_group_load_store_type (gimple *stmt,
       gimple *next_stmt = GROUP_NEXT_ELEMENT (stmt_info);
       while (next_stmt)
        {
-         gcc_assert (gimple_assign_single_p (next_stmt));
-         tree op = gimple_assign_rhs1 (next_stmt);
+         tree op = get_store_op (next_stmt);
          gimple *def_stmt;
          enum vect_def_type dt;
          if (!vect_is_simple_use (op, vinfo, &def_stmt, &dt))
@@ -1950,11 +2016,12 @@ get_negative_load_store_type (gimple *st
    or scatters, fill in GS_INFO accordingly.
 
    SLP says whether we're performing SLP rather than loop vectorization.
+   MASKED_P is true if the statement is conditional on a vectorized mask.
    VECTYPE is the vector type that the vectorized statements will use.
    NCOPIES is the number of vector statements that will be needed.  */
 
 static bool
-get_load_store_type (gimple *stmt, tree vectype, bool slp,
+get_load_store_type (gimple *stmt, tree vectype, bool slp, bool masked_p,
                     vec_load_store_type vls_type, unsigned int ncopies,
                     vect_memory_access_type *memory_access_type,
                     gather_scatter_info *gs_info)
@@ -1982,7 +2049,7 @@ get_load_store_type (gimple *stmt, tree
     }
   else if (STMT_VINFO_GROUPED_ACCESS (stmt_info))
     {
-      if (!get_group_load_store_type (stmt, vectype, slp, vls_type,
+      if (!get_group_load_store_type (stmt, vectype, slp, masked_p, vls_type,
                                      memory_access_type))
        return false;
     }
@@ -2031,6 +2098,174 @@ get_load_store_type (gimple *stmt, tree
   return true;
 }
 
+/* Set up the stored values for the first copy of a vectorized store.
+   GROUP_SIZE is the number of stores in the group (which is 1 for
+   ungrouped stores).  FIRST_STMT is the first statement in the group.
+
+   On return, initialize OPERANDS to a new vector in which element I
+   is the value that the first copy of group member I should store.
+   The caller should free OPERANDS after use.  */
+
+static void
+init_stored_values (unsigned int group_size, gimple *first_stmt,
+                   vec<tree> *operands)
+{
+  operands->create (group_size);
+  gimple *next_stmt = first_stmt;
+  for (unsigned int i = 0; i < group_size; i++)
+    {
+      /* Since gaps are not supported for interleaved stores,
+        GROUP_SIZE is the exact number of stmts in the chain.
+        Therefore, NEXT_STMT can't be NULL_TREE.  In case that
+        there is no interleaving, GROUP_SIZE is 1, and only one
+        iteration of the loop will be executed.  */
+      gcc_assert (next_stmt);
+      tree op = get_store_op (next_stmt);
+      tree vec_op = vect_get_vec_def_for_operand (op, next_stmt);
+      operands->quick_push (vec_op);
+      next_stmt = GROUP_NEXT_ELEMENT (vinfo_for_stmt (next_stmt));
+    }
+}
+
+/* OPERANDS is a vector set up by init_stored_values.  Update each element
+   for the next copy of each statement.  GROUP_SIZE and FIRST_STMT are
+   as for init_stored_values.  */
+
+static void
+advance_stored_values (unsigned int group_size, gimple *first_stmt,
+                      vec<tree> operands)
+{
+  vec_info *vinfo = vinfo_for_stmt (first_stmt)->vinfo;
+  for (unsigned int i = 0; i < group_size; i++)
+    {
+      tree op = operands[i];
+      enum vect_def_type dt;
+      gimple *def_stmt;
+      vect_is_simple_use (op, vinfo, &def_stmt, &dt);
+      operands[i] = vect_get_vec_def_for_stmt_copy (dt, op);
+    }
+}
+
+/* Emit one copy of a vectorized LOAD_LANES for STMT.  GROUP_SIZE is
+   the number of vectors being loaded and VECTYPE is the type of each
+   vector.  AGGR_TYPE is the type that should be used to refer to the
+   memory source (which contains the same number of elements as
+   GROUP_SIZE copies of VECTYPE, but in a different order).
+   DATAREF_PTR points to the first element that should be loaded.
+   ALIAS_PTR_TYPE is the type of the accessed elements for aliasing
+   purposes.  MASK, if nonnull, is a mask in which element I is true
+   if element I of each destination vector should be loaded.  */
+
+static void
+do_load_lanes (gimple *stmt, gimple_stmt_iterator *gsi,
+              unsigned int group_size, tree vectype, tree aggr_type,
+              tree dataref_ptr, tree alias_ptr_type, tree mask)
+{
+  tree scalar_dest = gimple_get_lhs (stmt);
+  tree vec_array = create_vector_array (vectype, group_size);
+
+  gcall *new_stmt;
+  if (mask)
+    {
+      /* Emit: VEC_ARRAY = MASK_LOAD_LANES (DATAREF_PTR, ALIAS_PTR, MASK).  */
+      tree alias_ptr = build_int_cst (alias_ptr_type,
+                                     TYPE_ALIGN_UNIT (TREE_TYPE (vectype)));
+      new_stmt = gimple_build_call_internal (IFN_MASK_LOAD_LANES, 3,
+                                            dataref_ptr, alias_ptr, mask);
+    }
+  else
+    {
+      /* Emit: VEC_ARRAY = LOAD_LANES (MEM_REF[...all elements...]).  */
+      tree data_ref = create_array_ref (aggr_type, dataref_ptr,
+                                       alias_ptr_type);
+      new_stmt = gimple_build_call_internal (IFN_LOAD_LANES, 1, data_ref);
+    }
+  gimple_call_set_lhs (new_stmt, vec_array);
+  gimple_call_set_nothrow (new_stmt, true);
+  vect_finish_stmt_generation (stmt, new_stmt, gsi);
+
+  /* Extract each vector into an SSA_NAME.  */
+  auto_vec<tree, 16> dr_chain;
+  dr_chain.reserve (group_size);
+  for (unsigned int i = 0; i < group_size; i++)
+    {
+      tree new_temp = read_vector_array (stmt, gsi, scalar_dest, vec_array, i);
+      dr_chain.quick_push (new_temp);
+    }
+
+  /* Record the mapping between SSA_NAMEs and statements.  */
+  vect_record_grouped_load_vectors (stmt, dr_chain);
+}
+
+/* Emit one copy of a vectorized STORE_LANES for STMT.  GROUP_SIZE is
+   the number of vectors being stored and OPERANDS[I] is the value
+   that group member I should store.  AGGR_TYPE is the type that should
+   be used to refer to the memory destination (which contains the same
+   number of elements as the source vectors, but in a different order).
+   DATAREF_PTR points to the first store location.  ALIAS_PTR_TYPE is
+   the type of the accessed elements for aliasing purposes.  MASK,
+   if nonnull, is a mask in which element I is true if element I of
+   each source vector should be stored.  */
+
+static gimple *
+do_store_lanes (gimple *stmt, gimple_stmt_iterator *gsi,
+               unsigned int group_size, tree aggr_type, tree dataref_ptr,
+               tree alias_ptr_type, vec<tree> operands, tree mask)
+{
+  /* Combine all the vectors into an array.  */
+  tree vectype = TREE_TYPE (operands[0]);
+  tree vec_array = create_vector_array (vectype, group_size);
+  for (unsigned int i = 0; i < group_size; i++)
+    write_vector_array (stmt, gsi, operands[i], vec_array, i);
+
+  gcall *new_stmt;
+  if (mask)
+    {
+      /* Emit: MASK_STORE_LANES (DATAREF_PTR, ALIAS_PTR, MASK, VEC_ARRAY).  */
+      tree alias_ptr = build_int_cst (alias_ptr_type,
+                                     TYPE_ALIGN_UNIT (TREE_TYPE (vectype)));
+      new_stmt = gimple_build_call_internal (IFN_MASK_STORE_LANES, 4,
+                                            dataref_ptr, alias_ptr,
+                                            mask, vec_array);
+    }
+  else
+    {
+      /* Emit: MEM_REF[...all elements...] = STORE_LANES (VEC_ARRAY).  */
+      tree data_ref = create_array_ref (aggr_type, dataref_ptr, 
alias_ptr_type);
+      new_stmt = gimple_build_call_internal (IFN_STORE_LANES, 1, vec_array);
+      gimple_call_set_lhs (new_stmt, data_ref);
+    }
+  gimple_call_set_nothrow (new_stmt, true);
+  vect_finish_stmt_generation (stmt, new_stmt, gsi);
+  return new_stmt;
+}
+
+/* Return the alias pointer type for the group of masked loads or
+   stores starting at FIRST_STMT.  */
+
+static tree
+get_masked_group_alias_ptr_type (gimple *first_stmt)
+{
+  tree type, next_type;
+  gimple *next_stmt;
+
+  type = TREE_TYPE (gimple_call_arg (first_stmt, 1));
+  next_stmt = GROUP_NEXT_ELEMENT (vinfo_for_stmt (first_stmt));
+  while (next_stmt)
+    {
+      next_type = TREE_TYPE (gimple_call_arg (next_stmt, 1));
+      if (get_alias_set (type) != get_alias_set (next_type))
+       {
+         if (dump_enabled_p ())
+           dump_printf_loc (MSG_NOTE, vect_location,
+                            "conflicting alias set types.\n");
+         return ptr_type_node;
+       }
+      next_stmt = GROUP_NEXT_ELEMENT (vinfo_for_stmt (next_stmt));
+    }
+  return type;
+}
+
 /* Function vectorizable_mask_load_store.
 
    Check if STMT performs a conditional load or store that can be vectorized.
@@ -2053,6 +2288,7 @@ vectorizable_mask_load_store (gimple *st
   tree rhs_vectype = NULL_TREE;
   tree mask_vectype;
   tree elem_type;
+  tree aggr_type;
   gimple *new_stmt;
   tree dummy;
   tree dataref_ptr = NULL_TREE;
@@ -2066,6 +2302,8 @@ vectorizable_mask_load_store (gimple *st
   tree mask;
   gimple *def_stmt;
   enum vect_def_type dt;
+  gimple *first_stmt = stmt;
+  unsigned int group_size = 1;
 
   if (slp_node != NULL)
     return false;
@@ -2127,7 +2365,7 @@ vectorizable_mask_load_store (gimple *st
     vls_type = VLS_LOAD;
 
   vect_memory_access_type memory_access_type;
-  if (!get_load_store_type (stmt, vectype, false, vls_type, ncopies,
+  if (!get_load_store_type (stmt, vectype, false, true, vls_type, ncopies,
                            &memory_access_type, &gs_info))
     return false;
 
@@ -2144,7 +2382,18 @@ vectorizable_mask_load_store (gimple *st
          return false;
        }
     }
-  else if (memory_access_type != VMAT_CONTIGUOUS)
+  else if (rhs_vectype
+          && !useless_type_conversion_p (vectype, rhs_vectype))
+    return false;
+  else if (memory_access_type == VMAT_CONTIGUOUS)
+    {
+      if (!VECTOR_MODE_P (TYPE_MODE (vectype))
+         || !can_vec_mask_load_store_p (TYPE_MODE (vectype),
+                                        TYPE_MODE (mask_vectype),
+                                        vls_type == VLS_LOAD))
+       return false;
+    }
+  else if (memory_access_type != VMAT_LOAD_STORE_LANES)
     {
       if (dump_enabled_p ())
        dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location,
@@ -2152,13 +2401,6 @@ vectorizable_mask_load_store (gimple *st
                         vls_type == VLS_LOAD ? "load" : "store");
       return false;
     }
-  else if (!VECTOR_MODE_P (TYPE_MODE (vectype))
-          || !can_vec_mask_load_store_p (TYPE_MODE (vectype),
-                                         TYPE_MODE (mask_vectype),
-                                         vls_type == VLS_LOAD)
-          || (rhs_vectype
-              && !useless_type_conversion_p (vectype, rhs_vectype)))
-    return false;
 
   if (!vec_stmt) /* transformation not required.  */
     {
@@ -2176,6 +2418,14 @@ vectorizable_mask_load_store (gimple *st
 
   /* Transform.  */
 
+  if (STMT_VINFO_GROUPED_ACCESS (stmt_info))
+    {
+      first_stmt = GROUP_FIRST_ELEMENT (stmt_info);
+      group_size = GROUP_SIZE (vinfo_for_stmt (first_stmt));
+      if (vls_type != VLS_LOAD)
+       GROUP_STORE_COUNT (vinfo_for_stmt (first_stmt))++;
+    }
+
   if (memory_access_type == VMAT_GATHER_SCATTER)
     {
       tree vec_oprnd0 = NULL_TREE, op;
@@ -2343,23 +2593,28 @@ vectorizable_mask_load_store (gimple *st
          prev_stmt_info = vinfo_for_stmt (new_stmt);
        }
 
-      /* Ensure that even with -fno-tree-dce the scalar MASK_LOAD is removed
-        from the IL.  */
-      if (STMT_VINFO_RELATED_STMT (stmt_info))
-       {
-         stmt = STMT_VINFO_RELATED_STMT (stmt_info);
-         stmt_info = vinfo_for_stmt (stmt);
-       }
-      tree lhs = gimple_call_lhs (stmt);
-      new_stmt = gimple_build_assign (lhs, build_zero_cst (TREE_TYPE (lhs)));
-      set_vinfo_for_stmt (new_stmt, stmt_info);
-      set_vinfo_for_stmt (stmt, NULL);
-      STMT_VINFO_STMT (stmt_info) = new_stmt;
-      gsi_replace (gsi, new_stmt, true);
+      replace_mask_load (stmt, gsi);
       return true;
     }
-  else if (vls_type != VLS_LOAD)
+
+  if (memory_access_type == VMAT_LOAD_STORE_LANES)
+    aggr_type = build_array_type_nelts (elem_type, group_size * nunits);
+  else
+    aggr_type = vectype;
+
+  if (vls_type != VLS_LOAD)
     {
+      /* Vectorize the whole group when we reach the final statement.
+        Replace all other statements with an empty sequence.  */
+      if (STMT_VINFO_GROUPED_ACCESS (stmt_info)
+         && (GROUP_STORE_COUNT (vinfo_for_stmt (first_stmt))
+             < GROUP_SIZE (vinfo_for_stmt (first_stmt))))
+       {
+         *vec_stmt = NULL;
+         return true;
+       }
+
+      auto_vec<tree, 16> operands;
       tree vec_rhs = NULL_TREE, vec_mask = NULL_TREE;
       prev_stmt_info = NULL;
       LOOP_VINFO_HAS_MASK_STORE (loop_vinfo) = true;
@@ -2369,48 +2624,62 @@ vectorizable_mask_load_store (gimple *st
 
          if (i == 0)
            {
-             tree rhs = gimple_call_arg (stmt, 3);
-             vec_rhs = vect_get_vec_def_for_operand (rhs, stmt);
+             init_stored_values (group_size, first_stmt, &operands);
+             vec_rhs = operands[0];
              vec_mask = vect_get_vec_def_for_operand (mask, stmt,
                                                       mask_vectype);
-             /* We should have catched mismatched types earlier.  */
+             /* We should have caught mismatched types earlier.  */
              gcc_assert (useless_type_conversion_p (vectype,
                                                     TREE_TYPE (vec_rhs)));
-             dataref_ptr = vect_create_data_ref_ptr (stmt, vectype, NULL,
-                                                     NULL_TREE, &dummy, gsi,
-                                                     &ptr_incr, false, &inv_p);
+             dataref_ptr = vect_create_data_ref_ptr (first_stmt, aggr_type,
+                                                     NULL, NULL_TREE, &dummy,
+                                                     gsi, &ptr_incr, false,
+                                                     &inv_p);
              gcc_assert (!inv_p);
            }
          else
            {
-             vect_is_simple_use (vec_rhs, loop_vinfo, &def_stmt, &dt);
-             vec_rhs = vect_get_vec_def_for_stmt_copy (dt, vec_rhs);
+             advance_stored_values (group_size, first_stmt, operands);
+             vec_rhs = operands[0];
              vect_is_simple_use (vec_mask, loop_vinfo, &def_stmt, &dt);
              vec_mask = vect_get_vec_def_for_stmt_copy (dt, vec_mask);
-             dataref_ptr = bump_vector_ptr (dataref_ptr, ptr_incr, gsi, stmt,
-                                            TYPE_SIZE_UNIT (vectype));
+             dataref_ptr = bump_vector_ptr (dataref_ptr, ptr_incr,
+                                            gsi, first_stmt,
+                                            TYPE_SIZE_UNIT (aggr_type));
            }
 
-         align = DR_TARGET_ALIGNMENT (dr);
-         if (aligned_access_p (dr))
-           misalign = 0;
-         else if (DR_MISALIGNMENT (dr) == -1)
+         if (memory_access_type == VMAT_LOAD_STORE_LANES)
            {
-             align = TYPE_ALIGN_UNIT (elem_type);
-             misalign = 0;
+             tree ref_type = get_masked_group_alias_ptr_type (first_stmt);
+             new_stmt = do_store_lanes (stmt, gsi, group_size, aggr_type,
+                                        dataref_ptr, ref_type, operands,
+                                        vec_mask);
            }
          else
-           misalign = DR_MISALIGNMENT (dr);
-         set_ptr_info_alignment (get_ptr_info (dataref_ptr), align,
-                                 misalign);
-         tree ptr = build_int_cst (TREE_TYPE (gimple_call_arg (stmt, 1)),
-                                   misalign ? least_bit_hwi (misalign) : 
align);
-         gcall *call
-           = gimple_build_call_internal (IFN_MASK_STORE, 4, dataref_ptr,
-                                         ptr, vec_mask, vec_rhs);
-         gimple_call_set_nothrow (call, true);
-         new_stmt = call;
-         vect_finish_stmt_generation (stmt, new_stmt, gsi);
+           {
+             align = DR_TARGET_ALIGNMENT (dr);
+             if (aligned_access_p (dr))
+               misalign = 0;
+             else if (DR_MISALIGNMENT (dr) == -1)
+               {
+                 align = TYPE_ALIGN_UNIT (elem_type);
+                 misalign = 0;
+               }
+             else
+               misalign = DR_MISALIGNMENT (dr);
+             set_ptr_info_alignment (get_ptr_info (dataref_ptr), align,
+                                     misalign);
+             tree ptr = build_int_cst (TREE_TYPE (gimple_call_arg (stmt, 1)),
+                                       misalign
+                                       ? least_bit_hwi (misalign)
+                                       : align);
+             gcall *call
+               = gimple_build_call_internal (IFN_MASK_STORE, 4, dataref_ptr,
+                                             ptr, vec_mask, vec_rhs);
+             gimple_call_set_nothrow (call, true);
+             new_stmt = call;
+             vect_finish_stmt_generation (stmt, new_stmt, gsi);
+           }
          if (i == 0)
            STMT_VINFO_VEC_STMT (stmt_info) = *vec_stmt = new_stmt;
          else
@@ -2420,73 +2689,88 @@ vectorizable_mask_load_store (gimple *st
     }
   else
     {
+      /* Vectorize the whole group when we reach the first statement.
+        For later statements we just need to return the cached
+        replacement.  */
+      if (group_size > 1
+         && STMT_VINFO_VEC_STMT (vinfo_for_stmt (first_stmt)))
+       {
+         *vec_stmt = STMT_VINFO_VEC_STMT (stmt_info);
+         replace_mask_load (stmt, gsi);
+         return true;
+       }
+
       tree vec_mask = NULL_TREE;
       prev_stmt_info = NULL;
-      vec_dest = vect_create_destination_var (gimple_call_lhs (stmt), vectype);
+      if (memory_access_type == VMAT_LOAD_STORE_LANES)
+       vec_dest = NULL_TREE;
+      else
+       vec_dest = vect_create_destination_var (gimple_call_lhs (stmt),
+                                               vectype);
       for (i = 0; i < ncopies; i++)
        {
          unsigned align, misalign;
 
          if (i == 0)
            {
+             gcc_assert (mask == gimple_call_arg (first_stmt, 2));
              vec_mask = vect_get_vec_def_for_operand (mask, stmt,
                                                       mask_vectype);
-             dataref_ptr = vect_create_data_ref_ptr (stmt, vectype, NULL,
-                                                     NULL_TREE, &dummy, gsi,
-                                                     &ptr_incr, false, &inv_p);
+             dataref_ptr = vect_create_data_ref_ptr (first_stmt, aggr_type,
+                                                     NULL, NULL_TREE, &dummy,
+                                                     gsi, &ptr_incr, false,
+                                                     &inv_p);
              gcc_assert (!inv_p);
            }
          else
            {
              vect_is_simple_use (vec_mask, loop_vinfo, &def_stmt, &dt);
              vec_mask = vect_get_vec_def_for_stmt_copy (dt, vec_mask);
-             dataref_ptr = bump_vector_ptr (dataref_ptr, ptr_incr, gsi, stmt,
-                                            TYPE_SIZE_UNIT (vectype));
+             dataref_ptr = bump_vector_ptr (dataref_ptr, ptr_incr,
+                                            gsi, first_stmt,
+                                            TYPE_SIZE_UNIT (aggr_type));
            }
 
-         align = DR_TARGET_ALIGNMENT (dr);
-         if (aligned_access_p (dr))
-           misalign = 0;
-         else if (DR_MISALIGNMENT (dr) == -1)
+         if (memory_access_type == VMAT_LOAD_STORE_LANES)
            {
-             align = TYPE_ALIGN_UNIT (elem_type);
-             misalign = 0;
+             tree ref_type = get_masked_group_alias_ptr_type (first_stmt);
+             do_load_lanes (stmt, gsi, group_size, vectype,
+                            aggr_type, dataref_ptr, ref_type, vec_mask);
+             *vec_stmt = STMT_VINFO_VEC_STMT (stmt_info);
            }
          else
-           misalign = DR_MISALIGNMENT (dr);
-         set_ptr_info_alignment (get_ptr_info (dataref_ptr), align,
-                                 misalign);
-         tree ptr = build_int_cst (TREE_TYPE (gimple_call_arg (stmt, 1)),
-                                   misalign ? least_bit_hwi (misalign) : 
align);
-         gcall *call
-           = gimple_build_call_internal (IFN_MASK_LOAD, 3, dataref_ptr,
-                                         ptr, vec_mask);
-         gimple_call_set_lhs (call, make_ssa_name (vec_dest));
-         gimple_call_set_nothrow (call, true);
-         vect_finish_stmt_generation (stmt, call, gsi);
-         if (i == 0)
-           STMT_VINFO_VEC_STMT (stmt_info) = *vec_stmt = call;
-         else
-           STMT_VINFO_RELATED_STMT (prev_stmt_info) = call;
-         prev_stmt_info = vinfo_for_stmt (call);
+           {
+             align = DR_TARGET_ALIGNMENT (dr);
+             if (aligned_access_p (dr))
+               misalign = 0;
+             else if (DR_MISALIGNMENT (dr) == -1)
+               {
+                 align = TYPE_ALIGN_UNIT (elem_type);
+                 misalign = 0;
+               }
+             else
+               misalign = DR_MISALIGNMENT (dr);
+             set_ptr_info_alignment (get_ptr_info (dataref_ptr), align,
+                                     misalign);
+             tree ptr = build_int_cst (TREE_TYPE (gimple_call_arg (stmt, 1)),
+                                       misalign
+                                       ? least_bit_hwi (misalign)
+                                       : align);
+             gcall *call
+               = gimple_build_call_internal (IFN_MASK_LOAD, 3, dataref_ptr,
+                                             ptr, vec_mask);
+             gimple_call_set_lhs (call, make_ssa_name (vec_dest));
+             gimple_call_set_nothrow (call, true);
+             vect_finish_stmt_generation (stmt, call, gsi);
+             if (i == 0)
+               STMT_VINFO_VEC_STMT (stmt_info) = *vec_stmt = call;
+             else
+               STMT_VINFO_RELATED_STMT (prev_stmt_info) = call;
+             prev_stmt_info = vinfo_for_stmt (call);
+           }
        }
-    }
 
-  if (vls_type == VLS_LOAD)
-    {
-      /* Ensure that even with -fno-tree-dce the scalar MASK_LOAD is removed
-        from the IL.  */
-      if (STMT_VINFO_RELATED_STMT (stmt_info))
-       {
-         stmt = STMT_VINFO_RELATED_STMT (stmt_info);
-         stmt_info = vinfo_for_stmt (stmt);
-       }
-      tree lhs = gimple_call_lhs (stmt);
-      new_stmt = gimple_build_assign (lhs, build_zero_cst (TREE_TYPE (lhs)));
-      set_vinfo_for_stmt (new_stmt, stmt_info);
-      set_vinfo_for_stmt (stmt, NULL);
-      STMT_VINFO_STMT (stmt_info) = new_stmt;
-      gsi_replace (gsi, new_stmt, true);
+      replace_mask_load (stmt, gsi);
     }
 
   return true;
@@ -5818,7 +6102,7 @@ vectorizable_store (gimple *stmt, gimple
     return false;
 
   vect_memory_access_type memory_access_type;
-  if (!get_load_store_type (stmt, vectype, slp, vls_type, ncopies,
+  if (!get_load_store_type (stmt, vectype, slp, false, vls_type, ncopies,
                            &memory_access_type, &gs_info))
     return false;
 
@@ -6353,34 +6637,21 @@ vectorizable_store (gimple *stmt, gimple
               vec_oprnd = vec_oprnds[0];
             }
           else
-            {
-             /* For interleaved stores we collect vectorized defs for all the
-                stores in the group in DR_CHAIN and OPRNDS. DR_CHAIN is then
-                used as an input to vect_permute_store_chain(), and OPRNDS as
-                an input to vect_get_vec_def_for_stmt_copy() for the next copy.
-
-                If the store is not grouped, GROUP_SIZE is 1, and DR_CHAIN and
-                OPRNDS are of size 1.  */
-             next_stmt = first_stmt;
-             for (i = 0; i < group_size; i++)
-               {
-                 /* Since gaps are not supported for interleaved stores,
-                    GROUP_SIZE is the exact number of stmts in the chain.
-                    Therefore, NEXT_STMT can't be NULL_TREE.  In case that
-                    there is no interleaving, GROUP_SIZE is 1, and only one
-                    iteration of the loop will be executed.  */
-                 gcc_assert (next_stmt
-                             && gimple_assign_single_p (next_stmt));
-                 op = gimple_assign_rhs1 (next_stmt);
-
-                 vec_oprnd = vect_get_vec_def_for_operand (op, next_stmt);
-                 dr_chain.quick_push (vec_oprnd);
-                 oprnds.quick_push (vec_oprnd);
-                 next_stmt = GROUP_NEXT_ELEMENT (vinfo_for_stmt (next_stmt));
-               }
+           {
+             /* For interleaved stores we collect vectorized defs
+                for all the stores in the group in DR_CHAIN and OPRNDS.
+                DR_CHAIN is then used as an input to
+                vect_permute_store_chain(), and OPRNDS as an input to
+                vect_get_vec_def_for_stmt_copy() for the next copy.
+
+                If the store is not grouped, GROUP_SIZE is 1, and DR_CHAIN
+                and OPRNDS are of size 1.  */
+             init_stored_values (group_size, first_stmt, &oprnds);
+             dr_chain.safe_splice (oprnds);
+             vec_oprnd = oprnds[0];
            }
 
-         /* We should have catched mismatched types earlier.  */
+         /* We should have caught mismatched types earlier.  */
          gcc_assert (useless_type_conversion_p (vectype,
                                                 TREE_TYPE (vec_oprnd)));
          bool simd_lane_access_p
@@ -6414,14 +6685,10 @@ vectorizable_store (gimple *stmt, gimple
             next copy.
             If the store is not grouped, GROUP_SIZE is 1, and DR_CHAIN and
             OPRNDS are of size 1.  */
-         for (i = 0; i < group_size; i++)
-           {
-             op = oprnds[i];
-             vect_is_simple_use (op, vinfo, &def_stmt, &dt);
-             vec_oprnd = vect_get_vec_def_for_stmt_copy (dt, op);
-             dr_chain[i] = vec_oprnd;
-             oprnds[i] = vec_oprnd;
-           }
+         advance_stored_values (group_size, first_stmt, oprnds);
+         dr_chain.truncate (0);
+         dr_chain.splice (oprnds);
+         vec_oprnd = oprnds[0];
          if (dataref_offset)
            dataref_offset
              = int_const_binop (PLUS_EXPR, dataref_offset,
@@ -6432,27 +6699,8 @@ vectorizable_store (gimple *stmt, gimple
        }
 
       if (memory_access_type == VMAT_LOAD_STORE_LANES)
-       {
-         tree vec_array;
-
-         /* Combine all the vectors into an array.  */
-         vec_array = create_vector_array (vectype, vec_num);
-         for (i = 0; i < vec_num; i++)
-           {
-             vec_oprnd = dr_chain[i];
-             write_vector_array (stmt, gsi, vec_oprnd, vec_array, i);
-           }
-
-         /* Emit:
-              MEM_REF[...all elements...] = STORE_LANES (VEC_ARRAY).  */
-         data_ref = create_array_ref (aggr_type, dataref_ptr, ref_type);
-         gcall *call = gimple_build_call_internal (IFN_STORE_LANES, 1,
-                                                   vec_array);
-         gimple_call_set_lhs (call, data_ref);
-         gimple_call_set_nothrow (call, true);
-         new_stmt = call;
-         vect_finish_stmt_generation (stmt, new_stmt, gsi);
-       }
+       new_stmt = do_store_lanes (stmt, gsi, vec_num, aggr_type,
+                                  dataref_ptr, ref_type, dr_chain, NULL_TREE);
       else
        {
          new_stmt = NULL;
@@ -6859,7 +7107,7 @@ vectorizable_load (gimple *stmt, gimple_
     }
 
   vect_memory_access_type memory_access_type;
-  if (!get_load_store_type (stmt, vectype, slp, VLS_LOAD, ncopies,
+  if (!get_load_store_type (stmt, vectype, slp, false, VLS_LOAD, ncopies,
                            &memory_access_type, &gs_info))
     return false;
 
@@ -7553,32 +7801,8 @@ vectorizable_load (gimple *stmt, gimple_
        dr_chain.create (vec_num);
 
       if (memory_access_type == VMAT_LOAD_STORE_LANES)
-       {
-         tree vec_array;
-
-         vec_array = create_vector_array (vectype, vec_num);
-
-         /* Emit:
-              VEC_ARRAY = LOAD_LANES (MEM_REF[...all elements...]).  */
-         data_ref = create_array_ref (aggr_type, dataref_ptr, ref_type);
-         gcall *call = gimple_build_call_internal (IFN_LOAD_LANES, 1,
-                                                   data_ref);
-         gimple_call_set_lhs (call, vec_array);
-         gimple_call_set_nothrow (call, true);
-         new_stmt = call;
-         vect_finish_stmt_generation (stmt, new_stmt, gsi);
-
-         /* Extract each vector into an SSA_NAME.  */
-         for (i = 0; i < vec_num; i++)
-           {
-             new_temp = read_vector_array (stmt, gsi, scalar_dest,
-                                           vec_array, i);
-             dr_chain.quick_push (new_temp);
-           }
-
-         /* Record the mapping between SSA_NAMEs and statements.  */
-         vect_record_grouped_load_vectors (stmt, dr_chain);
-       }
+       do_load_lanes (stmt, gsi, group_size, vectype, aggr_type,
+                      dataref_ptr, ref_type, NULL_TREE);
       else
        {
          for (i = 0; i < vec_num; i++)
@@ -8907,7 +9131,16 @@ vect_transform_stmt (gimple *stmt, gimpl
       done = vectorizable_call (stmt, gsi, &vec_stmt, slp_node);
       stmt = gsi_stmt (*gsi);
       if (gimple_call_internal_p (stmt, IFN_MASK_STORE))
-       is_store = true;
+       {
+         gcc_assert (!slp_node);
+         /* As with normal stores, we vectorize the whole group when
+            we reach the last call in the group.  The other calls are
+            in the group are left with a null VEC_STMT.  */
+         if (STMT_VINFO_GROUPED_ACCESS (stmt_info))
+           *grouped_store = true;
+         if (STMT_VINFO_VEC_STMT (stmt_info))
+           is_store = true;
+       }
       break;
 
     case call_simd_clone_vec_info_type:
Index: gcc/testsuite/gcc.dg/vect/vect-ooo-group-1.c
===================================================================
--- /dev/null   2017-11-08 11:04:45.353113300 +0000
+++ gcc/testsuite/gcc.dg/vect/vect-ooo-group-1.c        2017-11-08 
16:35:04.763816035 +0000
@@ -0,0 +1,12 @@
+/* { dg-do compile } */
+
+void
+f (int *restrict a, int *restrict b, int *restrict c)
+{
+  for (int i = 0; i < 100; ++i)
+    if (c[i])
+      {
+       a[i * 2] = b[i * 5 + 2];
+       a[i * 2 + 1] = b[i * 5];
+      }
+}
Index: gcc/testsuite/gcc.target/aarch64/sve_mask_struct_load_1.c
===================================================================
--- /dev/null   2017-11-08 11:04:45.353113300 +0000
+++ gcc/testsuite/gcc.target/aarch64/sve_mask_struct_load_1.c   2017-11-08 
16:35:04.763816035 +0000
@@ -0,0 +1,67 @@
+/* { dg-do compile } */
+/* { dg-options "-O2 -ftree-vectorize -fno-tree-dce -ffast-math 
-march=armv8-a+sve" } */
+
+#define TEST_LOOP(NAME, OUTTYPE, INTYPE, MASKTYPE)             \
+  void __attribute__ ((noinline, noclone))                     \
+  NAME##_2 (OUTTYPE *__restrict dest, INTYPE *__restrict src,  \
+           MASKTYPE *__restrict cond, int n)                   \
+  {                                                            \
+    for (int i = 0; i < n; ++i)                                        \
+      if (cond[i])                                             \
+       dest[i] = src[i * 2] + src[i * 2 + 1];                  \
+  }
+
+#define TEST2(NAME, OUTTYPE, INTYPE) \
+  TEST_LOOP (NAME##_i8, OUTTYPE, INTYPE, signed char) \
+  TEST_LOOP (NAME##_i16, OUTTYPE, INTYPE, unsigned short) \
+  TEST_LOOP (NAME##_f32, OUTTYPE, INTYPE, float) \
+  TEST_LOOP (NAME##_f64, OUTTYPE, INTYPE, double)
+
+#define TEST1(NAME, OUTTYPE) \
+  TEST2 (NAME##_i8, OUTTYPE, signed char) \
+  TEST2 (NAME##_i16, OUTTYPE, unsigned short) \
+  TEST2 (NAME##_i32, OUTTYPE, int) \
+  TEST2 (NAME##_i64, OUTTYPE, unsigned long)
+
+#define TEST(NAME) \
+  TEST1 (NAME##_i8, signed char) \
+  TEST1 (NAME##_i16, unsigned short) \
+  TEST1 (NAME##_i32, int) \
+  TEST1 (NAME##_i64, unsigned long) \
+  TEST2 (NAME##_f16_f16, _Float16, _Float16) \
+  TEST2 (NAME##_f32_f32, float, float) \
+  TEST2 (NAME##_f64_f64, double, double)
+
+TEST (test)
+
+/*    Mask |  8 16 32 64
+    -------+------------
+    Out  8 |  1  1  1  1
+        16 |  1  1  1  1
+        32 |  1  1  1  1
+        64 |  1  1  1  1.  */
+/* { dg-final { scan-assembler-times {\tld2b\t.z[0-9]} 16 } } */
+
+/*    Mask |  8 16 32 64
+    -------+------------
+    Out  8 |  2  2  2  2
+        16 |  2  1  1  1 x2 (for half float)
+        32 |  2  1  1  1
+        64 |  2  1  1  1.  */
+/* { dg-final { scan-assembler-times {\tld2h\t.z[0-9]} 28 } } */
+
+/*    Mask |  8 16 32 64
+    -------+------------
+    Out  8 |  4  4  4  4
+        16 |  4  2  2  2
+        32 |  4  2  1  1 x2 (for float)
+        64 |  4  2  1  1.  */
+/* { dg-final { scan-assembler-times {\tld2w\t.z[0-9]} 50 } } */
+
+/*    Mask |  8 16 32 64
+    -------+------------
+    Out  8 |  8  8  8  8
+        16 |  8  4  4  4
+        32 |  8  4  2  2
+        64 |  8  4  2  1 x2 (for double).  */
+/* { dg-final { scan-assembler-times {\tld2d\t.z[0-9]} 98 } } */
Index: gcc/testsuite/gcc.target/aarch64/sve_mask_struct_load_1_run.c
===================================================================
--- /dev/null   2017-11-08 11:04:45.353113300 +0000
+++ gcc/testsuite/gcc.target/aarch64/sve_mask_struct_load_1_run.c       
2017-11-08 16:35:04.763816035 +0000
@@ -0,0 +1,38 @@
+/* { dg-do run { target aarch64_sve_hw } } */
+/* { dg-options "-O2 -ftree-vectorize -fno-tree-dce -ffast-math 
-march=armv8-a+sve" } */
+
+#include "sve_mask_struct_load_1.c"
+
+#define N 100
+
+#undef TEST_LOOP
+#define TEST_LOOP(NAME, OUTTYPE, INTYPE, MASKTYPE)     \
+  {                                                    \
+    OUTTYPE out[N];                                    \
+    INTYPE in[N * 2];                                  \
+    MASKTYPE mask[N];                                  \
+    for (int i = 0; i < N; ++i)                                \
+      {                                                        \
+       out[i] = i * 7 / 2;                             \
+       mask[i] = i % 5 <= i % 3;                       \
+       asm volatile ("" ::: "memory");                 \
+      }                                                        \
+    for (int i = 0; i < N * 2; ++i)                    \
+      in[i] = i * 9 / 2;                               \
+    NAME##_2 (out, in, mask, N);                       \
+    for (int i = 0; i < N; ++i)                                \
+      {                                                        \
+       OUTTYPE if_true = in[i * 2] + in[i * 2 + 1];    \
+       OUTTYPE if_false = i * 7 / 2;                   \
+       if (out[i] != (mask[i] ? if_true : if_false))   \
+         __builtin_abort ();                           \
+       asm volatile ("" ::: "memory");                 \
+      }                                                        \
+  }
+
+int __attribute__ ((optimize (1)))
+main (void)
+{
+  TEST (test);
+  return 0;
+}
Index: gcc/testsuite/gcc.target/aarch64/sve_mask_struct_load_2.c
===================================================================
--- /dev/null   2017-11-08 11:04:45.353113300 +0000
+++ gcc/testsuite/gcc.target/aarch64/sve_mask_struct_load_2.c   2017-11-08 
16:35:04.766569934 +0000
@@ -0,0 +1,69 @@
+/* { dg-do compile } */
+/* { dg-options "-O2 -ftree-vectorize -fno-tree-dce -ffast-math 
-march=armv8-a+sve" } */
+
+#define TEST_LOOP(NAME, OUTTYPE, INTYPE, MASKTYPE)             \
+  void __attribute__ ((noinline, noclone))                     \
+  NAME##_3 (OUTTYPE *__restrict dest, INTYPE *__restrict src,  \
+           MASKTYPE *__restrict cond, int n)                   \
+  {                                                            \
+    for (int i = 0; i < n; ++i)                                        \
+      if (cond[i])                                             \
+       dest[i] = (src[i * 3]                                   \
+                  + src[i * 3 + 1]                             \
+                  + src[i * 3 + 2]);                           \
+  }
+
+#define TEST2(NAME, OUTTYPE, INTYPE) \
+  TEST_LOOP (NAME##_i8, OUTTYPE, INTYPE, signed char) \
+  TEST_LOOP (NAME##_i16, OUTTYPE, INTYPE, unsigned short) \
+  TEST_LOOP (NAME##_f32, OUTTYPE, INTYPE, float) \
+  TEST_LOOP (NAME##_f64, OUTTYPE, INTYPE, double)
+
+#define TEST1(NAME, OUTTYPE) \
+  TEST2 (NAME##_i8, OUTTYPE, signed char) \
+  TEST2 (NAME##_i16, OUTTYPE, unsigned short) \
+  TEST2 (NAME##_i32, OUTTYPE, int) \
+  TEST2 (NAME##_i64, OUTTYPE, unsigned long)
+
+#define TEST(NAME) \
+  TEST1 (NAME##_i8, signed char) \
+  TEST1 (NAME##_i16, unsigned short) \
+  TEST1 (NAME##_i32, int) \
+  TEST1 (NAME##_i64, unsigned long) \
+  TEST2 (NAME##_f16_f16, _Float16, _Float16) \
+  TEST2 (NAME##_f32_f32, float, float) \
+  TEST2 (NAME##_f64_f64, double, double)
+
+TEST (test)
+
+/*    Mask |  8 16 32 64
+    -------+------------
+    Out  8 |  1  1  1  1
+        16 |  1  1  1  1
+        32 |  1  1  1  1
+        64 |  1  1  1  1.  */
+/* { dg-final { scan-assembler-times {\tld3b\t.z[0-9]} 16 } } */
+
+/*    Mask |  8 16 32 64
+    -------+------------
+    Out  8 |  2  2  2  2
+        16 |  2  1  1  1 x2 (for _Float16)
+        32 |  2  1  1  1
+        64 |  2  1  1  1.  */
+/* { dg-final { scan-assembler-times {\tld3h\t.z[0-9]} 28 } } */
+
+/*    Mask |  8 16 32 64
+    -------+------------
+    Out  8 |  4  4  4  4
+        16 |  4  2  2  2
+        32 |  4  2  1  1 x2 (for float)
+        64 |  4  2  1  1.  */
+/* { dg-final { scan-assembler-times {\tld3w\t.z[0-9]} 50 } } */
+
+/*    Mask |  8 16 32 64
+    -------+------------
+    Out  8 |  8  8  8  8
+        16 |  8  4  4  4
+        32 |  8  4  2  2
+        64 |  8  4  2  1 x2 (for double).  */
+/* { dg-final { scan-assembler-times {\tld3d\t.z[0-9]} 98 } } */
Index: gcc/testsuite/gcc.target/aarch64/sve_mask_struct_load_2_run.c
===================================================================
--- /dev/null   2017-11-08 11:04:45.353113300 +0000
+++ gcc/testsuite/gcc.target/aarch64/sve_mask_struct_load_2_run.c       
2017-11-08 16:35:04.766569934 +0000
@@ -0,0 +1,40 @@
+/* { dg-do run { target aarch64_sve_hw } } */
+/* { dg-options "-O2 -ftree-vectorize -fno-tree-dce -ffast-math 
-march=armv8-a+sve" } */
+
+#include "sve_mask_struct_load_2.c"
+
+#define N 100
+
+#undef TEST_LOOP
+#define TEST_LOOP(NAME, OUTTYPE, INTYPE, MASKTYPE)     \
+  {                                                    \
+    OUTTYPE out[N];                                    \
+    INTYPE in[N * 3];                                  \
+    MASKTYPE mask[N];                                  \
+    for (int i = 0; i < N; ++i)                                \
+      {                                                        \
+       out[i] = i * 7 / 2;                             \
+       mask[i] = i % 5 <= i % 3;                       \
+       asm volatile ("" ::: "memory");                 \
+      }                                                        \
+    for (int i = 0; i < N * 3; ++i)                    \
+      in[i] = i * 9 / 2;                               \
+    NAME##_3 (out, in, mask, N);                       \
+    for (int i = 0; i < N; ++i)                                \
+      {                                                        \
+       OUTTYPE if_true = (in[i * 3]                    \
+                          + in[i * 3 + 1]              \
+                          + in[i * 3 + 2]);            \
+       OUTTYPE if_false = i * 7 / 2;                   \
+       if (out[i] != (mask[i] ? if_true : if_false))   \
+         __builtin_abort ();                           \
+       asm volatile ("" ::: "memory");                 \
+      }                                                        \
+  }
+
+int __attribute__ ((optimize (1)))
+main (void)
+{
+  TEST (test);
+  return 0;
+}
Index: gcc/testsuite/gcc.target/aarch64/sve_mask_struct_load_3.c
===================================================================
--- /dev/null   2017-11-08 11:04:45.353113300 +0000
+++ gcc/testsuite/gcc.target/aarch64/sve_mask_struct_load_3.c   2017-11-08 
16:35:04.766569934 +0000
@@ -0,0 +1,70 @@
+/* { dg-do compile } */
+/* { dg-options "-O2 -ftree-vectorize -fno-tree-dce -ffast-math 
-march=armv8-a+sve" } */
+
+#define TEST_LOOP(NAME, OUTTYPE, INTYPE, MASKTYPE)             \
+  void __attribute__ ((noinline, noclone))                     \
+  NAME##_4 (OUTTYPE *__restrict dest, INTYPE *__restrict src,  \
+           MASKTYPE *__restrict cond, int n)                   \
+  {                                                            \
+    for (int i = 0; i < n; ++i)                                        \
+      if (cond[i])                                             \
+       dest[i] = (src[i * 4]                                   \
+                  + src[i * 4 + 1]                             \
+                  + src[i * 4 + 2]                             \
+                  + src[i * 4 + 3]);                           \
+  }
+
+#define TEST2(NAME, OUTTYPE, INTYPE) \
+  TEST_LOOP (NAME##_i8, OUTTYPE, INTYPE, signed char) \
+  TEST_LOOP (NAME##_i16, OUTTYPE, INTYPE, unsigned short) \
+  TEST_LOOP (NAME##_f32, OUTTYPE, INTYPE, float) \
+  TEST_LOOP (NAME##_f64, OUTTYPE, INTYPE, double)
+
+#define TEST1(NAME, OUTTYPE) \
+  TEST2 (NAME##_i8, OUTTYPE, signed char) \
+  TEST2 (NAME##_i16, OUTTYPE, unsigned short) \
+  TEST2 (NAME##_i32, OUTTYPE, int) \
+  TEST2 (NAME##_i64, OUTTYPE, unsigned long)
+
+#define TEST(NAME) \
+  TEST1 (NAME##_i8, signed char) \
+  TEST1 (NAME##_i16, unsigned short) \
+  TEST1 (NAME##_i32, int) \
+  TEST1 (NAME##_i64, unsigned long) \
+  TEST2 (NAME##_f16_f16, _Float16, _Float16) \
+  TEST2 (NAME##_f32_f32, float, float) \
+  TEST2 (NAME##_f64_f64, double, double)
+
+TEST (test)
+
+/*    Mask |  8 16 32 64
+    -------+------------
+    Out  8 |  1  1  1  1
+        16 |  1  1  1  1
+        32 |  1  1  1  1
+        64 |  1  1  1  1.  */
+/* { dg-final { scan-assembler-times {\tld4b\t.z[0-9]} 16 } } */
+
+/*    Mask |  8 16 32 64
+    -------+------------
+    Out  8 |  2  2  2  2
+        16 |  2  1  1  1 x2 (for half float)
+        32 |  2  1  1  1
+        64 |  2  1  1  1.  */
+/* { dg-final { scan-assembler-times {\tld4h\t.z[0-9]} 28 } } */
+
+/*    Mask |  8 16 32 64
+    -------+------------
+    Out  8 |  4  4  4  4
+        16 |  4  2  2  2
+        32 |  4  2  1  1 x2 (for float)
+        64 |  4  2  1  1.  */
+/* { dg-final { scan-assembler-times {\tld4w\t.z[0-9]} 50 } } */
+
+/*    Mask |  8 16 32 64
+    -------+------------
+    Out  8 |  8  8  8  8
+        16 |  8  4  4  4
+        32 |  8  4  2  2
+        64 |  8  4  2  1 x2 (for double).  */
+/* { dg-final { scan-assembler-times {\tld4d\t.z[0-9]} 98 } } */
Index: gcc/testsuite/gcc.target/aarch64/sve_mask_struct_load_3_run.c
===================================================================
--- /dev/null   2017-11-08 11:04:45.353113300 +0000
+++ gcc/testsuite/gcc.target/aarch64/sve_mask_struct_load_3_run.c       
2017-11-08 16:35:04.766569934 +0000
@@ -0,0 +1,41 @@
+/* { dg-do run { target aarch64_sve_hw } } */
+/* { dg-options "-O2 -ftree-vectorize -fno-tree-dce -ffast-math 
-march=armv8-a+sve" } */
+
+#include "sve_mask_struct_load_3.c"
+
+#define N 100
+
+#undef TEST_LOOP
+#define TEST_LOOP(NAME, OUTTYPE, INTYPE, MASKTYPE)     \
+  {                                                    \
+    OUTTYPE out[N];                                    \
+    INTYPE in[N * 4];                                  \
+    MASKTYPE mask[N];                                  \
+    for (int i = 0; i < N; ++i)                                \
+      {                                                        \
+       out[i] = i * 7 / 2;                             \
+       mask[i] = i % 5 <= i % 3;                       \
+       asm volatile ("" ::: "memory");                 \
+      }                                                        \
+    for (int i = 0; i < N * 4; ++i)                    \
+      in[i] = i * 9 / 2;                               \
+    NAME##_4 (out, in, mask, N);                       \
+    for (int i = 0; i < N; ++i)                                \
+      {                                                        \
+       OUTTYPE if_true = (in[i * 4]                    \
+                          + in[i * 4 + 1]              \
+                          + in[i * 4 + 2]              \
+                          + in[i * 4 + 3]);            \
+       OUTTYPE if_false = i * 7 / 2;                   \
+       if (out[i] != (mask[i] ? if_true : if_false))   \
+         __builtin_abort ();                           \
+       asm volatile ("" ::: "memory");                 \
+      }                                                        \
+  }
+
+int __attribute__ ((optimize (1)))
+main (void)
+{
+  TEST (test);
+  return 0;
+}
Index: gcc/testsuite/gcc.target/aarch64/sve_mask_struct_load_4.c
===================================================================
--- /dev/null   2017-11-08 11:04:45.353113300 +0000
+++ gcc/testsuite/gcc.target/aarch64/sve_mask_struct_load_4.c   2017-11-08 
16:35:04.766569934 +0000
@@ -0,0 +1,67 @@
+/* { dg-do compile } */
+/* { dg-options "-O2 -ftree-vectorize -fno-tree-dce -ffast-math 
-march=armv8-a+sve" } */
+
+#define TEST_LOOP(NAME, OUTTYPE, INTYPE, MASKTYPE)             \
+  void __attribute__ ((noinline, noclone))                     \
+  NAME##_3 (OUTTYPE *__restrict dest, INTYPE *__restrict src,  \
+           MASKTYPE *__restrict cond, int n)                   \
+  {                                                            \
+    for (int i = 0; i < n; ++i)                                        \
+      if (cond[i])                                             \
+       dest[i] = src[i * 3] + src[i * 3 + 2];                  \
+  }
+
+#define TEST2(NAME, OUTTYPE, INTYPE) \
+  TEST_LOOP (NAME##_i8, OUTTYPE, INTYPE, signed char) \
+  TEST_LOOP (NAME##_i16, OUTTYPE, INTYPE, unsigned short) \
+  TEST_LOOP (NAME##_f32, OUTTYPE, INTYPE, float) \
+  TEST_LOOP (NAME##_f64, OUTTYPE, INTYPE, double)
+
+#define TEST1(NAME, OUTTYPE) \
+  TEST2 (NAME##_i8, OUTTYPE, signed char) \
+  TEST2 (NAME##_i16, OUTTYPE, unsigned short) \
+  TEST2 (NAME##_i32, OUTTYPE, int) \
+  TEST2 (NAME##_i64, OUTTYPE, unsigned long)
+
+#define TEST(NAME) \
+  TEST1 (NAME##_i8, signed char) \
+  TEST1 (NAME##_i16, unsigned short) \
+  TEST1 (NAME##_i32, int) \
+  TEST1 (NAME##_i64, unsigned long) \
+  TEST2 (NAME##_f16_f16, _Float16, _Float16) \
+  TEST2 (NAME##_f32_f32, float, float) \
+  TEST2 (NAME##_f64_f64, double, double)
+
+TEST (test)
+
+/*    Mask |  8 16 32 64
+    -------+------------
+    Out  8 |  1  1  1  1
+        16 |  1  1  1  1
+        32 |  1  1  1  1
+        64 |  1  1  1  1.  */
+/* { dg-final { scan-assembler-times {\tld3b\t.z[0-9]} 16 } } */
+
+/*    Mask |  8 16 32 64
+    -------+------------
+    Out  8 |  2  2  2  2
+        16 |  2  1  1  1 x2 (for half float)
+        32 |  2  1  1  1
+        64 |  2  1  1  1.  */
+/* { dg-final { scan-assembler-times {\tld3h\t.z[0-9]} 28 } } */
+
+/*    Mask |  8 16 32 64
+    -------+------------
+    Out  8 |  4  4  4  4
+        16 |  4  2  2  2
+        32 |  4  2  1  1 x2 (for float)
+        64 |  4  2  1  1.  */
+/* { dg-final { scan-assembler-times {\tld3w\t.z[0-9]} 50 } } */
+
+/*    Mask |  8 16 32 64
+    -------+------------
+    Out  8 |  8  8  8  8
+        16 |  8  4  4  4
+        32 |  8  4  2  2
+        64 |  8  4  2  1 x2 (for double).  */
+/* { dg-final { scan-assembler-times {\tld3d\t.z[0-9]} 98 } } */
Index: gcc/testsuite/gcc.target/aarch64/sve_mask_struct_load_5.c
===================================================================
--- /dev/null   2017-11-08 11:04:45.353113300 +0000
+++ gcc/testsuite/gcc.target/aarch64/sve_mask_struct_load_5.c   2017-11-08 
16:35:04.766569934 +0000
@@ -0,0 +1,67 @@
+/* { dg-do compile } */
+/* { dg-options "-O2 -ftree-vectorize -fno-tree-dce -ffast-math 
-march=armv8-a+sve" } */
+
+#define TEST_LOOP(NAME, OUTTYPE, INTYPE, MASKTYPE)             \
+  void __attribute__ ((noinline, noclone))                     \
+  NAME##_4 (OUTTYPE *__restrict dest, INTYPE *__restrict src,  \
+           MASKTYPE *__restrict cond, int n)                   \
+  {                                                            \
+    for (int i = 0; i < n; ++i)                                        \
+      if (cond[i])                                             \
+       dest[i] = src[i * 4] + src[i * 4 + 3];                  \
+  }
+
+#define TEST2(NAME, OUTTYPE, INTYPE) \
+  TEST_LOOP (NAME##_i8, OUTTYPE, INTYPE, signed char) \
+  TEST_LOOP (NAME##_i16, OUTTYPE, INTYPE, unsigned short) \
+  TEST_LOOP (NAME##_f32, OUTTYPE, INTYPE, float) \
+  TEST_LOOP (NAME##_f64, OUTTYPE, INTYPE, double)
+
+#define TEST1(NAME, OUTTYPE) \
+  TEST2 (NAME##_i8, OUTTYPE, signed char) \
+  TEST2 (NAME##_i16, OUTTYPE, unsigned short) \
+  TEST2 (NAME##_i32, OUTTYPE, int) \
+  TEST2 (NAME##_i64, OUTTYPE, unsigned long)
+
+#define TEST(NAME) \
+  TEST1 (NAME##_i8, signed char) \
+  TEST1 (NAME##_i16, unsigned short) \
+  TEST1 (NAME##_i32, int) \
+  TEST1 (NAME##_i64, unsigned long) \
+  TEST2 (NAME##_f16_f16, _Float16, _Float16) \
+  TEST2 (NAME##_f32_f32, float, float) \
+  TEST2 (NAME##_f64_f64, double, double)
+
+TEST (test)
+
+/*    Mask |  8 16 32 64
+    -------+------------
+    Out  8 |  1  1  1  1
+        16 |  1  1  1  1
+        32 |  1  1  1  1
+        64 |  1  1  1  1.  */
+/* { dg-final { scan-assembler-times {\tld4b\t.z[0-9]} 16 } } */
+
+/*    Mask |  8 16 32 64
+    -------+------------
+    Out  8 |  2  2  2  2
+        16 |  2  1  1  1 x2 (for half float)
+        32 |  2  1  1  1
+        64 |  2  1  1  1.  */
+/* { dg-final { scan-assembler-times {\tld4h\t.z[0-9]} 28 } } */
+
+/*    Mask |  8 16 32 64
+    -------+------------
+    Out  8 |  4  4  4  4
+        16 |  4  2  2  2
+        32 |  4  2  1  1 x2 (for float)
+        64 |  4  2  1  1.  */
+/* { dg-final { scan-assembler-times {\tld4w\t.z[0-9]} 50 } } */
+
+/*    Mask |  8 16 32 64
+    -------+------------
+    Out  8 |  8  8  8  8
+        16 |  8  4  4  4
+        32 |  8  4  2  2
+        64 |  8  4  2  1 x2 (for double).  */
+/* { dg-final { scan-assembler-times {\tld4d\t.z[0-9]} 98 } } */
Index: gcc/testsuite/gcc.target/aarch64/sve_mask_struct_load_6.c
===================================================================
--- /dev/null   2017-11-08 11:04:45.353113300 +0000
+++ gcc/testsuite/gcc.target/aarch64/sve_mask_struct_load_6.c   2017-11-08 
16:35:04.766569934 +0000
@@ -0,0 +1,40 @@
+/* { dg-do compile } */
+/* { dg-options "-O2 -ftree-vectorize -fno-tree-dce -ffast-math 
-march=armv8-a+sve" } */
+
+#define TEST_LOOP(NAME, OUTTYPE, INTYPE, MASKTYPE)             \
+  void __attribute__ ((noinline, noclone))                     \
+  NAME##_2 (OUTTYPE *__restrict dest, INTYPE *__restrict src,  \
+           MASKTYPE *__restrict cond, int n)                   \
+  {                                                            \
+    for (int i = 0; i < n; ++i)                                        \
+      if (cond[i])                                             \
+       dest[i] = src[i * 2];                                   \
+  }
+
+#define TEST2(NAME, OUTTYPE, INTYPE) \
+  TEST_LOOP (NAME##_i8, OUTTYPE, INTYPE, signed char) \
+  TEST_LOOP (NAME##_i16, OUTTYPE, INTYPE, unsigned short) \
+  TEST_LOOP (NAME##_f32, OUTTYPE, INTYPE, float) \
+  TEST_LOOP (NAME##_f64, OUTTYPE, INTYPE, double)
+
+#define TEST1(NAME, OUTTYPE) \
+  TEST2 (NAME##_i8, OUTTYPE, signed char) \
+  TEST2 (NAME##_i16, OUTTYPE, unsigned short) \
+  TEST2 (NAME##_i32, OUTTYPE, int) \
+  TEST2 (NAME##_i64, OUTTYPE, unsigned long)
+
+#define TEST(NAME) \
+  TEST1 (NAME##_i8, signed char) \
+  TEST1 (NAME##_i16, unsigned short) \
+  TEST1 (NAME##_i32, int) \
+  TEST1 (NAME##_i64, unsigned long) \
+  TEST2 (NAME##_f16_f16, _Float16, _Float16) \
+  TEST2 (NAME##_f32_f32, float, float) \
+  TEST2 (NAME##_f64_f64, double, double)
+
+TEST (test)
+
+/* { dg-final { scan-assembler-not {\tld2b\t} } } */
+/* { dg-final { scan-assembler-not {\tld2h\t} } } */
+/* { dg-final { scan-assembler-not {\tld2w\t} } } */
+/* { dg-final { scan-assembler-not {\tld2d\t} } } */
Index: gcc/testsuite/gcc.target/aarch64/sve_mask_struct_load_7.c
===================================================================
--- /dev/null   2017-11-08 11:04:45.353113300 +0000
+++ gcc/testsuite/gcc.target/aarch64/sve_mask_struct_load_7.c   2017-11-08 
16:35:04.767487900 +0000
@@ -0,0 +1,40 @@
+/* { dg-do compile } */
+/* { dg-options "-O2 -ftree-vectorize -fno-tree-dce -ffast-math 
-march=armv8-a+sve" } */
+
+#define TEST_LOOP(NAME, OUTTYPE, INTYPE, MASKTYPE)             \
+  void __attribute__ ((noinline, noclone))                     \
+  NAME##_3 (OUTTYPE *__restrict dest, INTYPE *__restrict src,  \
+           MASKTYPE *__restrict cond, int n)                   \
+  {                                                            \
+    for (int i = 0; i < n; ++i)                                        \
+      if (cond[i])                                             \
+       dest[i] = src[i * 3] + src[i * 3 + 1];                  \
+  }
+
+#define TEST2(NAME, OUTTYPE, INTYPE) \
+  TEST_LOOP (NAME##_i8, OUTTYPE, INTYPE, signed char) \
+  TEST_LOOP (NAME##_i16, OUTTYPE, INTYPE, unsigned short) \
+  TEST_LOOP (NAME##_f32, OUTTYPE, INTYPE, float) \
+  TEST_LOOP (NAME##_f64, OUTTYPE, INTYPE, double)
+
+#define TEST1(NAME, OUTTYPE) \
+  TEST2 (NAME##_i8, OUTTYPE, signed char) \
+  TEST2 (NAME##_i16, OUTTYPE, unsigned short) \
+  TEST2 (NAME##_i32, OUTTYPE, int) \
+  TEST2 (NAME##_i64, OUTTYPE, unsigned long)
+
+#define TEST(NAME) \
+  TEST1 (NAME##_i8, signed char) \
+  TEST1 (NAME##_i16, unsigned short) \
+  TEST1 (NAME##_i32, int) \
+  TEST1 (NAME##_i64, unsigned long) \
+  TEST2 (NAME##_f16_f16, _Float16, _Float16) \
+  TEST2 (NAME##_f32_f32, float, float) \
+  TEST2 (NAME##_f64_f64, double, double)
+
+TEST (test)
+
+/* { dg-final { scan-assembler-not {\tld3b\t} } } */
+/* { dg-final { scan-assembler-not {\tld3h\t} } } */
+/* { dg-final { scan-assembler-not {\tld3w\t} } } */
+/* { dg-final { scan-assembler-not {\tld3d\t} } } */
Index: gcc/testsuite/gcc.target/aarch64/sve_mask_struct_load_8.c
===================================================================
--- /dev/null   2017-11-08 11:04:45.353113300 +0000
+++ gcc/testsuite/gcc.target/aarch64/sve_mask_struct_load_8.c   2017-11-08 
16:35:04.767487900 +0000
@@ -0,0 +1,40 @@
+/* { dg-do compile } */
+/* { dg-options "-O2 -ftree-vectorize -fno-tree-dce -ffast-math 
-march=armv8-a+sve" } */
+
+#define TEST_LOOP(NAME, OUTTYPE, INTYPE, MASKTYPE)             \
+  void __attribute__ ((noinline, noclone))                     \
+  NAME##_4 (OUTTYPE *__restrict dest, INTYPE *__restrict src,  \
+           MASKTYPE *__restrict cond, int n)                   \
+  {                                                            \
+    for (int i = 0; i < n; ++i)                                        \
+      if (cond[i])                                             \
+       dest[i] = src[i * 4] + src[i * 4 + 2];                  \
+  }
+
+#define TEST2(NAME, OUTTYPE, INTYPE) \
+  TEST_LOOP (NAME##_i8, OUTTYPE, INTYPE, signed char) \
+  TEST_LOOP (NAME##_i16, OUTTYPE, INTYPE, unsigned short) \
+  TEST_LOOP (NAME##_f32, OUTTYPE, INTYPE, float) \
+  TEST_LOOP (NAME##_f64, OUTTYPE, INTYPE, double)
+
+#define TEST1(NAME, OUTTYPE) \
+  TEST2 (NAME##_i8, OUTTYPE, signed char) \
+  TEST2 (NAME##_i16, OUTTYPE, unsigned short) \
+  TEST2 (NAME##_i32, OUTTYPE, int) \
+  TEST2 (NAME##_i64, OUTTYPE, unsigned long)
+
+#define TEST(NAME) \
+  TEST1 (NAME##_i8, signed char) \
+  TEST1 (NAME##_i16, unsigned short) \
+  TEST1 (NAME##_i32, int) \
+  TEST1 (NAME##_i64, unsigned long) \
+  TEST2 (NAME##_f16_f16, _Float16, _Float16) \
+  TEST2 (NAME##_f32_f32, float, float) \
+  TEST2 (NAME##_f64_f64, double, double)
+
+TEST (test)
+
+/* { dg-final { scan-assembler-not {\tld4b\t} } } */
+/* { dg-final { scan-assembler-not {\tld4h\t} } } */
+/* { dg-final { scan-assembler-not {\tld4w\t} } } */
+/* { dg-final { scan-assembler-not {\tld4d\t} } } */
Index: gcc/testsuite/gcc.target/aarch64/sve_mask_struct_store_1.c
===================================================================
--- /dev/null   2017-11-08 11:04:45.353113300 +0000
+++ gcc/testsuite/gcc.target/aarch64/sve_mask_struct_store_1.c  2017-11-08 
16:35:04.767487900 +0000
@@ -0,0 +1,70 @@
+/* { dg-do compile } */
+/* { dg-options "-O2 -ftree-vectorize -ffast-math -march=armv8-a+sve" } */
+
+#define TEST_LOOP(NAME, OUTTYPE, INTYPE, MASKTYPE)             \
+  void __attribute__ ((noinline, noclone))                     \
+  NAME##_2 (OUTTYPE *__restrict dest, INTYPE *__restrict src,  \
+           MASKTYPE *__restrict cond, int n)                   \
+  {                                                            \
+    for (int i = 0; i < n; ++i)                                        \
+      if (cond[i])                                             \
+       {                                                       \
+         dest[i * 2] = src[i];                                 \
+         dest[i * 2 + 1] = src[i];                             \
+       }                                                       \
+  }
+
+#define TEST2(NAME, OUTTYPE, INTYPE) \
+  TEST_LOOP (NAME##_i8, OUTTYPE, INTYPE, signed char) \
+  TEST_LOOP (NAME##_i16, OUTTYPE, INTYPE, unsigned short) \
+  TEST_LOOP (NAME##_f32, OUTTYPE, INTYPE, float) \
+  TEST_LOOP (NAME##_f64, OUTTYPE, INTYPE, double)
+
+#define TEST1(NAME, OUTTYPE) \
+  TEST2 (NAME##_i8, OUTTYPE, signed char) \
+  TEST2 (NAME##_i16, OUTTYPE, unsigned short) \
+  TEST2 (NAME##_i32, OUTTYPE, int) \
+  TEST2 (NAME##_i64, OUTTYPE, unsigned long)
+
+#define TEST(NAME) \
+  TEST1 (NAME##_i8, signed char) \
+  TEST1 (NAME##_i16, unsigned short) \
+  TEST1 (NAME##_i32, int) \
+  TEST1 (NAME##_i64, unsigned long) \
+  TEST2 (NAME##_f16_f16, _Float16, _Float16) \
+  TEST2 (NAME##_f32_f32, float, float) \
+  TEST2 (NAME##_f64_f64, double, double)
+
+TEST (test)
+
+/*    Mask |  8 16 32 64
+    -------+------------
+    In   8 |  1  1  1  1
+        16 |  1  1  1  1
+        32 |  1  1  1  1
+        64 |  1  1  1  1.  */
+/* { dg-final { scan-assembler-times {\tst2b\t.z[0-9]} 16 } } */
+
+/*    Mask |  8 16 32 64
+    -------+------------
+    In   8 |  2  2  2  2
+        16 |  2  1  1  1 x2 (for _Float16)
+        32 |  2  1  1  1
+        64 |  2  1  1  1.  */
+/* { dg-final { scan-assembler-times {\tst2h\t.z[0-9]} 28 } } */
+
+/*    Mask |  8 16 32 64
+    -------+------------
+    In   8 |  4  4  4  4
+        16 |  4  2  2  2
+        32 |  4  2  1  1 x2 (for float)
+        64 |  4  2  1  1.  */
+/* { dg-final { scan-assembler-times {\tst2w\t.z[0-9]} 50 } } */
+
+/*    Mask |  8 16 32 64
+    -------+------------
+    In   8 |  8  8  8  8
+        16 |  8  4  4  4
+        32 |  8  4  2  2
+        64 |  8  4  2  1 x2 (for double).  */
+/* { dg-final { scan-assembler-times {\tst2d\t.z[0-9]} 98 } } */
Index: gcc/testsuite/gcc.target/aarch64/sve_mask_struct_store_1_run.c
===================================================================
--- /dev/null   2017-11-08 11:04:45.353113300 +0000
+++ gcc/testsuite/gcc.target/aarch64/sve_mask_struct_store_1_run.c      
2017-11-08 16:35:04.767487900 +0000
@@ -0,0 +1,38 @@
+/* { dg-do run { target aarch64_sve_hw } } */
+/* { dg-options "-O2 -ftree-vectorize -fno-tree-dce -ffast-math 
-march=armv8-a+sve" } */
+
+#include "sve_mask_struct_store_1.c"
+
+#define N 100
+
+#undef TEST_LOOP
+#define TEST_LOOP(NAME, OUTTYPE, INTYPE, MASKTYPE)             \
+  {                                                            \
+    OUTTYPE out[N * 2];                                                \
+    INTYPE in[N];                                              \
+    MASKTYPE mask[N];                                          \
+    for (int i = 0; i < N; ++i)                                        \
+      {                                                                \
+       in[i] = i * 7 / 2;                                      \
+       mask[i] = i % 5 <= i % 3;                               \
+       asm volatile ("" ::: "memory");                         \
+      }                                                                \
+    for (int i = 0; i < N * 2; ++i)                            \
+      out[i] = i * 9 / 2;                                      \
+    NAME##_2 (out, in, mask, N);                               \
+    for (int i = 0; i < N * 2; ++i)                            \
+      {                                                                \
+       OUTTYPE if_true = in[i / 2];                            \
+       OUTTYPE if_false = i * 9 / 2;                           \
+       if (out[i] != (mask[i / 2] ? if_true : if_false))       \
+         __builtin_abort ();                                   \
+       asm volatile ("" ::: "memory");                         \
+      }                                                                \
+  }
+
+int __attribute__ ((optimize (1)))
+main (void)
+{
+  TEST (test);
+  return 0;
+}
Index: gcc/testsuite/gcc.target/aarch64/sve_mask_struct_store_2.c
===================================================================
--- /dev/null   2017-11-08 11:04:45.353113300 +0000
+++ gcc/testsuite/gcc.target/aarch64/sve_mask_struct_store_2.c  2017-11-08 
16:35:04.767487900 +0000
@@ -0,0 +1,71 @@
+/* { dg-do compile } */
+/* { dg-options "-O2 -ftree-vectorize -ffast-math -march=armv8-a+sve" } */
+
+#define TEST_LOOP(NAME, OUTTYPE, INTYPE, MASKTYPE)             \
+  void __attribute__ ((noinline, noclone))                     \
+  NAME##_3 (OUTTYPE *__restrict dest, INTYPE *__restrict src,  \
+           MASKTYPE *__restrict cond, int n)                   \
+  {                                                            \
+    for (int i = 0; i < n; ++i)                                        \
+      if (cond[i])                                             \
+       {                                                       \
+         dest[i * 3] = src[i];                                 \
+         dest[i * 3 + 1] = src[i];                             \
+         dest[i * 3 + 2] = src[i];                             \
+       }                                                       \
+  }
+
+#define TEST2(NAME, OUTTYPE, INTYPE) \
+  TEST_LOOP (NAME##_i8, OUTTYPE, INTYPE, signed char) \
+  TEST_LOOP (NAME##_i16, OUTTYPE, INTYPE, unsigned short) \
+  TEST_LOOP (NAME##_f32, OUTTYPE, INTYPE, float) \
+  TEST_LOOP (NAME##_f64, OUTTYPE, INTYPE, double)
+
+#define TEST1(NAME, OUTTYPE) \
+  TEST2 (NAME##_i8, OUTTYPE, signed char) \
+  TEST2 (NAME##_i16, OUTTYPE, unsigned short) \
+  TEST2 (NAME##_i32, OUTTYPE, int) \
+  TEST2 (NAME##_i64, OUTTYPE, unsigned long)
+
+#define TEST(NAME) \
+  TEST1 (NAME##_i8, signed char) \
+  TEST1 (NAME##_i16, unsigned short) \
+  TEST1 (NAME##_i32, int) \
+  TEST1 (NAME##_i64, unsigned long) \
+  TEST2 (NAME##_f16_f16, _Float16, _Float16) \
+  TEST2 (NAME##_f32_f32, float, float) \
+  TEST2 (NAME##_f64_f64, double, double)
+
+TEST (test)
+
+/*    Mask |  8 16 32 64
+    -------+------------
+    In   8 |  1  1  1  1
+        16 |  1  1  1  1
+        32 |  1  1  1  1
+        64 |  1  1  1  1.  */
+/* { dg-final { scan-assembler-times {\tst3b\t.z[0-9]} 16 } } */
+
+/*    Mask |  8 16 32 64
+    -------+------------
+    In   8 |  2  2  2  2
+        16 |  2  1  1  1 x2 (for _Float16)
+        32 |  2  1  1  1
+        64 |  2  1  1  1.  */
+/* { dg-final { scan-assembler-times {\tst3h\t.z[0-9]} 28 } } */
+
+/*    Mask |  8 16 32 64
+    -------+------------
+    In   8 |  4  4  4  4
+        16 |  4  2  2  2
+        32 |  4  2  1  1 x2 (for float)
+        64 |  4  2  1  1.  */
+/* { dg-final { scan-assembler-times {\tst3w\t.z[0-9]} 50 } } */
+
+/*    Mask |  8 16 32 64
+    -------+------------
+    In   8 |  8  8  8  8
+        16 |  8  4  4  4
+        32 |  8  4  2  2
+        64 |  8  4  2  1 x2 (for double).  */
+/* { dg-final { scan-assembler-times {\tst3d\t.z[0-9]} 98 } } */
Index: gcc/testsuite/gcc.target/aarch64/sve_mask_struct_store_2_run.c
===================================================================
--- /dev/null   2017-11-08 11:04:45.353113300 +0000
+++ gcc/testsuite/gcc.target/aarch64/sve_mask_struct_store_2_run.c      
2017-11-08 16:35:04.767487900 +0000
@@ -0,0 +1,38 @@
+/* { dg-do run { target aarch64_sve_hw } } */
+/* { dg-options "-O2 -ftree-vectorize -fno-tree-dce -ffast-math 
-march=armv8-a+sve" } */
+
+#include "sve_mask_struct_store_2.c"
+
+#define N 100
+
+#undef TEST_LOOP
+#define TEST_LOOP(NAME, OUTTYPE, INTYPE, MASKTYPE)             \
+  {                                                            \
+    OUTTYPE out[N * 3];                                                \
+    INTYPE in[N];                                              \
+    MASKTYPE mask[N];                                          \
+    for (int i = 0; i < N; ++i)                                        \
+      {                                                                \
+       in[i] = i * 7 / 2;                                      \
+       mask[i] = i % 5 <= i % 3;                               \
+       asm volatile ("" ::: "memory");                         \
+      }                                                                \
+    for (int i = 0; i < N * 3; ++i)                            \
+      out[i] = i * 9 / 2;                                      \
+    NAME##_3 (out, in, mask, N);                               \
+    for (int i = 0; i < N * 3; ++i)                            \
+      {                                                                \
+       OUTTYPE if_true = in[i / 3];                            \
+       OUTTYPE if_false = i * 9 / 2;                           \
+       if (out[i] != (mask[i / 3] ? if_true : if_false))       \
+         __builtin_abort ();                                   \
+       asm volatile ("" ::: "memory");                         \
+      }                                                                \
+  }
+
+int __attribute__ ((optimize (1)))
+main (void)
+{
+  TEST (test);
+  return 0;
+}
Index: gcc/testsuite/gcc.target/aarch64/sve_mask_struct_store_3.c
===================================================================
--- /dev/null   2017-11-08 11:04:45.353113300 +0000
+++ gcc/testsuite/gcc.target/aarch64/sve_mask_struct_store_3.c  2017-11-08 
16:35:04.767487900 +0000
@@ -0,0 +1,72 @@
+/* { dg-do compile } */
+/* { dg-options "-O2 -ftree-vectorize -fno-vect-cost-model -march=armv8-a+sve" 
} */
+
+#define TEST_LOOP(NAME, OUTTYPE, INTYPE, MASKTYPE)             \
+  void __attribute__ ((noinline, noclone))                     \
+  NAME##_4 (OUTTYPE *__restrict dest, INTYPE *__restrict src,  \
+           MASKTYPE *__restrict cond, int n)                   \
+  {                                                            \
+    for (int i = 0; i < n; ++i)                                        \
+      if (cond[i])                                             \
+       {                                                       \
+         dest[i * 4] = src[i];                                 \
+         dest[i * 4 + 1] = src[i];                             \
+         dest[i * 4 + 2] = src[i];                             \
+         dest[i * 4 + 3] = src[i];                             \
+       }                                                       \
+  }
+
+#define TEST2(NAME, OUTTYPE, INTYPE) \
+  TEST_LOOP (NAME##_i8, OUTTYPE, INTYPE, signed char) \
+  TEST_LOOP (NAME##_i16, OUTTYPE, INTYPE, unsigned short) \
+  TEST_LOOP (NAME##_f32, OUTTYPE, INTYPE, float) \
+  TEST_LOOP (NAME##_f64, OUTTYPE, INTYPE, double)
+
+#define TEST1(NAME, OUTTYPE) \
+  TEST2 (NAME##_i8, OUTTYPE, signed char) \
+  TEST2 (NAME##_i16, OUTTYPE, unsigned short) \
+  TEST2 (NAME##_i32, OUTTYPE, int) \
+  TEST2 (NAME##_i64, OUTTYPE, unsigned long)
+
+#define TEST(NAME) \
+  TEST1 (NAME##_i8, signed char) \
+  TEST1 (NAME##_i16, unsigned short) \
+  TEST1 (NAME##_i32, int) \
+  TEST1 (NAME##_i64, unsigned long) \
+  TEST2 (NAME##_f16_f16, _Float16, _Float16) \
+  TEST2 (NAME##_f32_f32, float, float) \
+  TEST2 (NAME##_f64_f64, double, double)
+
+TEST (test)
+
+/*    Mask |  8 16 32 64
+    -------+------------
+    In   8 |  1  1  1  1
+        16 |  1  1  1  1
+        32 |  1  1  1  1
+        64 |  1  1  1  1.  */
+/* { dg-final { scan-assembler-times {\tst4b\t.z[0-9]} 16 } } */
+
+/*    Mask |  8 16 32 64
+    -------+------------
+    In   8 |  2  2  2  2
+        16 |  2  1  1  1 x2 (for half float)
+        32 |  2  1  1  1
+        64 |  2  1  1  1.  */
+/* { dg-final { scan-assembler-times {\tst4h\t.z[0-9]} 28 } } */
+
+/*    Mask |  8 16 32 64
+    -------+------------
+    In   8 |  4  4  4  4
+        16 |  4  2  2  2
+        32 |  4  2  1  1 x2 (for float)
+        64 |  4  2  1  1.  */
+/* { dg-final { scan-assembler-times {\tst4w\t.z[0-9]} 50 } } */
+
+/*    Mask |  8 16 32 64
+    -------+------------
+    In   8 |  8  8  8  8
+        16 |  8  4  4  4
+        32 |  8  4  2  2
+        64 |  8  4  2  1 x2 (for double).  */
+/* { dg-final { scan-assembler-times {\tst4d\t.z[0-9]} 98 } } */
Index: gcc/testsuite/gcc.target/aarch64/sve_mask_struct_store_3_run.c
===================================================================
--- /dev/null   2017-11-08 11:04:45.353113300 +0000
+++ gcc/testsuite/gcc.target/aarch64/sve_mask_struct_store_3_run.c      
2017-11-08 16:35:04.767487900 +0000
@@ -0,0 +1,38 @@
+/* { dg-do run { target aarch64_sve_hw } } */
+/* { dg-options "-O2 -ftree-vectorize -fno-tree-dce -ffast-math 
-march=armv8-a+sve" } */
+
+#include "sve_mask_struct_store_3.c"
+
+#define N 100
+
+#undef TEST_LOOP
+#define TEST_LOOP(NAME, OUTTYPE, INTYPE, MASKTYPE)             \
+  {                                                            \
+    OUTTYPE out[N * 4];                                                \
+    INTYPE in[N];                                              \
+    MASKTYPE mask[N];                                          \
+    for (int i = 0; i < N; ++i)                                        \
+      {                                                                \
+       in[i] = i * 7 / 2;                                      \
+       mask[i] = i % 5 <= i % 3;                               \
+       asm volatile ("" ::: "memory");                         \
+      }                                                                \
+    for (int i = 0; i < N * 4; ++i)                            \
+      out[i] = i * 9 / 2;                                      \
+    NAME##_4 (out, in, mask, N);                               \
+    for (int i = 0; i < N * 4; ++i)                            \
+      {                                                                \
+       OUTTYPE if_true = in[i / 4];                            \
+       OUTTYPE if_false = i * 9 / 2;                           \
+       if (out[i] != (mask[i / 4] ? if_true : if_false))       \
+         __builtin_abort ();                                   \
+       asm volatile ("" ::: "memory");                         \
+      }                                                                \
+  }
+
+int __attribute__ ((optimize (1)))
+main (void)
+{
+  TEST (test);
+  return 0;
+}
Index: gcc/testsuite/gcc.target/aarch64/sve_mask_struct_store_4.c
===================================================================
--- /dev/null   2017-11-08 11:04:45.353113300 +0000
+++ gcc/testsuite/gcc.target/aarch64/sve_mask_struct_store_4.c  2017-11-08 
16:35:04.767487900 +0000
@@ -0,0 +1,44 @@
+/* { dg-do compile } */
+/* { dg-options "-O2 -ftree-vectorize -ffast-math -march=armv8-a+sve" } */
+
+#define TEST_LOOP(NAME, OUTTYPE, INTYPE, MASKTYPE)             \
+  void __attribute__ ((noinline, noclone))                     \
+  NAME##_2 (OUTTYPE *__restrict dest, INTYPE *__restrict src,  \
+           MASKTYPE *__restrict cond, int n)                   \
+  {                                                            \
+    for (int i = 0; i < n; ++i)                                        \
+      {                                                                \
+       if (cond[i] < 8)                                        \
+         dest[i * 2] = src[i];                                 \
+       if (cond[i] > 2)                                        \
+         dest[i * 2 + 1] = src[i];                             \
+       }                                                       \
+  }
+
+#define TEST2(NAME, OUTTYPE, INTYPE) \
+  TEST_LOOP (NAME##_i8, OUTTYPE, INTYPE, signed char) \
+  TEST_LOOP (NAME##_i16, OUTTYPE, INTYPE, unsigned short) \
+  TEST_LOOP (NAME##_f32, OUTTYPE, INTYPE, float) \
+  TEST_LOOP (NAME##_f64, OUTTYPE, INTYPE, double)
+
+#define TEST1(NAME, OUTTYPE) \
+  TEST2 (NAME##_i8, OUTTYPE, signed char) \
+  TEST2 (NAME##_i16, OUTTYPE, unsigned short) \
+  TEST2 (NAME##_i32, OUTTYPE, int) \
+  TEST2 (NAME##_i64, OUTTYPE, unsigned long)
+
+#define TEST(NAME) \
+  TEST1 (NAME##_i8, signed char) \
+  TEST1 (NAME##_i16, unsigned short) \
+  TEST1 (NAME##_i32, int) \
+  TEST1 (NAME##_i64, unsigned long) \
+  TEST2 (NAME##_f16_f16, _Float16, _Float16) \
+  TEST2 (NAME##_f32_f32, float, float) \
+  TEST2 (NAME##_f64_f64, double, double)
+
+TEST (test)
+
+/* { dg-final { scan-assembler-not {\tst2b\t.z[0-9]} } } */
+/* { dg-final { scan-assembler-not {\tst2h\t.z[0-9]} } } */
+/* { dg-final { scan-assembler-not {\tst2w\t.z[0-9]} } } */
+/* { dg-final { scan-assembler-not {\tst2d\t.z[0-9]} } } */

Reply via email to