Re: [Mesa-dev] [PATCH 01/14] intel/compiler: general 8/16/32/64-bit shuffle_src_to_dst function

2018-06-14 Thread Chema Casanova
El 14/06/18 a las 02:46, Jason Ekstrand escribió:
> On Wed, Jun 13, 2018 at 5:07 PM, Chema Casanova  > wrote:
> 
> On 13/06/18 22:46, Jason Ekstrand wrote:
> > On Sat, Jun 9, 2018 at 4:13 AM, Jose Maria Casanova Crespo
> > mailto:jmcasan...@igalia.com>
> >> wrote:
> >
> >     This new function takes care of shuffle/unshuffle components of a
> >     particular bit-size in components with a different bit-size.
> >
> >     If source type size is smaller than destination type size the
> operation
> >     needed is a component shuffle. The opposite case would be an
> unshuffle.
> >
> >     The operation allows to skip first_component number of
> components from
> >     the source.
> >
> >     Shuffle MOVs are retyped using integer types avoiding problems
> with
> >     denorms
> >     and float types. This allows to simplify uses of shuffle functions
> >     that are
> >     dealing with these retypes individually.
> >
> >     Now there is a new restriction so source and destination can
> not overlap
> >     anymore when calling this suffle function. Following patches that
> >     migrate
> >     to use this new function will take care individually of
> avoiding source
> >     and destination overlaps.
> >     ---
> >      src/intel/compiler/brw_fs_nir.cpp | 92
> +++
> >      1 file changed, 92 insertions(+)
> >
> >     diff --git a/src/intel/compiler/brw_fs_nir.cpp
> >     b/src/intel/compiler/brw_fs_nir.cpp
> >     index 166da0aa6d7..1a9d3c41d1d 100644
> >     --- a/src/intel/compiler/brw_fs_nir.cpp
> >     +++ b/src/intel/compiler/brw_fs_nir.cpp
> >     @@ -5362,6 +5362,98 @@ shuffle_16bit_data_for_32bit_write(const
> >     fs_builder ,
> >         }
> >      }
> >
> >     +/*
> >     + * This helper takes a source register and un/shuffles it
> into the
> >     destination
> >     + * register.
> >     + *
> >     + * If source type size is smaller than destination type size the
> >     operation
> >     + * needed is a component shuffle. The opposite case would be an
> >     unshuffle. If
> >     + * source/destination type size is equal a shuffle is done that
> >     would be
> >     + * equivalent to a simple MOV.
> >
> >
> > There's a sticky bit here if we want this to work with 64-bit types on
> > gen7 and earlier because we only have DF there and not Q so the
> > brw_reg_type_from_bit_size below doesn't work.  If we care about that
> > case (and I'm not convinced we do), it should be easy enough to add a
> > type_sz(src.type) == type_sz(dst.type) case which just does MOVs from
> > source to dest.
> 
> At this moment, current uses of this function are to read from 32-bits
> or to write to 32-bit. But I think that for completeness if would be
> nice to have all cases covered. The option of doing the MOVs in the case
> of equality (that would be quite normal) saves us to do the shuffle
> calculus for the simple case. So I'm going for it.
> 
> >     + *
> >     + * For example, if source is a 16-bit type and destination is
> >     32-bit. A 3
> >     + * components .xyz 16-bit vector on SIMD8 would be.
> >     + *
> >     + *    |x1|x2|x3|x4|x5|x6|x7|x8|y1|y2|y3|y4|y5|y6|y7|y8|
> >     + *    |z1|z2|z3|z4|z5|z6|z7|z8|  |  |  |  |  |  |  |  |
> >     + *
> >     + * This helper will return the following 2 32-bit components with
> >     the 16-bit
> >     + * values shuffled:
> >     + *
> >     + *    |x1 y1|x2 y2|x3 y3|x4 y4|x5 y5|x6 y6|x7 y7|x8 y8|
> >     + *    |z1   |z2   |z3   |z4   |z5   |z6   |z7   |z8   |
> >     + *
> >     + * For unshuffle, the example would be the opposite, a 64-bit
> type
> >     source
> >     + * and a 32-bit destination. A 2 component .xy 64-bit vector
> on SIMD8
> >     + * would be:
> >     + *
> >     + *    | x1l   x1h | x2l   x2h | x3l   x3h | x4l   x4h |
> >     + *    | x5l   x5h | x6l   x6h | x7l   x7h | x8l   x8h |
> >     + *    | y1l   y1h | y2l   y2h | y3l   y3h | y4l   y4h |
> >     + *    | y5l   y5h | y6l   y6h | y7l   y7h | y8l   y8h |
> >     + *
> >     + * The returned result would be the following 4 32-bit components
> >     unshuffled:
> >     + *
> >     + *    | x1l | x2l | x3l | x4l | x5l | x6l | x7l | x8l |
> >     + *    | x1h | x2h | x3h | x4h | x5h | x6h | x7h | x8h |
> >     + *    | y1l | y2l | y3l | y4l | y5l | y6l | y7l | y8l |
> >     + *    | y1h | y2h | y3h | y4h | y5h | y6h | y7h | y8h |
> >     + *
> >     + * - Source and destination register must not be overlapped.
> >     + * - 

Re: [Mesa-dev] [PATCH 01/14] intel/compiler: general 8/16/32/64-bit shuffle_src_to_dst function

2018-06-13 Thread Jason Ekstrand
On Wed, Jun 13, 2018 at 5:07 PM, Chema Casanova 
wrote:

> On 13/06/18 22:46, Jason Ekstrand wrote:
> > On Sat, Jun 9, 2018 at 4:13 AM, Jose Maria Casanova Crespo
> > mailto:jmcasan...@igalia.com>> wrote:
> >
> > This new function takes care of shuffle/unshuffle components of a
> > particular bit-size in components with a different bit-size.
> >
> > If source type size is smaller than destination type size the
> operation
> > needed is a component shuffle. The opposite case would be an
> unshuffle.
> >
> > The operation allows to skip first_component number of components
> from
> > the source.
> >
> > Shuffle MOVs are retyped using integer types avoiding problems with
> > denorms
> > and float types. This allows to simplify uses of shuffle functions
> > that are
> > dealing with these retypes individually.
> >
> > Now there is a new restriction so source and destination can not
> overlap
> > anymore when calling this suffle function. Following patches that
> > migrate
> > to use this new function will take care individually of avoiding
> source
> > and destination overlaps.
> > ---
> >  src/intel/compiler/brw_fs_nir.cpp | 92
> +++
> >  1 file changed, 92 insertions(+)
> >
> > diff --git a/src/intel/compiler/brw_fs_nir.cpp
> > b/src/intel/compiler/brw_fs_nir.cpp
> > index 166da0aa6d7..1a9d3c41d1d 100644
> > --- a/src/intel/compiler/brw_fs_nir.cpp
> > +++ b/src/intel/compiler/brw_fs_nir.cpp
> > @@ -5362,6 +5362,98 @@ shuffle_16bit_data_for_32bit_write(const
> > fs_builder ,
> > }
> >  }
> >
> > +/*
> > + * This helper takes a source register and un/shuffles it into the
> > destination
> > + * register.
> > + *
> > + * If source type size is smaller than destination type size the
> > operation
> > + * needed is a component shuffle. The opposite case would be an
> > unshuffle. If
> > + * source/destination type size is equal a shuffle is done that
> > would be
> > + * equivalent to a simple MOV.
> >
> >
> > There's a sticky bit here if we want this to work with 64-bit types on
> > gen7 and earlier because we only have DF there and not Q so the
> > brw_reg_type_from_bit_size below doesn't work.  If we care about that
> > case (and I'm not convinced we do), it should be easy enough to add a
> > type_sz(src.type) == type_sz(dst.type) case which just does MOVs from
> > source to dest.
>
> At this moment, current uses of this function are to read from 32-bits
> or to write to 32-bit. But I think that for completeness if would be
> nice to have all cases covered. The option of doing the MOVs in the case
> of equality (that would be quite normal) saves us to do the shuffle
> calculus for the simple case. So I'm going for it.
>
> > + *
> > + * For example, if source is a 16-bit type and destination is
> > 32-bit. A 3
> > + * components .xyz 16-bit vector on SIMD8 would be.
> > + *
> > + *|x1|x2|x3|x4|x5|x6|x7|x8|y1|y2|y3|y4|y5|y6|y7|y8|
> > + *|z1|z2|z3|z4|z5|z6|z7|z8|  |  |  |  |  |  |  |  |
> > + *
> > + * This helper will return the following 2 32-bit components with
> > the 16-bit
> > + * values shuffled:
> > + *
> > + *|x1 y1|x2 y2|x3 y3|x4 y4|x5 y5|x6 y6|x7 y7|x8 y8|
> > + *|z1   |z2   |z3   |z4   |z5   |z6   |z7   |z8   |
> > + *
> > + * For unshuffle, the example would be the opposite, a 64-bit type
> > source
> > + * and a 32-bit destination. A 2 component .xy 64-bit vector on
> SIMD8
> > + * would be:
> > + *
> > + *| x1l   x1h | x2l   x2h | x3l   x3h | x4l   x4h |
> > + *| x5l   x5h | x6l   x6h | x7l   x7h | x8l   x8h |
> > + *| y1l   y1h | y2l   y2h | y3l   y3h | y4l   y4h |
> > + *| y5l   y5h | y6l   y6h | y7l   y7h | y8l   y8h |
> > + *
> > + * The returned result would be the following 4 32-bit components
> > unshuffled:
> > + *
> > + *| x1l | x2l | x3l | x4l | x5l | x6l | x7l | x8l |
> > + *| x1h | x2h | x3h | x4h | x5h | x6h | x7h | x8h |
> > + *| y1l | y2l | y3l | y4l | y5l | y6l | y7l | y8l |
> > + *| y1h | y2h | y3h | y4h | y5h | y6h | y7h | y8h |
> > + *
> > + * - Source and destination register must not be overlapped.
> > + * - first_component parameter allows skipping source components.
> > + */
> > +void
> > +shuffle_src_to_dst(const fs_builder ,
> > +   const fs_reg ,
> > +   const fs_reg ,
> > +   uint32_t first_component,
> > +   uint32_t components)
> > +{
> > +   if (type_sz(src.type) <= type_sz(dst.type)) {
> > +  /* Source is shuffled into destination */
> > +  unsigned size_ratio = type_sz(dst.type) / type_sz(src.type);
> > +#ifndef NDEBUG
> > +  boolean src_dst_overlap = 

Re: [Mesa-dev] [PATCH 01/14] intel/compiler: general 8/16/32/64-bit shuffle_src_to_dst function

2018-06-13 Thread Chema Casanova
On 13/06/18 22:46, Jason Ekstrand wrote:
> On Sat, Jun 9, 2018 at 4:13 AM, Jose Maria Casanova Crespo
> mailto:jmcasan...@igalia.com>> wrote:
> 
> This new function takes care of shuffle/unshuffle components of a
> particular bit-size in components with a different bit-size.
> 
> If source type size is smaller than destination type size the operation
> needed is a component shuffle. The opposite case would be an unshuffle.
> 
> The operation allows to skip first_component number of components from
> the source.
> 
> Shuffle MOVs are retyped using integer types avoiding problems with
> denorms
> and float types. This allows to simplify uses of shuffle functions
> that are
> dealing with these retypes individually.
> 
> Now there is a new restriction so source and destination can not overlap
> anymore when calling this suffle function. Following patches that
> migrate
> to use this new function will take care individually of avoiding source
> and destination overlaps.
> ---
>  src/intel/compiler/brw_fs_nir.cpp | 92 +++
>  1 file changed, 92 insertions(+)
> 
> diff --git a/src/intel/compiler/brw_fs_nir.cpp
> b/src/intel/compiler/brw_fs_nir.cpp
> index 166da0aa6d7..1a9d3c41d1d 100644
> --- a/src/intel/compiler/brw_fs_nir.cpp
> +++ b/src/intel/compiler/brw_fs_nir.cpp
> @@ -5362,6 +5362,98 @@ shuffle_16bit_data_for_32bit_write(const
> fs_builder ,
>     }
>  }
> 
> +/*
> + * This helper takes a source register and un/shuffles it into the
> destination
> + * register.
> + *
> + * If source type size is smaller than destination type size the
> operation
> + * needed is a component shuffle. The opposite case would be an
> unshuffle. If
> + * source/destination type size is equal a shuffle is done that
> would be
> + * equivalent to a simple MOV.
> 
> 
> There's a sticky bit here if we want this to work with 64-bit types on
> gen7 and earlier because we only have DF there and not Q so the
> brw_reg_type_from_bit_size below doesn't work.  If we care about that
> case (and I'm not convinced we do), it should be easy enough to add a
> type_sz(src.type) == type_sz(dst.type) case which just does MOVs from
> source to dest.

At this moment, current uses of this function are to read from 32-bits
or to write to 32-bit. But I think that for completeness if would be
nice to have all cases covered. The option of doing the MOVs in the case
of equality (that would be quite normal) saves us to do the shuffle
calculus for the simple case. So I'm going for it.

> + *
> + * For example, if source is a 16-bit type and destination is
> 32-bit. A 3
> + * components .xyz 16-bit vector on SIMD8 would be.
> + *
> + *    |x1|x2|x3|x4|x5|x6|x7|x8|y1|y2|y3|y4|y5|y6|y7|y8|
> + *    |z1|z2|z3|z4|z5|z6|z7|z8|  |  |  |  |  |  |  |  |
> + *
> + * This helper will return the following 2 32-bit components with
> the 16-bit
> + * values shuffled:
> + *
> + *    |x1 y1|x2 y2|x3 y3|x4 y4|x5 y5|x6 y6|x7 y7|x8 y8|
> + *    |z1   |z2   |z3   |z4   |z5   |z6   |z7   |z8   |
> + *
> + * For unshuffle, the example would be the opposite, a 64-bit type
> source
> + * and a 32-bit destination. A 2 component .xy 64-bit vector on SIMD8
> + * would be:
> + *
> + *    | x1l   x1h | x2l   x2h | x3l   x3h | x4l   x4h |
> + *    | x5l   x5h | x6l   x6h | x7l   x7h | x8l   x8h |
> + *    | y1l   y1h | y2l   y2h | y3l   y3h | y4l   y4h |
> + *    | y5l   y5h | y6l   y6h | y7l   y7h | y8l   y8h |
> + *
> + * The returned result would be the following 4 32-bit components
> unshuffled:
> + *
> + *    | x1l | x2l | x3l | x4l | x5l | x6l | x7l | x8l |
> + *    | x1h | x2h | x3h | x4h | x5h | x6h | x7h | x8h |
> + *    | y1l | y2l | y3l | y4l | y5l | y6l | y7l | y8l |
> + *    | y1h | y2h | y3h | y4h | y5h | y6h | y7h | y8h |
> + *
> + * - Source and destination register must not be overlapped.
> + * - first_component parameter allows skipping source components.
> + */
> +void
> +shuffle_src_to_dst(const fs_builder ,
> +                   const fs_reg ,
> +                   const fs_reg ,
> +                   uint32_t first_component,
> +                   uint32_t components)
> +{
> +   if (type_sz(src.type) <= type_sz(dst.type)) {
> +      /* Source is shuffled into destination */
> +      unsigned size_ratio = type_sz(dst.type) / type_sz(src.type);
> +#ifndef NDEBUG
> +      boolean src_dst_overlap = regions_overlap(dst,
> +         type_sz(dst.type) * bld.dispatch_width() * components,
> +         offset(src, bld, first_component * size_ratio),
> 
> 
> Why do you need to multiply first_component by size_ratio?  It's already
> in units of source components.

Yes, that's wrong. I forgot to 

Re: [Mesa-dev] [PATCH 01/14] intel/compiler: general 8/16/32/64-bit shuffle_src_to_dst function

2018-06-13 Thread Jason Ekstrand
On Sat, Jun 9, 2018 at 4:13 AM, Jose Maria Casanova Crespo <
jmcasan...@igalia.com> wrote:

> This new function takes care of shuffle/unshuffle components of a
> particular bit-size in components with a different bit-size.
>
> If source type size is smaller than destination type size the operation
> needed is a component shuffle. The opposite case would be an unshuffle.
>
> The operation allows to skip first_component number of components from
> the source.
>
> Shuffle MOVs are retyped using integer types avoiding problems with denorms
> and float types. This allows to simplify uses of shuffle functions that are
> dealing with these retypes individually.
>
> Now there is a new restriction so source and destination can not overlap
> anymore when calling this suffle function. Following patches that migrate
> to use this new function will take care individually of avoiding source
> and destination overlaps.
> ---
>  src/intel/compiler/brw_fs_nir.cpp | 92 +++
>  1 file changed, 92 insertions(+)
>
> diff --git a/src/intel/compiler/brw_fs_nir.cpp
> b/src/intel/compiler/brw_fs_nir.cpp
> index 166da0aa6d7..1a9d3c41d1d 100644
> --- a/src/intel/compiler/brw_fs_nir.cpp
> +++ b/src/intel/compiler/brw_fs_nir.cpp
> @@ -5362,6 +5362,98 @@ shuffle_16bit_data_for_32bit_write(const
> fs_builder ,
> }
>  }
>
> +/*
> + * This helper takes a source register and un/shuffles it into the
> destination
> + * register.
> + *
> + * If source type size is smaller than destination type size the operation
> + * needed is a component shuffle. The opposite case would be an
> unshuffle. If
> + * source/destination type size is equal a shuffle is done that would be
> + * equivalent to a simple MOV.
>

There's a sticky bit here if we want this to work with 64-bit types on gen7
and earlier because we only have DF there and not Q so the
brw_reg_type_from_bit_size below doesn't work.  If we care about that case
(and I'm not convinced we do), it should be easy enough to add a
type_sz(src.type) == type_sz(dst.type) case which just does MOVs from
source to dest.


> + *
> + * For example, if source is a 16-bit type and destination is 32-bit. A 3
> + * components .xyz 16-bit vector on SIMD8 would be.
> + *
> + *|x1|x2|x3|x4|x5|x6|x7|x8|y1|y2|y3|y4|y5|y6|y7|y8|
> + *|z1|z2|z3|z4|z5|z6|z7|z8|  |  |  |  |  |  |  |  |
> + *
> + * This helper will return the following 2 32-bit components with the
> 16-bit
> + * values shuffled:
> + *
> + *|x1 y1|x2 y2|x3 y3|x4 y4|x5 y5|x6 y6|x7 y7|x8 y8|
> + *|z1   |z2   |z3   |z4   |z5   |z6   |z7   |z8   |
> + *
> + * For unshuffle, the example would be the opposite, a 64-bit type source
> + * and a 32-bit destination. A 2 component .xy 64-bit vector on SIMD8
> + * would be:
> + *
> + *| x1l   x1h | x2l   x2h | x3l   x3h | x4l   x4h |
> + *| x5l   x5h | x6l   x6h | x7l   x7h | x8l   x8h |
> + *| y1l   y1h | y2l   y2h | y3l   y3h | y4l   y4h |
> + *| y5l   y5h | y6l   y6h | y7l   y7h | y8l   y8h |
> + *
> + * The returned result would be the following 4 32-bit components
> unshuffled:
> + *
> + *| x1l | x2l | x3l | x4l | x5l | x6l | x7l | x8l |
> + *| x1h | x2h | x3h | x4h | x5h | x6h | x7h | x8h |
> + *| y1l | y2l | y3l | y4l | y5l | y6l | y7l | y8l |
> + *| y1h | y2h | y3h | y4h | y5h | y6h | y7h | y8h |
> + *
> + * - Source and destination register must not be overlapped.
> + * - first_component parameter allows skipping source components.
> + */
> +void
> +shuffle_src_to_dst(const fs_builder ,
> +   const fs_reg ,
> +   const fs_reg ,
> +   uint32_t first_component,
> +   uint32_t components)
> +{
> +   if (type_sz(src.type) <= type_sz(dst.type)) {
> +  /* Source is shuffled into destination */
> +  unsigned size_ratio = type_sz(dst.type) / type_sz(src.type);
> +#ifndef NDEBUG
> +  boolean src_dst_overlap = regions_overlap(dst,
> + type_sz(dst.type) * bld.dispatch_width() * components,
> + offset(src, bld, first_component * size_ratio),
>

Why do you need to multiply first_component by size_ratio?  It's already in
units of source components.


> + type_sz(src.type) * bld.dispatch_width() * components *
> size_ratio);
> +#endif
> +  assert(!src_dst_overlap);
>

If the only thing you're doing with src_dst_overlap is to assert on it, you
may as well put the regions_overlap call inside the assert and drop the
#ifndef.


> +
> +  brw_reg_type shuffle_type =
> + brw_reg_type_from_bit_size(8 * type_sz(src.type),
> +BRW_REGISTER_TYPE_D);
> +  for (unsigned i = 0; i < components; i++) {
> + fs_reg shuffle_component_i =
> +subscript(offset(dst, bld, i / size_ratio),
> +  shuffle_type, i % size_ratio);
> + bld.MOV(shuffle_component_i,
> + retype(offset(src, bld, i + first_component),
> shuffle_type));
> +  }
> +   } else {
> 

[Mesa-dev] [PATCH 01/14] intel/compiler: general 8/16/32/64-bit shuffle_src_to_dst function

2018-06-09 Thread Jose Maria Casanova Crespo
This new function takes care of shuffle/unshuffle components of a
particular bit-size in components with a different bit-size.

If source type size is smaller than destination type size the operation
needed is a component shuffle. The opposite case would be an unshuffle.

The operation allows to skip first_component number of components from
the source.

Shuffle MOVs are retyped using integer types avoiding problems with denorms
and float types. This allows to simplify uses of shuffle functions that are
dealing with these retypes individually.

Now there is a new restriction so source and destination can not overlap
anymore when calling this suffle function. Following patches that migrate
to use this new function will take care individually of avoiding source
and destination overlaps.
---
 src/intel/compiler/brw_fs_nir.cpp | 92 +++
 1 file changed, 92 insertions(+)

diff --git a/src/intel/compiler/brw_fs_nir.cpp 
b/src/intel/compiler/brw_fs_nir.cpp
index 166da0aa6d7..1a9d3c41d1d 100644
--- a/src/intel/compiler/brw_fs_nir.cpp
+++ b/src/intel/compiler/brw_fs_nir.cpp
@@ -5362,6 +5362,98 @@ shuffle_16bit_data_for_32bit_write(const fs_builder ,
}
 }
 
+/*
+ * This helper takes a source register and un/shuffles it into the destination
+ * register.
+ *
+ * If source type size is smaller than destination type size the operation
+ * needed is a component shuffle. The opposite case would be an unshuffle. If
+ * source/destination type size is equal a shuffle is done that would be
+ * equivalent to a simple MOV.
+ *
+ * For example, if source is a 16-bit type and destination is 32-bit. A 3
+ * components .xyz 16-bit vector on SIMD8 would be.
+ *
+ *|x1|x2|x3|x4|x5|x6|x7|x8|y1|y2|y3|y4|y5|y6|y7|y8|
+ *|z1|z2|z3|z4|z5|z6|z7|z8|  |  |  |  |  |  |  |  |
+ *
+ * This helper will return the following 2 32-bit components with the 16-bit
+ * values shuffled:
+ *
+ *|x1 y1|x2 y2|x3 y3|x4 y4|x5 y5|x6 y6|x7 y7|x8 y8|
+ *|z1   |z2   |z3   |z4   |z5   |z6   |z7   |z8   |
+ *
+ * For unshuffle, the example would be the opposite, a 64-bit type source
+ * and a 32-bit destination. A 2 component .xy 64-bit vector on SIMD8
+ * would be:
+ *
+ *| x1l   x1h | x2l   x2h | x3l   x3h | x4l   x4h |
+ *| x5l   x5h | x6l   x6h | x7l   x7h | x8l   x8h |
+ *| y1l   y1h | y2l   y2h | y3l   y3h | y4l   y4h |
+ *| y5l   y5h | y6l   y6h | y7l   y7h | y8l   y8h |
+ *
+ * The returned result would be the following 4 32-bit components unshuffled:
+ *
+ *| x1l | x2l | x3l | x4l | x5l | x6l | x7l | x8l |
+ *| x1h | x2h | x3h | x4h | x5h | x6h | x7h | x8h |
+ *| y1l | y2l | y3l | y4l | y5l | y6l | y7l | y8l |
+ *| y1h | y2h | y3h | y4h | y5h | y6h | y7h | y8h |
+ *
+ * - Source and destination register must not be overlapped.
+ * - first_component parameter allows skipping source components.
+ */
+void
+shuffle_src_to_dst(const fs_builder ,
+   const fs_reg ,
+   const fs_reg ,
+   uint32_t first_component,
+   uint32_t components)
+{
+   if (type_sz(src.type) <= type_sz(dst.type)) {
+  /* Source is shuffled into destination */
+  unsigned size_ratio = type_sz(dst.type) / type_sz(src.type);
+#ifndef NDEBUG
+  boolean src_dst_overlap = regions_overlap(dst,
+ type_sz(dst.type) * bld.dispatch_width() * components,
+ offset(src, bld, first_component * size_ratio),
+ type_sz(src.type) * bld.dispatch_width() * components * size_ratio);
+#endif
+  assert(!src_dst_overlap);
+
+  brw_reg_type shuffle_type =
+ brw_reg_type_from_bit_size(8 * type_sz(src.type),
+BRW_REGISTER_TYPE_D);
+  for (unsigned i = 0; i < components; i++) {
+ fs_reg shuffle_component_i =
+subscript(offset(dst, bld, i / size_ratio),
+  shuffle_type, i % size_ratio);
+ bld.MOV(shuffle_component_i,
+ retype(offset(src, bld, i + first_component), shuffle_type));
+  }
+   } else {
+  /* Source is unshuffled into destination */
+  unsigned size_ratio = type_sz(src.type) / type_sz(dst.type);
+#ifndef NDEBUG
+  boolean src_dst_overlap = regions_overlap(dst,
+ type_sz(dst.type) * bld.dispatch_width() * components,
+ offset(src, bld, first_component / size_ratio),
+ type_sz(src.type) * bld.dispatch_width() *
+ DIV_ROUND_UP(components + first_component % size_ratio, size_ratio));
+#endif
+  assert(!src_dst_overlap);
+  brw_reg_type shuffle_type =
+ brw_reg_type_from_bit_size(8 * type_sz(dst.type),
+BRW_REGISTER_TYPE_D);
+  for (unsigned i = 0; i < components; i++) {
+ fs_reg shuffle_component_i =
+subscript(offset(src, bld, (first_component + i) / size_ratio),
+  shuffle_type, (first_component + i) % size_ratio);
+ bld.MOV(retype(offset(dst, bld, i),