Re: [RFC] ldist: Recognize rawmemchr loop patterns
On Mon, Jan 31, 2022 at 2:16 PM Tom de Vries wrote: > > On 9/17/21 10:08, Richard Biener via Gcc-patches wrote: > > On Mon, Sep 13, 2021 at 4:53 PM Stefan Schulze Frielinghaus > > wrote: > >> > >> On Mon, Sep 06, 2021 at 11:56:21AM +0200, Richard Biener wrote: > >>> On Fri, Sep 3, 2021 at 10:01 AM Stefan Schulze Frielinghaus > >>> wrote: > > On Fri, Aug 20, 2021 at 12:35:58PM +0200, Richard Biener wrote: > [...] > >>> > >>> + /* Handle strlen like loops. */ > >>> + if (store_dr == NULL > >>> + && integer_zerop (pattern) > >>> + && TREE_CODE (reduction_iv.base) == INTEGER_CST > >>> + && TREE_CODE (reduction_iv.step) == INTEGER_CST > >>> + && integer_onep (reduction_iv.step) > >>> + && (types_compatible_p (TREE_TYPE (reduction_var), > >>> size_type_node) > >>> + || TYPE_OVERFLOW_UNDEFINED (TREE_TYPE (reduction_var > >>> +{ > >>> > >>> I wonder what goes wrong with a larger or smaller wrapping IV type? > >>> The iteration > >>> only stops when you load a NUL and the increments just wrap along > >>> (you're > >>> using the pointer IVs to compute the strlen result). Can't you > >>> simply truncate? > >> > >> I think truncation is enough as long as no overflow occurs in strlen or > >> strlen_using_rawmemchr. > >> > >>> For larger than size_type_node (actually larger than ptr_type_node > >>> would matter > >>> I guess), the argument is that since pointer wrapping would be > >>> undefined anyway > >>> the IV cannot wrap either. Now, the correct check here would IMHO be > >>> > >>>TYPE_PRECISION (TREE_TYPE (reduction_var)) < TYPE_PRECISION > >>> (ptr_type_node) > >>> || TYPE_OVERFLOW_UNDEFINED (TREE_TYPE (pointer-iv-var)) > >>> > >>> ? > >> > >> Regarding the implementation which makes use of rawmemchr: > >> > >> We can count at most PTRDIFF_MAX many bytes without an overflow. Thus, > >> the maximal length we can determine of a string where each character > >> has > >> size S is PTRDIFF_MAX / S without an overflow. Since an overflow for > >> ptrdiff type is undefined we have to make sure that if an overflow > >> occurs, then an overflow occurs for reduction variable, too, and that > >> this is undefined, too. However, I'm not sure anymore whether we want > >> to respect overflows in all cases. If TYPE_PRECISION (ptr_type_node) > >> equals TYPE_PRECISION (ptrdiff_type_node) and an overflow occurs, then > >> this would mean that a single string consumes more than half of the > >> virtual addressable memory. At least for architectures where > >> TYPE_PRECISION (ptrdiff_type_node) == 64 holds, I think it is > >> reasonable > >> to neglect the case where computing pointer difference may overflow. > >> Otherwise we are talking about strings with lenghts of multiple > >> pebibytes. For other architectures we might have to be more precise > >> and make sure that reduction variable overflows first and that this is > >> undefined. > >> > >> Thus a conservative condition would be (I assumed that the size of any > >> integral type is a power of two which I'm not sure if this really > >> holds; > >> IIRC the C standard requires only that the alignment is a power of two > >> but not necessarily the size so I might need to change this): > >> > >> /* Compute precision (reduction_var) < (precision (ptrdiff_type) - 1 - > >> log2 (sizeof (load_type)) > >> or in other words return true if reduction variable overflows first > >> and false otherwise. */ > >> > >> static bool > >> reduction_var_overflows_first (tree reduction_var, tree load_type) > >> { > >>unsigned precision_ptrdiff = TYPE_PRECISION (ptrdiff_type_node); > >>unsigned precision_reduction_var = TYPE_PRECISION (TREE_TYPE > >> (reduction_var)); > >>unsigned size_exponent = wi::exact_log2 (wi::to_wide > >> (TYPE_SIZE_UNIT (load_type))); > >>return wi::ltu_p (precision_reduction_var, precision_ptrdiff - 1 - > >> size_exponent); > >> } > >> > >> TYPE_PRECISION (ptrdiff_type_node) == 64 > >> || (TYPE_OVERFLOW_UNDEFINED (TREE_TYPE (reduction_var)) > >> && reduction_var_overflows_first (reduction_var, load_type) > >> > >> Regarding the implementation which makes use of strlen: > >> > >> I'm not sure what it means if strlen is called for a string with a > >> length greater than SIZE_MAX. Therefore, similar to the implementation > >> using rawmemchr where we neglect the case of an overflow for 64bit > >> architectures, a conservative condition would be: > >> > >> TYPE_PRECISION (size_type_node) == 64 > >> || (TYPE_OVERFLOW_UNDEFINED (TREE_TYPE (reduction_var)) > >> && TYPE_PRECISION (reduct
Re: [RFC] ldist: Recognize rawmemchr loop patterns
On 9/17/21 10:08, Richard Biener via Gcc-patches wrote: On Mon, Sep 13, 2021 at 4:53 PM Stefan Schulze Frielinghaus wrote: On Mon, Sep 06, 2021 at 11:56:21AM +0200, Richard Biener wrote: On Fri, Sep 3, 2021 at 10:01 AM Stefan Schulze Frielinghaus wrote: On Fri, Aug 20, 2021 at 12:35:58PM +0200, Richard Biener wrote: [...] + /* Handle strlen like loops. */ + if (store_dr == NULL + && integer_zerop (pattern) + && TREE_CODE (reduction_iv.base) == INTEGER_CST + && TREE_CODE (reduction_iv.step) == INTEGER_CST + && integer_onep (reduction_iv.step) + && (types_compatible_p (TREE_TYPE (reduction_var), size_type_node) + || TYPE_OVERFLOW_UNDEFINED (TREE_TYPE (reduction_var +{ I wonder what goes wrong with a larger or smaller wrapping IV type? The iteration only stops when you load a NUL and the increments just wrap along (you're using the pointer IVs to compute the strlen result). Can't you simply truncate? I think truncation is enough as long as no overflow occurs in strlen or strlen_using_rawmemchr. For larger than size_type_node (actually larger than ptr_type_node would matter I guess), the argument is that since pointer wrapping would be undefined anyway the IV cannot wrap either. Now, the correct check here would IMHO be TYPE_PRECISION (TREE_TYPE (reduction_var)) < TYPE_PRECISION (ptr_type_node) || TYPE_OVERFLOW_UNDEFINED (TREE_TYPE (pointer-iv-var)) ? Regarding the implementation which makes use of rawmemchr: We can count at most PTRDIFF_MAX many bytes without an overflow. Thus, the maximal length we can determine of a string where each character has size S is PTRDIFF_MAX / S without an overflow. Since an overflow for ptrdiff type is undefined we have to make sure that if an overflow occurs, then an overflow occurs for reduction variable, too, and that this is undefined, too. However, I'm not sure anymore whether we want to respect overflows in all cases. If TYPE_PRECISION (ptr_type_node) equals TYPE_PRECISION (ptrdiff_type_node) and an overflow occurs, then this would mean that a single string consumes more than half of the virtual addressable memory. At least for architectures where TYPE_PRECISION (ptrdiff_type_node) == 64 holds, I think it is reasonable to neglect the case where computing pointer difference may overflow. Otherwise we are talking about strings with lenghts of multiple pebibytes. For other architectures we might have to be more precise and make sure that reduction variable overflows first and that this is undefined. Thus a conservative condition would be (I assumed that the size of any integral type is a power of two which I'm not sure if this really holds; IIRC the C standard requires only that the alignment is a power of two but not necessarily the size so I might need to change this): /* Compute precision (reduction_var) < (precision (ptrdiff_type) - 1 - log2 (sizeof (load_type)) or in other words return true if reduction variable overflows first and false otherwise. */ static bool reduction_var_overflows_first (tree reduction_var, tree load_type) { unsigned precision_ptrdiff = TYPE_PRECISION (ptrdiff_type_node); unsigned precision_reduction_var = TYPE_PRECISION (TREE_TYPE (reduction_var)); unsigned size_exponent = wi::exact_log2 (wi::to_wide (TYPE_SIZE_UNIT (load_type))); return wi::ltu_p (precision_reduction_var, precision_ptrdiff - 1 - size_exponent); } TYPE_PRECISION (ptrdiff_type_node) == 64 || (TYPE_OVERFLOW_UNDEFINED (TREE_TYPE (reduction_var)) && reduction_var_overflows_first (reduction_var, load_type) Regarding the implementation which makes use of strlen: I'm not sure what it means if strlen is called for a string with a length greater than SIZE_MAX. Therefore, similar to the implementation using rawmemchr where we neglect the case of an overflow for 64bit architectures, a conservative condition would be: TYPE_PRECISION (size_type_node) == 64 || (TYPE_OVERFLOW_UNDEFINED (TREE_TYPE (reduction_var)) && TYPE_PRECISION (reduction_var) <= TYPE_PRECISION (size_type_node)) I still included the overflow undefined check for reduction variable in order to rule out situations where the reduction variable is unsigned and overflows as many times until strlen(,_using_rawmemchr) overflows, too. Maybe this is all theoretical nonsense but I'm afraid of uncommon architectures. Anyhow, while writing this down it becomes clear that this deserves a comment which I will add once it becomes clear which way to go. I think all the arguments about objects bigger than half of the address-space also are valid for 32bit targets and thus 32bit size_type_node (or 32bit pointer size). I'm not actually sure what's the canonical type to check against, whether it's size_type_node (Cs size_t), ptr_type_node (Cs void *) or sizetype (the middle-end "offset" type used for all address computations). For weird reasons I'd lean towards 'sizetype' (for example some embedded targets have
Re: [RFC] ldist: Recognize rawmemchr loop patterns
On Fri, Sep 17, 2021 at 10:08:27AM +0200, Richard Biener wrote: > On Mon, Sep 13, 2021 at 4:53 PM Stefan Schulze Frielinghaus > wrote: > > > > On Mon, Sep 06, 2021 at 11:56:21AM +0200, Richard Biener wrote: > > > On Fri, Sep 3, 2021 at 10:01 AM Stefan Schulze Frielinghaus > > > wrote: > > > > > > > > On Fri, Aug 20, 2021 at 12:35:58PM +0200, Richard Biener wrote: > > > > [...] > > > > > > > > > > > > > > + /* Handle strlen like loops. */ > > > > > > > + if (store_dr == NULL > > > > > > > + && integer_zerop (pattern) > > > > > > > + && TREE_CODE (reduction_iv.base) == INTEGER_CST > > > > > > > + && TREE_CODE (reduction_iv.step) == INTEGER_CST > > > > > > > + && integer_onep (reduction_iv.step) > > > > > > > + && (types_compatible_p (TREE_TYPE (reduction_var), > > > > > > > size_type_node) > > > > > > > + || TYPE_OVERFLOW_UNDEFINED (TREE_TYPE (reduction_var > > > > > > > +{ > > > > > > > > > > > > > > I wonder what goes wrong with a larger or smaller wrapping IV > > > > > > > type? > > > > > > > The iteration > > > > > > > only stops when you load a NUL and the increments just wrap along > > > > > > > (you're > > > > > > > using the pointer IVs to compute the strlen result). Can't you > > > > > > > simply truncate? > > > > > > > > > > > > I think truncation is enough as long as no overflow occurs in > > > > > > strlen or > > > > > > strlen_using_rawmemchr. > > > > > > > > > > > > > For larger than size_type_node (actually larger than > > > > > > > ptr_type_node would matter > > > > > > > I guess), the argument is that since pointer wrapping would be > > > > > > > undefined anyway > > > > > > > the IV cannot wrap either. Now, the correct check here would > > > > > > > IMHO be > > > > > > > > > > > > > > TYPE_PRECISION (TREE_TYPE (reduction_var)) < TYPE_PRECISION > > > > > > > (ptr_type_node) > > > > > > >|| TYPE_OVERFLOW_UNDEFINED (TREE_TYPE (pointer-iv-var)) > > > > > > > > > > > > > > ? > > > > > > > > > > > > Regarding the implementation which makes use of rawmemchr: > > > > > > > > > > > > We can count at most PTRDIFF_MAX many bytes without an overflow. > > > > > > Thus, > > > > > > the maximal length we can determine of a string where each > > > > > > character has > > > > > > size S is PTRDIFF_MAX / S without an overflow. Since an overflow > > > > > > for > > > > > > ptrdiff type is undefined we have to make sure that if an overflow > > > > > > occurs, then an overflow occurs for reduction variable, too, and > > > > > > that > > > > > > this is undefined, too. However, I'm not sure anymore whether we > > > > > > want > > > > > > to respect overflows in all cases. If TYPE_PRECISION > > > > > > (ptr_type_node) > > > > > > equals TYPE_PRECISION (ptrdiff_type_node) and an overflow occurs, > > > > > > then > > > > > > this would mean that a single string consumes more than half of the > > > > > > virtual addressable memory. At least for architectures where > > > > > > TYPE_PRECISION (ptrdiff_type_node) == 64 holds, I think it is > > > > > > reasonable > > > > > > to neglect the case where computing pointer difference may overflow. > > > > > > Otherwise we are talking about strings with lenghts of multiple > > > > > > pebibytes. For other architectures we might have to be more precise > > > > > > and make sure that reduction variable overflows first and that this > > > > > > is > > > > > > undefined. > > > > > > > > > > > > Thus a conservative condition would be (I assumed that the size of > > > > > > any > > > > > > integral type is a power of two which I'm not sure if this really > > > > > > holds; > > > > > > IIRC the C standard requires only that the alignment is a power of > > > > > > two > > > > > > but not necessarily the size so I might need to change this): > > > > > > > > > > > > /* Compute precision (reduction_var) < (precision (ptrdiff_type) - > > > > > > 1 - log2 (sizeof (load_type)) > > > > > >or in other words return true if reduction variable overflows > > > > > > first > > > > > >and false otherwise. */ > > > > > > > > > > > > static bool > > > > > > reduction_var_overflows_first (tree reduction_var, tree load_type) > > > > > > { > > > > > > unsigned precision_ptrdiff = TYPE_PRECISION (ptrdiff_type_node); > > > > > > unsigned precision_reduction_var = TYPE_PRECISION (TREE_TYPE > > > > > > (reduction_var)); > > > > > > unsigned size_exponent = wi::exact_log2 (wi::to_wide > > > > > > (TYPE_SIZE_UNIT (load_type))); > > > > > > return wi::ltu_p (precision_reduction_var, precision_ptrdiff - 1 > > > > > > - size_exponent); > > > > > > } > > > > > > > > > > > > TYPE_PRECISION (ptrdiff_type_node) == 64 > > > > > > || (TYPE_OVERFLOW_UNDEFINED (TREE_TYPE (reduction_var)) > > > > > > && reduction_var_overflows_first (reduction_var, load_type) > > > > > > > > > > > > Regarding the implementation which makes use of strlen: > > > > > > > > > > > > I'm not sure what it means if strlen is called
Re: [RFC] ldist: Recognize rawmemchr loop patterns
On Mon, Sep 13, 2021 at 4:53 PM Stefan Schulze Frielinghaus wrote: > > On Mon, Sep 06, 2021 at 11:56:21AM +0200, Richard Biener wrote: > > On Fri, Sep 3, 2021 at 10:01 AM Stefan Schulze Frielinghaus > > wrote: > > > > > > On Fri, Aug 20, 2021 at 12:35:58PM +0200, Richard Biener wrote: > > > [...] > > > > > > > > > > > > + /* Handle strlen like loops. */ > > > > > > + if (store_dr == NULL > > > > > > + && integer_zerop (pattern) > > > > > > + && TREE_CODE (reduction_iv.base) == INTEGER_CST > > > > > > + && TREE_CODE (reduction_iv.step) == INTEGER_CST > > > > > > + && integer_onep (reduction_iv.step) > > > > > > + && (types_compatible_p (TREE_TYPE (reduction_var), > > > > > > size_type_node) > > > > > > + || TYPE_OVERFLOW_UNDEFINED (TREE_TYPE (reduction_var > > > > > > +{ > > > > > > > > > > > > I wonder what goes wrong with a larger or smaller wrapping IV type? > > > > > > The iteration > > > > > > only stops when you load a NUL and the increments just wrap along > > > > > > (you're > > > > > > using the pointer IVs to compute the strlen result). Can't you > > > > > > simply truncate? > > > > > > > > > > I think truncation is enough as long as no overflow occurs in strlen > > > > > or > > > > > strlen_using_rawmemchr. > > > > > > > > > > > For larger than size_type_node (actually larger than ptr_type_node > > > > > > would matter > > > > > > I guess), the argument is that since pointer wrapping would be > > > > > > undefined anyway > > > > > > the IV cannot wrap either. Now, the correct check here would IMHO > > > > > > be > > > > > > > > > > > > TYPE_PRECISION (TREE_TYPE (reduction_var)) < TYPE_PRECISION > > > > > > (ptr_type_node) > > > > > >|| TYPE_OVERFLOW_UNDEFINED (TREE_TYPE (pointer-iv-var)) > > > > > > > > > > > > ? > > > > > > > > > > Regarding the implementation which makes use of rawmemchr: > > > > > > > > > > We can count at most PTRDIFF_MAX many bytes without an overflow. > > > > > Thus, > > > > > the maximal length we can determine of a string where each character > > > > > has > > > > > size S is PTRDIFF_MAX / S without an overflow. Since an overflow for > > > > > ptrdiff type is undefined we have to make sure that if an overflow > > > > > occurs, then an overflow occurs for reduction variable, too, and that > > > > > this is undefined, too. However, I'm not sure anymore whether we want > > > > > to respect overflows in all cases. If TYPE_PRECISION (ptr_type_node) > > > > > equals TYPE_PRECISION (ptrdiff_type_node) and an overflow occurs, then > > > > > this would mean that a single string consumes more than half of the > > > > > virtual addressable memory. At least for architectures where > > > > > TYPE_PRECISION (ptrdiff_type_node) == 64 holds, I think it is > > > > > reasonable > > > > > to neglect the case where computing pointer difference may overflow. > > > > > Otherwise we are talking about strings with lenghts of multiple > > > > > pebibytes. For other architectures we might have to be more precise > > > > > and make sure that reduction variable overflows first and that this is > > > > > undefined. > > > > > > > > > > Thus a conservative condition would be (I assumed that the size of any > > > > > integral type is a power of two which I'm not sure if this really > > > > > holds; > > > > > IIRC the C standard requires only that the alignment is a power of two > > > > > but not necessarily the size so I might need to change this): > > > > > > > > > > /* Compute precision (reduction_var) < (precision (ptrdiff_type) - 1 > > > > > - log2 (sizeof (load_type)) > > > > >or in other words return true if reduction variable overflows first > > > > >and false otherwise. */ > > > > > > > > > > static bool > > > > > reduction_var_overflows_first (tree reduction_var, tree load_type) > > > > > { > > > > > unsigned precision_ptrdiff = TYPE_PRECISION (ptrdiff_type_node); > > > > > unsigned precision_reduction_var = TYPE_PRECISION (TREE_TYPE > > > > > (reduction_var)); > > > > > unsigned size_exponent = wi::exact_log2 (wi::to_wide > > > > > (TYPE_SIZE_UNIT (load_type))); > > > > > return wi::ltu_p (precision_reduction_var, precision_ptrdiff - 1 - > > > > > size_exponent); > > > > > } > > > > > > > > > > TYPE_PRECISION (ptrdiff_type_node) == 64 > > > > > || (TYPE_OVERFLOW_UNDEFINED (TREE_TYPE (reduction_var)) > > > > > && reduction_var_overflows_first (reduction_var, load_type) > > > > > > > > > > Regarding the implementation which makes use of strlen: > > > > > > > > > > I'm not sure what it means if strlen is called for a string with a > > > > > length greater than SIZE_MAX. Therefore, similar to the > > > > > implementation > > > > > using rawmemchr where we neglect the case of an overflow for 64bit > > > > > architectures, a conservative condition would be: > > > > > > > > > > TYPE_PRECISION (size_type_node) == 64 > > > > > || (TYPE_OVERFLOW_UNDEFINED (TREE_TYPE (reduction_var)) > > > > > &&
Re: [RFC] ldist: Recognize rawmemchr loop patterns
On Mon, Sep 06, 2021 at 11:56:21AM +0200, Richard Biener wrote: > On Fri, Sep 3, 2021 at 10:01 AM Stefan Schulze Frielinghaus > wrote: > > > > On Fri, Aug 20, 2021 at 12:35:58PM +0200, Richard Biener wrote: > > [...] > > > > > > > > > > + /* Handle strlen like loops. */ > > > > > + if (store_dr == NULL > > > > > + && integer_zerop (pattern) > > > > > + && TREE_CODE (reduction_iv.base) == INTEGER_CST > > > > > + && TREE_CODE (reduction_iv.step) == INTEGER_CST > > > > > + && integer_onep (reduction_iv.step) > > > > > + && (types_compatible_p (TREE_TYPE (reduction_var), > > > > > size_type_node) > > > > > + || TYPE_OVERFLOW_UNDEFINED (TREE_TYPE (reduction_var > > > > > +{ > > > > > > > > > > I wonder what goes wrong with a larger or smaller wrapping IV type? > > > > > The iteration > > > > > only stops when you load a NUL and the increments just wrap along > > > > > (you're > > > > > using the pointer IVs to compute the strlen result). Can't you > > > > > simply truncate? > > > > > > > > I think truncation is enough as long as no overflow occurs in strlen or > > > > strlen_using_rawmemchr. > > > > > > > > > For larger than size_type_node (actually larger than ptr_type_node > > > > > would matter > > > > > I guess), the argument is that since pointer wrapping would be > > > > > undefined anyway > > > > > the IV cannot wrap either. Now, the correct check here would IMHO be > > > > > > > > > > TYPE_PRECISION (TREE_TYPE (reduction_var)) < TYPE_PRECISION > > > > > (ptr_type_node) > > > > >|| TYPE_OVERFLOW_UNDEFINED (TREE_TYPE (pointer-iv-var)) > > > > > > > > > > ? > > > > > > > > Regarding the implementation which makes use of rawmemchr: > > > > > > > > We can count at most PTRDIFF_MAX many bytes without an overflow. Thus, > > > > the maximal length we can determine of a string where each character has > > > > size S is PTRDIFF_MAX / S without an overflow. Since an overflow for > > > > ptrdiff type is undefined we have to make sure that if an overflow > > > > occurs, then an overflow occurs for reduction variable, too, and that > > > > this is undefined, too. However, I'm not sure anymore whether we want > > > > to respect overflows in all cases. If TYPE_PRECISION (ptr_type_node) > > > > equals TYPE_PRECISION (ptrdiff_type_node) and an overflow occurs, then > > > > this would mean that a single string consumes more than half of the > > > > virtual addressable memory. At least for architectures where > > > > TYPE_PRECISION (ptrdiff_type_node) == 64 holds, I think it is reasonable > > > > to neglect the case where computing pointer difference may overflow. > > > > Otherwise we are talking about strings with lenghts of multiple > > > > pebibytes. For other architectures we might have to be more precise > > > > and make sure that reduction variable overflows first and that this is > > > > undefined. > > > > > > > > Thus a conservative condition would be (I assumed that the size of any > > > > integral type is a power of two which I'm not sure if this really holds; > > > > IIRC the C standard requires only that the alignment is a power of two > > > > but not necessarily the size so I might need to change this): > > > > > > > > /* Compute precision (reduction_var) < (precision (ptrdiff_type) - 1 - > > > > log2 (sizeof (load_type)) > > > >or in other words return true if reduction variable overflows first > > > >and false otherwise. */ > > > > > > > > static bool > > > > reduction_var_overflows_first (tree reduction_var, tree load_type) > > > > { > > > > unsigned precision_ptrdiff = TYPE_PRECISION (ptrdiff_type_node); > > > > unsigned precision_reduction_var = TYPE_PRECISION (TREE_TYPE > > > > (reduction_var)); > > > > unsigned size_exponent = wi::exact_log2 (wi::to_wide (TYPE_SIZE_UNIT > > > > (load_type))); > > > > return wi::ltu_p (precision_reduction_var, precision_ptrdiff - 1 - > > > > size_exponent); > > > > } > > > > > > > > TYPE_PRECISION (ptrdiff_type_node) == 64 > > > > || (TYPE_OVERFLOW_UNDEFINED (TREE_TYPE (reduction_var)) > > > > && reduction_var_overflows_first (reduction_var, load_type) > > > > > > > > Regarding the implementation which makes use of strlen: > > > > > > > > I'm not sure what it means if strlen is called for a string with a > > > > length greater than SIZE_MAX. Therefore, similar to the implementation > > > > using rawmemchr where we neglect the case of an overflow for 64bit > > > > architectures, a conservative condition would be: > > > > > > > > TYPE_PRECISION (size_type_node) == 64 > > > > || (TYPE_OVERFLOW_UNDEFINED (TREE_TYPE (reduction_var)) > > > > && TYPE_PRECISION (reduction_var) <= TYPE_PRECISION > > > > (size_type_node)) > > > > > > > > I still included the overflow undefined check for reduction variable in > > > > order to rule out situations where the reduction variable is unsigned > > > > and overflows as many times until strlen(,_using_rawmemchr) overflows, > > > > too. May
Re: [RFC] ldist: Recognize rawmemchr loop patterns
On Fri, Sep 3, 2021 at 10:01 AM Stefan Schulze Frielinghaus wrote: > > On Fri, Aug 20, 2021 at 12:35:58PM +0200, Richard Biener wrote: > [...] > > > > > > > > + /* Handle strlen like loops. */ > > > > + if (store_dr == NULL > > > > + && integer_zerop (pattern) > > > > + && TREE_CODE (reduction_iv.base) == INTEGER_CST > > > > + && TREE_CODE (reduction_iv.step) == INTEGER_CST > > > > + && integer_onep (reduction_iv.step) > > > > + && (types_compatible_p (TREE_TYPE (reduction_var), > > > > size_type_node) > > > > + || TYPE_OVERFLOW_UNDEFINED (TREE_TYPE (reduction_var > > > > +{ > > > > > > > > I wonder what goes wrong with a larger or smaller wrapping IV type? > > > > The iteration > > > > only stops when you load a NUL and the increments just wrap along > > > > (you're > > > > using the pointer IVs to compute the strlen result). Can't you simply > > > > truncate? > > > > > > I think truncation is enough as long as no overflow occurs in strlen or > > > strlen_using_rawmemchr. > > > > > > > For larger than size_type_node (actually larger than ptr_type_node > > > > would matter > > > > I guess), the argument is that since pointer wrapping would be > > > > undefined anyway > > > > the IV cannot wrap either. Now, the correct check here would IMHO be > > > > > > > > TYPE_PRECISION (TREE_TYPE (reduction_var)) < TYPE_PRECISION > > > > (ptr_type_node) > > > >|| TYPE_OVERFLOW_UNDEFINED (TREE_TYPE (pointer-iv-var)) > > > > > > > > ? > > > > > > Regarding the implementation which makes use of rawmemchr: > > > > > > We can count at most PTRDIFF_MAX many bytes without an overflow. Thus, > > > the maximal length we can determine of a string where each character has > > > size S is PTRDIFF_MAX / S without an overflow. Since an overflow for > > > ptrdiff type is undefined we have to make sure that if an overflow > > > occurs, then an overflow occurs for reduction variable, too, and that > > > this is undefined, too. However, I'm not sure anymore whether we want > > > to respect overflows in all cases. If TYPE_PRECISION (ptr_type_node) > > > equals TYPE_PRECISION (ptrdiff_type_node) and an overflow occurs, then > > > this would mean that a single string consumes more than half of the > > > virtual addressable memory. At least for architectures where > > > TYPE_PRECISION (ptrdiff_type_node) == 64 holds, I think it is reasonable > > > to neglect the case where computing pointer difference may overflow. > > > Otherwise we are talking about strings with lenghts of multiple > > > pebibytes. For other architectures we might have to be more precise > > > and make sure that reduction variable overflows first and that this is > > > undefined. > > > > > > Thus a conservative condition would be (I assumed that the size of any > > > integral type is a power of two which I'm not sure if this really holds; > > > IIRC the C standard requires only that the alignment is a power of two > > > but not necessarily the size so I might need to change this): > > > > > > /* Compute precision (reduction_var) < (precision (ptrdiff_type) - 1 - > > > log2 (sizeof (load_type)) > > >or in other words return true if reduction variable overflows first > > >and false otherwise. */ > > > > > > static bool > > > reduction_var_overflows_first (tree reduction_var, tree load_type) > > > { > > > unsigned precision_ptrdiff = TYPE_PRECISION (ptrdiff_type_node); > > > unsigned precision_reduction_var = TYPE_PRECISION (TREE_TYPE > > > (reduction_var)); > > > unsigned size_exponent = wi::exact_log2 (wi::to_wide (TYPE_SIZE_UNIT > > > (load_type))); > > > return wi::ltu_p (precision_reduction_var, precision_ptrdiff - 1 - > > > size_exponent); > > > } > > > > > > TYPE_PRECISION (ptrdiff_type_node) == 64 > > > || (TYPE_OVERFLOW_UNDEFINED (TREE_TYPE (reduction_var)) > > > && reduction_var_overflows_first (reduction_var, load_type) > > > > > > Regarding the implementation which makes use of strlen: > > > > > > I'm not sure what it means if strlen is called for a string with a > > > length greater than SIZE_MAX. Therefore, similar to the implementation > > > using rawmemchr where we neglect the case of an overflow for 64bit > > > architectures, a conservative condition would be: > > > > > > TYPE_PRECISION (size_type_node) == 64 > > > || (TYPE_OVERFLOW_UNDEFINED (TREE_TYPE (reduction_var)) > > > && TYPE_PRECISION (reduction_var) <= TYPE_PRECISION (size_type_node)) > > > > > > I still included the overflow undefined check for reduction variable in > > > order to rule out situations where the reduction variable is unsigned > > > and overflows as many times until strlen(,_using_rawmemchr) overflows, > > > too. Maybe this is all theoretical nonsense but I'm afraid of uncommon > > > architectures. Anyhow, while writing this down it becomes clear that > > > this deserves a comment which I will add once it becomes clear which way > > > to go. > > > > I think all the arguments about
Re: [RFC] ldist: Recognize rawmemchr loop patterns
On Fri, Aug 20, 2021 at 12:35:58PM +0200, Richard Biener wrote: [...] > > > > > > + /* Handle strlen like loops. */ > > > + if (store_dr == NULL > > > + && integer_zerop (pattern) > > > + && TREE_CODE (reduction_iv.base) == INTEGER_CST > > > + && TREE_CODE (reduction_iv.step) == INTEGER_CST > > > + && integer_onep (reduction_iv.step) > > > + && (types_compatible_p (TREE_TYPE (reduction_var), size_type_node) > > > + || TYPE_OVERFLOW_UNDEFINED (TREE_TYPE (reduction_var > > > +{ > > > > > > I wonder what goes wrong with a larger or smaller wrapping IV type? > > > The iteration > > > only stops when you load a NUL and the increments just wrap along (you're > > > using the pointer IVs to compute the strlen result). Can't you simply > > > truncate? > > > > I think truncation is enough as long as no overflow occurs in strlen or > > strlen_using_rawmemchr. > > > > > For larger than size_type_node (actually larger than ptr_type_node would > > > matter > > > I guess), the argument is that since pointer wrapping would be undefined > > > anyway > > > the IV cannot wrap either. Now, the correct check here would IMHO be > > > > > > TYPE_PRECISION (TREE_TYPE (reduction_var)) < TYPE_PRECISION > > > (ptr_type_node) > > >|| TYPE_OVERFLOW_UNDEFINED (TREE_TYPE (pointer-iv-var)) > > > > > > ? > > > > Regarding the implementation which makes use of rawmemchr: > > > > We can count at most PTRDIFF_MAX many bytes without an overflow. Thus, > > the maximal length we can determine of a string where each character has > > size S is PTRDIFF_MAX / S without an overflow. Since an overflow for > > ptrdiff type is undefined we have to make sure that if an overflow > > occurs, then an overflow occurs for reduction variable, too, and that > > this is undefined, too. However, I'm not sure anymore whether we want > > to respect overflows in all cases. If TYPE_PRECISION (ptr_type_node) > > equals TYPE_PRECISION (ptrdiff_type_node) and an overflow occurs, then > > this would mean that a single string consumes more than half of the > > virtual addressable memory. At least for architectures where > > TYPE_PRECISION (ptrdiff_type_node) == 64 holds, I think it is reasonable > > to neglect the case where computing pointer difference may overflow. > > Otherwise we are talking about strings with lenghts of multiple > > pebibytes. For other architectures we might have to be more precise > > and make sure that reduction variable overflows first and that this is > > undefined. > > > > Thus a conservative condition would be (I assumed that the size of any > > integral type is a power of two which I'm not sure if this really holds; > > IIRC the C standard requires only that the alignment is a power of two > > but not necessarily the size so I might need to change this): > > > > /* Compute precision (reduction_var) < (precision (ptrdiff_type) - 1 - log2 > > (sizeof (load_type)) > >or in other words return true if reduction variable overflows first > >and false otherwise. */ > > > > static bool > > reduction_var_overflows_first (tree reduction_var, tree load_type) > > { > > unsigned precision_ptrdiff = TYPE_PRECISION (ptrdiff_type_node); > > unsigned precision_reduction_var = TYPE_PRECISION (TREE_TYPE > > (reduction_var)); > > unsigned size_exponent = wi::exact_log2 (wi::to_wide (TYPE_SIZE_UNIT > > (load_type))); > > return wi::ltu_p (precision_reduction_var, precision_ptrdiff - 1 - > > size_exponent); > > } > > > > TYPE_PRECISION (ptrdiff_type_node) == 64 > > || (TYPE_OVERFLOW_UNDEFINED (TREE_TYPE (reduction_var)) > > && reduction_var_overflows_first (reduction_var, load_type) > > > > Regarding the implementation which makes use of strlen: > > > > I'm not sure what it means if strlen is called for a string with a > > length greater than SIZE_MAX. Therefore, similar to the implementation > > using rawmemchr where we neglect the case of an overflow for 64bit > > architectures, a conservative condition would be: > > > > TYPE_PRECISION (size_type_node) == 64 > > || (TYPE_OVERFLOW_UNDEFINED (TREE_TYPE (reduction_var)) > > && TYPE_PRECISION (reduction_var) <= TYPE_PRECISION (size_type_node)) > > > > I still included the overflow undefined check for reduction variable in > > order to rule out situations where the reduction variable is unsigned > > and overflows as many times until strlen(,_using_rawmemchr) overflows, > > too. Maybe this is all theoretical nonsense but I'm afraid of uncommon > > architectures. Anyhow, while writing this down it becomes clear that > > this deserves a comment which I will add once it becomes clear which way > > to go. > > I think all the arguments about objects bigger than half of the address-space > also are valid for 32bit targets and thus 32bit size_type_node (or > 32bit pointer size). > I'm not actually sure what's the canonical type to check against, whether > it's size_type_node (Cs size_t), ptr_type_node (Cs void *) or sizetyp
Re: [RFC] ldist: Recognize rawmemchr loop patterns
On Fri, Jun 25, 2021 at 12:23 PM Stefan Schulze Frielinghaus wrote: > > On Wed, Jun 16, 2021 at 04:22:35PM +0200, Richard Biener wrote: > > On Mon, Jun 14, 2021 at 7:26 PM Stefan Schulze Frielinghaus > > wrote: > > > > > > On Thu, May 20, 2021 at 08:37:24PM +0200, Stefan Schulze Frielinghaus > > > wrote: > > > [...] > > > > > but we won't ever arrive here because of the niters condition. But > > > > > yes, doing the pattern matching in the innermost loop processing code > > > > > looks good to me - for the specific case it would be > > > > > > > > > > /* Don't distribute loop if niters is unknown. */ > > > > > tree niters = number_of_latch_executions (loop); > > > > > if (niters == NULL_TREE || niters == chrec_dont_know) > > > > > ---> here? > > > > > continue; > > > > > > > > Right, please find attached a new version of the patch where everything > > > > is included in the loop distribution pass. I will do a bootstrap and > > > > regtest on IBM Z over night. If you give me green light I will also do > > > > the same on x86_64. > > > > > > Meanwhile I gave it a shot on x86_64 where the testsuite runs fine (at > > > least the ldist-strlen testcase). If you are Ok with the patch, then I > > > would rebase and run the testsuites again and post a patch series > > > including the rawmemchr implementation for IBM Z. > > > > @@ -3257,6 +3261,464 @@ find_seed_stmts_for_distribution (class loop > > *loop, vec *work_list) > >return work_list->length () > 0; > > } > > > > +static void > > +generate_rawmemchr_builtin (loop_p loop, tree reduction_var, > > + data_reference_p store_dr, tree base, tree > > pattern, > > + location_t loc) > > +{ > > > > this new function needs a comment. Applies to all of the new ones, btw. > > Done. > > > + gcc_checking_assert (POINTER_TYPE_P (TREE_TYPE (base)) > > + && TREE_TYPE (TREE_TYPE (base)) == TREE_TYPE > > (pattern)); > > > > this looks fragile and is probably unnecessary as well. > > > > + gcc_checking_assert (TREE_TYPE (reduction_var) == TREE_TYPE (base)); > > > > in general you want types_compatible_p () checks which for pointers means > > all pointers are compatible ... > > True, I removed both asserts. > > > (skipping stuff) > > > > @@ -3321,10 +3783,20 @@ loop_distribution::execute (function *fun) > > && !optimize_loop_for_speed_p (loop))) > > continue; > > > > - /* Don't distribute loop if niters is unknown. */ > > + /* If niters is unknown don't distribute loop but rather try to > > transform > > +it to a call to a builtin. */ > >tree niters = number_of_latch_executions (loop); > >if (niters == NULL_TREE || niters == chrec_dont_know) > > - continue; > > + { > > + if (transform_reduction_loop (loop)) > > + { > > + changed = true; > > + loops_to_be_destroyed.safe_push (loop); > > + if (dump_file) > > + fprintf (dump_file, "Loop %d transformed into a > > builtin.\n", loop->num); > > + } > > + continue; > > + } > > > > please look at > > > > if (nb_generated_loops + nb_generated_calls > 0) > > { > > changed = true; > > if (dump_enabled_p ()) > > dump_printf_loc (MSG_OPTIMIZED_LOCATIONS, > > loc, "Loop%s %d distributed: split to > > %d loops " > > "and %d library calls.\n", str, loop->num, > > nb_generated_loops, nb_generated_calls); > > > > and follow the use of dump_* and MSG_OPTIMIZED_LOCATIONS so the > > transforms are reported with -fopt-info-loop > > Done. > > > + > > + return transform_reduction_loop_1 (loop, load_dr, store_dr, > > reduction_var); > > +} > > > > what's the point in tail-calling here and visually splitting the > > function in half? > > In the first place I thought that this is more pleasant since in > transform_reduction_loop_1 it is settled that we have a single load, > store, and reduction variable. After refactoring this isn't true > anymore and I inlined the function and made this clear via a comment. > > > > > (sorry for picking random pieces now ;)) > > > > + for (gphi_iterator bsi = gsi_start_phis (bb); !gsi_end_p (bsi); > > + gsi_next (&bsi), ++ninsns) > > + { > > > > this counts debug insns, I guess you want gsi_next_nondebug at least. > > not sure why you are counting PHIs at all btw - for the loops you match > > you are expecting at most two, one IV and eventually one for the virtual > > operand of the store? > > Yes, I removed the counting for the phi loop and changed to > gsi_next_nondebug for both loops. > > > > > + if (gimple_has_volatile_ops (phi)) > > + return false; > > > > PHIs never have volatile ops. > > > > + if (gimple_clobber_p (phi)) > > +
Re: [RFC] ldist: Recognize rawmemchr loop patterns
ping On Fri, Jun 25, 2021 at 12:23:32PM +0200, Stefan Schulze Frielinghaus wrote: > On Wed, Jun 16, 2021 at 04:22:35PM +0200, Richard Biener wrote: > > On Mon, Jun 14, 2021 at 7:26 PM Stefan Schulze Frielinghaus > > wrote: > > > > > > On Thu, May 20, 2021 at 08:37:24PM +0200, Stefan Schulze Frielinghaus > > > wrote: > > > [...] > > > > > but we won't ever arrive here because of the niters condition. But > > > > > yes, doing the pattern matching in the innermost loop processing code > > > > > looks good to me - for the specific case it would be > > > > > > > > > > /* Don't distribute loop if niters is unknown. */ > > > > > tree niters = number_of_latch_executions (loop); > > > > > if (niters == NULL_TREE || niters == chrec_dont_know) > > > > > ---> here? > > > > > continue; > > > > > > > > Right, please find attached a new version of the patch where everything > > > > is included in the loop distribution pass. I will do a bootstrap and > > > > regtest on IBM Z over night. If you give me green light I will also do > > > > the same on x86_64. > > > > > > Meanwhile I gave it a shot on x86_64 where the testsuite runs fine (at > > > least the ldist-strlen testcase). If you are Ok with the patch, then I > > > would rebase and run the testsuites again and post a patch series > > > including the rawmemchr implementation for IBM Z. > > > > @@ -3257,6 +3261,464 @@ find_seed_stmts_for_distribution (class loop > > *loop, vec *work_list) > >return work_list->length () > 0; > > } > > > > +static void > > +generate_rawmemchr_builtin (loop_p loop, tree reduction_var, > > + data_reference_p store_dr, tree base, tree > > pattern, > > + location_t loc) > > +{ > > > > this new function needs a comment. Applies to all of the new ones, btw. > > Done. > > > + gcc_checking_assert (POINTER_TYPE_P (TREE_TYPE (base)) > > + && TREE_TYPE (TREE_TYPE (base)) == TREE_TYPE > > (pattern)); > > > > this looks fragile and is probably unnecessary as well. > > > > + gcc_checking_assert (TREE_TYPE (reduction_var) == TREE_TYPE (base)); > > > > in general you want types_compatible_p () checks which for pointers means > > all pointers are compatible ... > > True, I removed both asserts. > > > (skipping stuff) > > > > @@ -3321,10 +3783,20 @@ loop_distribution::execute (function *fun) > > && !optimize_loop_for_speed_p (loop))) > > continue; > > > > - /* Don't distribute loop if niters is unknown. */ > > + /* If niters is unknown don't distribute loop but rather try to > > transform > > +it to a call to a builtin. */ > >tree niters = number_of_latch_executions (loop); > >if (niters == NULL_TREE || niters == chrec_dont_know) > > - continue; > > + { > > + if (transform_reduction_loop (loop)) > > + { > > + changed = true; > > + loops_to_be_destroyed.safe_push (loop); > > + if (dump_file) > > + fprintf (dump_file, "Loop %d transformed into a > > builtin.\n", loop->num); > > + } > > + continue; > > + } > > > > please look at > > > > if (nb_generated_loops + nb_generated_calls > 0) > > { > > changed = true; > > if (dump_enabled_p ()) > > dump_printf_loc (MSG_OPTIMIZED_LOCATIONS, > > loc, "Loop%s %d distributed: split to > > %d loops " > > "and %d library calls.\n", str, loop->num, > > nb_generated_loops, nb_generated_calls); > > > > and follow the use of dump_* and MSG_OPTIMIZED_LOCATIONS so the > > transforms are reported with -fopt-info-loop > > Done. > > > + > > + return transform_reduction_loop_1 (loop, load_dr, store_dr, > > reduction_var); > > +} > > > > what's the point in tail-calling here and visually splitting the > > function in half? > > In the first place I thought that this is more pleasant since in > transform_reduction_loop_1 it is settled that we have a single load, > store, and reduction variable. After refactoring this isn't true > anymore and I inlined the function and made this clear via a comment. > > > > > (sorry for picking random pieces now ;)) > > > > + for (gphi_iterator bsi = gsi_start_phis (bb); !gsi_end_p (bsi); > > + gsi_next (&bsi), ++ninsns) > > + { > > > > this counts debug insns, I guess you want gsi_next_nondebug at least. > > not sure why you are counting PHIs at all btw - for the loops you match > > you are expecting at most two, one IV and eventually one for the virtual > > operand of the store? > > Yes, I removed the counting for the phi loop and changed to > gsi_next_nondebug for both loops. > > > > > + if (gimple_has_volatile_ops (phi)) > > + return false; > > > > PHIs never have volatile ops. > > > > +
Re: [RFC] ldist: Recognize rawmemchr loop patterns
On Wed, Jun 16, 2021 at 04:22:35PM +0200, Richard Biener wrote: > On Mon, Jun 14, 2021 at 7:26 PM Stefan Schulze Frielinghaus > wrote: > > > > On Thu, May 20, 2021 at 08:37:24PM +0200, Stefan Schulze Frielinghaus wrote: > > [...] > > > > but we won't ever arrive here because of the niters condition. But > > > > yes, doing the pattern matching in the innermost loop processing code > > > > looks good to me - for the specific case it would be > > > > > > > > /* Don't distribute loop if niters is unknown. */ > > > > tree niters = number_of_latch_executions (loop); > > > > if (niters == NULL_TREE || niters == chrec_dont_know) > > > > ---> here? > > > > continue; > > > > > > Right, please find attached a new version of the patch where everything > > > is included in the loop distribution pass. I will do a bootstrap and > > > regtest on IBM Z over night. If you give me green light I will also do > > > the same on x86_64. > > > > Meanwhile I gave it a shot on x86_64 where the testsuite runs fine (at > > least the ldist-strlen testcase). If you are Ok with the patch, then I > > would rebase and run the testsuites again and post a patch series > > including the rawmemchr implementation for IBM Z. > > @@ -3257,6 +3261,464 @@ find_seed_stmts_for_distribution (class loop > *loop, vec *work_list) >return work_list->length () > 0; > } > > +static void > +generate_rawmemchr_builtin (loop_p loop, tree reduction_var, > + data_reference_p store_dr, tree base, tree > pattern, > + location_t loc) > +{ > > this new function needs a comment. Applies to all of the new ones, btw. Done. > + gcc_checking_assert (POINTER_TYPE_P (TREE_TYPE (base)) > + && TREE_TYPE (TREE_TYPE (base)) == TREE_TYPE > (pattern)); > > this looks fragile and is probably unnecessary as well. > > + gcc_checking_assert (TREE_TYPE (reduction_var) == TREE_TYPE (base)); > > in general you want types_compatible_p () checks which for pointers means > all pointers are compatible ... True, I removed both asserts. > (skipping stuff) > > @@ -3321,10 +3783,20 @@ loop_distribution::execute (function *fun) > && !optimize_loop_for_speed_p (loop))) > continue; > > - /* Don't distribute loop if niters is unknown. */ > + /* If niters is unknown don't distribute loop but rather try to > transform > +it to a call to a builtin. */ >tree niters = number_of_latch_executions (loop); >if (niters == NULL_TREE || niters == chrec_dont_know) > - continue; > + { > + if (transform_reduction_loop (loop)) > + { > + changed = true; > + loops_to_be_destroyed.safe_push (loop); > + if (dump_file) > + fprintf (dump_file, "Loop %d transformed into a > builtin.\n", loop->num); > + } > + continue; > + } > > please look at > > if (nb_generated_loops + nb_generated_calls > 0) > { > changed = true; > if (dump_enabled_p ()) > dump_printf_loc (MSG_OPTIMIZED_LOCATIONS, > loc, "Loop%s %d distributed: split to > %d loops " > "and %d library calls.\n", str, loop->num, > nb_generated_loops, nb_generated_calls); > > and follow the use of dump_* and MSG_OPTIMIZED_LOCATIONS so the > transforms are reported with -fopt-info-loop Done. > + > + return transform_reduction_loop_1 (loop, load_dr, store_dr, reduction_var); > +} > > what's the point in tail-calling here and visually splitting the > function in half? In the first place I thought that this is more pleasant since in transform_reduction_loop_1 it is settled that we have a single load, store, and reduction variable. After refactoring this isn't true anymore and I inlined the function and made this clear via a comment. > > (sorry for picking random pieces now ;)) > > + for (gphi_iterator bsi = gsi_start_phis (bb); !gsi_end_p (bsi); > + gsi_next (&bsi), ++ninsns) > + { > > this counts debug insns, I guess you want gsi_next_nondebug at least. > not sure why you are counting PHIs at all btw - for the loops you match > you are expecting at most two, one IV and eventually one for the virtual > operand of the store? Yes, I removed the counting for the phi loop and changed to gsi_next_nondebug for both loops. > > + if (gimple_has_volatile_ops (phi)) > + return false; > > PHIs never have volatile ops. > > + if (gimple_clobber_p (phi)) > + continue; > > or are clobbers. Removed both. > Btw, can you factor out a helper from find_single_drs working on a > stmt to reduce code duplication? Ahh sorry for that. I've already done this in one of my first patches but didn't copy that over. Although my changes do not require a RDG the whole pass is base
Re: [RFC] ldist: Recognize rawmemchr loop patterns
On Mon, Jun 14, 2021 at 7:26 PM Stefan Schulze Frielinghaus wrote: > > On Thu, May 20, 2021 at 08:37:24PM +0200, Stefan Schulze Frielinghaus wrote: > [...] > > > but we won't ever arrive here because of the niters condition. But > > > yes, doing the pattern matching in the innermost loop processing code > > > looks good to me - for the specific case it would be > > > > > > /* Don't distribute loop if niters is unknown. */ > > > tree niters = number_of_latch_executions (loop); > > > if (niters == NULL_TREE || niters == chrec_dont_know) > > > ---> here? > > > continue; > > > > Right, please find attached a new version of the patch where everything > > is included in the loop distribution pass. I will do a bootstrap and > > regtest on IBM Z over night. If you give me green light I will also do > > the same on x86_64. > > Meanwhile I gave it a shot on x86_64 where the testsuite runs fine (at > least the ldist-strlen testcase). If you are Ok with the patch, then I > would rebase and run the testsuites again and post a patch series > including the rawmemchr implementation for IBM Z. @@ -3257,6 +3261,464 @@ find_seed_stmts_for_distribution (class loop *loop, vec *work_list) return work_list->length () > 0; } +static void +generate_rawmemchr_builtin (loop_p loop, tree reduction_var, + data_reference_p store_dr, tree base, tree pattern, + location_t loc) +{ this new function needs a comment. Applies to all of the new ones, btw. + gcc_checking_assert (POINTER_TYPE_P (TREE_TYPE (base)) + && TREE_TYPE (TREE_TYPE (base)) == TREE_TYPE (pattern)); this looks fragile and is probably unnecessary as well. + gcc_checking_assert (TREE_TYPE (reduction_var) == TREE_TYPE (base)); in general you want types_compatible_p () checks which for pointers means all pointers are compatible ... (skipping stuff) @@ -3321,10 +3783,20 @@ loop_distribution::execute (function *fun) && !optimize_loop_for_speed_p (loop))) continue; - /* Don't distribute loop if niters is unknown. */ + /* If niters is unknown don't distribute loop but rather try to transform +it to a call to a builtin. */ tree niters = number_of_latch_executions (loop); if (niters == NULL_TREE || niters == chrec_dont_know) - continue; + { + if (transform_reduction_loop (loop)) + { + changed = true; + loops_to_be_destroyed.safe_push (loop); + if (dump_file) + fprintf (dump_file, "Loop %d transformed into a builtin.\n", loop->num); + } + continue; + } please look at if (nb_generated_loops + nb_generated_calls > 0) { changed = true; if (dump_enabled_p ()) dump_printf_loc (MSG_OPTIMIZED_LOCATIONS, loc, "Loop%s %d distributed: split to %d loops " "and %d library calls.\n", str, loop->num, nb_generated_loops, nb_generated_calls); and follow the use of dump_* and MSG_OPTIMIZED_LOCATIONS so the transforms are reported with -fopt-info-loop + + return transform_reduction_loop_1 (loop, load_dr, store_dr, reduction_var); +} what's the point in tail-calling here and visually splitting the function in half? (sorry for picking random pieces now ;)) + for (gphi_iterator bsi = gsi_start_phis (bb); !gsi_end_p (bsi); + gsi_next (&bsi), ++ninsns) + { this counts debug insns, I guess you want gsi_next_nondebug at least. not sure why you are counting PHIs at all btw - for the loops you match you are expecting at most two, one IV and eventually one for the virtual operand of the store? + if (gimple_has_volatile_ops (phi)) + return false; PHIs never have volatile ops. + if (gimple_clobber_p (phi)) + continue; or are clobbers. Btw, can you factor out a helper from find_single_drs working on a stmt to reduce code duplication? + tree reduction_var; + switch (gimple_code (reduction_stmt)) +{ +case GIMPLE_PHI: + reduction_var = gimple_phi_result (reduction_stmt); + break; +case GIMPLE_ASSIGN: + reduction_var = gimple_assign_lhs (reduction_stmt); + break; +default: + /* Bail out e.g. for GIMPLE_CALL. */ + return false; gimple_get_lhs (reduction_stmt); would work for both PHIs and assigns. + if (reduction_var == NULL) +return false; it can never be NULL here. + /* Bail out if this is a bitfield memory reference. */ + if (TREE_CODE (DR_REF (load_dr)) == COMPONENT_REF + && DECL_BIT_FIELD (TREE_OPERAND (DR_REF (load_dr), 1))) +return false; ... I see this is again quite some code copied from find_single_drs, please see how to avoid this much duplication by splitting out helpers. +static bool +transform_reduction_loop_1 (loop_p loop, +
Re: [RFC] ldist: Recognize rawmemchr loop patterns
On Thu, May 20, 2021 at 08:37:24PM +0200, Stefan Schulze Frielinghaus wrote: [...] > > but we won't ever arrive here because of the niters condition. But > > yes, doing the pattern matching in the innermost loop processing code > > looks good to me - for the specific case it would be > > > > /* Don't distribute loop if niters is unknown. */ > > tree niters = number_of_latch_executions (loop); > > if (niters == NULL_TREE || niters == chrec_dont_know) > > ---> here? > > continue; > > Right, please find attached a new version of the patch where everything > is included in the loop distribution pass. I will do a bootstrap and > regtest on IBM Z over night. If you give me green light I will also do > the same on x86_64. Meanwhile I gave it a shot on x86_64 where the testsuite runs fine (at least the ldist-strlen testcase). If you are Ok with the patch, then I would rebase and run the testsuites again and post a patch series including the rawmemchr implementation for IBM Z. Cheers, Stefan
Re: [RFC] ldist: Recognize rawmemchr loop patterns
On Thu, May 20, 2021 at 11:24:57AM +0200, Richard Biener wrote: > On Fri, May 7, 2021 at 2:32 PM Stefan Schulze Frielinghaus > wrote: > > > > On Wed, May 05, 2021 at 11:36:41AM +0200, Richard Biener wrote: > > > On Tue, Mar 16, 2021 at 6:13 PM Stefan Schulze Frielinghaus > > > wrote: > > > > > > > > [snip] > > > > > > > > Please find attached a new version of the patch. A major change > > > > compared to > > > > the previous patch is that I created a separate pass which hopefully > > > > makes > > > > reviewing also easier since it is almost self-contained. After > > > > realizing that > > > > detecting loops which mimic the behavior of rawmemchr/strlen functions > > > > does not > > > > really fit into the topic of loop distribution, I created a separate > > > > pass. > > > > > > It's true that these reduction-like patterns are more difficult than > > > the existing > > > memcpy/memset cases. > > > > > > > Due > > > > to this I was also able to play around a bit and schedule the pass at > > > > different > > > > times. Currently it is scheduled right before loop distribution where > > > > loop > > > > header copying already took place which leads to the following effect. > > > > > > In fact I'd schedule it after loop distribution so there's the chance > > > that loop > > > distribution can expose a loop that fits the new pattern. > > > > > > > Running > > > > this setup over > > > > > > > > char *t (char *p) > > > > { > > > > for (; *p; ++p); > > > > return p; > > > > } > > > > > > > > the new pass transforms > > > > > > > > char * t (char * p) > > > > { > > > > char _1; > > > > char _7; > > > > > > > >[local count: 118111600]: > > > > _7 = *p_3(D); > > > > if (_7 != 0) > > > > goto ; [89.00%] > > > > else > > > > goto ; [11.00%] > > > > > > > >[local count: 105119324]: > > > > > > > >[local count: 955630225]: > > > > # p_8 = PHI > > > > p_6 = p_8 + 1; > > > > _1 = *p_6; > > > > if (_1 != 0) > > > > goto ; [89.00%] > > > > else > > > > goto ; [11.00%] > > > > > > > >[local count: 105119324]: > > > > # p_2 = PHI > > > > goto ; [100.00%] > > > > > > > >[local count: 850510901]: > > > > goto ; [100.00%] > > > > > > > >[local count: 12992276]: > > > > > > > >[local count: 118111600]: > > > > # p_9 = PHI > > > > return p_9; > > > > > > > > } > > > > > > > > into > > > > > > > > char * t (char * p) > > > > { > > > > char * _5; > > > > char _7; > > > > > > > >[local count: 118111600]: > > > > _7 = *p_3(D); > > > > if (_7 != 0) > > > > goto ; [89.00%] > > > > else > > > > goto ; [11.00%] > > > > > > > >[local count: 105119324]: > > > > _5 = p_3(D) + 1; > > > > p_10 = .RAWMEMCHR (_5, 0); > > > > > > > >[local count: 118111600]: > > > > # p_9 = PHI > > > > return p_9; > > > > > > > > } > > > > > > > > which is fine so far. However, I haven't made up my mind so far > > > > whether it is > > > > worthwhile to spend more time in order to also eliminate the "first > > > > unrolling" > > > > of the loop. > > > > > > Might be a phiopt transform ;) Might apply to quite some set of > > > builtins. I wonder how the strlen case looks like though. > > > > > > > I gave it a shot by scheduling the pass prior pass copy header > > > > and ended up with: > > > > > > > > char * t (char * p) > > > > { > > > >[local count: 118111600]: > > > > p_5 = .RAWMEMCHR (p_3(D), 0); > > > > return p_5; > > > > > > > > } > > > > > > > > which seems optimal to me. The downside of this is that I have to > > > > initialize > > > > scalar evolution analysis which might be undesired that early. > > > > > > > > All this brings me to the question where do you see this peace of code > > > > running? > > > > If in a separate pass when would you schedule it? If in an existing > > > > pass, > > > > which one would you choose? > > > > > > I think it still fits loop distribution. If you manage to detect it > > > with your pass > > > standalone then you should be able to detect it in loop distribution. > > > > If a loop is distributed only because one of the partitions matches a > > rawmemchr/strlen-like loop pattern, then we have at least two partitions > > which walk over the same memory region. Since a rawmemchr/strlen-like > > loop has no body (neglecting expression-3 of a for-loop where just an > > increment happens) it is governed by the memory accesses in the loop > > condition. Therefore, in such a case loop distribution would result in > > performance degradation. This is why I think that it does not fit > > conceptually into ldist pass. However, since I make use of a couple of > > helper functions from ldist pass, it may still fit technically. > > > > Since currently all ldist optimizations operate over loops where niters > > is known and for rawmemchr/strlen-like loops this is not the case, it is > > not possible that those optimizations expose a loop which is suitable > > for rawmemchr/strlen
Re: [RFC] ldist: Recognize rawmemchr loop patterns
On Fri, May 7, 2021 at 2:32 PM Stefan Schulze Frielinghaus wrote: > > On Wed, May 05, 2021 at 11:36:41AM +0200, Richard Biener wrote: > > On Tue, Mar 16, 2021 at 6:13 PM Stefan Schulze Frielinghaus > > wrote: > > > > > > [snip] > > > > > > Please find attached a new version of the patch. A major change compared > > > to > > > the previous patch is that I created a separate pass which hopefully makes > > > reviewing also easier since it is almost self-contained. After realizing > > > that > > > detecting loops which mimic the behavior of rawmemchr/strlen functions > > > does not > > > really fit into the topic of loop distribution, I created a separate pass. > > > > It's true that these reduction-like patterns are more difficult than > > the existing > > memcpy/memset cases. > > > > > Due > > > to this I was also able to play around a bit and schedule the pass at > > > different > > > times. Currently it is scheduled right before loop distribution where > > > loop > > > header copying already took place which leads to the following effect. > > > > In fact I'd schedule it after loop distribution so there's the chance that > > loop > > distribution can expose a loop that fits the new pattern. > > > > > Running > > > this setup over > > > > > > char *t (char *p) > > > { > > > for (; *p; ++p); > > > return p; > > > } > > > > > > the new pass transforms > > > > > > char * t (char * p) > > > { > > > char _1; > > > char _7; > > > > > >[local count: 118111600]: > > > _7 = *p_3(D); > > > if (_7 != 0) > > > goto ; [89.00%] > > > else > > > goto ; [11.00%] > > > > > >[local count: 105119324]: > > > > > >[local count: 955630225]: > > > # p_8 = PHI > > > p_6 = p_8 + 1; > > > _1 = *p_6; > > > if (_1 != 0) > > > goto ; [89.00%] > > > else > > > goto ; [11.00%] > > > > > >[local count: 105119324]: > > > # p_2 = PHI > > > goto ; [100.00%] > > > > > >[local count: 850510901]: > > > goto ; [100.00%] > > > > > >[local count: 12992276]: > > > > > >[local count: 118111600]: > > > # p_9 = PHI > > > return p_9; > > > > > > } > > > > > > into > > > > > > char * t (char * p) > > > { > > > char * _5; > > > char _7; > > > > > >[local count: 118111600]: > > > _7 = *p_3(D); > > > if (_7 != 0) > > > goto ; [89.00%] > > > else > > > goto ; [11.00%] > > > > > >[local count: 105119324]: > > > _5 = p_3(D) + 1; > > > p_10 = .RAWMEMCHR (_5, 0); > > > > > >[local count: 118111600]: > > > # p_9 = PHI > > > return p_9; > > > > > > } > > > > > > which is fine so far. However, I haven't made up my mind so far whether > > > it is > > > worthwhile to spend more time in order to also eliminate the "first > > > unrolling" > > > of the loop. > > > > Might be a phiopt transform ;) Might apply to quite some set of > > builtins. I wonder how the strlen case looks like though. > > > > > I gave it a shot by scheduling the pass prior pass copy header > > > and ended up with: > > > > > > char * t (char * p) > > > { > > >[local count: 118111600]: > > > p_5 = .RAWMEMCHR (p_3(D), 0); > > > return p_5; > > > > > > } > > > > > > which seems optimal to me. The downside of this is that I have to > > > initialize > > > scalar evolution analysis which might be undesired that early. > > > > > > All this brings me to the question where do you see this peace of code > > > running? > > > If in a separate pass when would you schedule it? If in an existing pass, > > > which one would you choose? > > > > I think it still fits loop distribution. If you manage to detect it > > with your pass > > standalone then you should be able to detect it in loop distribution. > > If a loop is distributed only because one of the partitions matches a > rawmemchr/strlen-like loop pattern, then we have at least two partitions > which walk over the same memory region. Since a rawmemchr/strlen-like > loop has no body (neglecting expression-3 of a for-loop where just an > increment happens) it is governed by the memory accesses in the loop > condition. Therefore, in such a case loop distribution would result in > performance degradation. This is why I think that it does not fit > conceptually into ldist pass. However, since I make use of a couple of > helper functions from ldist pass, it may still fit technically. > > Since currently all ldist optimizations operate over loops where niters > is known and for rawmemchr/strlen-like loops this is not the case, it is > not possible that those optimizations expose a loop which is suitable > for rawmemchr/strlen optimization. True - though that seems to be an unnecessary restriction. > Therefore, what do you think about > scheduling rawmemchr/strlen optimization right between those > if-statements of function loop_distribution::execute? > >if (nb_generated_loops + nb_generated_calls > 0) > { >changed = true; >if (dump_enabled_p ()) > dump_printf_loc (MSG_OPTIMIZED_
Re: [RFC] ldist: Recognize rawmemchr loop patterns
On Wed, May 05, 2021 at 11:36:41AM +0200, Richard Biener wrote: > On Tue, Mar 16, 2021 at 6:13 PM Stefan Schulze Frielinghaus > wrote: > > > > [snip] > > > > Please find attached a new version of the patch. A major change compared to > > the previous patch is that I created a separate pass which hopefully makes > > reviewing also easier since it is almost self-contained. After realizing > > that > > detecting loops which mimic the behavior of rawmemchr/strlen functions does > > not > > really fit into the topic of loop distribution, I created a separate pass. > > It's true that these reduction-like patterns are more difficult than > the existing > memcpy/memset cases. > > > Due > > to this I was also able to play around a bit and schedule the pass at > > different > > times. Currently it is scheduled right before loop distribution where loop > > header copying already took place which leads to the following effect. > > In fact I'd schedule it after loop distribution so there's the chance that > loop > distribution can expose a loop that fits the new pattern. > > > Running > > this setup over > > > > char *t (char *p) > > { > > for (; *p; ++p); > > return p; > > } > > > > the new pass transforms > > > > char * t (char * p) > > { > > char _1; > > char _7; > > > >[local count: 118111600]: > > _7 = *p_3(D); > > if (_7 != 0) > > goto ; [89.00%] > > else > > goto ; [11.00%] > > > >[local count: 105119324]: > > > >[local count: 955630225]: > > # p_8 = PHI > > p_6 = p_8 + 1; > > _1 = *p_6; > > if (_1 != 0) > > goto ; [89.00%] > > else > > goto ; [11.00%] > > > >[local count: 105119324]: > > # p_2 = PHI > > goto ; [100.00%] > > > >[local count: 850510901]: > > goto ; [100.00%] > > > >[local count: 12992276]: > > > >[local count: 118111600]: > > # p_9 = PHI > > return p_9; > > > > } > > > > into > > > > char * t (char * p) > > { > > char * _5; > > char _7; > > > >[local count: 118111600]: > > _7 = *p_3(D); > > if (_7 != 0) > > goto ; [89.00%] > > else > > goto ; [11.00%] > > > >[local count: 105119324]: > > _5 = p_3(D) + 1; > > p_10 = .RAWMEMCHR (_5, 0); > > > >[local count: 118111600]: > > # p_9 = PHI > > return p_9; > > > > } > > > > which is fine so far. However, I haven't made up my mind so far whether it > > is > > worthwhile to spend more time in order to also eliminate the "first > > unrolling" > > of the loop. > > Might be a phiopt transform ;) Might apply to quite some set of > builtins. I wonder how the strlen case looks like though. > > > I gave it a shot by scheduling the pass prior pass copy header > > and ended up with: > > > > char * t (char * p) > > { > >[local count: 118111600]: > > p_5 = .RAWMEMCHR (p_3(D), 0); > > return p_5; > > > > } > > > > which seems optimal to me. The downside of this is that I have to > > initialize > > scalar evolution analysis which might be undesired that early. > > > > All this brings me to the question where do you see this peace of code > > running? > > If in a separate pass when would you schedule it? If in an existing pass, > > which one would you choose? > > I think it still fits loop distribution. If you manage to detect it > with your pass > standalone then you should be able to detect it in loop distribution. If a loop is distributed only because one of the partitions matches a rawmemchr/strlen-like loop pattern, then we have at least two partitions which walk over the same memory region. Since a rawmemchr/strlen-like loop has no body (neglecting expression-3 of a for-loop where just an increment happens) it is governed by the memory accesses in the loop condition. Therefore, in such a case loop distribution would result in performance degradation. This is why I think that it does not fit conceptually into ldist pass. However, since I make use of a couple of helper functions from ldist pass, it may still fit technically. Since currently all ldist optimizations operate over loops where niters is known and for rawmemchr/strlen-like loops this is not the case, it is not possible that those optimizations expose a loop which is suitable for rawmemchr/strlen optimization. Therefore, what do you think about scheduling rawmemchr/strlen optimization right between those if-statements of function loop_distribution::execute? if (nb_generated_loops + nb_generated_calls > 0) { changed = true; if (dump_enabled_p ()) dump_printf_loc (MSG_OPTIMIZED_LOCATIONS, loc, "Loop%s %d distributed: split to %d loops " "and %d library calls.\n", str, loop->num, nb_generated_loops, nb_generated_calls); break; } // rawmemchr/strlen like loops if (dump_file && (dump_flags & TDF_DETAILS)) fprintf (dump_file, "Loop%s %d not distributed.\n", str, loop->num); > Can you > explain what part is "easier" as
Re: [RFC] ldist: Recognize rawmemchr loop patterns
On Wed, May 5, 2021 at 11:36 AM Richard Biener wrote: > > On Tue, Mar 16, 2021 at 6:13 PM Stefan Schulze Frielinghaus > wrote: > > > > [snip] > > > > Please find attached a new version of the patch. A major change compared to > > the previous patch is that I created a separate pass which hopefully makes > > reviewing also easier since it is almost self-contained. After realizing > > that > > detecting loops which mimic the behavior of rawmemchr/strlen functions does > > not > > really fit into the topic of loop distribution, I created a separate pass. > > It's true that these reduction-like patterns are more difficult than > the existing > memcpy/memset cases. > > > Due > > to this I was also able to play around a bit and schedule the pass at > > different > > times. Currently it is scheduled right before loop distribution where loop > > header copying already took place which leads to the following effect. > > In fact I'd schedule it after loop distribution so there's the chance that > loop > distribution can expose a loop that fits the new pattern. > > > Running > > this setup over > > > > char *t (char *p) > > { > > for (; *p; ++p); > > return p; > > } > > > > the new pass transforms > > > > char * t (char * p) > > { > > char _1; > > char _7; > > > >[local count: 118111600]: > > _7 = *p_3(D); > > if (_7 != 0) > > goto ; [89.00%] > > else > > goto ; [11.00%] > > > >[local count: 105119324]: > > > >[local count: 955630225]: > > # p_8 = PHI > > p_6 = p_8 + 1; > > _1 = *p_6; > > if (_1 != 0) > > goto ; [89.00%] > > else > > goto ; [11.00%] > > > >[local count: 105119324]: > > # p_2 = PHI > > goto ; [100.00%] > > > >[local count: 850510901]: > > goto ; [100.00%] > > > >[local count: 12992276]: > > > >[local count: 118111600]: > > # p_9 = PHI > > return p_9; > > > > } > > > > into > > > > char * t (char * p) > > { > > char * _5; > > char _7; > > > >[local count: 118111600]: > > _7 = *p_3(D); > > if (_7 != 0) > > goto ; [89.00%] > > else > > goto ; [11.00%] > > > >[local count: 105119324]: > > _5 = p_3(D) + 1; > > p_10 = .RAWMEMCHR (_5, 0); > > > >[local count: 118111600]: > > # p_9 = PHI > > return p_9; > > > > } > > > > which is fine so far. However, I haven't made up my mind so far whether it > > is > > worthwhile to spend more time in order to also eliminate the "first > > unrolling" > > of the loop. > > Might be a phiopt transform ;) Might apply to quite some set of > builtins. I wonder how the strlen case looks like though. > > > I gave it a shot by scheduling the pass prior pass copy header > > and ended up with: > > > > char * t (char * p) > > { > >[local count: 118111600]: > > p_5 = .RAWMEMCHR (p_3(D), 0); > > return p_5; > > > > } > > > > which seems optimal to me. The downside of this is that I have to > > initialize > > scalar evolution analysis which might be undesired that early. > > > > All this brings me to the question where do you see this peace of code > > running? > > If in a separate pass when would you schedule it? If in an existing pass, > > which one would you choose? > > I think it still fits loop distribution. If you manage to detect it > with your pass > standalone then you should be able to detect it in loop distribution. Can you > explain what part is "easier" as standalone pass? Btw, another "fitting" pass would be final value replacement (pass_scev_cprop) since what these patterns provide is a builtin call to compute the value of one of the loop PHIs on exit. Note this pass leaves removal of in-loop computations to followup DCE which means that in some cases it does unprofitable transforms. There's a bug somewhere where I worked on doing final value replacement on-demand when DCE figures out the loop is otherwise dead but I never finished this (loop distribution could also use such mechanism to get rid of unwanted PHIs). > > Another topic which came up is whether there exists a more elegant solution > > to > > my current implementation in order to deal with stores (I'm speaking of the > > `if > > (store_dr)` statement inside of function transform_loop_1). For example, > > > > extern char *p; > > char *t () > > { > > for (; *p; ++p); > > return p; > > } > > > > ends up as > > > > char * t () > > { > > char * _1; > > char * _2; > > char _3; > > char * p.1_8; > > char _9; > > char * p.1_10; > > char * p.1_11; > > > >[local count: 118111600]: > > p.1_8 = p; > > _9 = *p.1_8; > > if (_9 != 0) > > goto ; [89.00%] > > else > > goto ; [11.00%] > > > >[local count: 105119324]: > > > >[local count: 955630225]: > > # p.1_10 = PHI <_1(6), p.1_8(5)> > > _1 = p.1_10 + 1; > > p = _1; > > _3 = *_1; > > if (_3 != 0) > > goto ; [89.00%] > > else > > goto ; [11.00%] > > > >[local count: 105119324]: > > # _2 = PHI <_1(3)> > > goto ; [100.00%] > > > >[local count: 8505109
Re: [RFC] ldist: Recognize rawmemchr loop patterns
On Tue, Mar 16, 2021 at 6:13 PM Stefan Schulze Frielinghaus wrote: > > [snip] > > Please find attached a new version of the patch. A major change compared to > the previous patch is that I created a separate pass which hopefully makes > reviewing also easier since it is almost self-contained. After realizing that > detecting loops which mimic the behavior of rawmemchr/strlen functions does > not > really fit into the topic of loop distribution, I created a separate pass. It's true that these reduction-like patterns are more difficult than the existing memcpy/memset cases. > Due > to this I was also able to play around a bit and schedule the pass at > different > times. Currently it is scheduled right before loop distribution where loop > header copying already took place which leads to the following effect. In fact I'd schedule it after loop distribution so there's the chance that loop distribution can expose a loop that fits the new pattern. > Running > this setup over > > char *t (char *p) > { > for (; *p; ++p); > return p; > } > > the new pass transforms > > char * t (char * p) > { > char _1; > char _7; > >[local count: 118111600]: > _7 = *p_3(D); > if (_7 != 0) > goto ; [89.00%] > else > goto ; [11.00%] > >[local count: 105119324]: > >[local count: 955630225]: > # p_8 = PHI > p_6 = p_8 + 1; > _1 = *p_6; > if (_1 != 0) > goto ; [89.00%] > else > goto ; [11.00%] > >[local count: 105119324]: > # p_2 = PHI > goto ; [100.00%] > >[local count: 850510901]: > goto ; [100.00%] > >[local count: 12992276]: > >[local count: 118111600]: > # p_9 = PHI > return p_9; > > } > > into > > char * t (char * p) > { > char * _5; > char _7; > >[local count: 118111600]: > _7 = *p_3(D); > if (_7 != 0) > goto ; [89.00%] > else > goto ; [11.00%] > >[local count: 105119324]: > _5 = p_3(D) + 1; > p_10 = .RAWMEMCHR (_5, 0); > >[local count: 118111600]: > # p_9 = PHI > return p_9; > > } > > which is fine so far. However, I haven't made up my mind so far whether it is > worthwhile to spend more time in order to also eliminate the "first unrolling" > of the loop. Might be a phiopt transform ;) Might apply to quite some set of builtins. I wonder how the strlen case looks like though. > I gave it a shot by scheduling the pass prior pass copy header > and ended up with: > > char * t (char * p) > { >[local count: 118111600]: > p_5 = .RAWMEMCHR (p_3(D), 0); > return p_5; > > } > > which seems optimal to me. The downside of this is that I have to initialize > scalar evolution analysis which might be undesired that early. > > All this brings me to the question where do you see this peace of code > running? > If in a separate pass when would you schedule it? If in an existing pass, > which one would you choose? I think it still fits loop distribution. If you manage to detect it with your pass standalone then you should be able to detect it in loop distribution. Can you explain what part is "easier" as standalone pass? > Another topic which came up is whether there exists a more elegant solution to > my current implementation in order to deal with stores (I'm speaking of the > `if > (store_dr)` statement inside of function transform_loop_1). For example, > > extern char *p; > char *t () > { > for (; *p; ++p); > return p; > } > > ends up as > > char * t () > { > char * _1; > char * _2; > char _3; > char * p.1_8; > char _9; > char * p.1_10; > char * p.1_11; > >[local count: 118111600]: > p.1_8 = p; > _9 = *p.1_8; > if (_9 != 0) > goto ; [89.00%] > else > goto ; [11.00%] > >[local count: 105119324]: > >[local count: 955630225]: > # p.1_10 = PHI <_1(6), p.1_8(5)> > _1 = p.1_10 + 1; > p = _1; > _3 = *_1; > if (_3 != 0) > goto ; [89.00%] > else > goto ; [11.00%] > >[local count: 105119324]: > # _2 = PHI <_1(3)> > goto ; [100.00%] > >[local count: 850510901]: > goto ; [100.00%] > >[local count: 12992276]: > >[local count: 118111600]: > # p.1_11 = PHI <_2(8), p.1_8(7)> > return p.1_11; > > } > > where inside the loop a load and store occurs. For a rawmemchr like loop I > have to show that we never load from a memory location to which we write. > Currently I solve this by hard coding those facts which are not generic at > all. > I gave compute_data_dependences_for_loop a try which failed to determine the > fact that stores only happen to p[0] and loads from p[i] where i>0. Maybe > there are more generic solutions to express this in contrast to my current > one? So the example loop is not valid to be trasformed to rawmemchr since it's valid to call it with p = &p; - but sure, once you pass the first *p != 0 check things become fishy but GCC isn't able to turn that into a non-dependence. Why is the case of stores inside the loop important? In fact that you think it is makes a case for integrating this with
Re: [RFC] ldist: Recognize rawmemchr loop patterns
ping On Thu, Apr 08, 2021 at 10:23:31AM +0200, Stefan Schulze Frielinghaus wrote: > ping > > On Tue, Mar 16, 2021 at 06:13:21PM +0100, Stefan Schulze Frielinghaus wrote: > > [snip] > > > > Please find attached a new version of the patch. A major change compared to > > the previous patch is that I created a separate pass which hopefully makes > > reviewing also easier since it is almost self-contained. After realizing > > that > > detecting loops which mimic the behavior of rawmemchr/strlen functions does > > not > > really fit into the topic of loop distribution, I created a separate pass. > > Due > > to this I was also able to play around a bit and schedule the pass at > > different > > times. Currently it is scheduled right before loop distribution where loop > > header copying already took place which leads to the following effect. > > Running > > this setup over > > > > char *t (char *p) > > { > > for (; *p; ++p); > > return p; > > } > > > > the new pass transforms > > > > char * t (char * p) > > { > > char _1; > > char _7; > > > >[local count: 118111600]: > > _7 = *p_3(D); > > if (_7 != 0) > > goto ; [89.00%] > > else > > goto ; [11.00%] > > > >[local count: 105119324]: > > > >[local count: 955630225]: > > # p_8 = PHI > > p_6 = p_8 + 1; > > _1 = *p_6; > > if (_1 != 0) > > goto ; [89.00%] > > else > > goto ; [11.00%] > > > >[local count: 105119324]: > > # p_2 = PHI > > goto ; [100.00%] > > > >[local count: 850510901]: > > goto ; [100.00%] > > > >[local count: 12992276]: > > > >[local count: 118111600]: > > # p_9 = PHI > > return p_9; > > > > } > > > > into > > > > char * t (char * p) > > { > > char * _5; > > char _7; > > > >[local count: 118111600]: > > _7 = *p_3(D); > > if (_7 != 0) > > goto ; [89.00%] > > else > > goto ; [11.00%] > > > >[local count: 105119324]: > > _5 = p_3(D) + 1; > > p_10 = .RAWMEMCHR (_5, 0); > > > >[local count: 118111600]: > > # p_9 = PHI > > return p_9; > > > > } > > > > which is fine so far. However, I haven't made up my mind so far whether it > > is > > worthwhile to spend more time in order to also eliminate the "first > > unrolling" > > of the loop. I gave it a shot by scheduling the pass prior pass copy header > > and ended up with: > > > > char * t (char * p) > > { > >[local count: 118111600]: > > p_5 = .RAWMEMCHR (p_3(D), 0); > > return p_5; > > > > } > > > > which seems optimal to me. The downside of this is that I have to > > initialize > > scalar evolution analysis which might be undesired that early. > > > > All this brings me to the question where do you see this peace of code > > running? > > If in a separate pass when would you schedule it? If in an existing pass, > > which one would you choose? > > > > Another topic which came up is whether there exists a more elegant solution > > to > > my current implementation in order to deal with stores (I'm speaking of the > > `if > > (store_dr)` statement inside of function transform_loop_1). For example, > > > > extern char *p; > > char *t () > > { > > for (; *p; ++p); > > return p; > > } > > > > ends up as > > > > char * t () > > { > > char * _1; > > char * _2; > > char _3; > > char * p.1_8; > > char _9; > > char * p.1_10; > > char * p.1_11; > > > >[local count: 118111600]: > > p.1_8 = p; > > _9 = *p.1_8; > > if (_9 != 0) > > goto ; [89.00%] > > else > > goto ; [11.00%] > > > >[local count: 105119324]: > > > >[local count: 955630225]: > > # p.1_10 = PHI <_1(6), p.1_8(5)> > > _1 = p.1_10 + 1; > > p = _1; > > _3 = *_1; > > if (_3 != 0) > > goto ; [89.00%] > > else > > goto ; [11.00%] > > > >[local count: 105119324]: > > # _2 = PHI <_1(3)> > > goto ; [100.00%] > > > >[local count: 850510901]: > > goto ; [100.00%] > > > >[local count: 12992276]: > > > >[local count: 118111600]: > > # p.1_11 = PHI <_2(8), p.1_8(7)> > > return p.1_11; > > > > } > > > > where inside the loop a load and store occurs. For a rawmemchr like loop I > > have to show that we never load from a memory location to which we write. > > Currently I solve this by hard coding those facts which are not generic at > > all. > > I gave compute_data_dependences_for_loop a try which failed to determine the > > fact that stores only happen to p[0] and loads from p[i] where i>0. Maybe > > there are more generic solutions to express this in contrast to my current > > one? > > > > Thanks again for your input so far. Really appreciated. > > > > Cheers, > > Stefan > > > diff --git a/gcc/Makefile.in b/gcc/Makefile.in > > index 8a5fb3fd99c..7b2d7405277 100644 > > --- a/gcc/Makefile.in > > +++ b/gcc/Makefile.in > > @@ -1608,6 +1608,7 @@ OBJS = \ > > tree-into-ssa.o \ > > tree-iterator.o \ > > tree-loop-distribution.o \ > > + tree-loop-pattern.o \ > > tree-nested.o \ > >
Re: [RFC] ldist: Recognize rawmemchr loop patterns
ping On Tue, Mar 16, 2021 at 06:13:21PM +0100, Stefan Schulze Frielinghaus wrote: > [snip] > > Please find attached a new version of the patch. A major change compared to > the previous patch is that I created a separate pass which hopefully makes > reviewing also easier since it is almost self-contained. After realizing that > detecting loops which mimic the behavior of rawmemchr/strlen functions does > not > really fit into the topic of loop distribution, I created a separate pass. > Due > to this I was also able to play around a bit and schedule the pass at > different > times. Currently it is scheduled right before loop distribution where loop > header copying already took place which leads to the following effect. > Running > this setup over > > char *t (char *p) > { > for (; *p; ++p); > return p; > } > > the new pass transforms > > char * t (char * p) > { > char _1; > char _7; > >[local count: 118111600]: > _7 = *p_3(D); > if (_7 != 0) > goto ; [89.00%] > else > goto ; [11.00%] > >[local count: 105119324]: > >[local count: 955630225]: > # p_8 = PHI > p_6 = p_8 + 1; > _1 = *p_6; > if (_1 != 0) > goto ; [89.00%] > else > goto ; [11.00%] > >[local count: 105119324]: > # p_2 = PHI > goto ; [100.00%] > >[local count: 850510901]: > goto ; [100.00%] > >[local count: 12992276]: > >[local count: 118111600]: > # p_9 = PHI > return p_9; > > } > > into > > char * t (char * p) > { > char * _5; > char _7; > >[local count: 118111600]: > _7 = *p_3(D); > if (_7 != 0) > goto ; [89.00%] > else > goto ; [11.00%] > >[local count: 105119324]: > _5 = p_3(D) + 1; > p_10 = .RAWMEMCHR (_5, 0); > >[local count: 118111600]: > # p_9 = PHI > return p_9; > > } > > which is fine so far. However, I haven't made up my mind so far whether it is > worthwhile to spend more time in order to also eliminate the "first unrolling" > of the loop. I gave it a shot by scheduling the pass prior pass copy header > and ended up with: > > char * t (char * p) > { >[local count: 118111600]: > p_5 = .RAWMEMCHR (p_3(D), 0); > return p_5; > > } > > which seems optimal to me. The downside of this is that I have to initialize > scalar evolution analysis which might be undesired that early. > > All this brings me to the question where do you see this peace of code > running? > If in a separate pass when would you schedule it? If in an existing pass, > which one would you choose? > > Another topic which came up is whether there exists a more elegant solution to > my current implementation in order to deal with stores (I'm speaking of the > `if > (store_dr)` statement inside of function transform_loop_1). For example, > > extern char *p; > char *t () > { > for (; *p; ++p); > return p; > } > > ends up as > > char * t () > { > char * _1; > char * _2; > char _3; > char * p.1_8; > char _9; > char * p.1_10; > char * p.1_11; > >[local count: 118111600]: > p.1_8 = p; > _9 = *p.1_8; > if (_9 != 0) > goto ; [89.00%] > else > goto ; [11.00%] > >[local count: 105119324]: > >[local count: 955630225]: > # p.1_10 = PHI <_1(6), p.1_8(5)> > _1 = p.1_10 + 1; > p = _1; > _3 = *_1; > if (_3 != 0) > goto ; [89.00%] > else > goto ; [11.00%] > >[local count: 105119324]: > # _2 = PHI <_1(3)> > goto ; [100.00%] > >[local count: 850510901]: > goto ; [100.00%] > >[local count: 12992276]: > >[local count: 118111600]: > # p.1_11 = PHI <_2(8), p.1_8(7)> > return p.1_11; > > } > > where inside the loop a load and store occurs. For a rawmemchr like loop I > have to show that we never load from a memory location to which we write. > Currently I solve this by hard coding those facts which are not generic at > all. > I gave compute_data_dependences_for_loop a try which failed to determine the > fact that stores only happen to p[0] and loads from p[i] where i>0. Maybe > there are more generic solutions to express this in contrast to my current > one? > > Thanks again for your input so far. Really appreciated. > > Cheers, > Stefan > diff --git a/gcc/Makefile.in b/gcc/Makefile.in > index 8a5fb3fd99c..7b2d7405277 100644 > --- a/gcc/Makefile.in > +++ b/gcc/Makefile.in > @@ -1608,6 +1608,7 @@ OBJS = \ > tree-into-ssa.o \ > tree-iterator.o \ > tree-loop-distribution.o \ > + tree-loop-pattern.o \ > tree-nested.o \ > tree-nrv.o \ > tree-object-size.o \ > diff --git a/gcc/internal-fn.c b/gcc/internal-fn.c > index dd7173126fb..957e96a46a4 100644 > --- a/gcc/internal-fn.c > +++ b/gcc/internal-fn.c > @@ -2917,6 +2917,33 @@ expand_VEC_CONVERT (internal_fn, gcall *) >gcc_unreachable (); > } > > +void > +expand_RAWMEMCHR (internal_fn, gcall *stmt) > +{ > + expand_operand ops[3]; > + > + tree lhs = gimple_call_lhs (stmt); > + if (!lhs) > +return; > + tree lhs_type = TREE_TYPE (l
Re: [RFC] ldist: Recognize rawmemchr loop patterns
[snip] Please find attached a new version of the patch. A major change compared to the previous patch is that I created a separate pass which hopefully makes reviewing also easier since it is almost self-contained. After realizing that detecting loops which mimic the behavior of rawmemchr/strlen functions does not really fit into the topic of loop distribution, I created a separate pass. Due to this I was also able to play around a bit and schedule the pass at different times. Currently it is scheduled right before loop distribution where loop header copying already took place which leads to the following effect. Running this setup over char *t (char *p) { for (; *p; ++p); return p; } the new pass transforms char * t (char * p) { char _1; char _7; [local count: 118111600]: _7 = *p_3(D); if (_7 != 0) goto ; [89.00%] else goto ; [11.00%] [local count: 105119324]: [local count: 955630225]: # p_8 = PHI p_6 = p_8 + 1; _1 = *p_6; if (_1 != 0) goto ; [89.00%] else goto ; [11.00%] [local count: 105119324]: # p_2 = PHI goto ; [100.00%] [local count: 850510901]: goto ; [100.00%] [local count: 12992276]: [local count: 118111600]: # p_9 = PHI return p_9; } into char * t (char * p) { char * _5; char _7; [local count: 118111600]: _7 = *p_3(D); if (_7 != 0) goto ; [89.00%] else goto ; [11.00%] [local count: 105119324]: _5 = p_3(D) + 1; p_10 = .RAWMEMCHR (_5, 0); [local count: 118111600]: # p_9 = PHI return p_9; } which is fine so far. However, I haven't made up my mind so far whether it is worthwhile to spend more time in order to also eliminate the "first unrolling" of the loop. I gave it a shot by scheduling the pass prior pass copy header and ended up with: char * t (char * p) { [local count: 118111600]: p_5 = .RAWMEMCHR (p_3(D), 0); return p_5; } which seems optimal to me. The downside of this is that I have to initialize scalar evolution analysis which might be undesired that early. All this brings me to the question where do you see this peace of code running? If in a separate pass when would you schedule it? If in an existing pass, which one would you choose? Another topic which came up is whether there exists a more elegant solution to my current implementation in order to deal with stores (I'm speaking of the `if (store_dr)` statement inside of function transform_loop_1). For example, extern char *p; char *t () { for (; *p; ++p); return p; } ends up as char * t () { char * _1; char * _2; char _3; char * p.1_8; char _9; char * p.1_10; char * p.1_11; [local count: 118111600]: p.1_8 = p; _9 = *p.1_8; if (_9 != 0) goto ; [89.00%] else goto ; [11.00%] [local count: 105119324]: [local count: 955630225]: # p.1_10 = PHI <_1(6), p.1_8(5)> _1 = p.1_10 + 1; p = _1; _3 = *_1; if (_3 != 0) goto ; [89.00%] else goto ; [11.00%] [local count: 105119324]: # _2 = PHI <_1(3)> goto ; [100.00%] [local count: 850510901]: goto ; [100.00%] [local count: 12992276]: [local count: 118111600]: # p.1_11 = PHI <_2(8), p.1_8(7)> return p.1_11; } where inside the loop a load and store occurs. For a rawmemchr like loop I have to show that we never load from a memory location to which we write. Currently I solve this by hard coding those facts which are not generic at all. I gave compute_data_dependences_for_loop a try which failed to determine the fact that stores only happen to p[0] and loads from p[i] where i>0. Maybe there are more generic solutions to express this in contrast to my current one? Thanks again for your input so far. Really appreciated. Cheers, Stefan diff --git a/gcc/Makefile.in b/gcc/Makefile.in index 8a5fb3fd99c..7b2d7405277 100644 --- a/gcc/Makefile.in +++ b/gcc/Makefile.in @@ -1608,6 +1608,7 @@ OBJS = \ tree-into-ssa.o \ tree-iterator.o \ tree-loop-distribution.o \ + tree-loop-pattern.o \ tree-nested.o \ tree-nrv.o \ tree-object-size.o \ diff --git a/gcc/internal-fn.c b/gcc/internal-fn.c index dd7173126fb..957e96a46a4 100644 --- a/gcc/internal-fn.c +++ b/gcc/internal-fn.c @@ -2917,6 +2917,33 @@ expand_VEC_CONVERT (internal_fn, gcall *) gcc_unreachable (); } +void +expand_RAWMEMCHR (internal_fn, gcall *stmt) +{ + expand_operand ops[3]; + + tree lhs = gimple_call_lhs (stmt); + if (!lhs) +return; + tree lhs_type = TREE_TYPE (lhs); + rtx lhs_rtx = expand_expr (lhs, NULL_RTX, VOIDmode, EXPAND_WRITE); + create_output_operand (&ops[0], lhs_rtx, TYPE_MODE (lhs_type)); + + for (unsigned int i = 0; i < 2; ++i) +{ + tree rhs = gimple_call_arg (stmt, i); + tree rhs_type = TREE_TYPE (rhs); + rtx rhs_rtx = expand_normal (rhs); + create_input_operand (&ops[i + 1], rhs_rtx, TYPE_MODE (rhs_type)); +} + + insn_code icode = direct_optab_handler (rawmemchr_optab, ops[2].mode); + + expand_insn (i
Re: [RFC] ldist: Recognize rawmemchr loop patterns
On Tue, Mar 02, 2021 at 01:29:59PM +0100, Richard Biener wrote: > On Sun, Feb 14, 2021 at 11:27 AM Stefan Schulze Frielinghaus > wrote: > > > > On Tue, Feb 09, 2021 at 09:57:58AM +0100, Richard Biener wrote: > > > On Mon, Feb 8, 2021 at 3:11 PM Stefan Schulze Frielinghaus via > > > Gcc-patches wrote: > > > > > > > > This patch adds support for recognizing loops which mimic the behaviour > > > > of function rawmemchr, and replaces those with an internal function call > > > > in case a target provides them. In contrast to the original rawmemchr > > > > function, this patch also supports different instances where the memory > > > > pointed to and the pattern are interpreted as 8, 16, and 32 bit sized, > > > > respectively. > > > > > > > > This patch is not final and I'm looking for some feedback: > > > > > > > > Previously, only loops which mimic the behaviours of functions memset, > > > > memcpy, and memmove have been detected and replaced by corresponding > > > > function calls. One characteristic of those loops/partitions is that > > > > they don't have a reduction. In contrast, loops which mimic the > > > > behaviour of rawmemchr compute a result and therefore have a reduction. > > > > My current attempt is to ensure that the reduction statement is not used > > > > in any other partition and only in that case ignore the reduction and > > > > replace the loop by a function call. We then only need to replace the > > > > reduction variable of the loop which contained the loop result by the > > > > variable of the lhs of the internal function call. This should ensure > > > > that the transformation is correct independently of how partitions are > > > > fused/distributed in the end. Any thoughts about this? > > > > > > Currently we're forcing reduction partitions last (and force to have a > > > single > > > one by fusing all partitions containing a reduction) because > > > code-generation > > > does not properly update SSA form for the reduction results. ISTR that > > > might be just because we do not copy the LC PHI nodes or do not adjust > > > them when copying. That might not be an issue in case you replace the > > > partition with a call. I guess you can try to have a testcase with > > > two rawmemchr patterns and a regular loop part that has to be scheduled > > > inbetween both for correctness. > > > > Ah ok, in that case I updated my patch by removing the constraint that > > the reduction statement must be in precisely one partition. Please find > > attached the testcases I came up so far. Since transforming a loop into > > a rawmemchr function call is backend dependend, I planned to include > > those only in my backend patch. I wasn't able to come up with any > > testcase where a loop is distributed into multiple partitions and where > > one is classified as a rawmemchr builtin. The latter boils down to a > > for loop with an empty body only in which case I suspect that loop > > distribution shouldn't be done anyway. > > > > > > Furthermore, I simply added two new members (pattern, fn) to structure > > > > builtin_info which I consider rather hacky. For the long run I thought > > > > about to split up structure builtin_info into a union where each member > > > > is a structure for a particular builtin of a partition, i.e., something > > > > like this: > > > > > > > > union builtin_info > > > > { > > > > struct binfo_memset *memset; > > > > struct binfo_memcpymove *memcpymove; > > > > struct binfo_rawmemchr *rawmemchr; > > > > }; > > > > > > > > Such that a structure for one builtin does not get "polluted" by a > > > > different one. Any thoughts about this? > > > > > > Probably makes sense if the list of recognized patterns grow further. > > > > > > I see you use internal functions rather than builtin functions. I guess > > > that's OK. But you use new target hooks for expansion where I think > > > new optab entries similar to cmpmem would be more appropriate > > > where the distinction between 8, 16 or 32 bits can be encoded in > > > the modes. > > > > The optab implementation is really nice which allows me to use iterators > > in the backend which in the end saves me some boiler plate code compared > > to the previous implementation :) > > > > While using optabs now, I only require one additional member (pattern) > > in the builtin_info struct. Thus I didn't want to overcomplicate things > > and kept the single struct approach as is. > > > > For the long run, should I resubmit this patch once stage 1 opens or how > > would you propose to proceed? > > Yes, and sorry for the delay. Few comments on the patch given I had a > quick look: > > +void > +expand_RAWMEMCHR (internal_fn, gcall *stmt) > +{ > + expand_operand ops[3]; > + > + tree lhs = gimple_call_lhs (stmt); > > I think that give we have people testing with -fno-tree-dce you > should try to handle a NULL LHS gracefully. I suppose by > simply doing nothing. > > + tree result_old = build_fold_addr_expr (DR_REF (d
Re: [RFC] ldist: Recognize rawmemchr loop patterns
On Sun, Feb 14, 2021 at 11:27 AM Stefan Schulze Frielinghaus wrote: > > On Tue, Feb 09, 2021 at 09:57:58AM +0100, Richard Biener wrote: > > On Mon, Feb 8, 2021 at 3:11 PM Stefan Schulze Frielinghaus via > > Gcc-patches wrote: > > > > > > This patch adds support for recognizing loops which mimic the behaviour > > > of function rawmemchr, and replaces those with an internal function call > > > in case a target provides them. In contrast to the original rawmemchr > > > function, this patch also supports different instances where the memory > > > pointed to and the pattern are interpreted as 8, 16, and 32 bit sized, > > > respectively. > > > > > > This patch is not final and I'm looking for some feedback: > > > > > > Previously, only loops which mimic the behaviours of functions memset, > > > memcpy, and memmove have been detected and replaced by corresponding > > > function calls. One characteristic of those loops/partitions is that > > > they don't have a reduction. In contrast, loops which mimic the > > > behaviour of rawmemchr compute a result and therefore have a reduction. > > > My current attempt is to ensure that the reduction statement is not used > > > in any other partition and only in that case ignore the reduction and > > > replace the loop by a function call. We then only need to replace the > > > reduction variable of the loop which contained the loop result by the > > > variable of the lhs of the internal function call. This should ensure > > > that the transformation is correct independently of how partitions are > > > fused/distributed in the end. Any thoughts about this? > > > > Currently we're forcing reduction partitions last (and force to have a > > single > > one by fusing all partitions containing a reduction) because code-generation > > does not properly update SSA form for the reduction results. ISTR that > > might be just because we do not copy the LC PHI nodes or do not adjust > > them when copying. That might not be an issue in case you replace the > > partition with a call. I guess you can try to have a testcase with > > two rawmemchr patterns and a regular loop part that has to be scheduled > > inbetween both for correctness. > > Ah ok, in that case I updated my patch by removing the constraint that > the reduction statement must be in precisely one partition. Please find > attached the testcases I came up so far. Since transforming a loop into > a rawmemchr function call is backend dependend, I planned to include > those only in my backend patch. I wasn't able to come up with any > testcase where a loop is distributed into multiple partitions and where > one is classified as a rawmemchr builtin. The latter boils down to a > for loop with an empty body only in which case I suspect that loop > distribution shouldn't be done anyway. > > > > Furthermore, I simply added two new members (pattern, fn) to structure > > > builtin_info which I consider rather hacky. For the long run I thought > > > about to split up structure builtin_info into a union where each member > > > is a structure for a particular builtin of a partition, i.e., something > > > like this: > > > > > > union builtin_info > > > { > > > struct binfo_memset *memset; > > > struct binfo_memcpymove *memcpymove; > > > struct binfo_rawmemchr *rawmemchr; > > > }; > > > > > > Such that a structure for one builtin does not get "polluted" by a > > > different one. Any thoughts about this? > > > > Probably makes sense if the list of recognized patterns grow further. > > > > I see you use internal functions rather than builtin functions. I guess > > that's OK. But you use new target hooks for expansion where I think > > new optab entries similar to cmpmem would be more appropriate > > where the distinction between 8, 16 or 32 bits can be encoded in > > the modes. > > The optab implementation is really nice which allows me to use iterators > in the backend which in the end saves me some boiler plate code compared > to the previous implementation :) > > While using optabs now, I only require one additional member (pattern) > in the builtin_info struct. Thus I didn't want to overcomplicate things > and kept the single struct approach as is. > > For the long run, should I resubmit this patch once stage 1 opens or how > would you propose to proceed? Yes, and sorry for the delay. Few comments on the patch given I had a quick look: +void +expand_RAWMEMCHR (internal_fn, gcall *stmt) +{ + expand_operand ops[3]; + + tree lhs = gimple_call_lhs (stmt); I think that give we have people testing with -fno-tree-dce you should try to handle a NULL LHS gracefully. I suppose by simply doing nothing. + tree result_old = build_fold_addr_expr (DR_REF (dr)); + tree result_new = copy_ssa_name (result_old); I think you simply want tree result = make_ssa_name (ptr_type_node); most definitely + imm_use_iterator iter; + gimple *stmt; + use_operand_p use_p; + FOR_EACH_IMM_USE_STMT (stmt, iter, result_old) +{
Re: [RFC] ldist: Recognize rawmemchr loop patterns
On 2/25/21 10:01 AM, Stefan Schulze Frielinghaus via Gcc-patches wrote: > Ping Given this is not a regression it needs to wait for gcc-12. jeff
Re: [RFC] ldist: Recognize rawmemchr loop patterns
Ping On Sun, Feb 14, 2021 at 11:27:40AM +0100, Stefan Schulze Frielinghaus wrote: > On Tue, Feb 09, 2021 at 09:57:58AM +0100, Richard Biener wrote: > > On Mon, Feb 8, 2021 at 3:11 PM Stefan Schulze Frielinghaus via > > Gcc-patches wrote: > > > > > > This patch adds support for recognizing loops which mimic the behaviour > > > of function rawmemchr, and replaces those with an internal function call > > > in case a target provides them. In contrast to the original rawmemchr > > > function, this patch also supports different instances where the memory > > > pointed to and the pattern are interpreted as 8, 16, and 32 bit sized, > > > respectively. > > > > > > This patch is not final and I'm looking for some feedback: > > > > > > Previously, only loops which mimic the behaviours of functions memset, > > > memcpy, and memmove have been detected and replaced by corresponding > > > function calls. One characteristic of those loops/partitions is that > > > they don't have a reduction. In contrast, loops which mimic the > > > behaviour of rawmemchr compute a result and therefore have a reduction. > > > My current attempt is to ensure that the reduction statement is not used > > > in any other partition and only in that case ignore the reduction and > > > replace the loop by a function call. We then only need to replace the > > > reduction variable of the loop which contained the loop result by the > > > variable of the lhs of the internal function call. This should ensure > > > that the transformation is correct independently of how partitions are > > > fused/distributed in the end. Any thoughts about this? > > > > Currently we're forcing reduction partitions last (and force to have a > > single > > one by fusing all partitions containing a reduction) because code-generation > > does not properly update SSA form for the reduction results. ISTR that > > might be just because we do not copy the LC PHI nodes or do not adjust > > them when copying. That might not be an issue in case you replace the > > partition with a call. I guess you can try to have a testcase with > > two rawmemchr patterns and a regular loop part that has to be scheduled > > inbetween both for correctness. > > Ah ok, in that case I updated my patch by removing the constraint that > the reduction statement must be in precisely one partition. Please find > attached the testcases I came up so far. Since transforming a loop into > a rawmemchr function call is backend dependend, I planned to include > those only in my backend patch. I wasn't able to come up with any > testcase where a loop is distributed into multiple partitions and where > one is classified as a rawmemchr builtin. The latter boils down to a > for loop with an empty body only in which case I suspect that loop > distribution shouldn't be done anyway. > > > > Furthermore, I simply added two new members (pattern, fn) to structure > > > builtin_info which I consider rather hacky. For the long run I thought > > > about to split up structure builtin_info into a union where each member > > > is a structure for a particular builtin of a partition, i.e., something > > > like this: > > > > > > union builtin_info > > > { > > > struct binfo_memset *memset; > > > struct binfo_memcpymove *memcpymove; > > > struct binfo_rawmemchr *rawmemchr; > > > }; > > > > > > Such that a structure for one builtin does not get "polluted" by a > > > different one. Any thoughts about this? > > > > Probably makes sense if the list of recognized patterns grow further. > > > > I see you use internal functions rather than builtin functions. I guess > > that's OK. But you use new target hooks for expansion where I think > > new optab entries similar to cmpmem would be more appropriate > > where the distinction between 8, 16 or 32 bits can be encoded in > > the modes. > > The optab implementation is really nice which allows me to use iterators > in the backend which in the end saves me some boiler plate code compared > to the previous implementation :) > > While using optabs now, I only require one additional member (pattern) > in the builtin_info struct. Thus I didn't want to overcomplicate things > and kept the single struct approach as is. > > For the long run, should I resubmit this patch once stage 1 opens or how > would you propose to proceed? > > Thanks for your review so far! > > Cheers, > Stefan > > > > > Richard. > > > > > Cheers, > > > Stefan > > > --- > > > gcc/internal-fn.c| 42 ++ > > > gcc/internal-fn.def | 3 + > > > gcc/target-insns.def | 3 + > > > gcc/tree-loop-distribution.c | 257 ++- > > > 4 files changed, 272 insertions(+), 33 deletions(-) > > > > > > diff --git a/gcc/internal-fn.c b/gcc/internal-fn.c > > > index dd7173126fb..9cd62544a1a 100644 > > > --- a/gcc/internal-fn.c > > > +++ b/gcc/internal-fn.c > > > @@ -2917,6 +2917,48 @@ expand_VEC_CONVERT (internal_fn, gcall *) > > >gcc_unreac
Re: [RFC] ldist: Recognize rawmemchr loop patterns
On Tue, Feb 09, 2021 at 09:57:58AM +0100, Richard Biener wrote: > On Mon, Feb 8, 2021 at 3:11 PM Stefan Schulze Frielinghaus via > Gcc-patches wrote: > > > > This patch adds support for recognizing loops which mimic the behaviour > > of function rawmemchr, and replaces those with an internal function call > > in case a target provides them. In contrast to the original rawmemchr > > function, this patch also supports different instances where the memory > > pointed to and the pattern are interpreted as 8, 16, and 32 bit sized, > > respectively. > > > > This patch is not final and I'm looking for some feedback: > > > > Previously, only loops which mimic the behaviours of functions memset, > > memcpy, and memmove have been detected and replaced by corresponding > > function calls. One characteristic of those loops/partitions is that > > they don't have a reduction. In contrast, loops which mimic the > > behaviour of rawmemchr compute a result and therefore have a reduction. > > My current attempt is to ensure that the reduction statement is not used > > in any other partition and only in that case ignore the reduction and > > replace the loop by a function call. We then only need to replace the > > reduction variable of the loop which contained the loop result by the > > variable of the lhs of the internal function call. This should ensure > > that the transformation is correct independently of how partitions are > > fused/distributed in the end. Any thoughts about this? > > Currently we're forcing reduction partitions last (and force to have a single > one by fusing all partitions containing a reduction) because code-generation > does not properly update SSA form for the reduction results. ISTR that > might be just because we do not copy the LC PHI nodes or do not adjust > them when copying. That might not be an issue in case you replace the > partition with a call. I guess you can try to have a testcase with > two rawmemchr patterns and a regular loop part that has to be scheduled > inbetween both for correctness. Ah ok, in that case I updated my patch by removing the constraint that the reduction statement must be in precisely one partition. Please find attached the testcases I came up so far. Since transforming a loop into a rawmemchr function call is backend dependend, I planned to include those only in my backend patch. I wasn't able to come up with any testcase where a loop is distributed into multiple partitions and where one is classified as a rawmemchr builtin. The latter boils down to a for loop with an empty body only in which case I suspect that loop distribution shouldn't be done anyway. > > Furthermore, I simply added two new members (pattern, fn) to structure > > builtin_info which I consider rather hacky. For the long run I thought > > about to split up structure builtin_info into a union where each member > > is a structure for a particular builtin of a partition, i.e., something > > like this: > > > > union builtin_info > > { > > struct binfo_memset *memset; > > struct binfo_memcpymove *memcpymove; > > struct binfo_rawmemchr *rawmemchr; > > }; > > > > Such that a structure for one builtin does not get "polluted" by a > > different one. Any thoughts about this? > > Probably makes sense if the list of recognized patterns grow further. > > I see you use internal functions rather than builtin functions. I guess > that's OK. But you use new target hooks for expansion where I think > new optab entries similar to cmpmem would be more appropriate > where the distinction between 8, 16 or 32 bits can be encoded in > the modes. The optab implementation is really nice which allows me to use iterators in the backend which in the end saves me some boiler plate code compared to the previous implementation :) While using optabs now, I only require one additional member (pattern) in the builtin_info struct. Thus I didn't want to overcomplicate things and kept the single struct approach as is. For the long run, should I resubmit this patch once stage 1 opens or how would you propose to proceed? Thanks for your review so far! Cheers, Stefan > > Richard. > > > Cheers, > > Stefan > > --- > > gcc/internal-fn.c| 42 ++ > > gcc/internal-fn.def | 3 + > > gcc/target-insns.def | 3 + > > gcc/tree-loop-distribution.c | 257 ++- > > 4 files changed, 272 insertions(+), 33 deletions(-) > > > > diff --git a/gcc/internal-fn.c b/gcc/internal-fn.c > > index dd7173126fb..9cd62544a1a 100644 > > --- a/gcc/internal-fn.c > > +++ b/gcc/internal-fn.c > > @@ -2917,6 +2917,48 @@ expand_VEC_CONVERT (internal_fn, gcall *) > >gcc_unreachable (); > > } > > > > +static void > > +expand_RAWMEMCHR8 (internal_fn, gcall *stmt) > > +{ > > + if (targetm.have_rawmemchr8 ()) > > +{ > > + rtx result = expand_expr (gimple_call_lhs (stmt), NULL_RTX, > > VOIDmode, EXPAND_WRITE); > > + rtx start = expand_normal (gimple_cal
Re: [RFC] ldist: Recognize rawmemchr loop patterns
On Mon, Feb 8, 2021 at 3:11 PM Stefan Schulze Frielinghaus via Gcc-patches wrote: > > This patch adds support for recognizing loops which mimic the behaviour > of function rawmemchr, and replaces those with an internal function call > in case a target provides them. In contrast to the original rawmemchr > function, this patch also supports different instances where the memory > pointed to and the pattern are interpreted as 8, 16, and 32 bit sized, > respectively. > > This patch is not final and I'm looking for some feedback: > > Previously, only loops which mimic the behaviours of functions memset, > memcpy, and memmove have been detected and replaced by corresponding > function calls. One characteristic of those loops/partitions is that > they don't have a reduction. In contrast, loops which mimic the > behaviour of rawmemchr compute a result and therefore have a reduction. > My current attempt is to ensure that the reduction statement is not used > in any other partition and only in that case ignore the reduction and > replace the loop by a function call. We then only need to replace the > reduction variable of the loop which contained the loop result by the > variable of the lhs of the internal function call. This should ensure > that the transformation is correct independently of how partitions are > fused/distributed in the end. Any thoughts about this? Currently we're forcing reduction partitions last (and force to have a single one by fusing all partitions containing a reduction) because code-generation does not properly update SSA form for the reduction results. ISTR that might be just because we do not copy the LC PHI nodes or do not adjust them when copying. That might not be an issue in case you replace the partition with a call. I guess you can try to have a testcase with two rawmemchr patterns and a regular loop part that has to be scheduled inbetween both for correctness. > Furthermore, I simply added two new members (pattern, fn) to structure > builtin_info which I consider rather hacky. For the long run I thought > about to split up structure builtin_info into a union where each member > is a structure for a particular builtin of a partition, i.e., something > like this: > > union builtin_info > { > struct binfo_memset *memset; > struct binfo_memcpymove *memcpymove; > struct binfo_rawmemchr *rawmemchr; > }; > > Such that a structure for one builtin does not get "polluted" by a > different one. Any thoughts about this? Probably makes sense if the list of recognized patterns grow further. I see you use internal functions rather than builtin functions. I guess that's OK. But you use new target hooks for expansion where I think new optab entries similar to cmpmem would be more appropriate where the distinction between 8, 16 or 32 bits can be encoded in the modes. Richard. > Cheers, > Stefan > --- > gcc/internal-fn.c| 42 ++ > gcc/internal-fn.def | 3 + > gcc/target-insns.def | 3 + > gcc/tree-loop-distribution.c | 257 ++- > 4 files changed, 272 insertions(+), 33 deletions(-) > > diff --git a/gcc/internal-fn.c b/gcc/internal-fn.c > index dd7173126fb..9cd62544a1a 100644 > --- a/gcc/internal-fn.c > +++ b/gcc/internal-fn.c > @@ -2917,6 +2917,48 @@ expand_VEC_CONVERT (internal_fn, gcall *) >gcc_unreachable (); > } > > +static void > +expand_RAWMEMCHR8 (internal_fn, gcall *stmt) > +{ > + if (targetm.have_rawmemchr8 ()) > +{ > + rtx result = expand_expr (gimple_call_lhs (stmt), NULL_RTX, VOIDmode, > EXPAND_WRITE); > + rtx start = expand_normal (gimple_call_arg (stmt, 0)); > + rtx pattern = expand_normal (gimple_call_arg (stmt, 1)); > + emit_insn (targetm.gen_rawmemchr8 (result, start, pattern)); > +} > + else > +gcc_unreachable(); > +} > + > +static void > +expand_RAWMEMCHR16 (internal_fn, gcall *stmt) > +{ > + if (targetm.have_rawmemchr16 ()) > +{ > + rtx result = expand_expr (gimple_call_lhs (stmt), NULL_RTX, VOIDmode, > EXPAND_WRITE); > + rtx start = expand_normal (gimple_call_arg (stmt, 0)); > + rtx pattern = expand_normal (gimple_call_arg (stmt, 1)); > + emit_insn (targetm.gen_rawmemchr16 (result, start, pattern)); > +} > + else > +gcc_unreachable(); > +} > + > +static void > +expand_RAWMEMCHR32 (internal_fn, gcall *stmt) > +{ > + if (targetm.have_rawmemchr32 ()) > +{ > + rtx result = expand_expr (gimple_call_lhs (stmt), NULL_RTX, VOIDmode, > EXPAND_WRITE); > + rtx start = expand_normal (gimple_call_arg (stmt, 0)); > + rtx pattern = expand_normal (gimple_call_arg (stmt, 1)); > + emit_insn (targetm.gen_rawmemchr32 (result, start, pattern)); > +} > + else > +gcc_unreachable(); > +} > + > /* Expand the IFN_UNIQUE function according to its first argument. */ > > static void > diff --git a/gcc/internal-fn.def b/gcc/internal-fn.def > index daeace7a34e..34247859704 100644 > --- a/gcc/internal-fn.