Tamar Christina <tamar.christ...@arm.com> writes:
>> -----Original Message-----
>> From: Richard Biener <rguent...@suse.de>
>> Sent: Tuesday, September 2, 2025 1:44 PM
>> To: Tamar Christina <tamar.christ...@arm.com>
>> Cc: gcc-patches@gcc.gnu.org; nd <n...@arm.com>
>> Subject: Re: [PATCH 1/3]middle-end: clear the user unroll flag if the
>> cost model has
>> overriden it
>> 
>> On Tue, 2 Sep 2025, Tamar Christina wrote:
>> 
>> > > What was it that made you propose this change?
>> >
>> > When we have a loop of say int and a pragma unroll 4
>> >
>> > If the vectorizer picks V4SI as the mode, the requested unroll ended up
>> > exactly matching the VF. As such the requested unroll is 1 and we don't
>> > clear the pragma.
>> >
>> > So it did honor the requested unroll factor. However since we didn't set
>> > the unroll amount back and left it at 4 the rtl unroller won't use the
>> > rtl cost model at all and just unroll the vector loop 4 times.
>> 
>> Ah, OK.
>> 
>> > This change isn't to bypass the rtl cost model, it's to allow it to be
>> > used rather than overriding it after vectorization.
>> 
>> OK, fine.  But still, consider
>> 
>> #pragma unroll 4
>>  for (int i = 0; i < 64; ++i)
>>   {
>>     a[4*i+0] = i;
>>     a[4*i+1] = i;
>>     a[4*i+2] = i;
>>     a[4*i+3] = i;
>>   }
>> 
>> so VF == 1, suggested_unroll_factor == 4.  If we don't up VF to 4
>> should we still claim we did any unrolling?  If the target suggested
>> a unroll factor of two, should we instead change ->unroll to 2?
>> Should the user unroll factor override the vector target one?
>> 
>
> I think the target unroll factor should always win out, primarily because
> of throughput based costing.  The loop above on a 4 VX system should
> by the vectorizer already be using VF = 4, suggested_unroll_factor == 4.
>
> We also don't ever force unrolling for predicated SVE because for
> predicated SVE we have to balance predicate throughput limitations
> of any given CPU.  Having the user unroll factor be able to override
> the cost model one will almost certainly lead to worse performance
> in this case.

FWIW, cause and effect are kind-of the other way around: we request an
unroll factor for SVE in the normal way, but doing so disables predication,
thanks to:

      /* For partial-vector-usage=1, try to push the handling of partial
         vectors to the epilogue, with the main loop continuing to operate
         on full vectors.

         If we are unrolling we also do not want to use partial vectors. This
         is to avoid the overhead of generating multiple masks and also to
         avoid having to execute entire iterations of FALSE masked instructions
         when dealing with one or less full iterations.

         ??? We could then end up failing to use partial vectors if we
         decide to peel iterations into a prologue, and if the main loop
         then ends up processing fewer than VF iterations.  */
      if ((param_vect_partial_vector_usage == 1
           || loop_vinfo->suggested_unroll_factor > 1)
          && !LOOP_VINFO_EPILOGUE_P (loop_vinfo)
          && !vect_known_niters_smaller_than_vf (loop_vinfo))
        LOOP_VINFO_EPIL_USING_PARTIAL_VECTORS_P (loop_vinfo) = true;
      else
        LOOP_VINFO_USING_PARTIAL_VECTORS_P (loop_vinfo) = true;

In other words, the choice of unroll factor is an input to the
predication decision, rather than the predication decision being an
input to the choice of unroll factor.

Thanks,
Richard

Reply via email to