On 1/11/24 11:20, juzhe.zh...@rivai.ai wrote:
> Ok I see your idea and we need to adjust scalar_to_vec accurately. Inside the 
> loop we have these 2 scalar_to_vec:
> 
> 1. MIN_EXPR <patt_28, 15> 1 times scalar_to_vec costs 1 in prologue
> 
>    This scalar_to_vec cost should be 0 or 1 since it only generate single 
> instructions: vmv.v.iv16,15
> 
> 2. 32872 >> patt_26 1 times scalar_to_vec costs 1 in prologue
> 
>    This cost should be higher since it cost 3 instructions:
>     lia4,-32768
>     addiwa4,a4,104
>     vmv.v.xv16,a4
> 
> Am I correct ?
> 
> I guess if we cost 1 case as 1 cost and 2 case as 3 cost. Then we will be 
> good.

That would be the general idea, yes.  As Richard mentioned, it doesn't
always work well but for this case here it could help a bit.
(My question whether why we shouldn't vectorize this at 256b
and above still stands, though)

As mentioned before, the other thing that needs to be considered
is register-move costs (or the respective cost structure).  On
some uarchs the vmv.v.f might be more expensive than vmv.v.x and
so on - in addition to the instructions needed to synthesize the
constant.

Regards
 Robin

Reply via email to