"juzhe.zh...@rivai.ai" <juzhe.zh...@rivai.ai> writes:
>>> But the issue is the same in the reverse with WHILE_LEN, no?
>>>WHILE_LEN just computes a scalar value - you seem to suggest
>>>there's a hidden side-effect of "coalescing" the result with
>>>a hardware vector length register?  I don't think that's good design.
> No, I don't plan to suggest there's a hidden side-effect of "coalescing"
> the result with a hardware vector length register.
>
> Today, I read RVV ISA deeply again. I realize that this patch is not absolute 
> correct for 
> any RVV hardward.
>
> According to RVV ISA, the vsetvl definition:
> an vsetvli instruction which is vsetvli vl, avl, vtype
> vl = AVL if AVL ≤ VLMAX
> ceil(AVL / 2) ≤ vl ≤ VLMAX if AVL < (2 * VLMAX)
> vl = VLMAX if AVL ≥ (2 * VLMAX)
> Deterministic on any given implementation for same input AVL and VLMAX values
> The second constraint make the result of vsetvli is not necessary to be VLMAX 
> (the maximum number of elements will be updated of specific vector-length RVV 
> CPU).
>
> So for a vsetvli instruction (vsetvli vl,avl,vtype). The "vl" value can be 
> various among different RVV CPU depending on the implementation of the 
> downstream RVV hardware.
>
>
>  Now I think I should fix this patch since this patch is not always suitable 
> for all hardware.
>
> So according to RVV ISA:
> For example, this permits an implementation to set vl = ceil(AVL / 2) for 
> VLMAX < AVL < 2*VLMAX in order to evenly distribute work over the last two 
> iterations of a stripmine loop.
>
> We can have these 2 following different RVV CPU:
>
> Suppose  the maximum number of the elements needs to be updated is 10 element 
> (int32_t), and the vector length = 256 bit (update 8 INT32 elements in max).
>
> So there are 2 iterations we need, the number elements of each iteration 
> depending on hardware implementation.
>
> So we can have these 2 following hardware implementation are both legal for 
> RVV standard:
>
> RVV CPU 1:
> 1st iteration update 5 element (it satisfy the constraint ceil (AVL/2) <= vl 
> <= VLMAX), set vl = ceil (AVL/2) = 5
> 2nd iteration update 5 elements too.
>
> RVV CPU 2:
> 1st iteration update 8 elements. set vl = VLMAX = 8.
> 2nd iteration update 3 elments.
>
> These 2 RVV CPU are both legal according to RVV specification standard.
> It's obvious this patch is correct for RVV CPU 2 but incorrect for RVV CPU 1.

Ah, OK.  In that case, I guess a new ifn like WHILE_LEN will be needed
after all.

> Since the current flow of this patch is as follows:
>
> +     <bb 3>
> +     _19 = (unsigned long) n_5(D);
> +     ...
> +
> +     <bb 4>:
> +     ...
> +     # ivtmp_20 = PHI <ivtmp_21(4), _19(3)>
> +     ...
> +     _22 = .WHILE_LEN (ivtmp_20, vf);
> +     ...
> +     LEN_LOAD (addr, _22);...        addr = addr + vf;+      ivtmp_21 = 
> ivtmp_20 - _22;
> +     ...
> +     if (ivtmp_21 != 0)
> +       goto <bb 4>; [75.00%]
> +     else
> +       goto <bb 5>; [25.00%]
> +
> +     <bb 5>
> +     return;
>
> Here the _22 which is the output of WHILE_LEN is only used in ivtmp_21 = 
> ivtmp_20 - _22;
> which serves the saturating-to-zero subtraction. 
> And "addr = addr + vf;" 
> The address is calculated in the loop just keep add vf. 
> Such sequence is Ok for most of the RVV CPU so far I think.
> However, for future compatibility, we should make WHILE_LEN output as the 
> address IV adding value too.
>
> So, we should abandon the current the address loop way which is just keeping 
> add vf.
>
> Replace "addr = addr + vf".
>
> Instead, we should do like that:
>
> _22 = .WHILE_LEN (ivtmp_20, vf);
> ....
> LEN_LOAD (addr, _22);tmp = _22 * 4; (Assume it is INT32 calculation, make _22 
> which is INT32 align into BYTE align for address counting) addr = addr + 
> tmp;....
> Makeing the result of WHILE_LEN is not only used to do the remain = remain - 
> len,But also used in addressing calculating: tmp = len * (element bytesize) ; 
> addr = addr + tmp;
> Then this flow is the correct flow for all RVV CPU.
> This flow is totally same as example in RVV ISA define:
> https://github.com/riscv/riscv-v-spec/blob/master/example/vvaddint32.s
> I think I need to change this patch as described above to make it to be 
> global suitable for all RVV CPU in the word. But I am not sure whether GCC 
> community accept this flow. So I propose it now before I do it. 
> I didn't realize that since my downstream RVV hardware and the open-source 
> simulator generate "vl" = VLMAX. (Sorry about that)

Sounds OK to me FWIW.  Supporting non-(poly-)constant VFs seems useful.

I had a local patch (not submitted) that needed non-constant VFs for
a different reason: to clamp the VF at runtime to avoid hazards.
E.g. if we know that vectorisation is safe up to VF=N but not
beyond that, the patch would limit the VF to N on targets with
larger vectors, rather than punt to the scalar fallback loop.

There's still a key difference between that use case and yours,
since in "my" case the VF would still be invariant (for SVE),
whereas with yours the VF would vary between iterations.
But the general principle is the same.

Thanks,
Richard

Reply via email to