Re: [PATCH] RISC-V: Optimize codegen of VLA SLP

2023-06-20 Thread Robin Dapp via Gcc-patches
> Ok. Just sent V2. I will adjust comment and send V3 again :)

Sorry, was too slow.

Regards
 Robin



Re: Re: [PATCH] RISC-V: Optimize codegen of VLA SLP

2023-06-20 Thread juzhe.zh...@rivai.ai
Ok. Just sent V2. I will adjust comment and send V3 again :)



juzhe.zh...@rivai.ai
 
From: Robin Dapp
Date: 2023-06-20 16:55
To: juzhe.zh...@rivai.ai; gcc-patches
CC: rdapp.gcc; kito.cheng; Kito.cheng; palmer; palmer; jeffreyalaw
Subject: Re: [PATCH] RISC-V: Optimize codegen of VLA SLP
 
> +   /* Step 2: VID AND -NPATTERNS:
> + { 0&-4, 1&-4, 2&-4, 3 &-4, 4 &-4, 5 &-4, 6 &-4, 7 &-4, ... }
> +   */
Before that, just add something simple like:
 
We want to create a pattern where value[ix] = floor (ix / NPATTERNS).
As NPATTERNS is always a power of two we can rewrite this as
= ix & -NPATTERNS.
 
Regards
Robin
 
 


Re: [PATCH] RISC-V: Optimize codegen of VLA SLP

2023-06-20 Thread Robin Dapp via Gcc-patches


> +   /* Step 2: VID AND -NPATTERNS:
> +  { 0&-4, 1&-4, 2&-4, 3 &-4, 4 &-4, 5 &-4, 6 &-4, 7 &-4, ... }
> +   */
Before that, just add something simple like:

We want to create a pattern where value[ix] = floor (ix / NPATTERNS).
As NPATTERNS is always a power of two we can rewrite this as
= ix & -NPATTERNS.

Regards
 Robin



Re: Re: [PATCH] RISC-V: Optimize codegen of VLA SLP

2023-06-20 Thread juzhe.zh...@rivai.ai
Hi, Robin. Can you give me more details of the comments:

What about:

  /* As NPATTERNS is always a power of two, we can optimize codegen of
 VLA const vector according to this feature.  */

Is this comment Ok ? You know I am always struggle with adding an English 
comment or description.

Thanks.


juzhe.zh...@rivai.ai
 
From: Robin Dapp
Date: 2023-06-20 16:03
To: Juzhe-Zhong; gcc-patches
CC: rdapp.gcc; kito.cheng; kito.cheng; palmer; palmer; jeffreyalaw
Subject: Re: [PATCH] RISC-V: Optimize codegen of VLA SLP
> This is a nice improvement.  Even though we're in the SLP realm I would
> still add an assert that documents that we're indeed operating with
> pow2_p (NPATTERNS) and some comment as to why we can use AND.
> Sure we're doing exact_log2 et al later anyway, just to make things
> clearer.
 
Actually no assert necessary, just a comment like:
 
/* As NPATTERNS is always a power of two, we can ..."  */ 
 
Regards
Robin
 
 


Re: Re: [PATCH] RISC-V: Optimize codegen of VLA SLP

2023-06-20 Thread juzhe.zh...@rivai.ai
Hi, Robin.

>> Actually no assert necessary, just a comment like:
>> /* As NPATTERNS is always a power of two, we can ..."  */
Ok.

>> My immediate idea would have been to fall back to the first
>> approach, i.e. create the "0x00030002..." constant
>>and then
>>  vid.v v4
>>>>  vand.vi v4, v4, -4
>>  vadd.vv v4, v4, v6

>>It's one more vector instruction though so possibly worse from a latency
>>standpoint.

>>Rest looks good to me.

This solution I have tried and talked with my colleague deeply, turns out is 
very obvious consume 1 more vector insn.
You may argue that this patch needs 1 more scalar insn (slli) which is relative 
cheaper than vector insn.
I prefer this patch solution.

Address comments. I have add comments to send V2. 
Thanks.


juzhe.zh...@rivai.ai
 
From: Robin Dapp
Date: 2023-06-20 15:55
To: Juzhe-Zhong; gcc-patches
CC: rdapp.gcc; kito.cheng; kito.cheng; palmer; palmer; jeffreyalaw
Subject: Re: [PATCH] RISC-V: Optimize codegen of VLA SLP
Hi Juzhe,
 
> Case 1:
> void
> f (uint8_t *restrict a, uint8_t *restrict b)
> {
>   for (int i = 0; i < 100; ++i)
> {
>   a[i * 8] = b[i * 8 + 37] + 1;
>   a[i * 8 + 1] = b[i * 8 + 37] + 2;
>   a[i * 8 + 2] = b[i * 8 + 37] + 3;
>   a[i * 8 + 3] = b[i * 8 + 37] + 4;
>   a[i * 8 + 4] = b[i * 8 + 37] + 5;
>   a[i * 8 + 5] = b[i * 8 + 37] + 6;
>   a[i * 8 + 6] = b[i * 8 + 37] + 7;
>   a[i * 8 + 7] = b[i * 8 + 37] + 8;
> }
> }
> 
> We need to generate the stepped vector:
> NPATTERNS = 8.
> { 0, 0, 0, 0, 0, 0, 0, 0, 8, 8, 8, 8, 8, 8, 8, 8 }
> 
> Before this patch:
> vid.vv4 ;; {0,1,2,3,4,5,6,7,...}
> vsrl.vi  v4,v4,3;; {0,0,0,0,0,0,0,1,1,1,1,1,1,1,1,...}
> li   a3,8   ;; {8}
> vmul.vx  v4,v4,a3   ;; {0,0,0,0,0,0,0,8,8,8,8,8,8,8,8,...}
> 
> After this patch:
> vid.vv4;; {0,1,2,3,4,5,6,7,...}
> vand.vi  v4,v4,-8(-NPATTERNS)  ;; {0,0,0,0,0,0,0,8,8,8,8,8,8,8,8,...}
 
This is a nice improvement.  Even though we're in the SLP realm I would
still add an assert that documents that we're indeed operating with
pow2_p (NPATTERNS) and some comment as to why we can use AND.
Sure we're doing exact_log2 et al later anyway, just to make things
clearer.
 
> Before this patch:
> li   a6,134221824
> slli a6,a6,5
> addi a6,a6,3;; 64-bit: 0x000300020001
> vmv.v.x  v6,a6  ;; {3, 2, 1, 0, ... }
> vid.vv4 ;; {0, 1, 2, 3, 4, 5, 6, 7, ... }
> vsrl.vi  v4,v4,2;; {0, 0, 0, 0, 1, 1, 1, 1, ... }
> li   a3,4   ;; {4}
> vmul.vx  v4,v4,a3   ;; {0, 0, 0, 0, 4, 4, 4, 4, ... }
> vadd.vv  v4,v4,v6   ;; {3, 2, 1, 0, 7, 6, 5, 4, 11, 10, 9, 8, 15, 14, 13, 
> 12, ... }
> 
> After this patch:
> li a3,-536875008
> slli a3,a3,4
> addi a3,a3,1
> slli a3,a3,16
> vmv.v.x v2,a3   ;; {3, 1, -1, -3, ... }
> vid.v v4  ;; {0, 1, 2, 3, 4, 5, 6, 7, ... }
> vadd.vv v4,v4,v2;; {3, 2, 1, 0, 7, 6, 5, 4, 11, 10, 9, 8, 15, 14, 13, 
> 12, ... }
 
My immediate idea would have been to fall back to the first
approach, i.e. create the "0x00030002..." constant
 
> li   a6,134221824
> slli a6,a6,5
> addi a6,a6,3;; 64-bit: 0x000300020001
> vmv.v.x  v6,a6  ;; {3, 2, 1, 0, ... }
 
and then
  vid.v v4
  vand.vi v4, v4, -4
  vadd.vv v4, v4, v6
 
It's one more vector instruction though so possibly worse from a latency
standpoint.
 
Rest looks good to me.
 
Regards
Robin
 


Re: [PATCH] RISC-V: Optimize codegen of VLA SLP

2023-06-20 Thread Robin Dapp via Gcc-patches
> This is a nice improvement.  Even though we're in the SLP realm I would
> still add an assert that documents that we're indeed operating with
> pow2_p (NPATTERNS) and some comment as to why we can use AND.
> Sure we're doing exact_log2 et al later anyway, just to make things
> clearer.

Actually no assert necessary, just a comment like:

/* As NPATTERNS is always a power of two, we can ..."  */ 

Regards
 Robin



Re: [PATCH] RISC-V: Optimize codegen of VLA SLP

2023-06-20 Thread Robin Dapp via Gcc-patches
Hi Juzhe,

> Case 1:
> void
> f (uint8_t *restrict a, uint8_t *restrict b)
> {
>   for (int i = 0; i < 100; ++i)
> {
>   a[i * 8] = b[i * 8 + 37] + 1;
>   a[i * 8 + 1] = b[i * 8 + 37] + 2;
>   a[i * 8 + 2] = b[i * 8 + 37] + 3;
>   a[i * 8 + 3] = b[i * 8 + 37] + 4;
>   a[i * 8 + 4] = b[i * 8 + 37] + 5;
>   a[i * 8 + 5] = b[i * 8 + 37] + 6;
>   a[i * 8 + 6] = b[i * 8 + 37] + 7;
>   a[i * 8 + 7] = b[i * 8 + 37] + 8;
> }
> }
> 
> We need to generate the stepped vector:
> NPATTERNS = 8.
> { 0, 0, 0, 0, 0, 0, 0, 0, 8, 8, 8, 8, 8, 8, 8, 8 }
> 
> Before this patch:
> vid.vv4 ;; {0,1,2,3,4,5,6,7,...}
> vsrl.vi  v4,v4,3;; {0,0,0,0,0,0,0,1,1,1,1,1,1,1,1,...}
> li   a3,8   ;; {8}
> vmul.vx  v4,v4,a3   ;; {0,0,0,0,0,0,0,8,8,8,8,8,8,8,8,...}
> 
> After this patch:
> vid.vv4;; {0,1,2,3,4,5,6,7,...}
> vand.vi  v4,v4,-8(-NPATTERNS)  ;; {0,0,0,0,0,0,0,8,8,8,8,8,8,8,8,...}

This is a nice improvement.  Even though we're in the SLP realm I would
still add an assert that documents that we're indeed operating with
pow2_p (NPATTERNS) and some comment as to why we can use AND.
Sure we're doing exact_log2 et al later anyway, just to make things
clearer.

> Before this patch:
> li   a6,134221824
> slli a6,a6,5
> addi a6,a6,3;; 64-bit: 0x000300020001
> vmv.v.x  v6,a6  ;; {3, 2, 1, 0, ... }
> vid.vv4 ;; {0, 1, 2, 3, 4, 5, 6, 7, ... }
> vsrl.vi  v4,v4,2;; {0, 0, 0, 0, 1, 1, 1, 1, ... }
> li   a3,4   ;; {4}
> vmul.vx  v4,v4,a3   ;; {0, 0, 0, 0, 4, 4, 4, 4, ... }
> vadd.vv  v4,v4,v6   ;; {3, 2, 1, 0, 7, 6, 5, 4, 11, 10, 9, 8, 15, 14, 13, 
> 12, ... }
> 
> After this patch:
> lia3,-536875008
> slli  a3,a3,4
> addi  a3,a3,1
> slli  a3,a3,16
> vmv.v.x   v2,a3   ;; {3, 1, -1, -3, ... }
> vid.v v4  ;; {0, 1, 2, 3, 4, 5, 6, 7, ... }
> vadd.vv   v4,v4,v2;; {3, 2, 1, 0, 7, 6, 5, 4, 11, 10, 9, 8, 15, 
> 14, 13, 12, ... }

My immediate idea would have been to fall back to the first
approach, i.e. create the "0x00030002..." constant

> li   a6,134221824
> slli a6,a6,5
> addi a6,a6,3;; 64-bit: 0x000300020001
> vmv.v.x  v6,a6  ;; {3, 2, 1, 0, ... }

and then
  vid.v v4
  vand.vi v4, v4, -4
  vadd.vv v4, v4, v6

It's one more vector instruction though so possibly worse from a latency
standpoint.

Rest looks good to me.

Regards
 Robin