Re: [PATCH v2] RISC-V: Improve bswap8 when zbb is enabled

Jeff Law Mon, 21 Jul 2025 11:55:53 -0700



On 7/17/25 7:45 AM, Dusan Stojkovic wrote:

Hi Jeff,

So I'm not sure this transformation is correct.

Let's consider the case where a5 has the value 0xffff000000000000 at the
"li" instruction.

a5 = 0xff00000000000000

li a4, -65536 // a4 = 0xffffffffffff0000
srai a5,a5,32 // a5 = 0xffffffffff000000
and a5,a5,a4 // a5 = 0xffffffffff000000
roriw a5,a5,16 // a5 = 0x000000000000ff00

So with your change:
a5 = 0xff00000000000000
li a4, -65536 // a4 = 0xffffffffffff0000
srai a5,a5,48 // a5 = 0xfffffffffffff000

Am I missing something?

Jeff


Thank you for the review.

You are correct, I mistakenly used an arithmetic shift instead of a logical
right shift. There were also testsuite failures with rv32 but the test should
be executed on rv64 only.

This was addressed. The peephole pattern transforms:
bswap8:
         rev8    a5,a0
         -> li      a4,-65536
         -> srai    a5,a5,32
         -> and     a5,a5,a4
         -> roriw   a5,a5,16
         and     a0,a0,a4
         or      a0,a0,a5
         sext.w  a0,a0
         ret

And emits this assembly:
bswap8:
         rev8    a5,a0
         -> li      a4,-65536
         -> srli    a5,a5,48
         and     a0,a0,a4
         or      a0,a0,a5
         sext.w  a0,a0
         ret

Basically if after rev8 a5 has it's MSB set to 1 then the arithmetic right
shift will set all other bits to the value of MSB as per your example.

The logical shift here ensures that this is not the case.
# after rev8:
a5 = 0xff00000000000000
...
srli a5,a5,48  # a5 := 0x00000000ff00
...

Thanks. So the next (and probably most important) is to try and avoidusing a peephole2. After staring at the .combine dumps some more Ithink I see the path forward.



So the li+sria+and+roriw sequence corresponds to this in the .combine dump:

Trying 12, 10, 13 -> 14:
   12: r148:DI=0xffffffffffff0000
   10: r135:DI=r142:DI>>0x20
      REG_DEAD r142:DI
   13: r147:DI=r135:DI&r148:DI

REG_DEAD r135:DIREG_EQUAL r135:DI&0xffffffffffff0000

   14: r149:SI=r147:DI#0>->0x10
      REG_DEAD r147:DI

So we know that r148 does not die and thus we must produce its value inany combination, hence the attempts with PARALLEL constructs:

Failed to match this instruction:
(parallel [
        (set (reg:SI 149 [ _10 ])
            (zero_extend:SI (subreg:HI (rotatert:SI (subreg:SI (zero_extract:DI 
(reg:DI 142)
                                (const_int 32 [0x20])
                                (const_int 32 [0x20])) 0)
                        (const_int 16 [0x10])) 0)))
        (set (reg:DI 148)
            (const_int -65536 [0xffffffffffff0000]))
    ])

[ ... And other forms ... ]

One of the things combine will do is try to emit those two components ofthe PARALLEL as individual insns:

Successfully matched this instruction:
(set (reg:DI 148)
    (const_int -65536 [0xffffffffffff0000]))
Failed to match this instruction:
(set (reg:SI 149 [ _10 ])
    (zero_extend:SI (subreg:HI (rotatert:SI (subreg:SI (zero_extract:DI (reg:DI 
142)
                        (const_int 32 [0x20])
                        (const_int 32 [0x20])) 0)
                (const_int 16 [0x10])) 0)))

So the simple constant assignment matches (as it should). If we stareat that second hunk that failed to match long enough, it's just alogical shift right by 48 positions (which matches your assembly output).

So one path forward would be to create a matching pattern for that.Possibly generalizing it a bit. That would obviously be a RISC-Vspecific solution.

The other (possibly more complex) approach would be to optimize it insimplify-rtx.cc which would work for every target. The trick would beto realize the zero_extraction step is just a logical left shift by 32bits. The rotate + HI extension effectively shifts by 16 more bits.That could get ugly.


So here's another equivalent form combine tries:

(set (reg:SI 149 [ _10 ])
    (and:SI (rotatert:SI (subreg:SI (lshiftrt:DI (reg:DI 142)
                    (const_int 32 [0x20])) 0)
            (const_int 16 [0x10]))
        (const_int 65535 [0xffff])))


Near the end of the AND case in simpify_binary_operation_1 we have this:

     tem = simplify_with_subreg_not (code, mode, op0, op1);
      if (tem)
        return tem;


We could handle this new case just before the subreg_not case.

I'd probably start with recognizing that the rotate+and is just a shiftand rewrite it into that form and try to simplify it again. If thatdoesn't work, then I'd recognize the rotate+and is a shift, then lookinside the subreg to simplify via brute force.

If that turns out to be too ugly to deal with in simplify-rtx, that'sprobably a better form for a define_insn as it's noticably simpler thanthe one with the zero-extract.


Does all that make sense?

jeff

Re: [PATCH v2] RISC-V: Improve bswap8 when zbb is enabled

Reply via email to