On 7/17/25 7:45 AM, Dusan Stojkovic wrote:
Hi Jeff,

So I'm not sure this transformation is correct.

Let's consider the case where a5 has the value 0xffff000000000000 at the
"li" instruction.

a5 = 0xff00000000000000

li a4, -65536 // a4 = 0xffffffffffff0000
srai a5,a5,32 // a5 = 0xffffffffff000000
and a5,a5,a4 // a5 = 0xffffffffff000000
roriw a5,a5,16 // a5 = 0x000000000000ff00

So with your change:
a5 = 0xff00000000000000
li a4, -65536 // a4 = 0xffffffffffff0000
srai a5,a5,48 // a5 = 0xfffffffffffff000

Am I missing something?

Jeff

Thank you for the review.

You are correct, I mistakenly used an arithmetic shift instead of a logical
right shift. There were also testsuite failures with rv32 but the test should
be executed on rv64 only.

This was addressed. The peephole pattern transforms:
bswap8:
         rev8    a5,a0
         -> li      a4,-65536
         -> srai    a5,a5,32
         -> and     a5,a5,a4
         -> roriw   a5,a5,16
         and     a0,a0,a4
         or      a0,a0,a5
         sext.w  a0,a0
         ret

And emits this assembly:
bswap8:
         rev8    a5,a0
         -> li      a4,-65536
         -> srli    a5,a5,48
         and     a0,a0,a4
         or      a0,a0,a5
         sext.w  a0,a0
         ret

Basically if after rev8 a5 has it's MSB set to 1 then the arithmetic right
shift will set all other bits to the value of MSB as per your example.

The logical shift here ensures that this is not the case.
# after rev8:
a5 = 0xff00000000000000
...
srli a5,a5,48  # a5 := 0x00000000ff00
...
Thanks. So the next (and probably most important) is to try and avoid using a peephole2. After staring at the .combine dumps some more I think I see the path forward.


So the li+sria+and+roriw sequence corresponds to this in the .combine dump:

Trying 12, 10, 13 -> 14:
   12: r148:DI=0xffffffffffff0000
   10: r135:DI=r142:DI>>0x20
      REG_DEAD r142:DI
   13: r147:DI=r135:DI&r148:DI
REG_DEAD r135:DI REG_EQUAL r135:DI&0xffffffffffff0000
   14: r149:SI=r147:DI#0>->0x10
      REG_DEAD r147:DI

So we know that r148 does not die and thus we must produce its value in any combination, hence the attempts with PARALLEL constructs:

Failed to match this instruction:
(parallel [
        (set (reg:SI 149 [ _10 ])
            (zero_extend:SI (subreg:HI (rotatert:SI (subreg:SI (zero_extract:DI 
(reg:DI 142)
                                (const_int 32 [0x20])
                                (const_int 32 [0x20])) 0)
                        (const_int 16 [0x10])) 0)))
        (set (reg:DI 148)
            (const_int -65536 [0xffffffffffff0000]))
    ])
[ ... And other forms ... ]

One of the things combine will do is try to emit those two components of the PARALLEL as individual insns:

Successfully matched this instruction:
(set (reg:DI 148)
    (const_int -65536 [0xffffffffffff0000]))
Failed to match this instruction:
(set (reg:SI 149 [ _10 ])
    (zero_extend:SI (subreg:HI (rotatert:SI (subreg:SI (zero_extract:DI (reg:DI 
142)
                        (const_int 32 [0x20])
                        (const_int 32 [0x20])) 0)
                (const_int 16 [0x10])) 0)))

So the simple constant assignment matches (as it should). If we stare at that second hunk that failed to match long enough, it's just a logical shift right by 48 positions (which matches your assembly output).

So one path forward would be to create a matching pattern for that. Possibly generalizing it a bit. That would obviously be a RISC-V specific solution.

The other (possibly more complex) approach would be to optimize it in simplify-rtx.cc which would work for every target. The trick would be to realize the zero_extraction step is just a logical left shift by 32 bits. The rotate + HI extension effectively shifts by 16 more bits. That could get ugly.

So here's another equivalent form combine tries:

(set (reg:SI 149 [ _10 ])
    (and:SI (rotatert:SI (subreg:SI (lshiftrt:DI (reg:DI 142)
                    (const_int 32 [0x20])) 0)
            (const_int 16 [0x10]))
        (const_int 65535 [0xffff])))

Near the end of the AND case in simpify_binary_operation_1 we have this:

     tem = simplify_with_subreg_not (code, mode, op0, op1);
      if (tem)
        return tem;


We could handle this new case just before the subreg_not case.

I'd probably start with recognizing that the rotate+and is just a shift and rewrite it into that form and try to simplify it again. If that doesn't work, then I'd recognize the rotate+and is a shift, then look inside the subreg to simplify via brute force.

If that turns out to be too ugly to deal with in simplify-rtx, that's probably a better form for a define_insn as it's noticably simpler than the one with the zero-extract.

Does all that make sense?

jeff






Reply via email to