On 7/17/25 7:45 AM, Dusan Stojkovic wrote:
Hi Jeff,
So I'm not sure this transformation is correct.
Let's consider the case where a5 has the value 0xffff000000000000 at the
"li" instruction.
a5 = 0xff00000000000000
li a4, -65536 // a4 = 0xffffffffffff0000
srai a5,a5,32 // a5 = 0xffffffffff000000
and a5,a5,a4 // a5 = 0xffffffffff000000
roriw a5,a5,16 // a5 = 0x000000000000ff00
So with your change:
a5 = 0xff00000000000000
li a4, -65536 // a4 = 0xffffffffffff0000
srai a5,a5,48 // a5 = 0xfffffffffffff000
Am I missing something?
Jeff
Thank you for the review.
You are correct, I mistakenly used an arithmetic shift instead of a logical
right shift. There were also testsuite failures with rv32 but the test should
be executed on rv64 only.
This was addressed. The peephole pattern transforms:
bswap8:
rev8 a5,a0
-> li a4,-65536
-> srai a5,a5,32
-> and a5,a5,a4
-> roriw a5,a5,16
and a0,a0,a4
or a0,a0,a5
sext.w a0,a0
ret
And emits this assembly:
bswap8:
rev8 a5,a0
-> li a4,-65536
-> srli a5,a5,48
and a0,a0,a4
or a0,a0,a5
sext.w a0,a0
ret
Basically if after rev8 a5 has it's MSB set to 1 then the arithmetic right
shift will set all other bits to the value of MSB as per your example.
The logical shift here ensures that this is not the case.
# after rev8:
a5 = 0xff00000000000000
...
srli a5,a5,48 # a5 := 0x00000000ff00
...
Thanks. So the next (and probably most important) is to try and avoid
using a peephole2. After staring at the .combine dumps some more I
think I see the path forward.
So the li+sria+and+roriw sequence corresponds to this in the .combine dump:
Trying 12, 10, 13 -> 14:
12: r148:DI=0xffffffffffff0000
10: r135:DI=r142:DI>>0x20
REG_DEAD r142:DI
13: r147:DI=r135:DI&r148:DI
REG_DEAD r135:DI
REG_EQUAL r135:DI&0xffffffffffff0000
14: r149:SI=r147:DI#0>->0x10
REG_DEAD r147:DI
So we know that r148 does not die and thus we must produce its value in
any combination, hence the attempts with PARALLEL constructs:
Failed to match this instruction:
(parallel [
(set (reg:SI 149 [ _10 ])
(zero_extend:SI (subreg:HI (rotatert:SI (subreg:SI (zero_extract:DI
(reg:DI 142)
(const_int 32 [0x20])
(const_int 32 [0x20])) 0)
(const_int 16 [0x10])) 0)))
(set (reg:DI 148)
(const_int -65536 [0xffffffffffff0000]))
])
[ ... And other forms ... ]
One of the things combine will do is try to emit those two components of
the PARALLEL as individual insns:
Successfully matched this instruction:
(set (reg:DI 148)
(const_int -65536 [0xffffffffffff0000]))
Failed to match this instruction:
(set (reg:SI 149 [ _10 ])
(zero_extend:SI (subreg:HI (rotatert:SI (subreg:SI (zero_extract:DI (reg:DI
142)
(const_int 32 [0x20])
(const_int 32 [0x20])) 0)
(const_int 16 [0x10])) 0)))
So the simple constant assignment matches (as it should). If we stare
at that second hunk that failed to match long enough, it's just a
logical shift right by 48 positions (which matches your assembly output).
So one path forward would be to create a matching pattern for that.
Possibly generalizing it a bit. That would obviously be a RISC-V
specific solution.
The other (possibly more complex) approach would be to optimize it in
simplify-rtx.cc which would work for every target. The trick would be
to realize the zero_extraction step is just a logical left shift by 32
bits. The rotate + HI extension effectively shifts by 16 more bits.
That could get ugly.
So here's another equivalent form combine tries:
(set (reg:SI 149 [ _10 ])
(and:SI (rotatert:SI (subreg:SI (lshiftrt:DI (reg:DI 142)
(const_int 32 [0x20])) 0)
(const_int 16 [0x10]))
(const_int 65535 [0xffff])))
Near the end of the AND case in simpify_binary_operation_1 we have this:
tem = simplify_with_subreg_not (code, mode, op0, op1);
if (tem)
return tem;
We could handle this new case just before the subreg_not case.
I'd probably start with recognizing that the rotate+and is just a shift
and rewrite it into that form and try to simplify it again. If that
doesn't work, then I'd recognize the rotate+and is a shift, then look
inside the subreg to simplify via brute force.
If that turns out to be too ugly to deal with in simplify-rtx, that's
probably a better form for a define_insn as it's noticably simpler than
the one with the zero-extract.
Does all that make sense?
jeff