This is really Jim's code, but it's been sitting around in Bugzilla for a while so I've picked it up. All I really did here is add a target hook and mangle some comments, but I think I understand enough about what's going on to try and get things moving forward. So I'm writing up a pretty big cover letter to try and summarize what I think is going on here, as it's definitely not something I fully understand yet.
We've got a quirk in the RISC-V ABI where DF arguments on rv32 get split into an X register and a 32-bit aligned stack slot. The middle-end prologue code just stores out the X register and treats the argument as if it was entirely passed on the stack. This can result in a misaligned load, and those are still slow on a bunch of RISC-V systems. This patch set adds a target hook that essentially biases the middle-end the other way: load the stack part of the argument and then merge it with the register part via subword moves. That's essentially handling these via register-register operations, but for the specific case that trips up as a misaligned access bug on RISC-V the generated code ends up with more memory ops. More specifically, the included test case is essentially double foo(..., double split) { return split; } with the arguments sot up so "split" has 32 bits in a7 (an integer register used for arguments) and 32 bits on the stack. The return goes into a floating-point register, as they're 64 bits on rv32ifd (even when integer registers are only 32 bits). Without this patch (and with this patch on targets with fast misaligned accesses) that generates sw a7,12(sp) fld fa0,12(sp) and with this patch (on a subtarget with slow misaligned access) ends up as lw a5,16(sp) sw a7,8(sp) sw a5,12(sp) fld fa0,8(sp) That looks a little odd, but I think it's actually good code -- the only way to get a double into a register on rv32 is to load it from memory, so without misaligned loads we're sort of just stuck there. While playing around writing this cover letter I came up with another case that's essentially long long foo(..., long long split) { return split; } that used to generate sw a7,12(sp) lw a0,12(sp) lw a1,16(sp) and now generates lw a1,0(sp) mv a0,a7 so I do think we've at least got some room for new optimizations here, maybe even on other targets. The target hook will need some adjustment, but ultimately I'm not even sure if a target hook is the way to go here. It was just an easy way to flip the behavior so I could play around with some of Jim's code. It kind of feels like the load/subword merge version would result in better code in general, but I'm not sure on that one. That said, I figured I'd just send it out so others could see this. It's very much out of my wheel house, so I'd be shocked if this doesn't cause any failures...