On Tue, Oct 18, 2022 at 8:35 PM Palmer Dabbelt <pal...@dabbelt.com> wrote: > > On Tue, 18 Oct 2022 08:57:37 PDT (-0700), j...@ventanamicro.com wrote: > > > > Just a couple more comments in-line. > > > > On 10/18/22 09:18, Manolis Tsamis wrote: > >> > >>>> +/* Implement TARGET_SHRINK_WRAP_GET_SEPARATE_COMPONENTS. */ > >>>> + > >>>> +static sbitmap > >>>> +riscv_get_separate_components (void) > >>>> +{ > >>>> + HOST_WIDE_INT offset; > >>>> + sbitmap components = sbitmap_alloc (FIRST_PSEUDO_REGISTER); > >>>> + bitmap_clear (components); > >>>> + > >>>> + if (riscv_use_save_libcall (&cfun->machine->frame) > >>>> + || cfun->machine->interrupt_handler_p) > >>> riscv_use_save_libcall() already checks interrupt_handler_p, so that's > >>> redundant. That said, I'm not sure riscv_use_save_libcall() is the > >>> right check here as unless I'm missing something we don't have all those > >>> other constraints when shrink-wrapping. > >>> > >> riscv_use_save_libcall returns false when interrupt_handler_p is true, so > >> the > >> check for interrupt_handler_p in the branch is not redundant in this case. > >> > >> I encountered some issues when shrink wrapping and libcall was used in the > >> same > >> function. Thinking that libcall replaces the prologue/epilogue I didn't > >> see a > >> reason to have both at the same time and hence I opted to disable > >> shrink wrapping in that case. From my understanding this should be > >> harmless? > > > > I would have expected things to work fine with libcalls, perhaps with > > the exception of the save/restore libcalls. So that needs deeper > > investigation. > > The save/restore libcalls only support saving/restoring a handful of > register configurations (just the saved X registers in the order they're > usually saved in by GCC). It should be OK for correctness to over-save > registers, but it kind of just un-does the shrink wrapping so not sure > it's worth worrying about at that point. > > There's also some oddness around the save/restore libcall ABI, it's not > the standard function ABI but instead a GCC-internal one. IIRC it just > uses the alternate link register (ie, t0 instead of ra) but I may have > forgotten something else. > > >>> It seems kind of clunky to have two copies of all these loops (and we'll > >>> need a third to make this work with the V stuff), but we've got that > >>> issue elsewhere in the port so I don't think you need to fix it here > >>> (though the V stuff will be there for the v2, so you'll need the third > >>> copy of each loop). > >>> > >> Indeed, I was following the other ports here. Do you think it would be > >> better to refactor this when the code for the V extension is added? > >> By taking into account what code will be needed for V, a proper refactored > >> function could be made to handle all cases. > > > > I think refactoring when V gets added would be fine. While we could > > probably refactor it correctly now (it isn't terribly complex code after > > all), but we're more likely to get it right with the least amount of > > work if we do it when V is submitted. > > Some of the V register blocks are already there, but ya I agree we can > just wait. There's going to be a bunch of V-related churn for a bit, > juggling those patches is already enough of a headache ;) > > >>> Either way, this deserves a test case. I think it should be possible to > >>> write one by introducing some register pressure around a > >>> shrink-wrappable block that needs a long stack offset and making sure > >>> in-flight registers don't get trashed. > >>> > >> I tried to think of some way to introduce a test like that but couldn't and > >> I don't see how it would be done. Shrink wrapping only affects saved > >> registers > >> so there are always available temporaries that are not affected by > >> shrink wrapping. > >> (Register pressure should be irrelevant in this case if I understand > >> correctly). > >> Also the implementation checks for SMALL_OPERAND (offset) shrink wrapping > >> should be unaffected from long stack offsets. If you see some way to write > >> a test for that based on what I explained please explain how I could do > >> that. > > > > I think the register pressure was just to ensure that some saves were > > needed to trigger an attempt to shrink wrap something. You'd also need > > something to eat stack space (local array which gets referenced as an > > asm operand, but where the asm doesn't generate any code perhaps)? > > Whether or not that works depends on stack layout though which I don't > > know well enough for riscv. > > Sorry for being a bit vague, but it's because I always find it takes a > bit of time to write up tests like this. I think something like this > might do it, but that almost certainly won't work as-is: > > // Some extern bits to try and trip up the optimizer. > extern long helper(long *sa, long a, long b, long c, ...); > extern long glob_array[1024]; > > // The function takes a bunch of arguments to fill up the A > // registers. > long func(long a, long b, long c, ...) { > long stack_array[1024]; // should be big enough to make SP offsets > // take a temporary > > // Here I'm loading up all the T registers with something that > // must be saved to the stack, since those helper calls below may > // change the value of glob_array. We probably need to fill the S > // registers too? > long t0 = glob_array[arg + 0]; > long t1 = glob_array[arg + 1]; > long t2 = ... > > // This should trigger shrink-wrapping, as the function calls will > // need a bunch of register state saved. The idea is to make sure > // all the other registers are somehow in flight at this point and > // could only be recovered via their saved copies, but due to the > // large stack we'll need a temporary to construct the slot > // addresses. > if (a) { > t0 = helper(stack_array, a, b, ...); > } > > // Go re-use those temporaries, again indirect to avoid > // optimization. > a += glob_array[t0]; > b += glob_array[t1]; > c ...; > return helper(stack_array, a, b, c, ...); > } > > At that point you should end up with the save/restore code in that if > block needing to do something like > > lui TMP, IMM_HI > addi TMP, sp, IMM_LO > sd REG0, 0(t0) > sd REG1, 8(t0) > ... > > and if one of those registers it's trying to save is TMP then we're in > trouble. >
Thanks for the additional details and rough sketch of the testcase. I'll try to make something based on that. Manolis > > Generally looks pretty good though. > > Ya, and a useful optimization so thanks for the help ;) > > > > > > > Jeff