On 28/10/2020 18:10, Christophe Lyon via Gcc-patches wrote: > On Wed, 28 Oct 2020 at 18:44, Richard Earnshaw > <richard.earns...@foss.arm.com> wrote: >> >> On 27/10/2020 15:42, Richard Earnshaw via Gcc-patches wrote: >>> On 26/10/2020 10:52, Christophe Lyon via Gcc-patches wrote: >>>> On Thu, 22 Oct 2020 at 17:22, Richard Earnshaw >>>> <richard.earns...@foss.arm.com> wrote: >>>>> >>>>> On 22/10/2020 09:45, Christophe Lyon via Gcc-patches wrote: >>>>>> On Wed, 21 Oct 2020 at 19:36, Richard Earnshaw >>>>>> <richard.earns...@foss.arm.com> wrote: >>>>>>> >>>>>>> On 21/10/2020 17:11, Christophe Lyon via Gcc-patches wrote: >>>>>>>> On Wed, 21 Oct 2020 at 18:07, Richard Earnshaw >>>>>>>> <richard.earns...@foss.arm.com> wrote: >>>>>>>>> >>>>>>>>> On 21/10/2020 16:49, Christophe Lyon via Gcc-patches wrote: >>>>>>>>>> On Tue, 20 Oct 2020 at 13:25, Richard Earnshaw >>>>>>>>>> <richard.earns...@foss.arm.com> wrote: >>>>>>>>>>> >>>>>>>>>>> On 20/10/2020 12:22, Richard Earnshaw wrote: >>>>>>>>>>>> On 19/10/2020 17:32, Christophe Lyon via Gcc-patches wrote: >>>>>>>>>>>>> On Mon, 19 Oct 2020 at 16:39, Richard Earnshaw >>>>>>>>>>>>> <richard.earns...@foss.arm.com> wrote: >>>>>>>>>>>>>> >>>>>>>>>>>>>> On 12/10/2020 08:59, Christophe Lyon via Gcc-patches wrote: >>>>>>>>>>>>>>> On Thu, 8 Oct 2020 at 11:58, Richard Earnshaw >>>>>>>>>>>>>>> <richard.earns...@foss.arm.com> wrote: >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> On 08/10/2020 10:07, Christophe Lyon via Gcc-patches wrote: >>>>>>>>>>>>>>>>> On Tue, 6 Oct 2020 at 18:02, Richard Earnshaw >>>>>>>>>>>>>>>>> <richard.earns...@foss.arm.com> wrote: >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> On 29/09/2020 20:50, Christophe Lyon via Gcc-patches wrote: >>>>>>>>>>>>>>>>>>> When mi_delta is > 255 and -mpure-code is used, we cannot >>>>>>>>>>>>>>>>>>> load delta >>>>>>>>>>>>>>>>>>> from code memory (like we do without -mpure-code). >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> This patch builds the value of mi_delta into r3 with a >>>>>>>>>>>>>>>>>>> series of >>>>>>>>>>>>>>>>>>> movs/adds/lsls. >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> We also do some cleanup by not emitting the function >>>>>>>>>>>>>>>>>>> address and delta >>>>>>>>>>>>>>>>>>> via .word directives at the end of the thunk since we don't >>>>>>>>>>>>>>>>>>> use them >>>>>>>>>>>>>>>>>>> with -mpure-code. >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> No need for new testcases, this bug was already identified >>>>>>>>>>>>>>>>>>> by >>>>>>>>>>>>>>>>>>> eg. pr46287-3.C >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> 2020-09-29 Christophe Lyon <christophe.l...@linaro.org> >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> gcc/ >>>>>>>>>>>>>>>>>>> * config/arm/arm.c (arm_thumb1_mi_thunk): Build >>>>>>>>>>>>>>>>>>> mi_delta in r3 and >>>>>>>>>>>>>>>>>>> do not emit function address and delta when >>>>>>>>>>>>>>>>>>> -mpure-code is used. >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> Hi Richard, >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> Thanks for your comments. >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> There are some optimizations you can make to this code. >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> Firstly, for values between 256 and 510 (inclusive), it >>>>>>>>>>>>>>>>>> would be better >>>>>>>>>>>>>>>>>> to just expand a mov of 255 followed by an add. >>>>>>>>>>>>>>>>> I now see the splitted for the "Pe" constraint which I hadn't >>>>>>>>>>>>>>>>> noticed >>>>>>>>>>>>>>>>> before, so I can write something similar indeed. >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> However, I'm note quite sure to understand the benefit in the >>>>>>>>>>>>>>>>> split >>>>>>>>>>>>>>>>> when -mpure-code is NOT used. >>>>>>>>>>>>>>>>> Consider: >>>>>>>>>>>>>>>>> int f3_1 (void) { return 510; } >>>>>>>>>>>>>>>>> int f3_2 (void) { return 511; } >>>>>>>>>>>>>>>>> Compile with -O2 -mcpu=cortex-m0: >>>>>>>>>>>>>>>>> f3_1: >>>>>>>>>>>>>>>>> movs r0, #255 >>>>>>>>>>>>>>>>> lsls r0, r0, #1 >>>>>>>>>>>>>>>>> bx lr >>>>>>>>>>>>>>>>> f3_2: >>>>>>>>>>>>>>>>> ldr r0, .L4 >>>>>>>>>>>>>>>>> bx lr >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> The splitter makes the code bigger, does it "compensate" for >>>>>>>>>>>>>>>>> this by >>>>>>>>>>>>>>>>> not having to load the constant? >>>>>>>>>>>>>>>>> Actually the constant uses 4 more bytes, which should be >>>>>>>>>>>>>>>>> taken into >>>>>>>>>>>>>>>>> account when comparing code size, >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> Yes, the size of the literal pool entry needs to be taken into >>>>>>>>>>>>>>>> account. >>>>>>>>>>>>>>>> It might happen that the entry could be shared with another >>>>>>>>>>>>>>>> use of that >>>>>>>>>>>>>>>> literal, but in general that's rare. >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> so f3_1 uses 6 bytes, and f3_2 uses 8, so as you say below >>>>>>>>>>>>>>>>> three >>>>>>>>>>>>>>>>> thumb1 instructions would be equivalent in size compared to >>>>>>>>>>>>>>>>> loading >>>>>>>>>>>>>>>>> from the literal pool. Should the 256-510 range be extended? >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> It's a bit borderline at three instructions when literal pools >>>>>>>>>>>>>>>> are not >>>>>>>>>>>>>>>> expensive to use, but in thumb1 literal pools tend to be quite >>>>>>>>>>>>>>>> small due >>>>>>>>>>>>>>>> to the limited pc offsets we can use. I think on balance we >>>>>>>>>>>>>>>> probably >>>>>>>>>>>>>>>> want to use the instruction sequence unless optimizing for >>>>>>>>>>>>>>>> size. >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> This is also true for >>>>>>>>>>>>>>>>>> the literal pools alternative as well, so should be handled >>>>>>>>>>>>>>>>>> before all >>>>>>>>>>>>>>>>>> this. >>>>>>>>>>>>>>>>> I am not sure what you mean: with -mpure-code, the above >>>>>>>>>>>>>>>>> sample is compiled as: >>>>>>>>>>>>>>>>> f3_1: >>>>>>>>>>>>>>>>> movs r0, #255 >>>>>>>>>>>>>>>>> lsls r0, r0, #1 >>>>>>>>>>>>>>>>> bx lr >>>>>>>>>>>>>>>>> f3_2: >>>>>>>>>>>>>>>>> movs r0, #1 >>>>>>>>>>>>>>>>> lsls r0, r0, #8 >>>>>>>>>>>>>>>>> adds r0, r0, #255 >>>>>>>>>>>>>>>>> bx lr >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> so the "return 510" case is already handled as without >>>>>>>>>>>>>>>>> -mpure-code. >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> I was thinking specifically of the thunk sequence where you >>>>>>>>>>>>>>>> seem to be >>>>>>>>>>>>>>>> emitting instructions directly rather than generating RTL. >>>>>>>>>>>>>>>> The examples >>>>>>>>>>>>>>>> you show here are not thunks. >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>> OK thanks for the clarification. >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> Here is an updated version, split into 3 patches to hopefully >>>>>>>>>>>>>>> make >>>>>>>>>>>>>>> review easier. >>>>>>>>>>>>>>> They apply on top of my other mpure-code patches for PR96967 >>>>>>>>>>>>>>> and PR96770: >>>>>>>>>>>>>>> https://gcc.gnu.org/pipermail/gcc-patches/2020-September/554956.html >>>>>>>>>>>>>>> https://gcc.gnu.org/pipermail/gcc-patches/2020-September/554957.html >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> I kept it this way to make incremental changes easier to >>>>>>>>>>>>>>> understand. >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> Patch 1: With the hope to avoid confusion and make maintenance >>>>>>>>>>>>>>> easier, >>>>>>>>>>>>>>> I have updated thumb1_gen_const_int() so that it can generate >>>>>>>>>>>>>>> either RTL or >>>>>>>>>>>>>>> asm. This way, all the code used to build thumb-1 constants is >>>>>>>>>>>>>>> in the >>>>>>>>>>>>>>> same place, >>>>>>>>>>>>>>> in case we need to improve/fix it later. We now generate >>>>>>>>>>>>>>> shorter sequences in >>>>>>>>>>>>>>> several cases matching your comments. >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> Patch 2: Removes the equivalent loop from thumb1_movsi_insn >>>>>>>>>>>>>>> pattern and >>>>>>>>>>>>>>> calls thumb1_gen_const_int. >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> Patch 3: Update of the original patch in this thread, now calls >>>>>>>>>>>>>>> thumb1_gen_const_int. >>>>>>>>>>>>>> >>>>>>>>>>>>>> Yuk! Those changes to thumb1_gen_const_int are horrible. >>>>>>>>>>>>>> >>>>>>>>>>>>>> I think we should be able to leverage the fact that the compiler >>>>>>>>>>>>>> can use >>>>>>>>>>>>>> C++ now to do much better than that, for example by making that >>>>>>>>>>>>>> function >>>>>>>>>>>>>> a template. For example (and this is just a sketch): >>>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> Indeed! I didn't think about it since there is no other use of >>>>>>>>>>>>> templates in arm.c yet. >>>>>>>>>>>>> I'll send an update soon. >>>>>>>>>>>>> >>>>>>>>>>>>> Other than that, does the approach look OK to you? >>>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> Yes, I think this is heading in the right direction. Bringing the >>>>>>>>>>>> two >>>>>>>>>>>> immediate generating operations into a single function can only be >>>>>>>>>>>> a >>>>>>>>>>>> good thing. >>>>>>>>>>>> >>>>>>>>>>>> Looking again at your example constant sequences, I see: >>>>>>>>>>>> >>>>>>>>>>>> 0x1000010: >>>>>>>>>>>> movs r3, #16 >>>>>>>>>>>> lsls r3, #16 >>>>>>>>>>>> adds r3, #1 >>>>>>>>>>>> lsls r3, #4 >>>>>>>>>>>> 0x1000011: >>>>>>>>>>>> movs r3, #1 >>>>>>>>>>>> lsls r3, #24 >>>>>>>>>>>> adds r3, #17 >>>>>>>>>>>> >>>>>>>>>>>> The first of these looks odd, given the second sequence. Why >>>>>>>>>>>> doesn't >>>>>>>>>>>> the first expand to >>>>>>>>>>>> >>>>>>>>>>>> 0x1000010: >>>>>>>>>>>> movs r3, #16 >>>>>>>>>>>> lsls r3, #16 >>>>>>>>>>>> adds r3, #16 >>>>>>>>>>>> >>>>>>>>>>> Err, I mean to: >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> 0x1000010: >>>>>>>>>>> movs r3, #1 >>>>>>>>>>> lsls r3, #24 >>>>>>>>>>> adds r3, #16 >>>>>>>>>>> >>>>>>>>>>> ? >>>>>>>>>> >>>>>>>>>> Because I first try to right-shift the constant, hoping to reduce its >>>>>>>>>> range and need less instructions to build the higher part, then >>>>>>>>>> left-shift back. >>>>>>>>>> >>>>>>>>>> In this particular case, we'd need to realize that there are many >>>>>>>>>> zeros "inside" the constant. >>>>>>>>>> >>>>>>>>>> If I remove the part that tries to reduce the range, I do get that >>>>>>>>>> sequence, but for 764 I now generate >>>>>>>>>> movs r3, #2 >>>>>>>>>> lsls r3, #8 >>>>>>>>>> adds r3, #252 >>>>>>>>>> instead of >>>>>>>>>> movs r3, #191 >>>>>>>>>> lsls r3, #2 >>>>>>>>>> >>>>>>>>>> A possibility would be to try both approaches and keep the shortest >>>>>>>>>> one. >>>>>>>>> >>>>>>>>> Lets leave that for now, it's not important to fixing the main issue; >>>>>>>>> but we should remember we need to come back to it at some point. >>>>>>>>> >>>>>>>> Thanks, that's what I was thinking too. >>>>>>>> >>>>>>>>> There are other tricks as well, such as >>>>>>>>> >>>>>>>>> 0xffffff >>>>>>>>> >>>>>>>>> can be done as >>>>>>>>> >>>>>>>>> 0x1000000 - 1 >>>>>>>>> >>>>>>>>> and >>>>>>>>> >>>>>>>>> 0xfffffd >>>>>>>>> >>>>>>>>> as >>>>>>>>> >>>>>>>>> 0x1000000 - 3 >>>>>>>>> >>>>>>>>> but these can wait as well. >>>>>>>>> >>>>>>>> >>>>>>>> Didn't we already need to handle such tricks? I'm surprised this >>>>>>>> wasn't needed earlier. >>>>>>>> >>>>>>> >>>>>>> I don't think we ever worried about them. Most of them need at least 3 >>>>>>> instructions so aren't a code size saving over using a literal pool >>>>>>> entry. >>>>>>> >>>>>> OK, this will also help when using -mslow-flash-data. >>>>>> >>>>>> Here are updated patches, now using a template as you suggested. >>>>> >>>>> Looking better, but when I try to apply this to my local tree patch 2 >>>>> fails (I'm not exactly sure why, what was your baseline for these >>>>> patches?) >>>> I have the tree patches in this thread on top of these other two: >>>> https://gcc.gnu.org/pipermail/gcc-patches/2020-October/556768.html >>>> https://gcc.gnu.org/pipermail/gcc-patches/2020-October/556769.html >>>> >>>> They have gradual improvements to thumb1_movsi_insn. >>>> >>>>> -- that patch looks suspicious anyway, you're replacing code >>>>> that prints out assembly with code that generates RTL. >>>> Right! I took me a while to understand how I could miss this, sorry. >>>> That was caused by improper testing, as this part of the code >>>> isn't used when targetting cortex-m0. I have added a testcase >>>> for cortex-m23 which crashes with the previous version of patch 2, >>>> and succeeds now. >>>> >>>>> Could you also rename t1_print and t1_rtl to thumb1_const_print and >>>>> thumb1_const_rtl. I think the names as they stand are likely to be too >>>>> generic. >>>> OK, done. >>>> >>>> How about this new version? >>>> I'm not yet sure about the most appropriate naming for: >>>> thumb1_gen_const_int >>>> thumb1_gen_const_int_asm >>>> should they be >>>> thumb1_gen_const_int_rtl >>>> thumb1_gen_const_int_print >>>> to be consistent with the new classes? >>> >>> It would probably be better, yes. >>> >>> More detailed comments below. >>> >>> R. >>> >>>> >>>> Thanks, >>>> >>>> Christophe >>>> >>>>> R. >>>>> >>>>>> >>>>>> Thanks, >>>>>> >>>>>> Christophe >>>>>> >>>>>>> R. >>>>>>> >>>>>>>> >>>>>>>>> >>>>>>>>> R. >>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> R. >>>>>>>>>>>> >>>>>>>>>>>>> Thanks, >>>>>>>>>>>>> >>>>>>>>>>>>> Christophe >>>>>>>>>>>>> >>>>>>>>>>>>>> class t1_rtl >>>>>>>>>>>>>> { >>>>>>>>>>>>>> public: >>>>>>>>>>>>>> void ashift(int a) { gen_rtx_ASHIFT(a); } >>>>>>>>>>>>>> void rshift(int b) { gen_rtx_SHIFTRT(b); } >>>>>>>>>>>>>> }; >>>>>>>>>>>>>> >>>>>>>>>>>>>> class t1_print >>>>>>>>>>>>>> { >>>>>>>>>>>>>> public: >>>>>>>>>>>>>> t1_print (FILE *f) : t_file(f) {} >>>>>>>>>>>>>> void ashift (int a) { fprintf (t_file, "a shift %d\n", a); } >>>>>>>>>>>>>> void rshift (int b) { fprintf (t_file, "r shift %d\n", b); } >>>>>>>>>>>>>> private: >>>>>>>>>>>>>> FILE *t_file; >>>>>>>>>>>>>> }; >>>>>>>>>>>>>> >>>>>>>>>>>>>> template <class T> >>>>>>>>>>>>>> void thumb1_gen_const_int(T t, int f) >>>>>>>>>>>>>> { >>>>>>>>>>>>>> // Expansion of thumb1_gen_const_int ... >>>>>>>>>>>>>> t.ashift(f); >>>>>>>>>>>>>> } >>>>>>>>>>>>>> >>>>>>>>>>>>>> // Usage... >>>>>>>>>>>>>> void f1() >>>>>>>>>>>>>> { >>>>>>>>>>>>>> // Use the RTL expander >>>>>>>>>>>>>> t1_rtl g; >>>>>>>>>>>>>> thumb1_gen_const_int (g, 3); >>>>>>>>>>>>>> } >>>>>>>>>>>>>> >>>>>>>>>>>>>> void f2() >>>>>>>>>>>>>> { >>>>>>>>>>>>>> // Use the printf expander writing to stdout >>>>>>>>>>>>>> t1_print g(stdout); >>>>>>>>>>>>>> thumb1_gen_const_int (g, 3); >>>>>>>>>>>>>> } >>>>>>>>>>>>>> >>>>>>>>>>>>>> With this you can write thumb1_gen_const_int without having to >>>>>>>>>>>>>> worry >>>>>>>>>>>>>> about which expander is being used in each instance and the >>>>>>>>>>>>>> template >>>>>>>>>>>>>> expansion will use the right version. >>>>>>>>>>>>>> >>>>>>>>>>>>>> R. >>>>>>>>>>>>>> >>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> I also suspect (but haven't check) that the base adjustment >>>>>>>>>>>>>>>>>> will >>>>>>>>>>>>>>>>>> most commonly be a multiple of the machine word size (ie 4). >>>>>>>>>>>>>>>>>> If that is >>>>>>>>>>>>>>>>>> the case then you could generate n/4 and then shift it left >>>>>>>>>>>>>>>>>> by 2 for an >>>>>>>>>>>>>>>>>> even greater range of literals. >>>>>>>>>>>>>>>>> I can see there is provision for this in the >>>>>>>>>>>>>>>>> !TARGET_THUMB1_ONLY case, >>>>>>>>>>>>>>>>> I'll update my patch. >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> More generally, any sequence of up to >>>>>>>>>>>>>>>>>> three thumb1 instructions will be no larger, and probably as >>>>>>>>>>>>>>>>>> fast as the >>>>>>>>>>>>>>>>>> existing literal pool fall back. >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> Secondly, if the value is, for example, 65536 (0x10000), >>>>>>>>>>>>>>>>>> your code will >>>>>>>>>>>>>>>>>> emit a mov followed by two shift-by-8 instructions; the two >>>>>>>>>>>>>>>>>> shifts could >>>>>>>>>>>>>>>>>> be merged into a single shift-by-16. >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> Right, I'll try to make use of thumb_shiftable_const. >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> Finally, I'd really like to see some executable tests for >>>>>>>>>>>>>>>>>> this, if at >>>>>>>>>>>>>>>>>> all possible. >>>>>>>>>>>>>>>>> I mentioned pr46287-3.C, but that's not the only existing >>>>>>>>>>>>>>>>> testcase >>>>>>>>>>>>>>>>> that showed the problem. There are also: >>>>>>>>>>>>>>>>> g++.dg/opt/thunk1.C >>>>>>>>>>>>>>>>> g++.dg/ipa/pr46984.C >>>>>>>>>>>>>>>>> g++.dg/torture/pr46287.C >>>>>>>>>>>>>>>>> g++.dg/torture/pr45699.C >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> Do you want that I copy one of these in the arm subdir and add >>>>>>>>>>>>>>>>> -mpure-code in dg-options? >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> On reflection, probably not - that just makes things more >>>>>>>>>>>>>>>> complicated >>>>>>>>>>>>>>>> with all the dg-options mess (I'm worried about interactions >>>>>>>>>>>>>>>> with other >>>>>>>>>>>>>>>> sets of options on the command line and the fall-out from >>>>>>>>>>>>>>>> that). If >>>>>>>>>>>>>>>> someone cares about pure-code they should be doing full >>>>>>>>>>>>>>>> testsuite runs >>>>>>>>>>>>>>>> with it enabled and that should be sufficient. >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> Yes, that's what I am doing manually, it's a bit tricky, and I >>>>>>>>>>>>>>> use a >>>>>>>>>>>>>>> modified simulator. >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> Thanks, >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> Christophe >>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> R. >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> Thanks, >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> Christophe >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> R. >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> k# (use "git pull" to merge the remote branch into yours) >>>>>>>>>>>>>>>>>>> --- >>>>>>>>>>>>>>>>>>> gcc/config/arm/arm.c | 91 >>>>>>>>>>>>>>>>>>> +++++++++++++++++++++++++++++++++++++--------------- >>>>>>>>>>>>>>>>>>> 1 file changed, 66 insertions(+), 25 deletions(-) >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> diff --git a/gcc/config/arm/arm.c b/gcc/config/arm/arm.c >>>>>>>>>>>>>>>>>>> index ceeb91f..62abeb5 100644 >>>>>>>>>>>>>>>>>>> --- a/gcc/config/arm/arm.c >>>>>>>>>>>>>>>>>>> +++ b/gcc/config/arm/arm.c >>>>>>>>>>>>>>>>>>> @@ -28342,9 +28342,43 @@ arm_thumb1_mi_thunk (FILE *file, >>>>>>>>>>>>>>>>>>> tree, HOST_WIDE_INT delta, >>>>>>>>>>>>>>>>>>> { >>>>>>>>>>>>>>>>>>> if (mi_delta > 255) >>>>>>>>>>>>>>>>>>> { >>>>>>>>>>>>>>>>>>> - fputs ("\tldr\tr3, ", file); >>>>>>>>>>>>>>>>>>> - assemble_name (file, label); >>>>>>>>>>>>>>>>>>> - fputs ("+4\n", file); >>>>>>>>>>>>>>>>>>> + /* With -mpure-code, we cannot load delta from the >>>>>>>>>>>>>>>>>>> constant >>>>>>>>>>>>>>>>>>> + pool: we build it explicitly. */ >>>>>>>>>>>>>>>>>>> + if (target_pure_code) >>>>>>>>>>>>>>>>>>> + { >>>>>>>>>>>>>>>>>>> + bool mov_done_p = false; >>>>>>>>>>>>>>>>>>> + int i; >>>>>>>>>>>>>>>>>>> + >>>>>>>>>>>>>>>>>>> + /* Emit upper 3 bytes if needed. */ >>>>>>>>>>>>>>>>>>> + for (i = 0; i < 3; i++) >>>>>>>>>>>>>>>>>>> + { >>>>>>>>>>>>>>>>>>> + int byte = (mi_delta >> (8 * (3 - i))) & >>>>>>>>>>>>>>>>>>> 0xff; >>>>>>>>>>>>>>>>>>> + >>>>>>>>>>>>>>>>>>> + if (byte) >>>>>>>>>>>>>>>>>>> + { >>>>>>>>>>>>>>>>>>> + if (mov_done_p) >>>>>>>>>>>>>>>>>>> + asm_fprintf (file, "\tadds\tr3, >>>>>>>>>>>>>>>>>>> #%d\n", byte); >>>>>>>>>>>>>>>>>>> + else >>>>>>>>>>>>>>>>>>> + asm_fprintf (file, "\tmovs\tr3, >>>>>>>>>>>>>>>>>>> #%d\n", byte); >>>>>>>>>>>>>>>>>>> + mov_done_p = true; >>>>>>>>>>>>>>>>>>> + } >>>>>>>>>>>>>>>>>>> + >>>>>>>>>>>>>>>>>>> + if (mov_done_p) >>>>>>>>>>>>>>>>>>> + asm_fprintf (file, "\tlsls\tr3, #8\n"); >>>>>>>>>>>>>>>>>>> + } >>>>>>>>>>>>>>>>>>> + >>>>>>>>>>>>>>>>>>> + /* Emit lower byte if needed. */ >>>>>>>>>>>>>>>>>>> + if (!mov_done_p) >>>>>>>>>>>>>>>>>>> + asm_fprintf (file, "\tmovs\tr3, #%d\n", >>>>>>>>>>>>>>>>>>> mi_delta & 0xff); >>>>>>>>>>>>>>>>>>> + else if (mi_delta & 0xff) >>>>>>>>>>>>>>>>>>> + asm_fprintf (file, "\tadds\tr3, #%d\n", >>>>>>>>>>>>>>>>>>> mi_delta & 0xff); >>>>>>>>>>>>>>>>>>> + } >>>>>>>>>>>>>>>>>>> + else >>>>>>>>>>>>>>>>>>> + { >>>>>>>>>>>>>>>>>>> + fputs ("\tldr\tr3, ", file); >>>>>>>>>>>>>>>>>>> + assemble_name (file, label); >>>>>>>>>>>>>>>>>>> + fputs ("+4\n", file); >>>>>>>>>>>>>>>>>>> + } >>>>>>>>>>>>>>>>>>> asm_fprintf (file, "\t%ss\t%r, %r, r3\n", >>>>>>>>>>>>>>>>>>> mi_op, this_regno, this_regno); >>>>>>>>>>>>>>>>>>> } >>>>>>>>>>>>>>>>>>> @@ -28380,30 +28414,37 @@ arm_thumb1_mi_thunk (FILE *file, >>>>>>>>>>>>>>>>>>> tree, HOST_WIDE_INT delta, >>>>>>>>>>>>>>>>>>> fputs ("\tpop\t{r3}\n", file); >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> fprintf (file, "\tbx\tr12\n"); >>>>>>>>>>>>>>>>>>> - ASM_OUTPUT_ALIGN (file, 2); >>>>>>>>>>>>>>>>>>> - assemble_name (file, label); >>>>>>>>>>>>>>>>>>> - fputs (":\n", file); >>>>>>>>>>>>>>>>>>> - if (flag_pic) >>>>>>>>>>>>>>>>>>> + >>>>>>>>>>>>>>>>>>> + /* With -mpure-code, we don't need to emit literals >>>>>>>>>>>>>>>>>>> for the >>>>>>>>>>>>>>>>>>> + function address and delta since we emitted code to >>>>>>>>>>>>>>>>>>> build >>>>>>>>>>>>>>>>>>> + them. */ >>>>>>>>>>>>>>>>>>> + if (!target_pure_code) >>>>>>>>>>>>>>>>>>> { >>>>>>>>>>>>>>>>>>> - /* Output ".word .LTHUNKn-[3,7]-.LTHUNKPCn". */ >>>>>>>>>>>>>>>>>>> - rtx tem = XEXP (DECL_RTL (function), 0); >>>>>>>>>>>>>>>>>>> - /* For TARGET_THUMB1_ONLY the thunk is in Thumb >>>>>>>>>>>>>>>>>>> mode, so the PC >>>>>>>>>>>>>>>>>>> - pipeline offset is four rather than eight. >>>>>>>>>>>>>>>>>>> Adjust the offset >>>>>>>>>>>>>>>>>>> - accordingly. */ >>>>>>>>>>>>>>>>>>> - tem = plus_constant (GET_MODE (tem), tem, >>>>>>>>>>>>>>>>>>> - TARGET_THUMB1_ONLY ? -3 : -7); >>>>>>>>>>>>>>>>>>> - tem = gen_rtx_MINUS (GET_MODE (tem), >>>>>>>>>>>>>>>>>>> - tem, >>>>>>>>>>>>>>>>>>> - gen_rtx_SYMBOL_REF (Pmode, >>>>>>>>>>>>>>>>>>> - ggc_strdup >>>>>>>>>>>>>>>>>>> (labelpc))); >>>>>>>>>>>>>>>>>>> - assemble_integer (tem, 4, BITS_PER_WORD, 1); >>>>>>>>>>>>>>>>>>> - } >>>>>>>>>>>>>>>>>>> - else >>>>>>>>>>>>>>>>>>> - /* Output ".word .LTHUNKn". */ >>>>>>>>>>>>>>>>>>> - assemble_integer (XEXP (DECL_RTL (function), 0), 4, >>>>>>>>>>>>>>>>>>> BITS_PER_WORD, 1); >>>>>>>>>>>>>>>>>>> + ASM_OUTPUT_ALIGN (file, 2); >>>>>>>>>>>>>>>>>>> + assemble_name (file, label); >>>>>>>>>>>>>>>>>>> + fputs (":\n", file); >>>>>>>>>>>>>>>>>>> + if (flag_pic) >>>>>>>>>>>>>>>>>>> + { >>>>>>>>>>>>>>>>>>> + /* Output ".word .LTHUNKn-[3,7]-.LTHUNKPCn". */ >>>>>>>>>>>>>>>>>>> + rtx tem = XEXP (DECL_RTL (function), 0); >>>>>>>>>>>>>>>>>>> + /* For TARGET_THUMB1_ONLY the thunk is in Thumb >>>>>>>>>>>>>>>>>>> mode, so the PC >>>>>>>>>>>>>>>>>>> + pipeline offset is four rather than eight. >>>>>>>>>>>>>>>>>>> Adjust the offset >>>>>>>>>>>>>>>>>>> + accordingly. */ >>>>>>>>>>>>>>>>>>> + tem = plus_constant (GET_MODE (tem), tem, >>>>>>>>>>>>>>>>>>> + TARGET_THUMB1_ONLY ? -3 : >>>>>>>>>>>>>>>>>>> -7); >>>>>>>>>>>>>>>>>>> + tem = gen_rtx_MINUS (GET_MODE (tem), >>>>>>>>>>>>>>>>>>> + tem, >>>>>>>>>>>>>>>>>>> + gen_rtx_SYMBOL_REF (Pmode, >>>>>>>>>>>>>>>>>>> + >>>>>>>>>>>>>>>>>>> ggc_strdup (labelpc))); >>>>>>>>>>>>>>>>>>> + assemble_integer (tem, 4, BITS_PER_WORD, 1); >>>>>>>>>>>>>>>>>>> + } >>>>>>>>>>>>>>>>>>> + else >>>>>>>>>>>>>>>>>>> + /* Output ".word .LTHUNKn". */ >>>>>>>>>>>>>>>>>>> + assemble_integer (XEXP (DECL_RTL (function), 0), >>>>>>>>>>>>>>>>>>> 4, BITS_PER_WORD, 1); >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> - if (TARGET_THUMB1_ONLY && mi_delta > 255) >>>>>>>>>>>>>>>>>>> - assemble_integer (GEN_INT(mi_delta), 4, >>>>>>>>>>>>>>>>>>> BITS_PER_WORD, 1); >>>>>>>>>>>>>>>>>>> + if (TARGET_THUMB1_ONLY && mi_delta > 255) >>>>>>>>>>>>>>>>>>> + assemble_integer (GEN_INT(mi_delta), 4, >>>>>>>>>>>>>>>>>>> BITS_PER_WORD, 1); >>>>>>>>>>>>>>>>>>> + } >>>>>>>>>>>>>>>>>>> } >>>>>>>>>>>>>>>>>>> else >>>>>>>>>>>>>>>>>>> { >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>> >>>>>>>>> >>>>>>> >>>>> >>> >>> +class thumb1_const_rtl >>> ... >>> + void mov (int val) >>> >>> This should take a HOST_WIDE_INT. Similarly for add and shift. The >>> same applies to the asm version as well. >>> >>> + asm_fprintf (t_file, "\tmovs\tr%d, #%d\n", dst_regno, val); >>> >>> Should be using reg_names[dst_regno] in all cases. In fact, you might >>> want to move that lookup to the constructor and just save a pointer to >>> the string there. You'll need to use HOST_WIDE_INT_PRINT_UNSIGNED >> >> Correction, for a (signed) HOST_WIDE_INT, this should be >> HOST_WIDE_INT_PRINT_DEC. >> > > Right, but if "val" is unsigned HOST_WIDE_INT, all the methods in the > two classes > can have unsigned HOST_WIDE_INT parameters, and thus use > HOST_WIDE_INT_PRINT_UNSIGNED? >
It's generally safer to print it as a signed value. We probably won't have cases where the top bit of a 32-bit word are set here; but if you do, and the value is unsigned, you end up with 16 digit hex numbers rather than the 8 you'd expect on a 32-bit target because HOST_WIDE_INT is at least 64 bits in size. R. > >>> rather than "%d" for the immediate. >>> >>> +template <class T> >>> +void >>> +thumb1_gen_const_int_1 (T dst, HOST_WIDE_INT op1) >>> +{ >>> + bool mov_done_p = false; >>> + int val = op1; >>> >>> This potentially silently loses precision. In fact, I think you really >>> want to use "unsigned HOST_WIDE_INT" throughout the following code, so >>> that the right shifts aren't undefined if dealing with negative numbers. >>> >>> For safety, you should also have an assertion in here that >>> >>> op1 == trunc_int_for_mode (op1, SImode) >>> >>> + int shift = 0; >>> + int i; >>> + >>> + if (val == 0) >>> >>> You can short-circuit 0..255 here for a quick exit. >>> >>> + { >>> + dst.mov (val); >>> + return; >>> + } >>> >>> Another trick: if the top nine bits of the 32-bit value are all set, >>> you're probably going to be better off (and certainly not worse off) by >>> generating -op1 and then negating the result in a final step - you can >>> do that via recursion. >>> >>> + >>> + /* In the general case, we need 7 instructions to build >>> + a 32 bits constant (1 movs, 3 lsls, 3 adds). We can >>> + do better if VAL is small enough, or >>> + right-shiftable by a suitable amount. If the >>> + right-shift enables to encode at least one less byte, >>> + it's worth it: we save a adds and a lsls at the >>> + expense of a final lsls. */ >>> + int final_shift = number_of_first_bit_set (val); >>> + >>> + int leading_zeroes = clz_hwi (val); >>> + int number_of_bytes_needed >>> + = ((HOST_BITS_PER_WIDE_INT - 1 - leading_zeroes) >>> + / BITS_PER_UNIT) + 1; >>> + int number_of_bytes_needed2 >>> + = ((HOST_BITS_PER_WIDE_INT - 1 - leading_zeroes - final_shift) >>> + / BITS_PER_UNIT) + 1; >>> + >>> + if (number_of_bytes_needed2 < number_of_bytes_needed) >>> + val >>= final_shift; >>> + else >>> + final_shift = 0; >>> + >>> + /* If we are in a very small range, we can use either a single movs >>> + or movs+adds. */ >>> + if ((val >= 0) && (val <= 510)) >>> >>> if val is made unsigned HWI as I suggest, the lower bounds test is not >>> needed. >>> >>> + { >>> + if (val > 255) >>> + { >>> + int high = val - 255; >>> >>> Again, watch your types. >>> >>> + >>> + dst.mov (high); >>> + dst.add (255); >>> + } >>> + else >>> + dst.mov (val); >>> + >>> + if (final_shift > 0) >>> + dst.ashift (final_shift); >>> + } >>> + else >>> + { >>> + /* General case, emit upper 3 bytes as needed. */ >>> + for (i = 0; i < 3; i++) >>> + { >>> + int byte = (val >> (8 * (3 - i))) & 0xff; >>> >>> and here. >>> >>> + >>> + if (byte) >>> + { >>> + /* We are about to emit new bits, stop accumulating a >>> + shift amount, and left-shift only if we have already >>> + emitted some upper bits. */ >>> + if (mov_done_p) >>> + { >>> + dst.ashift (shift); >>> + dst.add (byte); >>> + } >>> + else >>> + dst.mov (byte); >>> + >>> + /* Stop accumulating shift amount since we've just >>> + emitted some bits. */ >>> + shift = 0; >>> + >>> + mov_done_p = true; >>> + } >>> + >>> + if (mov_done_p) >>> + shift += 8; >>> + } >>> + >>> + /* Emit lower byte. */ >>> + if (!mov_done_p) >>> + dst.mov (val & 0xff); >>> + else >>> + { >>> + dst.ashift (shift); >>> + if (val & 0xff) >>> + dst.add (val & 0xff); >>> + } >>> + >>> + if (final_shift > 0) >>> + dst.ashift (final_shift); >>> + } >>> +} >>> + >>> >>