On Wed, 28 Oct 2020 at 18:44, Richard Earnshaw <richard.earns...@foss.arm.com> wrote: > > On 27/10/2020 15:42, Richard Earnshaw via Gcc-patches wrote: > > On 26/10/2020 10:52, Christophe Lyon via Gcc-patches wrote: > >> On Thu, 22 Oct 2020 at 17:22, Richard Earnshaw > >> <richard.earns...@foss.arm.com> wrote: > >>> > >>> On 22/10/2020 09:45, Christophe Lyon via Gcc-patches wrote: > >>>> On Wed, 21 Oct 2020 at 19:36, Richard Earnshaw > >>>> <richard.earns...@foss.arm.com> wrote: > >>>>> > >>>>> On 21/10/2020 17:11, Christophe Lyon via Gcc-patches wrote: > >>>>>> On Wed, 21 Oct 2020 at 18:07, Richard Earnshaw > >>>>>> <richard.earns...@foss.arm.com> wrote: > >>>>>>> > >>>>>>> On 21/10/2020 16:49, Christophe Lyon via Gcc-patches wrote: > >>>>>>>> On Tue, 20 Oct 2020 at 13:25, Richard Earnshaw > >>>>>>>> <richard.earns...@foss.arm.com> wrote: > >>>>>>>>> > >>>>>>>>> On 20/10/2020 12:22, Richard Earnshaw wrote: > >>>>>>>>>> On 19/10/2020 17:32, Christophe Lyon via Gcc-patches wrote: > >>>>>>>>>>> On Mon, 19 Oct 2020 at 16:39, Richard Earnshaw > >>>>>>>>>>> <richard.earns...@foss.arm.com> wrote: > >>>>>>>>>>>> > >>>>>>>>>>>> On 12/10/2020 08:59, Christophe Lyon via Gcc-patches wrote: > >>>>>>>>>>>>> On Thu, 8 Oct 2020 at 11:58, Richard Earnshaw > >>>>>>>>>>>>> <richard.earns...@foss.arm.com> wrote: > >>>>>>>>>>>>>> > >>>>>>>>>>>>>> On 08/10/2020 10:07, Christophe Lyon via Gcc-patches wrote: > >>>>>>>>>>>>>>> On Tue, 6 Oct 2020 at 18:02, Richard Earnshaw > >>>>>>>>>>>>>>> <richard.earns...@foss.arm.com> wrote: > >>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>> On 29/09/2020 20:50, Christophe Lyon via Gcc-patches wrote: > >>>>>>>>>>>>>>>>> When mi_delta is > 255 and -mpure-code is used, we cannot > >>>>>>>>>>>>>>>>> load delta > >>>>>>>>>>>>>>>>> from code memory (like we do without -mpure-code). > >>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>> This patch builds the value of mi_delta into r3 with a > >>>>>>>>>>>>>>>>> series of > >>>>>>>>>>>>>>>>> movs/adds/lsls. > >>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>> We also do some cleanup by not emitting the function > >>>>>>>>>>>>>>>>> address and delta > >>>>>>>>>>>>>>>>> via .word directives at the end of the thunk since we don't > >>>>>>>>>>>>>>>>> use them > >>>>>>>>>>>>>>>>> with -mpure-code. > >>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>> No need for new testcases, this bug was already identified > >>>>>>>>>>>>>>>>> by > >>>>>>>>>>>>>>>>> eg. pr46287-3.C > >>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>> 2020-09-29 Christophe Lyon <christophe.l...@linaro.org> > >>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>> gcc/ > >>>>>>>>>>>>>>>>> * config/arm/arm.c (arm_thumb1_mi_thunk): Build > >>>>>>>>>>>>>>>>> mi_delta in r3 and > >>>>>>>>>>>>>>>>> do not emit function address and delta when > >>>>>>>>>>>>>>>>> -mpure-code is used. > >>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>> Hi Richard, > >>>>>>>>>>>>>>> > >>>>>>>>>>>>>>> Thanks for your comments. > >>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>> There are some optimizations you can make to this code. > >>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>> Firstly, for values between 256 and 510 (inclusive), it > >>>>>>>>>>>>>>>> would be better > >>>>>>>>>>>>>>>> to just expand a mov of 255 followed by an add. > >>>>>>>>>>>>>>> I now see the splitted for the "Pe" constraint which I hadn't > >>>>>>>>>>>>>>> noticed > >>>>>>>>>>>>>>> before, so I can write something similar indeed. > >>>>>>>>>>>>>>> > >>>>>>>>>>>>>>> However, I'm note quite sure to understand the benefit in the > >>>>>>>>>>>>>>> split > >>>>>>>>>>>>>>> when -mpure-code is NOT used. > >>>>>>>>>>>>>>> Consider: > >>>>>>>>>>>>>>> int f3_1 (void) { return 510; } > >>>>>>>>>>>>>>> int f3_2 (void) { return 511; } > >>>>>>>>>>>>>>> Compile with -O2 -mcpu=cortex-m0: > >>>>>>>>>>>>>>> f3_1: > >>>>>>>>>>>>>>> movs r0, #255 > >>>>>>>>>>>>>>> lsls r0, r0, #1 > >>>>>>>>>>>>>>> bx lr > >>>>>>>>>>>>>>> f3_2: > >>>>>>>>>>>>>>> ldr r0, .L4 > >>>>>>>>>>>>>>> bx lr > >>>>>>>>>>>>>>> > >>>>>>>>>>>>>>> The splitter makes the code bigger, does it "compensate" for > >>>>>>>>>>>>>>> this by > >>>>>>>>>>>>>>> not having to load the constant? > >>>>>>>>>>>>>>> Actually the constant uses 4 more bytes, which should be > >>>>>>>>>>>>>>> taken into > >>>>>>>>>>>>>>> account when comparing code size, > >>>>>>>>>>>>>> > >>>>>>>>>>>>>> Yes, the size of the literal pool entry needs to be taken into > >>>>>>>>>>>>>> account. > >>>>>>>>>>>>>> It might happen that the entry could be shared with another > >>>>>>>>>>>>>> use of that > >>>>>>>>>>>>>> literal, but in general that's rare. > >>>>>>>>>>>>>> > >>>>>>>>>>>>>>> so f3_1 uses 6 bytes, and f3_2 uses 8, so as you say below > >>>>>>>>>>>>>>> three > >>>>>>>>>>>>>>> thumb1 instructions would be equivalent in size compared to > >>>>>>>>>>>>>>> loading > >>>>>>>>>>>>>>> from the literal pool. Should the 256-510 range be extended? > >>>>>>>>>>>>>> > >>>>>>>>>>>>>> It's a bit borderline at three instructions when literal pools > >>>>>>>>>>>>>> are not > >>>>>>>>>>>>>> expensive to use, but in thumb1 literal pools tend to be quite > >>>>>>>>>>>>>> small due > >>>>>>>>>>>>>> to the limited pc offsets we can use. I think on balance we > >>>>>>>>>>>>>> probably > >>>>>>>>>>>>>> want to use the instruction sequence unless optimizing for > >>>>>>>>>>>>>> size. > >>>>>>>>>>>>>> > >>>>>>>>>>>>>>> > >>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>> This is also true for > >>>>>>>>>>>>>>>> the literal pools alternative as well, so should be handled > >>>>>>>>>>>>>>>> before all > >>>>>>>>>>>>>>>> this. > >>>>>>>>>>>>>>> I am not sure what you mean: with -mpure-code, the above > >>>>>>>>>>>>>>> sample is compiled as: > >>>>>>>>>>>>>>> f3_1: > >>>>>>>>>>>>>>> movs r0, #255 > >>>>>>>>>>>>>>> lsls r0, r0, #1 > >>>>>>>>>>>>>>> bx lr > >>>>>>>>>>>>>>> f3_2: > >>>>>>>>>>>>>>> movs r0, #1 > >>>>>>>>>>>>>>> lsls r0, r0, #8 > >>>>>>>>>>>>>>> adds r0, r0, #255 > >>>>>>>>>>>>>>> bx lr > >>>>>>>>>>>>>>> > >>>>>>>>>>>>>>> so the "return 510" case is already handled as without > >>>>>>>>>>>>>>> -mpure-code. > >>>>>>>>>>>>>> > >>>>>>>>>>>>>> I was thinking specifically of the thunk sequence where you > >>>>>>>>>>>>>> seem to be > >>>>>>>>>>>>>> emitting instructions directly rather than generating RTL. > >>>>>>>>>>>>>> The examples > >>>>>>>>>>>>>> you show here are not thunks. > >>>>>>>>>>>>>> > >>>>>>>>>>>>> OK thanks for the clarification. > >>>>>>>>>>>>> > >>>>>>>>>>>>> Here is an updated version, split into 3 patches to hopefully > >>>>>>>>>>>>> make > >>>>>>>>>>>>> review easier. > >>>>>>>>>>>>> They apply on top of my other mpure-code patches for PR96967 > >>>>>>>>>>>>> and PR96770: > >>>>>>>>>>>>> https://gcc.gnu.org/pipermail/gcc-patches/2020-September/554956.html > >>>>>>>>>>>>> https://gcc.gnu.org/pipermail/gcc-patches/2020-September/554957.html > >>>>>>>>>>>>> > >>>>>>>>>>>>> I kept it this way to make incremental changes easier to > >>>>>>>>>>>>> understand. > >>>>>>>>>>>>> > >>>>>>>>>>>>> Patch 1: With the hope to avoid confusion and make maintenance > >>>>>>>>>>>>> easier, > >>>>>>>>>>>>> I have updated thumb1_gen_const_int() so that it can generate > >>>>>>>>>>>>> either RTL or > >>>>>>>>>>>>> asm. This way, all the code used to build thumb-1 constants is > >>>>>>>>>>>>> in the > >>>>>>>>>>>>> same place, > >>>>>>>>>>>>> in case we need to improve/fix it later. We now generate > >>>>>>>>>>>>> shorter sequences in > >>>>>>>>>>>>> several cases matching your comments. > >>>>>>>>>>>>> > >>>>>>>>>>>>> Patch 2: Removes the equivalent loop from thumb1_movsi_insn > >>>>>>>>>>>>> pattern and > >>>>>>>>>>>>> calls thumb1_gen_const_int. > >>>>>>>>>>>>> > >>>>>>>>>>>>> Patch 3: Update of the original patch in this thread, now calls > >>>>>>>>>>>>> thumb1_gen_const_int. > >>>>>>>>>>>> > >>>>>>>>>>>> Yuk! Those changes to thumb1_gen_const_int are horrible. > >>>>>>>>>>>> > >>>>>>>>>>>> I think we should be able to leverage the fact that the compiler > >>>>>>>>>>>> can use > >>>>>>>>>>>> C++ now to do much better than that, for example by making that > >>>>>>>>>>>> function > >>>>>>>>>>>> a template. For example (and this is just a sketch): > >>>>>>>>>>>> > >>>>>>>>>>> > >>>>>>>>>>> Indeed! I didn't think about it since there is no other use of > >>>>>>>>>>> templates in arm.c yet. > >>>>>>>>>>> I'll send an update soon. > >>>>>>>>>>> > >>>>>>>>>>> Other than that, does the approach look OK to you? > >>>>>>>>>>> > >>>>>>>>>> > >>>>>>>>>> Yes, I think this is heading in the right direction. Bringing the > >>>>>>>>>> two > >>>>>>>>>> immediate generating operations into a single function can only be > >>>>>>>>>> a > >>>>>>>>>> good thing. > >>>>>>>>>> > >>>>>>>>>> Looking again at your example constant sequences, I see: > >>>>>>>>>> > >>>>>>>>>> 0x1000010: > >>>>>>>>>> movs r3, #16 > >>>>>>>>>> lsls r3, #16 > >>>>>>>>>> adds r3, #1 > >>>>>>>>>> lsls r3, #4 > >>>>>>>>>> 0x1000011: > >>>>>>>>>> movs r3, #1 > >>>>>>>>>> lsls r3, #24 > >>>>>>>>>> adds r3, #17 > >>>>>>>>>> > >>>>>>>>>> The first of these looks odd, given the second sequence. Why > >>>>>>>>>> doesn't > >>>>>>>>>> the first expand to > >>>>>>>>>> > >>>>>>>>>> 0x1000010: > >>>>>>>>>> movs r3, #16 > >>>>>>>>>> lsls r3, #16 > >>>>>>>>>> adds r3, #16 > >>>>>>>>>> > >>>>>>>>> Err, I mean to: > >>>>>>>>> > >>>>>>>>> > >>>>>>>>> 0x1000010: > >>>>>>>>> movs r3, #1 > >>>>>>>>> lsls r3, #24 > >>>>>>>>> adds r3, #16 > >>>>>>>>> > >>>>>>>>> ? > >>>>>>>> > >>>>>>>> Because I first try to right-shift the constant, hoping to reduce its > >>>>>>>> range and need less instructions to build the higher part, then > >>>>>>>> left-shift back. > >>>>>>>> > >>>>>>>> In this particular case, we'd need to realize that there are many > >>>>>>>> zeros "inside" the constant. > >>>>>>>> > >>>>>>>> If I remove the part that tries to reduce the range, I do get that > >>>>>>>> sequence, but for 764 I now generate > >>>>>>>> movs r3, #2 > >>>>>>>> lsls r3, #8 > >>>>>>>> adds r3, #252 > >>>>>>>> instead of > >>>>>>>> movs r3, #191 > >>>>>>>> lsls r3, #2 > >>>>>>>> > >>>>>>>> A possibility would be to try both approaches and keep the shortest > >>>>>>>> one. > >>>>>>> > >>>>>>> Lets leave that for now, it's not important to fixing the main issue; > >>>>>>> but we should remember we need to come back to it at some point. > >>>>>>> > >>>>>> Thanks, that's what I was thinking too. > >>>>>> > >>>>>>> There are other tricks as well, such as > >>>>>>> > >>>>>>> 0xffffff > >>>>>>> > >>>>>>> can be done as > >>>>>>> > >>>>>>> 0x1000000 - 1 > >>>>>>> > >>>>>>> and > >>>>>>> > >>>>>>> 0xfffffd > >>>>>>> > >>>>>>> as > >>>>>>> > >>>>>>> 0x1000000 - 3 > >>>>>>> > >>>>>>> but these can wait as well. > >>>>>>> > >>>>>> > >>>>>> Didn't we already need to handle such tricks? I'm surprised this > >>>>>> wasn't needed earlier. > >>>>>> > >>>>> > >>>>> I don't think we ever worried about them. Most of them need at least 3 > >>>>> instructions so aren't a code size saving over using a literal pool > >>>>> entry. > >>>>> > >>>> OK, this will also help when using -mslow-flash-data. > >>>> > >>>> Here are updated patches, now using a template as you suggested. > >>> > >>> Looking better, but when I try to apply this to my local tree patch 2 > >>> fails (I'm not exactly sure why, what was your baseline for these > >>> patches?) > >> I have the tree patches in this thread on top of these other two: > >> https://gcc.gnu.org/pipermail/gcc-patches/2020-October/556768.html > >> https://gcc.gnu.org/pipermail/gcc-patches/2020-October/556769.html > >> > >> They have gradual improvements to thumb1_movsi_insn. > >> > >>> -- that patch looks suspicious anyway, you're replacing code > >>> that prints out assembly with code that generates RTL. > >> Right! I took me a while to understand how I could miss this, sorry. > >> That was caused by improper testing, as this part of the code > >> isn't used when targetting cortex-m0. I have added a testcase > >> for cortex-m23 which crashes with the previous version of patch 2, > >> and succeeds now. > >> > >>> Could you also rename t1_print and t1_rtl to thumb1_const_print and > >>> thumb1_const_rtl. I think the names as they stand are likely to be too > >>> generic. > >> OK, done. > >> > >> How about this new version? > >> I'm not yet sure about the most appropriate naming for: > >> thumb1_gen_const_int > >> thumb1_gen_const_int_asm > >> should they be > >> thumb1_gen_const_int_rtl > >> thumb1_gen_const_int_print > >> to be consistent with the new classes? > > > > It would probably be better, yes. > > > > More detailed comments below. > > > > R. > > > >> > >> Thanks, > >> > >> Christophe > >> > >>> R. > >>> > >>>> > >>>> Thanks, > >>>> > >>>> Christophe > >>>> > >>>>> R. > >>>>> > >>>>>> > >>>>>>> > >>>>>>> R. > >>>>>>> > >>>>>>>> > >>>>>>>> > >>>>>>>>>> > >>>>>>>>>> R. > >>>>>>>>>> > >>>>>>>>>>> Thanks, > >>>>>>>>>>> > >>>>>>>>>>> Christophe > >>>>>>>>>>> > >>>>>>>>>>>> class t1_rtl > >>>>>>>>>>>> { > >>>>>>>>>>>> public: > >>>>>>>>>>>> void ashift(int a) { gen_rtx_ASHIFT(a); } > >>>>>>>>>>>> void rshift(int b) { gen_rtx_SHIFTRT(b); } > >>>>>>>>>>>> }; > >>>>>>>>>>>> > >>>>>>>>>>>> class t1_print > >>>>>>>>>>>> { > >>>>>>>>>>>> public: > >>>>>>>>>>>> t1_print (FILE *f) : t_file(f) {} > >>>>>>>>>>>> void ashift (int a) { fprintf (t_file, "a shift %d\n", a); } > >>>>>>>>>>>> void rshift (int b) { fprintf (t_file, "r shift %d\n", b); } > >>>>>>>>>>>> private: > >>>>>>>>>>>> FILE *t_file; > >>>>>>>>>>>> }; > >>>>>>>>>>>> > >>>>>>>>>>>> template <class T> > >>>>>>>>>>>> void thumb1_gen_const_int(T t, int f) > >>>>>>>>>>>> { > >>>>>>>>>>>> // Expansion of thumb1_gen_const_int ... > >>>>>>>>>>>> t.ashift(f); > >>>>>>>>>>>> } > >>>>>>>>>>>> > >>>>>>>>>>>> // Usage... > >>>>>>>>>>>> void f1() > >>>>>>>>>>>> { > >>>>>>>>>>>> // Use the RTL expander > >>>>>>>>>>>> t1_rtl g; > >>>>>>>>>>>> thumb1_gen_const_int (g, 3); > >>>>>>>>>>>> } > >>>>>>>>>>>> > >>>>>>>>>>>> void f2() > >>>>>>>>>>>> { > >>>>>>>>>>>> // Use the printf expander writing to stdout > >>>>>>>>>>>> t1_print g(stdout); > >>>>>>>>>>>> thumb1_gen_const_int (g, 3); > >>>>>>>>>>>> } > >>>>>>>>>>>> > >>>>>>>>>>>> With this you can write thumb1_gen_const_int without having to > >>>>>>>>>>>> worry > >>>>>>>>>>>> about which expander is being used in each instance and the > >>>>>>>>>>>> template > >>>>>>>>>>>> expansion will use the right version. > >>>>>>>>>>>> > >>>>>>>>>>>> R. > >>>>>>>>>>>> > >>>>>>>>>>>>> > >>>>>>>>>>>>>> > >>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>> I also suspect (but haven't check) that the base adjustment > >>>>>>>>>>>>>>>> will > >>>>>>>>>>>>>>>> most commonly be a multiple of the machine word size (ie 4). > >>>>>>>>>>>>>>>> If that is > >>>>>>>>>>>>>>>> the case then you could generate n/4 and then shift it left > >>>>>>>>>>>>>>>> by 2 for an > >>>>>>>>>>>>>>>> even greater range of literals. > >>>>>>>>>>>>>>> I can see there is provision for this in the > >>>>>>>>>>>>>>> !TARGET_THUMB1_ONLY case, > >>>>>>>>>>>>>>> I'll update my patch. > >>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>> More generally, any sequence of up to > >>>>>>>>>>>>>>>> three thumb1 instructions will be no larger, and probably as > >>>>>>>>>>>>>>>> fast as the > >>>>>>>>>>>>>>>> existing literal pool fall back. > >>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>> Secondly, if the value is, for example, 65536 (0x10000), > >>>>>>>>>>>>>>>> your code will > >>>>>>>>>>>>>>>> emit a mov followed by two shift-by-8 instructions; the two > >>>>>>>>>>>>>>>> shifts could > >>>>>>>>>>>>>>>> be merged into a single shift-by-16. > >>>>>>>>>>>>>>> > >>>>>>>>>>>>>>> Right, I'll try to make use of thumb_shiftable_const. > >>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>> Finally, I'd really like to see some executable tests for > >>>>>>>>>>>>>>>> this, if at > >>>>>>>>>>>>>>>> all possible. > >>>>>>>>>>>>>>> I mentioned pr46287-3.C, but that's not the only existing > >>>>>>>>>>>>>>> testcase > >>>>>>>>>>>>>>> that showed the problem. There are also: > >>>>>>>>>>>>>>> g++.dg/opt/thunk1.C > >>>>>>>>>>>>>>> g++.dg/ipa/pr46984.C > >>>>>>>>>>>>>>> g++.dg/torture/pr46287.C > >>>>>>>>>>>>>>> g++.dg/torture/pr45699.C > >>>>>>>>>>>>>>> > >>>>>>>>>>>>>>> Do you want that I copy one of these in the arm subdir and add > >>>>>>>>>>>>>>> -mpure-code in dg-options? > >>>>>>>>>>>>>> > >>>>>>>>>>>>>> On reflection, probably not - that just makes things more > >>>>>>>>>>>>>> complicated > >>>>>>>>>>>>>> with all the dg-options mess (I'm worried about interactions > >>>>>>>>>>>>>> with other > >>>>>>>>>>>>>> sets of options on the command line and the fall-out from > >>>>>>>>>>>>>> that). If > >>>>>>>>>>>>>> someone cares about pure-code they should be doing full > >>>>>>>>>>>>>> testsuite runs > >>>>>>>>>>>>>> with it enabled and that should be sufficient. > >>>>>>>>>>>>> > >>>>>>>>>>>>> Yes, that's what I am doing manually, it's a bit tricky, and I > >>>>>>>>>>>>> use a > >>>>>>>>>>>>> modified simulator. > >>>>>>>>>>>>> > >>>>>>>>>>>>> Thanks, > >>>>>>>>>>>>> > >>>>>>>>>>>>> Christophe > >>>>>>>>>>>>> > >>>>>>>>>>>>>> > >>>>>>>>>>>>>> R. > >>>>>>>>>>>>>> > >>>>>>>>>>>>>>> > >>>>>>>>>>>>>>> Thanks, > >>>>>>>>>>>>>>> > >>>>>>>>>>>>>>> Christophe > >>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>> R. > >>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>> k# (use "git pull" to merge the remote branch into yours) > >>>>>>>>>>>>>>>>> --- > >>>>>>>>>>>>>>>>> gcc/config/arm/arm.c | 91 > >>>>>>>>>>>>>>>>> +++++++++++++++++++++++++++++++++++++--------------- > >>>>>>>>>>>>>>>>> 1 file changed, 66 insertions(+), 25 deletions(-) > >>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>> diff --git a/gcc/config/arm/arm.c b/gcc/config/arm/arm.c > >>>>>>>>>>>>>>>>> index ceeb91f..62abeb5 100644 > >>>>>>>>>>>>>>>>> --- a/gcc/config/arm/arm.c > >>>>>>>>>>>>>>>>> +++ b/gcc/config/arm/arm.c > >>>>>>>>>>>>>>>>> @@ -28342,9 +28342,43 @@ arm_thumb1_mi_thunk (FILE *file, > >>>>>>>>>>>>>>>>> tree, HOST_WIDE_INT delta, > >>>>>>>>>>>>>>>>> { > >>>>>>>>>>>>>>>>> if (mi_delta > 255) > >>>>>>>>>>>>>>>>> { > >>>>>>>>>>>>>>>>> - fputs ("\tldr\tr3, ", file); > >>>>>>>>>>>>>>>>> - assemble_name (file, label); > >>>>>>>>>>>>>>>>> - fputs ("+4\n", file); > >>>>>>>>>>>>>>>>> + /* With -mpure-code, we cannot load delta from the > >>>>>>>>>>>>>>>>> constant > >>>>>>>>>>>>>>>>> + pool: we build it explicitly. */ > >>>>>>>>>>>>>>>>> + if (target_pure_code) > >>>>>>>>>>>>>>>>> + { > >>>>>>>>>>>>>>>>> + bool mov_done_p = false; > >>>>>>>>>>>>>>>>> + int i; > >>>>>>>>>>>>>>>>> + > >>>>>>>>>>>>>>>>> + /* Emit upper 3 bytes if needed. */ > >>>>>>>>>>>>>>>>> + for (i = 0; i < 3; i++) > >>>>>>>>>>>>>>>>> + { > >>>>>>>>>>>>>>>>> + int byte = (mi_delta >> (8 * (3 - i))) & > >>>>>>>>>>>>>>>>> 0xff; > >>>>>>>>>>>>>>>>> + > >>>>>>>>>>>>>>>>> + if (byte) > >>>>>>>>>>>>>>>>> + { > >>>>>>>>>>>>>>>>> + if (mov_done_p) > >>>>>>>>>>>>>>>>> + asm_fprintf (file, "\tadds\tr3, > >>>>>>>>>>>>>>>>> #%d\n", byte); > >>>>>>>>>>>>>>>>> + else > >>>>>>>>>>>>>>>>> + asm_fprintf (file, "\tmovs\tr3, > >>>>>>>>>>>>>>>>> #%d\n", byte); > >>>>>>>>>>>>>>>>> + mov_done_p = true; > >>>>>>>>>>>>>>>>> + } > >>>>>>>>>>>>>>>>> + > >>>>>>>>>>>>>>>>> + if (mov_done_p) > >>>>>>>>>>>>>>>>> + asm_fprintf (file, "\tlsls\tr3, #8\n"); > >>>>>>>>>>>>>>>>> + } > >>>>>>>>>>>>>>>>> + > >>>>>>>>>>>>>>>>> + /* Emit lower byte if needed. */ > >>>>>>>>>>>>>>>>> + if (!mov_done_p) > >>>>>>>>>>>>>>>>> + asm_fprintf (file, "\tmovs\tr3, #%d\n", > >>>>>>>>>>>>>>>>> mi_delta & 0xff); > >>>>>>>>>>>>>>>>> + else if (mi_delta & 0xff) > >>>>>>>>>>>>>>>>> + asm_fprintf (file, "\tadds\tr3, #%d\n", > >>>>>>>>>>>>>>>>> mi_delta & 0xff); > >>>>>>>>>>>>>>>>> + } > >>>>>>>>>>>>>>>>> + else > >>>>>>>>>>>>>>>>> + { > >>>>>>>>>>>>>>>>> + fputs ("\tldr\tr3, ", file); > >>>>>>>>>>>>>>>>> + assemble_name (file, label); > >>>>>>>>>>>>>>>>> + fputs ("+4\n", file); > >>>>>>>>>>>>>>>>> + } > >>>>>>>>>>>>>>>>> asm_fprintf (file, "\t%ss\t%r, %r, r3\n", > >>>>>>>>>>>>>>>>> mi_op, this_regno, this_regno); > >>>>>>>>>>>>>>>>> } > >>>>>>>>>>>>>>>>> @@ -28380,30 +28414,37 @@ arm_thumb1_mi_thunk (FILE *file, > >>>>>>>>>>>>>>>>> tree, HOST_WIDE_INT delta, > >>>>>>>>>>>>>>>>> fputs ("\tpop\t{r3}\n", file); > >>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>> fprintf (file, "\tbx\tr12\n"); > >>>>>>>>>>>>>>>>> - ASM_OUTPUT_ALIGN (file, 2); > >>>>>>>>>>>>>>>>> - assemble_name (file, label); > >>>>>>>>>>>>>>>>> - fputs (":\n", file); > >>>>>>>>>>>>>>>>> - if (flag_pic) > >>>>>>>>>>>>>>>>> + > >>>>>>>>>>>>>>>>> + /* With -mpure-code, we don't need to emit literals > >>>>>>>>>>>>>>>>> for the > >>>>>>>>>>>>>>>>> + function address and delta since we emitted code to > >>>>>>>>>>>>>>>>> build > >>>>>>>>>>>>>>>>> + them. */ > >>>>>>>>>>>>>>>>> + if (!target_pure_code) > >>>>>>>>>>>>>>>>> { > >>>>>>>>>>>>>>>>> - /* Output ".word .LTHUNKn-[3,7]-.LTHUNKPCn". */ > >>>>>>>>>>>>>>>>> - rtx tem = XEXP (DECL_RTL (function), 0); > >>>>>>>>>>>>>>>>> - /* For TARGET_THUMB1_ONLY the thunk is in Thumb > >>>>>>>>>>>>>>>>> mode, so the PC > >>>>>>>>>>>>>>>>> - pipeline offset is four rather than eight. > >>>>>>>>>>>>>>>>> Adjust the offset > >>>>>>>>>>>>>>>>> - accordingly. */ > >>>>>>>>>>>>>>>>> - tem = plus_constant (GET_MODE (tem), tem, > >>>>>>>>>>>>>>>>> - TARGET_THUMB1_ONLY ? -3 : -7); > >>>>>>>>>>>>>>>>> - tem = gen_rtx_MINUS (GET_MODE (tem), > >>>>>>>>>>>>>>>>> - tem, > >>>>>>>>>>>>>>>>> - gen_rtx_SYMBOL_REF (Pmode, > >>>>>>>>>>>>>>>>> - ggc_strdup > >>>>>>>>>>>>>>>>> (labelpc))); > >>>>>>>>>>>>>>>>> - assemble_integer (tem, 4, BITS_PER_WORD, 1); > >>>>>>>>>>>>>>>>> - } > >>>>>>>>>>>>>>>>> - else > >>>>>>>>>>>>>>>>> - /* Output ".word .LTHUNKn". */ > >>>>>>>>>>>>>>>>> - assemble_integer (XEXP (DECL_RTL (function), 0), 4, > >>>>>>>>>>>>>>>>> BITS_PER_WORD, 1); > >>>>>>>>>>>>>>>>> + ASM_OUTPUT_ALIGN (file, 2); > >>>>>>>>>>>>>>>>> + assemble_name (file, label); > >>>>>>>>>>>>>>>>> + fputs (":\n", file); > >>>>>>>>>>>>>>>>> + if (flag_pic) > >>>>>>>>>>>>>>>>> + { > >>>>>>>>>>>>>>>>> + /* Output ".word .LTHUNKn-[3,7]-.LTHUNKPCn". */ > >>>>>>>>>>>>>>>>> + rtx tem = XEXP (DECL_RTL (function), 0); > >>>>>>>>>>>>>>>>> + /* For TARGET_THUMB1_ONLY the thunk is in Thumb > >>>>>>>>>>>>>>>>> mode, so the PC > >>>>>>>>>>>>>>>>> + pipeline offset is four rather than eight. > >>>>>>>>>>>>>>>>> Adjust the offset > >>>>>>>>>>>>>>>>> + accordingly. */ > >>>>>>>>>>>>>>>>> + tem = plus_constant (GET_MODE (tem), tem, > >>>>>>>>>>>>>>>>> + TARGET_THUMB1_ONLY ? -3 : > >>>>>>>>>>>>>>>>> -7); > >>>>>>>>>>>>>>>>> + tem = gen_rtx_MINUS (GET_MODE (tem), > >>>>>>>>>>>>>>>>> + tem, > >>>>>>>>>>>>>>>>> + gen_rtx_SYMBOL_REF (Pmode, > >>>>>>>>>>>>>>>>> + > >>>>>>>>>>>>>>>>> ggc_strdup (labelpc))); > >>>>>>>>>>>>>>>>> + assemble_integer (tem, 4, BITS_PER_WORD, 1); > >>>>>>>>>>>>>>>>> + } > >>>>>>>>>>>>>>>>> + else > >>>>>>>>>>>>>>>>> + /* Output ".word .LTHUNKn". */ > >>>>>>>>>>>>>>>>> + assemble_integer (XEXP (DECL_RTL (function), 0), > >>>>>>>>>>>>>>>>> 4, BITS_PER_WORD, 1); > >>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>> - if (TARGET_THUMB1_ONLY && mi_delta > 255) > >>>>>>>>>>>>>>>>> - assemble_integer (GEN_INT(mi_delta), 4, > >>>>>>>>>>>>>>>>> BITS_PER_WORD, 1); > >>>>>>>>>>>>>>>>> + if (TARGET_THUMB1_ONLY && mi_delta > 255) > >>>>>>>>>>>>>>>>> + assemble_integer (GEN_INT(mi_delta), 4, > >>>>>>>>>>>>>>>>> BITS_PER_WORD, 1); > >>>>>>>>>>>>>>>>> + } > >>>>>>>>>>>>>>>>> } > >>>>>>>>>>>>>>>>> else > >>>>>>>>>>>>>>>>> { > >>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>> > >>>>>>>>>>>>>> > >>>>>>>>>>>> > >>>>>>>>>> > >>>>>>>>> > >>>>>>> > >>>>> > >>> > > > > +class thumb1_const_rtl > > ... > > + void mov (int val) > > > > This should take a HOST_WIDE_INT. Similarly for add and shift. The > > same applies to the asm version as well. > > > > + asm_fprintf (t_file, "\tmovs\tr%d, #%d\n", dst_regno, val); > > > > Should be using reg_names[dst_regno] in all cases. In fact, you might > > want to move that lookup to the constructor and just save a pointer to > > the string there. You'll need to use HOST_WIDE_INT_PRINT_UNSIGNED > > Correction, for a (signed) HOST_WIDE_INT, this should be > HOST_WIDE_INT_PRINT_DEC. >
Right, but if "val" is unsigned HOST_WIDE_INT, all the methods in the two classes can have unsigned HOST_WIDE_INT parameters, and thus use HOST_WIDE_INT_PRINT_UNSIGNED? > > rather than "%d" for the immediate. > > > > +template <class T> > > +void > > +thumb1_gen_const_int_1 (T dst, HOST_WIDE_INT op1) > > +{ > > + bool mov_done_p = false; > > + int val = op1; > > > > This potentially silently loses precision. In fact, I think you really > > want to use "unsigned HOST_WIDE_INT" throughout the following code, so > > that the right shifts aren't undefined if dealing with negative numbers. > > > > For safety, you should also have an assertion in here that > > > > op1 == trunc_int_for_mode (op1, SImode) > > > > + int shift = 0; > > + int i; > > + > > + if (val == 0) > > > > You can short-circuit 0..255 here for a quick exit. > > > > + { > > + dst.mov (val); > > + return; > > + } > > > > Another trick: if the top nine bits of the 32-bit value are all set, > > you're probably going to be better off (and certainly not worse off) by > > generating -op1 and then negating the result in a final step - you can > > do that via recursion. > > > > + > > + /* In the general case, we need 7 instructions to build > > + a 32 bits constant (1 movs, 3 lsls, 3 adds). We can > > + do better if VAL is small enough, or > > + right-shiftable by a suitable amount. If the > > + right-shift enables to encode at least one less byte, > > + it's worth it: we save a adds and a lsls at the > > + expense of a final lsls. */ > > + int final_shift = number_of_first_bit_set (val); > > + > > + int leading_zeroes = clz_hwi (val); > > + int number_of_bytes_needed > > + = ((HOST_BITS_PER_WIDE_INT - 1 - leading_zeroes) > > + / BITS_PER_UNIT) + 1; > > + int number_of_bytes_needed2 > > + = ((HOST_BITS_PER_WIDE_INT - 1 - leading_zeroes - final_shift) > > + / BITS_PER_UNIT) + 1; > > + > > + if (number_of_bytes_needed2 < number_of_bytes_needed) > > + val >>= final_shift; > > + else > > + final_shift = 0; > > + > > + /* If we are in a very small range, we can use either a single movs > > + or movs+adds. */ > > + if ((val >= 0) && (val <= 510)) > > > > if val is made unsigned HWI as I suggest, the lower bounds test is not > > needed. > > > > + { > > + if (val > 255) > > + { > > + int high = val - 255; > > > > Again, watch your types. > > > > + > > + dst.mov (high); > > + dst.add (255); > > + } > > + else > > + dst.mov (val); > > + > > + if (final_shift > 0) > > + dst.ashift (final_shift); > > + } > > + else > > + { > > + /* General case, emit upper 3 bytes as needed. */ > > + for (i = 0; i < 3; i++) > > + { > > + int byte = (val >> (8 * (3 - i))) & 0xff; > > > > and here. > > > > + > > + if (byte) > > + { > > + /* We are about to emit new bits, stop accumulating a > > + shift amount, and left-shift only if we have already > > + emitted some upper bits. */ > > + if (mov_done_p) > > + { > > + dst.ashift (shift); > > + dst.add (byte); > > + } > > + else > > + dst.mov (byte); > > + > > + /* Stop accumulating shift amount since we've just > > + emitted some bits. */ > > + shift = 0; > > + > > + mov_done_p = true; > > + } > > + > > + if (mov_done_p) > > + shift += 8; > > + } > > + > > + /* Emit lower byte. */ > > + if (!mov_done_p) > > + dst.mov (val & 0xff); > > + else > > + { > > + dst.ashift (shift); > > + if (val & 0xff) > > + dst.add (val & 0xff); > > + } > > + > > + if (final_shift > 0) > > + dst.ashift (final_shift); > > + } > > +} > > + > > >