Re: [PATCH] rl78 anddi3 improvement

2018-01-19 Thread DJ Delorie

Jeff Law  writes:
> So I think you're ultimately far better off determining why GCC does not
> generate efficient code for 64bit logicals on the rl78 target.

In thinking about this more, one possible reason is that rl78 has an
8-bit WORD_MODE.  Which means DImode operations are not reduced to
SImode, they're reduced to QImode.  If you want SImode instead, you need
to intervene.


Re: [PATCH] rl78 anddi3 improvement

2017-12-19 Thread DJ Delorie

Right, when doing 64-bit operations on an 8-bit mcu with limited
registers, a hand-written assembler routine that you call as needed will
beat anything gcc spits out - for size-per-call.  And I had a lot of
trouble getting gcc to deal with the rl78's limited register set and
addressing modes - compiling libgcc/libstdc++ is a torture test for
alloc/reload on small mcus.  That forced me to be conservative, and add
the virtual ISA that gcc can work with.

So I'm OK with your approach, and if you come up with something even
better, I'm OK with that too :-)

(pending reviews, of course, that's next ;)


RE: [PATCH] rl78 anddi3 improvement

2017-12-11 Thread Sebastian Perta
Hello Jeff,

Thank you for your comments.

>>So I think you're ultimately far better off determining why GCC does not
>>generate efficient code for 64bit logicals on the rl78 target.
I totally agree with you, this is why:
1. I have another patch: I define_expand movdi in which I instruct GCC to use 
16 bit movw instead of movw, with this patch applied on the latest revision I 
reduce the code size of this function (my_anddi3) from 323 bytes to 245 bytes. 
I'm just waiting on the regression to finish I will post this patch.
2. I am working very hard for quite some time to improve/replace the 
"devirtualization" pass (see pass_rl78_devirt in rl78.c). I am working on a 
solution which will generate very similar code what I wrote in ___adddi3 and 
will also allow me to also change the calling convention to be efficient 
(similar to what the RL78 commercial compilers use) but unfortunately I still a 
long way from being finished (as it is quite difficult). I think DJ can explain 
much better why he needed to do things this way (add this pass and the *_virt 
and *_real variants for the instructions) in the first place.

However if you look closely at the patch you will see the that I put the 
following condition for the availability of the expand:
+  "optimize_size"
The idea behind this is the following:
Compared to the commercial RL78 compilers GCC is quite far behind (even 2x-3x 
bigger). When comparing the output code I observed the commercial compilers I 
saw they use quite extensively code merging techniques.
For GCC I found some work on this on a branch which didn't make to master 
(https://www.gnu.org/software/gcc/projects/cfo.html). I have ported this a 
while back to 4.9.2 but I didn't get significant code size improvement (I think 
I will give this another try after I finish point 2 above)
So I decided then to continue doing things this way which finally gave me some 
really good results (improved the code size by 30% on average, even more than 
50% in some cases).
So even if/when I finish with point 2 above I think I will still like to have 
things done this way as they improve code size significantly.

I hope this explanation is satisfactory to you  as I have other patches (not 
only for 64 bit operations) which make use of the same idea.

Best Regards,
Sebastian


[https://www2.renesas.eu/media/email/unicef_2017.jpg]

This Christmas, instead of sending out cards, Renesas Electronics Europe have 
decided to support Unicef with a donation. For further details click 
here to find out about the valuable work they do, 
helping children all over the world.
We would like to take this opportunity to wish you a Merry Christmas and a 
prosperous New Year.



Renesas Electronics Europe Ltd, Dukes Meadow, Millboard Road, Bourne End, 
Buckinghamshire, SL8 5FH, UK. Registered in England & Wales under Registered 
No. 04586709.


Re: [PATCH] rl78 anddi3 improvement

2017-12-08 Thread Jeff Law
On 12/08/2017 07:50 AM, Sebastian Perta wrote:
> Hello,
> 
> The following patch improves code size for 64 bit and for RL78:
> it emits a library function call instead of emitting code for  the 64 bit
> min for every single time.
> The and function which was added in libgcc is hand written, so more optimal
> than what GCC generates.
> 
> The change can easily be seen on the following test case:
> unsigned long long my_anddi3(unsigned long long x, unsigned long long y){ 
> return x & y;
> }
> I did not add this to the regression as it very simple and there are test
> cases in the regression which test this, for example
> gcc.dg/torture/stackalign/alloca-1.c  and
> gcc.dg/torture/stackalign/vararg-2.c.
> Regression test is OK, tested with the following command:
> make -k check-gcc RUNTESTFLAGS=--target_board=rl78-sim
> 
> Please let me know if this is OK, Thank you!
> Sebastian
> 
> Index: gcc/ChangeLog
> ===
> --- gcc/ChangeLog (revision 255511)
> +++ gcc/ChangeLog (working copy)
> @@ -1,3 +1,7 @@
> +2017-12-08  Sebastian Perta  
> +
> + * config/rl78/rl78.md: New define_expand "anddi3".
So I think you're ultimately far better off determining why GCC does not
generate efficient code for 64bit logicals on the rl78 target.

Once you start defining these kind of operations you'll often end up
defining more and more rather than addressing the core underlying issues.

Furthermore, defining multi-word operations will inhibit optimizations
that would have applied had the multi-word operation been broken down
into word-sized operations.  I've seen this time and time again when
32bit targets define DImode operations.  So not only will you end up
with a lot of these patterns, you then end up defining lots of special
cases for them to catch some of hte missed optimizations.

DJ has the final call here, but I think you'd be better off looking
deeper and trying to understand how to get better code for multi-word
operations without defining multi-word libcall expanders like this.

jeff



[PATCH] rl78 anddi3 improvement

2017-12-08 Thread Sebastian Perta
Hello,

The following patch improves code size for 64 bit and for RL78:
it emits a library function call instead of emitting code for  the 64 bit
min for every single time.
The and function which was added in libgcc is hand written, so more optimal
than what GCC generates.

The change can easily be seen on the following test case:
unsigned long long my_anddi3(unsigned long long x, unsigned long long y){ 
return x & y;
}
I did not add this to the regression as it very simple and there are test
cases in the regression which test this, for example
gcc.dg/torture/stackalign/alloca-1.c  and
gcc.dg/torture/stackalign/vararg-2.c.
Regression test is OK, tested with the following command:
make -k check-gcc RUNTESTFLAGS=--target_board=rl78-sim

Please let me know if this is OK, Thank you!
Sebastian

Index: gcc/ChangeLog
===
--- gcc/ChangeLog   (revision 255511)
+++ gcc/ChangeLog   (working copy)
@@ -1,3 +1,7 @@
+2017-12-08  Sebastian Perta  
+
+   * config/rl78/rl78.md: New define_expand "anddi3".
+   
 2017-12-08  Martin Jambor  
 
PR tree-optimization/83141
Index: gcc/config/rl78/rl78.md
===
--- gcc/config/rl78/rl78.md (revision 255511)
+++ gcc/config/rl78/rl78.md (working copy)
@@ -718,3 +718,13 @@
   [(set_attr "valloc" "macax")
(set_attr "is_g13_muldiv_insn" "yes")]
 )
+
+(define_expand "anddi3"
+ [(set (match_operand:DI  0 "nonimmediate_operand" "")
+   (and:DI (match_operand:DI 1 "general_operand"  "")
+(match_operand:DI2 "general_operand"  "")))
+   ]
+  "optimize_size"
+  "rl78_emit_libcall (\"__anddi3\", AND, DImode, DImode, 3, operands);
+   DONE;"
+)
Index: libgcc/ChangeLog
===
--- libgcc/ChangeLog(revision 255511)
+++ libgcc/ChangeLog(working copy)
@@ -1,3 +1,8 @@
+2017-12-08  Sebastian Perta  
+
+   * config/rl78/anddi3.S: New assembly file.
+   * config/rl78/t-rl78: Added anddi3.S to LIB2ADD.
+   
 2017-11-30  Michael Meissner  
 
* config/rs6000/_mulkc3.c (__mulkc3): Add forward declaration.
Index: libgcc/config/rl78/anddi3.S
===
--- libgcc/config/rl78/anddi3.S (nonexistent)
+++ libgcc/config/rl78/anddi3.S (working copy)
@@ -0,0 +1,66 @@
+;   Copyright (C) 2017 Free Software Foundation, Inc.
+;   Contributed by Sebastian Perta.
+; 
+; This file is free software; you can redistribute it and/or modify it
+; under the terms of the GNU General Public License as published by the
+; Free Software Foundation; either version 3, or (at your option) any
+; later version.
+; 
+; This file is distributed in the hope that it will be useful, but
+; WITHOUT ANY WARRANTY; without even the implied warranty of
+; MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+; General Public License for more details.
+; 
+; Under Section 7 of GPL version 3, you are granted additional
+; permissions described in the GCC Runtime Library Exception, version
+; 3.1, as published by the Free Software Foundation.
+;
+; You should have received a copy of the GNU General Public License and
+; a copy of the GCC Runtime Library Exception along with this program;
+; see the files COPYING3 and COPYING.RUNTIME respectively.  If not, see
+; .
+
+
+#include "vregs.h"
+
+.text
+
+START_FUNC ___anddi3
+
+movw  hl, sp
+
+mov   a, [hl+4]
+and   a, [hl+12]
+mov   r8, a
+
+mov   a, [hl+5]
+and   a, [hl+13]
+mov   r9, a
+
+mov   a, [hl+6]
+and   a, [hl+14]
+mov   r10, a
+
+mov   a, [hl+7]
+and   a, [hl+15]
+mov   r11, a
+
+mov   a, [hl+8]
+and   a, [hl+16]
+mov   r12, a
+
+mov   a, [hl+9]
+and   a, [hl+17]
+mov   r13, a
+
+mov   a, [hl+10]
+and   a, [hl+18]
+mov   r14, a
+
+mov   a, [hl+11]
+and   a, [hl+19]
+mov   r15, a
+
+ret
+
+END_FUNC ___anddi3
Index: libgcc/config/rl78/t-rl78
===
--- libgcc/config/rl78/t-rl78   (revision 255511)
+++ libgcc/config/rl78/t-rl78   (working copy)
@@ -32,7 +32,8 @@
$(srcdir)/config/rl78/fpmath-sf.S \
$(srcdir)/config/rl78/cmpsi2.S \
$(srcdir)/config/rl78/adddi3.S \
-   $(srcdir)/config/rl78/subdi3.S
+   $(srcdir)/config/rl78/subdi3.S \
+   $(srcdir)/config/rl78/anddi3.S
 
 LIB2FUNCS_EXCLUDE = _clzhi2 _clzsi2 _ctzhi2 _ctzsi2 \
   _popcounthi2 _popcountsi2 \