Re: [PATCH] improved const shifts for AVR targets

2022-10-28 Thread Jeff Law via Gcc-patches



On 10/15/22 06:08, A. Binzberger wrote:

Re: [PATCH] improved const shifts for AVR targets
On 12.10.22 19:57, Jeff Law wrote:


On 10/4/22 11:06, Alexander Binzberger via Gcc-patches wrote:

Hi,
recently I used some arduino uno for a project and realized some areas
which do not output optimal asm code. Especially around shifts and 
function

calls.
With this as motivation and hacktoberfest I started patching things.
Since patch files do not provide a good overview and I hope for a
"hacktoberfest-accepted" label on the PR on github I also opened it 
there:

https://github.com/gcc-mirror/gcc/pull/73

This patch improves shifts with const right hand operand. While 8bit 
and

16bit shifts where mostly fine 24bit and 32bit where not handled well.

Testing
I checked output with a local installation of compiler explorer in 
asm and

a tiny unit test comparing shifts with mul/div by 2.
I however did not write any testcases in gcc for it.

Target
This patch is only targeting atmel avr family of chips.

Changelog
improved const shifts for AVR targets


It would be helpful if you could show the before/after code for the 
cases you're changing.  Extra credit if you include cycles & size 
information for those cases.  That would help someone like me who 
knows GCC well, but isn't particularly well versed in the AVR target 
evaluate the overarching goal of the patch (ie, better code).


about the avr family targets:

* consider every branch instruction = 1/2 cycles

* consider every 2byte/word instruction (besides move word if 
available) = 2 cycles


* consider multiplication (if available) = 2 cycles

* consider every load (beside load immediate "ldi" 1cylce) = 2cycles 
(+1 for prog mem)


* pop and jump mostly = 2 cycles

* call is basically = 2-4 cycles

* ret is about =  4/5 cycles

* consider every instruction (bit/bit-test, most compare, arithmetic, 
logic, some other) = 1 cycle


* division does not exist

or as a summary for this patch: branches and such are 2 cycles the 
rest is 1 cycle


note that shifts are 1bit per cycle and the instructions are at least 
mostly byte based.


also note that operations using immediate do only work with the upper 
half of registers.


All useful, but you should be giving me the summary for the things 
you're changing, not asking me to do it :-)  Presumably you've already 
done the analysis to ensure your changes are an improvement.  I'm asking 
you to provide that analysis for review and archival purposes.



A quick table like


Mode    Shift count    Shift type    original cycles (or size) new 
cycles (or size)



That will make it very clear for me and anyone doing historical work in 
the future what was expected here.  It's OK if the cycle counts aren't 
100% accurate.



Including a testcase would be awesome as well, but isn't strictly required.



a description for the code before my change and what changed:

* shifts on 8bit (beside arithmetic shifts right) were optimized and 
always unrolled (only aligned with the rest of the code without actual 
change)


* arithmetic shift 8bit and 16bit shifts were mostly optimized and 
mostly unrolled - depending on registers and Os (I added the missing 
cases there)


* 24bit and 32bit shifts were basically not optimized at all and never 
unrolled (I added those cases and aligned the optimizer logic with the 
others. They also reuse the other shift code since they may reduce to 
those cases after a move for bigger shifts.)


* out_shift_with_cnt provides a fallback implementation as a loop over 
shifts which may get unrolled. in case of Os to about inner_len + 3,4 
or 5 and in other cases of optimizer e.g. O2 it gets unrolled if size 
is smaller 10. see max_len (basically unchanged)


* did not touch non const cases in this patch but may in a future 
patch for O2 and O3


note that in case of Os the smaller code is picked which is the loop 
at least in some cases but other optimizer cases profit a lot.


also note that it is debatable if Os needs to be that strict with size 
since the compute overhead of the loop is high with 5 per loop 
iteration/cycle- so per bit shift. A lot more cases could be covered 
with +1 or +2 more instructions.



about plen:

If plen is NULL the asm code gets returned.

If plen is a pointer the code does count the instruction count which I 
guess is used (or could be used) as a rough estimate of cycles as well 
as byte code size.


Some of the functions named this len. The 24bit functions mainly named 
this plen and used it like it is now in all functions. This is mostly 
a readability improvement.


I am not sure how this works together with the optimizer or the rest.

To my understanding however the functions may get called once by the 
optimizer with a length given, then to output code and potentially 
again with a len given over avr_adjust_length to return the size.


I may be wrong about this part but as far as I can tell I did not 
change the way it operates.



[PATCH] improved const shifts for AVR targets

2022-10-15 Thread Georg Johann Lay

Hi,
recently I used some arduino uno for a project and realized some areas
which do not output optimal asm code. Especially around shifts and function
calls.
With this as motivation and hacktoberfest I started patching things.
Since patch files do not provide a good overview and I hope for a
"hacktoberfest-accepted" label on the PR on github I also opened it there:
https://github.com/gcc-mirror/gcc/pull/73

This patch improves shifts with const right hand operand. While 8bit and
16bit shifts where mostly fine 24bit and 32bit where not handled well.

Testing
I checked output with a local installation of compiler explorer in asm and
a tiny unit test comparing shifts with mul/div by 2.
I however did not write any testcases in gcc for it.


Hi, for such large changes, IMO it's a good idea to run the testsuite 
against the changes and make sure that there are no regressions.  Maybe 
even add new runtime tests in gcc.target/avr/torture to cover 
significant amount of the changes?


For example a test could go like:

__attribute__((__always_inline__))
static inline void shr (long x, int off)
{
long y = x >> off;
__asm ("" : "+r" (x));
if (x >> off != y)
__builtin_abort();
}

void test_shr (void)
{
long x = 0x76543215;
shr (x, 13);
shr (x, 14);
shr (x, 15);
shr (x, 16);
}

One shift is folded away by the compiler, and the other one has to be 
carried out.


However, the insn output also depends on available register classes like 
"ldi_ok" and whether a "d" class scratch is available, so it will be 
hard to achieve full coverage.  As it appears, testing for the lower 
registers can be forced by, where this won't work for AVR_TINY, of course:


static inline void shr (long x, int off)
{
long y = x >> off;
__asm ("" : "+l" (x));
x >>= off;
__asm ("" : "+l" (x));
if (x != y)
__builtin_abort();
}


Target
This patch is only targeting atmel avr family of chips.

Changelog
improved const shifts for AVR targets


You can have a look at existing ChangeLog files to see the format and style.



Patch
-
diff --git a/gcc/config/avr/avr.cc b/gcc/config/avr/avr.cc
index 4ed390e4cf9..c7b70812d5c 100644
--- a/gcc/config/avr/avr.cc
+++ b/gcc/config/avr/avr.cc
@@ -6043,9 +6043,6 @@ out_shift_with_cnt (const char *templ, rtx_insn
*insn, rtx operands[],
   op[2] = operands[2];
   op[3] = operands[3];

-  if (plen)
-*plen = 0;
-


This looks wrong.  These functions are used in two different contexts:

One is computing the instructions lengths (in words) which is needed for 
jump offset computations for relative jumps that are crossing the insn. 
This is done for plen != NULL, and the length must be returned in *plen.


Second is actual output of the instruction sequence rest. return 
respective sting (depending on context), which must have a length no 
longer than computed.  This is performed if plen == NULL.


Not initializing *plen means that you get garbage for instruction 
lengths.  Runtime errors will occur but just not very frequently, e.g. 
if an instruction sequence is longer than anticipated, a jump target 
might be out of reach which results in a linker error.



   if (CONST_INT_P (operands[2]))
 {
   /* Operand 3 is a scratch register if this is a
@@ -6150,96 +6147,68 @@ out_shift_with_cnt (const char *templ, rtx_insn
*insn, rtx operands[],
 /* 8bit shift left ((char)x << i)   */

 const char *
-ashlqi3_out (rtx_insn *insn, rtx operands[], int *len)
+ashlqi3_out (rtx_insn *insn, rtx operands[], int *plen)
 {
   if (CONST_INT_P (operands[2]))
 {
-  int k;
-
-  if (!len)
- len = 
-
   switch (INTVAL (operands[2]))
  {
  default:
   if (INTVAL (operands[2]) < 8)
 break;

-  *len = 1;
-  return "clr %0";
-
- case 1:
-  *len = 1;
-  return "lsl %0";
-
- case 2:
-  *len = 2;
-  return ("lsl %0" CR_TAB
-  "lsl %0");
-
- case 3:
-  *len = 3;
-  return ("lsl %0" CR_TAB
-  "lsl %0" CR_TAB
-  "lsl %0");
+return avr_asm_len ("clr %0", operands, plen, 1);


I don't get it.  This prints *one* CLR instruction for all shift offsets 
1...3?




  case 4:
   if (test_hard_reg_class (LD_REGS, operands[0]))
 {
-  *len = 2;
-  return ("swap %0" CR_TAB
-  "andi %0,0xf0");
+return avr_asm_len ("swap %0" CR_TAB
+  "andi %0,0xf0", operands, plen, 2);


Glitch of coding-rules (GNU style it is), similar in many placed down 
the line which seem to have incorrect indentations.  It's not always 
easy to tell this just from looking at a patch, so better double-check 
your indentations.



 }
-  *len = 4;
-  return ("lsl %0" CR_TAB
+return avr_asm_len ("lsl %0" CR_TAB
   "lsl %0" CR_TAB
   "lsl %0" CR_TAB
-  "lsl %0");
+  "lsl %0", operands, plen, 4);

  case 5:
   if (test_hard_reg_class (LD_REGS, operands[0]))
 {
-  *len = 3;
-  return ("swap %0" CR_TAB
+return avr_asm_len ("swap %0" CR_TAB
   "lsl %0"  CR_TAB
-  "andi %0,0xe0");
+  "andi %0,0xe0", operands, plen, 

Re: [PATCH] improved const shifts for AVR targets

2022-10-15 Thread A. Binzberger via Gcc-patches

On 12.10.22 19:57, Jeff Law wrote:


On 10/4/22 11:06, Alexander Binzberger via Gcc-patches wrote:

Hi,
recently I used some arduino uno for a project and realized some areas
which do not output optimal asm code. Especially around shifts and 
function

calls.
With this as motivation and hacktoberfest I started patching things.
Since patch files do not provide a good overview and I hope for a
"hacktoberfest-accepted" label on the PR on github I also opened it 
there:

https://github.com/gcc-mirror/gcc/pull/73

This patch improves shifts with const right hand operand. While 8bit and
16bit shifts where mostly fine 24bit and 32bit where not handled well.

Testing
I checked output with a local installation of compiler explorer in 
asm and

a tiny unit test comparing shifts with mul/div by 2.
I however did not write any testcases in gcc for it.

Target
This patch is only targeting atmel avr family of chips.

Changelog
improved const shifts for AVR targets


It would be helpful if you could show the before/after code for the 
cases you're changing.  Extra credit if you include cycles & size 
information for those cases.  That would help someone like me who 
knows GCC well, but isn't particularly well versed in the AVR target 
evaluate the overarching goal of the patch (ie, better code).


about the avr family targets:

* consider every branch instruction = 1/2 cycles

* consider every 2byte/word instruction (besides move word if available) 
= 2 cycles


* consider multiplication (if available) = 2 cycles

* consider every load (beside load immediate "ldi" 1cylce) = 2cycles (+1 
for prog mem)


* pop and jump mostly = 2 cycles

* call is basically = 2-4 cycles

* ret is about =  4/5 cycles

* consider every instruction (bit/bit-test, most compare, arithmetic, 
logic, some other) = 1 cycle


* division does not exist

or as a summary for this patch: branches and such are 2 cycles the rest 
is 1 cycle


note that shifts are 1bit per cycle and the instructions are at least 
mostly byte based.


also note that operations using immediate do only work with the upper 
half of registers.



a description for the code before my change and what changed:

* shifts on 8bit (beside arithmetic shifts right) were optimized and 
always unrolled (only aligned with the rest of the code without actual 
change)


* arithmetic shift 8bit and 16bit shifts were mostly optimized and 
mostly unrolled - depending on registers and Os (I added the missing 
cases there)


* 24bit and 32bit shifts were basically not optimized at all and never 
unrolled (I added those cases and aligned the optimizer logic with the 
others. They also reuse the other shift code since they may reduce to 
those cases after a move for bigger shifts.)


* out_shift_with_cnt provides a fallback implementation as a loop over 
shifts which may get unrolled. in case of Os to about inner_len + 3,4 or 
5 and in other cases of optimizer e.g. O2 it gets unrolled if size is 
smaller 10. see max_len (basically unchanged)


* did not touch non const cases in this patch but may in a future patch 
for O2 and O3


note that in case of Os the smaller code is picked which is the loop at 
least in some cases but other optimizer cases profit a lot.


also note that it is debatable if Os needs to be that strict with size 
since the compute overhead of the loop is high with 5 per loop 
iteration/cycle- so per bit shift. A lot more cases could be covered 
with +1 or +2 more instructions.



about plen:

If plen is NULL the asm code gets returned.

If plen is a pointer the code does count the instruction count which I 
guess is used (or could be used) as a rough estimate of cycles as well 
as byte code size.


Some of the functions named this len. The 24bit functions mainly named 
this plen and used it like it is now in all functions. This is mostly a 
readability improvement.


I am not sure how this works together with the optimizer or the rest.

To my understanding however the functions may get called once by the 
optimizer with a length given, then to output code and potentially again 
with a len given over avr_adjust_length to return the size.


I may be wrong about this part but as far as I can tell I did not change 
the way it operates.



size and cycles summary:

The asm instruction count is used as size and cycle estimate. This gets 
close to the real deal for the shifts since the instructions are all 1 
cylce anyway and similar in byte code size.


8bit gets always optimized and unrolled to get max performance and less 
code size (beside shift of 6 with lower half registers used which is the 
worst case with +1 instruction).


16bit, 24bit and 32bit gets unrolled depending on optimizer setting - 
and registers used (see out_shift_with_cnt:max_len). So 16bit gets 
mostly optimized and unrolled in Os (see comments for plen/max_len) and 
always in O2 and such (max_len=10). Shift optimization and unroll for 
24bit and 32bit is mostly only relevant when not optimizing for size.



Re: [PATCH] improved const shifts for AVR targets

2022-10-12 Thread Jeff Law via Gcc-patches



On 10/4/22 11:06, Alexander Binzberger via Gcc-patches wrote:

Hi,
recently I used some arduino uno for a project and realized some areas
which do not output optimal asm code. Especially around shifts and function
calls.
With this as motivation and hacktoberfest I started patching things.
Since patch files do not provide a good overview and I hope for a
"hacktoberfest-accepted" label on the PR on github I also opened it there:
https://github.com/gcc-mirror/gcc/pull/73

This patch improves shifts with const right hand operand. While 8bit and
16bit shifts where mostly fine 24bit and 32bit where not handled well.

Testing
I checked output with a local installation of compiler explorer in asm and
a tiny unit test comparing shifts with mul/div by 2.
I however did not write any testcases in gcc for it.

Target
This patch is only targeting atmel avr family of chips.

Changelog
improved const shifts for AVR targets


It would be helpful if you could show the before/after code for the 
cases you're changing.  Extra credit if you include cycles & size 
information for those cases.  That would help someone like me who knows 
GCC well, but isn't particularly well versed in the AVR target evaluate 
the overarching goal of the patch (ie, better code).


Changes should include a ChangeLog which indicates what changed. If you 
look at git log you will see examples of what a ChangeLog should look like.


The is large enough that you need either a  copyright assignment or DCO 
certification.


See this page for details:

https://gcc.gnu.org/contribute.html




Patch
-
diff --git a/gcc/config/avr/avr.cc b/gcc/config/avr/avr.cc
index 4ed390e4cf9..c7b70812d5c 100644
--- a/gcc/config/avr/avr.cc
+++ b/gcc/config/avr/avr.cc
@@ -6043,9 +6043,6 @@ out_shift_with_cnt (const char *templ, rtx_insn
*insn, rtx operands[],
op[2] = operands[2];
op[3] = operands[3];

-  if (plen)
-*plen = 0;
-


Doesn't this leave *plen uninitialized for the case where the shift 
count is held in memory or a register or is an out of range constant?  
Is this really safe?





if (CONST_INT_P (operands[2]))
  {
/* Operand 3 is a scratch register if this is a
@@ -6150,96 +6147,68 @@ out_shift_with_cnt (const char *templ, rtx_insn
*insn, rtx operands[],
  /* 8bit shift left ((char)x << i)   */

  const char *
-ashlqi3_out (rtx_insn *insn, rtx operands[], int *len)
+ashlqi3_out (rtx_insn *insn, rtx operands[], int *plen)
  {
if (CONST_INT_P (operands[2]))
  {
-  int k;
-
-  if (!len)
- len = 
-


Isn't this wrong for the call to ashlqi3_out from avr.md?  In that call 
len/plen will be zero, which we then pass down.  So the question is why 
did you remove this code?



The patch as-is is relatively large and can easily be broken down into 
more manageable chunks.  I would suggest a patch for each mode.  ie, one 
which changes QImode shifts, another for HImode shifts, another for 
PSImode shifts. etc.  It may seem like more work, but by breaking it 
down reviewers can take action on each patch individually.  So for 
example its relatively easy to work through the QImode changes and those 
could go in fairly quick while the PSImode changes will require 
considerably more time to review.




switch (INTVAL (operands[2]))
   {
   default:
if (INTVAL (operands[2]) < 8)
  break;

-  *len = 1;
-  return "clr %0";
-
- case 1:
-  *len = 1;
-  return "lsl %0";
-
- case 2:
-  *len = 2;
-  return ("lsl %0" CR_TAB
-  "lsl %0");
-
- case 3:
-  *len = 3;
-  return ("lsl %0" CR_TAB
-  "lsl %0" CR_TAB
-  "lsl %0");
+return avr_asm_len ("clr %0", operands, plen, 1);


You've probably got a whitespace problem here.  I think the return 
should line up in the came column as the IF statement. Conceptually this 
seems reasonable as cases 1, 2 and 3 can be trivially handled by 
out_shift_with_cnt.  Tough routing more code through out_shift_with_cnt 
means the comment might need to change since we're routing more cases 
through it that were trivially handled in ashlqi3_out before.





   case 4:
if (test_hard_reg_class (LD_REGS, operands[0]))
  {
-  *len = 2;
-  return ("swap %0" CR_TAB
-  "andi %0,0xf0");
+return avr_asm_len ("swap %0" CR_TAB
+  "andi %0,0xf0", operands, plen, 2);
More indention problems here.  THe return should line up two spaces 
inside the open curly brace.  Otherwise this case seems reasonable since 
it's generating the same code as before.

  }
-  *len = 4;
-  return ("lsl %0" CR_TAB
+return avr_asm_len ("lsl %0" CR_TAB
"lsl %0" CR_TAB
"lsl %0" CR_TAB
-  "lsl %0");
+  "lsl %0", operands, plen, 4);


Gratuitous indentation changes.  Please don't do that unless you're 
fixing cases where the indentation is wrong according to GNU/project 
standards.





   case 5:
if (test_hard_reg_class (LD_REGS, operands[0]))
  {
-  *len = 3;
-  return ("swap %0" CR_TAB
+return avr_asm_len ("swap %0" CR_TAB
"lsl %0"  

[PATCH] improved const shifts for AVR targets

2022-10-04 Thread Alexander Binzberger via Gcc-patches
Hi,
recently I used some arduino uno for a project and realized some areas
which do not output optimal asm code. Especially around shifts and function
calls.
With this as motivation and hacktoberfest I started patching things.
Since patch files do not provide a good overview and I hope for a
"hacktoberfest-accepted" label on the PR on github I also opened it there:
https://github.com/gcc-mirror/gcc/pull/73

This patch improves shifts with const right hand operand. While 8bit and
16bit shifts where mostly fine 24bit and 32bit where not handled well.

Testing
I checked output with a local installation of compiler explorer in asm and
a tiny unit test comparing shifts with mul/div by 2.
I however did not write any testcases in gcc for it.

Target
This patch is only targeting atmel avr family of chips.

Changelog
improved const shifts for AVR targets

Patch
-
diff --git a/gcc/config/avr/avr.cc b/gcc/config/avr/avr.cc
index 4ed390e4cf9..c7b70812d5c 100644
--- a/gcc/config/avr/avr.cc
+++ b/gcc/config/avr/avr.cc
@@ -6043,9 +6043,6 @@ out_shift_with_cnt (const char *templ, rtx_insn
*insn, rtx operands[],
   op[2] = operands[2];
   op[3] = operands[3];

-  if (plen)
-*plen = 0;
-
   if (CONST_INT_P (operands[2]))
 {
   /* Operand 3 is a scratch register if this is a
@@ -6150,96 +6147,68 @@ out_shift_with_cnt (const char *templ, rtx_insn
*insn, rtx operands[],
 /* 8bit shift left ((char)x << i)   */

 const char *
-ashlqi3_out (rtx_insn *insn, rtx operands[], int *len)
+ashlqi3_out (rtx_insn *insn, rtx operands[], int *plen)
 {
   if (CONST_INT_P (operands[2]))
 {
-  int k;
-
-  if (!len)
- len = 
-
   switch (INTVAL (operands[2]))
  {
  default:
   if (INTVAL (operands[2]) < 8)
 break;

-  *len = 1;
-  return "clr %0";
-
- case 1:
-  *len = 1;
-  return "lsl %0";
-
- case 2:
-  *len = 2;
-  return ("lsl %0" CR_TAB
-  "lsl %0");
-
- case 3:
-  *len = 3;
-  return ("lsl %0" CR_TAB
-  "lsl %0" CR_TAB
-  "lsl %0");
+return avr_asm_len ("clr %0", operands, plen, 1);

  case 4:
   if (test_hard_reg_class (LD_REGS, operands[0]))
 {
-  *len = 2;
-  return ("swap %0" CR_TAB
-  "andi %0,0xf0");
+return avr_asm_len ("swap %0" CR_TAB
+  "andi %0,0xf0", operands, plen, 2);
 }
-  *len = 4;
-  return ("lsl %0" CR_TAB
+return avr_asm_len ("lsl %0" CR_TAB
   "lsl %0" CR_TAB
   "lsl %0" CR_TAB
-  "lsl %0");
+  "lsl %0", operands, plen, 4);

  case 5:
   if (test_hard_reg_class (LD_REGS, operands[0]))
 {
-  *len = 3;
-  return ("swap %0" CR_TAB
+return avr_asm_len ("swap %0" CR_TAB
   "lsl %0"  CR_TAB
-  "andi %0,0xe0");
+  "andi %0,0xe0", operands, plen, 3);
 }
-  *len = 5;
-  return ("lsl %0" CR_TAB
+return avr_asm_len ("lsl %0" CR_TAB
   "lsl %0" CR_TAB
   "lsl %0" CR_TAB
   "lsl %0" CR_TAB
-  "lsl %0");
+  "lsl %0", operands, plen, 5);

  case 6:
   if (test_hard_reg_class (LD_REGS, operands[0]))
 {
-  *len = 4;
-  return ("swap %0" CR_TAB
+return avr_asm_len ("swap %0" CR_TAB
   "lsl %0"  CR_TAB
   "lsl %0"  CR_TAB
-  "andi %0,0xc0");
+  "andi %0,0xc0", operands, plen, 4);
 }
-  *len = 6;
-  return ("lsl %0" CR_TAB
+return avr_asm_len ("lsl %0" CR_TAB
   "lsl %0" CR_TAB
   "lsl %0" CR_TAB
   "lsl %0" CR_TAB
   "lsl %0" CR_TAB
-  "lsl %0");
+  "lsl %0", operands, plen, 6);

  case 7:
-  *len = 3;
-  return ("ror %0" CR_TAB
+return avr_asm_len ("ror %0" CR_TAB
   "clr %0" CR_TAB
-  "ror %0");
+  "ror %0", operands, plen, 3);
  }
 }
   else if (CONSTANT_P (operands[2]))
 fatal_insn ("internal compiler error.  Incorrect shift:", insn);

   out_shift_with_cnt ("lsl %0",
-  insn, operands, len, 1);
+  insn, operands, plen, 1);
   return "";
 }

@@ -6247,7 +6216,7 @@ ashlqi3_out (rtx_insn *insn, rtx operands[], int *len)
 /* 16bit shift left ((short)x << i)   */

 const char *
-ashlhi3_out (rtx_insn *insn, rtx operands[], int *len)
+ashlhi3_out (rtx_insn *insn, rtx operands[], int *plen)
 {
   if (CONST_INT_P (operands[2]))
 {
@@ -6255,11 +6224,6 @@ ashlhi3_out (rtx_insn *insn, rtx operands[], int
*len)
  && XVECLEN (PATTERN (insn), 0) == 3
  && REG_P (operands[3]));
   int ldi_ok = test_hard_reg_class (LD_REGS, operands[0]);
-  int k;
-  int *t = len;
-
-  if (!len)
- len = 

   swit