RE: missing conditional propagation in cprop.c pass

2011-09-29 Thread Rahul Kharche

 insn 882  : cc - compare (r684, 0)
 jump_insn 883 : if (cc != 0) goto insn 46
 insn 49: r291 - r684
 ..
 insn 46
 
 cc contains the result of subtracting 0 from r684; control flow goes to
 insn_49 only if (cc == 0), which implies (r684 == 0).
 Then at insn_49 we have conditional const propagation r684 - 0, is it
 right?
 

I believe, the optimization you may be referring to is value range propagation 
which does predication of values based on predicates of conditions. GCC 
definitely applies VRP at the tree stage, I am not sure if there is an RTL pass 
to do the same.




RE: missing conditional propagation in cprop.c pass

2011-09-29 Thread Rahul Kharche
 On 09/29/11 17:36, Jeff Law wrote:
  On 09/29/11 09:26, Bernd Schmidt wrote:
  ISTR cse.c has some support for this.
  cprop.c -- see references to implicit_sets.
 
 cse too: record_jump_equiv.

Interesting. Are the two approaches subtly different
or do they apply precisely the same predication?


RE: Clustering switch cases

2010-09-03 Thread Rahul Kharche
 I have been working on your patch but I didn't manage to get it
 working yet. Unfortunately our current stable GCC version if 4.3.4 and
 our 4.4.x still has some issues.
 I tried backporting your patch to GCC 4.3.4 but it introduced several
 regressions.  It's my guess this is due to my incorrect manual
 patching of 4.3.
 Before I go through the regressions on our GCC 4.3 and try to fix the
 patching, do you have by any change a patch against 4.3.x?

Can you try the attached patch again with GCC 4.4.1 and let me know if it
fixes your regressions. There are related changes to how basic blocks are
split that may needs changing when using switch clustering. The changes
to stmt.c are the same as before. The additional changes are to cfgrtl.c
and cfgbuild.c.

Cheers,
Rahul



switch_case_clusters.patch
Description: switch_case_clusters.patch


RE: Clustering switch cases

2010-08-31 Thread Rahul Kharche
 I will be looking at the patch Rahul posted and will try to see if I
 can improve on it.

See attached patch (again) that Paulo is referring to. Sending to GCC
failed due to email client issues.

I have another patch for http://gcc.gnu.org/ml/gcc/2010-08/msg00413.html
Which I will send out shortly.

Rahul


switch_case_clusters.patch
Description: switch_case_clusters.patch


RE: branch probabilities on multiway branches

2010-05-04 Thread Rahul Kharche
Hi Edmar,

I have been testing your patch on our port of GCC4.4.1. The patch
works well, although I stumbled on an issue.

Many of our interesting cases have large case values, so profiling
the default range of values 0 - 255, or even as large as 0 - 2047 is
insufficient. Profiling for a larger ranges is impractical. I am
attempting to use edge frequencies instead of value profile to
effect the same transformation.

We also use an adapted form of the patch below to form sparse clusters
of contiguous large case values and shrink the switch case jump table.
I have merged your work into this patch so we can form a
probability/frequency based balanced decision tree for more likely
cases, and jump tables with clusters otherwise.

http://gcc.gnu.org/ml/gcc-patches/2004-07/msg02479.html


Here's a modified test case from your patch to demonstrate the need
for clusters in addition to cost balanced decision tree.

enum { NEVER=1000, HARDLY=10001, FOO=1002, FOO1=1003, FOO2=1004,
   FOO3=1005, FOO4=1006, OFTEN=2500, BAR=2501, DUMMY=2502,
   DUMMY1=2503, DUMMY2=2504, DUMMY3=2505};

int seq[] = {
  OFTEN, OFTEN, OFTEN, OFTEN, OFTEN, HARDLY,
  BAR, FOO, OFTEN, FOO, OFTEN, HARDLY,
  OFTEN, OFTEN, OFTEN, OFTEN, OFTEN, HARDLY,
  OFTEN, OFTEN, OFTEN, OFTEN, OFTEN, HARDLY,
  OFTEN, OFTEN, OFTEN, OFTEN, OFTEN, HARDLY,
  300, -1, OFTEN, OFTEN, OFTEN, OFTEN, HARDLY,
};
int result;

int
test_it (int k, int i)
{
  int j;

  j = k;
switch (seq[i])
{
case NEVER:
  j = j * i;
  break;
case FOO:
  j = j + k;
  break;
case FOO1:
  j = j + k2;
  break;
case FOO2:
  j = j + k5;
  break;
case FOO3:
  j = j + k6;
  break;
case FOO4:
  j = j + k7;
  break;
case BAR:
  j = j + i + k;
case DUMMY:
  j = j * (k + i);
  break;
case DUMMY1:
  j = j * (k + i2);
  break;
case OFTEN:
  j = j + i;
  break;
case DUMMY2:
  j = j * (k + i5);
  break;
case DUMMY3:
  j = j * (k + i6);
  break;
default:
  j = j * k;
  break;
}
  return j;
}

int
main (int argc, char **argv)
{
  int i;
  for (i = 0; i  sizeof(seq)/sizeof(seq[0]); i++)
result = test_it (1, i);
  return 0;
}


Cheers,
Rahul


-Original Message-
From: Rahul Kharche 
Sent: 22 April 2010 11:24
To: Edmar Wienskoski
Cc: Steven Bosscher; Jan Hubicka; gcc@gcc.gnu.org; sdkteam-gnu
Subject: RE: branch probabilities on multiway branches

Thanks Edmar! I will try and work your patch into our GCC 4.4.1
port and get some results.

Cheers,
Rahul


RE: branch probabilities on multiway branches

2010-04-22 Thread Rahul Kharche
Thanks Edmar! I will try and work your patch into our GCC 4.4.1
port and get some results.

Cheers,
Rahul


RE: branch probabilities on multiway branches

2010-04-20 Thread Rahul Kharche
Hi Steven,

 There is a transformation implemented in GCC's
 value profiling to put the most-frequently taken case-label of a
 switch-expr at the top of the switch

Could you point me to the bit of code that does this?

I'm exploring the idea of implementing source hints much like
__builtin_expect which would assign a probability to a case range
in a switch construct. I will have to tweak the case expansion
phase to implement a similar transformation to the one you mentioned
earlier.

Cheers,
Rahul


RE: branch probabilities on multiway branches

2010-04-15 Thread Rahul Kharche
The calculate branch probabilities algorithm (1) in the Wu Larus paper
also evenly distributes branch probabilities when number of outgoing
edges is  2, e.g. switch cases implemented as jump tables.

Are they any known heuristics to generate better branch probabilities
in this case?


-Original Message-
From: Jan Hubicka [mailto:hubi...@ucw.cz] 
Sent: 14 April 2010 00:44
To: Rahul Kharche
Cc: gcc@gcc.gnu.org; sdkteam-gnu
Subject: Re: branch probabilities on multiway branches

 Hi All,
 
 The following bit of code in predict.c implies branch probabilities
 are strictly evenly distributed for multiway branches at present. The
 comment suggests it is possible to generate better estimates for more
 generic cases, apart from being involved. Could anyone point me to
 the reference and/or if an implementation exists already.

There is Wu  Larus paper cited in the comment at the begging of file.

Honza


RE: branch probabilities on multiway branches

2010-04-15 Thread Rahul Kharche
What is the problem you're trying to solve?

Generally speaking I was looking for a better logic based on estimated
branch probability to decide between using binary search tree and jump
table implementation of a switch case.

One interesting test case is where the gross structure of a function
is a switch case. The function is parameterized on input to the switch
i.e. of the following type:

void foo (int val, int arg1, ...)
{
  switch (val)
  {
  case 100: /* blah1 */ break;
  case 101: /* blah2 */ break;
.
.
.
  case 1: /* blah1 */ break;

  default:
;
  }
}

On the basis of the static call graph, the estimated frequencies of
the calls to foo (), and assuming argument val is constant at each
call site, is it possible to optimize the switch case in foo ()?
Perhaps along the line of pulling out the most frequent case like you
mentioned in your example. Or specializing foo () on frequent cases of
val.

In this case inlining and VRP would have automatically done what I am
suggesting. Inlining however is not attempted because foo () is
estimated to be large.


Many Thanks,
Rahul


RE: branch probabilities on multiway branches

2010-04-15 Thread Rahul Kharche
The other case I'm working on is to selectively apply tailcall
optimization when optimizing for size. Clearly tail call optimiztion
is desirable along frequently executed edges. Otherwise we found
tailcall optimization generates a sicall_epilogue for each tailcall
which has a significant impact on size. This is true if the generated
epilogue is large. I have a patch which prunes tail calls along edges
which are determined to be infrequently executed from profile
information. When no profile information is available we are reduced
to using a heuristic whether the optimization will increase code size.
However it would be useful to also use estimated branch probabilities
The cases I'm looking at use switch cases heavily where branch
probability information is reduced to even probabilities across all
cases.

We're keen on improving the estimated branch probabilities in switch
cases because we cannot always gather profile information on all code
paths through a program.

Cheers,
Rahul


branch probabilities on multiway branches

2010-04-13 Thread Rahul Kharche
Hi All,

The following bit of code in predict.c implies branch probabilities
are strictly evenly distributed for multiway branches at present. The
comment suggests it is possible to generate better estimates for more
generic cases, apart from being involved. Could anyone point me to
the reference and/or if an implementation exists already.


/* When there is no successor or only one choice, prediction is easy. 

   We are lazy for now and predict only basic blocks with two outgoing
   edges.  It is possible to predict generic case too, but we have to
   ignore first match heuristics and do more involved combining.
Implement
   this later.  */
if (nedges != 2)
  {
if (!bb-count)
  set_even_probabilities (bb);
clear_bb_predictions (bb);
if (dump_file)
  fprintf (dump_file, %i edges in bb %i predicted to even
probabilities\n,
   nedges, bb-index);
return;
  }


Many Thanks,
Rahul


RE: Failure to combine SHIFT with ZERO_EXTEND

2010-02-09 Thread Rahul Kharche
Hi Jeff,

Many thanks for the pointers. I will make the changes and attach the
patch to the bugzilla soon.

Cheers,
Rahul

-Original Message-
From: Jeff Law [mailto:l...@redhat.com] 
Sent: 09 February 2010 00:45
To: Rahul Kharche
Cc: gcc@gcc.gnu.org; sdkteam-gnu
Subject: Re: Failure to combine SHIFT with ZERO_EXTEND

On 02/04/10 08:39, Rahul Kharche wrote:
 Hi All,

 On our private port of GCC 4.4.1 we fail to combine successive SHIFT
 operations like in the following case

 #includestdlib.h
 #includestdio.h

 void f1 ()
 {
unsigned short t1;
unsigned short t2;

t1 = rand();
t2 = rand();

t1= 1; t2= 1;
t1= 1; t2= 1;
t1= 1; t2= 1;
t1= 1; t2= 1;
t1= 1; t2= 1;
t1= 1; t2= 1;

printf(%d\n, (t1+t2));
 }

 This is a ZERO_EXTEND problem, because combining SHIFTs with whole
 integers works correctly, so do signed values. The problem seems to
 arise in the RTL combiner which combines the ZERO_EXTEND with the
 SHIFT to generate a SHIFT and an AND. Our architecture does not
 support AND with large constants and hence do not have a matching
 insn pattern (we prefer not doing this, because of large constants
 remain hanging at the end of all RTL optimisations and cause needless
 reloads).

 Fixing the combiner to convert masking AND operations to ZERO_EXTRACT
 fixes this issue without any obvious regressions. I'm adding the
 patch here against GCC 4.4.1 for any comments and/or suggestions.

Good catch.However, note we are a regression bugfix only phase of 
development right now in preparation for branching for GCC 4.5.  As a 
result the patch can't be checked in at this time; I would recommend you

update the patch to the current sources  attach it to bug #41998 which 
contains queued patches for after GCC 4.5 branches.


 Cheers,
 Rahul


 --- combine.c 2009-04-01 21:47:37.0 +0100
 +++ combine.c 2010-02-04 15:04:41.0 +
 @@ -446,6 +446,7 @@
   static void record_truncated_values (rtx *, void *);
   static bool reg_truncated_to_mode (enum machine_mode, const_rtx);
   static rtx gen_lowpart_or_truncate (enum machine_mode, rtx);
 +static bool can_zero_extract_p (rtx, rtx, enum machine_mode);



   /* It is not safe to use ordinary gen_lowpart in combine.
 @@ -6973,6 +6974,16 @@
  make_compound_operation (XEXP (x, 0),
   next_code),
  i, NULL_RTX, 1, 1, 0, 1);
 +  else if (can_zero_extract_p (XEXP (x, 0), XEXP (x, 1), mode))
 +{
 +   unsigned HOST_WIDE_INT len =  HOST_BITS_PER_WIDE_INT
 + - CLZ_HWI (UINTVAL (XEXP (x,
 1)));
 +   new_rtx = make_extraction (mode,
 + make_compound_operation (XEXP (x, 0),
 +  next_code),
 +  0, NULL_RTX, len, 1, 0,
 +  in_code == COMPARE);

There should be a comment prior to this code fragment describing the 
transformation being performed.   Something like:

/* Convert (and (shift X Y) MASK) into ... when ... */

That will make it clear in the future when your transformation applies 
rather than forcing someone scanning the code to read it in detail.


 +}

 break;

 @@ -7245,6 +7256,25 @@
   return simplify_gen_unary (TRUNCATE, mode, x, GET_MODE (x));
   }

 +static bool
 +can_zero_extract_p (rtx x, rtx mask_rtx, enum machine_mode mode)

There should be a comment prior to this function which briefly describes

what the function does, the parameters  return value.  Use comments 
prior to other functions to guide you.
 @@ -8957,7 +8987,6 @@
   op0 = UNKNOWN;

 *pop0 = op0;
 -
 /* ??? Slightly redundant with the above mask, but not entirely.
Moving this above means we'd have to sign-extend the mode mask
for the final test.  */

Useless diff fragment.  Remove this change as it's completely unrelated 
and useless.

You should also write a ChangeLog entry for your patch.  ChangeLogs 
describe what changed, not why something changed.  So a suitable entry 
might look something like:

date  Your Name yourem...@somewhere

 * combine.c (make_compound_operation): Convert shifts and masks
 into zero_extract in certain cases.
 (can_zero_extract_p): New function.


If you could make those changes and attach the result to PR 41998 they 
should be able to go in once 4.5 branches from the mainline.

Jeff



Failure to combine SHIFT with ZERO_EXTEND

2010-02-04 Thread Rahul Kharche
Hi All,

On our private port of GCC 4.4.1 we fail to combine successive SHIFT
operations like in the following case

#include stdlib.h
#include stdio.h

void f1 ()
{
  unsigned short t1;
  unsigned short t2;

  t1 = rand();
  t2 = rand();

  t1 = 1; t2 = 1;
  t1 = 1; t2 = 1;
  t1 = 1; t2 = 1;
  t1 = 1; t2 = 1;
  t1 = 1; t2 = 1;
  t1 = 1; t2 = 1;

  printf(%d\n, (t1+t2));
}

This is a ZERO_EXTEND problem, because combining SHIFTs with whole
integers works correctly, so do signed values. The problem seems to
arise in the RTL combiner which combines the ZERO_EXTEND with the
SHIFT to generate a SHIFT and an AND. Our architecture does not
support AND with large constants and hence do not have a matching
insn pattern (we prefer not doing this, because of large constants
remain hanging at the end of all RTL optimisations and cause needless
reloads).

Fixing the combiner to convert masking AND operations to ZERO_EXTRACT
fixes this issue without any obvious regressions. I'm adding the
patch here against GCC 4.4.1 for any comments and/or suggestions.

Cheers,
Rahul


--- combine.c   2009-04-01 21:47:37.0 +0100
+++ combine.c   2010-02-04 15:04:41.0 +
@@ -446,6 +446,7 @@
 static void record_truncated_values (rtx *, void *);
 static bool reg_truncated_to_mode (enum machine_mode, const_rtx);
 static rtx gen_lowpart_or_truncate (enum machine_mode, rtx);
+static bool can_zero_extract_p (rtx, rtx, enum machine_mode);
 

 
 /* It is not safe to use ordinary gen_lowpart in combine.
@@ -6973,6 +6974,16 @@
   make_compound_operation (XEXP (x, 0),
next_code),
   i, NULL_RTX, 1, 1, 0, 1);
+  else if (can_zero_extract_p (XEXP (x, 0), XEXP (x, 1), mode))
+{
+ unsigned HOST_WIDE_INT len =  HOST_BITS_PER_WIDE_INT
+   - CLZ_HWI (UINTVAL (XEXP (x,
1)));
+ new_rtx = make_extraction (mode,
+   make_compound_operation (XEXP (x, 0),
+next_code),
+0, NULL_RTX, len, 1, 0,
+in_code == COMPARE);
+}
 
   break;
 
@@ -7245,6 +7256,25 @@
 return simplify_gen_unary (TRUNCATE, mode, x, GET_MODE (x));
 }
 
+static bool
+can_zero_extract_p (rtx x, rtx mask_rtx, enum machine_mode mode)
+{
+  unsigned HOST_WIDE_INT count_lz, count_tz;
+  unsigned HOST_WIDE_INT nonzero, mask_all;
+  unsigned HOST_WIDE_INT mask_value = UINTVAL (mask_rtx);
+
+  mask_all = (unsigned HOST_WIDE_INT) -1;
+  nonzero = nonzero_bits (x, mode);
+  count_lz = CLZ_HWI (mask_value);
+  count_tz = CTZ_HWI (mask_value);
+
+  if (count_tz = (unsigned HOST_WIDE_INT) CTZ_HWI (nonzero)
+   ((mask_all  (count_lz + count_tz))  count_tz) ==
mask_value)
+return true;
+  
+  return false;
+}
+
 /* See if X can be simplified knowing that we will only refer to it in
MODE and will only refer to those bits that are nonzero in MASK.
If other bits are being computed or if masking operations are done
@@ -8957,7 +8987,6 @@
 op0 = UNKNOWN;
 
   *pop0 = op0;
-
   /* ??? Slightly redundant with the above mask, but not entirely.
  Moving this above means we'd have to sign-extend the mode mask
  for the final test.  */


RE: RFC PRE-ing REFERENCE expressions

2009-11-02 Thread Rahul Kharche
Hi Richard,

We would like to hang on to 4.4 for a little while. I was hoping I could
grab only the alias analysis improvements from the trunk. Do you suspect
this would be troublesome?

Cheers,
Rahul

-Original Message-
From: Richard Guenther [mailto:richard.guent...@gmail.com] 
Sent: 30 October 2009 14:50
To: Rahul Kharche
Cc: rgue...@gcc.gnu.org; gcc@gcc.gnu.org; sdkteam-gnu
Subject: Re: RFC PRE-ing REFERENCE expressions

On Fri, Oct 30, 2009 at 3:22 PM, Rahul Kharche ra...@icerasemi.com wrote:
 Hi Richi,

 Following up your suggestion on PR41488, I am continuing to test with
 loop header copy before FRE in GCC 4.4.1. One regression I am trying
 to tackle is from the test case below.

 typedef struct {
  int degree;
  int c[(256)];
 } swbcht_Polynomial;


 typedef struct {
  int m;
  int n;
  const swbcht_Polynomial *primPoly;
  unsigned short alpha[(1  (13))];
 } swbcht_GF;


 typedef struct {
  swbcht_GF gf;
 } swbcht_Handle;


 static const swbcht_Polynomial _primPoly13;

 void _ComputeGFLookupTables (swbcht_Handle *bchH, int m)
 {
  int i, j;

  swbcht_GF *gf = bchH-gf;

  gf-m = m;
  gf-n = (1  m) - 1;

  gf-primPoly = _primPoly13;

  for (i = 0; i  gf-m; i++) {
    gf-alpha[gf-m] ^= (gf-primPoly-c[i] * gf-alpha[i]);
  }
 }


 The dump after CH - FRE looks like

 _ComputeGFLookupTables (struct swbcht_Handle * bchH, int m)
 {
  int i;
  short unsigned int D.1241;
  short int D.1240;
  short int D.1239;
  short unsigned int D.1238;
  short unsigned int D.1237;
  short unsigned int D.1236;
  const int D.1235;
  const struct swbcht_Polynomial * D.1233;
  short int D.1232;
  short unsigned int D.1231;
  int D.1230;
  int D.1229;
  int D.1228;

 bb 2:
  bchH_2(D)-gf.m = m_4(D);
  D.1228_5 = 1  m_4(D);
  D.1229_6 = D.1228_5 + -1;
  bchH_2(D)-gf.n = D.1229_6;
  bchH_2(D)-gf.primPoly = _primPoly13;
  if (m_4(D)  0)
    goto bb 5;
  else
    goto bb 6;

 bb 6:
  goto bb 4;

 bb 5:

 bb 3:
  # i_35 = PHI 0(5), i_23(7)
  D.1230_10 = bchH_2(D)-gf.m;
  D.1231_11 = bchH_2(D)-gf.alpha[D.1230_10];
  D.1232_12 = (short int) D.1231_11;
  D.1233_13 = bchH_2(D)-gf.primPoly;
  D.1235_15 = D.1233_13-c[i_35];
  D.1236_16 = (short unsigned int) D.1235_15;
  D.1237_18 = bchH_2(D)-gf.alpha[i_35];
  D.1238_19 = D.1236_16 * D.1237_18;
  D.1239_20 = (short int) D.1238_19;
  D.1240_21 = D.1239_20 ^ D.1232_12;
  D.1241_22 = (short unsigned int) D.1240_21;
  bchH_2(D)-gf.alpha[D.1230_10] = D.1241_22;
  i_23 = i_35 + 1;
  D.1230_8 = bchH_2(D)-gf.m;
  if (i_23  D.1230_8)
    goto bb 7;
  else
    goto bb 8;

 bb 7:
  goto bb 3;

 bb 8:

 bb 4:
  return;

 }

 1. Why does FRE miss eliminating expression bchH_2(D)-gf.primPoly in
 bb 3 with _primPoly13. This is normally the case if we ran FRE then
 CH.

 Further PRE identifies bchH_2(D)-gf.primPoly as a partially redundant
 expression and hoists it into predecessor bb 7 and we get

 bb 3:
  # i_35 = PHI 0(5), i_23(7)
  # prephitmp.25_49 = PHI m_4(D)(5), D.1230_8(7)
  # prephitmp.27_51 = PHI _primPoly13(5), pretmp.26_50(7)
  D.1230_10 = prephitmp.25_49;
  D.1231_11 = bchH_2(D)-gf.alpha[D.1230_10];
  D.1232_12 = (short int) D.1231_11;
  D.1233_13 = prephitmp.27_51;
  D.1235_15 = D.1233_13-c[i_35];
  D.1236_16 = (short unsigned int) D.1235_15;
  D.1237_18 = bchH_2(D)-gf.alpha[i_35];
  D.1238_19 = D.1236_16 * D.1237_18;
  D.1239_20 = (short int) D.1238_19;
  D.1240_21 = D.1239_20 ^ D.1232_12;
  D.1241_22 = (short unsigned int) D.1240_21;
  bchH_2(D)-gf.alpha[D.1230_10] = D.1241_22;
  i_23 = i_35 + 1;
  D.1230_8 = bchH_2(D)-gf.m;
  if (D.1230_8  i_23)
    goto bb 7;
  else
    goto bb 8;

 bb 7:
  pretmp.26_50 = bchH_2(D)-gf.primPoly;
  goto bb 3;


 Clearly prephitmp.27_51 will make a redundant induction variable.
 Stepping through the insert_into_preds_of_block in tree-ssa-pre.c
 I can see the value numbers for expression bchH_2(D)-gf.primPoly
 available through bb 3 and through bb 2 are different.

 2. Is this because alias analysis cannot determine bchH_2(D)-gf
 has a unique target?

You should move to GCC 4.5 - the alias-oracle in GCC 4.4 is too weak
to discover all this and virtual operands are not helpful because these
are all indirect accesses through pointers without points-to information.

Richard.


 Many Thanks,
 Rahul




RFC PRE-ing REFERENCE expressions

2009-10-30 Thread Rahul Kharche
Hi Richi,

Following up your suggestion on PR41488, I am continuing to test with
loop header copy before FRE in GCC 4.4.1. One regression I am trying
to tackle is from the test case below.

typedef struct {
  int degree;
  int c[(256)];
} swbcht_Polynomial;


typedef struct {
  int m;
  int n;
  const swbcht_Polynomial *primPoly;
  unsigned short alpha[(1  (13))];
} swbcht_GF;


typedef struct {
  swbcht_GF gf;
} swbcht_Handle;


static const swbcht_Polynomial _primPoly13;

void _ComputeGFLookupTables (swbcht_Handle *bchH, int m)
{
  int i, j;

  swbcht_GF *gf = bchH-gf;

  gf-m = m;
  gf-n = (1  m) - 1;

  gf-primPoly = _primPoly13;

  for (i = 0; i  gf-m; i++) {
gf-alpha[gf-m] ^= (gf-primPoly-c[i] * gf-alpha[i]);
  }
}


The dump after CH - FRE looks like

_ComputeGFLookupTables (struct swbcht_Handle * bchH, int m)
{
  int i;
  short unsigned int D.1241;
  short int D.1240;
  short int D.1239;
  short unsigned int D.1238;
  short unsigned int D.1237;
  short unsigned int D.1236;
  const int D.1235;
  const struct swbcht_Polynomial * D.1233;
  short int D.1232;
  short unsigned int D.1231;
  int D.1230;
  int D.1229;
  int D.1228;

bb 2:
  bchH_2(D)-gf.m = m_4(D);
  D.1228_5 = 1  m_4(D);
  D.1229_6 = D.1228_5 + -1;
  bchH_2(D)-gf.n = D.1229_6;
  bchH_2(D)-gf.primPoly = _primPoly13;
  if (m_4(D)  0)
goto bb 5;
  else
goto bb 6;

bb 6:
  goto bb 4;

bb 5:

bb 3:
  # i_35 = PHI 0(5), i_23(7)
  D.1230_10 = bchH_2(D)-gf.m;
  D.1231_11 = bchH_2(D)-gf.alpha[D.1230_10];
  D.1232_12 = (short int) D.1231_11;
  D.1233_13 = bchH_2(D)-gf.primPoly;
  D.1235_15 = D.1233_13-c[i_35];
  D.1236_16 = (short unsigned int) D.1235_15;
  D.1237_18 = bchH_2(D)-gf.alpha[i_35];
  D.1238_19 = D.1236_16 * D.1237_18;
  D.1239_20 = (short int) D.1238_19;
  D.1240_21 = D.1239_20 ^ D.1232_12;
  D.1241_22 = (short unsigned int) D.1240_21;
  bchH_2(D)-gf.alpha[D.1230_10] = D.1241_22;
  i_23 = i_35 + 1;
  D.1230_8 = bchH_2(D)-gf.m;
  if (i_23  D.1230_8)
goto bb 7;
  else
goto bb 8;

bb 7:
  goto bb 3;

bb 8:

bb 4:
  return;

}

1. Why does FRE miss eliminating expression bchH_2(D)-gf.primPoly in
bb 3 with _primPoly13. This is normally the case if we ran FRE then
CH.

Further PRE identifies bchH_2(D)-gf.primPoly as a partially redundant
expression and hoists it into predecessor bb 7 and we get

bb 3:
  # i_35 = PHI 0(5), i_23(7)
  # prephitmp.25_49 = PHI m_4(D)(5), D.1230_8(7)
  # prephitmp.27_51 = PHI _primPoly13(5), pretmp.26_50(7)
  D.1230_10 = prephitmp.25_49;
  D.1231_11 = bchH_2(D)-gf.alpha[D.1230_10];
  D.1232_12 = (short int) D.1231_11;
  D.1233_13 = prephitmp.27_51;
  D.1235_15 = D.1233_13-c[i_35];
  D.1236_16 = (short unsigned int) D.1235_15;
  D.1237_18 = bchH_2(D)-gf.alpha[i_35];
  D.1238_19 = D.1236_16 * D.1237_18;
  D.1239_20 = (short int) D.1238_19;
  D.1240_21 = D.1239_20 ^ D.1232_12;
  D.1241_22 = (short unsigned int) D.1240_21;
  bchH_2(D)-gf.alpha[D.1230_10] = D.1241_22;
  i_23 = i_35 + 1;
  D.1230_8 = bchH_2(D)-gf.m;
  if (D.1230_8  i_23)
goto bb 7;
  else
goto bb 8;

bb 7:
  pretmp.26_50 = bchH_2(D)-gf.primPoly;
  goto bb 3;


Clearly prephitmp.27_51 will make a redundant induction variable.
Stepping through the insert_into_preds_of_block in tree-ssa-pre.c
I can see the value numbers for expression bchH_2(D)-gf.primPoly
available through bb 3 and through bb 2 are different.

2. Is this because alias analysis cannot determine bchH_2(D)-gf
has a unique target?


Many Thanks,
Rahul



RFC: missed loop optimizations from loop induction variable copies

2009-09-22 Thread Rahul Kharche
The following causes missed loop optimizations in O2 from creating
unnecessary loop induction variables. Or, is a case of IV opts not
able to coalesce copies of induction variables. A previous post related
to this was made in PR41026 which had a type promoted loop index
variable
copied. I believe this example makes the problem more obvious.

struct struct_t {
  int* data;
};

void testAutoIncStruct (struct struct_t* sp, int start, int end)
{
int i;
for (i = 0; i+start  end; i++)
  {
sp-data[i+start] = 0;
  }
}

With GCC v4.4.1 release) and gcc -O2 -fdump-tree-all on the above case
we get the following dump from IVOpts

testAutoIncStruct (struct struct_t * sp, int start, int end)
{
  unsigned int D.1283;
  unsigned int D.1284;
  int D.1282;
  unsigned int ivtmp.32;
  int * pretmp.17;
  int i;
  int * D.1245;
  unsigned int D.1244;
  unsigned int D.1243;

bb 2:
  if (start_3(D)  end_5(D))
goto bb 3;
  else
goto bb 6;

bb 3:
  pretmp.17_22 = sp_6(D)-data;
  D.1282_23 = start_3(D) + 1;
  ivtmp.32_25 = (unsigned int) D.1282_23;
  D.1283_27 = (unsigned int) end_5(D);
  D.1284_28 = D.1283_27 + 1;

bb 4:
  # start_20 = PHI start_4(5), start_3(D)(3)
  # ivtmp.32_7 = PHI ivtmp.32_24(5), ivtmp.32_25(3)
  D.1243_9 = (unsigned int) start_20;
  D.1244_10 = D.1243_9 * 4;
  D.1245_11 = pretmp.17_22 + D.1244_10;
  *D.1245_11 = 0;
  start_26 = (int) ivtmp.32_7;
  start_4 = start_26;
  ivtmp.32_24 = ivtmp.32_7 + 1;
  if (ivtmp.32_24 != D.1284_28)
goto bb 5;
  else
goto bb 6;

bb 5:
  goto bb 4;

bb 6:
  return;

}

IVOpts cannot identify start_26, start_4 and ivtmp_32_7 to be copies.
The root cause is that expression 'i + start' is identified as a common
expression between the test in the header and the index operation in the
latch. This is unified by copy propagation or FRE prior to loop
optimizations
and creates a new induction variable.

If we disable tree copy propagation and FRE with
gcc -O2 -fno-tree-copy-prop -fno-tree-fre -fdump-tree-all we get

testAutoIncStruct (struct struct_t * sp, int start, int end)
{
  unsigned int D.1287;
  unsigned int D.1288;
  unsigned int D.1289;
  int D.1290;
  unsigned int D.1284;
  unsigned int D.1285;
  unsigned int D.1286;
  int * pretmp.17;
  int i;
  int * D.1245;
  unsigned int D.1244;
  unsigned int D.1243;
  int D.1242;
  int * D.1241;

bb 2:
  if (start_3(D)  end_5(D))
goto bb 3;
  else
goto bb 6;

bb 3:
  pretmp.17_23 = sp_6(D)-data;
  D.1287_27 = (unsigned int) end_5(D);
  D.1288_28 = (unsigned int) start_3(D);
  D.1289_29 = D.1287_27 - D.1288_28;
  D.1290_30 = (int) D.1289_29;

bb 4:
  # i_20 = PHI i_12(5), 0(3)
  D.1241_7 = pretmp.17_23;
  D.1284_26 = (unsigned int) start_3(D);
  D.1285_25 = (unsigned int) i_20;
  D.1286_24 = D.1284_26 + D.1285_25;
  MEM[base: pretmp.17_23, index: D.1286_24, step: 4] = 0;
  i_12 = i_20 + 1;
  if (i_12 != D.1290_30)
goto bb 5;
  else
goto bb 6;

bb 5:
  goto bb 4;

bb 6:
  return;

}

The correct single induction variable as been identified here. This is
not
a loop header copying problem either. If we disable loop header copying,
we
still get multiple induction variables created. In fact in the above
case
loop header copying correctly enables post-increment mode on our port.

testAutoIncStruct (struct struct_t * sp, int start, int end)
{
  unsigned int D.1282;
  unsigned int ivtmp.31;
  unsigned int ivtmp.29;
  int i;
  int * D.1245;
  unsigned int D.1244;
  unsigned int D.1243;
  int D.1242;
  int * D.1241;

bb 2:
  ivtmp.29_18 = (unsigned int) start_3(D);
  D.1282_21 = (unsigned int) start_3(D);
  ivtmp.31_22 = D.1282_21 * 4;
  goto bb 4;

bb 3:
  D.1241_7 = sp_6(D)-data;
  D.1244_10 = ivtmp.31_19;
  D.1245_11 = D.1241_7 + D.1244_10;
  *D.1245_11 = 0;
  ivtmp.29_17 = ivtmp.29_8 + 1;
  ivtmp.31_20 = ivtmp.31_19 + 4;

bb 4:
  # ivtmp.29_8 = PHI ivtmp.29_18(2), ivtmp.29_17(3)
  # ivtmp.31_19 = PHI ivtmp.31_22(2), ivtmp.31_20(3)
  D.1242_23 = (int) ivtmp.29_8;
  if (D.1242_23  end_5(D))
goto bb 3;
  else
goto bb 5;

bb 5:
  return;

}

Does this imply we try and not copy propagate or FRE potential induction
variables? Or is this simply a missed case in IVOpts?

Rahul


RE: [4.5] Find more autoinc addressing for induction variables

2009-05-22 Thread Rahul Kharche
Hi,

I am trialing this patch on a private GCC port that I'm working on.
The patch works well with several test cases we have. However, fails
on the following

int
main ()
{
  const int arr[] = { 1, 2, 3, 4, 5, 6, 7, 8, 9, 10 };
  int i;

  for (i = 0; i  10; i++)
{
  printf(arr[%d] : %d\n, i, arr[i]);
}
}

Looking at the pass dump it seems the SSA re-scan of the symbol 'arr'
marks its use inside the loop as volatile

bb 2:
  arr[0] = 1;
  arr[1] = 2;
  arr[2] = 3;
  arr[3] = 4;
  arr[4] = 5;
  arr[5] = 6;
  arr[6] = 7;
  arr[7] = 8;
  arr[8] = 9;
  arr[9] = 10;
  __builtin_loop_start (1, 9);
  ivtmp.42_23 = (long unsigned int) arr[1];

bb 3:
  # i_19 = PHI i_6(4), 0(2)
  # prephitmp.14_18 = PHI pretmp.13_1(4), 1(2)
  # ivtmp.42_21 = PHI ivtmp.42_22(4), ivtmp.42_23(2)
  __builtin_loop_iteration (1);
  printf (arr[%d] == var : %d\n[0], i_19, prephitmp.14_18);
  i_6 = i_19 + 1;
  if (i_6 != 10)
goto bb 4;
  else
goto bb 5;

bb 4:
  pretmp.13_1 ={v} MEM[index: ivtmp.42_21];
  ivtmp.42_22 = ivtmp.42_21 + 4;
  goto bb 3;

bb 5:
  return;


And subsequently dead code elimination removes the array initialization
completely.

bb 2:
  __builtin_loop_start (1, 9);
  ivtmp.42_23 = (long unsigned int) arr[1];

bb 3:
  # i_19 = PHI i_6(4), 0(2)
  # prephitmp.14_18 = PHI pretmp.13_1(4), 1(2)
  # ivtmp.42_21 = PHI ivtmp.42_22(4), ivtmp.42_23(2)
  __builtin_loop_iteration (1);
  printf (arr[%d] == var : %d\n[0], i_19, prephitmp.14_18);
  i_6 = i_19 + 1;
  if (i_6 != 10)
goto bb 4;
  else
goto bb 5;

bb 4:
  pretmp.13_1 ={v} MEM[index: ivtmp.42_21];
  ivtmp.42_22 = ivtmp.42_21 + 4;
  goto bb 3;

bb 5:
  return;

}

Rahul


-Original Message-
From: gcc-patches-ow...@gcc.gnu.org
[mailto:gcc-patches-ow...@gcc.gnu.org] On Behalf Of Bernd Schmidt
Sent: 16 May 2009 01:05
To: Steven Bosscher
Cc: GCC Patches; F. Kenneth Zadeck; Zdenek Dvorak
Subject: Re: [4.5] Find more autoinc addressing for induction variables

I wrote:

 I'll see whether I can achieve a similar effect by modifying 
 tree-ssa-loop-ivopts instead.  Maybe I can add new candidates that get

 incremented in suitable basic blocks other than just the last one.

So here's a draft of what that would look like.  For every use/cand pair
for which autoinc is possible (which now requires that the use's block
is not in a nested loop, and dominates the latch), we create a new
candidate which is incremented at the point of the use, and compute its
cost without including the cost of the step.  This also gets rid of the
special handling in iv_ca_recount_cost.

That could be extended later to have candidates that are incremented
several times, e.g. to create two 16-bit postinc addressing modes for a
candidate that's incremented by 4.

Does this look like an approach that everyone can live with?


Bernd
--
This footer brought to you by insane German lawmakers.
Analog Devices GmbH  Wilhelm-Wagenfeld-Str. 6  80807 Muenchen
Sitz der Gesellschaft Muenchen, Registergericht Muenchen HRB 40368
Geschaeftsfuehrer Thomas Wessel, William A. Martin, Margaret Seif


RE: [4.5] Find more autoinc addressing for induction variables

2009-05-22 Thread Rahul Kharche
4.4.1, GCC 4.4 branch. Will I need dependent changes from the 4.5
branch?

Many Thanks,
Rahul


-Original Message-
From: Bernd Schmidt [mailto:bernds_...@t-online.de] 
Sent: 22 May 2009 16:56
To: Rahul Kharche
Cc: gcc@gcc.gnu.org; sdkteam-all
Subject: Re: [4.5] Find more autoinc addressing for induction variables

Hi,

Rahul Kharche wrote:
 I am trialing this patch on a private GCC port that I'm working on.
 The patch works well with several test cases we have. However, fails
 on the following
 
 int
 main ()
 {
   const int arr[] = { 1, 2, 3, 4, 5, 6, 7, 8, 9, 10 };
   int i;
 
   for (i = 0; i  10; i++)
 {
   printf(arr[%d] : %d\n, i, arr[i]);
 }
 }

I'm not getting anything that looks like the dump you sent.  What 
version of gcc is your port based on?


Bernd
-- 
This footer brought to you by insane German lawmakers.
Analog Devices GmbH  Wilhelm-Wagenfeld-Str. 6  80807 Muenchen
Sitz der Gesellschaft Muenchen, Registergericht Muenchen HRB 40368
Geschaeftsfuehrer Thomas Wessel, William A. Martin, Margaret Seif


RE: [4.5] Find more autoinc addressing for induction variables

2009-05-22 Thread Rahul Kharche
Hi,

Its fixed for me. I was missing get_ref_tag () in copy_ref_info () when
I patched against 4.5.

Thanks again,
Rahul

-Original Message-
From: Bernd Schmidt [mailto:bernds_...@t-online.de] 
Sent: 22 May 2009 16:56
To: Rahul Kharche
Cc: gcc@gcc.gnu.org; sdkteam-all
Subject: Re: [4.5] Find more autoinc addressing for induction variables

Hi,

Rahul Kharche wrote:
 I am trialing this patch on a private GCC port that I'm working on.
 The patch works well with several test cases we have. However, fails
 on the following
 
 int
 main ()
 {
   const int arr[] = { 1, 2, 3, 4, 5, 6, 7, 8, 9, 10 };
   int i;
 
   for (i = 0; i  10; i++)
 {
   printf(arr[%d] : %d\n, i, arr[i]);
 }
 }

I'm not getting anything that looks like the dump you sent.  What 
version of gcc is your port based on?


Bernd
-- 
This footer brought to you by insane German lawmakers.
Analog Devices GmbH  Wilhelm-Wagenfeld-Str. 6  80807 Muenchen
Sitz der Gesellschaft Muenchen, Registergericht Muenchen HRB 40368
Geschaeftsfuehrer Thomas Wessel, William A. Martin, Margaret Seif


RE: Constant folding and Constant propagation

2009-02-27 Thread Rahul Kharche
 - If I patch in this code, actually I get the same results I did
 before where the constants are propagated. It seems that in 4.3.2,
 every part of the compiler is trying to do that.

There are at least two forward propagation passes, one before and
another after GCSE. I haven't tried to tackle those passes, but I
believe they already have a cost model in place.

This patch will prevent partial sums involving registers and
constants from being combined during GCSE.

 - Just for testing I turned off the GCSE pass and it still
 propagated and folded the constants...

GCSE won't help with your trimmed down example

int main(void)
{
long a = 0xcafecafe;

printf(Final: %lx %lx %lx\n, a, a+5, a+15);
return EXIT_SUCCESS;
}

I believe Paolo's canon_reg solution together with tweaking
rtx_cost of constants with outer code PLUS might help. Any
luck with that pass?


RE: Constant folding and Constant propagation

2009-02-26 Thread Rahul Kharche
Hi Jean,

 Do you have a link to that patch? So that I can see if it applies for
me ?

See patch below. I put in some comments to try and explain what I have
attempted to do.

The hash SET generation, to track potential candidates in replacing a
register USE, needed some tweaking. This function was not tracking
register copies if the associated INSN had a NOTE attached that
associated the register with a constant. Also, only the last register
or constant in an assignment chain was being added to the hash SET.
It may be profitable to have a closer register copy in the assignment
chain as a potential replacement.

Before the actual replacing operation, the cost of temporary replacement
is determined using rtx_cost. A cost cannot be reliably formed if we
come
across jump INSNs because they may alter jumps or successors. The
default
replacement is used in this case.

Rahul


Index: gcse.c
===
--- gcse.c  (revision 141156)
+++ gcse.c  (working copy)
@@ -520,7 +520,7 @@
 static void record_set_info (rtx, const_rtx, void *);
 static void compute_sets (void);
 static void hash_scan_insn (rtx, struct hash_table *, int);
-static void hash_scan_set (rtx, rtx, struct hash_table *);
+static void hash_scan_set (rtx, rtx, struct hash_table *, bool);
 static void hash_scan_clobber (rtx, rtx, struct hash_table *);
 static void hash_scan_call (rtx, rtx, struct hash_table *);
 static int want_to_gcse_p (rtx);
@@ -560,7 +560,7 @@
 static void compute_cprop_data (void);
 static void find_used_regs (rtx *, void *);
 static int try_replace_reg (rtx, rtx, rtx);
-static struct expr *find_avail_set (int, rtx);
+static struct expr *find_avail_set (int, rtx, bool);
 static int cprop_jump (basic_block, rtx, rtx, rtx, rtx);
 static void mems_conflict_for_gcse_p (rtx, const_rtx, void *);
 static int load_killed_in_block_p (const_basic_block, int, const_rtx,
int);
@@ -1671,11 +1671,11 @@
expression one).  */
 
 static void
-hash_scan_set (rtx pat, rtx insn, struct hash_table *table)
+hash_scan_set (rtx pat, rtx insn, struct hash_table *table, bool
use_note)
 {
   rtx src = SET_SRC (pat);
   rtx dest = SET_DEST (pat);
-  rtx note;
+  rtx note = NULL_RTX;
 
   if (GET_CODE (src) == CALL)
 hash_scan_call (src, insn, table);
@@ -1685,16 +1685,21 @@
   unsigned int regno = REGNO (dest);
   rtx tmp;
 
-  /* See if a REG_NOTE shows this equivalent to a simpler
expression.
-This allows us to do a single GCSE pass and still eliminate
-redundant constants, addresses or other expressions that are
-constructed with multiple instructions.  */
-  note = find_reg_equal_equiv_note (insn);
-  if (note != 0
-  (table-set_p
- ? gcse_constant_p (XEXP (note, 0))
- : want_to_gcse_p (XEXP (note, 0
-   src = XEXP (note, 0), pat = gen_rtx_SET (VOIDmode, dest, src);
+  /* Do not substitute insn SET_SRC for note if caller uses
+ uses use_note = false. */
+  if (use_note)
+{
+ /* See if a REG_NOTE shows this equivalent to a simpler
expression.
+This allows us to do a single GCSE pass and still eliminate
+redundant constants, addresses or other expressions that
are
+constructed with multiple instructions.  */
+ note = find_reg_equal_equiv_note (insn);
+ if (note != 0
+  (table-set_p
+ ? gcse_constant_p (XEXP (note, 0))
+ : want_to_gcse_p (XEXP (note, 0
+   src = XEXP (note, 0), pat = gen_rtx_SET (VOIDmode, dest,
src);
+   }
 
   /* Only record sets of pseudo-regs in the hash table.  */
   if (! table-set_p
@@ -1835,14 +1840,30 @@
  what's been modified.  */
 
   if (GET_CODE (pat) == SET)
-hash_scan_set (pat, insn, table);
+{
+  /* If we're hashing SETs for constant/copy propagation
+ also hash a SET ignoring any insn notes. */
+  if (table-set_p)
+{
+  hash_scan_set (pat, insn, table, false);
+   }
+  hash_scan_set (pat, insn, table, true);
+}
   else if (GET_CODE (pat) == PARALLEL)
 for (i = 0; i  XVECLEN (pat, 0); i++)
   {
rtx x = XVECEXP (pat, 0, i);
 
if (GET_CODE (x) == SET)
- hash_scan_set (x, insn, table);
+ {
+   /* If we're hashing SETs for constant/copy propagation
+  also hash a SET ignoring any insn notes. */
+   if (table-set_p)
+ {
+   hash_scan_set (x, insn, table, false);
+ }
+   hash_scan_set (x, insn, table, true);
+ }
else if (GET_CODE (x) == CLOBBER)
  hash_scan_clobber (x, insn, table);
else if (GET_CODE (x) == CALL)
@@ -2071,7 +2092,7 @@
   if (table-set_p
   implicit_sets[current_bb-index] != NULL_RTX)
hash_scan_set (implicit_sets[current_bb-index],
-  BB_HEAD (current_bb), table);
+ 

Re: Constant folding and Constant propagation

2009-02-10 Thread Rahul Kharche
I am looking at a related problem in GCSE, GCC 4.3 whereby constants
are propagated to their use irrespective of the final instruction cost
of generating them (machine cycles or instruction count).

Global constant propagation effectively voids common expressions that
form large constants, identified by global CSE. This is especially
true of SYMBOl_REF constants like section-anchors generated while
indexing global arrays.

For the GCC port I work on, I have fixed this by weighing the rtx_cost
of propagating a register copy Vs propagating the constant into an insn.
I have an initial patch for this problem.


Rahul Vijay Kharche