I'm thinking about adding bitwise dataflow analysis support to RTL.

Is this a good idea?  Bad idea?  Already done? Please review if interested.

Thank you,
Silvius


Motivation
==========

int foo(int x)
{
  int i = 100;

  do
    {
      if (x > 0)
        x = x & 1;  /* After this insn, all bits except 1 are 0.  */
      else
        x = x & 2;  /* After this insn, all bits except 2 are 0.  */
      i--;
    }
  while (i);

  return x & 4;   /* Before this insn, only bits 1 and 2 may be nonzero.  */
}

"foo" should simply return 0.
This optimization is currently missed at -O2 and -O3 on x86_64.
(Cases with simpler control are solved by the "combine" pass.)


Proposal
========

1. Dataflow framework to propagate bitwise register properties.
   (Integrated with the current dataflow framework.)
2. Forward bitwise dataflow analysis: constant bit propagation.
3. Backward bitwise dataflow analysis: dead bit propagation.
4. Target applications: improve dce and see.  (Others?)


Preliminary Design
==================

1. I don't have enough understanding of GCC to decide whether it should be
   done at RTL level or tree level.  After some brainstorming with Ian Taylor,
   Diego Novillo and others, we decided to go with RTL.

2. This problem could be solved using iterative dataflow with bitmaps.
   However, memory size would increase significantly compared to scalar
   analysis, as would processing time.
   For constant bit propagation, we need to keep 2 bits of state for each
   register bit.  For 64 bit registers, that's a factor of 128x over
   the scalar reaching definition problem.

   Instead, I propose a sparse bitwise dataflow framework.  We would still
   use the existing RTL dataflow framework to build scalar DU/UD chains.
   Once they are available, bitwise information is propagated only over
   these chains, analogous to the sparse constant propagation described
   by Wegman & Zadeck TOPLAS 1991.

3. This might be too much detail at this point, but just in case, here is
   a brief description of a bit constant propagation algorithm.

   For each instruction I in the function body
     For each register R in instruction I
       def_constant_bits(I, R) = collect constants from AND/OR/... operations.

   Iterate until the def_constant_bits don't change:
     For each instruction I in the function body
       For each register R used at I
         use_constant_bits(I, R) = merge (def_constant_bits(D, R))
                                   across all definitions D of R that reach I
       For each register R defined at I
         def_constant_bits(I, R) = transfer (use_constant_bits(I, RU))
                                   for all register uses RU, based on opcodes.

The data structures and routines "collect", "merge" and "transfer"
depend on the problem solved.

4. Complexity considerations.
   The solver visits every DU edge once for each iteration of the fixed point
   convergence loop.  The maximum number of iterations is given by the
   height of the state lattice multiplied by the number of bits.
   Although this can be as high as 128 for constant bit propagation on x86_64,
   in practice we expect much lower times.
   Also, lower complexity guarantees can be given if less accurate information
   is allowed, e.g., byte level rather than bit level.  For byte constants,
   the upper bound constant factor drops from 128 to 16.

Some of these ideas came from discussions with Preston Briggs,
Sriraman Tallam and others.

Reply via email to