Re: RFC: An alternative -fsched-pressure implementation

2012-01-10 Thread Vladimir Makarov

On 01/09/2012 07:45 AM, Bernd Schmidt wrote:

On 12/23/2011 12:46 PM, Richard Sandiford wrote:

In the end I tried an ad-hoc approach in an attempt to do something
about (2), (3) and (4b).  The idea was to construct a preliminary
"model" schedule in which the only objective is to keep register
pressure to a minimum.  This schedule ignores pipeline characteristics,
latencies, and the number of available registers.  The maximum pressure
seen in this initial model schedule (MP) is then the benchmark for ECC(X).

Interesting. I had also thought about trying to address the problem in
the scheduler, so I'll just share some of my thoughts (not intended as a
negative comment on your approach).

Essentially the problem I usually see is that the wider your machine
gets, the happier sched1 will be to fill unused slots with instructions
that could just as well be scheduled 200 cycles later. That's really an
artifact of forward scheduling, so why not schedule both forwards and
then backwards (or the other way round)? Produce an initial schedule,
then perform a fixup phase: start at the other end and look for
instructions that can be scheduled in a wide range of cycles, and move
them if doing so can be shown to reduce register pressure, and we retain
a correct schedule. This can be done without trying to have a global
pressure estimate which has the problems you noted.

I saw such approaches in the literature.  It would be interesting to see 
how it works in GCC.


Although, I believe that register pressure minimization pass before RA 
and selective insn scheduler used only after RA could solve all this 
problems.  But it needs a lot of investigation to confirm this.  It is 
also hard for me to say what approach would require more efforts to 
implement.

Doing this would require constructing a backwards DFA, and I gave up on
this for the moment after a day or so spent with genautomata, but it
should be possible.

Long ago (10 years ago) I considered to do this exactly for removing the 
2nd separate insn scheduler by inserting spill code in already existing 
insn schedule.  But I decided not to do this.


Also I considered to generate DFAs with repeated insn issues for better 
serving modulo scheduler.


There are a lot of directions of developing automatically generated 
pipeline hazard recognizers going beyond (N)DFAs to simulate hidden 
register renaming, different ooo processors look ahead buffers/queues.  
All of these directions could be interesting research projects.  But 
even in the current state, imho GCC (N)DFA pipeline hazard recognizer 
after its 10 years of existence is the most powerful one used in 
industrial compilers.




Re: RFC: An alternative -fsched-pressure implementation

2012-01-10 Thread Vladimir Makarov

On 12/28/2011 08:51 AM, Richard Sandiford wrote:

Vladimir Makarov  writes:

In the end I tried an ad-hoc approach in an attempt to do something
about (2), (3) and (4b).  The idea was to construct a preliminary
"model" schedule in which the only objective is to keep register
pressure to a minimum.  This schedule ignores pipeline characteristics,
latencies, and the number of available registers.  The maximum pressure
seen in this initial model schedule (MP) is then the benchmark for ECC(X).


I always had an impression that the code before scheduler is close to
minimal register pressure because of specific expression generation.
May be I was wrong and some optimizations (global ones like pre) changes
this a lot.

One of the examples I was looking at was:

-
#include

#define COUNT 8

void
loop (uint8_t *__restrict dst, uint8_t *__restrict src, uint8_t *__restrict 
ff_cropTbl, int dstStride, int srcStride)
{
   const int w = COUNT;
   uint8_t *cm = ff_cropTbl + 1024;
   for(int i=0; i>5];
   dst[1*dstStride] = cm[(((src1+src2)*20 - (src0+src3)*5 + (srcA+src4)) + 
16)>>5];
   dst[2*dstStride] = cm[(((src2+src3)*20 - (src1+src4)*5 + (src0+src5)) + 
16)>>5];
   dst[3*dstStride] = cm[(((src3+src4)*20 - (src2+src5)*5 + (src1+src6)) + 
16)>>5];
   dst[4*dstStride] = cm[(((src4+src5)*20 - (src3+src6)*5 + (src2+src7)) + 
16)>>5];
   dst[5*dstStride] = cm[(((src5+src6)*20 - (src4+src7)*5 + (src3+src8)) + 
16)>>5];
   dst[6*dstStride] = cm[(((src6+src7)*20 - (src5+src8)*5 + (src4+src9)) + 
16)>>5];
   dst[7*dstStride] = cm[(((src7+src8)*20 - (src6+src9)*5 + (src5+src10)) + 
16)>>5];
   dst++;
   src++;
 }
}
-

(based on the libav h264 code).  In this example the loads from src and
stores to dst are still in their original order by the time we reach sched1,
so src, dst, srcA, srcB, and src0..10 are all live at once.  There's no
aliasing reason why they can't be reordered, and we do that during
scheduling.


Thanks, for the example.




As for the patch itself, I look at this with more attention at the
beginning of next year.  As I understand there is no rush with that
because we are still not at the stage 1.

Thanks, appreciate it.  And yeah, there's definitely no rush: there's
no way this can go in 4.7.


Ok.  Thanks, Richard.


Re: RFC: An alternative -fsched-pressure implementation

2012-01-09 Thread Bernd Schmidt
On 12/23/2011 12:46 PM, Richard Sandiford wrote:
> In the end I tried an ad-hoc approach in an attempt to do something
> about (2), (3) and (4b).  The idea was to construct a preliminary
> "model" schedule in which the only objective is to keep register
> pressure to a minimum.  This schedule ignores pipeline characteristics,
> latencies, and the number of available registers.  The maximum pressure
> seen in this initial model schedule (MP) is then the benchmark for ECC(X).

Interesting. I had also thought about trying to address the problem in
the scheduler, so I'll just share some of my thoughts (not intended as a
negative comment on your approach).

Essentially the problem I usually see is that the wider your machine
gets, the happier sched1 will be to fill unused slots with instructions
that could just as well be scheduled 200 cycles later. That's really an
artifact of forward scheduling, so why not schedule both forwards and
then backwards (or the other way round)? Produce an initial schedule,
then perform a fixup phase: start at the other end and look for
instructions that can be scheduled in a wide range of cycles, and move
them if doing so can be shown to reduce register pressure, and we retain
a correct schedule. This can be done without trying to have a global
pressure estimate which has the problems you noted.

Doing this would require constructing a backwards DFA, and I gave up on
this for the moment after a day or so spent with genautomata, but it
should be possible.


Bernd


Re: RFC: An alternative -fsched-pressure implementation

2011-12-28 Thread Richard Sandiford
Vladimir Makarov  writes:
>> In the end I tried an ad-hoc approach in an attempt to do something
>> about (2), (3) and (4b).  The idea was to construct a preliminary
>> "model" schedule in which the only objective is to keep register
>> pressure to a minimum.  This schedule ignores pipeline characteristics,
>> latencies, and the number of available registers.  The maximum pressure
>> seen in this initial model schedule (MP) is then the benchmark for ECC(X).
>>
> I always had an impression that the code before scheduler is close to 
> minimal register pressure because of specific expression generation.  
> May be I was wrong and some optimizations (global ones like pre) changes 
> this a lot.

One of the examples I was looking at was:

-
#include 

#define COUNT 8

void
loop (uint8_t *__restrict dst, uint8_t *__restrict src, uint8_t *__restrict 
ff_cropTbl, int dstStride, int srcStride)
{
  const int w = COUNT;
  uint8_t *cm = ff_cropTbl + 1024;
  for(int i=0; i>5];
  dst[1*dstStride] = cm[(((src1+src2)*20 - (src0+src3)*5 + (srcA+src4)) + 
16)>>5];
  dst[2*dstStride] = cm[(((src2+src3)*20 - (src1+src4)*5 + (src0+src5)) + 
16)>>5];
  dst[3*dstStride] = cm[(((src3+src4)*20 - (src2+src5)*5 + (src1+src6)) + 
16)>>5];
  dst[4*dstStride] = cm[(((src4+src5)*20 - (src3+src6)*5 + (src2+src7)) + 
16)>>5];
  dst[5*dstStride] = cm[(((src5+src6)*20 - (src4+src7)*5 + (src3+src8)) + 
16)>>5];
  dst[6*dstStride] = cm[(((src6+src7)*20 - (src5+src8)*5 + (src4+src9)) + 
16)>>5];
  dst[7*dstStride] = cm[(((src7+src8)*20 - (src6+src9)*5 + (src5+src10)) + 
16)>>5];
  dst++;
  src++;
}
}
-

(based on the libav h264 code).  In this example the loads from src and
stores to dst are still in their original order by the time we reach sched1,
so src, dst, srcA, srcB, and src0..10 are all live at once.  There's no
aliasing reason why they can't be reordered, and we do that during
scheduling.

>> During the main scheduling, an instruction X that occurs at or before
>> the next point of maximum pressure in the model schedule is measured
>> based on the current register pressure.  If X doesn't increase the
>> current pressure beyond the current maximum, its ECC(X) is zero,
>> otherwise ECC(X) is the cost of going from MP to the new maximum.
>> The idea is that the final net pressure of scheduling a given set of
>> instructions is going to be the same regardless of the order; we simply
>> want to keep the intermediate pressure under control.  An ECC(X) of zero
>> usually[*] means that scheduling X next won't send the rest of the
>> sequence beyond the current maximum pressure.
>>
>>[*] but not always.  There's more about this in the patch comments.
>>
>> If an instruction X occurs _after_ the next point of maximum pressure,
>> X is measured based on that maximum pressure.  If the current maximum
>> pressure is MP', and X increases pressure by dP, ECC(X) is the cost of
>> going from MP to MP' + dP.
>>
>> Of course, this all depends on how good a value MP is, and therefore
>> on how good the model schedule is.  I tried a few variations before
>> settling on the one in the patch (which I hope makes conceptual sense).
>>
>> I initially stayed with the idea above about assigning different costs to
>> (Ra), (Rb) and (Rc).  This produces some good results, but was still a
>> little too conservative in general, in that other tests were still worse
>> with -fsched-pressure than without.  I described some of the problems
>> with these costs above.  Another is that if an instruction X has a spill
>> cost of 6, say, then:
>>
>>ECC(X) + delay(X)
>>
>> will only allow X to be scheduled if the next instruction without
>> a spill cost has a delay of 6 cycles or more.  This is overly harsh,
>> especially seeing as few ARM instructions have such a high latency.
>> The benefit of spilling is often to avoid a series of short
>> (e.g. single-cycle) stalls, rather than to avoid a single long one.
>>
>> I then adjusted positive ECC(X) values based on the priority of X
>> relative to the highest-priority zero-cost instruction.  This was
>> better, but a DES filter in particular still suffered from the
>> "lots of short stalls" problem.
>>
>> Then, as an experiment, I tried ignoring MEMORY_MOVE_COST altogether
>> and simply treating each extra spill as having a cost of 1.  Somewhat
>> to my surprise, there was only one test in which some improvement was
>> lost; that test was now only slightly better than without -fsched-pressure.
>> But the fixed cost of 1 per spill achieved the goal of being as good as
>> or better than -fno-sched-pressure in almost all cases, while still being
>> significantly better in some of the cases we care about.
>>
>> Assigning a cost of just 1 to each excess spill is clearly going to
>> interfere with the normal list scheduler less often tha

Re: RFC: An alternative -fsched-pressure implementation

2011-12-23 Thread Vladimir Makarov

On 12/23/2011 06:46 AM, Richard Sandiford wrote:

So it looks like two pieces of work related to scheduling and register
pressure are being posted close together.  This one is an RFC for a less
aggressive form of -fsched-pressure.  I think it should complement
rather than compete with Bernd's IRA patch.  It seems like a good idea
to take register pressure into account during the first scheduling pass,
where we can still easily look at things like instruction latencies
and pipeline utilisation.  Better rematerialisation in the register
allocator would obviously be a good thing too though.

This patch started when we (Linaro) saw a big drop in performance
from vectorising an RGB to YIQ filter on ARM.  The first scheduling
pass was overly aggressive in creating a "wide" schedule, and caused
the newly-vectorised loop to contain lots of spills.  The loop grew so
big that it even had a constant pool in the middle of it.

-fsched-pressure did a very good job on this loop, creating far fewer
spills and consequently avoiding the constant pool.  However, it seemed
to make several other cases significantly worse.  The idea was therefore
to try to come up with a form of -fsched-pressure that could be turned
on for ARM by default.

Current -fsched-pressure works by assigning an "excess (pressure) cost change"
to each instruction; here I'll write that as ECC(X).  -fsched-pressure also
changes the way that the main list scheduler handles stalls due to data
dependencies.  If an instruction would stall for N cycles, the scheduler
would normally add it to the "now+N" queue, then add it to the ready queue
after N cycles.  With -fsched-pressure, it instead adds the instruction
to the ready queue immediately, while still recording that the instruction
would require N stalls.  I'll write the number of stalls on X as delay(X).

This arrangement allows the scheduler to choose between increasing register
pressure and introducing a deliberate stall.  Instructions are ranked by:

   (a) lowest ECC(X) + delay(X)
   (b) lowest delay(X)
   (c) normal list-scheduler ranking (based on things like INSN_PRIORITY)

Note that since delay(X) is measured in cycles, ECC(X) is effectively
measured in cycles too.

Several things seemed to be causing the degradations we were seeing
with -fsched-pressure:

   (1) The -fsched-pressure schedule is based purely on instruction latencies
   and issue rate; it doesn't take the DFA into account.  This means that
   we attempt to "dual issue" things like vector operations, loads and
   stores on Cortex A8 and A9.  In the examples I looked at, these sorts
   of inaccuracy seemed to accumulate, so that the value of delay(X)
   became based on slightly unrealistic cycle times.

   Note that this also affects code that doesn't have any pressure
   problems; it isn't limited to code that does.

   This may simply be historical.  It became much easier to use the
   DFA here after Bernd's introduction of prune_ready_list, but the
   original -fsched-pressure predates that.

   (2) We calculate ECC(X) by walking the unscheduled part of the block
   in its original order, then recording the pressure at each instruction.
   This seemed to make ECC(X) quite sensitive to that original order.
   I saw blocks that started out naturally "narrow" (not much ILP,
   e.g. from unrolled loops) and others that started naturally "wide"
   (a lot of ILP, such as in the libav h264 code), and walking the
   block in order meant that the two styles would be handled differently.

   (3) When calculating the pressure of the original block (as described
   in (2)), we ignore the deaths of registers that are used by more
   than one unscheduled instruction.  This tended to hurt long(ish)
   loops in things like filters, where the same value is often used
   as an input to two calculations.  The effect was that instructions
   towards the end of the block would appear to have very high pressure.
   This in turn made the algorithm very conservative; it wouldn't
   promote instructions from later in the block because those
   instructions seemed to have a prohibitively large cost.

   I asked Vlad about this, and he confirmed that it was a deliberate
   decision.  He'd tried honouring REG_DEAD notes instead, but it
   produced worse results on x86.  I'll return to this at the end.

   (4) ECC(X) is based on the pressure over and above ira_available_class_regs
   (the number of allocatable registers in a given class).  ARM has 14
   allocatable GENERAL_REGS: 16 minus the stack pointer and program
   counter.  So if 14 integer variables are live across a loop but
   not referenced within it, we end up scheduling that loop in a context
   of permanent pressure.  Pressure becomes the overriding concern,
   and we don't get much ILP.

   I suppose there are at least two ways of viewing this:

   (4a) We'r

Re: RFC: An alternative -fsched-pressure implementation

2011-12-23 Thread Richard Guenther
On Fri, Dec 23, 2011 at 12:46 PM, Richard Sandiford
 wrote:
> So it looks like two pieces of work related to scheduling and register
> pressure are being posted close together.  This one is an RFC for a less
> aggressive form of -fsched-pressure.  I think it should complement
> rather than compete with Bernd's IRA patch.  It seems like a good idea
> to take register pressure into account during the first scheduling pass,
> where we can still easily look at things like instruction latencies
> and pipeline utilisation.  Better rematerialisation in the register
> allocator would obviously be a good thing too though.
>
> This patch started when we (Linaro) saw a big drop in performance
> from vectorising an RGB to YIQ filter on ARM.  The first scheduling
> pass was overly aggressive in creating a "wide" schedule, and caused
> the newly-vectorised loop to contain lots of spills.  The loop grew so
> big that it even had a constant pool in the middle of it.
>
> -fsched-pressure did a very good job on this loop, creating far fewer
> spills and consequently avoiding the constant pool.  However, it seemed
> to make several other cases significantly worse.  The idea was therefore
> to try to come up with a form of -fsched-pressure that could be turned
> on for ARM by default.
>
> Current -fsched-pressure works by assigning an "excess (pressure) cost change"
> to each instruction; here I'll write that as ECC(X).  -fsched-pressure also
> changes the way that the main list scheduler handles stalls due to data
> dependencies.  If an instruction would stall for N cycles, the scheduler
> would normally add it to the "now+N" queue, then add it to the ready queue
> after N cycles.  With -fsched-pressure, it instead adds the instruction
> to the ready queue immediately, while still recording that the instruction
> would require N stalls.  I'll write the number of stalls on X as delay(X).
>
> This arrangement allows the scheduler to choose between increasing register
> pressure and introducing a deliberate stall.  Instructions are ranked by:
>
>  (a) lowest ECC(X) + delay(X)
>  (b) lowest delay(X)
>  (c) normal list-scheduler ranking (based on things like INSN_PRIORITY)
>
> Note that since delay(X) is measured in cycles, ECC(X) is effectively
> measured in cycles too.
>
> Several things seemed to be causing the degradations we were seeing
> with -fsched-pressure:
>
>  (1) The -fsched-pressure schedule is based purely on instruction latencies
>      and issue rate; it doesn't take the DFA into account.  This means that
>      we attempt to "dual issue" things like vector operations, loads and
>      stores on Cortex A8 and A9.  In the examples I looked at, these sorts
>      of inaccuracy seemed to accumulate, so that the value of delay(X)
>      became based on slightly unrealistic cycle times.
>
>      Note that this also affects code that doesn't have any pressure
>      problems; it isn't limited to code that does.
>
>      This may simply be historical.  It became much easier to use the
>      DFA here after Bernd's introduction of prune_ready_list, but the
>      original -fsched-pressure predates that.
>
>  (2) We calculate ECC(X) by walking the unscheduled part of the block
>      in its original order, then recording the pressure at each instruction.
>      This seemed to make ECC(X) quite sensitive to that original order.
>      I saw blocks that started out naturally "narrow" (not much ILP,
>      e.g. from unrolled loops) and others that started naturally "wide"
>      (a lot of ILP, such as in the libav h264 code), and walking the
>      block in order meant that the two styles would be handled differently.
>
>  (3) When calculating the pressure of the original block (as described
>      in (2)), we ignore the deaths of registers that are used by more
>      than one unscheduled instruction.  This tended to hurt long(ish)
>      loops in things like filters, where the same value is often used
>      as an input to two calculations.  The effect was that instructions
>      towards the end of the block would appear to have very high pressure.
>      This in turn made the algorithm very conservative; it wouldn't
>      promote instructions from later in the block because those
>      instructions seemed to have a prohibitively large cost.
>
>      I asked Vlad about this, and he confirmed that it was a deliberate
>      decision.  He'd tried honouring REG_DEAD notes instead, but it
>      produced worse results on x86.  I'll return to this at the end.
>
>  (4) ECC(X) is based on the pressure over and above ira_available_class_regs
>      (the number of allocatable registers in a given class).  ARM has 14
>      allocatable GENERAL_REGS: 16 minus the stack pointer and program
>      counter.  So if 14 integer variables are live across a loop but
>      not referenced within it, we end up scheduling that loop in a context
>      of permanent pressure.  Pressure becomes the overriding concern,
>      and we don't get muc