Vladimir Makarov <vmaka...@redhat.com> writes:
>> In the end I tried an ad-hoc approach in an attempt to do something
>> about (2), (3) and (4b).  The idea was to construct a preliminary
>> "model" schedule in which the only objective is to keep register
>> pressure to a minimum.  This schedule ignores pipeline characteristics,
>> latencies, and the number of available registers.  The maximum pressure
>> seen in this initial model schedule (MP) is then the benchmark for ECC(X).
>>
> I always had an impression that the code before scheduler is close to 
> minimal register pressure because of specific expression generation.  
> May be I was wrong and some optimizations (global ones like pre) changes 
> this a lot.

One of the examples I was looking at was:

-----------------------------------------------------------------------------
#include <stdint.h>

#define COUNT 8

void
loop (uint8_t *__restrict dst, uint8_t *__restrict src, uint8_t *__restrict 
ff_cropTbl, int dstStride, int srcStride)
{
  const int w = COUNT;
  uint8_t *cm = ff_cropTbl + 1024;
  for(int i=0; i<w; i++)
    {
      const int srcB = src[-2*srcStride];
      const int srcA = src[-1*srcStride];
      const int src0 = src[0 *srcStride];
      const int src1 = src[1 *srcStride];
      const int src2 = src[2 *srcStride];
      const int src3 = src[3 *srcStride];
      const int src4 = src[4 *srcStride];
      const int src5 = src[5 *srcStride];
      const int src6 = src[6 *srcStride];
      const int src7 = src[7 *srcStride];
      const int src8 = src[8 *srcStride];
      const int src9 = src[9 *srcStride];
      const int src10 = src[10*srcStride];

      dst[0*dstStride] = cm[(((src0+src1)*20 - (srcA+src2)*5 + (srcB+src3)) + 
16)>>5];
      dst[1*dstStride] = cm[(((src1+src2)*20 - (src0+src3)*5 + (srcA+src4)) + 
16)>>5];
      dst[2*dstStride] = cm[(((src2+src3)*20 - (src1+src4)*5 + (src0+src5)) + 
16)>>5];
      dst[3*dstStride] = cm[(((src3+src4)*20 - (src2+src5)*5 + (src1+src6)) + 
16)>>5];
      dst[4*dstStride] = cm[(((src4+src5)*20 - (src3+src6)*5 + (src2+src7)) + 
16)>>5];
      dst[5*dstStride] = cm[(((src5+src6)*20 - (src4+src7)*5 + (src3+src8)) + 
16)>>5];
      dst[6*dstStride] = cm[(((src6+src7)*20 - (src5+src8)*5 + (src4+src9)) + 
16)>>5];
      dst[7*dstStride] = cm[(((src7+src8)*20 - (src6+src9)*5 + (src5+src10)) + 
16)>>5];
      dst++;
      src++;
    }
}
-----------------------------------------------------------------------------

(based on the libav h264 code).  In this example the loads from src and
stores to dst are still in their original order by the time we reach sched1,
so src, dst, srcA, srcB, and src0..10 are all live at once.  There's no
aliasing reason why they can't be reordered, and we do that during
scheduling.

>> During the main scheduling, an instruction X that occurs at or before
>> the next point of maximum pressure in the model schedule is measured
>> based on the current register pressure.  If X doesn't increase the
>> current pressure beyond the current maximum, its ECC(X) is zero,
>> otherwise ECC(X) is the cost of going from MP to the new maximum.
>> The idea is that the final net pressure of scheduling a given set of
>> instructions is going to be the same regardless of the order; we simply
>> want to keep the intermediate pressure under control.  An ECC(X) of zero
>> usually[*] means that scheduling X next won't send the rest of the
>> sequence beyond the current maximum pressure.
>>
>>    [*] but not always.  There's more about this in the patch comments.
>>
>> If an instruction X occurs _after_ the next point of maximum pressure,
>> X is measured based on that maximum pressure.  If the current maximum
>> pressure is MP', and X increases pressure by dP, ECC(X) is the cost of
>> going from MP to MP' + dP.
>>
>> Of course, this all depends on how good a value MP is, and therefore
>> on how good the model schedule is.  I tried a few variations before
>> settling on the one in the patch (which I hope makes conceptual sense).
>>
>> I initially stayed with the idea above about assigning different costs to
>> (Ra), (Rb) and (Rc).  This produces some good results, but was still a
>> little too conservative in general, in that other tests were still worse
>> with -fsched-pressure than without.  I described some of the problems
>> with these costs above.  Another is that if an instruction X has a spill
>> cost of 6, say, then:
>>
>>    ECC(X) + delay(X)
>>
>> will only allow X to be scheduled if the next instruction without
>> a spill cost has a delay of 6 cycles or more.  This is overly harsh,
>> especially seeing as few ARM instructions have such a high latency.
>> The benefit of spilling is often to avoid a series of short
>> (e.g. single-cycle) stalls, rather than to avoid a single long one.
>>
>> I then adjusted positive ECC(X) values based on the priority of X
>> relative to the highest-priority zero-cost instruction.  This was
>> better, but a DES filter in particular still suffered from the
>> "lots of short stalls" problem.
>>
>> Then, as an experiment, I tried ignoring MEMORY_MOVE_COST altogether
>> and simply treating each extra spill as having a cost of 1.  Somewhat
>> to my surprise, there was only one test in which some improvement was
>> lost; that test was now only slightly better than without -fsched-pressure.
>> But the fixed cost of 1 per spill achieved the goal of being as good as
>> or better than -fno-sched-pressure in almost all cases, while still being
>> significantly better in some of the cases we care about.
>>
>> Assigning a cost of just 1 to each excess spill is clearly going to
>> interfere with the normal list scheduler less often than assigning each
>> spill a higher cost.  Given that this appraoch also gives the most important
>> benefits, it felt like a good default.
>>
> We should use whatever works better but I always try to find an 
> explanation and reason for some unexpected results.  If it is OOO 
> processor I have no problem with understanding this.

Yeah, it was an OO processor, but the behaviour appeared to be
independent of that.  The differences in performance were often
also reflected in the execution times that GCC itself predicted
during sched2.

The reasons appeared to be the ones I mentioned above:

 (1) Using a memory-based spill cost is arbitrary because we don't
     know how many or what type of spill instructions will be needed,
     and where they will be placed in relation to the other instructions.

     E.g. I saw a few cases where keeping the number of live GPRs under
     14 + X, but closer to 14 + X more of the time, produced more spills
     and worse sched2 code than allowing the number of live GPRs to peak
     at a higher value and go down sooner.  And that's no surprise, of
     course: this is very much a heuristic.

 (2) If the load+store cost is 6 (which seems a reasonable value for
     ARM GPRs) then with the current costs, we will only schedule an
     instruction that increases pressure by 1 if all instructions
     that don't increase pressure have a latency of 6 cycles or more.

     I tried to get around this by adjusting the costs based on priorities.
     However, I still saw situations where we had several instructions
     of a similar priority that all increased pressure.  Once we scheduled
     one of those -- X say -- we wouldn't schedule another (Y) until X's
     dependents themselves increased pressure or until those dependents
     reached a low enough priority to cancel out the load+store cost
     in ECC(Y).  If we schedule Y in that situation, we end up with the
     same pressure as if we'd scheduled Y straight after X, making those
     stalls unnecessary.

Maybe it was just the choice of examples I looked at in detail,
but (2) seemed like the bigger factor.

>> All in all, the patch is a complicated way of being quite timid,
> Yes, it is a bit scaring me that pace of changes in scheduler would make 
> it complicated as reload.  On the other hand, the pass is one of the 
> most target dependent and solves NP-problem.  I think it is hard to 
> generate a good code without some complications, except of course an 
> optimal solution approach but it hardly sometime in future will be 
> acceptable with compilation time point of view.

Yeah, complication is definitely a problem here.  We did try running
benchmarks with the first scheduling pass disabled, but unfortunately,
the pass is just too important to consider dropping.

>> but I hope it could also be used as a basis for more aggressive
>> versions in future.  Things I haven't looked at include whether
>> it would make sense to disable the priority-based adjustment of
>> ECC for -Os.  (That's a question of whether this patch can improve
>> size still further over -fno-sched-pressure.)
>>
>> Being timid ought to make it fit quite well with Bernd's patch,
>> although I haven't had chance to try the two together.
>>
>> I talked with Vlad about this a couple of months ago (thanks again for
>> the help, Vlad).  He explained that some of the things I was seeing --
>> especially (3) -- were deliberate decisions that had improved the
>> results for x86.  And I think it's probably true that the patch below
>> is only going to work well on fairly regular RISCy machines.
>>
>> For that reason, the patch adds an -fsched-pressure-algorithm=
>> option to choose between the two versions.  For historical reasons
>> I called the current algorithm "weighted" and the proposed one "model",
>> but I'm sure there are better names.  I've kept the option undocumented
>> because it's really an internal thing.  "weighted" remains the default.
>>
>> Of course, adding an alternative like this would only be acceptable if
>> -fsched-pressure and -fsched-pressure-algorithm=model became the default
>> for ARM (or another target).  That's a decision for the ARM maintainers;
>> I've sent Ramana the results I got, which unfortunately I can't share
>> more widely.
>>
> I guess the result is positive, otherwise you would not submit the 
> patch.  Although I place the performance improvement always first, it 
> would be interesting to know how the compilation time changes.

I tried compiling SPEC2000 (CPU + FP) for arm-linux-gnueabi on
x86_64-linux-gnu without -fsched-pressure and then with both
-fsched-pressure and -fsched-pressure-algorithm=model.
I couldn't detect a difference outside the noise.

> As for the patch itself, I look at this with more attention at the 
> beginning of next year.  As I understand there is no rush with that 
> because we are still not at the stage 1.

Thanks, appreciate it.  And yeah, there's definitely no rush: there's
no way this can go in 4.7.

Richard

Reply via email to