Tweak ALAP calculation in SCHED_PRESSURE_MODEL

2018-11-08 Thread Kyrill Tkachov



This patch fixes a flaw in the relationship between the way that
SCHED_PRESSURE_MODEL calculates the alap and depth vs how it uses
them in model_order_p.  A comment in model_order_p says:

  /* Combine the length of the longest path of satisfied true dependencies
 that leads to each instruction (depth) with the length of the longest
 path of any dependencies that leads from the instruction (alap).
 Prefer instructions with the greatest combined length.  If the combined
 lengths are equal, prefer instructions with the greatest depth.

 The idea is that, if we have a set S of "equal" instructions that each
 have ALAP value X, and we pick one such instruction I, any true-dependent
 successors of I that have ALAP value X - 1 should be preferred over S.
 This encourages the schedule to be "narrow" rather than "wide".
 However, if I is a low-priority instruction that we decided to
 schedule because of its model_classify_pressure, and if there
 is a set of higher-priority instructions T, the aforementioned
 successors of I should not have the edge over T.  */

The expectation was that scheduling an instruction X would give a
greater priority to the highest-priority successor instructions Y than
X had: Y.depth would be X.depth + 1 and Y.alap would be X.alap - 1,
giving an equal combined height, but with the greater depth winning as
a tie-breaker. But this doesn't work if the alap value was ultimately
determined by an anti-dependence.

This is particularly bad when --param max-pending-list-length kicks in,
since we then start adding fake anti-dependencies in order to keep the
list length down.  These fake dependencies tend to be on the critical
path.

The attached patch avoids that by making the alap calculation only
look at true dependencies.  This shouldn't be too bad, since we use
INSN_PRIORITY as the final tie-breaker than that does take
anti-dependencies into account.

This reduces the number of spills in the hot function from 436.cactusADM
by 14% on aarch64 at -O3 (and the number of instructions in general).
SPEC2017 shows a minor improvement on Cortex-A72 (about 0.1% overall).
Thanks to Wilco for the benchmarking.

Bootstrapped and tested on aarch64-none-linux-gnu.

Is this ok for trunk?

Thanks,
Kyrill

2018-11-08  Richard Sandiford  

gcc/
* haifa-sched.c (model_analyze_insns): Only add 1 to the consumer's
ALAP if the dependence is a true dependence.
diff --git a/gcc/haifa-sched.c b/gcc/haifa-sched.c
index 1fdc9df9fb26f23758ec8326cec91eecc4c917c1..01825de440c2e818eceab5ab7411b20b05ee54f1 100644
--- a/gcc/haifa-sched.c
+++ b/gcc/haifa-sched.c
@@ -3504,8 +3504,10 @@ model_analyze_insns (void)
 	FOR_EACH_DEP (iter, SD_LIST_FORW, sd_it, dep)
 	  {
 	con = MODEL_INSN_INFO (DEP_CON (dep));
-	if (con->insn && insn->alap < con->alap + 1)
-	  insn->alap = con->alap + 1;
+	unsigned int min_alap
+	  = con->alap + (DEP_TYPE (dep) == REG_DEP_TRUE);
+	if (con->insn && insn->alap < min_alap)
+	  insn->alap = min_alap;
 	  }
 
 	insn->old_queue = QUEUE_INDEX (iter);


Re: Tweak ALAP calculation in SCHED_PRESSURE_MODEL

2018-11-09 Thread Jeff Law
On 11/8/18 5:10 AM, Kyrill Tkachov wrote:
> 
> 
> This patch fixes a flaw in the relationship between the way that
> SCHED_PRESSURE_MODEL calculates the alap and depth vs how it uses
> them in model_order_p.  A comment in model_order_p says:
> 
>   /* Combine the length of the longest path of satisfied true dependencies
>  that leads to each instruction (depth) with the length of the longest
>  path of any dependencies that leads from the instruction (alap).
>  Prefer instructions with the greatest combined length.  If the
> combined
>  lengths are equal, prefer instructions with the greatest depth.
> 
>  The idea is that, if we have a set S of "equal" instructions that each
>  have ALAP value X, and we pick one such instruction I, any
> true-dependent
>  successors of I that have ALAP value X - 1 should be preferred over S.
>  This encourages the schedule to be "narrow" rather than "wide".
>  However, if I is a low-priority instruction that we decided to
>  schedule because of its model_classify_pressure, and if there
>  is a set of higher-priority instructions T, the aforementioned
>  successors of I should not have the edge over T.  */
> 
> The expectation was that scheduling an instruction X would give a
> greater priority to the highest-priority successor instructions Y than
> X had: Y.depth would be X.depth + 1 and Y.alap would be X.alap - 1,
> giving an equal combined height, but with the greater depth winning as
> a tie-breaker. But this doesn't work if the alap value was ultimately
> determined by an anti-dependence.
> 
> This is particularly bad when --param max-pending-list-length kicks in,
> since we then start adding fake anti-dependencies in order to keep the
> list length down.  These fake dependencies tend to be on the critical
> path.
> 
> The attached patch avoids that by making the alap calculation only
> look at true dependencies.  This shouldn't be too bad, since we use
> INSN_PRIORITY as the final tie-breaker than that does take
> anti-dependencies into account.
> 
> This reduces the number of spills in the hot function from 436.cactusADM
> by 14% on aarch64 at -O3 (and the number of instructions in general).
> SPEC2017 shows a minor improvement on Cortex-A72 (about 0.1% overall).
> Thanks to Wilco for the benchmarking.
> 
> Bootstrapped and tested on aarch64-none-linux-gnu.
> 
> Is this ok for trunk?
> 
> Thanks,
> Kyrill
> 
> 2018-11-08  Richard Sandiford  
> 
> gcc/
>     * haifa-sched.c (model_analyze_insns): Only add 1 to the consumer's
>     ALAP if the dependence is a true dependence.
So at the least the documentation of the ALAP field would need to be
updated as well as the comment you referenced (the "any dependencies").

But more importantly, it seems like blindly ignoring anti dependencies
is just a hack that happens to work.  I wonder if we could somehow mark
the fake dependencies we add, and avoid bumping the ALAP when we
encounter those fake dependencies.

It probably wouldn't be a bad idea to look at the default for
MAX_PENDING_LIST_LENGTH.  Based on the current default value and the
comments in the code that value could well have been tuned 25 or more
years ago!  How often are we falling over that during a bootstrap and
during spec builds?


Jeff


Re: Tweak ALAP calculation in SCHED_PRESSURE_MODEL

2018-11-16 Thread Kyrill Tkachov

Hi Jeff,

On 10/11/18 00:04, Jeff Law wrote:

On 11/8/18 5:10 AM, Kyrill Tkachov wrote:



This patch fixes a flaw in the relationship between the way that
SCHED_PRESSURE_MODEL calculates the alap and depth vs how it uses
them in model_order_p.  A comment in model_order_p says:

   /* Combine the length of the longest path of satisfied true dependencies
  that leads to each instruction (depth) with the length of the longest
  path of any dependencies that leads from the instruction (alap).
  Prefer instructions with the greatest combined length.  If the
combined
  lengths are equal, prefer instructions with the greatest depth.

  The idea is that, if we have a set S of "equal" instructions that each
  have ALAP value X, and we pick one such instruction I, any
true-dependent
  successors of I that have ALAP value X - 1 should be preferred over S.
  This encourages the schedule to be "narrow" rather than "wide".
  However, if I is a low-priority instruction that we decided to
  schedule because of its model_classify_pressure, and if there
  is a set of higher-priority instructions T, the aforementioned
  successors of I should not have the edge over T.  */

The expectation was that scheduling an instruction X would give a
greater priority to the highest-priority successor instructions Y than
X had: Y.depth would be X.depth + 1 and Y.alap would be X.alap - 1,
giving an equal combined height, but with the greater depth winning as
a tie-breaker. But this doesn't work if the alap value was ultimately
determined by an anti-dependence.

This is particularly bad when --param max-pending-list-length kicks in,
since we then start adding fake anti-dependencies in order to keep the
list length down.  These fake dependencies tend to be on the critical
path.

The attached patch avoids that by making the alap calculation only
look at true dependencies.  This shouldn't be too bad, since we use
INSN_PRIORITY as the final tie-breaker than that does take
anti-dependencies into account.

This reduces the number of spills in the hot function from 436.cactusADM
by 14% on aarch64 at -O3 (and the number of instructions in general).
SPEC2017 shows a minor improvement on Cortex-A72 (about 0.1% overall).
Thanks to Wilco for the benchmarking.

Bootstrapped and tested on aarch64-none-linux-gnu.

Is this ok for trunk?

Thanks,
Kyrill

2018-11-08  Richard Sandiford  

gcc/
 * haifa-sched.c (model_analyze_insns): Only add 1 to the consumer's
 ALAP if the dependence is a true dependence.

So at the least the documentation of the ALAP field would need to be
updated as well as the comment you referenced (the "any dependencies").


Ah, I can easily update the patch for that.


But more importantly, it seems like blindly ignoring anti dependencies
is just a hack that happens to work.  I wonder if we could somehow mark
the fake dependencies we add, and avoid bumping the ALAP when we
encounter those fake dependencies.


I did experiment with this. I added a new property to dep_t
to mark it as "artificial", that is created in the parts of sched-deps.c
that add dependencies when MAX_PENDING_LIST_LENGTH is exceeded.

Then ALAP is bumped only when the dependency is not artificial in this way.
This didn't help much on the testcase we were looking at (the hot function in 
cactus from SPEC2006).

The code size increase and number of spills decreased by only 6 (out of ~800) 
whereas with Richards'
patch it improved much more (~140 decrease, with a corresponding improvement in 
stack usage and code size).

Richard did suggest that anti-dependencies are already taken into account in 
the INSN_PRIORITY tie-breaker,
so perhaps that is a better scheme indeed?



It probably wouldn't be a bad idea to look at the default for
MAX_PENDING_LIST_LENGTH.  Based on the current default value and the
comments in the code that value could well have been tuned 25 or more
years ago!


Probably. I see that s390 and spu increase that param in their backends to much 
larger values than the default
I played around with increasing it on aarch64. It improved things somewhat, but 
Richard's patch still gave superior results.

Thanks,
Kyrill


How often are we falling over that during a bootstrap and
during spec builds?


Jeff




Re: Tweak ALAP calculation in SCHED_PRESSURE_MODEL

2018-11-16 Thread Jeff Law
On 11/16/18 5:35 AM, Kyrill Tkachov wrote:
> 
>> But more importantly, it seems like blindly ignoring anti dependencies
>> is just a hack that happens to work.  I wonder if we could somehow mark
>> the fake dependencies we add, and avoid bumping the ALAP when we
>> encounter those fake dependencies.
> 
> I did experiment with this. I added a new property to dep_t
> to mark it as "artificial", that is created in the parts of sched-deps.c
> that add dependencies when MAX_PENDING_LIST_LENGTH is exceeded.
> 
> Then ALAP is bumped only when the dependency is not artificial in this way.
> This didn't help much on the testcase we were looking at (the hot
> function in cactus from SPEC2006).
> 
> The code size increase and number of spills decreased by only 6 (out of
> ~800) whereas with Richards'
> patch it improved much more (~140 decrease, with a corresponding
> improvement in stack usage and code size).
> 
> Richard did suggest that anti-dependencies are already taken into
> account in the INSN_PRIORITY tie-breaker,
> so perhaps that is a better scheme indeed?
ISTM that your experiment indicates that it's not the artificial
dependencies that are the problem.   It's the real anti-dependencies
that are mucking things up.  That's fine, it just means I don't think we
want/need to do anything special for the artificial dependencies.

We certainly use INSN_PRIORITY as one of the (many) tie breakers in the
rank_for_schedule routine.  BUt I don't know if that's a better place or
not.

I trust Richard, so if he thinks the patch is the best approach, then
let's go with it after that trivial comment fix I mentioned in my
previous message.


> 
>>
>> It probably wouldn't be a bad idea to look at the default for
>> MAX_PENDING_LIST_LENGTH.  Based on the current default value and the
>> comments in the code that value could well have been tuned 25 or more
>> years ago!
> 
> Probably. I see that s390 and spu increase that param in their backends
> to much larger values than the default
> I played around with increasing it on aarch64. It improved things
> somewhat, but Richard's patch still gave superior results.
ACK.  Thanks for testing.  If you want to adjust, that would seem fine
as a follow-up.

jeff


Re: Tweak ALAP calculation in SCHED_PRESSURE_MODEL

2018-11-16 Thread Kyrill Tkachov

Hi Jeff,

On 16/11/18 17:08, Jeff Law wrote:

On 11/16/18 5:35 AM, Kyrill Tkachov wrote:

But more importantly, it seems like blindly ignoring anti dependencies
is just a hack that happens to work.  I wonder if we could somehow mark
the fake dependencies we add, and avoid bumping the ALAP when we
encounter those fake dependencies.

I did experiment with this. I added a new property to dep_t
to mark it as "artificial", that is created in the parts of sched-deps.c
that add dependencies when MAX_PENDING_LIST_LENGTH is exceeded.

Then ALAP is bumped only when the dependency is not artificial in this way.
This didn't help much on the testcase we were looking at (the hot
function in cactus from SPEC2006).

The code size increase and number of spills decreased by only 6 (out of
~800) whereas with Richards'
patch it improved much more (~140 decrease, with a corresponding
improvement in stack usage and code size).

Richard did suggest that anti-dependencies are already taken into
account in the INSN_PRIORITY tie-breaker,
so perhaps that is a better scheme indeed?

ISTM that your experiment indicates that it's not the artificial
dependencies that are the problem.   It's the real anti-dependencies
that are mucking things up.  That's fine, it just means I don't think we
want/need to do anything special for the artificial dependencies.


I must apologise. Since I sent this out earlier I found a bug in my 
implementation
of the above experiment which meant I wasn't marking the dependencies properly 
in all cases :(
With that fixed, the approach removes ~100 spills which is much better than 
what I reported initially
however still not as good as Richards' patch (removed ~140 spills).

I've kicked off a SPEC2006 benchmarking run to see if it has any any effect.


We certainly use INSN_PRIORITY as one of the (many) tie breakers in the
rank_for_schedule routine.  BUt I don't know if that's a better place or
not.

I trust Richard, so if he thinks the patch is the best approach, then
let's go with it after that trivial comment fix I mentioned in my
previous message.


I'll respin Richard's patch with the comment updates and resend that,
unless the benchmark run above shows something interesting.


It probably wouldn't be a bad idea to look at the default for
MAX_PENDING_LIST_LENGTH.  Based on the current default value and the
comments in the code that value could well have been tuned 25 or more
years ago!

Probably. I see that s390 and spu increase that param in their backends
to much larger values than the default
I played around with increasing it on aarch64. It improved things
somewhat, but Richard's patch still gave superior results.

ACK.  Thanks for testing.  If you want to adjust, that would seem fine
as a follow-up.


I'd need to benchmark such a change, but thanks.

MAX_PENDING_LIST_LENGTH only exists to limit compile-time, right?


Thanks, and sorry for the confusion.

Kyrill


jeff




Re: Tweak ALAP calculation in SCHED_PRESSURE_MODEL

2018-11-16 Thread Jeff Law
On 11/16/18 10:21 AM, Kyrill Tkachov wrote:
> 
 It probably wouldn't be a bad idea to look at the default for
 MAX_PENDING_LIST_LENGTH.  Based on the current default value and the
 comments in the code that value could well have been tuned 25 or more
 years ago!
>>> Probably. I see that s390 and spu increase that param in their backends
>>> to much larger values than the default
>>> I played around with increasing it on aarch64. It improved things
>>> somewhat, but Richard's patch still gave superior results.
>> ACK.  Thanks for testing.  If you want to adjust, that would seem fine
>> as a follow-up.
> 
> I'd need to benchmark such a change, but thanks.
> 
> MAX_PENDING_LIST_LENGTH only exists to limit compile-time, right?
I believe so.

jeff


Re: Tweak ALAP calculation in SCHED_PRESSURE_MODEL

2018-11-16 Thread Pat Haugen
On 11/8/18 6:10 AM, Kyrill Tkachov wrote:
> The attached patch avoids that by making the alap calculation only
> look at true dependencies.  This shouldn't be too bad, since we use
> INSN_PRIORITY as the final tie-breaker than that does take
> anti-dependencies into account.
> 
> This reduces the number of spills in the hot function from 436.cactusADM
> by 14% on aarch64 at -O3 (and the number of instructions in general).
> SPEC2017 shows a minor improvement on Cortex-A72 (about 0.1% overall).
> Thanks to Wilco for the benchmarking.

I tried the patch on PowerPC since it also uses SCHED_PRESSURE_MODEL algorithm. 
For CPU2006 only cactusADM had a noticeable difference, but I'm seeing a 5% 
degradation. Looking at the generated asm for function 
bench_staggeredleapfrog2_(), I see about a 1% increase in number of loads and 
stores generated and an extra 100 bytes allocated on the stack.

-Pat



Re: Tweak ALAP calculation in SCHED_PRESSURE_MODEL

2018-11-16 Thread Kyrill Tkachov



On 16/11/18 18:19, Pat Haugen wrote:

On 11/8/18 6:10 AM, Kyrill Tkachov wrote:
> The attached patch avoids that by making the alap calculation only
> look at true dependencies.  This shouldn't be too bad, since we use
> INSN_PRIORITY as the final tie-breaker than that does take
> anti-dependencies into account.
>
> This reduces the number of spills in the hot function from 436.cactusADM
> by 14% on aarch64 at -O3 (and the number of instructions in general).
> SPEC2017 shows a minor improvement on Cortex-A72 (about 0.1% overall).
> Thanks to Wilco for the benchmarking.

I tried the patch on PowerPC since it also uses SCHED_PRESSURE_MODEL algorithm. 
For CPU2006 only cactusADM had a noticeable difference, but I'm seeing a 5% 
degradation. Looking at the generated asm for function 
bench_staggeredleapfrog2_(), I see about a 1% increase in number of loads and 
stores generated and an extra 100 bytes allocated on the stack.



Thanks for trying it out!
Given that, I'll try to polish up the more targeted approach that filters out 
only the fake dependencies, which should give a more targeted approach.

Thanks,
Kyrill


-Pat





Re: Tweak ALAP calculation in SCHED_PRESSURE_MODEL

2018-11-19 Thread Kyrill Tkachov


On 16/11/18 18:19, Pat Haugen wrote:

On 11/8/18 6:10 AM, Kyrill Tkachov wrote:

The attached patch avoids that by making the alap calculation only
look at true dependencies.  This shouldn't be too bad, since we use
INSN_PRIORITY as the final tie-breaker than that does take
anti-dependencies into account.

This reduces the number of spills in the hot function from 436.cactusADM
by 14% on aarch64 at -O3 (and the number of instructions in general).
SPEC2017 shows a minor improvement on Cortex-A72 (about 0.1% overall).
Thanks to Wilco for the benchmarking.

I tried the patch on PowerPC since it also uses SCHED_PRESSURE_MODEL algorithm. 
For CPU2006 only cactusADM had a noticeable difference, but I'm seeing a 5% 
degradation. Looking at the generated asm for function 
bench_staggeredleapfrog2_(), I see about a 1% increase in number of loads and 
stores generated and an extra 100 bytes allocated on the stack.

-Pat



This is a follow-up from 
https://gcc.gnu.org/ml/gcc-patches/2018-11/msg01525.html
This version introduces an "artificial" property of the dependencies produced in
sched-deps.c that is recorded when they are created due to 
MAX_PENDING_LIST_LENGTH
and they are thus ignored in the model_analyze_insns ALAP calculation.

This approach gives most of the benefits of the original patch [1] on aarch64.
I tried it on the cactusADM hot function (bench_staggeredleapfrog2_) on 
powerpc64le-unknown-linux-gnu
with -O3 and found that the initial version proposed did indeed increase the 
instruction count
and stack space. This version gives a small improvement on powerpc in terms of 
instruction count
(number of st* instructions stays the same), so I'm hoping this version 
addresses Pat's concerns.
Pat, could you please try this version out if you've got the chance?

In terms of implementation, it required extending the various add_dependency 
functions/helpers to
take an extra argument marking the dependency as "artificial", which is what 
most of the patch diff
does. It is otherwise a fairly simple patch.

Bootstrapped and tested on aarch64-none-linux-gnu.

Is this ok for trunk if the PPC performance results look ok?


[1] https://gcc.gnu.org/ml/gcc-patches/2018-11/msg01514.html

2018-11-19  Richard Sandiford  
Kyrylo Tkachov  

* haifa-sched.c (model_analyze_insns): Avoid counting artificial
anti-dependencies.
* sched-int.h (struct _dep): Add artificial field.
(DEP_ARTIFICIAL): Define accessor.
(DEP_ANTI_ARTIFICIAL): Define.
(DEP_POSTPONED): Adjust definition.
(add_dependence): Add default bool argument to prototype.
* sched-deps.c (init_dep_1): Initialize artificial field.
(add_dependence_1): Add default bool parameter.  Handle in definition.
(add_dependence_list): Likewise.
(add_dependence_list_and_free): Likewise.
(flush_pending_lists): Likewise.
(haifa_note_dep): Handle DEP_ANTI_ARTIFICIAL in ds.
(sched_analyze_1): Mark new dependencies created as part of handling
MAX_PENDING_LIST_LENGTH limit as artificial.
(sched_analyze_2): Likewise.
(sched_analyze_insn): Likewise.
(deps_analyze_insn): Likewise.
(dump_ds): Handle DEP_ANTI_ARTIFICIAL.

diff --git a/gcc/haifa-sched.c b/gcc/haifa-sched.c
index 2c84ce3814357b30e1aaed57f1de3034d99afd57..c1787d01c5f4765a63986efe04c59748792e4932 100644
--- a/gcc/haifa-sched.c
+++ b/gcc/haifa-sched.c
@@ -3504,8 +3504,13 @@ model_analyze_insns (void)
 	FOR_EACH_DEP (iter, SD_LIST_FORW, sd_it, dep)
 	  {
 	con = MODEL_INSN_INFO (DEP_CON (dep));
-	if (con->insn && insn->alap < con->alap + 1)
-	  insn->alap = con->alap + 1;
+	/* Consider all dependencies except those introduced artificially
+	   by the scheduler as part of adhering to the
+	   MAX_PENDING_LIST_LENGTH limit.  */
+	unsigned int min_alap
+	  = con->alap + !DEP_ARTIFICIAL (dep);
+	if (con->insn && insn->alap < min_alap)
+	  insn->alap = min_alap;
 	  }
 
 	insn->old_queue = QUEUE_INDEX (iter);
diff --git a/gcc/sched-deps.c b/gcc/sched-deps.c
index f89f28269fd5ecf96688ed255d07b6976d2180c4..a83b779086099e9a1a0690b480230f4d7ad3e0ba 100644
--- a/gcc/sched-deps.c
+++ b/gcc/sched-deps.c
@@ -100,6 +100,7 @@ init_dep_1 (dep_t dep, rtx_insn *pro, rtx_insn *con, enum reg_note type, ds_t ds
   DEP_COST (dep) = UNKNOWN_DEP_COST;
   DEP_NONREG (dep) = 0;
   DEP_MULTIPLE (dep) = 0;
+  DEP_ARTIFICIAL (dep) = 0;
   DEP_REPLACE (dep) = NULL;
 }
 
@@ -472,16 +473,18 @@ static int cache_size;
 static bool mark_as_hard;
 
 static int deps_may_trap_p (const_rtx);
-static void add_dependence_1 (rtx_insn *, rtx_insn *, enum reg_note);
+static void add_dependence_1 (rtx_insn *, rtx_insn *, enum reg_note,
+			  bool = false);
 static void add_dependence_list (rtx_insn *, rtx_insn_list *, int,
- enum reg_note, bool);
+ enum reg_note, bool, bool = false);
 static void add_dependence_list_and_free (struct deps_desc *, rtx_insn *,
 	  rtx_insn_list **, int, enum reg_note,
-	  bool);
+	  bool, bo

Re: Tweak ALAP calculation in SCHED_PRESSURE_MODEL

2018-11-19 Thread Pat Haugen
On 11/19/18 11:54 AM, Kyrill Tkachov wrote:
> On 16/11/18 18:19, Pat Haugen wrote:
>> On 11/8/18 6:10 AM, Kyrill Tkachov wrote:
>>> The attached patch avoids that by making the alap calculation only
>>> look at true dependencies.  This shouldn't be too bad, since we use
>>> INSN_PRIORITY as the final tie-breaker than that does take
>>> anti-dependencies into account.
>>>
>>> This reduces the number of spills in the hot function from 436.cactusADM
>>> by 14% on aarch64 at -O3 (and the number of instructions in general).
>>> SPEC2017 shows a minor improvement on Cortex-A72 (about 0.1% overall).
>>> Thanks to Wilco for the benchmarking.
>> I tried the patch on PowerPC since it also uses SCHED_PRESSURE_MODEL 
>> algorithm. For CPU2006 only cactusADM had a noticeable difference, but I'm 
>> seeing a 5% degradation. Looking at the generated asm for function 
>> bench_staggeredleapfrog2_(), I see about a 1% increase in number of loads 
>> and stores generated and an extra 100 bytes allocated on the stack.
>>
>> -Pat
>>
> 
> This is a follow-up from 
> https://gcc.gnu.org/ml/gcc-patches/2018-11/msg01525.html
> This version introduces an "artificial" property of the dependencies produced 
> in
> sched-deps.c that is recorded when they are created due to 
> MAX_PENDING_LIST_LENGTH
> and they are thus ignored in the model_analyze_insns ALAP calculation.
> 
> This approach gives most of the benefits of the original patch [1] on aarch64.
> I tried it on the cactusADM hot function (bench_staggeredleapfrog2_) on 
> powerpc64le-unknown-linux-gnu
> with -O3 and found that the initial version proposed did indeed increase the 
> instruction count
> and stack space. This version gives a small improvement on powerpc in terms 
> of instruction count
> (number of st* instructions stays the same), so I'm hoping this version 
> addresses Pat's concerns.
> Pat, could you please try this version out if you've got the chance?
> 

I tried the new verison on cactusADM, it's showing a 2% degradation. I've 
kicked off a full CPU2006 run just to see if any others are affected.

-Pat



Re: Tweak ALAP calculation in SCHED_PRESSURE_MODEL

2018-11-20 Thread Pat Haugen
On 11/19/18 2:30 PM, Pat Haugen wrote:
>> This is a follow-up from 
>> https://gcc.gnu.org/ml/gcc-patches/2018-11/msg01525.html
>> This version introduces an "artificial" property of the dependencies 
>> produced in
>> sched-deps.c that is recorded when they are created due to 
>> MAX_PENDING_LIST_LENGTH
>> and they are thus ignored in the model_analyze_insns ALAP calculation.
>>
>> This approach gives most of the benefits of the original patch [1] on 
>> aarch64.
>> I tried it on the cactusADM hot function (bench_staggeredleapfrog2_) on 
>> powerpc64le-unknown-linux-gnu
>> with -O3 and found that the initial version proposed did indeed increase the 
>> instruction count
>> and stack space. This version gives a small improvement on powerpc in terms 
>> of instruction count
>> (number of st* instructions stays the same), so I'm hoping this version 
>> addresses Pat's concerns.
>> Pat, could you please try this version out if you've got the chance?
>>
> I tried the new verison on cactusADM, it's showing a 2% degradation. I've 
> kicked off a full CPU2006 run just to see if any others are affected.

The other benchmarks were neutral. So the only benchmark showing a change is 
the 2% degradation on cactusADM. Comparing the generated .s files for 
bench_staggeredleapfrog2_(), there is about a 0.7% increase in load insns and 
still the 1% increase in store insns.

-Pat



Re: Tweak ALAP calculation in SCHED_PRESSURE_MODEL

2018-11-20 Thread Kyrill Tkachov



On 20/11/18 16:48, Pat Haugen wrote:

On 11/19/18 2:30 PM, Pat Haugen wrote:

This is a follow-up from 
https://gcc.gnu.org/ml/gcc-patches/2018-11/msg01525.html
This version introduces an "artificial" property of the dependencies produced in
sched-deps.c that is recorded when they are created due to 
MAX_PENDING_LIST_LENGTH
and they are thus ignored in the model_analyze_insns ALAP calculation.

This approach gives most of the benefits of the original patch [1] on aarch64.
I tried it on the cactusADM hot function (bench_staggeredleapfrog2_) on 
powerpc64le-unknown-linux-gnu
with -O3 and found that the initial version proposed did indeed increase the 
instruction count
and stack space. This version gives a small improvement on powerpc in terms of 
instruction count
(number of st* instructions stays the same), so I'm hoping this version 
addresses Pat's concerns.
Pat, could you please try this version out if you've got the chance?


I tried the new verison on cactusADM, it's showing a 2% degradation. I've 
kicked off a full CPU2006 run just to see if any others are affected.

The other benchmarks were neutral. So the only benchmark showing a change is 
the 2% degradation on cactusADM. Comparing the generated .s files for 
bench_staggeredleapfrog2_(), there is about a 0.7% increase in load insns and 
still the 1% increase in store insns.


Sigh :(
What options are you compiling with? I tried a powerpc64le compiler with plain 
-O3 and saw got a slight improvement (by manual expection)

Thanks,
Kyrill



-Pat





Re: Tweak ALAP calculation in SCHED_PRESSURE_MODEL

2018-11-20 Thread Pat Haugen
On 11/20/18 10:53 AM, Kyrill Tkachov wrote:
> On 20/11/18 16:48, Pat Haugen wrote:
>> On 11/19/18 2:30 PM, Pat Haugen wrote:
 This is a follow-up from 
 https://gcc.gnu.org/ml/gcc-patches/2018-11/msg01525.html
 This version introduces an "artificial" property of the dependencies 
 produced in
 sched-deps.c that is recorded when they are created due to 
 MAX_PENDING_LIST_LENGTH
 and they are thus ignored in the model_analyze_insns ALAP calculation.

 This approach gives most of the benefits of the original patch [1] on 
 aarch64.
 I tried it on the cactusADM hot function (bench_staggeredleapfrog2_) on 
 powerpc64le-unknown-linux-gnu
 with -O3 and found that the initial version proposed did indeed increase 
 the instruction count
 and stack space. This version gives a small improvement on powerpc in 
 terms of instruction count
 (number of st* instructions stays the same), so I'm hoping this version 
 addresses Pat's concerns.
 Pat, could you please try this version out if you've got the chance?

>>> I tried the new verison on cactusADM, it's showing a 2% degradation. I've 
>>> kicked off a full CPU2006 run just to see if any others are affected.
>> The other benchmarks were neutral. So the only benchmark showing a change is 
>> the 2% degradation on cactusADM. Comparing the generated .s files for 
>> bench_staggeredleapfrog2_(), there is about a 0.7% increase in load insns 
>> and still the 1% increase in store insns.
> 
> Sigh :(
> What options are you compiling with? I tried a powerpc64le compiler with 
> plain -O3 and saw got a slight improvement (by manual expection)

I was using the following: -O3 -mcpu=power8 -fpeel-loops -funroll-loops 
-ffast-math -mpopcntd -mrecip=all. When I run with just -O3 -mcpu=power8 I see 
just under a 1% degradation.

-Pat