Re: Question on patch -fprofile-partial-training

2023-05-15 Thread Qing Zhao via Gcc-patches


> On May 11, 2023, at 12:08 PM, Qing Zhao via Gcc-patches 
>  wrote:
> 
> 
> 
>> On May 10, 2023, at 9:15 AM, Jan Hubicka  wrote:
>> 
>>> Honza,
 Main motivation for this was profiling programs that contain specific
 code paths for different CPUs (such as graphics library in Firefox or Linux
 kernel). In the situation training machine differs from the machine
 program is run later, we end up optimizing for size all code paths
 except ones taken by the specific CPU.  This patch essentially tells gcc
 to consider every non-trained function as built without profile
 feedback.
>>> Make sense.
 
 For Firefox it had important impact on graphics rendering tests back
 then since the building machined had AVX while the benchmarking did not.
 Some benchmarks improved several times which is not a surprise if you
 consider tight graphics rendering loop optimized for size versus
 vectorized one.  
>>> 
>>> That’s a lot of improvement. So, without -fprofile-partial-training, the 
>>> PGO hurt the performance for those cases? 
>> 
>> Yes, to get code size improvements we assume that the non-trained part
>> of code is cold and with -Os we are very aggressive to optimize for
>> size.  We now have two-level optimize_for size, so I think we could
>> make this more fine grained this stage1.
> 
> Okay. I see. 
> 
> Thanks a lot for the info.
> 
> Another question (which is confusing us very much right now is):
> 
> When we lower the following  parameter from 999 to 950: (in GCC8)
> 
> DEFPARAM(HOT_BB_COUNT_WS_PERMILLE,
> "hot-bb-count-ws-permille",
> "A basic block profile count is considered hot if it contributes to "
> "the given permillage of the entire profiled execution.”
> 999, 0, 1000)
> 
> The size of the “text.hot" section is 4x times SMALLER than the default one. 
> Is this expected behavior? 

As my further study of GCC8, yes, this is the expected behavior. -:).

Qing
> (From my reading of the GCC8 source code, when this parameter is getting 
> smaller, more basic blocks and functions will
> Be considered as HOT by GCC, then the text.hot section should be larger, not 
> smaller, do I miss anything here?)
> 
> Thanks a lot for your help.
> 
> Qing
> 
>> 
>> Honza
>>> 
 The patch has bad effect on code size which in turn
 impacts performance too, so I think it makes sense to use
 -fprofile-partial-training with bit of care (i.e. only one code where
 such scenarios are likely).
>>> 
>>> Right. 
 
 As for backporting, I do not have checkout of GCC 8 right now. It
 depends on profile infrastructure that was added in 2017 (so stage1 of
 GCC 8), so the patch may backport quite easilly.  I am not 100% sure
 what shape the infrastrucure was in the first version, but I am quite
 convinced it had the necessary bits - it was able to make the difference
 between 0 profile count and missing profile feedback.
>>> 
>>> This is good to know, I will try to back port to GCC8 and let them test to 
>>> see any good impact.
>>> 
>>> Qing
 
 Honza



Re: Question on patch -fprofile-partial-training

2023-05-11 Thread Qing Zhao via Gcc-patches


> On May 10, 2023, at 9:15 AM, Jan Hubicka  wrote:
> 
>> Honza,
>>> Main motivation for this was profiling programs that contain specific
>>> code paths for different CPUs (such as graphics library in Firefox or Linux
>>> kernel). In the situation training machine differs from the machine
>>> program is run later, we end up optimizing for size all code paths
>>> except ones taken by the specific CPU.  This patch essentially tells gcc
>>> to consider every non-trained function as built without profile
>>> feedback.
>> Make sense.
>>> 
>>> For Firefox it had important impact on graphics rendering tests back
>>> then since the building machined had AVX while the benchmarking did not.
>>> Some benchmarks improved several times which is not a surprise if you
>>> consider tight graphics rendering loop optimized for size versus
>>> vectorized one.  
>> 
>> That’s a lot of improvement. So, without -fprofile-partial-training, the PGO 
>> hurt the performance for those cases? 
> 
> Yes, to get code size improvements we assume that the non-trained part
> of code is cold and with -Os we are very aggressive to optimize for
> size.  We now have two-level optimize_for size, so I think we could
> make this more fine grained this stage1.

Okay. I see. 

Thanks a lot for the info.

Another question (which is confusing us very much right now is):

When we lower the following  parameter from 999 to 950: (in GCC8)

DEFPARAM(HOT_BB_COUNT_WS_PERMILLE,
 "hot-bb-count-ws-permille",
 "A basic block profile count is considered hot if it contributes to "
 "the given permillage of the entire profiled execution.”
 999, 0, 1000)

The size of the “text.hot" section is 4x times SMALLER than the default one. Is 
this expected behavior? 
(From my reading of the GCC8 source code, when this parameter is getting 
smaller, more basic blocks and functions will
Be considered as HOT by GCC, then the text.hot section should be larger, not 
smaller, do I miss anything here?)

Thanks a lot for your help.

Qing

> 
> Honza
>> 
>>> The patch has bad effect on code size which in turn
>>> impacts performance too, so I think it makes sense to use
>>> -fprofile-partial-training with bit of care (i.e. only one code where
>>> such scenarios are likely).
>> 
>> Right. 
>>> 
>>> As for backporting, I do not have checkout of GCC 8 right now. It
>>> depends on profile infrastructure that was added in 2017 (so stage1 of
>>> GCC 8), so the patch may backport quite easilly.  I am not 100% sure
>>> what shape the infrastrucure was in the first version, but I am quite
>>> convinced it had the necessary bits - it was able to make the difference
>>> between 0 profile count and missing profile feedback.
>> 
>> This is good to know, I will try to back port to GCC8 and let them test to 
>> see any good impact.
>> 
>> Qing
>>> 
>>> Honza



Re: Question on patch -fprofile-partial-training

2023-05-10 Thread Jan Hubicka via Gcc-patches
> Honza,
> > Main motivation for this was profiling programs that contain specific
> > code paths for different CPUs (such as graphics library in Firefox or Linux
> > kernel). In the situation training machine differs from the machine
> > program is run later, we end up optimizing for size all code paths
> > except ones taken by the specific CPU.  This patch essentially tells gcc
> > to consider every non-trained function as built without profile
> > feedback.
> Make sense.
> > 
> > For Firefox it had important impact on graphics rendering tests back
> > then since the building machined had AVX while the benchmarking did not.
> > Some benchmarks improved several times which is not a surprise if you
> > consider tight graphics rendering loop optimized for size versus
> > vectorized one.  
> 
> That’s a lot of improvement. So, without -fprofile-partial-training, the PGO 
> hurt the performance for those cases? 

Yes, to get code size improvements we assume that the non-trained part
of code is cold and with -Os we are very aggressive to optimize for
size.  We now have two-level optimize_for size, so I think we could
make this more fine grained this stage1.

Honza
> 
> > The patch has bad effect on code size which in turn
> > impacts performance too, so I think it makes sense to use
> > -fprofile-partial-training with bit of care (i.e. only one code where
> > such scenarios are likely).
> 
> Right. 
> > 
> > As for backporting, I do not have checkout of GCC 8 right now. It
> > depends on profile infrastructure that was added in 2017 (so stage1 of
> > GCC 8), so the patch may backport quite easilly.  I am not 100% sure
> > what shape the infrastrucure was in the first version, but I am quite
> > convinced it had the necessary bits - it was able to make the difference
> > between 0 profile count and missing profile feedback.
> 
> This is good to know, I will try to back port to GCC8 and let them test to 
> see any good impact.
> 
> Qing
> > 
> > Honza
> >> 
> 


Re: Question on patch -fprofile-partial-training

2023-05-09 Thread Qing Zhao via Gcc-patches
Honza,

Thanks a lot for your comments. 

> On May 9, 2023, at 6:22 AM, Jan Hubicka  wrote:
> 
> 
> From my understanding, -fprofile-partial-training is one important option 
> for PGO performance.
 
 I don't think so, speed benefit would be rather small I guess.
>>> I saw some articles online to introduce this option for gcc10,
>>> https://documentation.suse.com/sbp/all/html/SBP-GCC-10/index.html#sec-gcc10-pgo
>> 
>> Hi.
>> 
>> Ah, I see.
>> 
>>> And also based on my previous experience in Studio compiler, I guess that 
>>> this one might have
>>> Some good performance impact on PGO.  Is there any old performance data on 
>>> this option? (I cannot find online)
>> 
>> Maybe Honza can chime in here? Or Martin who is the author of the white 
>> paper.
> 
> Main motivation for this was profiling programs that contain specific
> code paths for different CPUs (such as graphics library in Firefox or Linux
> kernel). In the situation training machine differs from the machine
> program is run later, we end up optimizing for size all code paths
> except ones taken by the specific CPU.  This patch essentially tells gcc
> to consider every non-trained function as built without profile
> feedback.
Make sense.
> 
> For Firefox it had important impact on graphics rendering tests back
> then since the building machined had AVX while the benchmarking did not.
> Some benchmarks improved several times which is not a surprise if you
> consider tight graphics rendering loop optimized for size versus
> vectorized one.  

That’s a lot of improvement. So, without -fprofile-partial-training, the PGO 
hurt the performance for those cases? 

> The patch has bad effect on code size which in turn
> impacts performance too, so I think it makes sense to use
> -fprofile-partial-training with bit of care (i.e. only one code where
> such scenarios are likely).

Right. 
> 
> As for backporting, I do not have checkout of GCC 8 right now. It
> depends on profile infrastructure that was added in 2017 (so stage1 of
> GCC 8), so the patch may backport quite easilly.  I am not 100% sure
> what shape the infrastrucure was in the first version, but I am quite
> convinced it had the necessary bits - it was able to make the difference
> between 0 profile count and missing profile feedback.

This is good to know, I will try to back port to GCC8 and let them test to see 
any good impact.

Qing
> 
> Honza
>> 



Re: Question on patch -fprofile-partial-training

2023-05-09 Thread Jan Hubicka via Gcc-patches
> > > > 
> > > >  From my understanding, -fprofile-partial-training is one important 
> > > > option for PGO performance.
> > > 
> > > I don't think so, speed benefit would be rather small I guess.
> > I saw some articles online to introduce this option for gcc10,
> > https://documentation.suse.com/sbp/all/html/SBP-GCC-10/index.html#sec-gcc10-pgo
> 
> Hi.
> 
> Ah, I see.
> 
> > And also based on my previous experience in Studio compiler, I guess that 
> > this one might have
> > Some good performance impact on PGO.  Is there any old performance data on 
> > this option? (I cannot find online)
> 
> Maybe Honza can chime in here? Or Martin who is the author of the white paper.

Main motivation for this was profiling programs that contain specific
code paths for different CPUs (such as graphics library in Firefox or Linux
kernel). In the situation training machine differs from the machine
program is run later, we end up optimizing for size all code paths
except ones taken by the specific CPU.  This patch essentially tells gcc
to consider every non-trained function as built without profile
feedback.

For Firefox it had important impact on graphics rendering tests back
then since the building machined had AVX while the benchmarking did not.
Some benchmarks improved several times which is not a surprise if you
consider tight graphics rendering loop optimized for size versus
vectorized one.  The patch has bad effect on code size which in turn
impacts performance too, so I think it makes sense to use
-fprofile-partial-training with bit of care (i.e. only one code where
such scenarios are likely).

As for backporting, I do not have checkout of GCC 8 right now. It
depends on profile infrastructure that was added in 2017 (so stage1 of
GCC 8), so the patch may backport quite easilly.  I am not 100% sure
what shape the infrastrucure was in the first version, but I am quite
convinced it had the necessary bits - it was able to make the difference
between 0 profile count and missing profile feedback.

Honza
> 
> Martin
> 
> > 
> > thanks.
> > 
> > Qing
> > 
> > > 
> > > > I’d like
> > > > to see any big technique difficult to prevent it from being back ported 
> > > > to GCC8.
> > > 
> > > There might be of course some patch dependencies and I don't see a point 
> > > why should we waste
> > > time with that.
> > > 
> > > Cheers,
> > > Martin
> > > 
> > > > 
> > > > Thanks.
> > > > 
> > > > Qing
> > > > 
> > > > > 
> > > > > Martin
> > > > > 
> > > > > > 
> > > > > > Can this patch be back ported to GCC8 easily? I am wondering any 
> > > > > > significant
> > > > > > Change between GCC8 and GCC10 that might make the backporting very 
> > > > > > hard>
> > > > > > Thanks a lot for your help.
> > > > > > 
> > > > > > Qing
> > 
> 


Re: Question on patch -fprofile-partial-training

2023-05-09 Thread Martin Liška

On 5/4/23 15:37, Qing Zhao wrote:




On May 4, 2023, at 9:05 AM, Martin Liška  wrote:

On 5/4/23 14:54, Qing Zhao wrote:




On May 4, 2023, at 4:30 AM, Martin Liška  wrote:

On 5/3/23 21:10, Qing Zhao via Gcc-patches wrote:

Hi, Jan,

You added the following patch into gcc10:

 From 34fbe3f0946f88828765184ed6581bda62cdf49f Mon Sep 17 00:00:00 2001
From: Jan Hubicka 
Date: Thu, 5 Dec 2019 19:12:51 +0100
Subject: [PATCH] cgraphclones.c (localize_profile): New function.

   * cgraphclones.c (localize_profile): New function.
   (cgraph_node::create_clone): Use it for partial profiles.
   * common.opt (fprofile-partial-training): New flag.
   * doc/invoke.texi (-fprofile-partial-training): Document.
   * ipa-cp.c (update_profiling_info): For partial profiles do not
   set function profile to zero.
   * profile.c (compute_branch_probabilities): With partial profile
   watch if edge count is zero and turn all probabilities to guessed.
   (compute_branch_probabilities): For partial profiles do not apply
   profile when entry count is zero.
   * tree-profile.c (tree_profiling): Only do value_profile_transformations
   when profile is read.

My question is:


Hello.

Why would anybody backport such change to unsupported code-stream of GCC 8?
Generally speaking, I discourage from doing that.


Yes, I agree.
However, many users still use GCC8 right now, and some of them are asking for 
more performance
from PGO recently. That’s the reason I am studying this right now.


I understand there are products that are based on GCC8, but as the branch is 
officially unsupported, I don't
see a reason to backport a new feature from newer release. It's just asking for 
troubles. If your clients are
interested in more performance, then they should use a recent supported release.

We are trying to persuade them to use newer GCC, but it’s quite hard...




 From my understanding, -fprofile-partial-training is one important option for 
PGO performance.


I don't think so, speed benefit would be rather small I guess.

I saw some articles online to introduce this option for gcc10,
https://documentation.suse.com/sbp/all/html/SBP-GCC-10/index.html#sec-gcc10-pgo


Hi.

Ah, I see.


And also based on my previous experience in Studio compiler, I guess that this 
one might have
Some good performance impact on PGO.  Is there any old performance data on this 
option? (I cannot find online)


Maybe Honza can chime in here? Or Martin who is the author of the white paper.

Martin



thanks.

Qing




I’d like
to see any big technique difficult to prevent it from being back ported to GCC8.


There might be of course some patch dependencies and I don't see a point why 
should we waste
time with that.

Cheers,
Martin



Thanks.

Qing



Martin



Can this patch be back ported to GCC8 easily? I am wondering any significant
Change between GCC8 and GCC10 that might make the backporting very hard>
Thanks a lot for your help.

Qing






Re: Question on patch -fprofile-partial-training

2023-05-04 Thread Qing Zhao via Gcc-patches


> On May 4, 2023, at 9:05 AM, Martin Liška  wrote:
> 
> On 5/4/23 14:54, Qing Zhao wrote:
>> 
>> 
>>> On May 4, 2023, at 4:30 AM, Martin Liška  wrote:
>>> 
>>> On 5/3/23 21:10, Qing Zhao via Gcc-patches wrote:
 Hi, Jan,
 
 You added the following patch into gcc10:
 
 From 34fbe3f0946f88828765184ed6581bda62cdf49f Mon Sep 17 00:00:00 2001
 From: Jan Hubicka 
 Date: Thu, 5 Dec 2019 19:12:51 +0100
 Subject: [PATCH] cgraphclones.c (localize_profile): New function.
 
   * cgraphclones.c (localize_profile): New function.
   (cgraph_node::create_clone): Use it for partial profiles.
   * common.opt (fprofile-partial-training): New flag.
   * doc/invoke.texi (-fprofile-partial-training): Document.
   * ipa-cp.c (update_profiling_info): For partial profiles do not
   set function profile to zero.
   * profile.c (compute_branch_probabilities): With partial profile
   watch if edge count is zero and turn all probabilities to guessed.
   (compute_branch_probabilities): For partial profiles do not apply
   profile when entry count is zero.
   * tree-profile.c (tree_profiling): Only do 
 value_profile_transformations
   when profile is read.
 
 My question is:
>>> 
>>> Hello.
>>> 
>>> Why would anybody backport such change to unsupported code-stream of GCC 8?
>>> Generally speaking, I discourage from doing that.
>> 
>> Yes, I agree.
>> However, many users still use GCC8 right now, and some of them are asking 
>> for more performance
>> from PGO recently. That’s the reason I am studying this right now. 
> 
> I understand there are products that are based on GCC8, but as the branch is 
> officially unsupported, I don't
> see a reason to backport a new feature from newer release. It's just asking 
> for troubles. If your clients are
> interested in more performance, then they should use a recent supported 
> release.
We are trying to persuade them to use newer GCC, but it’s quite hard...
> 
>> 
>> From my understanding, -fprofile-partial-training is one important option 
>> for PGO performance.
> 
> I don't think so, speed benefit would be rather small I guess.
I saw some articles online to introduce this option for gcc10, 
https://documentation.suse.com/sbp/all/html/SBP-GCC-10/index.html#sec-gcc10-pgo
And also based on my previous experience in Studio compiler, I guess that this 
one might have
Some good performance impact on PGO.  Is there any old performance data on this 
option? (I cannot find online)

thanks.

Qing

> 
>> I’d like
>> to see any big technique difficult to prevent it from being back ported to 
>> GCC8. 
> 
> There might be of course some patch dependencies and I don't see a point why 
> should we waste
> time with that.
> 
> Cheers,
> Martin
> 
>> 
>> Thanks.
>> 
>> Qing
>> 
>>> 
>>> Martin
>>> 
 
 Can this patch be back ported to GCC8 easily? I am wondering any 
 significant
 Change between GCC8 and GCC10 that might make the backporting very hard> 
 Thanks a lot for your help.
 
 Qing



Re: Question on patch -fprofile-partial-training

2023-05-04 Thread Martin Liška
On 5/4/23 14:54, Qing Zhao wrote:
> 
> 
>> On May 4, 2023, at 4:30 AM, Martin Liška  wrote:
>>
>> On 5/3/23 21:10, Qing Zhao via Gcc-patches wrote:
>>> Hi, Jan,
>>>
>>> You added the following patch into gcc10:
>>>
>>> From 34fbe3f0946f88828765184ed6581bda62cdf49f Mon Sep 17 00:00:00 2001
>>> From: Jan Hubicka 
>>> Date: Thu, 5 Dec 2019 19:12:51 +0100
>>> Subject: [PATCH] cgraphclones.c (localize_profile): New function.
>>>
>>>* cgraphclones.c (localize_profile): New function.
>>>(cgraph_node::create_clone): Use it for partial profiles.
>>>* common.opt (fprofile-partial-training): New flag.
>>>* doc/invoke.texi (-fprofile-partial-training): Document.
>>>* ipa-cp.c (update_profiling_info): For partial profiles do not
>>>set function profile to zero.
>>>* profile.c (compute_branch_probabilities): With partial profile
>>>watch if edge count is zero and turn all probabilities to guessed.
>>>(compute_branch_probabilities): For partial profiles do not apply
>>>profile when entry count is zero.
>>>* tree-profile.c (tree_profiling): Only do 
>>> value_profile_transformations
>>>when profile is read.
>>>
>>> My question is:
>>
>> Hello.
>>
>> Why would anybody backport such change to unsupported code-stream of GCC 8?
>> Generally speaking, I discourage from doing that.
> 
> Yes, I agree.
> However, many users still use GCC8 right now, and some of them are asking for 
> more performance
> from PGO recently. That’s the reason I am studying this right now. 

I understand there are products that are based on GCC8, but as the branch is 
officially unsupported, I don't
see a reason to backport a new feature from newer release. It's just asking for 
troubles. If your clients are
interested in more performance, then they should use a recent supported release.

> 
> From my understanding, -fprofile-partial-training is one important option for 
> PGO performance.

I don't think so, speed benefit would be rather small I guess.

> I’d like
> to see any big technique difficult to prevent it from being back ported to 
> GCC8. 

There might be of course some patch dependencies and I don't see a point why 
should we waste
time with that.

Cheers,
Martin

> 
> Thanks.
> 
> Qing
> 
>>
>> Martin
>>
>>>
>>> Can this patch be back ported to GCC8 easily? I am wondering any significant
>>> Change between GCC8 and GCC10 that might make the backporting very hard> 
>>> Thanks a lot for your help.
>>>
>>> Qing
> 



Re: Question on patch -fprofile-partial-training

2023-05-04 Thread Qing Zhao via Gcc-patches


> On May 4, 2023, at 4:30 AM, Martin Liška  wrote:
> 
> On 5/3/23 21:10, Qing Zhao via Gcc-patches wrote:
>> Hi, Jan,
>> 
>> You added the following patch into gcc10:
>> 
>> From 34fbe3f0946f88828765184ed6581bda62cdf49f Mon Sep 17 00:00:00 2001
>> From: Jan Hubicka 
>> Date: Thu, 5 Dec 2019 19:12:51 +0100
>> Subject: [PATCH] cgraphclones.c (localize_profile): New function.
>> 
>>* cgraphclones.c (localize_profile): New function.
>>(cgraph_node::create_clone): Use it for partial profiles.
>>* common.opt (fprofile-partial-training): New flag.
>>* doc/invoke.texi (-fprofile-partial-training): Document.
>>* ipa-cp.c (update_profiling_info): For partial profiles do not
>>set function profile to zero.
>>* profile.c (compute_branch_probabilities): With partial profile
>>watch if edge count is zero and turn all probabilities to guessed.
>>(compute_branch_probabilities): For partial profiles do not apply
>>profile when entry count is zero.
>>* tree-profile.c (tree_profiling): Only do 
>> value_profile_transformations
>>when profile is read.
>> 
>> My question is:
> 
> Hello.
> 
> Why would anybody backport such change to unsupported code-stream of GCC 8?
> Generally speaking, I discourage from doing that.

Yes, I agree.
However, many users still use GCC8 right now, and some of them are asking for 
more performance
from PGO recently. That’s the reason I am studying this right now. 

From my understanding, -fprofile-partial-training is one important option for 
PGO performance. I’d like
to see any big technique difficult to prevent it from being back ported to 
GCC8. 

Thanks.

Qing

> 
> Martin
> 
>> 
>> Can this patch be back ported to GCC8 easily? I am wondering any significant
>> Change between GCC8 and GCC10 that might make the backporting very hard> 
>> Thanks a lot for your help.
>> 
>> Qing



Re: Question on patch -fprofile-partial-training

2023-05-04 Thread Martin Liška
On 5/3/23 21:10, Qing Zhao via Gcc-patches wrote:
> Hi, Jan,
> 
> You added the following patch into gcc10:
> 
> From 34fbe3f0946f88828765184ed6581bda62cdf49f Mon Sep 17 00:00:00 2001
> From: Jan Hubicka 
> Date: Thu, 5 Dec 2019 19:12:51 +0100
> Subject: [PATCH] cgraphclones.c (localize_profile): New function.
> 
> * cgraphclones.c (localize_profile): New function.
> (cgraph_node::create_clone): Use it for partial profiles.
> * common.opt (fprofile-partial-training): New flag.
> * doc/invoke.texi (-fprofile-partial-training): Document.
> * ipa-cp.c (update_profiling_info): For partial profiles do not
> set function profile to zero.
> * profile.c (compute_branch_probabilities): With partial profile
> watch if edge count is zero and turn all probabilities to guessed.
> (compute_branch_probabilities): For partial profiles do not apply
> profile when entry count is zero.
> * tree-profile.c (tree_profiling): Only do 
> value_profile_transformations
> when profile is read.
> 
> My question is:

Hello.

Why would anybody backport such change to unsupported code-stream of GCC 8?
Generally speaking, I discourage from doing that.

Martin

> 
> Can this patch be back ported to GCC8 easily? I am wondering any significant
> Change between GCC8 and GCC10 that might make the backporting very hard> 
> Thanks a lot for your help.
> 
> Qing