Am Fri 05 Aug 2011 09:32:05 AM CEST schrieb Richard Guenther <richard.guent...@gmail.com>:

On Thu, Aug 4, 2011 at 8:42 PM, Jan Hubicka <j...@suse.de> wrote:
Did you try using FDO with -Os?  FDO should make hot code parts
optimized similar to -O3 but leave other pieces optimized for size.
Using FDO with -O3 gives you the opposite, cold portions optimized
for size while the rest is optimized for speed.

FDO with -Os still optimize for size, even in hot parts.

I don't think so.  Or at least that would be a bug.  Shouldn't 'hot'
BBs/functions
be optimized for speed even at -Os?  Hm, I see predict.c indeed returns
always false for optimize_size :(

It was outcome of discussion held some time ago. I think it was Mark promoting point that users opitmize for size when they use -Os period.

I thought we had just the neither cold or hot parts optimized according
to optimize_size. I originally wanted to have attribute HOT to overwrite -Os, so the well annotateed sources (i.e. kernel) could compile with -Os by default and explicitely declare the hot parts hot and get them compiled appropriately.

With profile feedback however the current logic is binary - i.e. blocks are either hot since their count is bigger than the threshold or cold. We don't really have "I don't really know" state there. In some cases it would make sense - i.e. there are optimizations that we want to do only in the hottest parts of code, but we don't have any logic for that.

My plan is to extend ipa-profile to do better hot/cold partitioning first: at the moment we decide on fixed fraction of maximal count in the program. This is unnecesarily conservative for programs with not terribly flat profiles. At IPA level we could collect histogram of counts of instructions (i.e. figure out how much time we spend on instructions executed N times) and then figure out where is the threshold so 99% of executed instructions belongs to hot region. This should give noticeably smaller binaries.

I thought we had just the neither cold or hot parts optimized according
to optimize_size.



 So to get resonale
speedups you need -O3+FDO.  -O3+FDO effectively defaults to -Os in cold
portions of program.

Well, but unless your training coverage is 100% all parts with no coverage
get optimized with -O3 instead of -Os.  And I bet coverage for mozilla
isn't even close to 100%.  Thus I think recommending -O3 for FDO is
usually a bad idea.

Code with no coverage is cold in our model (as is code executed once or so) and thus optimized for -Os even at -O3+FDO. This is bit aggressive on optimizing for size side. We might consider changing this policy, but so far I didn't see any complains on this...

Honza

So - did you try FDO with -O2? ;)

Still -Os+FDO should be somewhat faster than -Os alone, so a slowdown is
bug.  It is not very thoroughly since it is not really used in practice.

Also do you get any warnings on profile mismatches? Perhaps something
is wrong to the degree that the relevant part of profile gets
misapplied.

I don't get any warning on profile mismatches. I only get a "few"
missing gcda files warning, but that's expected.

Perhaps you could compile one of less trivial files you are sure that are
covered by train run and send me -fdump-tree-all-blocks -fdump-ipa-all dumps
of the compilation so I can double check the profile seems sane. This could
be good start to rule out something stupid.

Honza

Cheers,

Mike







Reply via email to