Re: HotSpot code cache and instruction prefetching

2017-02-10 Thread 'Nitsan Wakart' via mechanical-sympathy
We have no availability based heuristics, nor does anyone else AFAIK, for 
dynamically enabling/disabling/tuning code gen to fit into smaller code cache 
pools or react to low space indications. Running out of code cache space is 
pretty rare and you can have a bigger cache if you like.Zing and OpenJDK as 
well as other compilers make aggressive and optimistic optimizations which 
minimize code size such as implicit null checks, brach/exception elimination, 
constant folding, Class Hierarchy Analysis to name a few. The general way these 
work is by the compiler either proving some code is redundant or not generating 
some unlikely code path optimistically (with a de-optimization fallback).As you 
point out, some optimizations bloat the code (unrolling etc), or result in code 
duplication (inlining). The compilers have different heuristics for how much to 
inline with a lot of seemingly arbitrary weights for different parameters. I 
leave it to people more involved in determining these heuristics to answer how 
much they worry about code size. I recommend looking at the GA sample in JMH 
for an interesting approach to exploring the parameter space for inlining.

-- 
You received this message because you are subscribed to the Google Groups 
"mechanical-sympathy" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to mechanical-sympathy+unsubscr...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


Re: HotSpot code cache and instruction prefetching

2017-02-09 Thread Sergey Melnikov
Nitsan, just as a curiosity, in Zing, do you have any optimizations for
code size? I mean most advanced performance optimizations require
additional code (code layout, aggressive versioning, unrolling, ...). So,
code size boosting caused by perf optimizations may exhaust code cache.

--Sergey


On Feb 8, 2017 18:18, "'Nitsan Wakart' via mechanical-sympathy" <
mechanical-sympathy@googlegroups.com> wrote:

We've implemented a code cache allocation scheme to this effect in recent
versions of Zing. Zing's code cache was similarly naive and since Zing has
been tiered compiling for a while now we started at a similar point to what
you describe.
The hypothesis (supported by some evidence) was that in sufficiently large
compiled code workloads, with sufficiently numerous hot methods (one of
those flat profiles where the list of methods taking > 1% cycles is long),
and given long enough for late compiles to kick in, you can end up with a
dispersed set of code blobs in you code cache. Ignoring the risk of code
cache exhaustion, the cost we were seeing for some workloads was in iTLB
misses.
The scheme we ended up with is different than JEP 197, we still have one
code heap and the segmentation is internal and follows a pretty simple
scheme, but it seems to help :)
The change improved some workloads (client code, finance application) by up
to 4%, the impact varied by CPU. As these things go, a modest win.
Relocating observed hot paths together is complex and as Aleksey points
out, if they are very strongly correlated inlining already helps this case.
I can imagine a workload where it would help, but I doubt it justifies the
work.
So, your intuition was to a large extent correct, and vendors are actively
pursuing it with some solutions already in the field and some around the
corner. At least for Zing, for certain real world workloads we saw a
measurable positive effect, and I expect the OpenJDK solution will deliver
similarly in these workloads.
A further, and more significant boost, was achieved by allocating larger
code cache pages. This is internal to Zing and does not require OS
configuration. Increasing the page size to 2M improved certain workloads by
more than 10% (large number of compilation units + memory pressure). A
similar improvement should be possible on OpenJDK by enabling
-XX:+UseLargePages,
I believe Sergey Kuksenko describes such a case in one of his talks. I've
not used this option myself so cannot comment on it's suitability.
Both optimizations are default for latest Zing versions.
Hope this helps,
Nitsan

-- 
You received this message because you are subscribed to the Google Groups
"mechanical-sympathy" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to mechanical-sympathy+unsubscr...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

-- 
You received this message because you are subscribed to the Google Groups 
"mechanical-sympathy" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to mechanical-sympathy+unsubscr...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


Re: HotSpot code cache and instruction prefetching

2017-02-08 Thread 'Nitsan Wakart' via mechanical-sympathy
We've implemented a code cache allocation scheme to this effect in recent 
versions of Zing. Zing's code cache was similarly naive and since Zing has been 
tiered compiling for a while now we started at a similar point to what you 
describe.The hypothesis (supported by some evidence) was that in sufficiently 
large compiled code workloads, with sufficiently numerous hot methods (one of 
those flat profiles where the list of methods taking > 1% cycles is long), and 
given long enough for late compiles to kick in, you can end up with a dispersed 
set of code blobs in you code cache. Ignoring the risk of code cache 
exhaustion, the cost we were seeing for some workloads was in iTLB misses.The 
scheme we ended up with is different than JEP 197, we still have one code heap 
and the segmentation is internal and follows a pretty simple scheme, but it 
seems to help :)The change improved some workloads (client code, finance 
application) by up to 4%, the impact varied by CPU. As these things go, a 
modest win.Relocating observed hot paths together is complex and as Aleksey 
points out, if they are very strongly correlated inlining already helps this 
case. I can imagine a workload where it would help, but I doubt it justifies 
the work.So, your intuition was to a large extent correct, and vendors are 
actively pursuing it with some solutions already in the field and some around 
the corner. At least for Zing, for certain real world workloads we saw a 
measurable positive effect, and I expect the OpenJDK solution will deliver 
similarly in these workloads.A further, and more significant boost, was 
achieved by allocating larger code cache pages. This is internal to Zing and 
does not require OS configuration. Increasing the page size to 2M improved 
certain workloads by more than 10% (large number of compilation units + memory 
pressure). A similar improvement should be possible on OpenJDK by enabling 
-XX:+UseLargePages, I believe Sergey Kuksenko describes such a case in one of 
his talks. I've not used this option myself so cannot comment on it's 
suitability.Both optimizations are default for latest Zing versions.Hope this 
helps,Nitsan

-- 
You received this message because you are subscribed to the Google Groups 
"mechanical-sympathy" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to mechanical-sympathy+unsubscr...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


Re: HotSpot code cache and instruction prefetching

2017-02-04 Thread Chris Newland
Hi Aleksey,

On Friday, 3 February 2017 09:54:35 UTC, Aleksey Shipilev wrote:
>
> On 02/03/2017 10:26 AM, Chris Newland wrote: 
> > Do you think the HotSpot designers took this into account but found 
> empirically 
> > that the simple algorithm is adequate (cost complexity outweighs gains 
> and hotimp 
> > methods are generally JIT-compiled together). 
>
> Let's ask another question: do you have the example where that matters? 
>

No, this was just pure curiosity :) My gut feel was that the CPU 
instruction cache would smooth out any hiccups in the prefetcher.
 

>
> > Could there be any benefit in relocating blocks for hot call chains to 
> match the 
> > call pattern once the program has reached a steady state? (assuming 
> inlining has 
> > already succeeded as much as possible). 
>
> Well, the "hot path" is supposed to be inlined and critical path laid out 
> sequentially within the compilation unit, so it is not catastrophic. 
>
> > Since tiered compilation became the default, do you think the many 
> (possibly 
> > unconnected) intermediate compilations have made prefetching worse? 
>
> ...but yes, for tiered, there are versions of the code that are known to 
> be 
> temporary (e.g. compilations on level 1,2,3), while the final compilation 
> stays 
> around for longer (level 4). This is why Segmented Code Cache was 
> implemented in 
> JDK 9: http://openjdk.java.net/jeps/197 
>
>
This JEP makes a lot more sense now I've got an understanding of the 
current code cache.

Thanks,

Chris

 

> IIRC, there were improvements on torturous workloads, and improvements in 
> nmethod scans. 
>
> Thanks, 
> -Aleksey 
>
>

-- 
You received this message because you are subscribed to the Google Groups 
"mechanical-sympathy" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to mechanical-sympathy+unsubscr...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


Re: HotSpot code cache and instruction prefetching

2017-02-03 Thread Aleksey Shipilev
On 02/03/2017 10:26 AM, Chris Newland wrote:
> Do you think the HotSpot designers took this into account but found 
> empirically
> that the simple algorithm is adequate (cost complexity outweighs gains and 
> hotimp
> methods are generally JIT-compiled together).

Let's ask another question: do you have the example where that matters?

> Could there be any benefit in relocating blocks for hot call chains to match 
> the
> call pattern once the program has reached a steady state? (assuming inlining 
> has
> already succeeded as much as possible).

Well, the "hot path" is supposed to be inlined and critical path laid out
sequentially within the compilation unit, so it is not catastrophic.

> Since tiered compilation became the default, do you think the many (possibly
> unconnected) intermediate compilations have made prefetching worse?

...but yes, for tiered, there are versions of the code that are known to be
temporary (e.g. compilations on level 1,2,3), while the final compilation stays
around for longer (level 4). This is why Segmented Code Cache was implemented in
JDK 9: http://openjdk.java.net/jeps/197

IIRC, there were improvements on torturous workloads, and improvements in
nmethod scans.

Thanks,
-Aleksey

-- 
You received this message because you are subscribed to the Google Groups 
"mechanical-sympathy" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to mechanical-sympathy+unsubscr...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


signature.asc
Description: OpenPGP digital signature


HotSpot code cache and instruction prefetching

2017-02-03 Thread Chris Newland
Hi,

I've been looking into how HotSpot arranges JIT-compiled native code in the 
code cache and the method appears to be:

1) Search the free-list (a linked list of blocks freed up by old methods 
removed from the code cache)

2) If there is a large enough free block then use it. If not then create a 
new block at the end of the current block (until you reach the code cache 
size limit).

I've added a visualisation for this to JITWatch[1] 
https://www.youtube.com/watch?v=XeTgtS3xdcc using the information found in 
the LogCompilation nmethod tags.

My question is: Is the HotSpot location of compiled methods optimal 
regarding the CPU's instruction prefetching?

After methods start getting removed from the code cache and nmethod 
locations occur less sequentially in blocks from the free-list will this 
make the layout worse for prefetching?

Do you think the HotSpot designers took this into account but found 
empirically that the simple algorithm is adequate (cost complexity 
outweighs gains and hot methods are generally JIT-compiled together).

Could there be any benefit in relocating blocks for hot call chains to 
match the call pattern once the program has reached a steady state? 
(assuming inlining has already succeeded as much as possible).

Since tiered compilation became the default, do you think the many 
(possibly unconnected) intermediate compilations have made prefetching 
worse?

Sorry for so many questions! Just interested in whether this matters or not 
to modern CPUs.

Many thanks,

Chris
@chriswhocodes
[1] https://github.com/AdoptOpenJDK/jitwatch

-- 
You received this message because you are subscribed to the Google Groups 
"mechanical-sympathy" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to mechanical-sympathy+unsubscr...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.