Hi, Reynold and others

I agree with your comments on mid-tenured objects and GC. In fact, dealing with 
mid-tenured objects are the major challenge for all java GC implementations.

I am wondering if anyone has played -XX:+PrintTenuringDistribution flags and 
see how exactly ages distribution look like when your program runs?
My output with -XX:+PrintGCDetails look like below: (Oracle jdk8 update 60 
http://www.oracle.com/technetwork/java/javase/downloads/jdk8-downloads-2133151.html

Age 1-5 are young guys, 13, 14, 15 are old guys.
The middle guys will have to be copied multiple times before become dead in old 
regions normally need some major GC to clean them up.

Desired survivor size 2583691264 bytes, new threshold 15 (max 15)
- age   1:   13474960 bytes,   13474960 total
- age   2:    2815592 bytes,   16290552 total
- age   3:     632784 bytes,   16923336 total
- age   4:     428432 bytes,   17351768 total
- age   5:     648696 bytes,   18000464 total
- age   6:     572328 bytes,   18572792 total
- age   7:     549216 bytes,   19122008 total
- age   8:     539544 bytes,   19661552 total
- age   9:     422256 bytes,   20083808 total
- age  10:     552928 bytes,   20636736 total
- age  11:     430464 bytes,   21067200 total
- age  12:     753320 bytes,   21820520 total
- age  13:     230864 bytes,   22051384 total
- age  14:     276288 bytes,   22327672 total
- age  15:     809272 bytes,   23136944 total

I’d love to see how others’ objects’ age distribution look like. Actually once 
we know the age distribution for some particular use cases, we can find a ways 
to avoid Full GC. Full GC is expensive because both CMS and G1 Full GC are 
single threaded. GC tuning nowadays becomes a task of just trying to avoid Full 
GC completely.

Thanks
-yanping

From: Reynold Xin [mailto:r...@databricks.com]
Sent: Tuesday, August 25, 2015 6:05 PM
To: Ulanov, Alexander
Cc: dev@spark.apache.org
Subject: Re: Dataframe aggregation with Tungsten unsafe

There are a lot of GC activity due to the non-code-gen path being sloppy about 
garbage creation. This is not actually what happens, but just as an example:

rdd.map { i: Int => i + 1 }

This under the hood becomes a closure that boxes on every input and every 
output, creating two extra objects.

The reality is more complicated than this -- but here's a simpler view of what 
happens with GC in these cases. You might've heard from other places that the 
JVM is very efficient about transient object allocations. That is true when you 
look at these allocations in isolation, but unfortunately not true when you 
look at them in aggregate.

First, due to the way the iterator interface is constructed, it is hard for the 
JIT compiler to on-stack allocate these objects. Then two things happen:

1. They pile up and cause more young gen GCs to happen.
2. After a few young gen GCs, some mid-tenured objects (e.g. an aggregation 
map) get copied into the old-gen, and eventually requires a full GC to free 
them. Full GCs are much more expensive than young gen GCs (usually involves 
copying all the data in the old gen).

So the more garbages that are created -> the more frequently full GC happens.

The more long lived objects in the old gen (e.g. cache) -> the more expensive 
full GC is.



On Tue, Aug 25, 2015 at 5:19 PM, Ulanov, Alexander 
<alexander.ula...@hp.com<mailto:alexander.ula...@hp.com>> wrote:
Thank you for the explanation. The size if the 100M data is ~1.4GB in memory 
and each worker has 32GB of memory. It seems to be a lot of free memory 
available. I wonder how Spark can hit GC with such setup?

Reynold Xin 
<r...@databricks.com<mailto:r...@databricks.com><mailto:r...@databricks.com<mailto:r...@databricks.com>>>

On Fri, Aug 21, 2015 at 11:07 AM, Ulanov, Alexander 
<alexander.ula...@hp.com<mailto:alexander.ula...@hp.com><mailto:alexander.ula...@hp.com<mailto:alexander.ula...@hp.com>>>
 wrote:

It seems that there is a nice improvement with Tungsten enabled given that data 
is persisted in memory 2x and 3x. However, the improvement is not that nice for 
parquet, it is 1.5x. What’s interesting, with Tungsten enabled performance of 
in-memory data and parquet data aggregation is similar. Could anyone comment on 
this? It seems counterintuitive to me.

Local performance was not as good as Reynold had. I have around 1.5x, he had 
5x. However, local mode is not interesting.


I think a large part of that is coming from the pressure created by JVM GC. 
Putting more data in-memory makes GC worse, unless GC is well tuned.



Reply via email to