Thank you for the explanation. The size if the 100M data is ~1.4GB in memory 
and each worker has 32GB of memory. It seems to be a lot of free memory 
available. I wonder how Spark can hit GC with such setup?

Reynold Xin <r...@databricks.com<mailto:r...@databricks.com>>


On Fri, Aug 21, 2015 at 11:07 AM, Ulanov, Alexander 
<alexander.ula...@hp.com<mailto:alexander.ula...@hp.com>> wrote:

It seems that there is a nice improvement with Tungsten enabled given that data 
is persisted in memory 2x and 3x. However, the improvement is not that nice for 
parquet, it is 1.5x. What’s interesting, with Tungsten enabled performance of 
in-memory data and parquet data aggregation is similar. Could anyone comment on 
this? It seems counterintuitive to me.

Local performance was not as good as Reynold had. I have around 1.5x, he had 
5x. However, local mode is not interesting.


I think a large part of that is coming from the pressure created by JVM GC. 
Putting more data in-memory makes GC worse, unless GC is well tuned.




---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org

Reply via email to