Thank you for the explanation. The size if the 100M data is ~1.4GB in memory and each worker has 32GB of memory. It seems to be a lot of free memory available. I wonder how Spark can hit GC with such setup?
Reynold Xin <r...@databricks.com<mailto:r...@databricks.com>> On Fri, Aug 21, 2015 at 11:07 AM, Ulanov, Alexander <alexander.ula...@hp.com<mailto:alexander.ula...@hp.com>> wrote: It seems that there is a nice improvement with Tungsten enabled given that data is persisted in memory 2x and 3x. However, the improvement is not that nice for parquet, it is 1.5x. What’s interesting, with Tungsten enabled performance of in-memory data and parquet data aggregation is similar. Could anyone comment on this? It seems counterintuitive to me. Local performance was not as good as Reynold had. I have around 1.5x, he had 5x. However, local mode is not interesting. I think a large part of that is coming from the pressure created by JVM GC. Putting more data in-memory makes GC worse, unless GC is well tuned. --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org