Re:Re: Low throughput and effect of GC in SparkSql GROUP BY

zhangxiongfei Thu, 21 May 2015 03:44:33 -0700

Hi Pramod

Is your data compressed? I encountered similar problem,however, after turned
codegen on, the GC time was still very long.The size of input data for my map
task is about 100M lzo file.
My query is ""select ip, count(*) as c from stage_bitauto_adclick_d group by ip
sort by c limit 100""

Thanks
Zhang Xiongfei

At 2015-05-21 12:18:35, "Reynold Xin" <r...@databricks.com> wrote:

Does this turn codegen on? I think the performance is fairly different when
codegen is turned on.

For 1.5, we are investigating having codegen on by default, so users get much
better performance out of the box.

On Wed, May 20, 2015 at 5:24 PM, Pramod Biligiri <pramodbilig...@gmail.com>
wrote:

Hi,
Somewhat similar to Daniel Mescheder's mail yesterday on SparkSql, I have a
data point regarding the performance of Group By, indicating there's excessive
GC and it's impacting the throughput. I want to know if the new memory manager
for aggregations (https://github.com/apache/spark/pull/5725/) is going to
address this kind of issue.

I only have a small amount of data on each node (~360MB) with a large heap size
(18 Gig). I still see 2-3 minor collections happening whenever I do a Select
Sum() with a group by(). I have tried with different sizes for Young Generation
without much effect, though not with different GC algorithms (Hm..I ought to
try reducing the rdd storage fraction perhaps).

I have made a chart of my results [1] by adding timing code to
Aggregates.scala. The query is actually Query 2 from Berkeley's AmpLab
benchmark, running over 10 million records. The chart is from one of the 4
worker nodes in the cluster.

I am trying to square this with a claim on the Project Tungsten blog post [2]:
"When profiling Spark user applications, we’ve found that a large fraction of
the CPU time is spent waiting for data to be fetched from main memory. "

Am I correct in assuming that SparkSql is yet to reach that level of
efficiency, at least in aggregation operations?

Thanks.

[1] -
https://docs.google.com/spreadsheets/d/1HSqYfic3n5s9i4Wsi1Qg0FKN_AWz2vV7_6RRMrtzplQ/edit#gid=481134174
[2]
https://databricks.com/blog/2015/04/28/project-tungsten-bringing-spark-closer-to-bare-metal.html

Pramod

Re:Re: Low throughput and effect of GC in SparkSql GROUP BY

Reply via email to