Does this turn codegen on? I think the performance is fairly different when
codegen is turned on.

For 1.5, we are investigating having codegen on by default, so users get
much better performance out of the box.


On Wed, May 20, 2015 at 5:24 PM, Pramod Biligiri <pramodbilig...@gmail.com>
wrote:

> Hi,
> Somewhat similar to Daniel Mescheder's mail yesterday on SparkSql, I have
> a data point regarding the performance of Group By, indicating there's
> excessive GC and it's impacting the throughput. I want to know if the new
> memory manager for aggregations (
> https://github.com/apache/spark/pull/5725/) is going to address this kind
> of issue.
>
> I only have a small amount of data on each node (~360MB) with a large heap
> size (18 Gig). I still see 2-3 minor collections happening whenever I do a
> Select Sum() with a group by(). I have tried with different sizes for Young
> Generation without much effect, though not with different GC algorithms
> (Hm..I ought to try reducing the rdd storage fraction perhaps).
>
> I have made a chart of my results [1] by adding timing code to
> Aggregates.scala. The query is actually Query 2 from Berkeley's AmpLab
> benchmark, running over 10 million records. The chart is from one of the 4
> worker nodes in the cluster.
>
> I am trying to square this with a claim on the Project Tungsten blog post
> [2]: "When profiling Spark user applications, we’ve found that a large
> fraction of the CPU time is spent waiting for data to be fetched from main
> memory. "
>
> Am I correct in assuming that SparkSql is yet to reach that level of
> efficiency, at least in aggregation operations?
>
> Thanks.
>
> [1] -
> https://docs.google.com/spreadsheets/d/1HSqYfic3n5s9i4Wsi1Qg0FKN_AWz2vV7_6RRMrtzplQ/edit#gid=481134174
> [2]
> https://databricks.com/blog/2015/04/28/project-tungsten-bringing-spark-closer-to-bare-metal.html
>
> Pramod
>

Reply via email to