Re: Low throughput and effect of GC in SparkSql GROUP BY

Reynold Xin Wed, 20 May 2015 23:50:53 -0700

Yup it is a different path. It runs GeneratedAggregate.

On Wed, May 20, 2015 at 11:43 PM, Pramod Biligiri <pramodbilig...@gmail.com>
wrote:


> I hadn't turned on codegen. I enabled it and ran it again, it is running
> 4-5 times faster now! :)
> Since my log statements are no longer appearing, I presume the code path
> seems quite different from the earlier hashmap related stuff in
> Aggregates.scala?
>
> Pramod
>
> On Wed, May 20, 2015 at 9:18 PM, Reynold Xin <r...@databricks.com> wrote:
>
>> Does this turn codegen on? I think the performance is fairly different
>> when codegen is turned on.
>>
>> For 1.5, we are investigating having codegen on by default, so users get
>> much better performance out of the box.
>>
>>
>> On Wed, May 20, 2015 at 5:24 PM, Pramod Biligiri <
>> pramodbilig...@gmail.com> wrote:
>>
>>> Hi,
>>> Somewhat similar to Daniel Mescheder's mail yesterday on SparkSql, I
>>> have a data point regarding the performance of Group By, indicating there's
>>> excessive GC and it's impacting the throughput. I want to know if the new
>>> memory manager for aggregations (
>>> https://github.com/apache/spark/pull/5725/) is going to address this
>>> kind of issue.
>>>
>>> I only have a small amount of data on each node (~360MB) with a large
>>> heap size (18 Gig). I still see 2-3 minor collections happening whenever I
>>> do a Select Sum() with a group by(). I have tried with different sizes for
>>> Young Generation without much effect, though not with different GC
>>> algorithms (Hm..I ought to try reducing the rdd storage fraction perhaps).
>>>
>>> I have made a chart of my results [1] by adding timing code to
>>> Aggregates.scala. The query is actually Query 2 from Berkeley's AmpLab
>>> benchmark, running over 10 million records. The chart is from one of the 4
>>> worker nodes in the cluster.
>>>
>>> I am trying to square this with a claim on the Project Tungsten blog
>>> post [2]: "When profiling Spark user applications, we’ve found that a
>>> large fraction of the CPU time is spent waiting for data to be fetched from
>>> main memory. "
>>>
>>> Am I correct in assuming that SparkSql is yet to reach that level of
>>> efficiency, at least in aggregation operations?
>>>
>>> Thanks.
>>>
>>> [1] -
>>> https://docs.google.com/spreadsheets/d/1HSqYfic3n5s9i4Wsi1Qg0FKN_AWz2vV7_6RRMrtzplQ/edit#gid=481134174
>>> [2]
>>> https://databricks.com/blog/2015/04/28/project-tungsten-bringing-spark-closer-to-bare-metal.html
>>>
>>> Pramod
>>>
>>
>>
>

Re: Low throughput and effect of GC in SparkSql GROUP BY

Reply via email to