Hi Zhang,
No my data is not compressed. I'm trying to minimize the load on the CPU.
The GC time reduced for me after codegen.

Pramod

On Thu, May 21, 2015 at 3:43 AM, zhangxiongfei <zhangxiongfei0...@163.com>
wrote:

> Hi Pramod
>
>  Is your data compressed? I encountered similar problem,however, after
> turned codegen on, the GC time was still very long.The size of  input data
> for my map task is about 100M lzo file.
> My query is ""select ip, count(*) as c from stage_bitauto_adclick_d group
> by ip sort by c limit 100""
>
> Thanks
> Zhang Xiongfei
>
>
>
> At 2015-05-21 12:18:35, "Reynold Xin" <r...@databricks.com> wrote:
>
> Does this turn codegen on? I think the performance is fairly different
> when codegen is turned on.
>
> For 1.5, we are investigating having codegen on by default, so users get
> much better performance out of the box.
>
>
> On Wed, May 20, 2015 at 5:24 PM, Pramod Biligiri <pramodbilig...@gmail.com
> > wrote:
>
>> Hi,
>> Somewhat similar to Daniel Mescheder's mail yesterday on SparkSql, I have
>> a data point regarding the performance of Group By, indicating there's
>> excessive GC and it's impacting the throughput. I want to know if the new
>> memory manager for aggregations (
>> https://github.com/apache/spark/pull/5725/) is going to address this
>> kind of issue.
>>
>> I only have a small amount of data on each node (~360MB) with a large
>> heap size (18 Gig). I still see 2-3 minor collections happening whenever I
>> do a Select Sum() with a group by(). I have tried with different sizes for
>> Young Generation without much effect, though not with different GC
>> algorithms (Hm..I ought to try reducing the rdd storage fraction perhaps).
>>
>> I have made a chart of my results [1] by adding timing code to
>> Aggregates.scala. The query is actually Query 2 from Berkeley's AmpLab
>> benchmark, running over 10 million records. The chart is from one of the 4
>> worker nodes in the cluster.
>>
>> I am trying to square this with a claim on the Project Tungsten blog post
>> [2]: "When profiling Spark user applications, we’ve found that a large
>> fraction of the CPU time is spent waiting for data to be fetched from main
>> memory. "
>>
>> Am I correct in assuming that SparkSql is yet to reach that level of
>> efficiency, at least in aggregation operations?
>>
>> Thanks.
>>
>> [1] -
>> https://docs.google.com/spreadsheets/d/1HSqYfic3n5s9i4Wsi1Qg0FKN_AWz2vV7_6RRMrtzplQ/edit#gid=481134174
>> [2]
>> https://databricks.com/blog/2015/04/28/project-tungsten-bringing-spark-closer-to-bare-metal.html
>>
>> Pramod
>>
>
>
>
>

Reply via email to