Re: spark dataframe gc

Akhil Das Mon, 27 Jul 2015 00:40:06 -0700

This spark.shuffle.sort.bypassMergeThreshold might help, You could also try
setting the shuffle manager to hash from sort. You can see more
configuration options from here
<https://spark.apache.org/docs/latest/configuration.html#shuffle-behavior>.


Thanks
Best Regards

On Fri, Jul 24, 2015 at 3:33 AM, Mohit Jaggi <mohitja...@gmail.com> wrote:

> Hi There,
> I am testing Spark DataFrame and havn't been able to get my code to finish
> due to what I suspect are GC issues. My guess is that GC interferes with
> heartbeating and executors are detected as failed. The data is ~50 numeric
> columns, ~100million rows in a CSV file.
> We are doing a groupBy using one of the columns and trying to calculate
> the average of each of the other columns. The groupBy key has about 250k
> unique values.
> It seems that Spark is creating a lot of temp objects (see jmap output
> below) while calculating the average which I am surprised to see. Why
> doesn't it use the same temp variable? Am I missing something? Do I need to
> specify a config flag to enable code generation and not do this?
>
>
> Mohit.
>
> [xxxxx app-20150723142604-0002]$ jmap -histo 12209
>
>
>  num     #instances         #bytes  class name
>
> ----------------------------------------------
>
>    1:     258615458     8275694656  scala.collection.immutable.$colon$colon
>
>    2:     103435856     7447381632
> org.apache.spark.sql.catalyst.expressions.Cast
>
>    3:     103435856     4964921088
> org.apache.spark.sql.catalyst.expressions.Coalesce
>
>    4:       1158643     4257400112  [B
>
>    5:      51717929     4137434320
> org.apache.spark.sql.catalyst.expressions.SumFunction
>
>    6:      51717928     3723690816
> org.apache.spark.sql.catalyst.expressions.Add
>
>    7:      51717929     2896204024
> org.apache.spark.sql.catalyst.expressions.CountFunction
>
>    8:      51717928     2896203968
> org.apache.spark.sql.catalyst.expressions.MutableLiteral
>
>    9:      51717928     2482460544
> org.apache.spark.sql.catalyst.expressions.Literal
>
>   10:      51803728     1243289472  java.lang.Double
>
>   11:      51717755     1241226120
> org.apache.spark.sql.catalyst.expressions.Cast$$anonfun$castToDouble$5
>
>   12:        975810      850906320
> [Lorg.apache.spark.sql.catalyst.expressions.AggregateFunction;
>
>   13:      51717754      827484064
> org.apache.spark.sql.catalyst.expressions.Cast$$anonfun$org$apache$spark$sql$catalyst$expressions$Cast$$cast$1
>
>   14:        982451       47157648  java.util.HashMap$Entry
>
>   15:        981132       34981720  [Ljava.lang.Object;
>
>   16:       1049984       25199616  org.apache.spark.sql.types.UTF8String
>
>   17:        978296       23479104
> org.apache.spark.sql.catalyst.expressions.GenericRow
>
>   18:        117166       15944560  <methodKlass>
>
>   19:        117166       14986224  <constMethodKlass>
>
>   20:          1567       12891952  [Ljava.util.HashMap$Entry;
>
>   21:          9103       10249728  <constantPoolKlass>
>
>   22:          9103        9278592  <instanceKlassKlass>
>
>   23:          5072        5691320  [I
>
>   24:          7281        5335040  <constantPoolCacheKlass>
>
>   25:         46420        4769600  [C
>
>   26:        105984        3391488
> io.netty.buffer.PoolThreadCache$MemoryRegionCache$Entry
>

Re: spark dataframe gc

Reply via email to