Hi There,
I am testing Spark DataFrame and havn't been able to get my code to finish
due to what I suspect are GC issues. My guess is that GC interferes with
heartbeating and executors are detected as failed. The data is ~50 numeric
columns, ~100million rows in a CSV file.
We are doing a groupBy using one of the columns and trying to calculate the
average of each of the other columns. The groupBy key has about 250k unique
values.
It seems that Spark is creating a lot of temp objects (see jmap output
below) while calculating the average which I am surprised to see. Why
doesn't it use the same temp variable? Am I missing something? Do I need to
specify a config flag to enable code generation and not do this?


Mohit.

[xxxxx app-20150723142604-0002]$ jmap -histo 12209


 num     #instances         #bytes  class name

----------------------------------------------

   1:     258615458     8275694656  scala.collection.immutable.$colon$colon

   2:     103435856     7447381632
org.apache.spark.sql.catalyst.expressions.Cast

   3:     103435856     4964921088
org.apache.spark.sql.catalyst.expressions.Coalesce

   4:       1158643     4257400112  [B

   5:      51717929     4137434320
org.apache.spark.sql.catalyst.expressions.SumFunction

   6:      51717928     3723690816
org.apache.spark.sql.catalyst.expressions.Add

   7:      51717929     2896204024
org.apache.spark.sql.catalyst.expressions.CountFunction

   8:      51717928     2896203968
org.apache.spark.sql.catalyst.expressions.MutableLiteral

   9:      51717928     2482460544
org.apache.spark.sql.catalyst.expressions.Literal

  10:      51803728     1243289472  java.lang.Double

  11:      51717755     1241226120
org.apache.spark.sql.catalyst.expressions.Cast$$anonfun$castToDouble$5

  12:        975810      850906320
[Lorg.apache.spark.sql.catalyst.expressions.AggregateFunction;

  13:      51717754      827484064
org.apache.spark.sql.catalyst.expressions.Cast$$anonfun$org$apache$spark$sql$catalyst$expressions$Cast$$cast$1

  14:        982451       47157648  java.util.HashMap$Entry

  15:        981132       34981720  [Ljava.lang.Object;

  16:       1049984       25199616  org.apache.spark.sql.types.UTF8String

  17:        978296       23479104
org.apache.spark.sql.catalyst.expressions.GenericRow

  18:        117166       15944560  <methodKlass>

  19:        117166       14986224  <constMethodKlass>

  20:          1567       12891952  [Ljava.util.HashMap$Entry;

  21:          9103       10249728  <constantPoolKlass>

  22:          9103        9278592  <instanceKlassKlass>

  23:          5072        5691320  [I

  24:          7281        5335040  <constantPoolCacheKlass>

  25:         46420        4769600  [C

  26:        105984        3391488
io.netty.buffer.PoolThreadCache$MemoryRegionCache$Entry

Reply via email to