Hi There, I am testing Spark DataFrame and havn't been able to get my code to finish due to what I suspect are GC issues. My guess is that GC interferes with heartbeating and executors are detected as failed. The data is ~50 numeric columns, ~100million rows in a CSV file. We are doing a groupBy using one of the columns and trying to calculate the average of each of the other columns. The groupBy key has about 250k unique values. It seems that Spark is creating a lot of temp objects (see jmap output below) while calculating the average which I am surprised to see. Why doesn't it use the same temp variable? Am I missing something? Do I need to specify a config flag to enable code generation and not do this?
Mohit. [xxxxx app-20150723142604-0002]$ jmap -histo 12209 num #instances #bytes class name ---------------------------------------------- 1: 258615458 8275694656 scala.collection.immutable.$colon$colon 2: 103435856 7447381632 org.apache.spark.sql.catalyst.expressions.Cast 3: 103435856 4964921088 org.apache.spark.sql.catalyst.expressions.Coalesce 4: 1158643 4257400112 [B 5: 51717929 4137434320 org.apache.spark.sql.catalyst.expressions.SumFunction 6: 51717928 3723690816 org.apache.spark.sql.catalyst.expressions.Add 7: 51717929 2896204024 org.apache.spark.sql.catalyst.expressions.CountFunction 8: 51717928 2896203968 org.apache.spark.sql.catalyst.expressions.MutableLiteral 9: 51717928 2482460544 org.apache.spark.sql.catalyst.expressions.Literal 10: 51803728 1243289472 java.lang.Double 11: 51717755 1241226120 org.apache.spark.sql.catalyst.expressions.Cast$$anonfun$castToDouble$5 12: 975810 850906320 [Lorg.apache.spark.sql.catalyst.expressions.AggregateFunction; 13: 51717754 827484064 org.apache.spark.sql.catalyst.expressions.Cast$$anonfun$org$apache$spark$sql$catalyst$expressions$Cast$$cast$1 14: 982451 47157648 java.util.HashMap$Entry 15: 981132 34981720 [Ljava.lang.Object; 16: 1049984 25199616 org.apache.spark.sql.types.UTF8String 17: 978296 23479104 org.apache.spark.sql.catalyst.expressions.GenericRow 18: 117166 15944560 <methodKlass> 19: 117166 14986224 <constMethodKlass> 20: 1567 12891952 [Ljava.util.HashMap$Entry; 21: 9103 10249728 <constantPoolKlass> 22: 9103 9278592 <instanceKlassKlass> 23: 5072 5691320 [I 24: 7281 5335040 <constantPoolCacheKlass> 25: 46420 4769600 [C 26: 105984 3391488 io.netty.buffer.PoolThreadCache$MemoryRegionCache$Entry