Alex Baretta created SPARK-5314: ----------------------------------- Summary: java.lang.OutOfMemoryError in SparkSQL with GROUP BY Key: SPARK-5314 URL: https://issues.apache.org/jira/browse/SPARK-5314 Project: Spark Issue Type: Bug Reporter: Alex Baretta
I am running a SparkSQL GROUP BY query on a largish Parquet table (a few hundred million rows), weighing it at about 50GB. My cluster has 1.7 TB of RAM, so it should have more than plenty resources to cope with this query. WARN TaskSetManager: Lost task 279.0 in stage 22.0 (TID 1229, ds-model-w-21.c.eastern-gravity-771.internal): java.lang.OutOfMemoryError: GC overhead limit exceeded at scala.collection.SeqLike$class.distinct(SeqLike.scala:493) at scala.collection.AbstractSeq.distinct(Seq.scala:40) at org.apache.spark.sql.catalyst.expressions.Coalesce.resolved$lzycompute(nullFunctions.scala:33) at org.apache.spark.sql.catalyst.expressions.Coalesce.resolved(nullFunctions.scala:33) at org.apache.spark.sql.catalyst.expressions.Coalesce.dataType(nullFunctions.scala:37) at org.apache.spark.sql.catalyst.expressions.Expression.n2(Expression.scala:100) at org.apache.spark.sql.catalyst.expressions.Add.eval(arithmetic.scala:101) at org.apache.spark.sql.catalyst.expressions.Coalesce.eval(nullFunctions.scala:50) at org.apache.spark.sql.catalyst.expressions.MutableLiteral.update(literals.scala:81) at org.apache.spark.sql.catalyst.expressions.SumFunction.update(aggregates.scala:571) at org.apache.spark.sql.execution.Aggregate$$anonfun$execute$1$$anonfun$7.apply(Aggregate.scala:167) at org.apache.spark.sql.execution.Aggregate$$anonfun$execute$1$$anonfun$7.apply(Aggregate.scala:151) at org.apache.spark.rdd.RDD$$anonfun$13.apply(RDD.scala:615) at org.apache.spark.rdd.RDD$$anonfun$13.apply(RDD.scala:615) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:264) at org.apache.spark.rdd.RDD.iterator(RDD.scala:231) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:264) at org.apache.spark.rdd.RDD.iterator(RDD.scala:231) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:68) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41) at org.apache.spark.scheduler.Task.run(Task.scala:56) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:183) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:745) -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org