Unfortunately, groupBy is not the most efficient operation. What is it you’re trying to do? It may be possible with one of the other *byKey transformations.
From: "SAHA, DEBOBROTA" Date: Wednesday, September 2, 2015 at 7:46 PM To: "'user@spark.apache.org<mailto:'user@spark.apache.org>'" Subject: Unbale to run Group BY on Large File Hi , I am getting below error while I am trying to select data using SPARK SQL from a RDD table. java.lang.OutOfMemoryError: GC overhead limit exceeded "Spark Context Cleaner" java.lang.InterruptedException The file or table size is around 113 GB and I am running SPARK 1.4 on a standalone cluster. Tried to extend the heap size but extending to 64GB also didn’t help. I would really appreciate any help on this. Thanks, Debobrota