*Hi devs,* *Is there any connection between the input file size and RAM size for sorting using SparkSQL ?* *I tried 1 GB file with 8 GB RAM with 4 cores and got java.lang.OutOfMemoryError: GC overhead limit exceeded.* *Or could it be for any other reason ? Its working for other SparkSQL operations.*
15/02/28 16:33:03 INFO Utils: Successfully started service 'sparkDriver' on port 41392. 15/02/28 16:33:03 INFO SparkEnv: Registering MapOutputTracker 15/02/28 16:33:03 INFO SparkEnv: Registering BlockManagerMaster 15/02/28 16:33:03 INFO DiskBlockManager: Created local directory at /tmp/spark-ecf4d6f0-c526-48fa-bd8a-d74a8bf64820/spark-4865c193-05e6-4aa1-999b-ab8c426479ab 15/02/28 16:33:03 INFO MemoryStore: MemoryStore started with capacity 944.7 MB 15/02/28 16:33:03 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable 15/02/28 16:33:03 INFO HttpFileServer: HTTP File server directory is /tmp/spark-af545c0b-15e6-4efa-a151-2c73faba8948/spark-987f58b4-5735-4965-91d1-38f238f4bb11 15/02/28 16:33:03 INFO HttpServer: Starting HTTP Server 15/02/28 16:33:03 INFO Utils: Successfully started service 'HTTP file server' on port 44588. 15/02/28 16:33:08 INFO Utils: Successfully started service 'SparkUI' on port 4040. 15/02/28 16:33:08 INFO SparkUI: Started SparkUI at http://10.30.9.7:4040 15/02/28 16:33:08 INFO Executor: Starting executor ID <driver> on host localhost 15/02/28 16:33:08 INFO AkkaUtils: Connecting to HeartbeatReceiver: akka.tcp://sparkDriver@10.30.9.7:41392/user/HeartbeatReceiver 15/02/28 16:33:08 INFO NettyBlockTransferService: Server created on 34475 15/02/28 16:33:08 INFO BlockManagerMaster: Trying to register BlockManager 15/02/28 16:33:08 INFO BlockManagerMasterActor: Registering block manager localhost:34475 with 944.7 MB RAM, BlockManagerId(<driver>, localhost, 34475) 15/02/28 16:33:08 INFO BlockManagerMaster: Registered BlockManager 15/02/28 16:33:09 INFO MemoryStore: ensureFreeSpace(193213) called with curMem=0, maxMem=990550425 15/02/28 16:33:09 INFO MemoryStore: Block broadcast_0 stored as values in memory (estimated size 188.7 KB, free 944.5 MB) 15/02/28 16:33:09 INFO MemoryStore: ensureFreeSpace(25432) called with curMem=193213, maxMem=990550425 15/02/28 16:33:09 INFO MemoryStore: Block broadcast_0_piece0 stored as bytes in memory (estimated size 24.8 KB, free 944.5 MB) 15/02/28 16:33:09 INFO BlockManagerInfo: Added broadcast_0_piece0 in memory on localhost:34475 (size: 24.8 KB, free: 944.6 MB) 15/02/28 16:33:09 INFO BlockManagerMaster: Updated info of block broadcast_0_piece0 15/02/28 16:33:09 INFO SparkContext: Created broadcast 0 from textFile at SortSQL.scala:20 15/02/28 16:33:10 INFO HiveMetaStore: 0: Opening raw store with implemenation class:org.apache.hadoop.hive.metastore.ObjectStore 15/02/28 16:33:10 INFO ObjectStore: ObjectStore, initialize called 15/02/28 16:33:10 INFO Persistence: Property datanucleus.cache.level2 unknown - will be ignored 15/02/28 16:33:10 INFO Persistence: Property hive.metastore.integral.jdo.pushdown unknown - will be ignored 15/02/28 16:33:12 INFO ObjectStore: Setting MetaStore object pin classes with hive.metastore.cache.pinobjtypes="Table,StorageDescriptor,SerDeInfo,Partition,Database,Type,FieldSchema,Order" 15/02/28 16:33:12 INFO MetaStoreDirectSql: MySQL check failed, assuming we are not on mysql: Lexical error at line 1, column 5. Encountered: "@" (64), after : "". 15/02/28 16:33:13 INFO Datastore: The class "org.apache.hadoop.hive.metastore.model.MFieldSchema" is tagged as "embedded-only" so does not have its own datastore table. 15/02/28 16:33:13 INFO Datastore: The class "org.apache.hadoop.hive.metastore.model.MOrder" is tagged as "embedded-only" so does not have its own datastore table. 15/02/28 16:33:13 INFO Datastore: The class "org.apache.hadoop.hive.metastore.model.MFieldSchema" is tagged as "embedded-only" so does not have its own datastore table. 15/02/28 16:33:13 INFO Datastore: The class "org.apache.hadoop.hive.metastore.model.MOrder" is tagged as "embedded-only" so does not have its own datastore table. 15/02/28 16:33:13 INFO Query: Reading in results for query "org.datanucleus.store.rdbms.query.SQLQuery@0" since the connection used is closing 15/02/28 16:33:13 INFO ObjectStore: Initialized ObjectStore 15/02/28 16:33:14 INFO HiveMetaStore: Added admin role in metastore 15/02/28 16:33:14 INFO HiveMetaStore: Added public role in metastore 15/02/28 16:33:14 INFO HiveMetaStore: No user is added in admin role, since config is empty 15/02/28 16:33:14 INFO SessionState: No Tez session required at this point. hive.execution.engine=mr. 15/02/28 16:33:14 INFO ParseDriver: Parsing command: SELECT * FROM people SORT BY B DESC 15/02/28 16:33:14 INFO ParseDriver: Parse Completed 15/02/28 16:33:14 INFO deprecation: mapred.tip.id is deprecated. Instead, use mapreduce.task.id 15/02/28 16:33:14 INFO deprecation: mapred.task.id is deprecated. Instead, use mapreduce.task.attempt.id 15/02/28 16:33:14 INFO deprecation: mapred.task.is.map is deprecated. Instead, use mapreduce.task.ismap 15/02/28 16:33:14 INFO deprecation: mapred.task.partition is deprecated. Instead, use mapreduce.task.partition 15/02/28 16:33:14 INFO deprecation: mapred.job.id is deprecated. Instead, use mapreduce.job.id 15/02/28 16:33:15 INFO FileInputFormat: Total input paths to process : 1 15/02/28 16:33:15 INFO SparkContext: Starting job: saveAsTextFile at SortSQL.scala:24 15/02/28 16:33:15 INFO DAGScheduler: Got job 0 (saveAsTextFile at SortSQL.scala:24) with 33 output partitions (allowLocal=false) 15/02/28 16:33:15 INFO DAGScheduler: Final stage: Stage 0(saveAsTextFile at SortSQL.scala:24) 15/02/28 16:33:15 INFO DAGScheduler: Parents of final stage: List() 15/02/28 16:33:15 INFO DAGScheduler: Missing parents: List() 15/02/28 16:33:15 INFO DAGScheduler: Submitting Stage 0 (MappedRDD[7] at saveAsTextFile at SortSQL.scala:24), which has no missing parents 15/02/28 16:33:15 INFO MemoryStore: ensureFreeSpace(130464) called with curMem=218645, maxMem=990550425 15/02/28 16:33:15 INFO MemoryStore: Block broadcast_1 stored as values in memory (estimated size 127.4 KB, free 944.3 MB) 15/02/28 16:33:15 INFO MemoryStore: ensureFreeSpace(78527) called with curMem=349109, maxMem=990550425 15/02/28 16:33:15 INFO MemoryStore: Block broadcast_1_piece0 stored as bytes in memory (estimated size 76.7 KB, free 944.3 MB) 15/02/28 16:33:15 INFO BlockManagerInfo: Added broadcast_1_piece0 in memory on localhost:34475 (size: 76.7 KB, free: 944.6 MB) 15/02/28 16:33:15 INFO BlockManagerMaster: Updated info of block broadcast_1_piece0 15/02/28 16:33:15 INFO SparkContext: Created broadcast 1 from broadcast at DAGScheduler.scala:838 15/02/28 16:33:15 INFO DAGScheduler: Submitting 33 missing tasks from Stage 0 (MappedRDD[7] at saveAsTextFile at SortSQL.scala:24) 15/02/28 16:33:15 INFO TaskSchedulerImpl: Adding task set 0.0 with 33 tasks 15/02/28 16:33:15 INFO TaskSetManager: Starting task 0.0 in stage 0.0 (TID 0, localhost, PROCESS_LOCAL, 1308 bytes) 15/02/28 16:33:15 INFO TaskSetManager: Starting task 1.0 in stage 0.0 (TID 1, localhost, PROCESS_LOCAL, 1308 bytes) 15/02/28 16:33:15 INFO TaskSetManager: Starting task 2.0 in stage 0.0 (TID 2, localhost, PROCESS_LOCAL, 1308 bytes) 15/02/28 16:33:15 INFO TaskSetManager: Starting task 3.0 in stage 0.0 (TID 3, localhost, PROCESS_LOCAL, 1308 bytes) 15/02/28 16:33:15 INFO Executor: Running task 3.0 in stage 0.0 (TID 3) 15/02/28 16:33:15 INFO Executor: Running task 2.0 in stage 0.0 (TID 2) 15/02/28 16:33:15 INFO Executor: Running task 1.0 in stage 0.0 (TID 1) 15/02/28 16:33:15 INFO Executor: Running task 0.0 in stage 0.0 (TID 0) 15/02/28 16:33:15 INFO HadoopRDD: Input split: file:/home/devan/skypeFiles/pcaData1GB.1.txt:0+33554432 15/02/28 16:33:15 INFO HadoopRDD: Input split: file:/home/devan/skypeFiles/pcaData1GB.1.txt:33554432+33554432 15/02/28 16:33:15 INFO HadoopRDD: Input split: file:/home/devan/skypeFiles/pcaData1GB.1.txt:67108864+33554432 15/02/28 16:33:15 INFO HadoopRDD: Input split: file:/home/devan/skypeFiles/pcaData1GB.1.txt:100663296+33554432 15/02/28 16:36:44 WARN NioEventLoop: Unexpected exception in the selector loop. java.lang.OutOfMemoryError: GC overhead limit exceeded at io.netty.util.internal.MpscLinkedQueue.offer(MpscLinkedQueue.java:129) at io.netty.util.internal.MpscLinkedQueue.add(MpscLinkedQueue.java:230) at io.netty.util.concurrent.SingleThreadEventExecutor.fetchFromDelayedQueue(SingleThreadEventExecutor.java:270) at io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:369) at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:357) at io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:116) at java.lang.Thread.run(Thread.java:722) 15/02/28 16:36:49 ERROR Executor: Exception in task 1.0 in stage 0.0 (TID 1) java.lang.OutOfMemoryError: GC overhead limit exceeded at java.lang.String.split(String.java:2333) at java.lang.String.split(String.java:2403) at Test.sparkSQL.SqlRunner$$anonfun$2.apply(SqlRunner.scala:26) at Test.sparkSQL.SqlRunner$$anonfun$2.apply(SqlRunner.scala:26) at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371) at scala.collection.Iterator$$anon$14.hasNext(Iterator.scala:388) at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327) at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327) at scala.collection.Iterator$class.foreach(Iterator.scala:727) at scala.collection.AbstractIterator.foreach(Iterator.scala:1157) at scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48) at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:103) at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:47) at scala.collection.TraversableOnce$class.to(TraversableOnce.scala:273) at scala.collection.AbstractIterator.to(Iterator.scala:1157) at scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:265) at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1157) at scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:252) at scala.collection.AbstractIterator.toArray(Iterator.scala:1157) at org.apache.spark.sql.execution.Sort$$anonfun$execute$3$$anonfun$apply$4.apply(basicOperators.scala:209) at org.apache.spark.sql.execution.Sort$$anonfun$execute$3$$anonfun$apply$4.apply(basicOperators.scala:207) at org.apache.spark.rdd.RDD$$anonfun$14.apply(RDD.scala:618) at org.apache.spark.rdd.RDD$$anonfun$14.apply(RDD.scala:618) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35) at org.apache.spark.sql.SchemaRDD.compute(SchemaRDD.scala:120) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:280) at org.apache.spark.rdd.RDD.iterator(RDD.scala:247) at org.apache.spark.rdd.MappedRDD.compute(MappedRDD.scala:31) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:280) at org.apache.spark.rdd.RDD.iterator(RDD.scala:247) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:61) at org.apache.spark.scheduler.Task.run(Task.scala:56) 15/02/28 16:36:55 INFO TaskSetManager: Starting task 4.0 in stage 0.0 (TID 4, localhost, PROCESS_LOCAL, 1308 bytes) 15/02/28 16:36:55 ERROR SparkUncaughtExceptionHandler: Uncaught exception in thread Thread[Executor task launch worker-1,5,main] java.lang.OutOfMemoryError: GC overhead limit exceeded at java.lang.String.split(String.java:2333) at java.lang.String.split(String.java:2403) at Test.sparkSQL.SqlRunner$$anonfun$2.apply(SqlRunner.scala:26) at Test.sparkSQL.SqlRunner$$anonfun$2.apply(SqlRunner.scala:26) at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371) at scala.collection.Iterator$$anon$14.hasNext(Iterator.scala:388) at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327) at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327) at scala.collection.Iterator$class.foreach(Iterator.scala:727) at scala.collection.AbstractIterator.foreach(Iterator.scala:1157) at scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48) at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:103) at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:47) at scala.collection.TraversableOnce$class.to(TraversableOnce.scala:273) at scala.collection.AbstractIterator.to(Iterator.scala:1157) at scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:265) at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1157) at scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:252) at scala.collection.AbstractIterator.toArray(Iterator.scala:1157) at org.apache.spark.sql.execution.Sort$$anonfun$execute$3$$anonfun$apply$4.apply(basicOperators.scala:209) at org.apache.spark.sql.execution.Sort$$anonfun$execute$3$$anonfun$apply$4.apply(basicOperators.scala:207) at org.apache.spark.rdd.RDD$$anonfun$14.apply(RDD.scala:618) at org.apache.spark.rdd.RDD$$anonfun$14.apply(RDD.scala:618) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35) at org.apache.spark.sql.SchemaRDD.compute(SchemaRDD.scala:120) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:280) at org.apache.spark.rdd.RDD.iterator(RDD.scala:247) at org.apache.spark.rdd.MappedRDD.compute(MappedRDD.scala:31) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:280) at org.apache.spark.rdd.RDD.iterator(RDD.scala:247) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:61) at org.apache.spark.scheduler.Task.run(Task.scala:56) 15/02/28 16:36:55 INFO Executor: Running task 4.0 in stage 0.0 (TID 4) 15/02/28 16:36:58 WARN TaskSetManager: Lost task 1.0 in stage 0.0 (TID 1, localhost): java.lang.OutOfMemoryError: GC overhead limit exceeded at java.lang.String.split(String.java:2333) at java.lang.String.split(String.java:2403) at Test.sparkSQL.SqlRunner$$anonfun$2.apply(SqlRunner.scala:26) at Test.sparkSQL.SqlRunner$$anonfun$2.apply(SqlRunner.scala:26) at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371) at scala.collection.Iterator$$anon$14.hasNext(Iterator.scala:388) at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327) at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327) at scala.collection.Iterator$class.foreach(Iterator.scala:727) at scala.collection.AbstractIterator.foreach(Iterator.scala:1157) at scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48) at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:103) at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:47) at scala.collection.TraversableOnce$class.to(TraversableOnce.scala:273) at scala.collection.AbstractIterator.to(Iterator.scala:1157) at scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:265) at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1157) at scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:252) at scala.collection.AbstractIterator.toArray(Iterator.scala:1157) at org.apache.spark.sql.execution.Sort$$anonfun$execute$3$$anonfun$apply$4.apply(basicOperators.scala:209) at org.apache.spark.sql.execution.Sort$$anonfun$execute$3$$anonfun$apply$4.apply(basicOperators.scala:207) at org.apache.spark.rdd.RDD$$anonfun$14.apply(RDD.scala:618) at org.apache.spark.rdd.RDD$$anonfun$14.apply(RDD.scala:618) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35) at org.apache.spark.sql.SchemaRDD.compute(SchemaRDD.scala:120) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:280) at org.apache.spark.rdd.RDD.iterator(RDD.scala:247) at org.apache.spark.rdd.MappedRDD.compute(MappedRDD.scala:31) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:280) at org.apache.spark.rdd.RDD.iterator(RDD.scala:247) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:61) at org.apache.spark.scheduler.Task.run(Task.scala:56) 15/02/28 16:37:04 ERROR TaskSetManager: Task 1 in stage 0.0 failed 1 times; aborting job Process finished with exit code 52