Hi, My spark jobs suddenly started getting hung and here is the debug leading to it: Following the program, it seems to be stuck whenever I do any collect(), count or rdd.saveAsParquet file. AFAIK, any operation that requires data flow back to master causes this. I increased the memory to 5 MB. Also, as per the debug statements, the memory is sufficient enough. Also increased -Xss and
15/01/17 11:44:16 INFO storage.MemoryStore: ensureFreeSpace(264808) called with curMem=0, maxMem=1019782103 15/01/17 11:44:16 INFO storage.MemoryStore: Block broadcast_0 stored as values in memory (estimated size 258.6 KB, free 972.3 MB) 15/01/17 11:44:16 INFO spark.SparkContext: Starting job: collect at SparkPlan.scala:85 15/01/17 11:44:16 INFO storage.MemoryStore: ensureFreeSpace(210344) called with curMem=264808, maxMem=1019782103 15/01/17 11:44:16 INFO storage.MemoryStore: Block broadcast_1 stored as values in memory (estimated size 205.4 KB, free 972.1 MB) 15/01/17 11:44:16 INFO storage.MemoryStore: ensureFreeSpace(282200) called with curMem=475152, maxMem=1019782103 15/01/17 11:44:16 INFO storage.MemoryStore: Block broadcast_2 stored as values in memory (estimated size 275.6 KB, free 971.8 MB) 15/01/17 11:44:16 INFO spark.SparkContext: Starting job: RangePartitioner at Exchange.scala:79 A bit of background which may or may not be relevant. The program was working fine in eclipse, however, was getting hung upon submission to the cluster. In an attempt to debug, I changed the version in build.sbt to match the one on the cluster sbt config when the program was working: "org.apache.spark" %% "spark-core" % "1.1.0" % "provided", "org.apache.spark" %% "spark-sql" % "1.1.0" % "provided", "spark.jobserver" % "job-server-api" % "0.4.0", "com.github.nscala-time" %% "nscala-time" % "1.6.0", "org.apache.hadoop" % "hadoop-client" % "2.3.0" % "provided" During debugging, I changed this to: "org.apache.spark" %% "spark-core" % "1.2.0" % "provided", "org.apache.spark" %% "spark-sql" % "1.2.0" % "provided", "spark.jobserver" % "job-server-api" % "0.4.0", "com.github.nscala-time" %% "nscala-time" % "1.6.0", "org.apache.hadoop" % "hadoop-client" % "2.5.0" % "provided" This is when the program started getting hung at the first rdd.count(). Now, even after reverting the changes in build.sbt, my program is getting hung at the same point. Tried these config changes in addition to -Xmx and -Xss in the eclipse.ini to 5MB each and set the below vars programatically sparkConf.set("spark.akka.frameSize","10") sparkConf.set("spark.shuffle.spill","true") sparkConf.set("spark.driver.memory","512m") sparkConf.set("spark.executor.memory","1g") sparkConf.set("spark.driver.maxResultSize","1g") Please note. In eclipse as well as sbt> the program kept throwing StackOverflow. Increasing Xss to 5 MB eliminated the problem, Could this be something unrelated to memory? The SchemaRDDs have close to 400 columns and hence I am using StructType(StructField) and performing applySchema. My code cannot be shared right now. If required, I will edit it and post. regards Sunita