Hi,

My spark jobs suddenly started getting hung and here is the debug leading
to it:
Following the program, it seems to be stuck whenever I do any collect(),
count or rdd.saveAsParquet file. AFAIK, any operation that requires data
flow back to master causes this. I increased the memory to 5 MB. Also, as
per the debug statements, the memory is sufficient enough. Also increased
-Xss and

15/01/17 11:44:16 INFO storage.MemoryStore: ensureFreeSpace(264808) called
with curMem=0, maxMem=1019782103
15/01/17 11:44:16 INFO storage.MemoryStore: Block broadcast_0 stored as
values in memory (estimated size 258.6 KB, free 972.3 MB)
15/01/17 11:44:16 INFO spark.SparkContext: Starting job: collect at
SparkPlan.scala:85
15/01/17 11:44:16 INFO storage.MemoryStore: ensureFreeSpace(210344) called
with curMem=264808, maxMem=1019782103
15/01/17 11:44:16 INFO storage.MemoryStore: Block broadcast_1 stored as
values in memory (estimated size 205.4 KB, free 972.1 MB)
15/01/17 11:44:16 INFO storage.MemoryStore: ensureFreeSpace(282200) called
with curMem=475152, maxMem=1019782103
15/01/17 11:44:16 INFO storage.MemoryStore: Block broadcast_2 stored as
values in memory (estimated size 275.6 KB, free 971.8 MB)
15/01/17 11:44:16 INFO spark.SparkContext: Starting job: RangePartitioner
at Exchange.scala:79

A bit of background which may or may not be relevant. The program was
working fine in eclipse, however, was getting hung upon submission to the
cluster. In an attempt to debug, I changed the version in build.sbt to
match the one on the cluster

sbt config when the program was working:
  "org.apache.spark" %% "spark-core" % "1.1.0" % "provided",
  "org.apache.spark" %% "spark-sql" % "1.1.0" % "provided",
  "spark.jobserver" % "job-server-api" % "0.4.0",
  "com.github.nscala-time" %% "nscala-time" % "1.6.0",
  "org.apache.hadoop" % "hadoop-client" % "2.3.0" % "provided"


During debugging, I changed this to:
  "org.apache.spark" %% "spark-core" % "1.2.0" % "provided",
  "org.apache.spark" %% "spark-sql" % "1.2.0" % "provided",
  "spark.jobserver" % "job-server-api" % "0.4.0",
  "com.github.nscala-time" %% "nscala-time" % "1.6.0",
  "org.apache.hadoop" % "hadoop-client" % "2.5.0" % "provided"

This is when the program started getting hung at the first rdd.count().
Now, even after reverting the changes in build.sbt, my program is getting
hung at the same point.

Tried these config changes in addition to -Xmx and -Xss in the eclipse.ini
to 5MB each and set the below vars programatically

    sparkConf.set("spark.akka.frameSize","10")
    sparkConf.set("spark.shuffle.spill","true")
    sparkConf.set("spark.driver.memory","512m")
    sparkConf.set("spark.executor.memory","1g")
    sparkConf.set("spark.driver.maxResultSize","1g")

Please note. In eclipse as well as sbt> the program kept throwing
StackOverflow. Increasing Xss to 5 MB eliminated the problem,
Could this be something unrelated to memory? The SchemaRDDs have close to
400 columns and hence I am using StructType(StructField) and performing
applySchema.

My code cannot be shared right now. If required, I will edit it and post.
regards
Sunita

Reply via email to