I was able to resolve this by adding rdd.collect() after every stage. This
enforced RDD evaluation and helped avoid the choke point.

regards
Sunita Kopppar

On Sat, Jan 17, 2015 at 12:56 PM, Sunita Arvind <sunitarv...@gmail.com>
wrote:

> Hi,
>
> My spark jobs suddenly started getting hung and here is the debug leading
> to it:
> Following the program, it seems to be stuck whenever I do any collect(),
> count or rdd.saveAsParquet file. AFAIK, any operation that requires data
> flow back to master causes this. I increased the memory to 5 MB. Also, as
> per the debug statements, the memory is sufficient enough. Also increased
> -Xss and
>
> 15/01/17 11:44:16 INFO storage.MemoryStore: ensureFreeSpace(264808) called
> with curMem=0, maxMem=1019782103
> 15/01/17 11:44:16 INFO storage.MemoryStore: Block broadcast_0 stored as
> values in memory (estimated size 258.6 KB, free 972.3 MB)
> 15/01/17 11:44:16 INFO spark.SparkContext: Starting job: collect at
> SparkPlan.scala:85
> 15/01/17 11:44:16 INFO storage.MemoryStore: ensureFreeSpace(210344) called
> with curMem=264808, maxMem=1019782103
> 15/01/17 11:44:16 INFO storage.MemoryStore: Block broadcast_1 stored as
> values in memory (estimated size 205.4 KB, free 972.1 MB)
> 15/01/17 11:44:16 INFO storage.MemoryStore: ensureFreeSpace(282200) called
> with curMem=475152, maxMem=1019782103
> 15/01/17 11:44:16 INFO storage.MemoryStore: Block broadcast_2 stored as
> values in memory (estimated size 275.6 KB, free 971.8 MB)
> 15/01/17 11:44:16 INFO spark.SparkContext: Starting job: RangePartitioner
> at Exchange.scala:79
>
> A bit of background which may or may not be relevant. The program was
> working fine in eclipse, however, was getting hung upon submission to the
> cluster. In an attempt to debug, I changed the version in build.sbt to
> match the one on the cluster
>
> sbt config when the program was working:
>   "org.apache.spark" %% "spark-core" % "1.1.0" % "provided",
>   "org.apache.spark" %% "spark-sql" % "1.1.0" % "provided",
>   "spark.jobserver" % "job-server-api" % "0.4.0",
>   "com.github.nscala-time" %% "nscala-time" % "1.6.0",
>   "org.apache.hadoop" % "hadoop-client" % "2.3.0" % "provided"
>
>
> During debugging, I changed this to:
>   "org.apache.spark" %% "spark-core" % "1.2.0" % "provided",
>   "org.apache.spark" %% "spark-sql" % "1.2.0" % "provided",
>   "spark.jobserver" % "job-server-api" % "0.4.0",
>   "com.github.nscala-time" %% "nscala-time" % "1.6.0",
>   "org.apache.hadoop" % "hadoop-client" % "2.5.0" % "provided"
>
> This is when the program started getting hung at the first rdd.count().
> Now, even after reverting the changes in build.sbt, my program is getting
> hung at the same point.
>
> Tried these config changes in addition to -Xmx and -Xss in the eclipse.ini
> to 5MB each and set the below vars programatically
>
>     sparkConf.set("spark.akka.frameSize","10")
>     sparkConf.set("spark.shuffle.spill","true")
>     sparkConf.set("spark.driver.memory","512m")
>     sparkConf.set("spark.executor.memory","1g")
>     sparkConf.set("spark.driver.maxResultSize","1g")
>
> Please note. In eclipse as well as sbt> the program kept throwing
> StackOverflow. Increasing Xss to 5 MB eliminated the problem,
> Could this be something unrelated to memory? The SchemaRDDs have close to
> 400 columns and hence I am using StructType(StructField) and performing
> applySchema.
>
> My code cannot be shared right now. If required, I will edit it and post.
> regards
> Sunita
>
>
>

Reply via email to