We experienced similar problem when implementing LDA on Spark. Now we call RDD.checkpoint every 10 iterations to cut the lineage DAG. Notice that checkpointing hurts performance since it submits a job to write HDFS.
On Tue, Jan 28, 2014 at 5:15 PM, Qiuzhuang Lian <qiuzhuang.l...@gmail.com>wrote: > I see this error thrown from Executor.scala: > > task = ser.deserialize[Task[Any]](taskBytes, > Thread.currentThread.getContextClassLoader) > > Any suggestions to break down the task to smaller chunk to avoid this? > > Thanks, > Qiuzhuang > > > > > On Sun, Jan 26, 2014 at 2:52 PM, Shao, Saisai <saisai.s...@intel.com> > wrote: > > > In my test I found this phenomenon might be caused by RDD's long > > dependency chain, this dependency chain is serialized into task and sent > to > > each executor, while deserializing this task will cause stack overflow. > > > > Especially in iterative job, like: > > var rdd = .. > > > > for (i <- 0 to 100) > > rdd = rdd.map(x=>x) > > > > rdd = rdd.cache > > > > Here rdd's dependency will be chained, at some point stack overflow will > > occur. > > > > You can check ( > > > https://groups.google.com/forum/?fromgroups#!searchin/spark-users/dependency/spark-users/-Cyfe3G6VwY/PFFnslzWn6AJ > ) > > and ( > > > https://groups.google.com/forum/?fromgroups#!searchin/spark-users/dependency/spark-users/NkxcmmS-DbM/c9qvuShbHEUJ > ) > > for details. Current workaround method is to cut the dependency chain by > > checkpointing RDD, maybe a better way is to clean the dependency chain > > after materialize stage is executed. > > > > Thanks > > Jerry > > > > -----Original Message----- > > From: Reynold Xin [mailto:r...@databricks.com] > > Sent: Sunday, January 26, 2014 2:04 PM > > To: dev@spark.incubator.apache.org > > Subject: Re: Any suggestion about JIRA 1006 "MLlib ALS gets stack > overflow > > with too many iterations"? > > > > I'm not entirely sure, but two candidates are > > > > the visit function in stageDependsOn > > > > submitStage > > > > > > > > > > > > > > On Sat, Jan 25, 2014 at 10:01 PM, Aaron Davidson <ilike...@gmail.com> > > wrote: > > > > > I'm an idiot, but which part of the DAGScheduler is recursive here? > > > Seems like processEvent shouldn't have inherently recursive properties. > > > > > > > > > On Sat, Jan 25, 2014 at 9:57 PM, Reynold Xin <r...@databricks.com> > > wrote: > > > > > > > It seems to me fixing DAGScheduler to make it not recursive is the > > > > better solution here, given the cost of checkpointing. > > > > > > > > On Sat, Jan 25, 2014 at 9:49 PM, Xia, Junluan > > > > <junluan....@intel.com> > > > > wrote: > > > > > > > > > Hi all > > > > > > > > > > The description about this Bug submitted by Matei is as following > > > > > > > > > > > > > > > The tipping point seems to be around 50. We should fix this by > > > > > checkpointing the RDDs every 10-20 iterations to break the lineage > > > chain, > > > > > but checkpointing currently requires HDFS installed, which not all > > > users > > > > > will have. > > > > > > > > > > We might also be able to fix DAGScheduler to not be recursive. > > > > > > > > > > > > > > > regards, > > > > > Andrew > > > > > > > > > > > > > > > > > > > >