Re: Any suggestion about JIRA 1006 "MLlib ALS gets stack overflow with too many iterations"?

Cheng Lian Tue, 28 Jan 2014 02:51:03 -0800

We experienced similar problem when implementing LDA on Spark.  Now we call
RDD.checkpoint every 10 iterations to cut the lineage DAG.  Notice that
checkpointing hurts performance since it submits a job to write HDFS.



On Tue, Jan 28, 2014 at 5:15 PM, Qiuzhuang Lian <qiuzhuang.l...@gmail.com>wrote:

> I see this error thrown from Executor.scala:
>
> task = ser.deserialize[Task[Any]](taskBytes,
> Thread.currentThread.getContextClassLoader)
>
> Any suggestions to break down the task to smaller chunk to avoid this?
>
> Thanks,
> Qiuzhuang
>
>
>
>
> On Sun, Jan 26, 2014 at 2:52 PM, Shao, Saisai <saisai.s...@intel.com>
> wrote:
>
> > In my test I found this phenomenon might be caused by RDD's long
> > dependency chain, this dependency chain is serialized into task and sent
> to
> > each executor, while deserializing this task will cause stack overflow.
> >
> > Especially in iterative job, like:
> > var rdd = ..
> >
> > for (i <- 0 to 100)
> >  rdd = rdd.map(x=>x)
> >
> > rdd = rdd.cache
> >
> > Here rdd's dependency will be chained, at some point stack overflow will
> > occur.
> >
> > You can check (
> >
> https://groups.google.com/forum/?fromgroups#!searchin/spark-users/dependency/spark-users/-Cyfe3G6VwY/PFFnslzWn6AJ
> )
> > and (
> >
> https://groups.google.com/forum/?fromgroups#!searchin/spark-users/dependency/spark-users/NkxcmmS-DbM/c9qvuShbHEUJ
> )
> > for details. Current workaround method is to cut the dependency chain by
> > checkpointing RDD, maybe a better way is to clean the dependency chain
> > after materialize stage is executed.
> >
> > Thanks
> > Jerry
> >
> > -----Original Message-----
> > From: Reynold Xin [mailto:r...@databricks.com]
> > Sent: Sunday, January 26, 2014 2:04 PM
> > To: dev@spark.incubator.apache.org
> > Subject: Re: Any suggestion about JIRA 1006 "MLlib ALS gets stack
> overflow
> > with too many iterations"?
> >
> > I'm not entirely sure, but two candidates are
> >
> > the visit function in stageDependsOn
> >
> > submitStage
> >
> >
> >
> >
> >
> >
> > On Sat, Jan 25, 2014 at 10:01 PM, Aaron Davidson <ilike...@gmail.com>
> > wrote:
> >
> > > I'm an idiot, but which part of the DAGScheduler is recursive here?
> > > Seems like processEvent shouldn't have inherently recursive properties.
> > >
> > >
> > > On Sat, Jan 25, 2014 at 9:57 PM, Reynold Xin <r...@databricks.com>
> > wrote:
> > >
> > > > It seems to me fixing DAGScheduler to make it not recursive is the
> > > > better solution here, given the cost of checkpointing.
> > > >
> > > > On Sat, Jan 25, 2014 at 9:49 PM, Xia, Junluan
> > > > <junluan....@intel.com>
> > > > wrote:
> > > >
> > > > > Hi all
> > > > >
> > > > > The description about this Bug submitted by Matei is as following
> > > > >
> > > > >
> > > > > The tipping point seems to be around 50. We should fix this by
> > > > > checkpointing the RDDs every 10-20 iterations to break the lineage
> > > chain,
> > > > > but checkpointing currently requires HDFS installed, which not all
> > > users
> > > > > will have.
> > > > >
> > > > > We might also be able to fix DAGScheduler to not be recursive.
> > > > >
> > > > >
> > > > > regards,
> > > > > Andrew
> > > > >
> > > > >
> > > >
> > >
> >
>

Re: Any suggestion about JIRA 1006 "MLlib ALS gets stack overflow with too many iterations"?

Reply via email to