Hi John,

Thanks for sharing the design issue.
As you pointed out, let's keep the recursive structure. We can revisit
optimizing it if we encounter problems.

Thanks.
-Gon


On Sun, Jun 3, 2018 at 2:37 PM, John Yang <[email protected]> wrote:

> It seems that the stack overflow error in Spark is mostly caused by the
> 'scheduler' in the master trying to recursively visit all of the RDDs that
> belong to an iterative job. I think the error I've bumped into when trying
> to run ALS on Spark was also caused by this. On the other hand, I suspect
> the length of RDDs pipelined into each task would be much smaller than
> that. Also a quick search tells me that the Java call stack is thousands of
> levels deep by default, which makes it even more unlikely that our Task
> will run into this error.
>
> So as the first implementation goal, it may be better to keep the recursive
> strcture, and just do a bit of refactoring to handle the other issues. I'd
> be happy to work on those. We may handle stack overflow errors once they
> really become a problem for our users.
>
> John
>
>
>
> On Sun, Jun 3, 2018, 1:43 PM John Yang <[email protected]> wrote:
>
> > Hi Nemo devs,
> >
> >
> >
> > I’d like to get your feedback on some ideas I have regarding TaskExecutor
> > (formerly TaskGroupExecutor).
> >
> >
> > At the moment TaskExecutor executes the IRVertex DAG of a task like the
> > following pseudocode. The ‘rootVertex’ here is either a source vertex,
> or a
> > vertex that fetches data from a parent stage.
> >
> >
> > prepareDataStructuresForIRVertexDataHandlers(irVertexDAG)
> >
> > for each rootVertex {
> >
> >   for each rootVertex’s incoming element {
> >
> >     recursivelyExecute(childIRVertexDataHandler(rootVertex), element}
> >
> >   }
> >
> > }
> >
> >
> > While this code has served us well so far, I feel there’s room for
> > improvements.
> >
> >
> > Most importantly, this code runs into stack overflow errors when handling
> > a long chain of vertices, which appear to be used quite frequently in
> user
> > applications. A quick search for “stackoverflowerror spark” on Google
> > returns quite a few bug reports from users trying to run long RDD
> lineages
> > on Spark, which also employs a recursive execution model. It’d be nice if
> > Nemo avoids this error.
> >
> >
> > Another issue I see is that we are consuming all of a root vertex's data
> > before moving onto the next root vertex. I feel this is unfair to root
> > vertices that are chosen later. In case of stream-joining two active data
> > streams, this effectively results in consuming only one of the streams,
> and
> > never looking at the other data stream. It'd be better if we give a fair
> > chance to every root vertex.
> >
> >
> > Finally, if we're to do away with recursion, it'd be nice to refactor
> IRVertexDataHandler
> > and related data structures a bit. With recursion, the data structures
> > are used to recursively reference children vertices. Without recursion,
> we
> > may want to loop through the vertices in a topological order, and
> reference
> > OutputCollectors of each vertex's parents to consume data, if data exist.
> > Of course, all per-element overheads remain small with quick pointer
> > referencing, and boolean operations.
> >
> >
> > So, a sketch of what I want to do is something like this:
> >
> >
> > (rootHandlers, nonRootHandlers) = topoSortAndSetUpHandlers(irVertexDAG)
> >
> > while (allDone) {
> >
> >   // Fair chance across roots
> >
> >   for rootHandler {
> >
> >     readElement(rootHandler)
> >
> >   }
> >
> >
> >   // Trickle down the element(s) to the leaf nodes in a topological order
> >
> >   while (nonRootsDoneForTheseRootElements) {
> >
> >     for nonRootHandler {
> >
> >       execute(nonRootHandler)
> >
> >     }
> >
> >   }
> >
> > }
> >
> >
> > Here, we can also wrap the root vertices using a handler with an
> > OutputCollector. This makes things consistent, and also sets things up to
> > handle the case of a streaming join where only one of the data stream is
> > active, as we can simply choose not to add an element to the
> > OutputCollector for the inactive stream, without getting blocking on it.
> > (note that we would also need to change the iterator logic a bit to make
> > this work)
> >
> >
> > Do you think it will work? Looking forward to hear your thoughts. :)
> >
> >
> >
> > Thanks,
> >
> > John
> >
> >
> >
>



-- 
Byung-Gon Chun

Reply via email to