[jira] [Commented] (SPARK-2688) Need a way to run multiple data pipeline concurrently

Sean Owen (JIRA) Mon, 26 Jan 2015 13:22:02 -0800

    [ 
https://issues.apache.org/jira/browse/SPARK-2688?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14292431#comment-14292431
 ]


Sean Owen commented on SPARK-2688:
----------------------------------

As [~irashid] says, #1 is just syntactic sugar on what you can do already in 
Spark. I'm not clear how something can need this functionality badly, then. 
Either it's not blocking anything, really, and let's see that, or let's discuss 
what beyond #1 is actually needed.

What I think people want is a miniature "push-based evaluation" method inside 
of Spark's pull-based DAG evaluation: force evaluation of N children of 1 
parent at once. The outcome of a sidebar I had with Sandy on this was that it's 
probably a) fraught with gotchas, given the push-vs-pull mismatch, but not 
impossible, and b) would force the children to be persisted in the general 
case, with possible optimizations in other special cases.

Is that the kind of thing Hive on Spark needs, and if so can we hear a concrete 
elaboration of an example of this, so we can compare with what's possible now? 
I still sense there's a mismatch between the perception and reality of what's 
possible with the current API. Hence, there may be some really good news here.

> Need a way to run multiple data pipeline concurrently
> -----------------------------------------------------
>
>                 Key: SPARK-2688
>                 URL: https://issues.apache.org/jira/browse/SPARK-2688
>             Project: Spark
>          Issue Type: New Feature
>          Components: Spark Core
>    Affects Versions: 1.0.1
>            Reporter: Xuefu Zhang
>
> Suppose we want to do the following data processing: 
> {code}
> rdd1 -> rdd2 -> rdd3
>            | -> rdd4
>            | -> rdd5
>            \ -> rdd6
> {code}
> where -> represents a transformation. rdd3 to rrdd6 are all derived from an 
> intermediate rdd2. We use foreach(fn) with a dummy function to trigger the 
> execution. However, rdd.foreach(fn) only trigger pipeline rdd1 -> rdd2 -> 
> rdd3. To make things worse, when we call rdd4.foreach(), rdd2 will be 
> recomputed. This is very inefficient. Ideally, we should be able to trigger 
> the execution the whole graph and reuse rdd2, but there doesn't seem to be a 
> way doing so. Tez already realized the importance of this (TEZ-391), so I 
> think Spark should provide this too.
> This is required for Hive to support multi-insert queries. HIVE-7292.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-2688) Need a way to run multiple data pipeline concurrently

Reply via email to