[jira] [Commented] (SPARK-2688) Need a way to run multiple data pipeline concurrently

Sean Owen (JIRA) Fri, 23 Jan 2015 11:37:57 -0800

    [ 
https://issues.apache.org/jira/browse/SPARK-2688?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14289798#comment-14289798
 ]


Sean Owen commented on SPARK-2688:
----------------------------------

[~airhorns] Persisting does not mean hitting disk, if you don't want to; you 
can persist in memory. If you persist to disk, why is there an avoidable 
slowdown? either way, you compute partitions, and persist partitions to disk. 
That's what Spark does. 

Yes, Spark does not persist by default, but, it can persist, and this is an 
example of exactly why you would persist. I don't understand the discussion 
about push vs pull, but it does indeed sound like a completely different 
architecture, and therefore pretty infeasible.

The rest of your discussion seems like a case for "MultipleOutputs in Spark", 
covered separately in https://issues.apache.org/jira/browse/SPARK-3622

I am not a decider, but, the use case requested by this JIRA vs SPARK-3622 
seems clearly possible in Spark right now with no change with persist().

> Need a way to run multiple data pipeline concurrently
> -----------------------------------------------------
>
>                 Key: SPARK-2688
>                 URL: https://issues.apache.org/jira/browse/SPARK-2688
>             Project: Spark
>          Issue Type: New Feature
>          Components: Spark Core
>    Affects Versions: 1.0.1
>            Reporter: Xuefu Zhang
>
> Suppose we want to do the following data processing: 
> {code}
> rdd1 -> rdd2 -> rdd3
>            | -> rdd4
>            | -> rdd5
>            \ -> rdd6
> {code}
> where -> represents a transformation. rdd3 to rrdd6 are all derived from an 
> intermediate rdd2. We use foreach(fn) with a dummy function to trigger the 
> execution. However, rdd.foreach(fn) only trigger pipeline rdd1 -> rdd2 -> 
> rdd3. To make things worse, when we call rdd4.foreach(), rdd2 will be 
> recomputed. This is very inefficient. Ideally, we should be able to trigger 
> the execution the whole graph and reuse rdd2, but there doesn't seem to be a 
> way doing so. Tez already realized the importance of this (TEZ-391), so I 
> think Spark should provide this too.
> This is required for Hive to support multi-insert queries. HIVE-7292.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-2688) Need a way to run multiple data pipeline concurrently

Reply via email to