[jira] [Comment Edited] (SPARK-3622) Provide a custom transformation that can output multiple RDDs

Patrick Wendell (JIRA) Sun, 21 Sep 2014 20:25:36 -0700

    [ 
https://issues.apache.org/jira/browse/SPARK-3622?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14142865#comment-14142865
 ]


Patrick Wendell edited comment on SPARK-3622 at 9/22/14 3:24 AM:
-----------------------------------------------------------------

Do you mind clarifying a little bit how hive would use this (maybe with a code 
example)?

Let's say you had a transformation that went from a single RDD A to two RDD's B 
and C. The normal way to do this if you want to avoid recomputing A would be to 
persist it, then use it to derive both B and C (this will do multiple passes on 
A, but it won't fully recompute A twice).

I think that doing this in the general case is not possible by definition. The 
user might use B and C at different times, so it's not possible to guarantee 
that A will be computed only once unless you persist A.


was (Author: pwendell):
Do you mind clarifying a little bit how hive would use this (maybe with a code 
example)? The normal way to do this if you want to avoid recomputing A would be 
to persist it, then use it to derive both B and C (this will do multiple passes 
on A, but it won't fully recompute A twice).

I think that doing this in the general case is not possible by definition. 
Let's say you had a transformation that went from a single RDD A to two RDD's B 
and C. The user might use B and C at different times, so it's not possible to 
guarantee that A will be computed only once unless you persist A.

> Provide a custom transformation that can output multiple RDDs
> -------------------------------------------------------------
>
>                 Key: SPARK-3622
>                 URL: https://issues.apache.org/jira/browse/SPARK-3622
>             Project: Spark
>          Issue Type: Improvement
>          Components: Spark Core
>    Affects Versions: 1.1.0
>            Reporter: Xuefu Zhang
>
> All existing transformations return just one RDD at most, even for those 
> which takes user-supplied functions such as mapPartitions() . However, 
> sometimes a user provided function may need to output multiple RDDs. For 
> instance, a filter function that divides the input RDD into serveral RDDs. 
> While it's possible to get multiple RDDs by transforming the same RDD 
> multiple times, it may be more efficient to do this concurrently in one shot. 
> Especially user's existing function is already generating different data sets.
> This the case in Hive on Spark, where Hive's map function and reduce function 
> can output different data sets to be consumed by subsequent stages.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-3622) Provide a custom transformation that can output multiple RDDs

Reply via email to