[ https://issues.apache.org/jira/browse/SPARK-3622?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14142865#comment-14142865 ]
Patrick Wendell edited comment on SPARK-3622 at 9/22/14 3:24 AM: ----------------------------------------------------------------- Do you mind clarifying a little bit how hive would use this (maybe with a code example)? Let's say you had a transformation that went from a single RDD A to two RDD's B and C. The normal way to do this if you want to avoid recomputing A would be to persist it, then use it to derive both B and C (this will do multiple passes on A, but it won't fully recompute A twice). I think that doing this in the general case is not possible by definition. The user might use B and C at different times, so it's not possible to guarantee that A will be computed only once unless you persist A. was (Author: pwendell): Do you mind clarifying a little bit how hive would use this (maybe with a code example)? The normal way to do this if you want to avoid recomputing A would be to persist it, then use it to derive both B and C (this will do multiple passes on A, but it won't fully recompute A twice). I think that doing this in the general case is not possible by definition. Let's say you had a transformation that went from a single RDD A to two RDD's B and C. The user might use B and C at different times, so it's not possible to guarantee that A will be computed only once unless you persist A. > Provide a custom transformation that can output multiple RDDs > ------------------------------------------------------------- > > Key: SPARK-3622 > URL: https://issues.apache.org/jira/browse/SPARK-3622 > Project: Spark > Issue Type: Improvement > Components: Spark Core > Affects Versions: 1.1.0 > Reporter: Xuefu Zhang > > All existing transformations return just one RDD at most, even for those > which takes user-supplied functions such as mapPartitions() . However, > sometimes a user provided function may need to output multiple RDDs. For > instance, a filter function that divides the input RDD into serveral RDDs. > While it's possible to get multiple RDDs by transforming the same RDD > multiple times, it may be more efficient to do this concurrently in one shot. > Especially user's existing function is already generating different data sets. > This the case in Hive on Spark, where Hive's map function and reduce function > can output different data sets to be consumed by subsequent stages. -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org