[ https://issues.apache.org/jira/browse/SPARK-2688?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Sandy Ryza updated SPARK-2688: ------------------------------ Issue Type: New Feature (was: Improvement) > Need a way to run multiple data pipeline concurrently > ----------------------------------------------------- > > Key: SPARK-2688 > URL: https://issues.apache.org/jira/browse/SPARK-2688 > Project: Spark > Issue Type: New Feature > Components: Spark Core > Affects Versions: 1.0.1 > Reporter: Xuefu Zhang > > Suppose we want to do the following data processing: > {code} > rdd1 -> rdd2 -> rdd3 > | -> rdd4 > | -> rdd5 > \ -> rdd6 > {code} > where -> represents a transformation. rdd3 to rrdd6 are all derived from an > intermediate rdd2. We use foreach(fn) with a dummy function to trigger the > execution. However, rdd.foreach(fn) only trigger pipeline rdd1 -> rdd2 -> > rdd3. To make things worse, when we call rdd4.foreach(), rdd2 will be > recomputed. This is very inefficient. Ideally, we should be able to trigger > the execution the whole graph and reuse rdd2, but there doesn't seem to be a > way doing so. Tez already realized the importance of this (TEZ-391), so I > think Spark should provide this too. > This is required for Hive to support multi-insert queries. HIVE-7292. -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org