[ 
https://issues.apache.org/jira/browse/SPARK-13378?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-13378.
-------------------------------
    Resolution: Not A Problem

No, we don't want to do this. You'd end up making hundreds of methods to "do X, 
and then return an RDD". This is entirely unnecessary in Spark. Just invoke an 
operation on an RDD, and then other operations. There's no such thing as "tee" 
being required in a DAG.

> Add tee method to RDD
> ---------------------
>
>                 Key: SPARK-13378
>                 URL: https://issues.apache.org/jira/browse/SPARK-13378
>             Project: Spark
>          Issue Type: Improvement
>          Components: Spark Core
>    Affects Versions: 1.6.0
>            Reporter: Richard Ding
>
> In our application, we sometimes need to save the partial/intermediate 
> results to side files in the middle of a data pipeline/DAG. The only way now 
> to do this is to use saveAsTextFile method which only runs at the end of a 
> pipeline. Otherwise multiple jobs are needed. We’ve implemented ‘tee’ method 
> on RDD that is similar to Unix tee utility. Below are the proposed methods:
> {code}
> def tee(path: String) : RDD[T]
> {code}
> Return a new RDD that is the same as this RDD but also save a copy of this 
> RDD to a text file, using string representation of elements.
> {code}
> def tee(path: String, f: (T) => Boolean): RDD[T]
> {code}
> Return a new RDD that is the same as this RDD but also save to a text file a 
> copy of the elements in this RDD that satisfy a predicate , using string 
> representation of elements.
> These methods can be used in RDD pipelines in ways similar to the tee utility 
> in Unix command pipeline, for example, 
> {code}
> sc.textFile(dataFile).map(x => x.split(“\t”)
>     .map(x => (x(0), x(1).toInt, x(2))
>     .tee(“output/tee-data-1”)
>     .tee(“output/tee-data-2”, x=> x._2 >= 10)
>     .groupBy(x => x._1)
>     .saveAsTextFile(“output/out-data”)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to