Richard Ding created SPARK-13378:
------------------------------------

             Summary: Add tee method to RDD
                 Key: SPARK-13378
                 URL: https://issues.apache.org/jira/browse/SPARK-13378
             Project: Spark
          Issue Type: Improvement
          Components: Spark Core
    Affects Versions: 1.6.0
            Reporter: Richard Ding


In our application, we sometimes need to save the partial/intermediate results 
to side files in the middle of a data pipeline/DAG. The only way now to do this 
is to use saveAsTextFile method which only runs at the end of a pipeline. 
Otherwise multiple jobs are needed. We’ve implemented ‘tee’ method on RDD that 
is similar to Unix tee utility. Below are the proposed methods:

{code}
def tee(path: String) : RDD[T]
{code}
Return a new RDD that is the same as this RDD but also save a copy of this RDD 
to a text file, using string representation of elements.

{code}
def tee(path: String, f: (T) => Boolean): RDD[T]
{code}
Return a new RDD that is the same as this RDD but also save to a text file a 
copy of the elements in this RDD that satisfy a predicate , using string 
representation of elements.

These methods can be used in RDD pipelines in ways similar to the tee utility 
in Unix command pipeline, for example, 

{code}
sc.textFile(dataFile).map(x => x.split(“\t”)
    .map(x => (x(0), x(1).toInt, x(2))
    .tee(“output/tee-data-1”)
    .tee(“output/tee-data-2”, x=> x._2 >= 10)
    .groupBy(x => x._1)
    .saveAsTextFile(“output/out-data”)
{code}




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to