Richard Ding created SPARK-13378: ------------------------------------ Summary: Add tee method to RDD Key: SPARK-13378 URL: https://issues.apache.org/jira/browse/SPARK-13378 Project: Spark Issue Type: Improvement Components: Spark Core Affects Versions: 1.6.0 Reporter: Richard Ding
In our application, we sometimes need to save the partial/intermediate results to side files in the middle of a data pipeline/DAG. The only way now to do this is to use saveAsTextFile method which only runs at the end of a pipeline. Otherwise multiple jobs are needed. We’ve implemented ‘tee’ method on RDD that is similar to Unix tee utility. Below are the proposed methods: {code} def tee(path: String) : RDD[T] {code} Return a new RDD that is the same as this RDD but also save a copy of this RDD to a text file, using string representation of elements. {code} def tee(path: String, f: (T) => Boolean): RDD[T] {code} Return a new RDD that is the same as this RDD but also save to a text file a copy of the elements in this RDD that satisfy a predicate , using string representation of elements. These methods can be used in RDD pipelines in ways similar to the tee utility in Unix command pipeline, for example, {code} sc.textFile(dataFile).map(x => x.split(“\t”) .map(x => (x(0), x(1).toInt, x(2)) .tee(“output/tee-data-1”) .tee(“output/tee-data-2”, x=> x._2 >= 10) .groupBy(x => x._1) .saveAsTextFile(“output/out-data”) {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org