Re: Filter operation to return two RDDs at once.
As far as I know, spark don't support multiple outputs On Wed, Jun 3, 2015 at 2:15 PM, ayan guha guha.a...@gmail.com wrote: Why do you need to do that if filter and content of the resulting rdd are exactly same? You may as well declare them as 1 RDD. On 3 Jun 2015 15:28, ÐΞ€ρ@Ҝ (๏̯͡๏) deepuj...@gmail.com wrote: I want to do this val qtSessionsWithQt = rawQtSession.filter(_._2.qualifiedTreatmentId != NULL_VALUE) val guidUidMapSessions = rawQtSession.filter(_._2. qualifiedTreatmentId == NULL_VALUE) This will run two different stages can this be done in one stage ? val (qtSessionsWithQt, guidUidMapSessions) = rawQtSession. *magicFilter*(_._2.qualifiedTreatmentId != NULL_VALUE) -- Deepak -- Best Regards Jeff Zhang
Re: Filter operation to return two RDDs at once.
I check the RDD#randSplit, it is much more like multiple one-to-one transformation rather than a one-to-multiple transformation. I write one sample code as following, it would generate 3 stages. Although we can use cache here to make it better, If spark can support multiple outputs, only 2 stages are needed. ( This would be useful for pig's multiple query and hive's self join ) val data = sc.textFile(/Users/jzhang/a.log).flatMap(line=line.split(\\s)).map(w=(w,1)) val parts = data.randomSplit(Array(0.2,0.8)) val joinResult = parts(0).join(parts(1)) println(joinResult.toDebugString) (1) MapPartitionsRDD[8] at join at WordCount.scala:22 [] | MapPartitionsRDD[7] at join at WordCount.scala:22 [] | CoGroupedRDD[6] at join at WordCount.scala:22 [] +-(1) PartitionwiseSampledRDD[4] at randomSplit at WordCount.scala:21 [] | | MapPartitionsRDD[3] at map at WordCount.scala:20 [] | | MapPartitionsRDD[2] at flatMap at WordCount.scala:20 [] | | /Users/jzhang/a.log MapPartitionsRDD[1] at textFile at WordCount.scala:20 [] | | /Users/jzhang/a.log HadoopRDD[0] at textFile at WordCount.scala:20 [] +-(1) PartitionwiseSampledRDD[5] at randomSplit at WordCount.scala:21 [] | MapPartitionsRDD[3] at map at WordCount.scala:20 [] | MapPartitionsRDD[2] at flatMap at WordCount.scala:20 [] | /Users/jzhang/a.log MapPartitionsRDD[1] at textFile at WordCount.scala:20 [] | /Users/jzhang/a.log HadoopRDD[0] at textFile at WordCount.scala:20 [] On Wed, Jun 3, 2015 at 2:45 PM, Sean Owen so...@cloudera.com wrote: In the sense here, Spark actually does have operations that make multiple RDDs like randomSplit. However there is not an equivalent of the partition operation which gives the elements that matched and did not match at once. On Wed, Jun 3, 2015, 8:32 AM Jeff Zhang zjf...@gmail.com wrote: As far as I know, spark don't support multiple outputs On Wed, Jun 3, 2015 at 2:15 PM, ayan guha guha.a...@gmail.com wrote: Why do you need to do that if filter and content of the resulting rdd are exactly same? You may as well declare them as 1 RDD. On 3 Jun 2015 15:28, ÐΞ€ρ@Ҝ (๏̯͡๏) deepuj...@gmail.com wrote: I want to do this val qtSessionsWithQt = rawQtSession.filter(_._2. qualifiedTreatmentId != NULL_VALUE) val guidUidMapSessions = rawQtSession.filter(_._2. qualifiedTreatmentId == NULL_VALUE) This will run two different stages can this be done in one stage ? val (qtSessionsWithQt, guidUidMapSessions) = rawQtSession. *magicFilter*(_._2.qualifiedTreatmentId != NULL_VALUE) -- Deepak -- Best Regards Jeff Zhang -- Best Regards Jeff Zhang
Re: Filter operation to return two RDDs at once.
In the sense here, Spark actually does have operations that make multiple RDDs like randomSplit. However there is not an equivalent of the partition operation which gives the elements that matched and did not match at once. On Wed, Jun 3, 2015, 8:32 AM Jeff Zhang zjf...@gmail.com wrote: As far as I know, spark don't support multiple outputs On Wed, Jun 3, 2015 at 2:15 PM, ayan guha guha.a...@gmail.com wrote: Why do you need to do that if filter and content of the resulting rdd are exactly same? You may as well declare them as 1 RDD. On 3 Jun 2015 15:28, ÐΞ€ρ@Ҝ (๏̯͡๏) deepuj...@gmail.com wrote: I want to do this val qtSessionsWithQt = rawQtSession.filter(_._2.qualifiedTreatmentId != NULL_VALUE) val guidUidMapSessions = rawQtSession.filter(_._2. qualifiedTreatmentId == NULL_VALUE) This will run two different stages can this be done in one stage ? val (qtSessionsWithQt, guidUidMapSessions) = rawQtSession. *magicFilter*(_._2.qualifiedTreatmentId != NULL_VALUE) -- Deepak -- Best Regards Jeff Zhang
Re: Filter operation to return two RDDs at once.
Why do you need to do that if filter and content of the resulting rdd are exactly same? You may as well declare them as 1 RDD. On 3 Jun 2015 15:28, ÐΞ€ρ@Ҝ (๏̯͡๏) deepuj...@gmail.com wrote: I want to do this val qtSessionsWithQt = rawQtSession.filter(_._2.qualifiedTreatmentId != NULL_VALUE) val guidUidMapSessions = rawQtSession.filter(_._2.qualifiedTreatmentId == NULL_VALUE) This will run two different stages can this be done in one stage ? val (qtSessionsWithQt, guidUidMapSessions) = rawQtSession. *magicFilter*(_._2.qualifiedTreatmentId != NULL_VALUE) -- Deepak
Filter operation to return two RDDs at once.
I want to do this val qtSessionsWithQt = rawQtSession.filter(_._2.qualifiedTreatmentId != NULL_VALUE) val guidUidMapSessions = rawQtSession.filter(_._2.qualifiedTreatmentId == NULL_VALUE) This will run two different stages can this be done in one stage ? val (qtSessionsWithQt, guidUidMapSessions) = rawQtSession.*magicFilter* (_._2.qualifiedTreatmentId != NULL_VALUE) -- Deepak