Re: Filter operation to return two RDDs at once.

2015-06-03 Thread Jeff Zhang
As far as I know, spark don't support multiple outputs

On Wed, Jun 3, 2015 at 2:15 PM, ayan guha guha.a...@gmail.com wrote:

 Why do you need to do that if filter and content of the resulting rdd are
 exactly same? You may as well declare them as 1 RDD.
 On 3 Jun 2015 15:28, ÐΞ€ρ@Ҝ (๏̯͡๏) deepuj...@gmail.com wrote:

 I want to do this

 val qtSessionsWithQt = rawQtSession.filter(_._2.qualifiedTreatmentId
 != NULL_VALUE)

 val guidUidMapSessions = rawQtSession.filter(_._2.
 qualifiedTreatmentId == NULL_VALUE)

 This will run two different stages can this be done in one stage ?

 val (qtSessionsWithQt, guidUidMapSessions) = rawQtSession.
 *magicFilter*(_._2.qualifiedTreatmentId != NULL_VALUE)




 --
 Deepak




-- 
Best Regards

Jeff Zhang


Re: Filter operation to return two RDDs at once.

2015-06-03 Thread Jeff Zhang
I check the RDD#randSplit, it is much more like multiple one-to-one
transformation rather than a one-to-multiple transformation.

I write one sample code as following, it would generate 3 stages. Although
we can use cache here to make it better, If spark can support multiple
outputs, only 2 stages are needed. ( This would be useful for pig's
multiple query and hive's self join )


val data = 
sc.textFile(/Users/jzhang/a.log).flatMap(line=line.split(\\s)).map(w=(w,1))
val parts = data.randomSplit(Array(0.2,0.8))
val joinResult = parts(0).join(parts(1))
println(joinResult.toDebugString)


(1) MapPartitionsRDD[8] at join at WordCount.scala:22 []
 |  MapPartitionsRDD[7] at join at WordCount.scala:22 []
 |  CoGroupedRDD[6] at join at WordCount.scala:22 []
 +-(1) PartitionwiseSampledRDD[4] at randomSplit at WordCount.scala:21 []
 |  |  MapPartitionsRDD[3] at map at WordCount.scala:20 []
 |  |  MapPartitionsRDD[2] at flatMap at WordCount.scala:20 []
 |  |  /Users/jzhang/a.log MapPartitionsRDD[1] at textFile at
WordCount.scala:20 []
 |  |  /Users/jzhang/a.log HadoopRDD[0] at textFile at WordCount.scala:20 []
 +-(1) PartitionwiseSampledRDD[5] at randomSplit at WordCount.scala:21 []
|  MapPartitionsRDD[3] at map at WordCount.scala:20 []
|  MapPartitionsRDD[2] at flatMap at WordCount.scala:20 []
|  /Users/jzhang/a.log MapPartitionsRDD[1] at textFile at
WordCount.scala:20 []
|  /Users/jzhang/a.log HadoopRDD[0] at textFile at WordCount.scala:20 []


On Wed, Jun 3, 2015 at 2:45 PM, Sean Owen so...@cloudera.com wrote:

 In the sense here, Spark actually does have operations that make multiple
 RDDs like randomSplit. However there is not an equivalent of the partition
 operation which gives the elements that matched and did not match at once.

 On Wed, Jun 3, 2015, 8:32 AM Jeff Zhang zjf...@gmail.com wrote:

 As far as I know, spark don't support multiple outputs

 On Wed, Jun 3, 2015 at 2:15 PM, ayan guha guha.a...@gmail.com wrote:

 Why do you need to do that if filter and content of the resulting rdd
 are exactly same? You may as well declare them as 1 RDD.
 On 3 Jun 2015 15:28, ÐΞ€ρ@Ҝ (๏̯͡๏) deepuj...@gmail.com wrote:

 I want to do this

 val qtSessionsWithQt = rawQtSession.filter(_._2.
 qualifiedTreatmentId != NULL_VALUE)

 val guidUidMapSessions = rawQtSession.filter(_._2.
 qualifiedTreatmentId == NULL_VALUE)

 This will run two different stages can this be done in one stage ?

 val (qtSessionsWithQt, guidUidMapSessions) = rawQtSession.
 *magicFilter*(_._2.qualifiedTreatmentId != NULL_VALUE)




 --
 Deepak




 --
 Best Regards

 Jeff Zhang




-- 
Best Regards

Jeff Zhang


Re: Filter operation to return two RDDs at once.

2015-06-03 Thread Sean Owen
In the sense here, Spark actually does have operations that make multiple
RDDs like randomSplit. However there is not an equivalent of the partition
operation which gives the elements that matched and did not match at once.

On Wed, Jun 3, 2015, 8:32 AM Jeff Zhang zjf...@gmail.com wrote:

 As far as I know, spark don't support multiple outputs

 On Wed, Jun 3, 2015 at 2:15 PM, ayan guha guha.a...@gmail.com wrote:

 Why do you need to do that if filter and content of the resulting rdd are
 exactly same? You may as well declare them as 1 RDD.
 On 3 Jun 2015 15:28, ÐΞ€ρ@Ҝ (๏̯͡๏) deepuj...@gmail.com wrote:

 I want to do this

 val qtSessionsWithQt = rawQtSession.filter(_._2.qualifiedTreatmentId
 != NULL_VALUE)

 val guidUidMapSessions = rawQtSession.filter(_._2.
 qualifiedTreatmentId == NULL_VALUE)

 This will run two different stages can this be done in one stage ?

 val (qtSessionsWithQt, guidUidMapSessions) = rawQtSession.
 *magicFilter*(_._2.qualifiedTreatmentId != NULL_VALUE)




 --
 Deepak




 --
 Best Regards

 Jeff Zhang



Re: Filter operation to return two RDDs at once.

2015-06-03 Thread ayan guha
Why do you need to do that if filter and content of the resulting rdd are
exactly same? You may as well declare them as 1 RDD.
On 3 Jun 2015 15:28, ÐΞ€ρ@Ҝ (๏̯͡๏) deepuj...@gmail.com wrote:

 I want to do this

 val qtSessionsWithQt = rawQtSession.filter(_._2.qualifiedTreatmentId
 != NULL_VALUE)

 val guidUidMapSessions = rawQtSession.filter(_._2.qualifiedTreatmentId
 == NULL_VALUE)

 This will run two different stages can this be done in one stage ?

 val (qtSessionsWithQt, guidUidMapSessions) = rawQtSession.
 *magicFilter*(_._2.qualifiedTreatmentId != NULL_VALUE)




 --
 Deepak




Filter operation to return two RDDs at once.

2015-06-02 Thread ๏̯͡๏
I want to do this

val qtSessionsWithQt = rawQtSession.filter(_._2.qualifiedTreatmentId !=
NULL_VALUE)

val guidUidMapSessions = rawQtSession.filter(_._2.qualifiedTreatmentId
== NULL_VALUE)

This will run two different stages can this be done in one stage ?

val (qtSessionsWithQt, guidUidMapSessions) = rawQtSession.*magicFilter*
(_._2.qualifiedTreatmentId != NULL_VALUE)




-- 
Deepak