if you have duplicate values for a key, join creates all pairs. Eg. if you 2 values for key X in rdd A & 2 values for key X in rdd B, then a.join(B) will have 4 records for key X
On Thu, Feb 19, 2015 at 3:39 PM, Darin McBeath <ddmcbe...@yahoo.com.invalid> wrote: > Consider the following left outer join > > potentialDailyModificationsRDD = > reducedDailyPairRDD.leftOuterJoin(baselinePairRDD).partitionBy(new > HashPartitioner(1024)).persist(StorageLevel.MEMORY_AND_DISK_SER()); > > > Below are the record counts for the RDDs involved > Number of records for reducedDailyPairRDD: 2565206 > Number of records for baselinePairRDD: 56102812 > Number of records for potentialDailyModificationsRDD: 2570115 > > Below are the partitioners for the RDDs involved. > Partitioner for reducedDailyPairRDD: > Some(org.apache.spark.HashPartitioner@400) > Partitioner for baselinePairRDD: Some(org.apache.spark.HashPartitioner@400 > ) > Partitioner for potentialDailyModificationsRDD: > Some(org.apache.spark.HashPartitioner@400) > > > I realize in the above statement that the .partitionBy is probably not > needed as the underlying RDDs used in the left outer join are already hash > partitioned. > > My question is how the resulting RDD (potentialDailyModificationsRDD) can > end up with more records than > reducedDailyPairRDD. I would think the number of records in > potentialDailyModificationsRDD should be 2565206 instead of 2570115. Am I > missing something or is this possibly a bug? > > I'm using Apache Spark 1.2 on a stand-alone cluster on ec2. To get the > counts for the records, I'm using the .count() for the RDD. > > Thanks. > > Darin. > > --------------------------------------------------------------------- > To unsubscribe, e-mail: user-unsubscr...@spark.apache.org > For additional commands, e-mail: user-h...@spark.apache.org > >