if you have duplicate values for a key, join creates all pairs.  Eg. if you
2 values for key X in rdd A & 2 values for key X in rdd B, then a.join(B)
will have 4 records for key X

On Thu, Feb 19, 2015 at 3:39 PM, Darin McBeath <ddmcbe...@yahoo.com.invalid>
wrote:

> Consider the following left outer join
>
> potentialDailyModificationsRDD =
> reducedDailyPairRDD.leftOuterJoin(baselinePairRDD).partitionBy(new
> HashPartitioner(1024)).persist(StorageLevel.MEMORY_AND_DISK_SER());
>
>
> Below are the record counts for the RDDs involved
> Number of records for reducedDailyPairRDD: 2565206
> Number of records for baselinePairRDD: 56102812
> Number of records for potentialDailyModificationsRDD: 2570115
>
> Below are the partitioners for the RDDs involved.
> Partitioner for reducedDailyPairRDD:
> Some(org.apache.spark.HashPartitioner@400)
> Partitioner for baselinePairRDD: Some(org.apache.spark.HashPartitioner@400
> )
> Partitioner for potentialDailyModificationsRDD:
> Some(org.apache.spark.HashPartitioner@400)
>
>
> I realize in the above statement that the .partitionBy is probably not
> needed as the underlying RDDs used in the left outer join are already hash
> partitioned.
>
> My question is how the resulting RDD (potentialDailyModificationsRDD) can
> end up with more records than
> reducedDailyPairRDD.  I would think the number of records in
> potentialDailyModificationsRDD should be 2565206 instead of 2570115.  Am I
> missing something or is this possibly a bug?
>
> I'm using Apache Spark 1.2 on a stand-alone cluster on ec2.  To get the
> counts for the records, I'm using the .count() for the RDD.
>
> Thanks.
>
> Darin.
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>

Reply via email to