Hi, I am trying to remove duplicates from a set of RDD tuples in an iterative algorithm. I have discovered that it is possible to substitute RDD mapPartitions for RDD distinct. First I partitioned the RDD and distinct it locally using mapPartitions transformation. I expect it will be much faster when it comes to iterative algorithm I checked two results and they were equal.. But I am not sure it works correctly. my concern is that it does not guarantee that it will remove all duplicates accurately. because hash codes can collide in some times. and if duplicates are in different partitions, the following code doesn't work. so all the same duplicates should be in the same partition. Any suggestion will be appreciated.
Code: PairRDD = inputPairs.partitionBy(new HashPartitioner(slices)) val distinctCount = PairRDD.distinct().count() val mapPartitionCount = PairRDD.mapPartitions(iterator => { iterator.toList.distinct.toIterator }, true).count() println("distinct : " + distinctCount) println("mapPartitionCount : " + mapPartitionCount) -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/substitute-mapPartitions-by-distinct-tp26876.html Sent from the Apache Spark User List mailing list archive at Nabble.com. --------------------------------------------------------------------- To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org