Also, you may want to use .lookup() instead of .filter() def lookup(key: K): Seq[V] Return the list of values in the RDD for key key. This operation is done efficiently if the RDD has a known partitioner by only searching the partition that the key maps to.
You might want to partition your first batch of data with .partitionBy() using your CustomTuple hash implementation, persist it, and do not run any operations on it which can remove it's partitioner object. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Does-filter-on-an-RDD-scan-every-data-item-tp20170p20639.html Sent from the Apache Spark User List mailing list archive at Nabble.com. --------------------------------------------------------------------- To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org