The original problem is in biology but the following captures the CS issues, Assume I have a large number of locks and a large number of keys. There is a scoring function between keys and locks and a key that fits a lock will have a high score. There may be many keys fitting one lock and a key may fit no locks well. The object is to find the best fitting lock for each key.
Assume that the number of keys and locks is high enough that taking the cartesian product of the two is computationally impractical. Also assume that keys and locks have an attached location which is accurate within an error (say 1 Km). Only keys and locks within 1 Km need be compared. Now assume I can create a JavaRDD<Keys> and a JavaRDD<Locks> . I could divide the locations into 1 Km squared bins and look only within a few bins. Assume that it is practical to take a cartesian product for all elements in a bin but not to keep all elements in memory. I could map my RDDs into PairRDDs where the key is the bin assigned by location I know how to take the cartesian product of two JavaRDDs but not how to take a cartesian product of sets of elements sharing a common key (bin), Any suggestions. Assume that in the worst cases the number of elements in a bin are too large to keep in memory although if a bin were subdivided into, say 100 subbins elements would fit in memory. Any thoughts as to how to attack the problem