Hi Steve, Are you talking about sequence alignment ?
— FG On Fri, Oct 31, 2014 at 5:44 PM, Steve Lewis <lordjoe2...@gmail.com> wrote: > The original problem is in biology but the following captures the CS > issues, Assume I have a large number of locks and a large number of keys. > There is a scoring function between keys and locks and a key that fits a > lock will have a high score. There may be many keys fitting one lock and a > key may fit no locks well. The object is to find the best fitting lock for > each key. > Assume that the number of keys and locks is high enough that taking the > cartesian product of the two is computationally impractical. Also assume > that keys and locks have an attached location which is accurate within an > error (say 1 Km). Only keys and locks within 1 Km need be compared. > Now assume I can create a JavaRDD<Keys> and a JavaRDD<Locks> . I could > divide the locations into 1 Km squared bins and look only within a few > bins. Assume that it is practical to take a cartesian product for all > elements in a bin but not to keep all elements in memory. I could map my > RDDs into PairRDDs where the key is the bin assigned by location > I know how to take the cartesian product of two JavaRDDs but not how to > take a cartesian product of sets of elements sharing a common key (bin), > Any suggestions. Assume that in the worst cases the number of elements in a > bin are too large to keep in memory although if a bin were subdivided into, > say 100 subbins elements would fit in memory. > Any thoughts as to how to attack the problem