Re: A Spark Design Problem

francois . garillot Fri, 31 Oct 2014 10:52:37 -0700

Hi Steve,

Are you talking about sequence alignment ?




—
FG

On Fri, Oct 31, 2014 at 5:44 PM, Steve Lewis <lordjoe2...@gmail.com>
wrote:

>  The original problem is in biology but the following captures the CS
> issues, Assume I  have a large number of locks and a large number of keys.
> There is a scoring function between keys and locks and a key that  fits a
> lock will have a high score. There may be many keys fitting one lock and a
> key may fit no locks well. The object is to find the best fitting lock for
> each key.
> Assume that the number of keys and locks is high enough that taking the
> cartesian product of the two is computationally impractical. Also assume
> that keys and locks have an attached location which is accurate within an
> error (say 1 Km). Only keys and locks within 1 Km need be compared.
> Now assume I can create a JavaRDD<Keys> and a JavaRDD<Locks> . I could
> divide the locations into 1 Km squared bins and look only within a few
> bins. Assume that it is practical to take a cartesian product for all
> elements in a bin but not to keep all elements in memory. I could map my
> RDDs into PairRDDs where the key is the bin assigned by location
> I know how to take the cartesian product of two JavaRDDs but not how to
> take a cartesian product of sets of elements sharing a common key (bin),
> Any suggestions. Assume that in the worst cases the number of elements in a
> bin are too large to keep in memory although if a bin were subdivided into,
> say 100 subbins elements would fit in memory.
> Any thoughts as to how to attack the problem

Re: A Spark Design Problem

Reply via email to