Hi, if I understood it correctly the "key" in that case would be a fuzzy/probabilistic key. I'm not sure this can be computed using either the sort-based or hash-based joinging/grouping strategies of Flink. Maybe we can find something if you elaborate.
Cheers, Aljoscha On Wed, 25 May 2016 at 10:22 Simone Robutti <simone.robu...@radicalbit.io> wrote: > Hello, > > I'm implementing MinHash for reccomendation on Flink. I'm almost done but > I need an efficient way to merge sets of similar keys together (and later > join these sets of keys with more data). > > The actual data structure is of the form DataSet[(Int,Set[Int])] where the > left element of the tuple is an ID for the right element, that is a set of > keys. I want to merge these sets together only if they share at least one > element. > > I'm rather sure to have studied the efficient solution to this problem in > a local environment but I don't really know how to treat it in a > distributed environment. Any suggestion? > > Thanks, > > Simone >