I don't see that sorting the data helps. The answer has to be all the associations. In this case the answer has to be: a , b --> it was a error in the question, sorry. b , d c , d x , y y , y
I feel like all the data which is associate should be in the same executor. On this case if I order the inputs. a , b x , y b , c y , y c , d --> to a , b b , c c , d x , y y , y Now, a,b ; b,c; one partitions for example, "c,d" and "x,y" another one and so on. I could get the relation between "a,b,c", but not about "d" with "a,b,c", am I wrong? I hope to be wrong!. It seems that it could be done with GraphX, but as you said, it seems a little bit overhead. 2016-02-25 5:43 GMT+01:00 James Barney <jamesbarne...@gmail.com>: > Guillermo, > I think you're after an associative algorithm where A is ultimately > associated with D, correct? Jakob would correct if that is a typo--a sort > would be all that is necessary in that case. > > I believe you're looking for something else though, if I understand > correctly. > > This seems like a similar algorithm to PageRank, no? > https://github.com/amplab/graphx/blob/master/python/examples/pagerank.py > Except return the "neighbor" itself, not the necessarily the rank of the > page. > > If you wanted to, use Scala and Graphx for this problem. Might be a bit of > overhead though: Construct a node for each member of each tuple with an > edge between. Then traverse the graph for all sets of nodes that are > connected. That result set would quickly explode in size, but you could > restrict results to a minimum N connections. I'm not super familiar with > Graphx myself, however. My intuition is saying 'graph problem' though. > > Thoughts? > > > On Wed, Feb 24, 2016 at 6:43 PM, Jakob Odersky <ja...@odersky.com> wrote: > >> Hi Guillermo, >> assuming that the first "a,b" is a typo and you actually meant "a,d", >> this is a sorting problem. >> >> You could easily model your data as an RDD or tuples (or as a >> dataframe/set) and use the sortBy (or orderBy for dataframe/sets) >> methods. >> >> best, >> --Jakob >> >> On Wed, Feb 24, 2016 at 2:26 PM, Guillermo Ortiz <konstt2...@gmail.com> >> wrote: >> > I want to do some algorithm in Spark.. I know how to do it in a single >> > machine where all data are together, but I don't know a good way to do >> it in >> > Spark. >> > >> > If someone has an idea.. >> > I have some data like this >> > a , b >> > x , y >> > b , c >> > y , y >> > c , d >> > >> > I want something like: >> > a , d >> > b , d >> > c , d >> > x , y >> > y , y >> > >> > I need to know that a->b->c->d, so a->d, b->d and c->d. >> > I don't want the code, just an idea how I could deal with it. >> > >> > Any idea? >> >> --------------------------------------------------------------------- >> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org >> For additional commands, e-mail: user-h...@spark.apache.org >> >> >