Like Robin said, pls explore Pregel. You could do it without Pregel but it might be laborious. I have a simple outline below. You will need more iterations if the number of levels is higher.
a-b b-c c-d b-e e-f f-c flatmaptopair a -> (a-b) b -> (a-b) b -> (b-c) c -> (b-c) c -> (c-d) d -> (c-d) b -> (b-e) e -> (b-e) e -> (e-f) f -> (e-f) f -> (f-c) c -> (f-c) aggregatebykey a -> (a-b) b -> (a-b, b-c, b-e) c -> (b-c, c-d, f-c) d -> (c-d) e -> (b-e, e-f) f -> (e-f, f-c) filter to remove keys with less than 2 values b -> (a-b, b-c, b-e) c -> (b-c, c-d, f-c) e -> (b-e, e-f) f -> (e-f, f-c) flatmap a-b-c a-b-e b-c-d b-e-f e-f-c flatmaptopair followed by aggregatebykey (a-b) -> (a-b-c, a-b-e) (b-c) -> (a-b-c, b-c-d) (c-d) -> (b-c-d) (b-e) -> (b-e-f) (e-f) -> (b-e-f, e-f-c) (f-c) -> (e-f-c) filter out keys with less than 2 values (b-c) -> (a-b-c, b-c-d) (e-f) -> (b-e-f, e-f-c) mapvalues a-b-c-d b-e-f-c flatmap a,d b,d c,d b,c e,c f,c On Thu, Feb 25, 2016 at 6:19 PM, Guillermo Ortiz <konstt2...@gmail.com> wrote: > I'm taking a look to Pregel. It seems it's a good way to do it. The only > negative thing that I see it's not a really complex graph with a lot of > edges between the vertex .. They are more like a lot of isolated small > graphs > > 2016-02-25 12:32 GMT+01:00 Robin East <robin.e...@xense.co.uk>: > >> The structures you are describing look like edges of a graph and you want >> to follow the graph to a terminal vertex and then propagate that value back >> up the path. On this assumption it would be simple to create the structures >> as graphs in GraphX and use Pregel for the algorithm implementation. >> >> ------------------------------------------------------------------------------- >> Robin East >> *Spark GraphX in Action* Michael Malak and Robin East >> Manning Publications Co. >> http://www.manning.com/books/spark-graphx-in-action >> >> >> >> >> >> On 25 Feb 2016, at 10:52, Guillermo Ortiz <konstt2...@gmail.com> wrote: >> >> Oh, the letters were just an example, it could be: >> a , t >> b, o >> t, k >> k, c >> >> So.. a -> t -> k -> c and the result is: a,c; t,c; k,c and b,o >> I don't know if you were thinking about sortBy because the another >> example where letter were consecutive. >> >> >> 2016-02-25 9:42 GMT+01:00 Guillermo Ortiz <konstt2...@gmail.com>: >> >>> I don't see that sorting the data helps. >>> The answer has to be all the associations. In this case the answer has >>> to be: >>> a , b --> it was a error in the question, sorry. >>> b , d >>> c , d >>> x , y >>> y , y >>> >>> I feel like all the data which is associate should be in the same >>> executor. >>> On this case if I order the inputs. >>> a , b >>> x , y >>> b , c >>> y , y >>> c , d >>> --> to >>> a , b >>> b , c >>> c , d >>> x , y >>> y , y >>> >>> Now, a,b ; b,c; one partitions for example, "c,d" and "x,y" another one >>> and so on. >>> I could get the relation between "a,b,c", but not about "d" with >>> "a,b,c", am I wrong? I hope to be wrong!. >>> >>> It seems that it could be done with GraphX, but as you said, it seems a >>> little bit overhead. >>> >>> >>> 2016-02-25 5:43 GMT+01:00 James Barney <jamesbarne...@gmail.com>: >>> >>>> Guillermo, >>>> I think you're after an associative algorithm where A is ultimately >>>> associated with D, correct? Jakob would correct if that is a typo--a sort >>>> would be all that is necessary in that case. >>>> >>>> I believe you're looking for something else though, if I understand >>>> correctly. >>>> >>>> This seems like a similar algorithm to PageRank, no? >>>> https://github.com/amplab/graphx/blob/master/python/examples/pagerank.py >>>> Except return the "neighbor" itself, not the necessarily the rank of the >>>> page. >>>> >>>> If you wanted to, use Scala and Graphx for this problem. Might be a bit >>>> of overhead though: Construct a node for each member of each tuple with an >>>> edge between. Then traverse the graph for all sets of nodes that are >>>> connected. That result set would quickly explode in size, but you could >>>> restrict results to a minimum N connections. I'm not super familiar with >>>> Graphx myself, however. My intuition is saying 'graph problem' though. >>>> >>>> Thoughts? >>>> >>>> >>>> On Wed, Feb 24, 2016 at 6:43 PM, Jakob Odersky <ja...@odersky.com> >>>> wrote: >>>> >>>>> Hi Guillermo, >>>>> assuming that the first "a,b" is a typo and you actually meant "a,d", >>>>> this is a sorting problem. >>>>> >>>>> You could easily model your data as an RDD or tuples (or as a >>>>> dataframe/set) and use the sortBy (or orderBy for dataframe/sets) >>>>> methods. >>>>> >>>>> best, >>>>> --Jakob >>>>> >>>>> On Wed, Feb 24, 2016 at 2:26 PM, Guillermo Ortiz <konstt2...@gmail.com> >>>>> wrote: >>>>> > I want to do some algorithm in Spark.. I know how to do it in a >>>>> single >>>>> > machine where all data are together, but I don't know a good way to >>>>> do it in >>>>> > Spark. >>>>> > >>>>> > If someone has an idea.. >>>>> > I have some data like this >>>>> > a , b >>>>> > x , y >>>>> > b , c >>>>> > y , y >>>>> > c , d >>>>> > >>>>> > I want something like: >>>>> > a , d >>>>> > b , d >>>>> > c , d >>>>> > x , y >>>>> > y , y >>>>> > >>>>> > I need to know that a->b->c->d, so a->d, b->d and c->d. >>>>> > I don't want the code, just an idea how I could deal with it. >>>>> > >>>>> > Any idea? >>>>> >>>>> --------------------------------------------------------------------- >>>>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org >>>>> For additional commands, e-mail: user-h...@spark.apache.org >>>>> >>>>> >>>> >>> >> >> >