Re: How could I do this algorithm in Spark?

Sabarish Sasidharan Thu, 25 Feb 2016 05:01:10 -0800

Like Robin said, pls explore Pregel. You could do it without Pregel but it
might be laborious. I have a simple outline below. You will need more
iterations if the number of levels is higher.


a-b
b-c
c-d
b-e
e-f
f-c

flatmaptopair

a -> (a-b)
b -> (a-b)
b -> (b-c)
c -> (b-c)
c -> (c-d)
d -> (c-d)
b -> (b-e)
e -> (b-e)
e -> (e-f)
f -> (e-f)
f -> (f-c)
c -> (f-c)

aggregatebykey

a -> (a-b)
b -> (a-b, b-c, b-e)
c -> (b-c, c-d, f-c)
d -> (c-d)
e -> (b-e, e-f)
f -> (e-f, f-c)

filter to remove keys with less than 2 values

b -> (a-b, b-c, b-e)
c -> (b-c, c-d, f-c)
e -> (b-e, e-f)
f -> (e-f, f-c)

flatmap

a-b-c
a-b-e
b-c-d
b-e-f
e-f-c

flatmaptopair followed by aggregatebykey

(a-b) -> (a-b-c, a-b-e)
(b-c) -> (a-b-c, b-c-d)
(c-d) -> (b-c-d)
(b-e) -> (b-e-f)
(e-f) -> (b-e-f, e-f-c)
(f-c) -> (e-f-c)

filter out keys with less than 2 values

(b-c) -> (a-b-c, b-c-d)
(e-f) -> (b-e-f, e-f-c)

mapvalues

a-b-c-d
b-e-f-c

flatmap

a,d
b,d
c,d
b,c
e,c
f,c


On Thu, Feb 25, 2016 at 6:19 PM, Guillermo Ortiz <konstt2...@gmail.com>
wrote:

> I'm taking a look to Pregel. It seems it's a good way to do it. The only
> negative thing that I see it's not a really complex graph with a lot of
> edges between the vertex .. They are more like a lot of isolated small
> graphs
>
> 2016-02-25 12:32 GMT+01:00 Robin East <robin.e...@xense.co.uk>:
>
>> The structures you are describing look like edges of a graph and you want
>> to follow the graph to a terminal vertex and then propagate that value back
>> up the path. On this assumption it would be simple to create the structures
>> as graphs in GraphX and use Pregel for the algorithm implementation.
>>
>> -------------------------------------------------------------------------------
>> Robin East
>> *Spark GraphX in Action* Michael Malak and Robin East
>> Manning Publications Co.
>> http://www.manning.com/books/spark-graphx-in-action
>>
>>
>>
>>
>>
>> On 25 Feb 2016, at 10:52, Guillermo Ortiz <konstt2...@gmail.com> wrote:
>>
>> Oh, the letters were just an example, it could be:
>> a , t
>> b, o
>> t, k
>> k, c
>>
>> So.. a -> t -> k -> c and the result is: a,c; t,c; k,c and b,o
>> I don't know if you were thinking about sortBy because the another
>> example where letter were consecutive.
>>
>>
>> 2016-02-25 9:42 GMT+01:00 Guillermo Ortiz <konstt2...@gmail.com>:
>>
>>> I don't see that sorting the data helps.
>>> The answer has to be all the associations. In this case the answer has
>>> to be:
>>> a , b --> it was a error in the question, sorry.
>>> b , d
>>> c , d
>>> x , y
>>> y , y
>>>
>>> I feel like all the data which is associate should be in the same
>>> executor.
>>> On this case if I order the inputs.
>>> a , b
>>> x , y
>>> b , c
>>> y , y
>>> c , d
>>> --> to
>>> a , b
>>> b , c
>>> c , d
>>> x , y
>>> y , y
>>>
>>> Now, a,b ; b,c; one partitions for example, "c,d" and "x,y" another one
>>> and so on.
>>> I could get the relation between "a,b,c", but not about "d" with
>>> "a,b,c", am I wrong? I hope to be wrong!.
>>>
>>> It seems that it could be done with GraphX, but as you said, it seems a
>>> little bit overhead.
>>>
>>>
>>> 2016-02-25 5:43 GMT+01:00 James Barney <jamesbarne...@gmail.com>:
>>>
>>>> Guillermo,
>>>> I think you're after an associative algorithm where A is ultimately
>>>> associated with D, correct? Jakob would correct if that is a typo--a sort
>>>> would be all that is necessary in that case.
>>>>
>>>> I believe you're looking for something else though, if I understand
>>>> correctly.
>>>>
>>>> This seems like a similar algorithm to PageRank, no?
>>>> https://github.com/amplab/graphx/blob/master/python/examples/pagerank.py
>>>> Except return the "neighbor" itself, not the necessarily the rank of the
>>>> page.
>>>>
>>>> If you wanted to, use Scala and Graphx for this problem. Might be a bit
>>>> of overhead though: Construct a node for each member of each tuple with an
>>>> edge between. Then traverse the graph for all sets of nodes that are
>>>> connected. That result set would quickly explode in size, but you could
>>>> restrict results to a minimum N connections. I'm not super familiar with
>>>> Graphx myself, however. My intuition is saying 'graph problem' though.
>>>>
>>>> Thoughts?
>>>>
>>>>
>>>> On Wed, Feb 24, 2016 at 6:43 PM, Jakob Odersky <ja...@odersky.com>
>>>> wrote:
>>>>
>>>>> Hi Guillermo,
>>>>> assuming that the first "a,b" is a typo and you actually meant "a,d",
>>>>> this is a sorting problem.
>>>>>
>>>>> You could easily model your data as an RDD or tuples (or as a
>>>>> dataframe/set) and use the sortBy (or orderBy for dataframe/sets)
>>>>> methods.
>>>>>
>>>>> best,
>>>>> --Jakob
>>>>>
>>>>> On Wed, Feb 24, 2016 at 2:26 PM, Guillermo Ortiz <konstt2...@gmail.com>
>>>>> wrote:
>>>>> > I want to do some algorithm in Spark.. I know how to do it in a
>>>>> single
>>>>> > machine where all data are together, but I don't know a good way to
>>>>> do it in
>>>>> > Spark.
>>>>> >
>>>>> > If someone has an idea..
>>>>> > I have some data like this
>>>>> > a , b
>>>>> > x , y
>>>>> > b , c
>>>>> > y , y
>>>>> > c , d
>>>>> >
>>>>> > I want something like:
>>>>> > a , d
>>>>> > b , d
>>>>> > c , d
>>>>> > x , y
>>>>> > y , y
>>>>> >
>>>>> > I need to know that a->b->c->d, so a->d, b->d and c->d.
>>>>> > I don't want the code, just an idea how I could deal with it.
>>>>> >
>>>>> > Any idea?
>>>>>
>>>>> ---------------------------------------------------------------------
>>>>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>>>>> For additional commands, e-mail: user-h...@spark.apache.org
>>>>>
>>>>>
>>>>
>>>
>>
>>
>

Re: How could I do this algorithm in Spark?

Reply via email to