So, given sets, you are joining overlapping sets until all of them are
mutually disjoint, right?

If graphs are out, then I also would love to see a slick distributed
solution, but couldn't think of one. It seems like a cartesian product
won't scale.

You can write a simple method to implement this locally, like:

val sets = collection.mutable.ArrayBuffer(...)
var i = 0
while (i < sets.size) {
  val index = sets.tail.indexWhere(s => (s & sets.head).size > 0)
  if (index >= 0) {
    sets += (sets.remove(0) | sets.remove(index))
  }
  i += 1
}

... and then apply this to each partition of sets for efficiency
(making partitions small enough that loading the sets into memory
isn't a problem), and then reduce the result using the same function.
I think that might be made efficient.

On Wed, Jul 23, 2014 at 7:21 PM, Roch Denis <[email protected]> wrote:
> Hello,
>
> Most of the tasks I've accomplished in Spark were fairly straightforward but
> I can't figure the following problem using the Spark API:
>
> Basically, I have an IP with a bunch of user ID associated to it. I want to
> create a list of all user id that are associated together, even if some are
> on different IP.
>
> For example:
>
> •       IP: 1.24.22.10 / User ID: A, B, C
> •       IP: 2.24.30.11 / User ID: C, D, E
> •       IP: 3.21.30.11 / User ID: F, Z, E
> •       IP: 4.21.30.11 / User ID: T, S, R
>
> The end result Would be something two list: [A,B,C, D, E, F, Z] and [T, S,
> R]
>
> What I've tried, is a
>
>         rdd = sc.parallelize([ frozenset([1, 2]), frozenset([2,3]),
> frozenset([3,4]) ])
>
> - Cartesian / Filter ( where I remove item with no user id in common )
> - Map: Merge the two user id set into a common set.
> - Distinct : Remove duplicates.
>
> I would have to run it a couple of times, but it doesn't quite work because
> for example [1,2] would get merged with [1,2] all the time and I would get
> stuck with it. ( see below ). I assume there's a common pattern to do this
> in mapreduce but I just don't know it :\. I realize it's a graph problem but
> spark graph implementation is not available in python yet.
>
> Pass 1:
>         SET: frozenset([1, 2, 3])
>         SET: frozenset([2, 3, 4])
>         SET: frozenset([2, 3])
>         SET: frozenset([1, 2])
>         SET: frozenset([3, 4])
>
> Pass 2:
>         SET: frozenset([1, 2, 3, 4])
>         SET: frozenset([1, 2, 3])
>         SET: frozenset([2, 3, 4])
>         SET: frozenset([2, 3])
>         SET: frozenset([1, 2])
>         SET: frozenset([3, 4])
>
> Pass 3:
>         SET: frozenset([1, 2, 3, 4])
>         SET: frozenset([1, 2, 3])
>         SET: frozenset([2, 3, 4])
>         SET: frozenset([2, 3])
>         SET: frozenset([1, 2])
>         SET: frozenset([3, 4])
>
>
>
>
> --
> View this message in context: 
> http://apache-spark-user-list.1001560.n3.nabble.com/Help-in-merging-a-RDD-agaisnt-itself-using-the-V-of-a-K-V-tp10530.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.

Reply via email to