Hello,
Most of the tasks I've accomplished in Spark were fairly straightforward but
I can't figure the following problem using the Spark API:
Basically, I have an IP with a bunch of user ID associated to it. I want to
create a list of all user id that are associated together, even if some are
on different IP.
For example:
• IP: 1.24.22.10 / User ID: A, B, C
• IP: 2.24.30.11 / User ID: C, D, E
• IP: 3.21.30.11 / User ID: F, Z, E
• IP: 4.21.30.11 / User ID: T, S, R
The end result Would be something two list: [A,B,C, D, E, F, Z] and [T, S,
R]
What I've tried, is a
rdd = sc.parallelize([ frozenset([1, 2]), frozenset([2,3]),
frozenset([3,4]) ])
- Cartesian / Filter ( where I remove item with no user id in common )
- Map: Merge the two user id set into a common set.
- Distinct : Remove duplicates.
I would have to run it a couple of times, but it doesn't quite work because
for example [1,2] would get merged with [1,2] all the time and I would get
stuck with it. ( see below ). I assume there's a common pattern to do this
in mapreduce but I just don't know it :\. I realize it's a graph problem but
spark graph implementation is not available in python yet.
Pass 1:
SET: frozenset([1, 2, 3])
SET: frozenset([2, 3, 4])
SET: frozenset([2, 3])
SET: frozenset([1, 2])
SET: frozenset([3, 4])
Pass 2:
SET: frozenset([1, 2, 3, 4])
SET: frozenset([1, 2, 3])
SET: frozenset([2, 3, 4])
SET: frozenset([2, 3])
SET: frozenset([1, 2])
SET: frozenset([3, 4])
Pass 3:
SET: frozenset([1, 2, 3, 4])
SET: frozenset([1, 2, 3])
SET: frozenset([2, 3, 4])
SET: frozenset([2, 3])
SET: frozenset([1, 2])
SET: frozenset([3, 4])
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/Help-in-merging-a-RDD-agaisnt-itself-using-the-V-of-a-K-V-tp10530.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.