Hi all,

I would like to apply a function over all elements for each key (assuming
key-value RDD). For instance, imagine I have:

import numpy as np
a = np.array([[1, 'hola', 'adios'],[2, 'hi', 'bye'],[2, 'hello',
'goodbye']])
a = sc.parallelize(a)

Then I want to create a key-value RDD, using the first element of each []
as key:

b = a.groupBy(lambda x: x[0])

And finally, I want to filter only those values where the second element is
equal along each key (or there is only one element). So, for key 1, there
is only one element ('hola'), whereas there are 2 different elements for
key 2 ('hi', 'hello'). Therefore, only values associated to key 1 must be
returned:

def test(group):
x = group[0][1]
for g in group[1:]:
y = g[1]
if x != y:
return []
else:
x = y
return group

c = flatMap(lambda (x,y): test(y.data))

Is there a more efficient way to do this?

Many thanks in advance,

Best

Reply via email to