Apply function to all elements along each key

Luis Guerra Tue, 20 Jan 2015 08:26:30 -0800

Hi all,

I would like to apply a function over all elements for each key (assuming
key-value RDD). For instance, imagine I have:


import numpy as np
a = np.array([[1, 'hola', 'adios'],[2, 'hi', 'bye'],[2, 'hello',
'goodbye']])
a = sc.parallelize(a)

Then I want to create a key-value RDD, using the first element of each []
as key:

b = a.groupBy(lambda x: x[0])

And finally, I want to filter only those values where the second element is
equal along each key (or there is only one element). So, for key 1, there
is only one element ('hola'), whereas there are 2 different elements for
key 2 ('hi', 'hello'). Therefore, only values associated to key 1 must be
returned:

def test(group):
x = group[0][1]
for g in group[1:]:
y = g[1]
if x != y:
return []
else:
x = y
return group

c = flatMap(lambda (x,y): test(y.data))

Is there a more efficient way to do this?

Many thanks in advance,

Best

Apply function to all elements along each key

Reply via email to