Hi all, I would like to apply a function over all elements for each key (assuming key-value RDD). For instance, imagine I have:
import numpy as np a = np.array([[1, 'hola', 'adios'],[2, 'hi', 'bye'],[2, 'hello', 'goodbye']]) a = sc.parallelize(a) Then I want to create a key-value RDD, using the first element of each [] as key: b = a.groupBy(lambda x: x[0]) And finally, I want to filter only those values where the second element is equal along each key (or there is only one element). So, for key 1, there is only one element ('hola'), whereas there are 2 different elements for key 2 ('hi', 'hello'). Therefore, only values associated to key 1 must be returned: def test(group): x = group[0][1] for g in group[1:]: y = g[1] if x != y: return [] else: x = y return group c = flatMap(lambda (x,y): test(y.data)) Is there a more efficient way to do this? Many thanks in advance, Best