Hello, here's a simple program that demonstrates my problem:
ssc = StreamingContext(sc, 1) input = [ [("k1",12), ("k2",14)], [("k1",22)] ] rawData = ssc.queueStream([sc.parallelize(d, 1) for d in input]) runningRawData = rawData.updateStateByKey(lambda nv, prev: reduce(sum, nv, prev or 0)) def stats(rdd) { keyavg = rdd.values().reduce(sum) / rdd.count() return rdd.mapValues(lambda i: i - keyavg) } runningRawData.transform(stats).print() I have a feeling this will calculate "keyavg = rdd.values().reduce(sum) / rdd.count()" inside stats quite a few times depending on the number of partitions on the current rdd. What would be an alternative way to do this two step computation without calculating the average many times?