Hello, here's a simple program that demonstrates my problem:

ssc = StreamingContext(sc, 1)

input = [ [("k1",12), ("k2",14)], [("k1",22)] ]

rawData = ssc.queueStream([sc.parallelize(d, 1) for d in input])

runningRawData = rawData.updateStateByKey(lambda nv, prev: reduce(sum, nv,
prev or 0))

def stats(rdd) {
    keyavg = rdd.values().reduce(sum) / rdd.count()
    return rdd.mapValues(lambda i: i - keyavg)
}

runningRawData.transform(stats).print()


I have a feeling this will calculate "keyavg = rdd.values().reduce(sum) /
rdd.count()" inside stats quite a few times depending on the number of
partitions on the current rdd.

What would be an alternative way to do this two step computation without
calculating the average many times?

Reply via email to