Streaming and calculated-once semantics

Dimitris Kouzis - Loukas Wed, 05 Aug 2015 10:47:36 -0700

Hello, here's a simple program that demonstrates my problem:


ssc = StreamingContext(sc, 1)

input = [ [("k1",12), ("k2",14)], [("k1",22)] ]

rawData = ssc.queueStream([sc.parallelize(d, 1) for d in input])

runningRawData = rawData.updateStateByKey(lambda nv, prev: reduce(sum, nv,
prev or 0))

def stats(rdd) {
    keyavg = rdd.values().reduce(sum) / rdd.count()
    return rdd.mapValues(lambda i: i - keyavg)
}

runningRawData.transform(stats).print()


I have a feeling this will calculate "keyavg = rdd.values().reduce(sum) /
rdd.count()" inside stats quite a few times depending on the number of
partitions on the current rdd.

What would be an alternative way to do this two step computation without
calculating the average many times?

Streaming and calculated-once semantics

Reply via email to