Hi,
I am developing a Spark Streaming application where I want every item in my
stream to be assigned a unique, strictly increasing Long. My input data
already has RDD-local integers (from 0 to N-1) assigned, so I am doing the
following:
var totalNumberOfItems = 0L
// update the keys of the stream data
val globallyIndexedItems = inputStream.map(keyVal =>
(keyVal._1 + totalNumberOfItems, keyVal._2))
// increase the number of total seen items
inputStream.foreachRDD(rdd => {
totalNumberOfItems += rdd.count
})
Now this works on my local[*] Spark instance, but I was wondering if this
is actually an ok thing to do. I don't want this to break when going to a
YARN cluster...
The function increasing totalNumberOfItems is closing over a var and
running in the driver, so I think this is ok. Here is my concern: What
about the function in the inputStream.map(...) block? This one is closing
over a var that has a different value in every interval. Will the closure
be serialized with that new value in every interval? Or only once with the
initial value and this will always be 0 during the runtime of the program?
As I said, it works locally, but I was wondering if I can really assume
that the closure is serialized with a new value in every interval.
Thanks,
Tobias