How are you determining how much time is serialization taking?
I made this change in a streaming app that relies heavily on updateStateByKey.
The memory consumption went up 3x on the executors but I can't see any perf
improvement. Task execution time is the same and the serialization state
While investigating performance challenges in a Streaming application using
UpdateStateByKey, I found that serialization of state was a meaningful (not
dominant) portion of our execution time.
In StateDStream.scala, serialized persistence is required:
You could call DStream.persist(StorageLevel.MEMORY_ONLY) on the
stateDStream returned by updateStateByKey to achieve the same. As you have
seen, the downside is greater memory usage, and also higher GC overheads
(that;s the main one usually). So I suggest you run your benchmarks for a
long enough