Re: Streaming Performance w/ UpdateStateByKey

2015-10-10 Thread Adrian Tanase
How are you determining how much time is serialization taking? I made this change in a streaming app that relies heavily on updateStateByKey. The memory consumption went up 3x on the executors but I can't see any perf improvement. Task execution time is the same and the serialization state

Streaming Performance w/ UpdateStateByKey

2015-10-05 Thread Jeff Nadler
While investigating performance challenges in a Streaming application using UpdateStateByKey, I found that serialization of state was a meaningful (not dominant) portion of our execution time. In StateDStream.scala, serialized persistence is required:

Re: Streaming Performance w/ UpdateStateByKey

2015-10-05 Thread Tathagata Das
You could call DStream.persist(StorageLevel.MEMORY_ONLY) on the stateDStream returned by updateStateByKey to achieve the same. As you have seen, the downside is greater memory usage, and also higher GC overheads (that;s the main one usually). So I suggest you run your benchmarks for a long enough