How are you determining how much time is serialization taking?

I made this change in a streaming app that relies heavily on updateStateByKey. 
The memory consumption went up 3x on the executors but I can't see any perf 
improvement. Task execution time is the same and the serialization state metric 
in the spark UI is 0-1ms in both scenarios.

Any idea where else to look or why am I not seeing any performance uplift?

Thanks!
Adrian

Sent from my iPhone

On 06 Oct 2015, at 00:47, Tathagata Das 
<t...@databricks.com<mailto:t...@databricks.com>> wrote:

You could call DStream.persist(StorageLevel.MEMORY_ONLY) on the stateDStream 
returned by updateStateByKey to achieve the same. As you have seen, the 
downside is greater memory usage, and also higher GC overheads (that;s the main 
one usually). So I suggest you run your benchmarks for a long enough time to 
see what is the GC overheads. If it turns out that some batches are randomly 
taking longer because of some task in some executor being stuck in GC, then its 
going to be bad.

Alternatively, you could also starting playing with CMS GC, etc.

BTW, it would be amazing, if you can share the number in your benchmarks. 
Number of states, how complex are the objects in state, whats the processing 
time and whats the improvements.

TD


On Mon, Oct 5, 2015 at 2:28 PM, Jeff Nadler 
<jnad...@srcginc.com<mailto:jnad...@srcginc.com>> wrote:

While investigating performance challenges in a Streaming application using 
UpdateStateByKey, I found that serialization of state was a meaningful (not 
dominant) portion of our execution time.

In StateDStream.scala, serialized persistence is required:

     super.persist(StorageLevel.MEMORY_ONLY_SER)

I can see why that might be a good choice for a default.    For our workload, I 
made a clone that uses StorageLevel.MEMORY_ONLY.   I've just completed some 
tests and it is indeed faster, with the expected cost of greater memory usage.  
 For us that would be a good tradeoff.

I'm not taking any particular extra risks by doing this, am I?

Should this be configurable?  Perhaps yet another signature for 
PairDStreamFunctions.updateStateByKey?

Thanks for sharing any thoughts-

Jef




Reply via email to