Hi all,

We are having a few issues with the performance of updateStateByKey
operation in Spark Streaming (1.2.1 at the moment) and any advice would be
greatly appreciated. Specifically, on each tick of the system (which is set
at 10 secs) we need to update a state tuple where the key is the user_id and
value an object with some state about the user. The problem is that using
Kryo serialization for 5M users, this gets really slow to the point that we
have to increase the period to more than 10 seconds so as not to fall
behind. 
The input for the streaming job is a Kafka stream which is consists of key
value pairs of user_ids with some sort of action codes, we join this to our
checkpointed state key and update the state.
I understand that the reason for iterating over the whole state set is for
evicting items or updating state for everyone for time-depended computations
but this does not apply on our situation and it hurts performance really
bad. 
Is there a possibility of implementing in the future and extra call in the
API for updating only a specific subset of keys?

p.s. i will try asap to setting the dstream as non-serialized but then i am
worried about GC and checkpointing performance



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/updateStateByKey-performance-API-tp22113.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Reply via email to