You can always throw more machines at this and see if the performance is increasing. Since you haven't mentioned anything regarding your # cores etc.
Thanks Best Regards On Wed, Mar 18, 2015 at 11:42 AM, nvrs <nvior...@gmail.com> wrote: > Hi all, > > We are having a few issues with the performance of updateStateByKey > operation in Spark Streaming (1.2.1 at the moment) and any advice would be > greatly appreciated. Specifically, on each tick of the system (which is set > at 10 secs) we need to update a state tuple where the key is the user_id > and > value an object with some state about the user. The problem is that using > Kryo serialization for 5M users, this gets really slow to the point that we > have to increase the period to more than 10 seconds so as not to fall > behind. > The input for the streaming job is a Kafka stream which is consists of key > value pairs of user_ids with some sort of action codes, we join this to our > checkpointed state key and update the state. > I understand that the reason for iterating over the whole state set is for > evicting items or updating state for everyone for time-depended > computations > but this does not apply on our situation and it hurts performance really > bad. > Is there a possibility of implementing in the future and extra call in the > API for updating only a specific subset of keys? > > p.s. i will try asap to setting the dstream as non-serialized but then i am > worried about GC and checkpointing performance > > > > -- > View this message in context: > http://apache-spark-user-list.1001560.n3.nabble.com/updateStateByKey-performance-API-tp22113.html > Sent from the Apache Spark User List mailing list archive at Nabble.com. > > --------------------------------------------------------------------- > To unsubscribe, e-mail: user-unsubscr...@spark.apache.org > For additional commands, e-mail: user-h...@spark.apache.org > >