You can always throw more machines at this and see if the performance is
increasing. Since you haven't mentioned anything regarding your # cores etc.

Thanks
Best Regards

On Wed, Mar 18, 2015 at 11:42 AM, nvrs <nvior...@gmail.com> wrote:

> Hi all,
>
> We are having a few issues with the performance of updateStateByKey
> operation in Spark Streaming (1.2.1 at the moment) and any advice would be
> greatly appreciated. Specifically, on each tick of the system (which is set
> at 10 secs) we need to update a state tuple where the key is the user_id
> and
> value an object with some state about the user. The problem is that using
> Kryo serialization for 5M users, this gets really slow to the point that we
> have to increase the period to more than 10 seconds so as not to fall
> behind.
> The input for the streaming job is a Kafka stream which is consists of key
> value pairs of user_ids with some sort of action codes, we join this to our
> checkpointed state key and update the state.
> I understand that the reason for iterating over the whole state set is for
> evicting items or updating state for everyone for time-depended
> computations
> but this does not apply on our situation and it hurts performance really
> bad.
> Is there a possibility of implementing in the future and extra call in the
> API for updating only a specific subset of keys?
>
> p.s. i will try asap to setting the dstream as non-serialized but then i am
> worried about GC and checkpointing performance
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/updateStateByKey-performance-API-tp22113.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>

Reply via email to