Re: Sessionization using updateStateByKey

Cody Koeninger Wed, 15 Jul 2015 10:41:02 -0700

Don't get me wrong, we've been able to use updateStateByKey for some jobs,
and it's certainly convenient.  At a certain point though, iterating
through every key on every batch is a less viable solution.


On Wed, Jul 15, 2015 at 12:38 PM, Sean McNamara <sean.mcnam...@webtrends.com
> wrote:

>  I would just like to add that we do the very same/similar thing
> at Webtrends, updateStateByKey has been a life-saver for our sessionization
> use-cases.
>
>  Cheers,
>
>  Sean
>
>
>  On Jul 15, 2015, at 11:20 AM, Silvio Fiorito <
> silvio.fior...@granturing.com> wrote:
>
>   Hi Cody,
>
>  I’ve had success using updateStateByKey for real-time sessionization by
> aging off timed-out sessions (returning None in the update function). This
> was on a large commercial website with millions of hits per day. This was
> over a year ago so I don’t have access to the stats any longer for length
> of sessions unfortunately, but I seem to remember they were around 10-30
> minutes long. Even with peaks in volume, Spark managed to keep up very well.
>
>  Thanks,
> Silvio
>
>   From: Cody Koeninger
> Date: Wednesday, July 15, 2015 at 5:38 PM
> To: algermissen1971
> Cc: Tathagata Das, swetha, user
> Subject: Re: Sessionization using updateStateByKey
>
>   An in-memory hash key data structure of some kind so that you're close
> to linear on the number of items in a batch, not the number of outstanding
> keys.  That's more complex, because you have to deal with expiration for
> keys that never get hit, and for unusually long sessions you have to either
> drop them or hit durable storage.
>
>  Maybe someone has a better idea, I'd like to hear it.
>
> On Wed, Jul 15, 2015 at 8:54 AM, algermissen1971 <
> algermissen1...@icloud.com> wrote:
>
>> Hi Cody,
>>
>> oh ... I though that was one of *the* use cases for it. Do you have a
>> suggestion / best practice how to achieve the same thing with better
>> scaling characteristics?
>>
>> Jan
>>
>> On 15 Jul 2015, at 15:33, Cody Koeninger <c...@koeninger.org> wrote:
>>
>> > I personally would try to avoid updateStateByKey for sessionization
>> when you have long sessions / a lot of keys, because it's linear on the
>> number of keys.
>> >
>> > On Tue, Jul 14, 2015 at 6:25 PM, Tathagata Das <t...@databricks.com>
>> wrote:
>> > [Apologies for repost, for those who have seen this response already in
>> the dev mailing list]
>> >
>> > 1. When you set ssc.checkpoint(checkpointDir), the spark streaming
>> periodically saves the state RDD (which is a snapshot of all the state
>> data) to HDFS using RDD checkpointing. In fact, a streaming app with
>> updateStateByKey will not start until you set checkpoint directory.
>> >
>> > 2. The updateStateByKey performance is sort of independent of the what
>> is the source that is being use - receiver based or direct Kafka. The
>> absolutely performance obvious depends on a LOT of variables, size of the
>> cluster, parallelization, etc. The key things is that you must ensure
>> sufficient parallelization at every stage - receiving, shuffles
>> (updateStateByKey included), and output.
>> >
>> > Some more discussion in my talk -
>> https://www.youtube.com/watch?v=d5UJonrruHk
>> >
>> >
>> >
>> > On Tue, Jul 14, 2015 at 4:13 PM, swetha <swethakasire...@gmail.com>
>> wrote:
>> >
>> > Hi,
>> >
>> > I have a question regarding sessionization using updateStateByKey. If
>> near
>> > real time state needs to be maintained in a Streaming application, what
>> > happens when the number of RDDs to maintain the state becomes very
>> large?
>> > Does it automatically get saved to HDFS and reload when needed or do I
>> have
>> > to use any code like ssc.checkpoint(checkpointDir)?  Also, how is the
>> > performance if I use both DStream Checkpointing for maintaining the
>> state
>> > and use Kafka Direct approach for exactly once semantics?
>> >
>> >
>> > Thanks,
>> > Swetha
>> >
>> >
>> >
>> > --
>> > View this message in context:
>> http://apache-spark-user-list.1001560.n3.nabble.com/Sessionization-using-updateStateByKey-tp23838.html
>> > Sent from the Apache Spark User List mailing list archive at Nabble.com
>> .
>> >
>> > ---------------------------------------------------------------------
>> > To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>> > For additional commands, e-mail: user-h...@spark.apache.org
>> >
>> >
>> >
>>
>>
>
>

Re: Sessionization using updateStateByKey

Reply via email to