[ 
https://issues.apache.org/jira/browse/SPARK-2629?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14989581#comment-14989581
 ] 

Adrian Tanase commented on SPARK-2629:
--------------------------------------

Great feature, thanks for driving it! I've left a few comments related to the 
API on the design document, let me know if that's fine or if you prefer to move 
the discution on the PR at this point.

There's one additional use case that this API solves, not sure if people have 
encountered it by now - emitting some new value before removing something (e.g. 
emit "session complete" event). I managed this in 2 steps with a tuple state 
object and by flagging the state for removal - then removing it in the next 
batch (sort of like flash messages in regular web app sessions). In the new API 
we can do both at the same time which is great.

One question related to the timeout and "eviction" of state entries - have you 
considered offering the option of extending the timeout? Maybe the expiration 
logic in a particular situation asks for timeout extension based on other 
considerations, not just on the new data coming in.
Would you consider a "touch" or relaxing the update restriction that you can't 
call update when the message is marked for timeout?

> Improved state management for Spark Streaming
> ---------------------------------------------
>
>                 Key: SPARK-2629
>                 URL: https://issues.apache.org/jira/browse/SPARK-2629
>             Project: Spark
>          Issue Type: Epic
>          Components: Streaming
>    Affects Versions: 0.9.2, 1.0.2, 1.2.2, 1.3.1, 1.4.1, 1.5.1
>            Reporter: Tathagata Das
>            Assignee: Tathagata Das
>
> Current updateStateByKey provides stateful processing in Spark Streaming. It 
> allows the user to maintain per-key state and manage that state using an 
> updateFunction. The updateFunction is called for each key, and it uses new 
> data and existing state of the key, to generate an updated state. However, 
> based on community feedback, we have learnt the following lessons.
> - Need for more optimized state management that does not scan every key
> - Need to make it easier to implement common use cases - (a) timeout of idle 
> data, (b) returning items other than state
> The high level idea that I am proposing is 
> - Introduce a new API trackStateByKey that, allows the user to update per-key 
> state, and emit arbitrary records. The new API is necessary as this will have 
> significantly different semantics than the existing updateStateByKey API. 
> This API will have direct support for timeouts.
> - Internally, the system will keep the state data as a map/list within the 
> partitions of the state RDDs. The new data RDDs will be partitioned 
> appropriately, and for all the key-value data, it will lookup the map/list in 
> the state RDD partition and create a new list/map of updated state data. The 
> new state RDD partition will be created based on the update data and if 
> necessary, with old data. 
> Here is the detailed design doc. Please take a look and provide feedback as 
> comments.
> https://docs.google.com/document/d/1NoALLyd83zGs1hNGMm0Pc5YOVgiPpMHugGMk6COqxxE/edit#heading=h.ph3w0clkd4em
> A WIP PR  is here - https://github.com/apache/spark/pull/9256



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to