Hi,

Databricks runtime as you already know has this enhancement and so it is
considered a good option if you want to decouple state from the jvm.
Some arguments why to do so are given by the Flink paper along with
incremental snapshotting: http://www.vldb.org/pvldb/vol10/p1718-carbone.pdf.
Also timers implemented in RockDb can give you higher scalability with very
large states (and many timers). I am not aware of the history behind the
FMGWS API (others could provide more info), but I was also looking at the
API recently thinking of an API for this:
https://issues.apache.org/jira/browse/SPARK-16738

Best,
Stavros

On Sun, Dec 16, 2018 at 7:58 PM Chitral Verma <chitralve...@gmail.com>
wrote:

> Hi Devs,
>
> For quite some time i've been looking at the structured streaming API to
> solve lots of use cases at my workplace, I've have some doubts I wanted to
> clarify regarding stateful aggregations over structured streaming.
>
> Currently, spark provides flatMapGroupWithState (FMGWS) /
> mapGroupWithState (MGWS) APIs to allow custom streaming aggregations by
> setting/ updating intermediate `GroupedState` which may or may not expire.
> This GroupedState is stored in form of snapshots and the latest snapshot is
> entirely in memory, what might be memory consuming approach and may result
> in OOMs.
>
> Other than this, in my opinion, FMGWS is not very flexible in terms of
> usage (aggregation logic and needs to be written on Rows and spark sql
> inbuilt functions can be used) and the timeouts require query to progress
> in order expire keys.
>
> To remedy this i have contributed to this project
> <https://github.com/chermenin/spark-states> which basically moves the
> expiration logic to state store (rocks db) and the state store is no longer
> managed by the executor jvm allowing true expiration of state with nano sec
> precision.
>
> My question is, is there a specific reason FMGWS API is designed the way
> it is and are there any down sides to the approach I have mentioned above.
>
> Do let me know you thoughts.
>
> Thanks
>

Reply via email to