Offline state manipulation tool for structured streaming query

Jungtaek Lim Sat, 13 Apr 2019 07:14:05 -0700

Hi Spark users, especially Structured Streaming users who are dealing with
stateful queries,


I'm pleased to introduce Spark State Tools, which enables offline state
manipulations for structured streaming query.

Basically the tool provides state as batch source and output so that you
can read state and transform, and even write back to state. With the full
features of batch query Spark SQL provides, you can achieve what you've
just imagined with your state, including rescaling state (repartition) and
schema evolution.

Summarized features are below:

- Show state information which you'll need to provide to enjoy features
  - state operator information, state schema
- Create savepoint from existing checkpoint of Structured Streaming query
- Read state as batch source of Spark SQL
- Write DataFrame to state as batch sink of Spark SQL
- Migrate state format from old to new
  - migrating Streaming Aggregation from ver 1 to 2
  - migrating FlatMapGroupsWithState from ver 1 to 2

And here's Github repository of this tool.
https://github.com/HeartSaVioR/spark-state-tools

Artifacts are also published to Maven central so you can just pull the
artifact into your app.

I'd be happy to hear new ideas of improvements, and much appreciated for
contributions!

Enjoy!

Thanks,
Jungtaek Lim (HeartSaVioR)

Offline state manipulation tool for structured streaming query

Reply via email to