[flink-web] 02/03: [blog] State Processor API

fhueske Fri, 13 Sep 2019 05:19:20 -0700

This is an automated email from the ASF dual-hosted git repository.

fhueske pushed a commit to branch asf-site
in repository https://gitbox.apache.org/repos/asf/flink-web.git


commit 21d447a8ef07e0285c6105e0eb07460dcf8a65f1
Author: Seth Wiesman <sjwies...@gmail.com>
AuthorDate: Thu Sep 12 17:14:06 2019 -0500

    [blog] State Processor API
    
    This closes #264.
---
 _posts/2019-09-13-state-processor-api.md           |  64 +++++++++++++++++++++
 .../application-my-app-state-processor-api.png     | Bin 0 -> 49938 bytes
 .../database-my-app-state-processor-api.png        | Bin 0 -> 50174 bytes
 3 files changed, 64 insertions(+)

diff --git a/_posts/2019-09-13-state-processor-api.md 
b/_posts/2019-09-13-state-processor-api.md
new file mode 100644
index 0000000..717f1b4
--- /dev/null
+++ b/_posts/2019-09-13-state-processor-api.md
@@ -0,0 +1,64 @@
+---
+layout: post
+title: "The State Processor API: How to Read, write and modify the state of 
Flink applications"
+date: 2019-09-13T12:00:00.000Z
+category: feature
+authors:
+- Seth:
+  name: "Seth Wiesman"
+  twitter: "sjwiesman"
+
+- Fabian:
+  name: "Fabian Hueske"
+  twitter: "fhueske"
+
+excerpt: This post explores the State Processor API, introduced with Flink 
1.9.0, why this feature is a big step for Flink, what you can use it for, how 
to use it and explores some future directions that align the feature with 
Apache Flink's evolution into a system for unified batch and stream processing.
+---
+
+Whether you are running Apache Flink<sup>Ⓡ</sup> in production or evaluated 
Flink as a computation framework in the past, you've probably found yourself 
asking the question: How can I access, write or update state in a Flink 
savepoint? Ask no more! [Apache Flink 
1.9.0](https://flink.apache.org/news/2019/08/22/release-1.9.0.html) introduces 
the [State Processor 
API](https://ci.apache.org/projects/flink/flink-docs-release-1.9/dev/libs/state_processor_api.html),
 a powerful extension of the  [...]
+ 
+In this post, we explain why this feature is a big step for Flink, what you 
can use it for, and how to use it. Finally, we will discuss the future of the 
State Processor API and how it aligns with our plans to evolve Flink into a 
system for [unified batch and stream 
processing](https://flink.apache.org/news/2019/02/13/unified-batch-streaming-blink.html).
+
+## Stateful Stream Processing with Apache Flink until Flink 1.9
+
+All non-trivial stream processing applications are stateful and most of them 
are designed to run for months or years. Over time, many of them accumulate a 
lot of valuable state that can be very expensive or even impossible to rebuild 
if it gets lost due to a failure. In order to guarantee the consistency and 
durability of application state, Flink featured a sophisticated checkpointing 
and recovery mechanism from very early on. With every release, the Flink 
community has added more and mo [...]
+
+However, a feature that was commonly requested by Flink users was the ability 
to access the state of an application “from the outside”. This request was 
motivated by the need to validate or debug the state of an application, to 
migrate the state of an application to another application, to evolve an 
application from the Heap State Backend to the RocksDB State Backend, or to 
import the initial state of an application from an external system like a 
relational database.
+
+Despite all those convincing reasons to expose application state externally, 
your access options have been fairly limited until now. Flink's Queryable State 
feature only supports key-lookups (point queries) and does not guarantee the 
consistency of returned values (the value of a key might be different before 
and after an application recovered from a failure). Moreover, queryable state 
cannot be used to add or modify the state of an application. Also, savepoints, 
which are consistent sna [...]
+
+## Reading and Writing Application State with the State Processor API
+
+The State Processor API that comes with Flink 1.9 is a true game-changer in 
how you can work with application state! In a nutshell, it extends the DataSet 
API with Input and OutputFormats to read and write savepoint or checkpoint 
data. Due to the [interoperability of DataSet and Table 
API](https://ci.apache.org/projects/flink/flink-docs-master/dev/table/common.html#integration-with-datastream-and-dataset-api),
 you can even use relational Table API or SQL queries to analyze and process st 
[...]
+
+For example, you can take a savepoint of a running stream processing 
application and analyze it with a DataSet batch program to verify that the 
application behaves correctly. Or you can read a batch of data from any store, 
preprocess it, and write the result to a savepoint that you use to bootstrap 
the state of a streaming application. It's also possible to fix inconsistent 
state entries now. Finally, the State Processor API opens up many ways to 
evolve a stateful application that were p [...]
+
+## Mapping Application State to DataSets
+
+The State Processor API maps the state of a streaming application to one or 
more data sets that can be separately processed. In order to be able to use the 
API, you need to understand how this mapping works.
+ 
+But let's first have a look at what a stateful Flink job looks like. A Flink 
job is composed of operators, typically one or more source operators, a few 
operators for the actual processing, and one or more sink operators. Each 
operator runs in parallel in one or more tasks and can work with different 
types of state. An operator can have zero, one, or more *“operator states”* 
which are organized as lists that are scoped to the operator's tasks. If the 
operator is applied on a keyed stream [...]
+ 
+The following figure shows the application “MyApp” which consists of three 
operators called “Src”, “Proc”, and “Snk”. Src has one operator state (os1), 
Proc has one operator state (os2) and two keyed states (ks1, ks2) and Snk is 
stateless.
+
+<p style="display: block; text-align: center; margin-top: 20px; margin-bottom: 
20px">
+       <img src="{{ site.baseurl 
}}/img/blog/2019-09-13-state-processor-api-blog/application-my-app-state-processor-api.png"
 width="600px" alt="Application: My App"/>
+</p>
+
+A savepoint or checkpoint of MyApp consists of the data of all states, 
organized in a way that the states of each task can be restored. When 
processing the data of a savepoint (or checkpoint) with a batch job, we need a 
mental model that maps the data of the individual tasks' states into data sets 
or tables. In fact, we can think of a savepoint as a database. Every operator 
(identified by its UID) represents a namespace. Each operator state of an 
operator is mapped to a dedicated table i [...]
+
+<p style="display: block; text-align: center; margin-top: 20px; margin-bottom: 
20px">
+       <img src="{{ site.baseurl 
}}/img/blog/2019-09-13-state-processor-api-blog/database-my-app-state-processor-api.png"
 width="600px" alt="Database: My App"/>
+</p>
+
+The figure shows how the values of Src's operator state are mapped to a table 
with one column and five rows, one row for all list entries across all parallel 
tasks of Src. Operator state os2 of the operator “Proc” is similarly mapped to 
an individual table. The keyed states ks1 and ks2 are combined to a single 
table with three columns, one for the key, one for ks1 and one for ks2. The 
keyed table holds one row for each distinct key of both keyed states. Since the 
operator “Snk” does not  [...]
+
+The State Processor API now offers methods to create, load, and write a 
savepoint. You can read a DataSet from a loaded savepoint or convert a DataSet 
into a state and add it to a savepoint. DataSets can be processed with the full 
feature set of the DataSet API. With these building blocks, all of the 
before-mentioned use cases (and more) can be addressed. Please have a look at 
the 
[documentation](https://ci.apache.org/projects/flink/flink-docs-release-1.9/dev/libs/state_processor_api.htm
 [...]
+
+## Why DataSet API?
+
+In case you are familiar with [Flink's 
roadmap](https://flink.apache.org/roadmap.html), you might be surprised that 
the State Processor API is based on the DataSet API. The Flink community plans 
to extend the DataStream API with the concept of *BoundedStreams* and deprecate 
the DataSet API. When designing this feature, we also evaluated the DataStream 
API or Table API but neither could provide the right feature set yet. Since we 
didn't want to block this feature on the progress of Flink' [...]
+
+## Summary
+
+Flink users have requested a feature to access and modify the state of 
streaming applications from the outside for a long time. With the State 
Processor API, Flink 1.9.0 finally exposes application state as a data format 
that can be manipulated. This feature opens up many new possibilities for how 
users can maintain and manage Flink streaming applications, including arbitrary 
evolution of stream applications and exporting and bootstrapping of application 
state. To put it concisely, the S [...]
diff --git 
a/img/blog/2019-09-13-state-processor-api-blog/application-my-app-state-processor-api.png
 
b/img/blog/2019-09-13-state-processor-api-blog/application-my-app-state-processor-api.png
new file mode 100644
index 0000000..ca6ef6c
Binary files /dev/null and 
b/img/blog/2019-09-13-state-processor-api-blog/application-my-app-state-processor-api.png
 differ
diff --git 
a/img/blog/2019-09-13-state-processor-api-blog/database-my-app-state-processor-api.png
 
b/img/blog/2019-09-13-state-processor-api-blog/database-my-app-state-processor-api.png
new file mode 100644
index 0000000..31642fb
Binary files /dev/null and 
b/img/blog/2019-09-13-state-processor-api-blog/database-my-app-state-processor-api.png
 differ

[flink-web] 02/03: [blog] State Processor API

Reply via email to