pnowojski commented on a change in pull request #436: URL: https://github.com/apache/flink-web/pull/436#discussion_r620031127
########## File path: _posts/2021-04-22-release-1.13.0.md ########## @@ -0,0 +1,374 @@ +--- +layout: post +title: "Apache Flink 1.13.0 Release Announcement" +date: 2021-04-22T08:00:00.000Z +categories: news +authors: +- stephan: + name: "Stephan Ewen" + twitter: "StephanEwen" +- dwysakowicz: + name: "Dawid Wysakowicz" + twitter: "dwysakowicz" + +excerpt: The Apache Flink community is excited to announce the release of Flink 1.13.0! Close to xxx contributors worked on over xxx threads to bring significant improvements to usability and observability as well as new features that improve elasticity of Flink’s Application-style deployments. +--- + + +The Apache Flink community is excited to announce the release of Flink 1.13.0! Close to xxx +contributors worked on over xxx threads to bring significant improvements to usability and +observability as well as new features that improve elasticity of Flink’s Application-style +deployments. + +This release brings us a big step forward in one of our major efforts: Making Stream Processing +Applications as natural and as simple to manage as any other application. The new reactive scaling +mode means that scaling streaming applications in and out now works like in any other application, +by just changing the number of parallel processes. + +We also added a series of improvements that help users better understand the performance of +applications. When the streams don't flow as fast as you’d hope, these can help you to understand +why: Load and backpressure visualization to identify bottlenecks, CPU flame graphs to identify hot +code paths in your application, and State Access Latencies to see how the State Backends are keeping +up. + +This blog post describes all major new features and improvements, important changes to be aware of +and what to expect moving forward. + +{% toc %} + +We encourage you to [download the release](https://flink.apache.org/downloads.html) and share your +feedback with the community through +the [Flink mailing lists](https://flink.apache.org/community.html#mailing-lists) +or [JIRA](https://issues.apache.org/jira/projects/FLINK/summary). + +## Notable Features and Improvements + +### Reactive mode + +The Reactive Mode is the latest piece in Flink's initiative for making Stream Processing +Applications as natural and as simple to manage as any other application. + +Flink has a dual nature when it comes to resource management and deployments: You can deploy +clusters onto Resource Managers like Kubernetes or Yarn in such a way that Flink actively manages +the resource, and allocates and releases workers as needed. That is especially useful for jobs and +applications that rapidly change their required resources, like batch applications and ad-hoc SQL +queries. The application parallelism rules, the number of workers follows. We call this active +scaling. + +For long running streaming applications, it is often a nicer model to just deploy them like any +other long-running application: The application doesn't really need to know that it runs on K8s, +EKS, Yarn, etc. and doesn't try to acquire a specific amount of workers; instead, it just uses the +number of workers that is given to it. The number of workers rules, the application parallelism +adjusts to that. We call that re-active scaling. + +The Application Deployment Mode started this effort, making deployments application-like (avoiding +having to separate deployment steps to (1) start cluster and (2) submit application). The reactive +scheduler completes this, and you now don't have to use extra tools (scripts or a K8s operator) any +more to keep the number of workers and the application parallelism settings in sync. + +You can now put an auto-scaler around Flink applications like around other typical applications — as +long as you are mindful when configuring the autoscaler that stateful applications still spend +effort in moving state around when scaling. + + +### Bottleneck detection, Backpressure and Idleness Monitoring + +One of the most important metrics to investigate when a job does not consume records as fast as you +would expect is the backpressure ratio. It lets you track down bottlenecks in your pipelines. The +current mechanism had two limitations: +It was heavy, because it worked by repeatedly taking stack trace samples of your running tasks. It +was difficult to find out which vertex was the source of backpressure. In Flink 1.13, we reworked +the mechanism to include new metrics for the time tasks spend being backpressured, along with a +reworked graphical representation of the job (including a percentage of time particular vertices are +backpressured). + + +<figure style="align-content: center"> + <img src="{{ site.baseurl }}/img/blog/2021-04-xx-release-1.13.0/bottleneck.png" style="width: 900px"/> +</figure> + +### Support for CPU flame graphs in Web UI + +It is desirable to provide better visibility into the distribution of CPU resources while executing +user code. One of the most visually effective means to do that are Flame Graphs. They allow to +easily answer question like: +Which methods are currently consuming CPU resources? How does consumption by one method compare to +the others? Which series of calls on the stack led to executing a particular method? Flame Graphs +are constructed by sampling stack traces a number of times. Every method call is represented by a +bar, where the length of the bar is proportional to the number of times it is present in the +samples. In order to prevent unintended impacts on production environments, Flame Graphs are +currently available as an opt-in feature that needs to be enabled in the configuration. Once enabled +they are accessible via a new component in the UI at the level of the selected operator: + +<figure style="align-content: center"> + <img src="{{ site.baseurl }}/img/blog/2021-04-xx-release-1.13.0/7.png" style="display: block; margin-left: auto; margin-right: auto; width: 600px"/> +</figure> + +### Access Latency Metrics for State + +State interactions are a crucial part of the majority of data +pipelines. Especially in case of using RocksDB they might be rather IO intensive and therefore they +play an important role in the overall performance of the pipeline. Therefore, it is important to be +able to get insights into what is going on under the hood. To provide more insights, we exposed +latency tracking metrics. + +The metrics are disabled by default, but you can enable them using the +`state.backend.rocksdb.latency-track-enabled` option. + +### Unified binary savepoint format + +All available state backends are forced to produce a single common unified binary format for their +savepoints. This means that savepoints are now mutually interchangeable. You are no longer locked +into the first state backend you chose when starting your application for the first time. It makes +it easier to start with Heap Backend and switch later on to RocksDB, if JVM Heap becomes too full ( +which you usually see when the GC times start to go up too much). + +### Support user-specified pod templates for Active Kubernetes Deployments + +The native Kubernetes deployment received an important update that it supports custom pod templates. +Flink from now on allows users to define the JobManager and TaskManager pods via template files. +This allows to support advanced features that are not supported by Flink Kubernetes config options +directly. Major Observability Improvements + +What runs on Flink are often critical workloads with SLAs, so it is important to have the right +tools to understand what is happening inside the applications. + +If your application does not progress as expected, the latency is higher or the throughput lower +than you would expect, these features help you figure out what is going on. + +### Unaligned Checkpoints - Production Ready Review comment: ``` After fixing two more bugs related to Unaligned Checkpoints that were discovered in Flink 1.12.2, upgrade to Flink 1.12.3 or Flink 1.13.0 is recommended for all Unaligned Checkpoints. Combined with support of rescaling from Unaligned Checkpoints, this feature is now considered production ready. ``` ? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org