dawidwys commented on a change in pull request #436: URL: https://github.com/apache/flink-web/pull/436#discussion_r619209440
########## File path: _posts/2021-04-22-release-1.13.0.md ########## @@ -0,0 +1,374 @@ +--- +layout: post +title: "Apache Flink 1.13.0 Release Announcement" +date: 2021-04-22T08:00:00.000Z +categories: news +authors: +- stephan: + name: "Stephan Ewen" + twitter: "StephanEwen" +- dwysakowicz: + name: "Dawid Wysakowicz" + twitter: "dwysakowicz" + +excerpt: The Apache Flink community is excited to announce the release of Flink 1.13.0! Close to xxx contributors worked on over xxx threads to bring significant improvements to usability and observability as well as new features that improve elasticity of Flink’s Application-style deployments. +--- + + +The Apache Flink community is excited to announce the release of Flink 1.13.0! Close to xxx +contributors worked on over xxx threads to bring significant improvements to usability and +observability as well as new features that improve elasticity of Flink’s Application-style +deployments. + +This release brings us a big step forward in one of our major efforts: Making Stream Processing +Applications as natural and as simple to manage as any other application. The new reactive scaling +mode means that scaling streaming applications in and out now works like in any other application, +by just changing the number of parallel processes. + +We also added a series of improvements that help users better understand the performance of +applications. When the streams don't flow as fast as you’d hope, these can help you to understand +why: Load and backpressure visualization to identify bottlenecks, CPU flame graphs to identify hot +code paths in your application, and State Access Latencies to see how the State Backends are keeping +up. + +This blog post describes all major new features and improvements, important changes to be aware of +and what to expect moving forward. + +{% toc %} + +We encourage you to [download the release](https://flink.apache.org/downloads.html) and share your +feedback with the community through +the [Flink mailing lists](https://flink.apache.org/community.html#mailing-lists) +or [JIRA](https://issues.apache.org/jira/projects/FLINK/summary). + +## Notable Features and Improvements + +### Reactive mode + +The Reactive Mode is the latest piece in Flink's initiative for making Stream Processing +Applications as natural and as simple to manage as any other application. + +Flink has a dual nature when it comes to resource management and deployments: You can deploy +clusters onto Resource Managers like Kubernetes or Yarn in such a way that Flink actively manages +the resource, and allocates and releases workers as needed. That is especially useful for jobs and +applications that rapidly change their required resources, like batch applications and ad-hoc SQL +queries. The application parallelism rules, the number of workers follows. We call this active +scaling. + +For long running streaming applications, it is often a nicer model to just deploy them like any +other long-running application: The application doesn't really need to know that it runs on K8s, +EKS, Yarn, etc. and doesn't try to acquire a specific amount of workers; instead, it just uses the +number of workers that is given to it. The number of workers rules, the application parallelism +adjusts to that. We call that re-active scaling. + +The Application Deployment Mode started this effort, making deployments application-like (avoiding +having to separate deployment steps to (1) start cluster and (2) submit application). The reactive +scheduler completes this, and you now don't have to use extra tools (scripts or a K8s operator) any +more to keep the number of workers and the application parallelism settings in sync. + +You can now put an auto-scaler around Flink applications like around other typical applications — as +long as you are mindful when configuring the autoscaler that stateful applications still spend +effort in moving state around when scaling. + + +### Bottleneck detection, Backpressure and Idleness Monitoring + +One of the most important metrics to investigate when a job does not consume records as fast as you +would expect is the backpressure ratio. It lets you track down bottlenecks in your pipelines. The +current mechanism had two limitations: +It was heavy, because it worked by repeatedly taking stack trace samples of your running tasks. It +was difficult to find out which vertex was the source of backpressure. In Flink 1.13, we reworked +the mechanism to include new metrics for the time tasks spend being backpressured, along with a +reworked graphical representation of the job (including a percentage of time particular vertices are +backpressured). + + +<figure style="align-content: center"> + <img src="{{ site.baseurl }}/img/blog/2021-04-xx-release-1.13.0/bottleneck.png" style="width: 900px"/> +</figure> + +### Support for CPU flame graphs in Web UI + +It is desirable to provide better visibility into the distribution of CPU resources while executing +user code. One of the most visually effective means to do that are Flame Graphs. They allow to +easily answer question like: +Which methods are currently consuming CPU resources? How does consumption by one method compare to +the others? Which series of calls on the stack led to executing a particular method? Flame Graphs +are constructed by sampling stack traces a number of times. Every method call is represented by a +bar, where the length of the bar is proportional to the number of times it is present in the +samples. In order to prevent unintended impacts on production environments, Flame Graphs are +currently available as an opt-in feature that needs to be enabled in the configuration. Once enabled +they are accessible via a new component in the UI at the level of the selected operator: + +<figure style="align-content: center"> + <img src="{{ site.baseurl }}/img/blog/2021-04-xx-release-1.13.0/7.png" style="display: block; margin-left: auto; margin-right: auto; width: 600px"/> +</figure> + +### Access Latency Metrics for State + +State interactions are a crucial part of the majority of data +pipelines. Especially in case of using RocksDB they might be rather IO intensive and therefore they +play an important role in the overall performance of the pipeline. Therefore, it is important to be +able to get insights into what is going on under the hood. To provide more insights, we exposed +latency tracking metrics. + +The metrics are disabled by default, but you can enable them using the +`state.backend.rocksdb.latency-track-enabled` option. + +### Unified binary savepoint format + +All available state backends are forced to produce a single common unified binary format for their +savepoints. This means that savepoints are now mutually interchangeable. You are no longer locked +into the first state backend you chose when starting your application for the first time. It makes +it easier to start with Heap Backend and switch later on to RocksDB, if JVM Heap becomes too full ( +which you usually see when the GC times start to go up too much). + +### Support user-specified pod templates for Active Kubernetes Deployments + +The native Kubernetes deployment received an important update that it supports custom pod templates. +Flink from now on allows users to define the JobManager and TaskManager pods via template files. +This allows to support advanced features that are not supported by Flink Kubernetes config options +directly. Major Observability Improvements + +What runs on Flink are often critical workloads with SLAs, so it is important to have the right +tools to understand what is happening inside the applications. + +If your application does not progress as expected, the latency is higher or the throughput lower +than you would expect, these features help you figure out what is going on. + +### Unaligned Checkpoints - Production Ready + +Rescaling and Adaptive Triggering + +This makes it possible to activate Unaligned Checkpoints by default without imposing any overhead on +non-backpressured jobs and without giving up the possibility to scale jobs up- or down based on a +checkpoint. + +### Machine Learning Library moving to a separate Repository + +Flink’s ML development has been slowed down by too tight coupling of the ML efforts with the core +system. + +ML will now follow the same approach as Stateful Functions and be developed as a separate library +with independent release cycles, but still under the umbrella of the Flink project. + +--- + +## Table API/SQL + +### Table-valued functions for Windows + +Defining time windows is one of the most frequent operations in streaming SQL queries. We updated +the Flink SQL window syntax and semantics to be +closer to the SQL standard. This is in sync with other vendors in the industry by expressing windows +as a kind of table-valued functions. + +TUMBLE and HOP windows have been ported to the new syntax. SESSION windows will follow in a +subsequent release. A new CUMULATE window function can assigns windows with an expanding step size +until the maximum window size is reached: + +```sql +SELECT window_time, window_start, window_end, SUM(price) AS total_price + FROM TABLE(CUMULATE(TABLE Bid, DESCRIPTOR(bidtime), INTERVAL '2' MINUTES, INTERVAL '10' MINUTES)) +GROUP BY window_start, window_end, window_time; +``` + +By referencing window start and window end of the table-valued window functions, it is now possible +to compute Top N windowed aggregations next to regular windowed aggregations and windowed joins: + +```sql +SELECT window_time, ... + FROM ( + SELECT *, ROW_NUMBER() OVER (PARTITION BY window_start, window_end ORDER BY total_price DESC) + as rank + FROM t + ) WHERE rank <= 100; +``` + +### Improved interoperability between DataStream and Table API/SQL + +The Table API becomes equally important to the DataStream API which is why a good interoperability +between the two APIs is crucial. (FLIP-136): The core Row class has received a major update with new +toString/hash/equals methods. Fields can not only be accessed by position but also by name with +support of sparse representations. The new methods `StreamTableEnvironment.toDataStream/fromDataStream` +can model the DataStream API as a regular table source or sink with a smooth type system and event-time +integration. + +```java +Table table=tableEnv.fromDataStream( + dataStream,Schema.newBuilder() + .columnByMetadata("rowtime","TIMESTAMP(3)") + .watermark("rowtime","SOURCE_WATERMARK()") + .build()); + +DataStream<Row> dataStream=tableEnv.toDataStream(table) + .keyBy(r->r.getField("user")) + .window(...) +``` + +### Improved SQL Client + +During the past releases, the SQL Client's feature set could not keep up with the evolution of the +TableEnvironment in the Table API. In this release, the SQL Client has been updated for feature +parity with the programmatic APIs. + +*Easier Configuration and Code Sharing* + +The support of YAML files to configure the SQL Client will be discontinued. Instead, the client +accepts one or more initialization scripts to configure an empty session. Additionally, a user can +declare a main SQL script for executing multiple statements in one file. + +``` +./sql-client.sh -i init1.sql init2.sql -f sqljob.sql +``` + +*More Commands and Support for Statement Sets* + +For multi-statement execution, commands such as ADD JAR, STATEMENT SET, and improved SET/RESET +commands with new config options have been added. The following example shows how a full SQL script +could be used as a command line tool to submit Flink jobs in a unified way for both batch and +streaming use cases: + +```sql +-- set up a catalog +CREATE +CATALOG hive_catalog WITH ( +'type' = 'hive' +); +USE +CATALOG hive_catalog; + +-- or use temporary objects +CREATE +TEMPORARY TABLE users ( +user_id BIGINT, +user_name STRING, +user_level STRING, +region STRING, +PRIMARY KEY (user_id) NOT ENFORCED +) WITH ( +'connector' = 'upsert-kafka', +'topic' = 'users', +'properties.bootstrap.servers' = '...', +'key.format' = 'csv', +'value.format' = 'avro' +); + +-- set the execution mode for jobs +SET +execution.runtime-mode=streaming; + +-- set the sync/async mode for INSERT INTOs +SET +table.dml-sync=true; + +-- set the job's parallelism +SET +parallism.default=100; + +-- restore state from the specific savepoint path +SET +execution.savepoint.path=/tmp/flink-savepoints/savepoint-bb0dab; + +INSERT INTO user_stats +SELECT user_id, COUNT(*) +FROM users +GROUP BY user_id; +``` + +--- + +## PyFlink + +### State access in DataStream API +Stateful stream processing has always been a major differentiator for Apache Flink from other data +processing systems. While many operations in a +dataflow simply look at one individual event at a time (for example an event parser), some +operations remember information across multiple events (for example window operators). These +operations are called stateful. + +The 1.12.0 release was the release when we reintroduced a rearchitected Python DataStream API, but +it is the access to State that comes in this release that fully opens up the full potential of +Apache Flink to Python users. + +### Row-based operations in Table API + +Every Flink release brings the Python Table API closer to a full feature parity with the Java API. +In this release the support for row-based operations was added. + + +## Other improvements + +**Exception histories in the Web UI** + +From now on, the Flink Web UI will present up to n last exceptions that caused a job to fail apart +from the last one. + +**Better exception / failure-cause reporting for unsuccessful checkpoints** + +Starting from the release Flink will provide statistics for checkpoints which failed or were +aborted. Prior versions were reporting metrics such as e.g. size of persisted data or trigger time +only in case a checkpoint succeeded. This change makes it easier to figure out a reason for the +failure without looking into logs. + +**Exactly-once JDBC sink** + +From 1.13, JDBC sink can guarantee exactly-once delivery for XA-compliant databases. That means no +data loss or duplication can occur in case of failover. Double processing (by Flink operators) can +still occur but the result will only be delivered once. + +The feature is available to DataStream programs via the new JdbcSink.exactlyOnceSink method (and +JdbcXaSinkFunction directly). The target database must be XA-compatible. + +**Improved consistency of behavior between different SQL time functions** + +**Hive Query syntax compatibility** + +**General User-Defined Aggregate Function Support in PyFlink Table API** Review comment: @dianfu @HuangXingBo Could you help me fill this in. Also a comment from @morsapaes: > Just a note to say that this was announced in Flink 1.12, so maybe it's important to mention what has improved from the initial implementation...? https://flink.apache.org/news/2020/12/10/release-1.12.0.html#other-improvements-to-pyflink -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org