gyfora commented on code in PR #726: URL: https://github.com/apache/flink-web/pull/726#discussion_r1535467703
########## docs/content/posts/2024-03-21-release-kubernetes-operator-1.8.0.md: ########## @@ -0,0 +1,159 @@ +--- +title: "Apache Flink Kubernetes Operator 1.8.0 Release Announcement" +date: "2024-03-21T18:00:00.000Z" +authors: +- mxm: + name: "Maximilian Michels" + twitter: "stadtlegende" +- gyfora: + name: "Gyula Fora" + twitter: "GyulaFora" +- 1996fanrui: + name: "Rui Fan" + twitter: "1996fanrui" +aliases: +- /news/2024/03/21/release-kubernetes-operator-1.8.0.html +--- + +The Apache Flink community is excited to announce the release of Flink Kubernetes Operator 1.8.0! + +The release includes many improvements to the operator core, the autoscaler, and introduces new features +like TaskManager memory auto-tuning. + +We encourage you to [download the release](https://flink.apache.org/downloads.html) and share your experience with the +community through the Flink [mailing lists](https://flink.apache.org/community.html#mailing-lists) or +[JIRA](https://issues.apache.org/jira/browse/flink)! We're looking forward to your feedback! + +## Highlights + +### Flink Autotuning Review Comment: Should we call this `Flink Memory Autotuning` ? ########## docs/content/posts/2024-03-21-release-kubernetes-operator-1.8.0.md: ########## @@ -0,0 +1,159 @@ +--- +title: "Apache Flink Kubernetes Operator 1.8.0 Release Announcement" +date: "2024-03-21T18:00:00.000Z" +authors: +- mxm: + name: "Maximilian Michels" + twitter: "stadtlegende" +- gyfora: + name: "Gyula Fora" + twitter: "GyulaFora" +- 1996fanrui: + name: "Rui Fan" + twitter: "1996fanrui" +aliases: +- /news/2024/03/21/release-kubernetes-operator-1.8.0.html +--- + +The Apache Flink community is excited to announce the release of Flink Kubernetes Operator 1.8.0! + +The release includes many improvements to the operator core, the autoscaler, and introduces new features +like TaskManager memory auto-tuning. + +We encourage you to [download the release](https://flink.apache.org/downloads.html) and share your experience with the +community through the Flink [mailing lists](https://flink.apache.org/community.html#mailing-lists) or +[JIRA](https://issues.apache.org/jira/browse/flink)! We're looking forward to your feedback! + +## Highlights + +### Flink Autotuning + +We're excited to announce our latest addition to the autoscaling module: Flink Autotuning. + +Flink Autotuning complements Flink Autoscaling by auto-adjusting critical setttings of the Flink configuration. +For this release, we support auto-configuring Flink memory which is a huge source of pain for users. Flink has +various memory pools (e.g. heap memory, network memory, state backend memory, JVM metaspace) which all need to be +assigned fractions of the available memory upfront in order for a Flink job to run properly. + +Assigning too little memory results in pipeline failures, which is why most users end up assigning way too much memory. +Based on our experience, we've seen that heap memory is at least 50% over-provisioned, even after using Flink Autoscaling. +The reason is that Flink Autoscaling is primarily CPU-driven to optimize pipeline throughput, but doesn't change the +ratio between CPU/Memory on the containers. + +Resource savings are nice to have, but the real power of Flink Autotuning is the reduced time to production. + +With Flink Autoscaling and Flink Autotuning, all users need to do is set a max memory size for the TaskManagers, just +like they would normally configure TaskManager memory. Flink Autotuning then automatically adjusts the various memory +pools and brings down the total container memory size. It does that by observing the actual max memory usage on the +TaskMangers or by calculating the exact number of network buffers required for the job topology. The adjustments are +made together with Flink Autoscaling, so there is no extra downtime involved. + +Flink Autotuning can be enabled by setting: + +``` +# Autoscaling needs to be enabled +job.autoscaler.enabled: true +# Turn on Autotuning +job.autoscaler.memory.tuning.enabled: true +``` + +In the future, we are planning to auto-tune more aspects of the Flink configuration, e.g. the number of task slots. +Another room for improvement is how managed memory is configured. If none is used, it will be set to zero. If managed +memory is used, it will be kept constant. We also added an option to add all saved memory to the managed memory. This +is beneficial when running with RocksDB to maximize performance. + +### Improved Accuracy of Autoscaling Metrics + +So far, Flink Autoscaling relied on sampling scaling metrics within the current metric window. The resulting accuracy +depended on the number of samples and the sampling interval. For this release, whenever possible, we use Flink's +accumulated metrics which provide cumulative counters of metrics like records processed or time spent processing. +This allows us to derive the exact metric value for the window. + +For example, to calculate the average records processed per time unite, we measure the accumulated number of records +processed once at the start of the metric window, e.g. 1000 records. Then we measure a second time when the metric +window closes, e.g. 1500. By subtracting the former from the latter, we can calculate the exact amount of records +processed: 1500-1000 = 500. We can then divide by the metric window duration to get the average number of records +processed. + +### Rescale time estimation + +We now measure the actual required restart time for applying autoscaling decisions. Previously, users had to manually +configure the estimated maximum restart time via `job.autoscaler.restart.time`. If the new feature is enabled, this +setting is now only used for the first scaling. After the first scaling, the actual restart time has been observed +and will be taken into account for future scalings. + +This feature can be enabled via: + +``` +job.autoscaler.restart.time-tracking.enabled: true +``` + +For the next release we are thinking to enable it by default. + +### Autoscaling for Session Cluster Jobs + +Autoscaling used to be an application / job cluster only feature. Now it is also supported for session clusters. + +### Savepoint Trigger Nonce + +A common request is to support a streamlined, user-friendly way of redeploying from a target savepoint. Previously this +was only possible by deleting the CR and recreating it with initialSavepointPath. A big downside of this approach is a +loss of savepoint/checkpoint history in the status that some platforms may need, resulting in non-cleaned up savepoints. + +We introduced a `savepointRedeployNonce` field in the job spec similar to other action trigger nonces. + +If the nonce changes to a new non-null value the job will be redeployed from the path specified in the +initialSavepointPath (or empty state If the path is empty). + +### Cluster shutdown and resource cleanup improvements: Review Comment: Remove `:` ########## docs/content/posts/2024-03-21-release-kubernetes-operator-1.8.0.md: ########## @@ -0,0 +1,159 @@ +--- +title: "Apache Flink Kubernetes Operator 1.8.0 Release Announcement" +date: "2024-03-21T18:00:00.000Z" +authors: +- mxm: + name: "Maximilian Michels" + twitter: "stadtlegende" +- gyfora: + name: "Gyula Fora" + twitter: "GyulaFora" +- 1996fanrui: + name: "Rui Fan" + twitter: "1996fanrui" +aliases: +- /news/2024/03/21/release-kubernetes-operator-1.8.0.html +--- + +The Apache Flink community is excited to announce the release of Flink Kubernetes Operator 1.8.0! + +The release includes many improvements to the operator core, the autoscaler, and introduces new features +like TaskManager memory auto-tuning. + +We encourage you to [download the release](https://flink.apache.org/downloads.html) and share your experience with the +community through the Flink [mailing lists](https://flink.apache.org/community.html#mailing-lists) or +[JIRA](https://issues.apache.org/jira/browse/flink)! We're looking forward to your feedback! + +## Highlights + +### Flink Autotuning Review Comment: I see you have reference of other config tuning in the future, it is also okay to keep it like this :) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@flink.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org