[DISCUSS] Change the default restart-strategy to exponential-delay

2023-11-15 文章 Rui Fan
Hi dear flink users and devs:

FLIP-364[1] intends to make some improvements to restart-strategy
and discuss updating some of the default values of exponential-delay,
and whether exponential-delay can be used as the default restart-strategy.
After discussing at dev mail list[2], we hope to collect more feedback
from Flink users.

# Why does the default restart-strategy need to be updated?

If checkpointing is enabled, the default value is fixed-delay with
Integer.MAX_VALUE restart attempts and '1 s' delay[3]. It means
the job will restart infinitely with high frequency when a job
continues to fail.

When the Kafka cluster fails, a large number of flink jobs will be
restarted frequently. After the kafka cluster is recovered, a large
number of high-frequency restarts of flink jobs may cause the
kafka cluster to avalanche again.

Considering the exponential-delay as the default strategy with
a couple of reasons:

- The exponential-delay can reduce the restart frequency when
  a job continues to fail.
- It can restart a job quickly when a job fails occasionally.
- The restart-strategy.exponential-delay.jitter-factor can avoid r
  estarting multiple jobs at the same time. It’s useful to prevent
  avalanches.

# What are the current default values[4] of exponential-delay?

restart-strategy.exponential-delay.initial-backoff : 1s
restart-strategy.exponential-delay.backoff-multiplier : 2.0
restart-strategy.exponential-delay.jitter-factor : 0.1
restart-strategy.exponential-delay.max-backoff : 5 min
restart-strategy.exponential-delay.reset-backoff-threshold : 1h

backoff-multiplier=2 means that the delay time of each restart
will be doubled. The delay times are:
1s, 2s, 4s, 8s, 16s, 32s, 64s, 128s, 256s, 300s, 300s, etc.

The delay time is increased rapidly, it will affect the recover
time for flink jobs.

# Option improvements

We think the backoff-multiplier between 1 and 2 is more sensible,
such as:

restart-strategy.exponential-delay.backoff-multiplier : 1.2
restart-strategy.exponential-delay.max-backoff : 1 min

After updating, the delay times are:

1s, 1.2s, 1.44s, 1.728s, 2.073s, 2.488s, 2.985s, 3.583s, 4.299s,
5.159s, 6.191s, 7.430s, 8.916s, 10.699s, 12.839s, 15.407s, 18.488s,
22.186s, 26.623s, 31.948s, 38.337s, etc

They achieve the following goals:
- When restarts are infrequent in a short period of time, flink can
  quickly restart the job. (For example: the retry delay time when
  restarting 5 times is 2.073s)
- When restarting frequently in a short period of time, flink can
  slightly reduce the restart frequency to prevent avalanches.
  (For example: the retry delay time when retrying 10 times is 5.1 s,
  and the retry delay time when retrying 20 times is 38s, which is not very
large.)

As @Mingliang Liu   mentioned at dev mail list: the
one-size-fits-all
default values do not exist. So our goal is that the default values
can be suitable for most jobs.

Looking forward to your thoughts and feedback, thanks~

[1] https://cwiki.apache.org/confluence/x/uJqzDw
[2] https://lists.apache.org/thread/5cgrft73kgkzkgjozf9zfk0w2oj7rjym
[3]
https://nightlies.apache.org/flink/flink-docs-release-1.18/docs/deployment/config/#restart-strategy-type
[4]
https://nightlies.apache.org/flink/flink-docs-master/docs/ops/state/task_failure_recovery/#exponential-delay-restart-strategy

Best,
Rui


[SUMMARY] Flink 1.19 Release Sync 11/14/2023

2023-11-15 文章 Lincoln Lee
Hi devs and users,

Yesterday was the first release sync of Flink 1.19, I’d like to share the
summary:

- Sync meeting
We switched back to google meet because there's some account limitation for
zoom on some region and the google meet is available when creator is not
online.
The meeting will happen every 2 weeks and switch to weekly after the
feature freeze.

- Feature freezing date
Jan 26, 2024

- Features & issues tracking
The community has collected many features on the 1.19 wiki page[1] and it
is encouraged to continuously updating the page for contributors, also
there exists large amounts of jira issues[2].
Please be aware that, for all `@Public` APIs that are intended to be
changed / removed in release 2.0, the deprecation work should be completed
in 1.19.
Another important thing is that since a lot of the work in 1.19 is also
related to the 2.0 release, tagging related issues with '2.0-related' tag
will make it easier for the 2.0 release managers to track progress.

- Daily work divisions
In general, every release manager will be working on all daily issues. For
some major tasks, in order to make sure there will at least always be
someone to take care of them, they have been assigned to specific release
managers[1]. If you need support in each of these areas, please don't
hesitate to contact us.

- Blockers
  - FLINK-31449 Remove DeclarativeSlotManager related logic @Xintong will
track it
  - FLINK-33531 Nightly Python fails @Dian Fu will look at this
  - FLINK-18356 flink-table-planner Exit code 137 on ci pipeline @Matthias
pr reviewing

- Retrospective of 1.18 release
Thanks for the efforts from previous release managers and also several
valuable thoughts and suggestions:
  - The release process now has a jira template, which will make the work
easier for the new release managers, and the overall steps will still
documented on the wiki page and continuously updated in the next releases.
We'll also be looking at automation to continue to streamline releases.
  - 1.18 experienced relatively long release testing, We found that finding
volunteers to join the testing after rc is ready can be a long wait. So in
1.19 we will try to find volunteers earlier(we added a new column:
volunteers for testing on the wiki page[1]), and before release testing,
let the feature developers describe the detailed testing steps, so that
subsequent testing can go faster.
  - The documentation build and flink-docker CI have been migrated to
GHA(Github actions), there's still a lot of work to be done to migrate the
CI pipeline from azure to GHA[3], and welcome to join in for our goal of
improving the experience of our contributors!

The next release sync will be on November 28th, 2023.

Google Meet: https://meet.google.com/vcx-arzs-trv

[1] https://cwiki.apache.org/confluence/display/FLINK/1.19+Release
[2] https://issues.apache.org/jira/secure/RapidBoard.jspa?rapidView=592
[3] https://issues.apache.org/jira/browse/FLINK-27075

Best regards,
Yun, Jing, Martijn and Lincoln