[
https://issues.apache.org/jira/browse/FLINK-23403?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17381938#comment-17381938
]
Till Rohrmann commented on FLINK-23403:
---------------------------------------
Thanks a lot for this input [~fly_in_gis]. I actually intend to start a public
discussion about the default values to gather more feedback because
practitioners' experience is super important for finding good default values.
Are you adjusting the timeout and interval setting for your deployment or do
you find that 50s/10s works will in your setup?
Which Java version and garbage collector are you using when you experience
fullGCs that take longer than 10s?
If your network is under high load, do you also experience that data
connections between {{TaskExecutors}} get separated and, therefore, experience
task restarts? Or is it simply that messages will get delivered very slowly.
How does Flink behave in this situation given that the default
{{akka.ask.timeout}} is set to {{10 s}}. I would assume that all kinds of RPCs
should fail and that this causes a job restart.
> Decrease default values for heartbeat timeout and interval
> ----------------------------------------------------------
>
> Key: FLINK-23403
> URL: https://issues.apache.org/jira/browse/FLINK-23403
> Project: Flink
> Issue Type: Improvement
> Components: Runtime / Configuration, Runtime / Coordination
> Affects Versions: 1.14.0
> Reporter: Till Rohrmann
> Assignee: Till Rohrmann
> Priority: Major
> Labels: pull-request-available
> Fix For: 1.14.0
>
>
> In order to speed up failure detection I suggest to decrease the default
> values for the heartbeat timeout and interval from 50s/10s to 15s/3s.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)