Maybe the Flink applications could run more stably if you configure enough
resources(e.g. memory, cpu, ephemeral-storage) for the JobManager and
TaskManager pods.

Best,
Yang

David Morávek <d...@apache.org> 于2022年1月5日周三 16:46写道:

> Hi Tianyi,
>
> this really depends on your kubernetes setup (eg. if autoscaling is
> enabled, you're using spot / preemtible instances). In general applications
> that run on Kubernetes needs be resilient to these kind of failures, Flink
> is no exception.
>
> In case of the failure, Flink needs to restart the job from the latest
> checkpoint to ensure consistency. In this kind of environment, you should
> be OK-ish with replying one checkpoint worth of data (you're able to adjust
> the checkpointing interval).
>
> Still it would be worth looking into why this disruptions happen and fix
> the cause. Even though you should be able to recover from these types of
> failures, doesn't mean that it's a good thing to do that more often then
> necessary :) I think if you describe the pod / sts you should see the k8s
> events that resulted in the container being terminated.
>
> Also we're currently working on sever efforts to make the restarting
> experience smoother and checkpointing interval shorter (eg. FLIP-198 [1],
> FLINK-25277 [2], FLIP-158 [3], ..).
>
> [1]
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-198%3A+Working+directory+for+Flink+processes
> [2] https://issues.apache.org/jira/browse/FLINK-25277
> [3]
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-158%3A+Generalized+incremental+checkpoints
>
> Best,
> D.
>
> On Tue, Jan 4, 2022 at 7:23 PM Tianyi Deng <td...@blizzard.com> wrote:
>
>> Hello Flink community,
>>
>>
>>
>> We have a Flink cluster deployed to AWS EKS along with many other
>> applications. This cluster is managed by Spotify’s Flink operator. After
>> deployment I notice the Stateful pods of job manager and task managers
>> intermittently received *SIGTERM* to terminate themselves. I assume this
>> has something to do with the voluntary pod disruption from K8s’s
>> descheduler, perhaps because of node draining since other applications’
>> pods scale up and down or other reasons. It seems like this is inevitable
>> as K8s usually move pods here and there, however it causes the Flink job to
>> restart every time. I feel this is quite unstable.
>>
>>
>>
>> Has anyone also seen this voluntary pod disruption in Flink cluster at
>> K8s? Is there any best practice or recommendation for the Flink operation
>> in K8s?
>>
>>
>>
>> Thanks,
>>
>> Tianyi
>>
>

Reply via email to