[ 
https://issues.apache.org/jira/browse/FLINK-34007?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17808237#comment-17808237
 ] 

Matthias Pohl commented on FLINK-34007:
---------------------------------------

I checked the implementation (since we're getting close to the 1.19 feature 
freeze). We have the following options:
 # We could downgrade the fabric8io kubernetes client dependency back to 
v5.12.4 (essentially reverting FLINK-31997).
 # We could fix the issue in the fabric8io kubernetes client and update the 
dependency as soon as the fix is released. Here, I'm also not that confident 
that we would be able to bring the fix into Flink before 1.19 should be 
released. ...because we're relying on the release of another project.
 # Refactor the k8s implementation to allow the restart of the 
KubernetesLeaderElector within the KubernetesLeaderElectionDriver. That would 
require updating the lockIdentify as well. The problem is that the lockIdentity 
is actually not owned by the KubernetesLeaderElector but by the k8s 
HighAvailabilityServices (because it's also used by the 
KubernetesStateHandleStore when checking the leadership). Even though moving 
the lockIdentity into the KubernetesLeaderElector makes sense (the 
KubernetesStateHandleStore should rely on the LeaderElectionService to detect 
whether leadership is acquired, instead), it is a larger effort and I am 
hesitant to work on that one that closely to the 1.19 feature freeze.

I feel like we should apply all the three options in the above order. Option #1 
would end up in 1.19.0 and 1.18.3 with option #2 being the follow-up. Option #3 
could be considered as a dedicated refactoring effort in 1.20 or later. What's 
your view on that?

> Flink Job stuck in suspend state after losing leadership in HA Mode
> -------------------------------------------------------------------
>
>                 Key: FLINK-34007
>                 URL: https://issues.apache.org/jira/browse/FLINK-34007
>             Project: Flink
>          Issue Type: Bug
>          Components: Runtime / Coordination
>    Affects Versions: 1.16.3, 1.17.2, 1.18.1, 1.18.2
>            Reporter: Zhenqiu Huang
>            Priority: Major
>              Labels: pull-request-available
>         Attachments: Debug.log, LeaderElector-Debug.json, job-manager.log
>
>
> The observation is that Job manager goes to suspend state with a failed 
> container not able to register itself to resource manager after timeout.
> JM Log, see attached
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to