Re: [EXTERNAL]Re: Flink Kubernetes Operator - Deadlock when Cluster Cleanup Fails

Niklas Wilcke Tue, 13 Feb 2024 04:23:37 -0800

Hi Gyula,

thanks for the advise. I requested a Jira account and will try to open a ticket 
as soon as I get access.


Cheers,
Niklas

> On 13. Feb 2024, at 09:13, Gyula Fóra <gyula.f...@gmail.com> wrote:
> 
> Hi Niklas!
> 
> The best way to report the issue would be to open a JIRA ticket with the same 
> detailed information.
> 
> Otherwise I think your observations are correct and this is indeed a frequent 
> problem that comes up, it would be good to improve on it. In addition to 
> improving logging we could also increase the default timeout and if we could 
> actually do something on the timeout that would be even better.
> 
> Please open the JIRA ticket and if you have time to work on these 
> improvements I will assign it to you.
> 
> Cheers
> Gyula
> 
> On Mon, Feb 12, 2024 at 11:59 PM Niklas Wilcke <niklas.wil...@uniberg.com 
> <mailto:niklas.wil...@uniberg.com>> wrote:
>> Hi Flink Kubernetes Operator Community,
>> 
>> I hope this is the right way to report an issue with the Apache Flink 
>> Kubernetes Operator. We are experiencing problems with some streaming job 
>> clusters which end up in a terminated state, because of the operator not 
>> behaving as expected. The problem is that the teardown of the Flink cluster 
>> by the operator doesn't succeed in the default timeout of 1 minute. After 
>> that the operator proceeds and tries to create a fresh cluster, which fails, 
>> because parts of the cluster still exist. After that it tries to fully 
>> remove the cluster including the HA metadata. After that it is stuck in an 
>> error loop that manual recovery is necessary, since the HA metadata is 
>> missing. At the very bottom of the mail you can find an condensed log 
>> attached, which hopefully gives a more detailed impression about the problem.
>> 
>> The current workaround is to increase the 
>> "kubernetes.operator.resource.cleanup.timeout" [0] to 10 minutes. Time will 
>> tell whether this workaround fixes the problem for us.
>> 
>> The main problem I see is that the method 
>> AbstractFlinkService.waitForClusterShutdown(...) [1] isn't handling a 
>> timeout at all. Please correct me in case I missed a detail, but this is how 
>> we experience the problem. In case one of the service, the jobmanagers or 
>> the taskmanagers survives the cleanup timeout (of 1 minute), the operator 
>> seems to proceed as if the entities have been removed properly. To me this 
>> doesn't look good. From my point of view at least an error should be logged.
>> 
>> Additionally the current logging makes it difficult to analyse the problem 
>> and to be notified about the timeout. The following things could possibly be 
>> improved or implemented.
>> Successful removal of the entities should be logged.
>> Timing out isn't logged (An error should probably be logged here)
>> For some reason the logging of the waited seconds is somehow incomplete 
>> (L944, further analysis needed)
>> We use the following Flink and Operator versions:
>> 
>> Flink Image: flink:1.17.1 (from Dockerhub)
>> Operator Version: 1.6.1
>> 
>> I hope this description is well enough to get into touch and discuss the 
>> matter. I'm open to provide additional information or with some guidance, 
>> provide a patch to resolve the issue.
>> Thanks for your work on the Operator. It is highly appreciated!
>> 
>> Cheers,
>> Niklas
>> 
>> 
>> [0] 
>> https://nightlies.apache.org/flink/flink-kubernetes-operator-docs-main/docs/operations/configuration/
>> [1] 
>> https://github.com/apache/flink-kubernetes-operator/blob/main/flink-kubernetes-operator/src/main/java/org/apache/flink/kubernetes/operator/service/AbstractFlinkService.java#L903
>> 
>> 
>> #############################################################################################
>> # The job in the cluster failed
>> Event  | Info    | JOBSTATUSCHANGED | Job status changed from RUNNING to 
>> FAILED"
>> Stopping failed Flink job...
>> Status | Error   | FAILED          | 
>> {""type"":""org.apache.flink.util.SerializedThrowable"",""message"":"
>> "org.apache.flink.runtime.JobException: Recovery is suppressed by 
>> FixedDelayRestartBackoffTimeStrategy(maxNumberRestartAttempts=5, 
>> backoffTimeMS=30000)"",""additionalMetadata"":{},""throwableList"":[{""type"":""org.apache.flink.util.SerializedThrowable"",""message"":""org
>> .apache.flink.util.FlinkExpectedException: The TaskExecutor is shutting 
>> down."",""additionalMetadata"":{}}]}
>>  Deleting JobManager deployment while preserving HA metadata.
>> Deleting cluster with Foreground propagation
>> Waiting for cluster shutdown... (10s)
>> Waiting for cluster shutdown... (30s)
>> Waiting for cluster shutdown... (40s)
>> Waiting for cluster shutdown... (45s)
>> Waiting for cluster shutdown... (50s)
>> Resubmitting Flink job...
>> Cluster shutdown completed.
>> Deploying application cluster
>> Event  | Info    | SUBMIT          | Starting deployment
>> Submitting application in 'Application Mode
>> Deploying application cluster
>> ...
>> Event  | Warning | CLUSTERDEPLOYMENTEXCEPTION | Could not create Kubernetes 
>> cluster ""<cluster>""
>>  Status | Error   | FAILED          | 
>> {""type"":""org.apache.flink.kubernetes.operator.exception.ReconciliationException"",""message"":""org.apache.flink.client.deployment.ClusterDeploymentException:
>>  Could not create Kubernetes cluster 
>> \""<cluster\""."",""additionalMetadata"":{},""throwableList"":[{""type"":""org.apache.flink.client.deployment.ClusterDeploymentException"",""message"":""Could
>>  not create Kubernetes cluster 
>> \""<cluster>\""."",""additionalMetadata"":{}},{""type"":""org.apache.flink.kubernetes.shaded.io.fabric8.kubernetes.client.KubernetesClientException"",""message"":""Failure
>>  executing: POST at: 
>> https://10.96.0.1/apis/apps/v1/namespaces/dapfs-basic/deployments. Message: 
>> object is being deleted: deployments.apps \""<cluster>\"" already exists. 
>> Received status: Status(apiVersion=v1, code=409, 
>> details=StatusDetails(causes=[], group=apps, kind=deployments, 
>> name=<cluster>, retryAfterSeconds=null, uid=null, additionalProperties={}), 
>> kind=Status, message=object is being deleted: deployments.apps 
>> \""<cluster>\"" already exists, metadata=ListMeta(_continue=null, 
>> remainingItemCount=null, resourceVersion=null, selfLink=null, 
>> additionalProperties={}), reason=AlreadyExists, status=Failure, 
>> additionalProperties={})."",""additionalMetadata"":{}}]} "
>> ...
>> Event  | Warning | RECOVERDEPLOYMENT | Recovering lost deployment"
>> Deploying application cluster requiring last-state from HA metadata
>> Event  | Info    | SUBMIT          | Starting deployment
>> Flink recovery failed
>> Event  | Warning | RESTOREFAILED   | HA metadata not available to restore 
>> from last state. It is possible that the job has finished or terminally 
>> failed, or the configmaps have been deleted. Manual restore required."
>> # Here the deadlock / endless loop starts
>> #############################################################################################

smime.p7s
Description: S/MIME cryptographic signature

Re: [EXTERNAL]Re: Flink Kubernetes Operator - Deadlock when Cluster Cleanup Fails

Reply via email to