[jira] [Commented] (FLINK-34576) Flink deployment keep staying at RECONCILING/STABLE status

chenyuzhi (Jira) Wed, 06 Mar 2024 23:49:04 -0800


    [ 
https://issues.apache.org/jira/browse/FLINK-34576?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17824288#comment-17824288
 ]


chenyuzhi commented on FLINK-34576:
-----------------------------------

"Not sure how to repro this in a test easily, you could try upgrading the JOSDK 
version and testing with a custom build in your env to see if that solves the 
issue. "

 
This is a bit confusing. Does it mean to only upgrade the JOSDK version to 4.5, 
or to upgrade the JOSDK version to 4.5  and use callback to set the leader 
fence (one of the solutions I can think of). In addition, what is custom build? 
Maybe I am not very familiar with the operator, but I am glad to try to solve 
this problem with the community.

 
[~gyfora] 

> Flink deployment keep staying at RECONCILING/STABLE status
> ----------------------------------------------------------
>
>                 Key: FLINK-34576
>                 URL: https://issues.apache.org/jira/browse/FLINK-34576
>             Project: Flink
>          Issue Type: Bug
>          Components: Kubernetes Operator
>    Affects Versions: kubernetes-operator-1.6.1
>            Reporter: chenyuzhi
>            Priority: Major
>         Attachments: image-2024-03-05-15-13-11-032.png
>
>
> The HA mode of flink-kubernetes-operator is being used. When one of the pods 
> of flink-kubernetes-operator restarts, flink-kubernetes-operator switches the 
> leader. However, some flinkdeployments have been in the 
> *JOB_STATUS=RECONCILING&LIFECYCLE_STATE=STABLE* state for a long time.
> Through the cmd "kubectl describe flinkdeployment xxx", can see the following 
> error, but there are no exceptions in the flink-kubernetes-operator log.
>  
> {code:java}
> Status:
>   Cluster Info:
>     Flink - Revision:             b6d20ed @ 2023-12-20T10:01:39+01:00
>     Flink - Version:              1.14.0-GDC1.6.0
>     Total - Cpu:                  7.0
>     Total - Memory:               30064771072
>   Error:                          
> {"type":"org.apache.flink.kubernetes.operator.exception.ReconciliationException","message":"org.apache.flink.shaded.guava30.com.google.common.util.concurrent.UncheckedExecutionException:
>  java.lang.RuntimeException: Failed to load 
> configuration","additionalMetadata":{},"throwableList":[{"type":"org.apache.flink.shaded.guava30.com.google.common.util.concurrent.UncheckedExecutionException","message":"java.lang.RuntimeException:
>  Failed to load 
> configuration","additionalMetadata":{}},{"type":"java.lang.RuntimeException","message":"Failed
>  to load configuration","additionalMetadata":{}}]}
>   Job Manager Deployment Status:  READY
>   Job Status:
>     Job Id:    cf44b5e73a1f263dd7d9f2c82be5216d
>     Job Name:  noah_stream_studio_1754211682_2218100380
>     Savepoint Info:
>       Last Periodic Savepoint Timestamp:  0
>       Savepoint History:
>     Start Time:     1705635107137
>     State:          RECONCILING
>     Update Time:    1709272530741
>   Lifecycle State:  STABLE {code}
>  
> !image-2024-03-05-15-13-11-032.png!
>  
> version：
> flink-kubernetes-operator: 1.6.1
> flink: 1.14.0/1.15.2 (flinkdeployment 1200+)
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Commented] (FLINK-34576) Flink deployment keep staying at RECONCILING/STABLE status

Reply via email to