[jira] [Updated] (FLINK-34576) Flink deployment keep staying at RECONCILING/STABLE status

chenyuzhi (Jira) Tue, 05 Mar 2024 00:01:05 -0800


     [ 
https://issues.apache.org/jira/browse/FLINK-34576?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


chenyuzhi updated FLINK-34576:
------------------------------
    Description: 
The HA mode of flink-kubernetes-operator is being used. When one of the pods of 
flink-kubernetes-operator restarts, flink-kubernetes-operator switches the 
leader. However, some flinkdeployments have been in the 
*JOB_STATUS=RECONCILING&LIFECYCLE_STATE=STABLE* state for a long time.

Through the cmd "kubectl describe flinkdeployment xxx", can see the following 
error, but there are no exceptions in the flink-kubernetes-operator log.

 
{code:java}
Status:
  Cluster Info:
    Flink - Revision:             b6d20ed @ 2023-12-20T10:01:39+01:00
    Flink - Version:              1.14.0-GDC1.6.0
    Total - Cpu:                  7.0
    Total - Memory:               30064771072
  Error:                          
{"type":"org.apache.flink.kubernetes.operator.exception.ReconciliationException","message":"org.apache.flink.shaded.guava30.com.google.common.util.concurrent.UncheckedExecutionException:
 java.lang.RuntimeException: Failed to load 
configuration","additionalMetadata":{},"throwableList":[{"type":"org.apache.flink.shaded.guava30.com.google.common.util.concurrent.UncheckedExecutionException","message":"java.lang.RuntimeException:
 Failed to load 
configuration","additionalMetadata":{}},{"type":"java.lang.RuntimeException","message":"Failed
 to load configuration","additionalMetadata":{}}]}
  Job Manager Deployment Status:  READY
  Job Status:
    Job Id:    cf44b5e73a1f263dd7d9f2c82be5216d
    Job Name:  noah_stream_studio_1754211682_2218100380
    Savepoint Info:
      Last Periodic Savepoint Timestamp:  0
      Savepoint History:
    Start Time:     1705635107137
    State:          RECONCILING
    Update Time:    1709272530741
  Lifecycle State:  STABLE {code}
 
!image-2024-03-05-15-13-11-032.png!

 

version：

flink-kubernetes-operator: 1.6.1

flink: 1.14.0/1.15.2 (flinkdeployment 1200+)

 

[~gyfora] 

  was:
The HA mode of flink-kubernetes-operator is being used. When one of the pods of 
flink-kubernetes-operator restarts, flink-kubernetes-operator switches the 
leader. However, some flinkdeployments have been in the 
*JOB_STATUS=RECONCILING&LIFECYCLE_STATE=STABLE* state for a long time.

Through the cmd "kubectl describe flinkdeployment xxx", can see the following 
error, but there are no exceptions in the flink-kubernetes-operator log.

 
{code:java}
Status:
  Cluster Info:
    Flink - Revision:             b6d20ed @ 2023-12-20T10:01:39+01:00
    Flink - Version:              1.14.0-GDC1.6.0
    Total - Cpu:                  7.0
    Total - Memory:               30064771072
  Error:                          
{"type":"org.apache.flink.kubernetes.operator.exception.ReconciliationException","message":"org.apache.flink.shaded.guava30.com.google.common.util.concurrent.UncheckedExecutionException:
 java.lang.RuntimeException: Failed to load 
configuration","additionalMetadata":{},"throwableList":[{"type":"org.apache.flink.shaded.guava30.com.google.common.util.concurrent.UncheckedExecutionException","message":"java.lang.RuntimeException:
 Failed to load 
configuration","additionalMetadata":{}},{"type":"java.lang.RuntimeException","message":"Failed
 to load configuration","additionalMetadata":{}}]}
  Job Manager Deployment Status:  READY
  Job Status:
    Job Id:    cf44b5e73a1f263dd7d9f2c82be5216d
    Job Name:  noah_stream_studio_1754211682_2218100380
    Savepoint Info:
      Last Periodic Savepoint Timestamp:  0
      Savepoint History:
    Start Time:     1705635107137
    State:          RECONCILING
    Update Time:    1709272530741
  Lifecycle State:  STABLE {code}
 
!image-2024-03-05-15-13-11-032.png!

 

version：

flink-kubernetes-operator: 1.6.1

flink: 1.14.0/1.15.2

 

作业规模：

flinkdeployment 1200+

[~gyfora] 


> Flink deployment keep staying at RECONCILING/STABLE status
> ----------------------------------------------------------
>
>                 Key: FLINK-34576
>                 URL: https://issues.apache.org/jira/browse/FLINK-34576
>             Project: Flink
>          Issue Type: Bug
>          Components: Kubernetes Operator
>    Affects Versions: kubernetes-operator-1.6.1
>            Reporter: chenyuzhi
>            Priority: Major
>         Attachments: image-2024-03-05-15-13-11-032.png
>
>
> The HA mode of flink-kubernetes-operator is being used. When one of the pods 
> of flink-kubernetes-operator restarts, flink-kubernetes-operator switches the 
> leader. However, some flinkdeployments have been in the 
> *JOB_STATUS=RECONCILING&LIFECYCLE_STATE=STABLE* state for a long time.
> Through the cmd "kubectl describe flinkdeployment xxx", can see the following 
> error, but there are no exceptions in the flink-kubernetes-operator log.
>  
> {code:java}
> Status:
>   Cluster Info:
>     Flink - Revision:             b6d20ed @ 2023-12-20T10:01:39+01:00
>     Flink - Version:              1.14.0-GDC1.6.0
>     Total - Cpu:                  7.0
>     Total - Memory:               30064771072
>   Error:                          
> {"type":"org.apache.flink.kubernetes.operator.exception.ReconciliationException","message":"org.apache.flink.shaded.guava30.com.google.common.util.concurrent.UncheckedExecutionException:
>  java.lang.RuntimeException: Failed to load 
> configuration","additionalMetadata":{},"throwableList":[{"type":"org.apache.flink.shaded.guava30.com.google.common.util.concurrent.UncheckedExecutionException","message":"java.lang.RuntimeException:
>  Failed to load 
> configuration","additionalMetadata":{}},{"type":"java.lang.RuntimeException","message":"Failed
>  to load configuration","additionalMetadata":{}}]}
>   Job Manager Deployment Status:  READY
>   Job Status:
>     Job Id:    cf44b5e73a1f263dd7d9f2c82be5216d
>     Job Name:  noah_stream_studio_1754211682_2218100380
>     Savepoint Info:
>       Last Periodic Savepoint Timestamp:  0
>       Savepoint History:
>     Start Time:     1705635107137
>     State:          RECONCILING
>     Update Time:    1709272530741
>   Lifecycle State:  STABLE {code}
>  
> !image-2024-03-05-15-13-11-032.png!
>  
> version：
> flink-kubernetes-operator: 1.6.1
> flink: 1.14.0/1.15.2 (flinkdeployment 1200+)
>  
> [~gyfora] 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (FLINK-34576) Flink deployment keep staying at RECONCILING/STABLE status

Reply via email to