[jira] [Updated] (FLINK-35145) Add timeout for cluster termination

2024-04-17 Thread Zhanghao Chen (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-35145?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhanghao Chen updated FLINK-35145:
--
Description: 
Currently, cluster termination may be blocked forever as there's no timeout for 
that. For example, for an Application cluster with ZK HA enabled, when ZK 
cluster is down, the cluster will reach termination status, but the termination 
process will be blocked when trying to clean up HA data on ZK, where the ZK 
client will retry connecting to ZK forever. Similar phenomenon can be observed 
when an HDFS/S3 outage occurs.

I propose adding a timeout for the cluster termination process in 
ClusterEntryPoint#
shutDownAsync method. 

  was:
Currently, cluster termination may be blocked forever as there's no timeout for 
that. For example, for an Application cluster with ZK HA enabled, when ZK 
cluster is down, the cluster will reach termination status, but the termination 
process will be blocked when trying to clean up HA data on ZK. Similar 
phenomenon can be observed when an HDFS/S3 outage occurs.

I propose adding a timeout for the cluster termination process in 
ClusterEntryPoint#
shutDownAsync method. 


> Add timeout for cluster termination
> ---
>
> Key: FLINK-35145
> URL: https://issues.apache.org/jira/browse/FLINK-35145
> Project: Flink
>  Issue Type: Improvement
>  Components: Runtime / Coordination
>Affects Versions: 1.20.0
>Reporter: Zhanghao Chen
>Priority: Major
> Fix For: 1.20.0
>
>
> Currently, cluster termination may be blocked forever as there's no timeout 
> for that. For example, for an Application cluster with ZK HA enabled, when ZK 
> cluster is down, the cluster will reach termination status, but the 
> termination process will be blocked when trying to clean up HA data on ZK, 
> where the ZK client will retry connecting to ZK forever. Similar phenomenon 
> can be observed when an HDFS/S3 outage occurs.
> I propose adding a timeout for the cluster termination process in 
> ClusterEntryPoint#
> shutDownAsync method. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (FLINK-35145) Add timeout for cluster termination

2024-04-17 Thread Zhanghao Chen (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-35145?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhanghao Chen updated FLINK-35145:
--
Description: 
Currently, cluster termination may be blocked forever as there's no timeout for 
that. For example, for an Application cluster with ZK HA enabled, when ZK 
cluster is down, the cluster will reach termination status, but the termination 
process will be blocked when trying to clean up HA data on ZK, where the ZK 
client will retry connecting to ZK forever. Similar phenomenon can be observed 
when an HDFS outage occurs.

I propose adding a timeout for the cluster termination process in 
ClusterEntryPoint#
shutDownAsync method. 

  was:
Currently, cluster termination may be blocked forever as there's no timeout for 
that. For example, for an Application cluster with ZK HA enabled, when ZK 
cluster is down, the cluster will reach termination status, but the termination 
process will be blocked when trying to clean up HA data on ZK, where the ZK 
client will retry connecting to ZK forever. Similar phenomenon can be observed 
when an HDFS/S3 outage occurs.

I propose adding a timeout for the cluster termination process in 
ClusterEntryPoint#
shutDownAsync method. 


> Add timeout for cluster termination
> ---
>
> Key: FLINK-35145
> URL: https://issues.apache.org/jira/browse/FLINK-35145
> Project: Flink
>  Issue Type: Improvement
>  Components: Runtime / Coordination
>Affects Versions: 1.20.0
>Reporter: Zhanghao Chen
>Priority: Major
> Fix For: 1.20.0
>
>
> Currently, cluster termination may be blocked forever as there's no timeout 
> for that. For example, for an Application cluster with ZK HA enabled, when ZK 
> cluster is down, the cluster will reach termination status, but the 
> termination process will be blocked when trying to clean up HA data on ZK, 
> where the ZK client will retry connecting to ZK forever. Similar phenomenon 
> can be observed when an HDFS outage occurs.
> I propose adding a timeout for the cluster termination process in 
> ClusterEntryPoint#
> shutDownAsync method. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)