[jira] [Updated] (FLINK-18229) Pending worker requests should be properly cleared

2022-12-28 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-18229?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated FLINK-18229:
---
Labels: pull-request-available  (was: )

> Pending worker requests should be properly cleared
> --
>
> Key: FLINK-18229
> URL: https://issues.apache.org/jira/browse/FLINK-18229
> Project: Flink
>  Issue Type: Sub-task
>  Components: Deployment / Kubernetes, Deployment / YARN, Runtime / 
> Coordination
>Affects Versions: 1.9.3, 1.10.1, 1.11.0
>Reporter: Xintong Song
>Assignee: Weihua Hu
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.17.0
>
>
> Currently, if Kubernetes/Yarn does not have enough resources to fulfill 
> Flink's resource requirement, there will be pending pod/container requests on 
> Kubernetes/Yarn. These pending resource requirements are never cleared until 
> either fulfilled or the Flink cluster is shutdown.
> However, sometimes Flink no longer needs the pending resources. E.g., the 
> slot request is then fulfilled by another slots that become available, or the 
> job failed due to slot request timeout (in a session cluster). In such cases, 
> Flink does not remove the resource request until the resource is allocated, 
> then it discovers that it no longer needs the allocated resource and release 
> them. This would affect the underlying Kubernetes/Yarn cluster, especially 
> when the cluster is under heavy workload.
> It would be good for Flink to cancel pod/container requests as earlier as 
> possible if it can discover that some of the pending workers are no longer 
> needed.
> There are several approaches potentially achieve this.
>  # We can always check whether there's a pending worker that can be canceled 
> when a \{{PendingTaskManagerSlot}} is unassigned.
>  # We can have a separate timeout for requesting new worker. If the resource 
> cannot be allocated within the given time since requested, we should cancel 
> that resource request and claim a resource allocation failure.
>  # We can share the same timeout for starting new worker (proposed in 
> FLINK-13554). This is similar to 2), but it requires the worker to be 
> registered, rather than allocated, before timeout.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (FLINK-18229) Pending worker requests should be properly cleared

2022-09-18 Thread Xingbo Huang (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-18229?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xingbo Huang updated FLINK-18229:
-
Fix Version/s: (was: 1.16.0)
   1.17.0

> Pending worker requests should be properly cleared
> --
>
> Key: FLINK-18229
> URL: https://issues.apache.org/jira/browse/FLINK-18229
> Project: Flink
>  Issue Type: Sub-task
>  Components: Deployment / Kubernetes, Deployment / YARN, Runtime / 
> Coordination
>Affects Versions: 1.9.3, 1.10.1, 1.11.0
>Reporter: Xintong Song
>Assignee: Weihua Hu
>Priority: Major
> Fix For: 1.17.0
>
>
> Currently, if Kubernetes/Yarn does not have enough resources to fulfill 
> Flink's resource requirement, there will be pending pod/container requests on 
> Kubernetes/Yarn. These pending resource requirements are never cleared until 
> either fulfilled or the Flink cluster is shutdown.
> However, sometimes Flink no longer needs the pending resources. E.g., the 
> slot request is then fulfilled by another slots that become available, or the 
> job failed due to slot request timeout (in a session cluster). In such cases, 
> Flink does not remove the resource request until the resource is allocated, 
> then it discovers that it no longer needs the allocated resource and release 
> them. This would affect the underlying Kubernetes/Yarn cluster, especially 
> when the cluster is under heavy workload.
> It would be good for Flink to cancel pod/container requests as earlier as 
> possible if it can discover that some of the pending workers are no longer 
> needed.
> There are several approaches potentially achieve this.
>  # We can always check whether there's a pending worker that can be canceled 
> when a \{{PendingTaskManagerSlot}} is unassigned.
>  # We can have a separate timeout for requesting new worker. If the resource 
> cannot be allocated within the given time since requested, we should cancel 
> that resource request and claim a resource allocation failure.
>  # We can share the same timeout for starting new worker (proposed in 
> FLINK-13554). This is similar to 2), but it requires the worker to be 
> registered, rather than allocated, before timeout.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (FLINK-18229) Pending worker requests should be properly cleared

2022-03-30 Thread Xintong Song (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-18229?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xintong Song updated FLINK-18229:
-
Fix Version/s: 1.16.0
   (was: 1.15.0)

> Pending worker requests should be properly cleared
> --
>
> Key: FLINK-18229
> URL: https://issues.apache.org/jira/browse/FLINK-18229
> Project: Flink
>  Issue Type: Sub-task
>  Components: Deployment / Kubernetes, Deployment / YARN, Runtime / 
> Coordination
>Affects Versions: 1.9.3, 1.10.1, 1.11.0
>Reporter: Xintong Song
>Assignee: huweihua
>Priority: Major
> Fix For: 1.16.0
>
>
> Currently, if Kubernetes/Yarn does not have enough resources to fulfill 
> Flink's resource requirement, there will be pending pod/container requests on 
> Kubernetes/Yarn. These pending resource requirements are never cleared until 
> either fulfilled or the Flink cluster is shutdown.
> However, sometimes Flink no longer needs the pending resources. E.g., the 
> slot request is then fulfilled by another slots that become available, or the 
> job failed due to slot request timeout (in a session cluster). In such cases, 
> Flink does not remove the resource request until the resource is allocated, 
> then it discovers that it no longer needs the allocated resource and release 
> them. This would affect the underlying Kubernetes/Yarn cluster, especially 
> when the cluster is under heavy workload.
> It would be good for Flink to cancel pod/container requests as earlier as 
> possible if it can discover that some of the pending workers are no longer 
> needed.
> There are several approaches potentially achieve this.
>  # We can always check whether there's a pending worker that can be canceled 
> when a \{{PendingTaskManagerSlot}} is unassigned.
>  # We can have a separate timeout for requesting new worker. If the resource 
> cannot be allocated within the given time since requested, we should cancel 
> that resource request and claim a resource allocation failure.
>  # We can share the same timeout for starting new worker (proposed in 
> FLINK-13554). This is similar to 2), but it requires the worker to be 
> registered, rather than allocated, before timeout.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Updated] (FLINK-18229) Pending worker requests should be properly cleared

2021-09-28 Thread Xintong Song (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-18229?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xintong Song updated FLINK-18229:
-
Fix Version/s: (was: 1.14.0)
   1.15.0

> Pending worker requests should be properly cleared
> --
>
> Key: FLINK-18229
> URL: https://issues.apache.org/jira/browse/FLINK-18229
> Project: Flink
>  Issue Type: Sub-task
>  Components: Deployment / Kubernetes, Deployment / YARN, Runtime / 
> Coordination
>Affects Versions: 1.9.3, 1.10.1, 1.11.0
>Reporter: Xintong Song
>Priority: Major
> Fix For: 1.15.0
>
>
> Currently, if Kubernetes/Yarn does not have enough resources to fulfill 
> Flink's resource requirement, there will be pending pod/container requests on 
> Kubernetes/Yarn. These pending resource requirements are never cleared until 
> either fulfilled or the Flink cluster is shutdown.
> However, sometimes Flink no longer needs the pending resources. E.g., the 
> slot request is then fulfilled by another slots that become available, or the 
> job failed due to slot request timeout (in a session cluster). In such cases, 
> Flink does not remove the resource request until the resource is allocated, 
> then it discovers that it no longer needs the allocated resource and release 
> them. This would affect the underlying Kubernetes/Yarn cluster, especially 
> when the cluster is under heavy workload.
> It would be good for Flink to cancel pod/container requests as earlier as 
> possible if it can discover that some of the pending workers are no longer 
> needed.
> There are several approaches potentially achieve this.
>  # We can always check whether there's a pending worker that can be canceled 
> when a \{{PendingTaskManagerSlot}} is unassigned.
>  # We can have a separate timeout for requesting new worker. If the resource 
> cannot be allocated within the given time since requested, we should cancel 
> that resource request and claim a resource allocation failure.
>  # We can share the same timeout for starting new worker (proposed in 
> FLINK-13554). This is similar to 2), but it requires the worker to be 
> registered, rather than allocated, before timeout.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (FLINK-18229) Pending worker requests should be properly cleared

2021-04-26 Thread Dawid Wysakowicz (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-18229?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dawid Wysakowicz updated FLINK-18229:
-
Fix Version/s: (was: 1.13.0)
   1.14.0

> Pending worker requests should be properly cleared
> --
>
> Key: FLINK-18229
> URL: https://issues.apache.org/jira/browse/FLINK-18229
> Project: Flink
>  Issue Type: Sub-task
>  Components: Deployment / Kubernetes, Deployment / YARN, Runtime / 
> Coordination
>Affects Versions: 1.9.3, 1.10.1, 1.11.0
>Reporter: Xintong Song
>Priority: Major
> Fix For: 1.14.0
>
>
> Currently, if Kubernetes/Yarn does not have enough resources to fulfill 
> Flink's resource requirement, there will be pending pod/container requests on 
> Kubernetes/Yarn. These pending resource requirements are never cleared until 
> either fulfilled or the Flink cluster is shutdown.
> However, sometimes Flink no longer needs the pending resources. E.g., the 
> slot request is then fulfilled by another slots that become available, or the 
> job failed due to slot request timeout (in a session cluster). In such cases, 
> Flink does not remove the resource request until the resource is allocated, 
> then it discovers that it no longer needs the allocated resource and release 
> them. This would affect the underlying Kubernetes/Yarn cluster, especially 
> when the cluster is under heavy workload.
> It would be good for Flink to cancel pod/container requests as earlier as 
> possible if it can discover that some of the pending workers are no longer 
> needed.
> There are several approaches potentially achieve this.
>  # We can always check whether there's a pending worker that can be canceled 
> when a \{{PendingTaskManagerSlot}} is unassigned.
>  # We can have a separate timeout for requesting new worker. If the resource 
> cannot be allocated within the given time since requested, we should cancel 
> that resource request and claim a resource allocation failure.
>  # We can share the same timeout for starting new worker (proposed in 
> FLINK-13554). This is similar to 2), but it requires the worker to be 
> registered, rather than allocated, before timeout.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (FLINK-18229) Pending worker requests should be properly cleared

2021-04-23 Thread Konstantin Knauf (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-18229?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Konstantin Knauf updated FLINK-18229:
-
Labels:   (was: stale-major)

Removed "stale-critical|major|minor" label in line with 
https://issues.apache.org/jira/browse/FLINK-22429. 

> Pending worker requests should be properly cleared
> --
>
> Key: FLINK-18229
> URL: https://issues.apache.org/jira/browse/FLINK-18229
> Project: Flink
>  Issue Type: Sub-task
>  Components: Deployment / Kubernetes, Deployment / YARN, Runtime / 
> Coordination
>Affects Versions: 1.9.3, 1.10.1, 1.11.0
>Reporter: Xintong Song
>Priority: Major
> Fix For: 1.13.0
>
>
> Currently, if Kubernetes/Yarn does not have enough resources to fulfill 
> Flink's resource requirement, there will be pending pod/container requests on 
> Kubernetes/Yarn. These pending resource requirements are never cleared until 
> either fulfilled or the Flink cluster is shutdown.
> However, sometimes Flink no longer needs the pending resources. E.g., the 
> slot request is then fulfilled by another slots that become available, or the 
> job failed due to slot request timeout (in a session cluster). In such cases, 
> Flink does not remove the resource request until the resource is allocated, 
> then it discovers that it no longer needs the allocated resource and release 
> them. This would affect the underlying Kubernetes/Yarn cluster, especially 
> when the cluster is under heavy workload.
> It would be good for Flink to cancel pod/container requests as earlier as 
> possible if it can discover that some of the pending workers are no longer 
> needed.
> There are several approaches potentially achieve this.
>  # We can always check whether there's a pending worker that can be canceled 
> when a \{{PendingTaskManagerSlot}} is unassigned.
>  # We can have a separate timeout for requesting new worker. If the resource 
> cannot be allocated within the given time since requested, we should cancel 
> that resource request and claim a resource allocation failure.
>  # We can share the same timeout for starting new worker (proposed in 
> FLINK-13554). This is similar to 2), but it requires the worker to be 
> registered, rather than allocated, before timeout.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (FLINK-18229) Pending worker requests should be properly cleared

2021-04-22 Thread Flink Jira Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-18229?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Flink Jira Bot updated FLINK-18229:
---
Labels: stale-major  (was: )

> Pending worker requests should be properly cleared
> --
>
> Key: FLINK-18229
> URL: https://issues.apache.org/jira/browse/FLINK-18229
> Project: Flink
>  Issue Type: Sub-task
>  Components: Deployment / Kubernetes, Deployment / YARN, Runtime / 
> Coordination
>Affects Versions: 1.9.3, 1.10.1, 1.11.0
>Reporter: Xintong Song
>Priority: Major
>  Labels: stale-major
> Fix For: 1.13.0
>
>
> Currently, if Kubernetes/Yarn does not have enough resources to fulfill 
> Flink's resource requirement, there will be pending pod/container requests on 
> Kubernetes/Yarn. These pending resource requirements are never cleared until 
> either fulfilled or the Flink cluster is shutdown.
> However, sometimes Flink no longer needs the pending resources. E.g., the 
> slot request is then fulfilled by another slots that become available, or the 
> job failed due to slot request timeout (in a session cluster). In such cases, 
> Flink does not remove the resource request until the resource is allocated, 
> then it discovers that it no longer needs the allocated resource and release 
> them. This would affect the underlying Kubernetes/Yarn cluster, especially 
> when the cluster is under heavy workload.
> It would be good for Flink to cancel pod/container requests as earlier as 
> possible if it can discover that some of the pending workers are no longer 
> needed.
> There are several approaches potentially achieve this.
>  # We can always check whether there's a pending worker that can be canceled 
> when a \{{PendingTaskManagerSlot}} is unassigned.
>  # We can have a separate timeout for requesting new worker. If the resource 
> cannot be allocated within the given time since requested, we should cancel 
> that resource request and claim a resource allocation failure.
>  # We can share the same timeout for starting new worker (proposed in 
> FLINK-13554). This is similar to 2), but it requires the worker to be 
> registered, rather than allocated, before timeout.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (FLINK-18229) Pending worker requests should be properly cleared

2021-01-14 Thread Xintong Song (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-18229?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xintong Song updated FLINK-18229:
-
Parent: FLINK-20988
Issue Type: Sub-task  (was: Improvement)

> Pending worker requests should be properly cleared
> --
>
> Key: FLINK-18229
> URL: https://issues.apache.org/jira/browse/FLINK-18229
> Project: Flink
>  Issue Type: Sub-task
>  Components: Deployment / Kubernetes, Deployment / YARN, Runtime / 
> Coordination
>Affects Versions: 1.9.3, 1.10.1, 1.11.0
>Reporter: Xintong Song
>Priority: Major
> Fix For: 1.13.0
>
>
> Currently, if Kubernetes/Yarn does not have enough resources to fulfill 
> Flink's resource requirement, there will be pending pod/container requests on 
> Kubernetes/Yarn. These pending resource requirements are never cleared until 
> either fulfilled or the Flink cluster is shutdown.
> However, sometimes Flink no longer needs the pending resources. E.g., the 
> slot request is then fulfilled by another slots that become available, or the 
> job failed due to slot request timeout (in a session cluster). In such cases, 
> Flink does not remove the resource request until the resource is allocated, 
> then it discovers that it no longer needs the allocated resource and release 
> them. This would affect the underlying Kubernetes/Yarn cluster, especially 
> when the cluster is under heavy workload.
> It would be good for Flink to cancel pod/container requests as earlier as 
> possible if it can discover that some of the pending workers are no longer 
> needed.
> There are several approaches potentially achieve this.
>  # We can always check whether there's a pending worker that can be canceled 
> when a \{{PendingTaskManagerSlot}} is unassigned.
>  # We can have a separate timeout for requesting new worker. If the resource 
> cannot be allocated within the given time since requested, we should cancel 
> that resource request and claim a resource allocation failure.
>  # We can share the same timeout for starting new worker (proposed in 
> FLINK-13554). This is similar to 2), but it requires the worker to be 
> registered, rather than allocated, before timeout.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (FLINK-18229) Pending worker requests should be properly cleared

2020-10-25 Thread Xintong Song (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-18229?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xintong Song updated FLINK-18229:
-
Fix Version/s: (was: 1.12.0)
   1.13.0

> Pending worker requests should be properly cleared
> --
>
> Key: FLINK-18229
> URL: https://issues.apache.org/jira/browse/FLINK-18229
> Project: Flink
>  Issue Type: Improvement
>  Components: Deployment / Kubernetes, Deployment / YARN, Runtime / 
> Coordination
>Affects Versions: 1.9.3, 1.10.1, 1.11.0
>Reporter: Xintong Song
>Priority: Major
> Fix For: 1.13.0
>
>
> Currently, if Kubernetes/Yarn does not have enough resources to fulfill 
> Flink's resource requirement, there will be pending pod/container requests on 
> Kubernetes/Yarn. These pending resource requirements are never cleared until 
> either fulfilled or the Flink cluster is shutdown.
> However, sometimes Flink no longer needs the pending resources. E.g., the 
> slot request is then fulfilled by another slots that become available, or the 
> job failed due to slot request timeout (in a session cluster). In such cases, 
> Flink does not remove the resource request until the resource is allocated, 
> then it discovers that it no longer needs the allocated resource and release 
> them. This would affect the underlying Kubernetes/Yarn cluster, especially 
> when the cluster is under heavy workload.
> It would be good for Flink to cancel pod/container requests as earlier as 
> possible if it can discover that some of the pending workers are no longer 
> needed.
> There are several approaches potentially achieve this.
>  # We can always check whether there's a pending worker that can be canceled 
> when a \{{PendingTaskManagerSlot}} is unassigned.
>  # We can have a separate timeout for requesting new worker. If the resource 
> cannot be allocated within the given time since requested, we should cancel 
> that resource request and claim a resource allocation failure.
>  # We can share the same timeout for starting new worker (proposed in 
> FLINK-13554). This is similar to 2), but it requires the worker to be 
> registered, rather than allocated, before timeout.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (FLINK-18229) Pending worker requests should be properly cleared

2020-06-09 Thread Xintong Song (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-18229?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xintong Song updated FLINK-18229:
-
Fix Version/s: 1.12.0

> Pending worker requests should be properly cleared
> --
>
> Key: FLINK-18229
> URL: https://issues.apache.org/jira/browse/FLINK-18229
> Project: Flink
>  Issue Type: Improvement
>  Components: Deployment / Kubernetes, Deployment / YARN, Runtime / 
> Coordination
>Affects Versions: 1.9.3, 1.10.1, 1.11.0
>Reporter: Xintong Song
>Priority: Major
> Fix For: 1.12.0
>
>
> Currently, if Kubernetes/Yarn does not have enough resources to fulfill 
> Flink's resource requirement, there will be pending pod/container requests on 
> Kubernetes/Yarn. These pending resource requirements are never cleared until 
> either fulfilled or the Flink cluster is shutdown.
> However, sometimes Flink no longer needs the pending resources. E.g., the 
> slot request is then fulfilled by another slots that become available, or the 
> job failed due to slot request timeout (in a session cluster). In such cases, 
> Flink does not remove the resource request until the resource is allocated, 
> then it discovers that it no longer needs the allocated resource and release 
> them. This would affect the underlying Kubernetes/Yarn cluster, especially 
> when the cluster is under heavy workload.
> It would be good for Flink to cancel pod/container requests as earlier as 
> possible if it can discover that some of the pending workers are no longer 
> needed.
> There are several approaches potentially achieve this.
>  # We can always check whether there's a pending worker that can be canceled 
> when a \{{PendingTaskManagerSlot}} is unassigned.
>  # We can have a separate timeout for requesting new worker. If the resource 
> cannot be allocated within the given time since requested, we should cancel 
> that resource request and claim a resource allocation failure.
>  # We can share the same timeout for starting new worker (proposed in 
> FLINK-13554). This is similar to 2), but it requires the worker to be 
> registered, rather than allocated, before timeout.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)