[jira] [Created] (FLINK-36521) Introduce TtlAwareSerializer to resolve the compatibility check between ttlSerializer and original serializer

2024-10-13 Thread xiangyu feng (Jira)
xiangyu feng created FLINK-36521:


 Summary: Introduce TtlAwareSerializer to resolve the compatibility 
check between ttlSerializer and original serializer 
 Key: FLINK-36521
 URL: https://issues.apache.org/jira/browse/FLINK-36521
 Project: Flink
  Issue Type: Sub-task
  Components: Runtime / State Backends
Reporter: xiangyu feng






--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (FLINK-35780) Support state migration between disabling and enabling ttl in RocksDBKeyedStateBackend

2024-09-15 Thread xiangyu feng (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-35780?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17881913#comment-17881913
 ] 

xiangyu feng commented on FLINK-35780:
--

Also cc: [~tzulitai]  [~masteryhx] 

> Support state migration between disabling and enabling ttl in 
> RocksDBKeyedStateBackend
> --
>
> Key: FLINK-35780
> URL: https://issues.apache.org/jira/browse/FLINK-35780
> Project: Flink
>  Issue Type: Sub-task
>  Components: Runtime / State Backends
>Reporter: xiangyu feng
>Priority: Major
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (FLINK-35780) Support state migration between disabling and enabling ttl in RocksDBKeyedStateBackend

2024-09-15 Thread xiangyu feng (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-35780?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17881851#comment-17881851
 ] 

xiangyu feng commented on FLINK-35780:
--

Hi [~Zakelly], I've finished the first version of implementation. Would you 
kindly review the PR when you have time?

> Support state migration between disabling and enabling ttl in 
> RocksDBKeyedStateBackend
> --
>
> Key: FLINK-35780
> URL: https://issues.apache.org/jira/browse/FLINK-35780
> Project: Flink
>  Issue Type: Sub-task
>  Components: Runtime / State Backends
>Reporter: xiangyu feng
>Priority: Major
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (FLINK-35780) Support state migration between disabling and enabling ttl in RocksDBKeyedStateBackend

2024-09-15 Thread xiangyu feng (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-35780?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

xiangyu feng updated FLINK-35780:
-
Summary: Support state migration between disabling and enabling ttl in 
RocksDBKeyedStateBackend  (was: Support state migration between disabling and 
enabling ttl in RocksDBState)

> Support state migration between disabling and enabling ttl in 
> RocksDBKeyedStateBackend
> --
>
> Key: FLINK-35780
> URL: https://issues.apache.org/jira/browse/FLINK-35780
> Project: Flink
>  Issue Type: Sub-task
>  Components: Runtime / State Backends
>Reporter: xiangyu feng
>Priority: Major
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (FLINK-35780) Support state migration between disabling and enabling ttl in RocksDBState

2024-09-15 Thread xiangyu feng (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-35780?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

xiangyu feng updated FLINK-35780:
-
Summary: Support state migration between disabling and enabling ttl in 
RocksDBState  (was: Support state migration from disabling to enabling ttl in 
RocksDBState)

> Support state migration between disabling and enabling ttl in RocksDBState
> --
>
> Key: FLINK-35780
> URL: https://issues.apache.org/jira/browse/FLINK-35780
> Project: Flink
>  Issue Type: Sub-task
>  Components: Runtime / State Backends
>Reporter: xiangyu feng
>Priority: Major
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (FLINK-35780) Support state migration from disabling to enabling ttl in RocksDBState

2024-07-07 Thread xiangyu feng (Jira)
xiangyu feng created FLINK-35780:


 Summary: Support state migration from disabling to enabling ttl in 
RocksDBState
 Key: FLINK-35780
 URL: https://issues.apache.org/jira/browse/FLINK-35780
 Project: Flink
  Issue Type: Sub-task
  Components: Runtime / State Backends
Reporter: xiangyu feng






--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (FLINK-32955) Support state compatibility between enabling TTL and disabling TTL

2024-06-16 Thread xiangyu feng (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-32955?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17855474#comment-17855474
 ] 

xiangyu feng commented on FLINK-32955:
--

[~lijinzhong] thx, [~Zakelly] [~masteryhx] would you kindly assign this issue 
to me? 

> Support state compatibility between enabling TTL and disabling TTL
> --
>
> Key: FLINK-32955
> URL: https://issues.apache.org/jira/browse/FLINK-32955
> Project: Flink
>  Issue Type: Improvement
>  Components: Runtime / State Backends
>Reporter: Jinzhong Li
>Assignee: Jinzhong Li
>Priority: Major
>
> Currently, trying to restore state, which was previously configured without 
> TTL, using TTL enabled descriptor or vice versa will lead to compatibility 
> failure and StateMigrationException.
> In some scenario, user may enable state ttl and restore from old state which 
> was previously configured without TTL;  or vice versa.
> It would be useful for users if we support state compatibility between 
> enabling TTL and disabling TTL.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (FLINK-32955) Support state compatibility between enabling TTL and disabling TTL

2024-06-16 Thread xiangyu feng (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-32955?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17855415#comment-17855415
 ] 

xiangyu feng commented on FLINK-32955:
--

Hi [~lijinzhong] , this issue hasn't been updated for a long time. I would like 
to work on this if it is ok for you.

> Support state compatibility between enabling TTL and disabling TTL
> --
>
> Key: FLINK-32955
> URL: https://issues.apache.org/jira/browse/FLINK-32955
> Project: Flink
>  Issue Type: Improvement
>  Components: Runtime / State Backends
>Reporter: Jinzhong Li
>Assignee: Jinzhong Li
>Priority: Major
>
> Currently, trying to restore state, which was previously configured without 
> TTL, using TTL enabled descriptor or vice versa will lead to compatibility 
> failure and StateMigrationException.
> In some scenario, user may enable state ttl and restore from old state which 
> was previously configured without TTL;  or vice versa.
> It would be useful for users if we support state compatibility between 
> enabling TTL and disabling TTL.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Closed] (FLINK-34452) Release Testing: Verify FLINK-15959 Add min number of slots configuration to limit total number of slots

2024-02-20 Thread xiangyu feng (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-34452?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

xiangyu feng closed FLINK-34452.


> Release Testing: Verify FLINK-15959 Add min number of slots configuration to 
> limit total number of slots
> 
>
> Key: FLINK-34452
> URL: https://issues.apache.org/jira/browse/FLINK-34452
> Project: Flink
>  Issue Type: Sub-task
>  Components: Runtime / Coordination
>Affects Versions: 1.19.0
>Reporter: xiangyu feng
>Assignee: Weijie Guo
>Priority: Blocker
>  Labels: release-testing
> Fix For: 1.19.0
>
>
> Test Suggestion:
>  # Prepare configuration options:
>  ** taskmanager.numberOfTaskSlots = 2,
>  ** slotmanager.number-of-slots.min = 7,
>  ** slot.idle.timeout = 5
>  # Setup a Flink session Cluster on Yarn or Native Kubernetes based on 
> following docs:
>  ** 
> [https://nightlies.apache.org/flink/flink-docs-master/docs/deployment/resource-providers/yarn/#starting-a-flink-session-on-yarn]
>  ** 
> [https://nightlies.apache.org/flink/flink-docs-master/docs/deployment/resource-providers/native_kubernetes/#starting-a-flink-session-on-kubernetes]
>  # Verify that 4 TaskManagers will be registered even though no jobs has been 
> submitted
>  # Verify that these TaskManagers will not be destroyed after 50 seconds.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Resolved] (FLINK-34452) Release Testing: Verify FLINK-15959 Add min number of slots configuration to limit total number of slots

2024-02-20 Thread xiangyu feng (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-34452?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

xiangyu feng resolved FLINK-34452.
--
Resolution: Fixed

> Release Testing: Verify FLINK-15959 Add min number of slots configuration to 
> limit total number of slots
> 
>
> Key: FLINK-34452
> URL: https://issues.apache.org/jira/browse/FLINK-34452
> Project: Flink
>  Issue Type: Sub-task
>  Components: Runtime / Coordination
>Affects Versions: 1.19.0
>Reporter: xiangyu feng
>Assignee: Weijie Guo
>Priority: Blocker
>  Labels: release-testing
> Fix For: 1.19.0
>
>
> Test Suggestion:
>  # Prepare configuration options:
>  ** taskmanager.numberOfTaskSlots = 2,
>  ** slotmanager.number-of-slots.min = 7,
>  ** slot.idle.timeout = 5
>  # Setup a Flink session Cluster on Yarn or Native Kubernetes based on 
> following docs:
>  ** 
> [https://nightlies.apache.org/flink/flink-docs-master/docs/deployment/resource-providers/yarn/#starting-a-flink-session-on-yarn]
>  ** 
> [https://nightlies.apache.org/flink/flink-docs-master/docs/deployment/resource-providers/native_kubernetes/#starting-a-flink-session-on-kubernetes]
>  # Verify that 4 TaskManagers will be registered even though no jobs has been 
> submitted
>  # Verify that these TaskManagers will not be destroyed after 50 seconds.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (FLINK-34452) Release Testing: Verify FLINK-15959 Add min number of slots configuration to limit total number of slots

2024-02-20 Thread xiangyu feng (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-34452?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17819097#comment-17819097
 ] 

xiangyu feng commented on FLINK-34452:
--

[~lincoln.86xy] Hi, I will close this ticket as resolved.

> Release Testing: Verify FLINK-15959 Add min number of slots configuration to 
> limit total number of slots
> 
>
> Key: FLINK-34452
> URL: https://issues.apache.org/jira/browse/FLINK-34452
> Project: Flink
>  Issue Type: Sub-task
>  Components: Runtime / Coordination
>Affects Versions: 1.19.0
>Reporter: xiangyu feng
>Assignee: Weijie Guo
>Priority: Blocker
>  Labels: release-testing
> Fix For: 1.19.0
>
>
> Test Suggestion:
>  # Prepare configuration options:
>  ** taskmanager.numberOfTaskSlots = 2,
>  ** slotmanager.number-of-slots.min = 7,
>  ** slot.idle.timeout = 5
>  # Setup a Flink session Cluster on Yarn or Native Kubernetes based on 
> following docs:
>  ** 
> [https://nightlies.apache.org/flink/flink-docs-master/docs/deployment/resource-providers/yarn/#starting-a-flink-session-on-yarn]
>  ** 
> [https://nightlies.apache.org/flink/flink-docs-master/docs/deployment/resource-providers/native_kubernetes/#starting-a-flink-session-on-kubernetes]
>  # Verify that 4 TaskManagers will be registered even though no jobs has been 
> submitted
>  # Verify that these TaskManagers will not be destroyed after 50 seconds.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (FLINK-34452) Release Testing: Verify FLINK-15959 Add min number of slots configuration to limit total number of slots

2024-02-20 Thread xiangyu feng (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-34452?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17819096#comment-17819096
 ] 

xiangyu feng commented on FLINK-34452:
--

[~Weijie Guo] Thx for your work!

> Release Testing: Verify FLINK-15959 Add min number of slots configuration to 
> limit total number of slots
> 
>
> Key: FLINK-34452
> URL: https://issues.apache.org/jira/browse/FLINK-34452
> Project: Flink
>  Issue Type: Sub-task
>  Components: Runtime / Coordination
>Affects Versions: 1.19.0
>Reporter: xiangyu feng
>Assignee: Weijie Guo
>Priority: Blocker
>  Labels: release-testing
> Fix For: 1.19.0
>
>
> Test Suggestion:
>  # Prepare configuration options:
>  ** taskmanager.numberOfTaskSlots = 2,
>  ** slotmanager.number-of-slots.min = 7,
>  ** slot.idle.timeout = 5
>  # Setup a Flink session Cluster on Yarn or Native Kubernetes based on 
> following docs:
>  ** 
> [https://nightlies.apache.org/flink/flink-docs-master/docs/deployment/resource-providers/yarn/#starting-a-flink-session-on-yarn]
>  ** 
> [https://nightlies.apache.org/flink/flink-docs-master/docs/deployment/resource-providers/native_kubernetes/#starting-a-flink-session-on-kubernetes]
>  # Verify that 4 TaskManagers will be registered even though no jobs has been 
> submitted
>  # Verify that these TaskManagers will not be destroyed after 50 seconds.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (FLINK-34452) Release Testing: Verify FLINK-15959 Add min number of slots configuration to limit total number of slots

2024-02-18 Thread xiangyu feng (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-34452?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

xiangyu feng updated FLINK-34452:
-
Description: 
Test Suggestion:
 # Prepare configuration options:
 ** taskmanager.numberOfTaskSlots = 2,
 ** slotmanager.number-of-slots.min = 7,
 ** slot.idle.timeout = 5
 # Setup a Flink session Cluster on Yarn or Native Kubernetes based on 
following docs:
 ** 
[https://nightlies.apache.org/flink/flink-docs-master/docs/deployment/resource-providers/yarn/#starting-a-flink-session-on-yarn]
 ** 
[https://nightlies.apache.org/flink/flink-docs-master/docs/deployment/resource-providers/native_kubernetes/#starting-a-flink-session-on-kubernetes]
 # Verify that 4 TaskManagers will be registered even though no jobs has been 
submitted
 # Verify that these TaskManagers will not be destroyed after 50 seconds.

  was:
Test Suggestion:
 # Prepare configuration options:
 ** taskmanager.numberOfTaskSlots = 2,
 ** slotmanager.number-of-slots.min = 7,
 ** slot.idle.timeout = 5
 # Setup a Flink session Cluster according to: 
[https://nightlies.apache.org/flink/flink-docs-master/docs/deployment/config/#basic-setup|https://nightlies.apache.org/flink/flink-docs-master/docs/deployment/overview/#session-mode]
 # Verify that 4 TaskManagers will be registered even though no jobs has been 
submitted
 # Verify that these TaskManagers will not be destroyed after 50 seconds.


> Release Testing: Verify FLINK-15959 Add min number of slots configuration to 
> limit total number of slots
> 
>
> Key: FLINK-34452
> URL: https://issues.apache.org/jira/browse/FLINK-34452
> Project: Flink
>  Issue Type: Sub-task
>  Components: Runtime / Coordination
>Affects Versions: 1.19.0
>Reporter: xiangyu feng
>Priority: Blocker
>  Labels: release-testing
>
> Test Suggestion:
>  # Prepare configuration options:
>  ** taskmanager.numberOfTaskSlots = 2,
>  ** slotmanager.number-of-slots.min = 7,
>  ** slot.idle.timeout = 5
>  # Setup a Flink session Cluster on Yarn or Native Kubernetes based on 
> following docs:
>  ** 
> [https://nightlies.apache.org/flink/flink-docs-master/docs/deployment/resource-providers/yarn/#starting-a-flink-session-on-yarn]
>  ** 
> [https://nightlies.apache.org/flink/flink-docs-master/docs/deployment/resource-providers/native_kubernetes/#starting-a-flink-session-on-kubernetes]
>  # Verify that 4 TaskManagers will be registered even though no jobs has been 
> submitted
>  # Verify that these TaskManagers will not be destroyed after 50 seconds.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (FLINK-34391) Release Testing Instructions: Verify FLINK-15959 Add min number of slots configuration to limit total number of slots

2024-02-17 Thread xiangyu feng (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-34391?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17818219#comment-17818219
 ] 

xiangyu feng commented on FLINK-34391:
--

[~lincoln.86xy] You can close this ticket now. I've created this ticket for 
test: https://issues.apache.org/jira/browse/FLINK-34452

> Release Testing Instructions: Verify FLINK-15959 Add min number of slots 
> configuration to limit total number of slots
> -
>
> Key: FLINK-34391
> URL: https://issues.apache.org/jira/browse/FLINK-34391
> Project: Flink
>  Issue Type: Sub-task
>  Components: Table SQL / API
>Affects Versions: 1.19.0
>Reporter: lincoln lee
>Assignee: xiangyu feng
>Priority: Blocker
> Fix For: 1.19.0
>
> Attachments: screenshot-1.png
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (FLINK-34452) Release Testing: Verify FLINK-15959 Add min number of slots configuration to limit total number of slots

2024-02-16 Thread xiangyu feng (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-34452?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

xiangyu feng updated FLINK-34452:
-
Description: 
Test Suggestion:
 # Prepare configuration options:
 ** taskmanager.numberOfTaskSlots = 2,
 ** slotmanager.number-of-slots.min = 7,
 ** slot.idle.timeout = 5
 # Setup a Flink session Cluster according to: 
[https://nightlies.apache.org/flink/flink-docs-master/docs/deployment/config/#basic-setup|https://nightlies.apache.org/flink/flink-docs-master/docs/deployment/overview/#session-mode]
 # Verify that 4 TaskManagers will be registered even though no jobs has been 
submitted
 # Verify that these TaskManagers will not be destroyed after 50 seconds.

  was:
Test Suggestion:
 # Prepare configuration options: taskmanager.numberOfTaskSlots = 2, 
slotmanager.number-of-slots.min = 7, slot.idle.timeout = 5
 # Setup a Flink session Cluster according to: 
[https://nightlies.apache.org/flink/flink-docs-master/docs/deployment/config/#basic-setup|https://nightlies.apache.org/flink/flink-docs-master/docs/deployment/overview/#session-mode]
 # Verify that 4 TaskManagers will be registered even though no jobs has been 
submitted
 # Verify that these TaskManagers will not be destroyed after 50 seconds.


> Release Testing: Verify FLINK-15959 Add min number of slots configuration to 
> limit total number of slots
> 
>
> Key: FLINK-34452
> URL: https://issues.apache.org/jira/browse/FLINK-34452
> Project: Flink
>  Issue Type: Sub-task
>  Components: Runtime / Coordination
>Affects Versions: 1.19.0
>Reporter: xiangyu feng
>Priority: Blocker
>  Labels: release-testing
>
> Test Suggestion:
>  # Prepare configuration options:
>  ** taskmanager.numberOfTaskSlots = 2,
>  ** slotmanager.number-of-slots.min = 7,
>  ** slot.idle.timeout = 5
>  # Setup a Flink session Cluster according to: 
> [https://nightlies.apache.org/flink/flink-docs-master/docs/deployment/config/#basic-setup|https://nightlies.apache.org/flink/flink-docs-master/docs/deployment/overview/#session-mode]
>  # Verify that 4 TaskManagers will be registered even though no jobs has been 
> submitted
>  # Verify that these TaskManagers will not be destroyed after 50 seconds.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (FLINK-34452) Release Testing: Verify FLINK-15959 Add min number of slots configuration to limit total number of slots

2024-02-16 Thread xiangyu feng (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-34452?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

xiangyu feng updated FLINK-34452:
-
Description: 
Test Suggestion:
 # Prepare configuration options:

 ** 
h5. taskmanager.numberOfTaskSlots = 2

 ** 
h5. slotmanager.number-of-slots.min = 7

 ** 
h5. slot.idle.timeout = 5

 # Setup a Flink session Cluster according to: 
[https://nightlies.apache.org/flink/flink-docs-master/docs/deployment/config/#basic-setup|https://nightlies.apache.org/flink/flink-docs-master/docs/deployment/overview/#session-mode]

 # Verify that 4 TaskManagers will be registered even though no jobs has been 
submitted

 # Verify that these TaskManagers will not be destroyed after 50 seconds.

  was:
Test Suggestion:
 # Prepare configuration options:
 ## 
h5. taskmanager.numberOfTaskSlots = 2

 ## 
h5. slotmanager.number-of-slots.min = 7

 ## 
h5. slot.idle.timeout = 5

 # Setup a Flink session Cluster according to: 
[https://nightlies.apache.org/flink/flink-docs-master/docs/deployment/config/#basic-setup|https://nightlies.apache.org/flink/flink-docs-master/docs/deployment/overview/#session-mode]
 # Verify that 4 TaskManagers will be registered even though no jobs has been 
submitted
 # Verify that these TaskManagers will not be destroyed after 50 seconds.


> Release Testing: Verify FLINK-15959 Add min number of slots configuration to 
> limit total number of slots
> 
>
> Key: FLINK-34452
> URL: https://issues.apache.org/jira/browse/FLINK-34452
> Project: Flink
>  Issue Type: Sub-task
>  Components: Runtime / Coordination
>Affects Versions: 1.19.0
>Reporter: xiangyu feng
>Priority: Blocker
>  Labels: release-testing
>
> Test Suggestion:
>  # Prepare configuration options:
>  ** 
> h5. taskmanager.numberOfTaskSlots = 2
>  ** 
> h5. slotmanager.number-of-slots.min = 7
>  ** 
> h5. slot.idle.timeout = 5
>  # Setup a Flink session Cluster according to: 
> [https://nightlies.apache.org/flink/flink-docs-master/docs/deployment/config/#basic-setup|https://nightlies.apache.org/flink/flink-docs-master/docs/deployment/overview/#session-mode]
>  # Verify that 4 TaskManagers will be registered even though no jobs has been 
> submitted
>  # Verify that these TaskManagers will not be destroyed after 50 seconds.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (FLINK-34452) Release Testing: Verify FLINK-15959 Add min number of slots configuration to limit total number of slots

2024-02-16 Thread xiangyu feng (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-34452?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

xiangyu feng updated FLINK-34452:
-
Description: 
Test Suggestion:
 # Prepare configuration options: taskmanager.numberOfTaskSlots = 2, 
slotmanager.number-of-slots.min = 7, slot.idle.timeout = 5
 # Setup a Flink session Cluster according to: 
[https://nightlies.apache.org/flink/flink-docs-master/docs/deployment/config/#basic-setup|https://nightlies.apache.org/flink/flink-docs-master/docs/deployment/overview/#session-mode]
 # Verify that 4 TaskManagers will be registered even though no jobs has been 
submitted
 # Verify that these TaskManagers will not be destroyed after 50 seconds.

  was:
Test Suggestion:
 # Prepare configuration options:

 ** 
h5. taskmanager.numberOfTaskSlots = 2

 ** 
h5. slotmanager.number-of-slots.min = 7

 ** 
h5. slot.idle.timeout = 5

 # Setup a Flink session Cluster according to: 
[https://nightlies.apache.org/flink/flink-docs-master/docs/deployment/config/#basic-setup|https://nightlies.apache.org/flink/flink-docs-master/docs/deployment/overview/#session-mode]

 # Verify that 4 TaskManagers will be registered even though no jobs has been 
submitted

 # Verify that these TaskManagers will not be destroyed after 50 seconds.


> Release Testing: Verify FLINK-15959 Add min number of slots configuration to 
> limit total number of slots
> 
>
> Key: FLINK-34452
> URL: https://issues.apache.org/jira/browse/FLINK-34452
> Project: Flink
>  Issue Type: Sub-task
>  Components: Runtime / Coordination
>Affects Versions: 1.19.0
>Reporter: xiangyu feng
>Priority: Blocker
>  Labels: release-testing
>
> Test Suggestion:
>  # Prepare configuration options: taskmanager.numberOfTaskSlots = 2, 
> slotmanager.number-of-slots.min = 7, slot.idle.timeout = 5
>  # Setup a Flink session Cluster according to: 
> [https://nightlies.apache.org/flink/flink-docs-master/docs/deployment/config/#basic-setup|https://nightlies.apache.org/flink/flink-docs-master/docs/deployment/overview/#session-mode]
>  # Verify that 4 TaskManagers will be registered even though no jobs has been 
> submitted
>  # Verify that these TaskManagers will not be destroyed after 50 seconds.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (FLINK-34452) Release Testing: Verify FLINK-15959 Add min number of slots configuration to limit total number of slots

2024-02-16 Thread xiangyu feng (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-34452?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

xiangyu feng updated FLINK-34452:
-
Description: 
Test Suggestion:
 # Prepare configuration options:
 ## 
h5. taskmanager.numberOfTaskSlots = 2

 ## 
h5. slotmanager.number-of-slots.min = 7

 ## 
h5. slot.idle.timeout = 5

 # Setup a Flink session Cluster according to: 
[https://nightlies.apache.org/flink/flink-docs-master/docs/deployment/config/#basic-setup|https://nightlies.apache.org/flink/flink-docs-master/docs/deployment/overview/#session-mode]
 # Verify that 4 TaskManagers will be registered even though no jobs has been 
submitted
 # Verify that these TaskManagers will not be destroyed after 50 seconds.

> Release Testing: Verify FLINK-15959 Add min number of slots configuration to 
> limit total number of slots
> 
>
> Key: FLINK-34452
> URL: https://issues.apache.org/jira/browse/FLINK-34452
> Project: Flink
>  Issue Type: Sub-task
>  Components: Runtime / Coordination
>Affects Versions: 1.19.0
>Reporter: xiangyu feng
>Priority: Blocker
>  Labels: release-testing
>
> Test Suggestion:
>  # Prepare configuration options:
>  ## 
> h5. taskmanager.numberOfTaskSlots = 2
>  ## 
> h5. slotmanager.number-of-slots.min = 7
>  ## 
> h5. slot.idle.timeout = 5
>  # Setup a Flink session Cluster according to: 
> [https://nightlies.apache.org/flink/flink-docs-master/docs/deployment/config/#basic-setup|https://nightlies.apache.org/flink/flink-docs-master/docs/deployment/overview/#session-mode]
>  # Verify that 4 TaskManagers will be registered even though no jobs has been 
> submitted
>  # Verify that these TaskManagers will not be destroyed after 50 seconds.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (FLINK-34452) Release Testing: Verify FLINK-15959 Add min number of slots configuration to limit total number of slots

2024-02-16 Thread xiangyu feng (Jira)
xiangyu feng created FLINK-34452:


 Summary: Release Testing: Verify FLINK-15959 Add min number of 
slots configuration to limit total number of slots
 Key: FLINK-34452
 URL: https://issues.apache.org/jira/browse/FLINK-34452
 Project: Flink
  Issue Type: Sub-task
  Components: Runtime / Coordination
Affects Versions: 1.19.0
Reporter: xiangyu feng






--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (FLINK-34391) Release Testing Instructions: Verify FLINK-15959 Add min number of slots configuration to limit total number of slots

2024-02-06 Thread xiangyu feng (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-34391?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17815115#comment-17815115
 ] 

xiangyu feng commented on FLINK-34391:
--

[~lincoln.86xy] Hi, I think this feature needs cross-team testing because it is 
a user facing feature.

> Release Testing Instructions: Verify FLINK-15959 Add min number of slots 
> configuration to limit total number of slots
> -
>
> Key: FLINK-34391
> URL: https://issues.apache.org/jira/browse/FLINK-34391
> Project: Flink
>  Issue Type: Sub-task
>  Components: Table SQL / API
>Affects Versions: 1.19.0
>Reporter: lincoln lee
>Assignee: xiangyu feng
>Priority: Blocker
> Fix For: 1.19.0
>
> Attachments: screenshot-1.png
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (FLINK-34226) Add the backoff-multiplier configuration in ExponentialBackoffRetryStrategy

2024-01-24 Thread xiangyu feng (Jira)
xiangyu feng created FLINK-34226:


 Summary: Add the backoff-multiplier configuration in 
ExponentialBackoffRetryStrategy
 Key: FLINK-34226
 URL: https://issues.apache.org/jira/browse/FLINK-34226
 Project: Flink
  Issue Type: Sub-task
Reporter: xiangyu feng






--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (FLINK-33683) FLIP-407 Improve Flink Client performance in interactive scenarios

2024-01-24 Thread xiangyu feng (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-33683?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17810282#comment-17810282
 ] 

xiangyu feng commented on FLINK-33683:
--

[~guoyangze]  Hi, this 
Flip(https://cwiki.apache.org/confluence/display/FLINK/FLIP-407%3A+Improve+Flink+Client+performance+in+interactive+scenarios)
 has been approved, would you kindly assign this Jira to me?

> FLIP-407 Improve Flink Client performance in interactive scenarios
> --
>
> Key: FLINK-33683
> URL: https://issues.apache.org/jira/browse/FLINK-33683
> Project: Flink
>  Issue Type: Improvement
>  Components: Client / Job Submission, Table SQL / Client
>Reporter: xiangyu feng
>Priority: Major
>
> Now there are lots of unnecessary overhead involved in submitting jobs and 
> fetching results to a long-running flink cluster. This works well for 
> streaming and batch job, because in these scenarios users will not frequently 
> submit jobs and fetch result to a running cluster.
>  
> But in OLAP scenario, users will continuously submit lots of short-lived jobs 
> to the running cluster. In this situation, these overhead will have a huge 
> impact on the E2E performance.  Here are some examples of unnecessary 
> overhead:
>  * Each `RemoteExecutor` will create a new `StandaloneClusterDescriptor` when 
> executing a job on the same remote cluster
>  * `StandaloneClusterDescriptor` will always create a new `RestClusterClient` 
> when retrieving an existing Flink Cluster
>  * Each `RestClusterClient` will create a new 
> `ClientHighAvailabilityServices` which might contains a resource-consuming ha 
> client(ZKClient or KubeClient) and a time-consuming leader retrieval operation
>  * `RestClient` will create a new connection for every request which costs 
> extra connection establishment time
>  
> Therefore, I suggest creating this ticket and following subtasks to improve 
> this performance. This ticket is also relates to  FLINK-25318.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (FLINK-33683) FLIP-407 Improve Flink Client performance in interactive scenarios

2024-01-22 Thread xiangyu feng (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-33683?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

xiangyu feng updated FLINK-33683:
-
Summary: FLIP-407 Improve Flink Client performance in interactive scenarios 
 (was: Improve the performance of submitting jobs and fetching results to a 
running flink cluster)

> FLIP-407 Improve Flink Client performance in interactive scenarios
> --
>
> Key: FLINK-33683
> URL: https://issues.apache.org/jira/browse/FLINK-33683
> Project: Flink
>  Issue Type: Improvement
>  Components: Client / Job Submission, Table SQL / Client
>Reporter: xiangyu feng
>Priority: Major
>
> Now there are lots of unnecessary overhead involved in submitting jobs and 
> fetching results to a long-running flink cluster. This works well for 
> streaming and batch job, because in these scenarios users will not frequently 
> submit jobs and fetch result to a running cluster.
>  
> But in OLAP scenario, users will continuously submit lots of short-lived jobs 
> to the running cluster. In this situation, these overhead will have a huge 
> impact on the E2E performance.  Here are some examples of unnecessary 
> overhead:
>  * Each `RemoteExecutor` will create a new `StandaloneClusterDescriptor` when 
> executing a job on the same remote cluster
>  * `StandaloneClusterDescriptor` will always create a new `RestClusterClient` 
> when retrieving an existing Flink Cluster
>  * Each `RestClusterClient` will create a new 
> `ClientHighAvailabilityServices` which might contains a resource-consuming ha 
> client(ZKClient or KubeClient) and a time-consuming leader retrieval operation
>  * `RestClient` will create a new connection for every request which costs 
> extra connection establishment time
>  
> Therefore, I suggest creating this ticket and following subtasks to improve 
> this performance. This ticket is also relates to  FLINK-25318.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (FLINK-33932) Support Retry Mechanism in RocksDBStateDataTransfer

2024-01-10 Thread xiangyu feng (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-33932?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17805398#comment-17805398
 ] 

xiangyu feng commented on FLINK-33932:
--

[~masteryhx] Thx for ur reply, I've talked to [~dianer17] to update the 
description of this issue. Would you kindly assign this issue to me? Also, I 
would like to hear more from you about this issue in the discussion thread of 
FLIP-414: [https://lists.apache.org/thread/om4kgd6trx2lctwm6x92q2kdjngxtz9k]

> Support Retry Mechanism in RocksDBStateDataTransfer
> ---
>
> Key: FLINK-33932
> URL: https://issues.apache.org/jira/browse/FLINK-33932
> Project: Flink
>  Issue Type: Improvement
>  Components: Runtime / Checkpointing
>Reporter: Guojun Li
>Priority: Major
>  Labels: pull-request-available
>
> Currently, there is no retry mechanism for downloading and uploading RocksDB 
> state files. Any jittering of remote filesystem might lead to a checkpoint 
> failure. By supporting retry mechanism in RocksDBStateDataTransfer, we can 
> significantly reduce the failure rate of checkpoint during asynchronous phase.
> The exception is as below:
> {noformat}
>  
> 2023-12-19 08:46:00,197 INFO  
> org.apache.flink.runtime.checkpoint.CheckpointCoordinator    [] - Decline 
> checkpoint 2 by task 
> 5b008c2c048fa8534d648455e69f9497_fbcce2f96483f8f15e87dc6c9afd372f_183_0 of 
> job a025f19e at 
> application-6f1c6e3d-1702480803995-5093022-taskmanager-1-1 @ 
> fdbd:dc61:1a:101:0:0:0:36 (dataPort=38789).
> org.apache.flink.util.SerializedThrowable: 
> org.apache.flink.runtime.checkpoint.CheckpointException: Asynchronous task 
> checkpoint failed.
>     at 
> org.apache.flink.streaming.runtime.tasks.AsyncCheckpointRunnable.handleExecutionException(AsyncCheckpointRunnable.java:320)
>  ~[flink-dist-1.17-xxx-SNAPSHOT.jar:1.17-xxx-SNAPSHOT]
>     at 
> org.apache.flink.streaming.runtime.tasks.AsyncCheckpointRunnable.run(AsyncCheckpointRunnable.java:155)
>  ~[flink-dist-1.17-xxx-SNAPSHOT.jar:1.17-xxx-SNAPSHOT]
>     at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
>  ~[?:?]
>     at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
>  ~[?:?]
>     at java.lang.Thread.run(Thread.java:829) [?:?]
> Caused by: org.apache.flink.util.SerializedThrowable: java.lang.Exception: 
> Could not materialize checkpoint 2 for operator GlobalGroupAggregate[132] -> 
> Calc[133] (184/500)#0.
>     at 
> org.apache.flink.streaming.runtime.tasks.AsyncCheckpointRunnable.handleExecutionException(AsyncCheckpointRunnable.java:298)
>  ~[flink-dist-1.17-xxx-SNAPSHOT.jar:1.17-xxx-SNAPSHOT]
>     ... 4 more
> Caused by: org.apache.flink.util.SerializedThrowable: 
> java.util.concurrent.ExecutionException: java.io.IOException: Could not flush 
> to file and close the file system output stream to 
> hdfs://xxx/shared/97329cbd-fb90-45ca-92de-edf52d6e6fc0 in order to obtain the 
> stream state handle
>     at java.util.concurrent.FutureTask.report(FutureTask.java:122) ~[?:?]
>     at java.util.concurrent.FutureTask.get(FutureTask.java:191) ~[?:?]
>     at 
> org.apache.flink.util.concurrent.FutureUtils.runIfNotDoneAndGet(FutureUtils.java:544)
>  ~[flink-dist-1.17-xxx-SNAPSHOT.jar:1.17-xxx-SNAPSHOT]
>     at 
> org.apache.flink.streaming.api.operators.OperatorSnapshotFinalizer.(OperatorSnapshotFinalizer.java:54)
>  ~[flink-dist-1.17-xxx-SNAPSHOT.jar:1.17-xxx-SNAPSHOT]
>     at 
> org.apache.flink.streaming.runtime.tasks.AsyncCheckpointRunnable.finalizeNonFinishedSnapshots(AsyncCheckpointRunnable.java:191)
>  ~[flink-dist-1.17-xxx-SNAPSHOT.jar:1.17-xxx-SNAPSHOT]
>     at 
> org.apache.flink.streaming.runtime.tasks.AsyncCheckpointRunnable.run(AsyncCheckpointRunnable.java:124)
>  ~[flink-dist-1.17-xxx-SNAPSHOT.jar:1.17-xxx-SNAPSHOT]
>     ... 3 more
> Caused by: org.apache.flink.util.SerializedThrowable: java.io.IOException: 
> Could not flush to file and close the file system output stream to 
> hdfs://xxx/shared/97329cbd-fb90-45ca-92de-edf52d6e6fc0 in order to obtain the 
> stream state handle
>     at 
> org.apache.flink.runtime.state.filesystem.FsCheckpointStreamFactory$FsCheckpointStateOutputStream.closeAndGetHandle(FsCheckpointStreamFactory.java:516)
>  ~[flink-dist-1.17-xxx-SNAPSHOT.jar:1.17-xxx-SNAPSHOT]
>     at 
> org.apache.flink.contrib.streaming.state.RocksDBStateUploader.uploadLocalFileToCheckpointFs(RocksDBStateUploader.java:157)
>  ~[flink-dist-1.17-xxx-SNAPSHOT.jar:1.17-xxx-SNAPSHOT]
>     at 
> org.apache.flink.contrib.streaming.state.RocksDBStateUploader.lambda$createUploadFutures$0(RocksDBStateUploader.java:113)
>  ~[flink-dist-1.17-xxx-SNAPSHOT.jar:1.17-xxx-SNAPSHOT]
>     at 
> org.apache.flink.util.function.CheckedSupplier.lambda$unchecked$0(CheckedSupplier.ja

[jira] [Commented] (FLINK-33932) Support retry mechanism for rocksdb uploader

2024-01-09 Thread xiangyu feng (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-33932?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17804666#comment-17804666
 ] 

xiangyu feng commented on FLINK-33932:
--

[~masteryhx]  Hi Hangxiang, what do u think about this? I'd like to hear more 
from you.

> Support retry mechanism for rocksdb uploader
> 
>
> Key: FLINK-33932
> URL: https://issues.apache.org/jira/browse/FLINK-33932
> Project: Flink
>  Issue Type: Improvement
>  Components: Runtime / Checkpointing
>Reporter: Guojun Li
>Priority: Major
>  Labels: pull-request-available
>
> Rocksdb uploader will throw exception and decline the current checkpoint if 
> there are errors when uploading to remote file system like hdfs.
> The exception is as below:
> {noformat}
>  
> 2023-12-19 08:46:00,197 INFO  
> org.apache.flink.runtime.checkpoint.CheckpointCoordinator    [] - Decline 
> checkpoint 2 by task 
> 5b008c2c048fa8534d648455e69f9497_fbcce2f96483f8f15e87dc6c9afd372f_183_0 of 
> job a025f19e at 
> application-6f1c6e3d-1702480803995-5093022-taskmanager-1-1 @ 
> fdbd:dc61:1a:101:0:0:0:36 (dataPort=38789).
> org.apache.flink.util.SerializedThrowable: 
> org.apache.flink.runtime.checkpoint.CheckpointException: Asynchronous task 
> checkpoint failed.
>     at 
> org.apache.flink.streaming.runtime.tasks.AsyncCheckpointRunnable.handleExecutionException(AsyncCheckpointRunnable.java:320)
>  ~[flink-dist-1.17-xxx-SNAPSHOT.jar:1.17-xxx-SNAPSHOT]
>     at 
> org.apache.flink.streaming.runtime.tasks.AsyncCheckpointRunnable.run(AsyncCheckpointRunnable.java:155)
>  ~[flink-dist-1.17-xxx-SNAPSHOT.jar:1.17-xxx-SNAPSHOT]
>     at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
>  ~[?:?]
>     at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
>  ~[?:?]
>     at java.lang.Thread.run(Thread.java:829) [?:?]
> Caused by: org.apache.flink.util.SerializedThrowable: java.lang.Exception: 
> Could not materialize checkpoint 2 for operator GlobalGroupAggregate[132] -> 
> Calc[133] (184/500)#0.
>     at 
> org.apache.flink.streaming.runtime.tasks.AsyncCheckpointRunnable.handleExecutionException(AsyncCheckpointRunnable.java:298)
>  ~[flink-dist-1.17-xxx-SNAPSHOT.jar:1.17-xxx-SNAPSHOT]
>     ... 4 more
> Caused by: org.apache.flink.util.SerializedThrowable: 
> java.util.concurrent.ExecutionException: java.io.IOException: Could not flush 
> to file and close the file system output stream to 
> hdfs://xxx/shared/97329cbd-fb90-45ca-92de-edf52d6e6fc0 in order to obtain the 
> stream state handle
>     at java.util.concurrent.FutureTask.report(FutureTask.java:122) ~[?:?]
>     at java.util.concurrent.FutureTask.get(FutureTask.java:191) ~[?:?]
>     at 
> org.apache.flink.util.concurrent.FutureUtils.runIfNotDoneAndGet(FutureUtils.java:544)
>  ~[flink-dist-1.17-xxx-SNAPSHOT.jar:1.17-xxx-SNAPSHOT]
>     at 
> org.apache.flink.streaming.api.operators.OperatorSnapshotFinalizer.(OperatorSnapshotFinalizer.java:54)
>  ~[flink-dist-1.17-xxx-SNAPSHOT.jar:1.17-xxx-SNAPSHOT]
>     at 
> org.apache.flink.streaming.runtime.tasks.AsyncCheckpointRunnable.finalizeNonFinishedSnapshots(AsyncCheckpointRunnable.java:191)
>  ~[flink-dist-1.17-xxx-SNAPSHOT.jar:1.17-xxx-SNAPSHOT]
>     at 
> org.apache.flink.streaming.runtime.tasks.AsyncCheckpointRunnable.run(AsyncCheckpointRunnable.java:124)
>  ~[flink-dist-1.17-xxx-SNAPSHOT.jar:1.17-xxx-SNAPSHOT]
>     ... 3 more
> Caused by: org.apache.flink.util.SerializedThrowable: java.io.IOException: 
> Could not flush to file and close the file system output stream to 
> hdfs://xxx/shared/97329cbd-fb90-45ca-92de-edf52d6e6fc0 in order to obtain the 
> stream state handle
>     at 
> org.apache.flink.runtime.state.filesystem.FsCheckpointStreamFactory$FsCheckpointStateOutputStream.closeAndGetHandle(FsCheckpointStreamFactory.java:516)
>  ~[flink-dist-1.17-xxx-SNAPSHOT.jar:1.17-xxx-SNAPSHOT]
>     at 
> org.apache.flink.contrib.streaming.state.RocksDBStateUploader.uploadLocalFileToCheckpointFs(RocksDBStateUploader.java:157)
>  ~[flink-dist-1.17-xxx-SNAPSHOT.jar:1.17-xxx-SNAPSHOT]
>     at 
> org.apache.flink.contrib.streaming.state.RocksDBStateUploader.lambda$createUploadFutures$0(RocksDBStateUploader.java:113)
>  ~[flink-dist-1.17-xxx-SNAPSHOT.jar:1.17-xxx-SNAPSHOT]
>     at 
> org.apache.flink.util.function.CheckedSupplier.lambda$unchecked$0(CheckedSupplier.java:32)
>  ~[flink-dist-1.17-xxx-SNAPSHOT.jar:1.17-xxx-SNAPSHOT]
>     at 
> java.util.concurrent.CompletableFuture$AsyncSupply.run(CompletableFuture.java:1700)
>  ~[?:?]
>     ... 3 more
> Caused by: org.apache.flink.util.SerializedThrowable: 
> java.net.ConnectException: Connection timed out
>     at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method) ~[?:?]
>     at sun.nio.ch.SocketChannelImp

[jira] [Commented] (FLINK-33932) Support retry mechanism for rocksdb uploader

2024-01-06 Thread xiangyu feng (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-33932?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17803901#comment-17803901
 ] 

xiangyu feng commented on FLINK-33932:
--

[~pnowojski] Also, I will open a flip to discuss add this retry configuration.

> Support retry mechanism for rocksdb uploader
> 
>
> Key: FLINK-33932
> URL: https://issues.apache.org/jira/browse/FLINK-33932
> Project: Flink
>  Issue Type: Improvement
>  Components: Runtime / Checkpointing
>Reporter: Guojun Li
>Priority: Major
>  Labels: pull-request-available
>
> Rocksdb uploader will throw exception and decline the current checkpoint if 
> there are errors when uploading to remote file system like hdfs.
> The exception is as below:
> {noformat}
>  
> 2023-12-19 08:46:00,197 INFO  
> org.apache.flink.runtime.checkpoint.CheckpointCoordinator    [] - Decline 
> checkpoint 2 by task 
> 5b008c2c048fa8534d648455e69f9497_fbcce2f96483f8f15e87dc6c9afd372f_183_0 of 
> job a025f19e at 
> application-6f1c6e3d-1702480803995-5093022-taskmanager-1-1 @ 
> fdbd:dc61:1a:101:0:0:0:36 (dataPort=38789).
> org.apache.flink.util.SerializedThrowable: 
> org.apache.flink.runtime.checkpoint.CheckpointException: Asynchronous task 
> checkpoint failed.
>     at 
> org.apache.flink.streaming.runtime.tasks.AsyncCheckpointRunnable.handleExecutionException(AsyncCheckpointRunnable.java:320)
>  ~[flink-dist-1.17-xxx-SNAPSHOT.jar:1.17-xxx-SNAPSHOT]
>     at 
> org.apache.flink.streaming.runtime.tasks.AsyncCheckpointRunnable.run(AsyncCheckpointRunnable.java:155)
>  ~[flink-dist-1.17-xxx-SNAPSHOT.jar:1.17-xxx-SNAPSHOT]
>     at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
>  ~[?:?]
>     at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
>  ~[?:?]
>     at java.lang.Thread.run(Thread.java:829) [?:?]
> Caused by: org.apache.flink.util.SerializedThrowable: java.lang.Exception: 
> Could not materialize checkpoint 2 for operator GlobalGroupAggregate[132] -> 
> Calc[133] (184/500)#0.
>     at 
> org.apache.flink.streaming.runtime.tasks.AsyncCheckpointRunnable.handleExecutionException(AsyncCheckpointRunnable.java:298)
>  ~[flink-dist-1.17-xxx-SNAPSHOT.jar:1.17-xxx-SNAPSHOT]
>     ... 4 more
> Caused by: org.apache.flink.util.SerializedThrowable: 
> java.util.concurrent.ExecutionException: java.io.IOException: Could not flush 
> to file and close the file system output stream to 
> hdfs://xxx/shared/97329cbd-fb90-45ca-92de-edf52d6e6fc0 in order to obtain the 
> stream state handle
>     at java.util.concurrent.FutureTask.report(FutureTask.java:122) ~[?:?]
>     at java.util.concurrent.FutureTask.get(FutureTask.java:191) ~[?:?]
>     at 
> org.apache.flink.util.concurrent.FutureUtils.runIfNotDoneAndGet(FutureUtils.java:544)
>  ~[flink-dist-1.17-xxx-SNAPSHOT.jar:1.17-xxx-SNAPSHOT]
>     at 
> org.apache.flink.streaming.api.operators.OperatorSnapshotFinalizer.(OperatorSnapshotFinalizer.java:54)
>  ~[flink-dist-1.17-xxx-SNAPSHOT.jar:1.17-xxx-SNAPSHOT]
>     at 
> org.apache.flink.streaming.runtime.tasks.AsyncCheckpointRunnable.finalizeNonFinishedSnapshots(AsyncCheckpointRunnable.java:191)
>  ~[flink-dist-1.17-xxx-SNAPSHOT.jar:1.17-xxx-SNAPSHOT]
>     at 
> org.apache.flink.streaming.runtime.tasks.AsyncCheckpointRunnable.run(AsyncCheckpointRunnable.java:124)
>  ~[flink-dist-1.17-xxx-SNAPSHOT.jar:1.17-xxx-SNAPSHOT]
>     ... 3 more
> Caused by: org.apache.flink.util.SerializedThrowable: java.io.IOException: 
> Could not flush to file and close the file system output stream to 
> hdfs://xxx/shared/97329cbd-fb90-45ca-92de-edf52d6e6fc0 in order to obtain the 
> stream state handle
>     at 
> org.apache.flink.runtime.state.filesystem.FsCheckpointStreamFactory$FsCheckpointStateOutputStream.closeAndGetHandle(FsCheckpointStreamFactory.java:516)
>  ~[flink-dist-1.17-xxx-SNAPSHOT.jar:1.17-xxx-SNAPSHOT]
>     at 
> org.apache.flink.contrib.streaming.state.RocksDBStateUploader.uploadLocalFileToCheckpointFs(RocksDBStateUploader.java:157)
>  ~[flink-dist-1.17-xxx-SNAPSHOT.jar:1.17-xxx-SNAPSHOT]
>     at 
> org.apache.flink.contrib.streaming.state.RocksDBStateUploader.lambda$createUploadFutures$0(RocksDBStateUploader.java:113)
>  ~[flink-dist-1.17-xxx-SNAPSHOT.jar:1.17-xxx-SNAPSHOT]
>     at 
> org.apache.flink.util.function.CheckedSupplier.lambda$unchecked$0(CheckedSupplier.java:32)
>  ~[flink-dist-1.17-xxx-SNAPSHOT.jar:1.17-xxx-SNAPSHOT]
>     at 
> java.util.concurrent.CompletableFuture$AsyncSupply.run(CompletableFuture.java:1700)
>  ~[?:?]
>     ... 3 more
> Caused by: org.apache.flink.util.SerializedThrowable: 
> java.net.ConnectException: Connection timed out
>     at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method) ~[?:?]
>     at sun.nio.ch.SocketChannelImpl.finishCo

[jira] [Commented] (FLINK-33932) Support retry mechanism for rocksdb uploader

2024-01-06 Thread xiangyu feng (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-33932?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17803900#comment-17803900
 ] 

xiangyu feng commented on FLINK-33932:
--

[~pnowojski]  Hi Piotr, thx for ur feedback. After a closer look at the code, I 
think we also need retry mechanism in `RocksDBStateDownloader`. IMHO, we can 
add default retry configurations in `RocksDBOptions` and implement this in 
`RocksDBStateDataTransfer`. In this way, both  `RocksDBStateUploader` and 
`RocksDBStateDownloader` can reuse this retry mechanism. 

 

Can u kindly assign this Jira to me and grant me the permission to change the 
title and description of this Jira?

> Support retry mechanism for rocksdb uploader
> 
>
> Key: FLINK-33932
> URL: https://issues.apache.org/jira/browse/FLINK-33932
> Project: Flink
>  Issue Type: Improvement
>  Components: Runtime / Checkpointing
>Reporter: Guojun Li
>Priority: Major
>  Labels: pull-request-available
>
> Rocksdb uploader will throw exception and decline the current checkpoint if 
> there are errors when uploading to remote file system like hdfs.
> The exception is as below:
> {noformat}
>  
> 2023-12-19 08:46:00,197 INFO  
> org.apache.flink.runtime.checkpoint.CheckpointCoordinator    [] - Decline 
> checkpoint 2 by task 
> 5b008c2c048fa8534d648455e69f9497_fbcce2f96483f8f15e87dc6c9afd372f_183_0 of 
> job a025f19e at 
> application-6f1c6e3d-1702480803995-5093022-taskmanager-1-1 @ 
> fdbd:dc61:1a:101:0:0:0:36 (dataPort=38789).
> org.apache.flink.util.SerializedThrowable: 
> org.apache.flink.runtime.checkpoint.CheckpointException: Asynchronous task 
> checkpoint failed.
>     at 
> org.apache.flink.streaming.runtime.tasks.AsyncCheckpointRunnable.handleExecutionException(AsyncCheckpointRunnable.java:320)
>  ~[flink-dist-1.17-xxx-SNAPSHOT.jar:1.17-xxx-SNAPSHOT]
>     at 
> org.apache.flink.streaming.runtime.tasks.AsyncCheckpointRunnable.run(AsyncCheckpointRunnable.java:155)
>  ~[flink-dist-1.17-xxx-SNAPSHOT.jar:1.17-xxx-SNAPSHOT]
>     at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
>  ~[?:?]
>     at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
>  ~[?:?]
>     at java.lang.Thread.run(Thread.java:829) [?:?]
> Caused by: org.apache.flink.util.SerializedThrowable: java.lang.Exception: 
> Could not materialize checkpoint 2 for operator GlobalGroupAggregate[132] -> 
> Calc[133] (184/500)#0.
>     at 
> org.apache.flink.streaming.runtime.tasks.AsyncCheckpointRunnable.handleExecutionException(AsyncCheckpointRunnable.java:298)
>  ~[flink-dist-1.17-xxx-SNAPSHOT.jar:1.17-xxx-SNAPSHOT]
>     ... 4 more
> Caused by: org.apache.flink.util.SerializedThrowable: 
> java.util.concurrent.ExecutionException: java.io.IOException: Could not flush 
> to file and close the file system output stream to 
> hdfs://xxx/shared/97329cbd-fb90-45ca-92de-edf52d6e6fc0 in order to obtain the 
> stream state handle
>     at java.util.concurrent.FutureTask.report(FutureTask.java:122) ~[?:?]
>     at java.util.concurrent.FutureTask.get(FutureTask.java:191) ~[?:?]
>     at 
> org.apache.flink.util.concurrent.FutureUtils.runIfNotDoneAndGet(FutureUtils.java:544)
>  ~[flink-dist-1.17-xxx-SNAPSHOT.jar:1.17-xxx-SNAPSHOT]
>     at 
> org.apache.flink.streaming.api.operators.OperatorSnapshotFinalizer.(OperatorSnapshotFinalizer.java:54)
>  ~[flink-dist-1.17-xxx-SNAPSHOT.jar:1.17-xxx-SNAPSHOT]
>     at 
> org.apache.flink.streaming.runtime.tasks.AsyncCheckpointRunnable.finalizeNonFinishedSnapshots(AsyncCheckpointRunnable.java:191)
>  ~[flink-dist-1.17-xxx-SNAPSHOT.jar:1.17-xxx-SNAPSHOT]
>     at 
> org.apache.flink.streaming.runtime.tasks.AsyncCheckpointRunnable.run(AsyncCheckpointRunnable.java:124)
>  ~[flink-dist-1.17-xxx-SNAPSHOT.jar:1.17-xxx-SNAPSHOT]
>     ... 3 more
> Caused by: org.apache.flink.util.SerializedThrowable: java.io.IOException: 
> Could not flush to file and close the file system output stream to 
> hdfs://xxx/shared/97329cbd-fb90-45ca-92de-edf52d6e6fc0 in order to obtain the 
> stream state handle
>     at 
> org.apache.flink.runtime.state.filesystem.FsCheckpointStreamFactory$FsCheckpointStateOutputStream.closeAndGetHandle(FsCheckpointStreamFactory.java:516)
>  ~[flink-dist-1.17-xxx-SNAPSHOT.jar:1.17-xxx-SNAPSHOT]
>     at 
> org.apache.flink.contrib.streaming.state.RocksDBStateUploader.uploadLocalFileToCheckpointFs(RocksDBStateUploader.java:157)
>  ~[flink-dist-1.17-xxx-SNAPSHOT.jar:1.17-xxx-SNAPSHOT]
>     at 
> org.apache.flink.contrib.streaming.state.RocksDBStateUploader.lambda$createUploadFutures$0(RocksDBStateUploader.java:113)
>  ~[flink-dist-1.17-xxx-SNAPSHOT.jar:1.17-xxx-SNAPSHOT]
>     at 
> org.apache.flink.util.function.CheckedSupplier.lambda$unchecked$0(CheckedSupplier.java:32

[jira] [Commented] (FLINK-33932) Support retry mechanism for rocksdb uploader

2023-12-29 Thread xiangyu feng (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-33932?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17801162#comment-17801162
 ] 

xiangyu feng commented on FLINK-33932:
--

[~martijnvisser] AFAIK, this retry is not about "retry" checkpointing but to 
ensure that Checkpoint can successfully upload the local file to the remote 
storage during the asynchronous phase. For files that partially upload to 
remote but failed eventually, they won‘t be considered as part of this 
checkpoint because no file path will be returned. This retry mechanism works 
well in our production for the past two years without any correctness issues.

> Support retry mechanism for rocksdb uploader
> 
>
> Key: FLINK-33932
> URL: https://issues.apache.org/jira/browse/FLINK-33932
> Project: Flink
>  Issue Type: Improvement
>  Components: Runtime / Checkpointing
>Reporter: Guojun Li
>Priority: Major
>  Labels: pull-request-available
>
> Rocksdb uploader will throw exception and decline the current checkpoint if 
> there are errors when uploading to remote file system like hdfs.
> The exception is as below:
> 2023-12-19 08:46:00,197 INFO  
> org.apache.flink.runtime.checkpoint.CheckpointCoordinator    [] - Decline 
> checkpoint 2 by task 
> 5b008c2c048fa8534d648455e69f9497_fbcce2f96483f8f15e87dc6c9afd372f_183_0 of 
> job a025f19e at 
> application-6f1c6e3d-1702480803995-5093022-taskmanager-1-1 @ 
> fdbd:dc61:1a:101:0:0:0:36 (dataPort=38789).
> org.apache.flink.util.SerializedThrowable: 
> org.apache.flink.runtime.checkpoint.CheckpointException: Asynchronous task 
> checkpoint failed.
>     at 
> org.apache.flink.streaming.runtime.tasks.AsyncCheckpointRunnable.handleExecutionException(AsyncCheckpointRunnable.java:320)
>  ~[flink-dist-1.17-xxx-SNAPSHOT.jar:1.17-xxx-SNAPSHOT]
>     at 
> org.apache.flink.streaming.runtime.tasks.AsyncCheckpointRunnable.run(AsyncCheckpointRunnable.java:155)
>  ~[flink-dist-1.17-xxx-SNAPSHOT.jar:1.17-xxx-SNAPSHOT]
>     at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
>  ~[?:?]
>     at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
>  ~[?:?]
>     at java.lang.Thread.run(Thread.java:829) [?:?]
> Caused by: org.apache.flink.util.SerializedThrowable: java.lang.Exception: 
> Could not materialize checkpoint 2 for operator GlobalGroupAggregate[132] -> 
> Calc[133] (184/500)#0.
>     at 
> org.apache.flink.streaming.runtime.tasks.AsyncCheckpointRunnable.handleExecutionException(AsyncCheckpointRunnable.java:298)
>  ~[flink-dist-1.17-xxx-SNAPSHOT.jar:1.17-xxx-SNAPSHOT]
>     ... 4 more
> Caused by: org.apache.flink.util.SerializedThrowable: 
> java.util.concurrent.ExecutionException: java.io.IOException: Could not flush 
> to file and close the file system output stream to 
> hdfs://xxx/shared/97329cbd-fb90-45ca-92de-edf52d6e6fc0 in order to obtain the 
> stream state handle
>     at java.util.concurrent.FutureTask.report(FutureTask.java:122) ~[?:?]
>     at java.util.concurrent.FutureTask.get(FutureTask.java:191) ~[?:?]
>     at 
> org.apache.flink.util.concurrent.FutureUtils.runIfNotDoneAndGet(FutureUtils.java:544)
>  ~[flink-dist-1.17-xxx-SNAPSHOT.jar:1.17-xxx-SNAPSHOT]
>     at 
> org.apache.flink.streaming.api.operators.OperatorSnapshotFinalizer.(OperatorSnapshotFinalizer.java:54)
>  ~[flink-dist-1.17-xxx-SNAPSHOT.jar:1.17-xxx-SNAPSHOT]
>     at 
> org.apache.flink.streaming.runtime.tasks.AsyncCheckpointRunnable.finalizeNonFinishedSnapshots(AsyncCheckpointRunnable.java:191)
>  ~[flink-dist-1.17-xxx-SNAPSHOT.jar:1.17-xxx-SNAPSHOT]
>     at 
> org.apache.flink.streaming.runtime.tasks.AsyncCheckpointRunnable.run(AsyncCheckpointRunnable.java:124)
>  ~[flink-dist-1.17-xxx-SNAPSHOT.jar:1.17-xxx-SNAPSHOT]
>     ... 3 more
> Caused by: org.apache.flink.util.SerializedThrowable: java.io.IOException: 
> Could not flush to file and close the file system output stream to 
> hdfs://xxx/shared/97329cbd-fb90-45ca-92de-edf52d6e6fc0 in order to obtain the 
> stream state handle
>     at 
> org.apache.flink.runtime.state.filesystem.FsCheckpointStreamFactory$FsCheckpointStateOutputStream.closeAndGetHandle(FsCheckpointStreamFactory.java:516)
>  ~[flink-dist-1.17-xxx-SNAPSHOT.jar:1.17-xxx-SNAPSHOT]
>     at 
> org.apache.flink.contrib.streaming.state.RocksDBStateUploader.uploadLocalFileToCheckpointFs(RocksDBStateUploader.java:157)
>  ~[flink-dist-1.17-xxx-SNAPSHOT.jar:1.17-xxx-SNAPSHOT]
>     at 
> org.apache.flink.contrib.streaming.state.RocksDBStateUploader.lambda$createUploadFutures$0(RocksDBStateUploader.java:113)
>  ~[flink-dist-1.17-xxx-SNAPSHOT.jar:1.17-xxx-SNAPSHOT]
>     at 
> org.apache.flink.util.function.CheckedSupplier.lambda$unchecked$0(CheckedSupplier.java:32)
>  ~[flink-dist-1.17-xxx-SNAPSHOT.jar:1.17-x

[jira] [Comment Edited] (FLINK-33932) Support retry mechanism for rocksdb uploader

2023-12-29 Thread xiangyu feng (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-33932?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17800258#comment-17800258
 ] 

xiangyu feng edited comment on FLINK-33932 at 12/29/23 12:24 PM:
-

Hi [~dianer17], thanks for creating  this. IMHO, this retry is very useful to 
improve the success rate of checkpoint. We can introduce a default fixed delay 
retry mechanism here.

 

I have implemented a poc in this pr: 
[https://github.com/apache/flink/pull/23986], WDYT?


was (Author: JIRAUSER301129):
Hi [~dianer17], thanks for creating  this. IMHO, this retry is very useful to 
improve the success rate of checkpoint. We can introduce a default fixed delay 
retry mechanism without adding any configuration, thus we don't need a flip for 
this jira. 

 

I have implemented a poc in this pr: 
[https://github.com/apache/flink/pull/23986], WDYT?

> Support retry mechanism for rocksdb uploader
> 
>
> Key: FLINK-33932
> URL: https://issues.apache.org/jira/browse/FLINK-33932
> Project: Flink
>  Issue Type: Improvement
>  Components: Runtime / Checkpointing
>Reporter: Guojun Li
>Priority: Major
>  Labels: pull-request-available
>
> Rocksdb uploader will throw exception and decline the current checkpoint if 
> there are errors when uploading to remote file system like hdfs.
> The exception is as below:
> 2023-12-19 08:46:00,197 INFO  
> org.apache.flink.runtime.checkpoint.CheckpointCoordinator    [] - Decline 
> checkpoint 2 by task 
> 5b008c2c048fa8534d648455e69f9497_fbcce2f96483f8f15e87dc6c9afd372f_183_0 of 
> job a025f19e at 
> application-6f1c6e3d-1702480803995-5093022-taskmanager-1-1 @ 
> fdbd:dc61:1a:101:0:0:0:36 (dataPort=38789).
> org.apache.flink.util.SerializedThrowable: 
> org.apache.flink.runtime.checkpoint.CheckpointException: Asynchronous task 
> checkpoint failed.
>     at 
> org.apache.flink.streaming.runtime.tasks.AsyncCheckpointRunnable.handleExecutionException(AsyncCheckpointRunnable.java:320)
>  ~[flink-dist-1.17-xxx-SNAPSHOT.jar:1.17-xxx-SNAPSHOT]
>     at 
> org.apache.flink.streaming.runtime.tasks.AsyncCheckpointRunnable.run(AsyncCheckpointRunnable.java:155)
>  ~[flink-dist-1.17-xxx-SNAPSHOT.jar:1.17-xxx-SNAPSHOT]
>     at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
>  ~[?:?]
>     at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
>  ~[?:?]
>     at java.lang.Thread.run(Thread.java:829) [?:?]
> Caused by: org.apache.flink.util.SerializedThrowable: java.lang.Exception: 
> Could not materialize checkpoint 2 for operator GlobalGroupAggregate[132] -> 
> Calc[133] (184/500)#0.
>     at 
> org.apache.flink.streaming.runtime.tasks.AsyncCheckpointRunnable.handleExecutionException(AsyncCheckpointRunnable.java:298)
>  ~[flink-dist-1.17-xxx-SNAPSHOT.jar:1.17-xxx-SNAPSHOT]
>     ... 4 more
> Caused by: org.apache.flink.util.SerializedThrowable: 
> java.util.concurrent.ExecutionException: java.io.IOException: Could not flush 
> to file and close the file system output stream to 
> hdfs://xxx/shared/97329cbd-fb90-45ca-92de-edf52d6e6fc0 in order to obtain the 
> stream state handle
>     at java.util.concurrent.FutureTask.report(FutureTask.java:122) ~[?:?]
>     at java.util.concurrent.FutureTask.get(FutureTask.java:191) ~[?:?]
>     at 
> org.apache.flink.util.concurrent.FutureUtils.runIfNotDoneAndGet(FutureUtils.java:544)
>  ~[flink-dist-1.17-xxx-SNAPSHOT.jar:1.17-xxx-SNAPSHOT]
>     at 
> org.apache.flink.streaming.api.operators.OperatorSnapshotFinalizer.(OperatorSnapshotFinalizer.java:54)
>  ~[flink-dist-1.17-xxx-SNAPSHOT.jar:1.17-xxx-SNAPSHOT]
>     at 
> org.apache.flink.streaming.runtime.tasks.AsyncCheckpointRunnable.finalizeNonFinishedSnapshots(AsyncCheckpointRunnable.java:191)
>  ~[flink-dist-1.17-xxx-SNAPSHOT.jar:1.17-xxx-SNAPSHOT]
>     at 
> org.apache.flink.streaming.runtime.tasks.AsyncCheckpointRunnable.run(AsyncCheckpointRunnable.java:124)
>  ~[flink-dist-1.17-xxx-SNAPSHOT.jar:1.17-xxx-SNAPSHOT]
>     ... 3 more
> Caused by: org.apache.flink.util.SerializedThrowable: java.io.IOException: 
> Could not flush to file and close the file system output stream to 
> hdfs://xxx/shared/97329cbd-fb90-45ca-92de-edf52d6e6fc0 in order to obtain the 
> stream state handle
>     at 
> org.apache.flink.runtime.state.filesystem.FsCheckpointStreamFactory$FsCheckpointStateOutputStream.closeAndGetHandle(FsCheckpointStreamFactory.java:516)
>  ~[flink-dist-1.17-xxx-SNAPSHOT.jar:1.17-xxx-SNAPSHOT]
>     at 
> org.apache.flink.contrib.streaming.state.RocksDBStateUploader.uploadLocalFileToCheckpointFs(RocksDBStateUploader.java:157)
>  ~[flink-dist-1.17-xxx-SNAPSHOT.jar:1.17-xxx-SNAPSHOT]
>     at 
> org.apache.flink.contrib.streaming.state.RocksDBStateUploader.lambda$createUpl

[jira] [Commented] (FLINK-33932) Support retry mechanism for rocksdb uploader

2023-12-27 Thread xiangyu feng (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-33932?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17800910#comment-17800910
 ] 

xiangyu feng commented on FLINK-33932:
--

[~martijnvisser] Yes, like the default `ExponentialBackoffRetryStrategy` 
defined in `RpcGatewayRetriever` and 
`ExponentialWaitStrategy` defined in `RestClusterClient`.

> Support retry mechanism for rocksdb uploader
> 
>
> Key: FLINK-33932
> URL: https://issues.apache.org/jira/browse/FLINK-33932
> Project: Flink
>  Issue Type: Improvement
>  Components: Runtime / Checkpointing
>Reporter: Guojun Li
>Priority: Major
>  Labels: pull-request-available
>
> Rocksdb uploader will throw exception and decline the current checkpoint if 
> there are errors when uploading to remote file system like hdfs.
> The exception is as below:
> 2023-12-19 08:46:00,197 INFO  
> org.apache.flink.runtime.checkpoint.CheckpointCoordinator    [] - Decline 
> checkpoint 2 by task 
> 5b008c2c048fa8534d648455e69f9497_fbcce2f96483f8f15e87dc6c9afd372f_183_0 of 
> job a025f19e at 
> application-6f1c6e3d-1702480803995-5093022-taskmanager-1-1 @ 
> fdbd:dc61:1a:101:0:0:0:36 (dataPort=38789).
> org.apache.flink.util.SerializedThrowable: 
> org.apache.flink.runtime.checkpoint.CheckpointException: Asynchronous task 
> checkpoint failed.
>     at 
> org.apache.flink.streaming.runtime.tasks.AsyncCheckpointRunnable.handleExecutionException(AsyncCheckpointRunnable.java:320)
>  ~[flink-dist-1.17-xxx-SNAPSHOT.jar:1.17-xxx-SNAPSHOT]
>     at 
> org.apache.flink.streaming.runtime.tasks.AsyncCheckpointRunnable.run(AsyncCheckpointRunnable.java:155)
>  ~[flink-dist-1.17-xxx-SNAPSHOT.jar:1.17-xxx-SNAPSHOT]
>     at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
>  ~[?:?]
>     at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
>  ~[?:?]
>     at java.lang.Thread.run(Thread.java:829) [?:?]
> Caused by: org.apache.flink.util.SerializedThrowable: java.lang.Exception: 
> Could not materialize checkpoint 2 for operator GlobalGroupAggregate[132] -> 
> Calc[133] (184/500)#0.
>     at 
> org.apache.flink.streaming.runtime.tasks.AsyncCheckpointRunnable.handleExecutionException(AsyncCheckpointRunnable.java:298)
>  ~[flink-dist-1.17-xxx-SNAPSHOT.jar:1.17-xxx-SNAPSHOT]
>     ... 4 more
> Caused by: org.apache.flink.util.SerializedThrowable: 
> java.util.concurrent.ExecutionException: java.io.IOException: Could not flush 
> to file and close the file system output stream to 
> hdfs://xxx/shared/97329cbd-fb90-45ca-92de-edf52d6e6fc0 in order to obtain the 
> stream state handle
>     at java.util.concurrent.FutureTask.report(FutureTask.java:122) ~[?:?]
>     at java.util.concurrent.FutureTask.get(FutureTask.java:191) ~[?:?]
>     at 
> org.apache.flink.util.concurrent.FutureUtils.runIfNotDoneAndGet(FutureUtils.java:544)
>  ~[flink-dist-1.17-xxx-SNAPSHOT.jar:1.17-xxx-SNAPSHOT]
>     at 
> org.apache.flink.streaming.api.operators.OperatorSnapshotFinalizer.(OperatorSnapshotFinalizer.java:54)
>  ~[flink-dist-1.17-xxx-SNAPSHOT.jar:1.17-xxx-SNAPSHOT]
>     at 
> org.apache.flink.streaming.runtime.tasks.AsyncCheckpointRunnable.finalizeNonFinishedSnapshots(AsyncCheckpointRunnable.java:191)
>  ~[flink-dist-1.17-xxx-SNAPSHOT.jar:1.17-xxx-SNAPSHOT]
>     at 
> org.apache.flink.streaming.runtime.tasks.AsyncCheckpointRunnable.run(AsyncCheckpointRunnable.java:124)
>  ~[flink-dist-1.17-xxx-SNAPSHOT.jar:1.17-xxx-SNAPSHOT]
>     ... 3 more
> Caused by: org.apache.flink.util.SerializedThrowable: java.io.IOException: 
> Could not flush to file and close the file system output stream to 
> hdfs://xxx/shared/97329cbd-fb90-45ca-92de-edf52d6e6fc0 in order to obtain the 
> stream state handle
>     at 
> org.apache.flink.runtime.state.filesystem.FsCheckpointStreamFactory$FsCheckpointStateOutputStream.closeAndGetHandle(FsCheckpointStreamFactory.java:516)
>  ~[flink-dist-1.17-xxx-SNAPSHOT.jar:1.17-xxx-SNAPSHOT]
>     at 
> org.apache.flink.contrib.streaming.state.RocksDBStateUploader.uploadLocalFileToCheckpointFs(RocksDBStateUploader.java:157)
>  ~[flink-dist-1.17-xxx-SNAPSHOT.jar:1.17-xxx-SNAPSHOT]
>     at 
> org.apache.flink.contrib.streaming.state.RocksDBStateUploader.lambda$createUploadFutures$0(RocksDBStateUploader.java:113)
>  ~[flink-dist-1.17-xxx-SNAPSHOT.jar:1.17-xxx-SNAPSHOT]
>     at 
> org.apache.flink.util.function.CheckedSupplier.lambda$unchecked$0(CheckedSupplier.java:32)
>  ~[flink-dist-1.17-xxx-SNAPSHOT.jar:1.17-xxx-SNAPSHOT]
>     at 
> java.util.concurrent.CompletableFuture$AsyncSupply.run(CompletableFuture.java:1700)
>  ~[?:?]
>     ... 3 more
> Caused by: org.apache.flink.util.SerializedThrowable: 
> java.net.ConnectException: Connection timed out
>     at sun.nio.ch.SocketChannelImpl.checkCo

[jira] [Comment Edited] (FLINK-33932) Support retry mechanism for rocksdb uploader

2023-12-25 Thread xiangyu feng (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-33932?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17800258#comment-17800258
 ] 

xiangyu feng edited comment on FLINK-33932 at 12/25/23 8:55 AM:


Hi [~dianer17], thanks for creating  this. IMHO, this retry is very useful to 
improve the success rate of checkpoint. We can introduce a default fixed delay 
retry mechanism without adding any configuration, thus we don't need a flip for 
this jira. 

 

I have implemented a poc in this pr: 
[https://github.com/apache/flink/pull/23986], WDYT?


was (Author: JIRAUSER301129):
Hi [~dianer17], thanks for creating  this. IMHO, this retry is very useful to 
improve the success rate of checkpoint. We can introduce a default fixed delay 
retry mechanism without adding any configuration, thus we don't need a flip for 
this jira. 

 

I have implemented a poc in this pr: 
[https://github.com/apache/flink/pull/23986,] WDYT?

> Support retry mechanism for rocksdb uploader
> 
>
> Key: FLINK-33932
> URL: https://issues.apache.org/jira/browse/FLINK-33932
> Project: Flink
>  Issue Type: Improvement
>  Components: Runtime / Checkpointing
>Reporter: Guojun Li
>Priority: Major
>  Labels: pull-request-available
>
> Rocksdb uploader will throw exception and decline the current checkpoint if 
> there are errors when uploading to remote file system like hdfs.
> The exception is as below:
> 2023-12-19 08:46:00,197 INFO  
> org.apache.flink.runtime.checkpoint.CheckpointCoordinator    [] - Decline 
> checkpoint 2 by task 
> 5b008c2c048fa8534d648455e69f9497_fbcce2f96483f8f15e87dc6c9afd372f_183_0 of 
> job a025f19e at 
> application-6f1c6e3d-1702480803995-5093022-taskmanager-1-1 @ 
> fdbd:dc61:1a:101:0:0:0:36 (dataPort=38789).
> org.apache.flink.util.SerializedThrowable: 
> org.apache.flink.runtime.checkpoint.CheckpointException: Asynchronous task 
> checkpoint failed.
>     at 
> org.apache.flink.streaming.runtime.tasks.AsyncCheckpointRunnable.handleExecutionException(AsyncCheckpointRunnable.java:320)
>  ~[flink-dist-1.17-xxx-SNAPSHOT.jar:1.17-xxx-SNAPSHOT]
>     at 
> org.apache.flink.streaming.runtime.tasks.AsyncCheckpointRunnable.run(AsyncCheckpointRunnable.java:155)
>  ~[flink-dist-1.17-xxx-SNAPSHOT.jar:1.17-xxx-SNAPSHOT]
>     at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
>  ~[?:?]
>     at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
>  ~[?:?]
>     at java.lang.Thread.run(Thread.java:829) [?:?]
> Caused by: org.apache.flink.util.SerializedThrowable: java.lang.Exception: 
> Could not materialize checkpoint 2 for operator GlobalGroupAggregate[132] -> 
> Calc[133] (184/500)#0.
>     at 
> org.apache.flink.streaming.runtime.tasks.AsyncCheckpointRunnable.handleExecutionException(AsyncCheckpointRunnable.java:298)
>  ~[flink-dist-1.17-xxx-SNAPSHOT.jar:1.17-xxx-SNAPSHOT]
>     ... 4 more
> Caused by: org.apache.flink.util.SerializedThrowable: 
> java.util.concurrent.ExecutionException: java.io.IOException: Could not flush 
> to file and close the file system output stream to 
> hdfs://xxx/shared/97329cbd-fb90-45ca-92de-edf52d6e6fc0 in order to obtain the 
> stream state handle
>     at java.util.concurrent.FutureTask.report(FutureTask.java:122) ~[?:?]
>     at java.util.concurrent.FutureTask.get(FutureTask.java:191) ~[?:?]
>     at 
> org.apache.flink.util.concurrent.FutureUtils.runIfNotDoneAndGet(FutureUtils.java:544)
>  ~[flink-dist-1.17-xxx-SNAPSHOT.jar:1.17-xxx-SNAPSHOT]
>     at 
> org.apache.flink.streaming.api.operators.OperatorSnapshotFinalizer.(OperatorSnapshotFinalizer.java:54)
>  ~[flink-dist-1.17-xxx-SNAPSHOT.jar:1.17-xxx-SNAPSHOT]
>     at 
> org.apache.flink.streaming.runtime.tasks.AsyncCheckpointRunnable.finalizeNonFinishedSnapshots(AsyncCheckpointRunnable.java:191)
>  ~[flink-dist-1.17-xxx-SNAPSHOT.jar:1.17-xxx-SNAPSHOT]
>     at 
> org.apache.flink.streaming.runtime.tasks.AsyncCheckpointRunnable.run(AsyncCheckpointRunnable.java:124)
>  ~[flink-dist-1.17-xxx-SNAPSHOT.jar:1.17-xxx-SNAPSHOT]
>     ... 3 more
> Caused by: org.apache.flink.util.SerializedThrowable: java.io.IOException: 
> Could not flush to file and close the file system output stream to 
> hdfs://xxx/shared/97329cbd-fb90-45ca-92de-edf52d6e6fc0 in order to obtain the 
> stream state handle
>     at 
> org.apache.flink.runtime.state.filesystem.FsCheckpointStreamFactory$FsCheckpointStateOutputStream.closeAndGetHandle(FsCheckpointStreamFactory.java:516)
>  ~[flink-dist-1.17-xxx-SNAPSHOT.jar:1.17-xxx-SNAPSHOT]
>     at 
> org.apache.flink.contrib.streaming.state.RocksDBStateUploader.uploadLocalFileToCheckpointFs(RocksDBStateUploader.java:157)
>  ~[flink-dist-1.17-xxx-SNAPSHOT.jar:1.17-xxx-SNAPSHOT]
>     at 
> org.apach

[jira] [Commented] (FLINK-33932) Support retry mechanism for rocksdb uploader

2023-12-24 Thread xiangyu feng (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-33932?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17800258#comment-17800258
 ] 

xiangyu feng commented on FLINK-33932:
--

Hi [~dianer17], thanks for creating  this. IMHO, this retry is very useful to 
improve the success rate of checkpoint. We can introduce a default fixed delay 
retry mechanism without adding any configuration, thus we don't need a flip for 
this jira. 

 

I have implemented a poc in this pr: 
[https://github.com/apache/flink/pull/23986,] WDYT?

> Support retry mechanism for rocksdb uploader
> 
>
> Key: FLINK-33932
> URL: https://issues.apache.org/jira/browse/FLINK-33932
> Project: Flink
>  Issue Type: Improvement
>  Components: Runtime / Checkpointing
>Reporter: Guojun Li
>Priority: Major
>  Labels: pull-request-available
>
> Rocksdb uploader will throw exception and decline the current checkpoint if 
> there are errors when uploading to remote file system like hdfs.
> The exception is as below:
> 2023-12-19 08:46:00,197 INFO  
> org.apache.flink.runtime.checkpoint.CheckpointCoordinator    [] - Decline 
> checkpoint 2 by task 
> 5b008c2c048fa8534d648455e69f9497_fbcce2f96483f8f15e87dc6c9afd372f_183_0 of 
> job a025f19e at 
> application-6f1c6e3d-1702480803995-5093022-taskmanager-1-1 @ 
> fdbd:dc61:1a:101:0:0:0:36 (dataPort=38789).
> org.apache.flink.util.SerializedThrowable: 
> org.apache.flink.runtime.checkpoint.CheckpointException: Asynchronous task 
> checkpoint failed.
>     at 
> org.apache.flink.streaming.runtime.tasks.AsyncCheckpointRunnable.handleExecutionException(AsyncCheckpointRunnable.java:320)
>  ~[flink-dist-1.17-xxx-SNAPSHOT.jar:1.17-xxx-SNAPSHOT]
>     at 
> org.apache.flink.streaming.runtime.tasks.AsyncCheckpointRunnable.run(AsyncCheckpointRunnable.java:155)
>  ~[flink-dist-1.17-xxx-SNAPSHOT.jar:1.17-xxx-SNAPSHOT]
>     at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
>  ~[?:?]
>     at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
>  ~[?:?]
>     at java.lang.Thread.run(Thread.java:829) [?:?]
> Caused by: org.apache.flink.util.SerializedThrowable: java.lang.Exception: 
> Could not materialize checkpoint 2 for operator GlobalGroupAggregate[132] -> 
> Calc[133] (184/500)#0.
>     at 
> org.apache.flink.streaming.runtime.tasks.AsyncCheckpointRunnable.handleExecutionException(AsyncCheckpointRunnable.java:298)
>  ~[flink-dist-1.17-xxx-SNAPSHOT.jar:1.17-xxx-SNAPSHOT]
>     ... 4 more
> Caused by: org.apache.flink.util.SerializedThrowable: 
> java.util.concurrent.ExecutionException: java.io.IOException: Could not flush 
> to file and close the file system output stream to 
> hdfs://xxx/shared/97329cbd-fb90-45ca-92de-edf52d6e6fc0 in order to obtain the 
> stream state handle
>     at java.util.concurrent.FutureTask.report(FutureTask.java:122) ~[?:?]
>     at java.util.concurrent.FutureTask.get(FutureTask.java:191) ~[?:?]
>     at 
> org.apache.flink.util.concurrent.FutureUtils.runIfNotDoneAndGet(FutureUtils.java:544)
>  ~[flink-dist-1.17-xxx-SNAPSHOT.jar:1.17-xxx-SNAPSHOT]
>     at 
> org.apache.flink.streaming.api.operators.OperatorSnapshotFinalizer.(OperatorSnapshotFinalizer.java:54)
>  ~[flink-dist-1.17-xxx-SNAPSHOT.jar:1.17-xxx-SNAPSHOT]
>     at 
> org.apache.flink.streaming.runtime.tasks.AsyncCheckpointRunnable.finalizeNonFinishedSnapshots(AsyncCheckpointRunnable.java:191)
>  ~[flink-dist-1.17-xxx-SNAPSHOT.jar:1.17-xxx-SNAPSHOT]
>     at 
> org.apache.flink.streaming.runtime.tasks.AsyncCheckpointRunnable.run(AsyncCheckpointRunnable.java:124)
>  ~[flink-dist-1.17-xxx-SNAPSHOT.jar:1.17-xxx-SNAPSHOT]
>     ... 3 more
> Caused by: org.apache.flink.util.SerializedThrowable: java.io.IOException: 
> Could not flush to file and close the file system output stream to 
> hdfs://xxx/shared/97329cbd-fb90-45ca-92de-edf52d6e6fc0 in order to obtain the 
> stream state handle
>     at 
> org.apache.flink.runtime.state.filesystem.FsCheckpointStreamFactory$FsCheckpointStateOutputStream.closeAndGetHandle(FsCheckpointStreamFactory.java:516)
>  ~[flink-dist-1.17-xxx-SNAPSHOT.jar:1.17-xxx-SNAPSHOT]
>     at 
> org.apache.flink.contrib.streaming.state.RocksDBStateUploader.uploadLocalFileToCheckpointFs(RocksDBStateUploader.java:157)
>  ~[flink-dist-1.17-xxx-SNAPSHOT.jar:1.17-xxx-SNAPSHOT]
>     at 
> org.apache.flink.contrib.streaming.state.RocksDBStateUploader.lambda$createUploadFutures$0(RocksDBStateUploader.java:113)
>  ~[flink-dist-1.17-xxx-SNAPSHOT.jar:1.17-xxx-SNAPSHOT]
>     at 
> org.apache.flink.util.function.CheckedSupplier.lambda$unchecked$0(CheckedSupplier.java:32)
>  ~[flink-dist-1.17-xxx-SNAPSHOT.jar:1.17-xxx-SNAPSHOT]
>     at 
> java.util.concurrent.CompletableFuture$AsyncSupply.run(CompletableFuture.java:1700)
>  

[jira] [Commented] (FLINK-33819) Support setting CompressType in RocksDBStateBackend

2023-12-14 Thread xiangyu feng (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-33819?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17796977#comment-17796977
 ] 

xiangyu feng commented on FLINK-33819:
--

+1 for this. This will save lots of cpu usage for jobs with less state space 
usage but high state access frequency.

> Support setting CompressType in RocksDBStateBackend
> ---
>
> Key: FLINK-33819
> URL: https://issues.apache.org/jira/browse/FLINK-33819
> Project: Flink
>  Issue Type: Improvement
>  Components: Runtime / State Backends
>Affects Versions: 1.18.0
>Reporter: Yue Ma
>Priority: Major
> Fix For: 1.19.0
>
> Attachments: image-2023-12-14-11-32-32-968.png, 
> image-2023-12-14-11-35-22-306.png
>
>
> Currently, RocksDBStateBackend does not support setting the compression 
> level, and Snappy is used for compression by default. But we have some 
> scenarios where compression will use a lot of CPU resources. Turning off 
> compression can significantly reduce CPU overhead. So we may need to support 
> a parameter for users to set the CompressType of Rocksdb.
>   !image-2023-12-14-11-35-22-306.png!



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (FLINK-33686) Reuse ClusterDescriptor in AbstractSessionClusterExecutor when executing jobs on the same cluster

2023-12-13 Thread xiangyu feng (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-33686?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

xiangyu feng updated FLINK-33686:
-
Summary: Reuse ClusterDescriptor in AbstractSessionClusterExecutor when 
executing jobs on the same cluster  (was: Reuse StandaloneClusterDescriptor in 
RemoteExecutor when executing jobs on the same cluster)

> Reuse ClusterDescriptor in AbstractSessionClusterExecutor when executing jobs 
> on the same cluster
> -
>
> Key: FLINK-33686
> URL: https://issues.apache.org/jira/browse/FLINK-33686
> Project: Flink
>  Issue Type: Sub-task
>  Components: Client / Job Submission
>Reporter: xiangyu feng
>Priority: Major
>
> Multiple `RemoteExecutor` instances can reuse the same 
> `StandaloneClusterDescriptor` when executing jobs to a same running cluster.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (FLINK-33698) Fix the backoff time calculation in ExponentialBackoffDelayRetryStrategy

2023-12-02 Thread xiangyu feng (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-33698?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

xiangyu feng updated FLINK-33698:
-
Description: 
The backoff time calculation in `ExponentialBackoffDelayRetryStrategy` is 
problematic due to the lack of a reset when reusing the object.

 

Current Version:
{code:java}
@Override
public long getBackoffTimeMillis(int currentAttempts) {
if (currentAttempts <= 1) {
// equivalent to initial delay
return lastRetryDelay;
}
long backoff = Math.min((long) (lastRetryDelay * multiplier), 
maxRetryDelay);
this.lastRetryDelay = backoff;
return backoff;
} {code}
Fixed Version:
{code:java}
@Override
public long getBackoffTimeMillis(int currentAttempts) {
if (currentAttempts <= 1) {
// reset to initialDelay
this.lastRetryDelay = initialDelay;
return lastRetryDelay;
}

long backoff = Math.min((long) (lastRetryDelay * multiplier), 
maxRetryDelay);
this.lastRetryDelay = backoff;
return backoff;
}{code}

  was:
The backoff time calculation in `ExponentialBackoffDelayRetryStrategy` should 
consider currentAttempts. 

 

Current Version:
{code:java}
@Override
public long getBackoffTimeMillis(int currentAttempts) {
if (currentAttempts <= 1) {
// equivalent to initial delay
return lastRetryDelay;
}
long backoff = Math.min((long) (lastRetryDelay * multiplier), 
maxRetryDelay);
this.lastRetryDelay = backoff;
return backoff;
} {code}
Fixed Version:
{code:java}
@Override
public long getBackoffTimeMillis(int currentAttempts) {
if (currentAttempts <= 1) {
// reset to initialDelay
this.lastRetryDelay = initialDelay;
return lastRetryDelay;
}

long backoff = Math.min((long) (lastRetryDelay * multiplier), 
maxRetryDelay);
this.lastRetryDelay = backoff;
return backoff;
}{code}


> Fix the backoff time calculation in ExponentialBackoffDelayRetryStrategy
> 
>
> Key: FLINK-33698
> URL: https://issues.apache.org/jira/browse/FLINK-33698
> Project: Flink
>  Issue Type: Bug
>  Components: API / DataStream
>Reporter: xiangyu feng
>Assignee: xiangyu feng
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.19.0
>
>
> The backoff time calculation in `ExponentialBackoffDelayRetryStrategy` is 
> problematic due to the lack of a reset when reusing the object.
>  
> Current Version:
> {code:java}
> @Override
> public long getBackoffTimeMillis(int currentAttempts) {
> if (currentAttempts <= 1) {
> // equivalent to initial delay
> return lastRetryDelay;
> }
> long backoff = Math.min((long) (lastRetryDelay * multiplier), 
> maxRetryDelay);
> this.lastRetryDelay = backoff;
> return backoff;
> } {code}
> Fixed Version:
> {code:java}
> @Override
> public long getBackoffTimeMillis(int currentAttempts) {
> if (currentAttempts <= 1) {
> // reset to initialDelay
> this.lastRetryDelay = initialDelay;
> return lastRetryDelay;
> }
> long backoff = Math.min((long) (lastRetryDelay * multiplier), 
> maxRetryDelay);
> this.lastRetryDelay = backoff;
> return backoff;
> }{code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (FLINK-33698) Fix the backoff time calculation in ExponentialBackoffDelayRetryStrategy

2023-12-02 Thread xiangyu feng (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-33698?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

xiangyu feng updated FLINK-33698:
-
Description: 
The backoff time calculation in `ExponentialBackoffDelayRetryStrategy` should 
consider currentAttempts. 

 

Current Version:
{code:java}
@Override
public long getBackoffTimeMillis(int currentAttempts) {
if (currentAttempts <= 1) {
// equivalent to initial delay
return lastRetryDelay;
}
long backoff = Math.min((long) (lastRetryDelay * multiplier), 
maxRetryDelay);
this.lastRetryDelay = backoff;
return backoff;
} {code}
Fixed Version:
{code:java}
@Override
public long getBackoffTimeMillis(int currentAttempts) {
if (currentAttempts <= 1) {
// reset to initialDelay
this.lastRetryDelay = initialDelay;
return lastRetryDelay;
}

long backoff = Math.min((long) (lastRetryDelay * multiplier), 
maxRetryDelay);
this.lastRetryDelay = backoff;
return backoff;
}{code}

  was:
The backoff time calculation in `ExponentialBackoffDelayRetryStrategy` should 
consider currentAttempts. 

 

Current Version:
{code:java}
@Override
public long getBackoffTimeMillis(int currentAttempts) {
if (currentAttempts <= 1) {
// equivalent to initial delay
return lastRetryDelay;
}
long backoff = Math.min((long) (lastRetryDelay * multiplier), 
maxRetryDelay);
this.lastRetryDelay = backoff;
return backoff;
} {code}
Fixed Version:
{code:java}
@Override
public long getBackoffTimeMillis(int currentAttempts) {
if (currentAttempts <= 1) {
// equivalent to initial delay
return initialDelay;
}
long backoff =
Math.min(
(long) (initialDelay * Math.pow(multiplier, currentAttempts 
- 1)),
maxRetryDelay);
return backoff;
} {code}


> Fix the backoff time calculation in ExponentialBackoffDelayRetryStrategy
> 
>
> Key: FLINK-33698
> URL: https://issues.apache.org/jira/browse/FLINK-33698
> Project: Flink
>  Issue Type: Bug
>  Components: API / DataStream
>Reporter: xiangyu feng
>Assignee: xiangyu feng
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.19.0
>
>
> The backoff time calculation in `ExponentialBackoffDelayRetryStrategy` should 
> consider currentAttempts. 
>  
> Current Version:
> {code:java}
> @Override
> public long getBackoffTimeMillis(int currentAttempts) {
> if (currentAttempts <= 1) {
> // equivalent to initial delay
> return lastRetryDelay;
> }
> long backoff = Math.min((long) (lastRetryDelay * multiplier), 
> maxRetryDelay);
> this.lastRetryDelay = backoff;
> return backoff;
> } {code}
> Fixed Version:
> {code:java}
> @Override
> public long getBackoffTimeMillis(int currentAttempts) {
> if (currentAttempts <= 1) {
> // reset to initialDelay
> this.lastRetryDelay = initialDelay;
> return lastRetryDelay;
> }
> long backoff = Math.min((long) (lastRetryDelay * multiplier), 
> maxRetryDelay);
> this.lastRetryDelay = backoff;
> return backoff;
> }{code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (FLINK-33702) Add IncrementalDelayRetryStrategy implementation of RetryStrategy

2023-12-01 Thread xiangyu feng (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-33702?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

xiangyu feng updated FLINK-33702:
-
Summary: Add IncrementalDelayRetryStrategy implementation of RetryStrategy  
(was: Add IncrementalDelayRetryStrategy in AsyncRetryStrategies )

> Add IncrementalDelayRetryStrategy implementation of RetryStrategy
> -
>
> Key: FLINK-33702
> URL: https://issues.apache.org/jira/browse/FLINK-33702
> Project: Flink
>  Issue Type: Improvement
>  Components: API / DataStream
>Reporter: xiangyu feng
>Priority: Major
>  Labels: pull-request-available
>
> RetryStrategy now supports FixedRetryStrategy and 
> ExponentialBackoffRetryStrategy. 
> In certain scenarios, we also need IncrementalDelayRetryStrategy to reduce 
> the retry count and perform the action more timely. 
> IncrementalDelayRetryStrategy will increase the retry delay at a fixed rate 
> for each attempt.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (FLINK-33702) Add IncrementalDelayRetryStrategy in AsyncRetryStrategies

2023-12-01 Thread xiangyu feng (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-33702?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

xiangyu feng updated FLINK-33702:
-
Description: 
RetryStrategy now supports FixedRetryStrategy and 
ExponentialBackoffRetryStrategy. 
In certain scenarios, we also need IncrementalDelayRetryStrategy to reduce the 
retry count and perform the action more timely. 

IncrementalDelayRetryStrategy will increase the retry delay at a fixed rate for 
each attempt.

  was:
AsyncRetryStrategies now supports NoRetryStrategy, FixedDelayRetryStrategy and 
ExponentialBackoffDelayRetryStrategy. 

In certain scenarios, we also need IncrementalDelayRetryStrategy to reduce the 
retry count and perform the action more timely. 

IncrementalDelayRetryStrategy will increase the retry delay at a fixed rate for 
each attempt.


> Add IncrementalDelayRetryStrategy in AsyncRetryStrategies 
> --
>
> Key: FLINK-33702
> URL: https://issues.apache.org/jira/browse/FLINK-33702
> Project: Flink
>  Issue Type: Improvement
>  Components: API / DataStream
>Reporter: xiangyu feng
>Priority: Major
>  Labels: pull-request-available
>
> RetryStrategy now supports FixedRetryStrategy and 
> ExponentialBackoffRetryStrategy. 
> In certain scenarios, we also need IncrementalDelayRetryStrategy to reduce 
> the retry count and perform the action more timely. 
> IncrementalDelayRetryStrategy will increase the retry delay at a fixed rate 
> for each attempt.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Reopened] (FLINK-33702) Add IncrementalDelayRetryStrategy in AsyncRetryStrategies

2023-12-01 Thread xiangyu feng (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-33702?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

xiangyu feng reopened FLINK-33702:
--

`RetryStrategy` in org.apache.flink.util.concurrent can be used in 
`CollectResultFetcher`.

> Add IncrementalDelayRetryStrategy in AsyncRetryStrategies 
> --
>
> Key: FLINK-33702
> URL: https://issues.apache.org/jira/browse/FLINK-33702
> Project: Flink
>  Issue Type: Improvement
>  Components: API / DataStream
>Reporter: xiangyu feng
>Priority: Major
>  Labels: pull-request-available
>
> AsyncRetryStrategies now supports NoRetryStrategy, FixedDelayRetryStrategy 
> and ExponentialBackoffDelayRetryStrategy. 
> In certain scenarios, we also need IncrementalDelayRetryStrategy to reduce 
> the retry count and perform the action more timely. 
> IncrementalDelayRetryStrategy will increase the retry delay at a fixed rate 
> for each attempt.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Closed] (FLINK-33702) Add IncrementalDelayRetryStrategy in AsyncRetryStrategies

2023-12-01 Thread xiangyu feng (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-33702?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

xiangyu feng closed FLINK-33702.

Resolution: Not A Problem

`AsyncRetryStrategy` is designed for AsyncWaitOperator to use. It may not 
suitable for `CollectResultFetcher` to use.

> Add IncrementalDelayRetryStrategy in AsyncRetryStrategies 
> --
>
> Key: FLINK-33702
> URL: https://issues.apache.org/jira/browse/FLINK-33702
> Project: Flink
>  Issue Type: Improvement
>  Components: API / DataStream
>Reporter: xiangyu feng
>Priority: Major
>  Labels: pull-request-available
>
> AsyncRetryStrategies now supports NoRetryStrategy, FixedDelayRetryStrategy 
> and ExponentialBackoffDelayRetryStrategy. 
> In certain scenarios, we also need IncrementalDelayRetryStrategy to reduce 
> the retry count and perform the action more timely. 
> IncrementalDelayRetryStrategy will increase the retry delay at a fixed rate 
> for each attempt.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (FLINK-33702) Add IncrementalDelayRetryStrategy in AsyncRetryStrategies

2023-11-30 Thread xiangyu feng (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-33702?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17791883#comment-17791883
 ] 

xiangyu feng commented on FLINK-33702:
--

Hi [~huwh] , would u kindly assign this to me and help review this pr?

> Add IncrementalDelayRetryStrategy in AsyncRetryStrategies 
> --
>
> Key: FLINK-33702
> URL: https://issues.apache.org/jira/browse/FLINK-33702
> Project: Flink
>  Issue Type: Improvement
>  Components: API / DataStream
>Reporter: xiangyu feng
>Priority: Major
>  Labels: pull-request-available
>
> AsyncRetryStrategies now supports NoRetryStrategy, FixedDelayRetryStrategy 
> and ExponentialBackoffDelayRetryStrategy. 
> In certain scenarios, we also need IncrementalDelayRetryStrategy to reduce 
> the retry count and perform the action more timely. 
> IncrementalDelayRetryStrategy will increase the retry delay at a fixed rate 
> for each attempt.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (FLINK-33702) Add IncrementalDelayRetryStrategy in AsyncRetryStrategies

2023-11-30 Thread xiangyu feng (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-33702?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17791508#comment-17791508
 ] 

xiangyu feng commented on FLINK-33702:
--

Hi [~lincoln.86xy], I'll also work on this if it is ok from your side.

> Add IncrementalDelayRetryStrategy in AsyncRetryStrategies 
> --
>
> Key: FLINK-33702
> URL: https://issues.apache.org/jira/browse/FLINK-33702
> Project: Flink
>  Issue Type: Improvement
>  Components: API / DataStream
>Reporter: xiangyu feng
>Priority: Major
>
> AsyncRetryStrategies now supports NoRetryStrategy, FixedDelayRetryStrategy 
> and ExponentialBackoffDelayRetryStrategy. 
> In certain scenarios, we also need IncrementalDelayRetryStrategy to reduce 
> the retry count and perform the action more timely. 
> IncrementalDelayRetryStrategy will increase the retry delay at a fixed rate 
> for each attempt.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (FLINK-33702) Add IncrementalDelayRetryStrategy in AsyncRetryStrategies

2023-11-30 Thread xiangyu feng (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-33702?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

xiangyu feng updated FLINK-33702:
-
Issue Type: Improvement  (was: Bug)

> Add IncrementalDelayRetryStrategy in AsyncRetryStrategies 
> --
>
> Key: FLINK-33702
> URL: https://issues.apache.org/jira/browse/FLINK-33702
> Project: Flink
>  Issue Type: Improvement
>  Components: API / DataStream
>Reporter: xiangyu feng
>Priority: Major
>
> AsyncRetryStrategies now supports NoRetryStrategy, FixedDelayRetryStrategy 
> and ExponentialBackoffDelayRetryStrategy. 
> In certain scenarios, we also need IncrementalDelayRetryStrategy to reduce 
> the retry count and perform the action more timely. 
> IncrementalDelayRetryStrategy will increase the retry delay at a fixed rate 
> for each attempt.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (FLINK-33702) Add IncrementalDelayRetryStrategy in AsyncRetryStrategies

2023-11-30 Thread xiangyu feng (Jira)
xiangyu feng created FLINK-33702:


 Summary: Add IncrementalDelayRetryStrategy in AsyncRetryStrategies 
 Key: FLINK-33702
 URL: https://issues.apache.org/jira/browse/FLINK-33702
 Project: Flink
  Issue Type: Bug
  Components: API / DataStream
Reporter: xiangyu feng


AsyncRetryStrategies now supports NoRetryStrategy, FixedDelayRetryStrategy and 
ExponentialBackoffDelayRetryStrategy.  In certain scenarios, we also need 
IncrementalDelayRetryStrategy to reduce the retry count and perform the action 
more timely. 

 

IncrementalDelayRetryStrategy will increase the retry delay at a fixed rate for 
each attempt.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (FLINK-33702) Add IncrementalDelayRetryStrategy in AsyncRetryStrategies

2023-11-30 Thread xiangyu feng (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-33702?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

xiangyu feng updated FLINK-33702:
-
Description: 
AsyncRetryStrategies now supports NoRetryStrategy, FixedDelayRetryStrategy and 
ExponentialBackoffDelayRetryStrategy. 

In certain scenarios, we also need IncrementalDelayRetryStrategy to reduce the 
retry count and perform the action more timely. 

IncrementalDelayRetryStrategy will increase the retry delay at a fixed rate for 
each attempt.

  was:
AsyncRetryStrategies now supports NoRetryStrategy, FixedDelayRetryStrategy and 
ExponentialBackoffDelayRetryStrategy.  In certain scenarios, we also need 
IncrementalDelayRetryStrategy to reduce the retry count and perform the action 
more timely. 

 

IncrementalDelayRetryStrategy will increase the retry delay at a fixed rate for 
each attempt.


> Add IncrementalDelayRetryStrategy in AsyncRetryStrategies 
> --
>
> Key: FLINK-33702
> URL: https://issues.apache.org/jira/browse/FLINK-33702
> Project: Flink
>  Issue Type: Bug
>  Components: API / DataStream
>Reporter: xiangyu feng
>Priority: Major
>
> AsyncRetryStrategies now supports NoRetryStrategy, FixedDelayRetryStrategy 
> and ExponentialBackoffDelayRetryStrategy. 
> In certain scenarios, we also need IncrementalDelayRetryStrategy to reduce 
> the retry count and perform the action more timely. 
> IncrementalDelayRetryStrategy will increase the retry delay at a fixed rate 
> for each attempt.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (FLINK-33684) Improve the retry strategy in CollectResultFetcher

2023-11-30 Thread xiangyu feng (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-33684?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

xiangyu feng updated FLINK-33684:
-
Description: 
Currently  CollectResultFetcher use a fixed retry interval.
{code:java}
private void sleepBeforeRetry() {
if (retryMillis <= 0) {
return;
}

try {
// TODO a more proper retry strategy?
Thread.sleep(retryMillis);
} catch (InterruptedException e) {
LOG.warn("Interrupted when sleeping before a retry", e);
}
} {code}
This can be improved with a RetryStrategy.

  was:
Currently  CollectResultFetcher use a fixed retry interval.
{code:java}
private void sleepBeforeRetry() {
if (retryMillis <= 0) {
return;
}

try {
// TODO a more proper retry strategy?
Thread.sleep(retryMillis);
} catch (InterruptedException e) {
LOG.warn("Interrupted when sleeping before a retry", e);
}
} {code}
This can be improved with a WaitStrategy.


> Improve the retry strategy in CollectResultFetcher
> --
>
> Key: FLINK-33684
> URL: https://issues.apache.org/jira/browse/FLINK-33684
> Project: Flink
>  Issue Type: Sub-task
>  Components: API / DataStream
>Reporter: xiangyu feng
>Priority: Major
>
> Currently  CollectResultFetcher use a fixed retry interval.
> {code:java}
> private void sleepBeforeRetry() {
> if (retryMillis <= 0) {
> return;
> }
> try {
> // TODO a more proper retry strategy?
> Thread.sleep(retryMillis);
> } catch (InterruptedException e) {
> LOG.warn("Interrupted when sleeping before a retry", e);
> }
> } {code}
> This can be improved with a RetryStrategy.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (FLINK-33698) Fix the backoff time calculation in ExponentialBackoffDelayRetryStrategy

2023-11-29 Thread xiangyu feng (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-33698?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17791405#comment-17791405
 ] 

xiangyu feng commented on FLINK-33698:
--

[~lincoln.86xy] Sure, would you kindly assign this to me?

> Fix the backoff time calculation in ExponentialBackoffDelayRetryStrategy
> 
>
> Key: FLINK-33698
> URL: https://issues.apache.org/jira/browse/FLINK-33698
> Project: Flink
>  Issue Type: Bug
>  Components: API / DataStream
>Reporter: xiangyu feng
>Priority: Major
>
> The backoff time calculation in `ExponentialBackoffDelayRetryStrategy` should 
> consider currentAttempts. 
>  
> Current Version:
> {code:java}
> @Override
> public long getBackoffTimeMillis(int currentAttempts) {
> if (currentAttempts <= 1) {
> // equivalent to initial delay
> return lastRetryDelay;
> }
> long backoff = Math.min((long) (lastRetryDelay * multiplier), 
> maxRetryDelay);
> this.lastRetryDelay = backoff;
> return backoff;
> } {code}
> Fixed Version:
> {code:java}
> @Override
> public long getBackoffTimeMillis(int currentAttempts) {
> if (currentAttempts <= 1) {
> // equivalent to initial delay
> return initialDelay;
> }
> long backoff =
> Math.min(
> (long) (initialDelay * Math.pow(multiplier, 
> currentAttempts - 1)),
> maxRetryDelay);
> return backoff;
> } {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (FLINK-33698) Fix the backoff time calculation in ExponentialBackoffDelayRetryStrategy

2023-11-29 Thread xiangyu feng (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-33698?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17791252#comment-17791252
 ] 

xiangyu feng commented on FLINK-33698:
--

Hi [~lincoln.86xy] , what do you think about this?

> Fix the backoff time calculation in ExponentialBackoffDelayRetryStrategy
> 
>
> Key: FLINK-33698
> URL: https://issues.apache.org/jira/browse/FLINK-33698
> Project: Flink
>  Issue Type: Bug
>  Components: API / DataStream
>Reporter: xiangyu feng
>Priority: Major
>
> The backoff time calculation in `ExponentialBackoffDelayRetryStrategy` should 
> consider currentAttempts. 
>  
> Current Version:
> {code:java}
> @Override
> public long getBackoffTimeMillis(int currentAttempts) {
> if (currentAttempts <= 1) {
> // equivalent to initial delay
> return lastRetryDelay;
> }
> long backoff = Math.min((long) (lastRetryDelay * multiplier), 
> maxRetryDelay);
> this.lastRetryDelay = backoff;
> return backoff;
> } {code}
> Fixed Version:
> {code:java}
> @Override
> public long getBackoffTimeMillis(int currentAttempts) {
> if (currentAttempts <= 1) {
> // equivalent to initial delay
> return initialDelay;
> }
> long backoff =
> Math.min(
> (long) (initialDelay * Math.pow(multiplier, 
> currentAttempts - 1)),
> maxRetryDelay);
> return backoff;
> } {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (FLINK-33698) Fix the backoff time calculation in ExponentialBackoffDelayRetryStrategy

2023-11-29 Thread xiangyu feng (Jira)
xiangyu feng created FLINK-33698:


 Summary: Fix the backoff time calculation in 
ExponentialBackoffDelayRetryStrategy
 Key: FLINK-33698
 URL: https://issues.apache.org/jira/browse/FLINK-33698
 Project: Flink
  Issue Type: Bug
  Components: API / DataStream
Reporter: xiangyu feng


The backoff time calculation in `ExponentialBackoffDelayRetryStrategy` should 
consider currentAttempts. 

 

Current Version:
{code:java}
@Override
public long getBackoffTimeMillis(int currentAttempts) {
if (currentAttempts <= 1) {
// equivalent to initial delay
return lastRetryDelay;
}
long backoff = Math.min((long) (lastRetryDelay * multiplier), 
maxRetryDelay);
this.lastRetryDelay = backoff;
return backoff;
} {code}
Fixed Version:
{code:java}
@Override
public long getBackoffTimeMillis(int currentAttempts) {
if (currentAttempts <= 1) {
// equivalent to initial delay
return initialDelay;
}
long backoff =
Math.min(
(long) (initialDelay * Math.pow(multiplier, currentAttempts 
- 1)),
maxRetryDelay);
return backoff;
} {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (FLINK-33683) Improve the performance of submitting jobs and fetching results to a running flink cluster

2023-11-29 Thread xiangyu feng (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-33683?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

xiangyu feng updated FLINK-33683:
-
Description: 
Now there are lots of unnecessary overhead involved in submitting jobs and 
fetching results to a long-running flink cluster. This works well for streaming 
and batch job, because in these scenarios users will not frequently submit jobs 
and fetch result to a running cluster.

 

But in OLAP scenario, users will continuously submit lots of short-lived jobs 
to the running cluster. In this situation, these overhead will have a huge 
impact on the E2E performance.  Here are some examples of unnecessary overhead:
 * Each `RemoteExecutor` will create a new `StandaloneClusterDescriptor` when 
executing a job on the same remote cluster
 * `StandaloneClusterDescriptor` will always create a new `RestClusterClient` 
when retrieving an existing Flink Cluster
 * Each `RestClusterClient` will create a new `ClientHighAvailabilityServices` 
which might contains a resource-consuming ha client(ZKClient or KubeClient) and 
a time-consuming leader retrieval operation
 * `RestClient` will create a new connection for every request which costs 
extra connection establishment time

 

Therefore, I suggest creating this ticket and following subtasks to improve 
this performance. This ticket is also relates to  FLINK-25318.

  was:
There is now a lot of unnecessary overhead involved in submitting jobs and 
fetching results to a long-running flink cluster. This works well for streaming 
and batch job, because in these scenarios users will not frequently submit jobs 
and fetch result to a running cluster.

 

But in OLAP scenario, users will continuously submit lots of short-lived jobs 
to the running cluster. In this situation, these overhead will have a huge 
impact on the E2E performance.  Here are some examples of unnecessary overhead:
 * Each `RemoteExecutor` will create a new `StandaloneClusterDescriptor` when 
executing a job on the same remote cluster
 * `StandaloneClusterDescriptor` will always create a new `RestClusterClient` 
when retrieving an existing Flink Cluster
 * Each `RestClusterClient` will create a new `ClientHighAvailabilityServices` 
which might contains a resource-consuming ha client(ZKClient or KubeClient) and 
a time-consuming leader retrieval operation
 * `RestClient` will create a new connection for every request which costs 
extra connection establishment time

 

Therefore, I suggest creating this ticket and following subtasks to improve 
this performance. This ticket is also relates to  FLINK-25318.


> Improve the performance of submitting jobs and fetching results to a running 
> flink cluster
> --
>
> Key: FLINK-33683
> URL: https://issues.apache.org/jira/browse/FLINK-33683
> Project: Flink
>  Issue Type: Improvement
>  Components: Client / Job Submission, Table SQL / Client
>Reporter: xiangyu feng
>Priority: Major
>
> Now there are lots of unnecessary overhead involved in submitting jobs and 
> fetching results to a long-running flink cluster. This works well for 
> streaming and batch job, because in these scenarios users will not frequently 
> submit jobs and fetch result to a running cluster.
>  
> But in OLAP scenario, users will continuously submit lots of short-lived jobs 
> to the running cluster. In this situation, these overhead will have a huge 
> impact on the E2E performance.  Here are some examples of unnecessary 
> overhead:
>  * Each `RemoteExecutor` will create a new `StandaloneClusterDescriptor` when 
> executing a job on the same remote cluster
>  * `StandaloneClusterDescriptor` will always create a new `RestClusterClient` 
> when retrieving an existing Flink Cluster
>  * Each `RestClusterClient` will create a new 
> `ClientHighAvailabilityServices` which might contains a resource-consuming ha 
> client(ZKClient or KubeClient) and a time-consuming leader retrieval operation
>  * `RestClient` will create a new connection for every request which costs 
> extra connection establishment time
>  
> Therefore, I suggest creating this ticket and following subtasks to improve 
> this performance. This ticket is also relates to  FLINK-25318.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (FLINK-32756) Reuse ClientHighAvailabilityServices when creating RestClusterClient

2023-11-28 Thread xiangyu feng (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-32756?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

xiangyu feng updated FLINK-32756:
-
Parent: FLINK-33683
Issue Type: Sub-task  (was: Improvement)

> Reuse ClientHighAvailabilityServices when creating RestClusterClient
> 
>
> Key: FLINK-32756
> URL: https://issues.apache.org/jira/browse/FLINK-32756
> Project: Flink
>  Issue Type: Sub-task
>  Components: Client / Job Submission
>Reporter: xiangyu feng
>Assignee: xiangyu feng
>Priority: Major
>
> Currently, every newly built RestClusterClient will create a new 
> ClientHighAvailabilityServices which is both unnecessary and resource 
> consuming. For example, each ZooKeeperClientHAServices contains a ZKClient 
> which holds a connection to ZK server and several related threads.
> By reusing ClientHighAvailabilityServices across multiple RestClusterClient 
> instances, we can save system resources(threads, connections), connection 
> establish time and leader retrieval time.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (FLINK-32756) Reuse ClientHighAvailabilityServices when creating RestClusterClient

2023-11-28 Thread xiangyu feng (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-32756?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

xiangyu feng updated FLINK-32756:
-
Parent: (was: FLINK-25318)
Issue Type: Improvement  (was: Sub-task)

> Reuse ClientHighAvailabilityServices when creating RestClusterClient
> 
>
> Key: FLINK-32756
> URL: https://issues.apache.org/jira/browse/FLINK-32756
> Project: Flink
>  Issue Type: Improvement
>  Components: Client / Job Submission
>Reporter: xiangyu feng
>Assignee: xiangyu feng
>Priority: Major
>
> Currently, every newly built RestClusterClient will create a new 
> ClientHighAvailabilityServices which is both unnecessary and resource 
> consuming. For example, each ZooKeeperClientHAServices contains a ZKClient 
> which holds a connection to ZK server and several related threads.
> By reusing ClientHighAvailabilityServices across multiple RestClusterClient 
> instances, we can save system resources(threads, connections), connection 
> establish time and leader retrieval time.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (FLINK-33688) Reuse Channels in RestClient to save connection establish time

2023-11-28 Thread xiangyu feng (Jira)
xiangyu feng created FLINK-33688:


 Summary: Reuse Channels in RestClient to save connection establish 
time
 Key: FLINK-33688
 URL: https://issues.apache.org/jira/browse/FLINK-33688
 Project: Flink
  Issue Type: Sub-task
  Components: Client / Job Submission
Reporter: xiangyu feng


RestClient can reuse the connections to Dispatcher when submitting http 
requests to a long running Flink cluster.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (FLINK-33687) Reuse RestClusterClient in StandaloneClusterDescriptor to avoid frequent thread create/destroy

2023-11-28 Thread xiangyu feng (Jira)
xiangyu feng created FLINK-33687:


 Summary: Reuse RestClusterClient in StandaloneClusterDescriptor to 
avoid frequent thread create/destroy
 Key: FLINK-33687
 URL: https://issues.apache.org/jira/browse/FLINK-33687
 Project: Flink
  Issue Type: Sub-task
  Components: Client / Job Submission
Reporter: xiangyu feng


`RestClusterClient` can also be reused when submitting programs to a 
long-running Flink Cluster



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (FLINK-33686) Reuse StandaloneClusterDescriptor in RemoteExecutor when executing jobs on the same cluster

2023-11-28 Thread xiangyu feng (Jira)
xiangyu feng created FLINK-33686:


 Summary: Reuse StandaloneClusterDescriptor in RemoteExecutor when 
executing jobs on the same cluster
 Key: FLINK-33686
 URL: https://issues.apache.org/jira/browse/FLINK-33686
 Project: Flink
  Issue Type: Sub-task
  Components: Client / Job Submission
Reporter: xiangyu feng


Multiple `RemoteExecutor` instances can reuse the same 
`StandaloneClusterDescriptor` when executing jobs to a same running cluster.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (FLINK-33685) StandaloneClusterId need to distinguish different remote clusters

2023-11-28 Thread xiangyu feng (Jira)
xiangyu feng created FLINK-33685:


 Summary: StandaloneClusterId need to distinguish different remote 
clusters
 Key: FLINK-33685
 URL: https://issues.apache.org/jira/browse/FLINK-33685
 Project: Flink
  Issue Type: Sub-task
  Components: Client / Job Submission
Reporter: xiangyu feng


`StandaloneClusterId` is a singleton, which means `StandaloneClusterDescriptor` 
cannot distinguish different remote running clusters.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (FLINK-33684) Improve the retry strategy in CollectResultFetcher

2023-11-28 Thread xiangyu feng (Jira)
xiangyu feng created FLINK-33684:


 Summary: Improve the retry strategy in CollectResultFetcher
 Key: FLINK-33684
 URL: https://issues.apache.org/jira/browse/FLINK-33684
 Project: Flink
  Issue Type: Sub-task
  Components: API / DataStream
Reporter: xiangyu feng


Currently  CollectResultFetcher use a fixed retry interval.
{code:java}
private void sleepBeforeRetry() {
if (retryMillis <= 0) {
return;
}

try {
// TODO a more proper retry strategy?
Thread.sleep(retryMillis);
} catch (InterruptedException e) {
LOG.warn("Interrupted when sleeping before a retry", e);
}
} {code}
This can be improved with a WaitStrategy.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (FLINK-33683) Improve the performance of submitting jobs and fetching results to a running flink cluster

2023-11-28 Thread xiangyu feng (Jira)
xiangyu feng created FLINK-33683:


 Summary: Improve the performance of submitting jobs and fetching 
results to a running flink cluster
 Key: FLINK-33683
 URL: https://issues.apache.org/jira/browse/FLINK-33683
 Project: Flink
  Issue Type: Improvement
  Components: Client / Job Submission, Table SQL / Client
Reporter: xiangyu feng


There is now a lot of unnecessary overhead involved in submitting jobs and 
fetching results to a long-running flink cluster. This works well for streaming 
and batch job, because in these scenarios users will not frequently submit jobs 
and fetch result to a running cluster.

 

But in OLAP scenario, users will continuously submit lots of short-lived jobs 
to the running cluster. In this situation, these overhead will have a huge 
impact on the E2E performance.  Here are some examples of unnecessary overhead:
 * Each `RemoteExecutor` will create a new `StandaloneClusterDescriptor` when 
executing a job on the same remote cluster
 * `StandaloneClusterDescriptor` will always create a new `RestClusterClient` 
when retrieving an existing Flink Cluster
 * Each `RestClusterClient` will create a new `ClientHighAvailabilityServices` 
which might contains a resource-consuming ha client(ZKClient or KubeClient) and 
a time-consuming leader retrieval operation
 * `RestClient` will create a new connection for every request which costs 
extra connection establishment time

 

Therefore, I suggest creating this ticket and following subtasks to improve 
this performance. This ticket is also relates to  FLINK-25318.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (FLINK-32756) Reuse ClientHighAvailabilityServices when creating RestClusterClient

2023-11-28 Thread xiangyu feng (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-32756?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

xiangyu feng updated FLINK-32756:
-
Summary: Reuse ClientHighAvailabilityServices when creating 
RestClusterClient  (was: Reuse ClientHighAvailabilityServices in 
RestClusterClient when submitting OLAP jobs)

> Reuse ClientHighAvailabilityServices when creating RestClusterClient
> 
>
> Key: FLINK-32756
> URL: https://issues.apache.org/jira/browse/FLINK-32756
> Project: Flink
>  Issue Type: Sub-task
>  Components: Client / Job Submission
>Reporter: xiangyu feng
>Priority: Major
>
> Currently, every newly built RestClusterClient will create a new 
> ClientHighAvailabilityServices which is both unnecessary and resource 
> consuming. For example, each ZooKeeperClientHAServices contains a ZKClient 
> which holds a connection to ZK server and several related threads.
> By reusing ClientHighAvailabilityServices across multiple RestClusterClient 
> instances, we can save system resources(threads, connections), connection 
> establish time and leader retrieval time.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (FLINK-32756) Reuse ClientHighAvailabilityServices when creating RestClusterClient

2023-11-28 Thread xiangyu feng (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-32756?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17790544#comment-17790544
 ] 

xiangyu feng commented on FLINK-32756:
--

[~guoyangze] Hi, would u kindly assign this Jira to me.

> Reuse ClientHighAvailabilityServices when creating RestClusterClient
> 
>
> Key: FLINK-32756
> URL: https://issues.apache.org/jira/browse/FLINK-32756
> Project: Flink
>  Issue Type: Sub-task
>  Components: Client / Job Submission
>Reporter: xiangyu feng
>Priority: Major
>
> Currently, every newly built RestClusterClient will create a new 
> ClientHighAvailabilityServices which is both unnecessary and resource 
> consuming. For example, each ZooKeeperClientHAServices contains a ZKClient 
> which holds a connection to ZK server and several related threads.
> By reusing ClientHighAvailabilityServices across multiple RestClusterClient 
> instances, we can save system resources(threads, connections), connection 
> establish time and leader retrieval time.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (FLINK-32756) Reuse ClientHighAvailabilityServices in RestClusterClient when submitting OLAP jobs

2023-11-28 Thread xiangyu feng (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-32756?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

xiangyu feng updated FLINK-32756:
-
Description: 
Currently, every newly built RestClusterClient will create a new 
ClientHighAvailabilityServices which is both unnecessary and resource 
consuming. For example, each ZooKeeperClientHAServices contains a ZKClient 
which holds a connection to ZK server and several related threads.

By reusing ClientHighAvailabilityServices across multiple RestClusterClient 
instances, we can save system resources(threads, connections), connection 
establish time and leader retrieval time.

  was:
In OLAP scenario, we submit queries to flink session cluster through the 
flink-sql-gateway service. When receiving queries, the gateway service will 
create sessions to handle the query, each session will create a new 
RestClusterClient and a new ClientHAServices.

 

In our production usage, we have enabled JobManager ZK HA and use 
ZooKeeperClientHAServices to do service discovery. Each ZKClientHAServices will 
establish a network connection with ZK and create four ZK related threads.

 

When QPS reaches 200, more than 1000 sessions are created in a single 
flink-sql-gateway instance, which means more than 1000 ZK connections and 4000 
ZK related threads are created simultaneously. This will raise a significant 
stability risk in production.

 

To address this problem, we have implemented SharedZKClientHAService for 
different sessions to share a ZK connection and ZKClient. This works well in 
our production.


> Reuse ClientHighAvailabilityServices in RestClusterClient when submitting 
> OLAP jobs
> ---
>
> Key: FLINK-32756
> URL: https://issues.apache.org/jira/browse/FLINK-32756
> Project: Flink
>  Issue Type: Sub-task
>  Components: Client / Job Submission
>Reporter: xiangyu feng
>Priority: Major
>
> Currently, every newly built RestClusterClient will create a new 
> ClientHighAvailabilityServices which is both unnecessary and resource 
> consuming. For example, each ZooKeeperClientHAServices contains a ZKClient 
> which holds a connection to ZK server and several related threads.
> By reusing ClientHighAvailabilityServices across multiple RestClusterClient 
> instances, we can save system resources(threads, connections), connection 
> establish time and leader retrieval time.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (FLINK-33458) Add env.java.opts.gateway option in flink-conf.yaml

2023-11-05 Thread xiangyu feng (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-33458?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17782983#comment-17782983
 ] 

xiangyu feng commented on FLINK-33458:
--

Hi [~agsharath], thx for reporting this. However, I think this is resolved in 
https://issues.apache.org/jira/browse/FLINK-33203 .

> Add env.java.opts.gateway option in flink-conf.yaml
> ---
>
> Key: FLINK-33458
> URL: https://issues.apache.org/jira/browse/FLINK-33458
> Project: Flink
>  Issue Type: Improvement
>  Components: Table SQL / Gateway
>Affects Versions: 1.18.0
>Reporter: Sharath Avadoot Gururaj
>Priority: Minor
>  Labels: easyfix
>
> Currently, JobManager, TaskManager, HistoryServer, Client have their own 
> configuration keys to set java options in flink-conf.yaml
> However it is missing for SQL Gateway. It should be added for completeness. 
> It can be useful, for example, to add Remote Debugging options.
> I propose this config key 
>  
> {noformat}
> env.java.opts.gateway{noformat}
>  in flink-conf.yaml which will be applied to SQL Gateway. I already have a 
> working patch. Its a pretty small one. If the community is OK, I will raise a 
> PR



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Resolved] (FLINK-33235) Improve the Quickstart guide for Flink OLAP and translate to Chinese

2023-11-01 Thread xiangyu feng (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-33235?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

xiangyu feng resolved FLINK-33235.
--
Fix Version/s: 1.19.0
   Resolution: Fixed

> Improve the Quickstart guide for Flink OLAP and translate to Chinese
> 
>
> Key: FLINK-33235
> URL: https://issues.apache.org/jira/browse/FLINK-33235
> Project: Flink
>  Issue Type: Sub-task
>  Components: Documentation
>Reporter: xiangyu feng
>Assignee: xiangyu feng
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.19.0
>
>
> 1, Follow the 
> [https://flink.apache.org/how-to-contribute/documentation-style-guide/,|https://flink.apache.org/how-to-contribute/documentation-style-guide/]
>  use 'active voice' and 'you' instead of 'passive voice' and 'we' in the doc;
> 2, Add architecture graphs in the doc;
> 3, Fix typos and misconfigured options in the doc;
> 4, Add new introduced options;
> 5, Translate to Chinese;



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (FLINK-33235) Improve the Quickstart guide for Flink OLAP and translate to Chinese

2023-10-14 Thread xiangyu feng (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-33235?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

xiangyu feng updated FLINK-33235:
-
Description: 
1, Follow the 
[https://flink.apache.org/how-to-contribute/documentation-style-guide/,|https://flink.apache.org/how-to-contribute/documentation-style-guide/]
 use 'active voice' and 'you' instead of 'passive voice' and 'we' in the doc;

2, Add architecture graphs in the doc;

3, Fix typos and misconfigured options in the doc;

4, Add new introduced options;

5, Translate to Chinese;

  was:
1, Follow the 
[https://flink.apache.org/how-to-contribute/documentation-style-guide/,|https://flink.apache.org/how-to-contribute/documentation-style-guide/]
 use 'active voice' and 'you' instead of 'passive voice' and 'we' in the doc;

2, Add architecture graphs in the doc;

3, Fix typos and misconfigured options in the doc;

4, Translate to Chinese;


> Improve the Quickstart guide for Flink OLAP and translate to Chinese
> 
>
> Key: FLINK-33235
> URL: https://issues.apache.org/jira/browse/FLINK-33235
> Project: Flink
>  Issue Type: Sub-task
>  Components: Documentation
>Reporter: xiangyu feng
>Assignee: xiangyu feng
>Priority: Major
>  Labels: pull-request-available
>
> 1, Follow the 
> [https://flink.apache.org/how-to-contribute/documentation-style-guide/,|https://flink.apache.org/how-to-contribute/documentation-style-guide/]
>  use 'active voice' and 'you' instead of 'passive voice' and 'we' in the doc;
> 2, Add architecture graphs in the doc;
> 3, Fix typos and misconfigured options in the doc;
> 4, Add new introduced options;
> 5, Translate to Chinese;



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (FLINK-33235) Improve the Quickstart guide for Flink OLAP and translate to Chinese

2023-10-13 Thread xiangyu feng (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-33235?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17774866#comment-17774866
 ] 

xiangyu feng commented on FLINK-33235:
--

Hi [~martijnvisser], excuse me for a minutes. I will take your advice and 
continue this work if this is ok for you.

> Improve the Quickstart guide for Flink OLAP and translate to Chinese
> 
>
> Key: FLINK-33235
> URL: https://issues.apache.org/jira/browse/FLINK-33235
> Project: Flink
>  Issue Type: Sub-task
>  Components: Documentation
>Reporter: xiangyu feng
>Assignee: xiangyu feng
>Priority: Major
>
> 1, Follow the 
> [https://flink.apache.org/how-to-contribute/documentation-style-guide/,|https://flink.apache.org/how-to-contribute/documentation-style-guide/]
>  use 'active voice' and 'you' instead of 'passive voice' and 'we' in the doc;
> 2, Add architecture graphs in the doc;
> 3, Fix typos and misconfigured options in the doc;
> 4, Translate to Chinese;



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (FLINK-33235) Improve the Quickstart guide for Flink OLAP and translate to Chinese

2023-10-12 Thread xiangyu feng (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-33235?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17774466#comment-17774466
 ] 

xiangyu feng commented on FLINK-33235:
--

[~martijnvisser] Sorry I didn't make myself clear. I mean this document might 
be released in 1.19 along within its relying features. After then, we might add 
new features to master branch and then we should update this documents 
accordingly. 

> Improve the Quickstart guide for Flink OLAP and translate to Chinese
> 
>
> Key: FLINK-33235
> URL: https://issues.apache.org/jira/browse/FLINK-33235
> Project: Flink
>  Issue Type: Sub-task
>  Components: Documentation
>Reporter: xiangyu feng
>Assignee: xiangyu feng
>Priority: Major
>
> 1, Follow the 
> [https://flink.apache.org/how-to-contribute/documentation-style-guide/,|https://flink.apache.org/how-to-contribute/documentation-style-guide/]
>  use 'active voice' and 'you' instead of 'passive voice' and 'we' in the doc;
> 2, Add architecture graphs in the doc;
> 3, Fix typos and misconfigured options in the doc;
> 4, Translate to Chinese;



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Closed] (FLINK-33209) Translate Flink OLAP quick start guide to Chinese

2023-10-12 Thread xiangyu feng (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-33209?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

xiangyu feng closed FLINK-33209.

Resolution: Duplicate

duplicate with FLINK-33235

> Translate Flink OLAP quick start guide to Chinese
> -
>
> Key: FLINK-33209
> URL: https://issues.apache.org/jira/browse/FLINK-33209
> Project: Flink
>  Issue Type: Sub-task
>  Components: chinese-translation
>Reporter: xiangyu feng
>Assignee: xiangyu feng
>Priority: Major
>
> Translate Flink OLAP quick start guide to Chinese



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (FLINK-33235) Improve the Quickstart guide for Flink OLAP and translate to Chinese

2023-10-12 Thread xiangyu feng (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-33235?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

xiangyu feng updated FLINK-33235:
-
Summary: Improve the Quickstart guide for Flink OLAP and translate to 
Chinese  (was: Improve the Quickstart guide for Flink OLAP)

> Improve the Quickstart guide for Flink OLAP and translate to Chinese
> 
>
> Key: FLINK-33235
> URL: https://issues.apache.org/jira/browse/FLINK-33235
> Project: Flink
>  Issue Type: Sub-task
>  Components: Documentation
>Reporter: xiangyu feng
>Assignee: xiangyu feng
>Priority: Major
>
> 1, Follow the 
> [https://flink.apache.org/how-to-contribute/documentation-style-guide/,|https://flink.apache.org/how-to-contribute/documentation-style-guide/]
>  use 'active voice' and 'you' instead of 'passive voice' and 'we' in the doc;
> 2, Add architecture graphs in the doc;
> 3, Fix typos and misconfigured options in the doc;
> 4, Translate to Chinese;



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (FLINK-33235) Improve the Quickstart guide for Flink OLAP

2023-10-12 Thread xiangyu feng (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-33235?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17774461#comment-17774461
 ] 

xiangyu feng commented on FLINK-33235:
--

[~martijnvisser] I see that now. When this documentation is released, all the 
referenced feature should also be released. I should only update this document 
in next release cycle, right?

> Improve the Quickstart guide for Flink OLAP
> ---
>
> Key: FLINK-33235
> URL: https://issues.apache.org/jira/browse/FLINK-33235
> Project: Flink
>  Issue Type: Sub-task
>  Components: Documentation
>Reporter: xiangyu feng
>Assignee: xiangyu feng
>Priority: Major
>
> 1, Follow the 
> [https://flink.apache.org/how-to-contribute/documentation-style-guide/,|https://flink.apache.org/how-to-contribute/documentation-style-guide/]
>  use 'active voice' and 'you' instead of 'passive voice' and 'we' in the doc;
> 2, Add architecture graphs in the doc;
> 3, Fix typos and misconfigured options in the doc;
> 4, Add notes to explain that some OLAP features may have not been released 
> yet, user can choose to build from the master branch;
> 5, Translate to Chinese



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (FLINK-33235) Improve the Quickstart guide for Flink OLAP

2023-10-12 Thread xiangyu feng (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-33235?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

xiangyu feng updated FLINK-33235:
-
Description: 
1, Follow the 
[https://flink.apache.org/how-to-contribute/documentation-style-guide/,|https://flink.apache.org/how-to-contribute/documentation-style-guide/]
 use 'active voice' and 'you' instead of 'passive voice' and 'we' in the doc;

2, Add architecture graphs in the doc;

3, Fix typos and misconfigured options in the doc;

4, Translate to Chinese;

  was:
1, Follow the 
[https://flink.apache.org/how-to-contribute/documentation-style-guide/,|https://flink.apache.org/how-to-contribute/documentation-style-guide/]
 use 'active voice' and 'you' instead of 'passive voice' and 'we' in the doc;

2, Add architecture graphs in the doc;

3, Fix typos and misconfigured options in the doc;

4, Add notes to explain that some OLAP features may have not been released yet, 
user can choose to build from the master branch;

5, Translate to Chinese


> Improve the Quickstart guide for Flink OLAP
> ---
>
> Key: FLINK-33235
> URL: https://issues.apache.org/jira/browse/FLINK-33235
> Project: Flink
>  Issue Type: Sub-task
>  Components: Documentation
>Reporter: xiangyu feng
>Assignee: xiangyu feng
>Priority: Major
>
> 1, Follow the 
> [https://flink.apache.org/how-to-contribute/documentation-style-guide/,|https://flink.apache.org/how-to-contribute/documentation-style-guide/]
>  use 'active voice' and 'you' instead of 'passive voice' and 'we' in the doc;
> 2, Add architecture graphs in the doc;
> 3, Fix typos and misconfigured options in the doc;
> 4, Translate to Chinese;



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (FLINK-33235) Improve the Quickstart guide for Flink OLAP

2023-10-12 Thread xiangyu feng (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-33235?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

xiangyu feng updated FLINK-33235:
-
Description: 
1, Follow the 
[https://flink.apache.org/how-to-contribute/documentation-style-guide/,|https://flink.apache.org/how-to-contribute/documentation-style-guide/]
 use 'active voice' and 'you' instead of 'passive voice' and 'we' in the doc;

2, Add architecture graphs in the doc;

3, Fix typos and misconfigured options in the doc;

4, Add notes to explain that some OLAP features may have not been released yet, 
user can choose to build from the master branch;

5, Translate to Chinese

  was:
1, Follow the 
[https://flink.apache.org/how-to-contribute/documentation-style-guide/,|https://flink.apache.org/how-to-contribute/documentation-style-guide/]
 use 'active voice' and 'you' instead of 'passive voice' and 'we' in the doc;

2, Add architecture graphs in the doc;

3, Fix typos and misconfigured options in the doc;

4, Add notes to explain that some OLAP features may have not been released yet, 
user can choose to build from the master branch;


> Improve the Quickstart guide for Flink OLAP
> ---
>
> Key: FLINK-33235
> URL: https://issues.apache.org/jira/browse/FLINK-33235
> Project: Flink
>  Issue Type: Sub-task
>  Components: Documentation
>Reporter: xiangyu feng
>Assignee: xiangyu feng
>Priority: Major
>
> 1, Follow the 
> [https://flink.apache.org/how-to-contribute/documentation-style-guide/,|https://flink.apache.org/how-to-contribute/documentation-style-guide/]
>  use 'active voice' and 'you' instead of 'passive voice' and 'we' in the doc;
> 2, Add architecture graphs in the doc;
> 3, Fix typos and misconfigured options in the doc;
> 4, Add notes to explain that some OLAP features may have not been released 
> yet, user can choose to build from the master branch;
> 5, Translate to Chinese



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Comment Edited] (FLINK-33235) Improve the Quickstart guide for Flink OLAP

2023-10-12 Thread xiangyu feng (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-33235?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17774452#comment-17774452
 ] 

xiangyu feng edited comment on FLINK-33235 at 10/12/23 11:37 AM:
-

[~martijnvisser] Hi Martjin, thx for your comment. I see your point now and I 
agree with you on this. However, I still want to improve this doc in following 
ways:

1, After checking the 
[https://flink.apache.org/how-to-contribute/documentation-style-guide/,|https://flink.apache.org/how-to-contribute/documentation-style-guide/]
 we should use 'active voice' and 'you'  instead of 'passive voice' and 'we' in 
the doc;

2, Add architecture graphs in this doc;

3, Fix typos and misconfigured options in this doc;

4, Add notes to explain that some OLAP features may have not released yet, user 
can choose to build from the master branch;

I will change the tittle and description accordingly, WDYT?

 


was (Author: JIRAUSER301129):
[~martijnvisser] Hi Martjin, thx for your comment. I see your point now and I 
agree with you on this. However, I still want to improve this doc in following 
ways:

1, After checking the [Documentation 
Guide|[https://flink.apache.org/how-to-contribute/documentation-style-guide/],] 
we should use 'active voice' and 'you'  instead of 'passive voice' and 'we' in 
the doc;

2, Add architecture graphs in this doc;

3, Fix typos and misconfigured options in this doc;

4, Add notes to explain that some OLAP features may have not released yet, user 
can choose to build from the master branch;

I will change the tittle and description accordingly, WDYT?

 

> Improve the Quickstart guide for Flink OLAP
> ---
>
> Key: FLINK-33235
> URL: https://issues.apache.org/jira/browse/FLINK-33235
> Project: Flink
>  Issue Type: Sub-task
>  Components: Documentation
>Reporter: xiangyu feng
>Assignee: xiangyu feng
>Priority: Major
>
> 1, Follow the [Documentation 
> Guide|[https://flink.apache.org/how-to-contribute/documentation-style-guide/]],
>  use 'active voice' and 'you' instead of 'passive voice' and 'we' in the doc;
> 2, Add architecture graphs in the doc;
> 3, Fix typos and misconfigured options in the doc;
> 4, Add notes to explain that some OLAP features may have not been released 
> yet, user can choose to build from the master branch;



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (FLINK-33235) Improve the Quickstart guide for Flink OLAP

2023-10-12 Thread xiangyu feng (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-33235?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

xiangyu feng updated FLINK-33235:
-
Description: 
1, Follow the 
[https://flink.apache.org/how-to-contribute/documentation-style-guide/,|https://flink.apache.org/how-to-contribute/documentation-style-guide/]
 use 'active voice' and 'you' instead of 'passive voice' and 'we' in the doc;

2, Add architecture graphs in the doc;

3, Fix typos and misconfigured options in the doc;

4, Add notes to explain that some OLAP features may have not been released yet, 
user can choose to build from the master branch;

  was:
1, Follow the [Documentation 
Guide|[https://flink.apache.org/how-to-contribute/documentation-style-guide/]], 
use 'active voice' and 'you' instead of 'passive voice' and 'we' in the doc;

2, Add architecture graphs in the doc;

3, Fix typos and misconfigured options in the doc;

4, Add notes to explain that some OLAP features may have not been released yet, 
user can choose to build from the master branch;


> Improve the Quickstart guide for Flink OLAP
> ---
>
> Key: FLINK-33235
> URL: https://issues.apache.org/jira/browse/FLINK-33235
> Project: Flink
>  Issue Type: Sub-task
>  Components: Documentation
>Reporter: xiangyu feng
>Assignee: xiangyu feng
>Priority: Major
>
> 1, Follow the 
> [https://flink.apache.org/how-to-contribute/documentation-style-guide/,|https://flink.apache.org/how-to-contribute/documentation-style-guide/]
>  use 'active voice' and 'you' instead of 'passive voice' and 'we' in the doc;
> 2, Add architecture graphs in the doc;
> 3, Fix typos and misconfigured options in the doc;
> 4, Add notes to explain that some OLAP features may have not been released 
> yet, user can choose to build from the master branch;



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (FLINK-33235) Improve the Quickstart guide for Flink OLAP

2023-10-12 Thread xiangyu feng (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-33235?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

xiangyu feng updated FLINK-33235:
-
Description: 
1, Follow the [Documentation 
Guide|[https://flink.apache.org/how-to-contribute/documentation-style-guide/]], 
use 'active voice' and 'you' instead of 'passive voice' and 'we' in the doc;

2, Add architecture graphs in the doc;

3, Fix typos and misconfigured options in the doc;

4, Add notes to explain that some OLAP features may have not been released yet, 
user can choose to build from the master branch;

  was:Many features required by OLAP session cluster are still in master branch 
or in-progress and not released yet, for example: FLIP-295, FLIP-362, FLIP-374. 
We need to address this in the document and show users how to quickly build 
OLAP Session Cluster from master branch.


> Improve the Quickstart guide for Flink OLAP
> ---
>
> Key: FLINK-33235
> URL: https://issues.apache.org/jira/browse/FLINK-33235
> Project: Flink
>  Issue Type: Sub-task
>  Components: Documentation
>Reporter: xiangyu feng
>Assignee: xiangyu feng
>Priority: Major
>
> 1, Follow the [Documentation 
> Guide|[https://flink.apache.org/how-to-contribute/documentation-style-guide/]],
>  use 'active voice' and 'you' instead of 'passive voice' and 'we' in the doc;
> 2, Add architecture graphs in the doc;
> 3, Fix typos and misconfigured options in the doc;
> 4, Add notes to explain that some OLAP features may have not been released 
> yet, user can choose to build from the master branch;



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (FLINK-33235) Improve the Quickstart guide for Flink OLAP

2023-10-12 Thread xiangyu feng (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-33235?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

xiangyu feng updated FLINK-33235:
-
Summary: Improve the Quickstart guide for Flink OLAP  (was: Quickstart 
guide for Flink OLAP should support building from master branch)

> Improve the Quickstart guide for Flink OLAP
> ---
>
> Key: FLINK-33235
> URL: https://issues.apache.org/jira/browse/FLINK-33235
> Project: Flink
>  Issue Type: Sub-task
>  Components: Documentation
>Reporter: xiangyu feng
>Assignee: xiangyu feng
>Priority: Major
>
> Many features required by OLAP session cluster are still in master branch or 
> in-progress and not released yet, for example: FLIP-295, FLIP-362, FLIP-374. 
> We need to address this in the document and show users how to quickly build 
> OLAP Session Cluster from master branch.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (FLINK-33235) Quickstart guide for Flink OLAP should support building from master branch

2023-10-12 Thread xiangyu feng (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-33235?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17774452#comment-17774452
 ] 

xiangyu feng commented on FLINK-33235:
--

[~martijnvisser] Hi Martjin, thx for your comment. I see your point now and I 
agree with you on this. However, I still want to improve this doc in following 
ways:

1, After checking the [Documentation 
Guide|[https://flink.apache.org/how-to-contribute/documentation-style-guide/],] 
we should use 'active voice' and 'you'  instead of 'passive voice' and 'we' in 
the doc;

2, Add architecture graphs in this doc;

3, Fix typos and misconfigured options in this doc;

4, Add notes to explain that some OLAP features may have not released yet, user 
can choose to build from the master branch;

I will change the tittle and description accordingly, WDYT?

 

> Quickstart guide for Flink OLAP should support building from master branch
> --
>
> Key: FLINK-33235
> URL: https://issues.apache.org/jira/browse/FLINK-33235
> Project: Flink
>  Issue Type: Sub-task
>  Components: Documentation
>Reporter: xiangyu feng
>Assignee: xiangyu feng
>Priority: Major
>
> Many features required by OLAP session cluster are still in master branch or 
> in-progress and not released yet, for example: FLIP-295, FLIP-362, FLIP-374. 
> We need to address this in the document and show users how to quickly build 
> OLAP Session Cluster from master branch.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Closed] (FLINK-32746) Using ZGC in JDK17 to solve long time class unloading STW

2023-10-12 Thread xiangyu feng (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-32746?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

xiangyu feng closed FLINK-32746.

Resolution: Done

JDK17 is already supported in 
https://issues.apache.org/jira/browse/FLINK-15736, user can enable ZGC by 
configuring by customizing the `env.java.opts.all` parameter. There already 
have a example in OLAP QuickStart 
docs:https://nightlies.apache.org/flink/flink-docs-master/docs/dev/table/olap_quickstart/

> Using ZGC in JDK17 to solve long time class unloading STW
> -
>
> Key: FLINK-32746
> URL: https://issues.apache.org/jira/browse/FLINK-32746
> Project: Flink
>  Issue Type: Sub-task
>  Components: Table SQL / Runtime
>Reporter: xiangyu feng
>Priority: Major
>
> In a OLAP session cluster, a TM need to frequently create new classloaders 
> and  generate new classes. These classes will be accumulated in metaspace. 
> When metaspace data usage reaches a threshold, a FullGC with a long time 
> Stop-the-World will be triggered. Currently, both SerialGC, ParallelGC and 
> G1GC are doing Stop-the-World class unloading. Only ZGC supports concurrent 
> class unload, see more in 
> [https://bugs.openjdk.org/browse/JDK-8218905|https://bugs.openjdk.org/browse/JDK-8218905).].
>  
> In our scenario, a class unloading for a 2GB metaspace with 5million classes 
> will stop the application more than 40 seconds. After switch to ZGC, the 
> maximum STW of the application has been reduced to less than 10ms.
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (FLINK-33235) Quickstart guide for Flink OLAP should support building from master branch

2023-10-10 Thread xiangyu feng (Jira)
xiangyu feng created FLINK-33235:


 Summary: Quickstart guide for Flink OLAP should support building 
from master branch
 Key: FLINK-33235
 URL: https://issues.apache.org/jira/browse/FLINK-33235
 Project: Flink
  Issue Type: Sub-task
  Components: Documentation
Reporter: xiangyu feng


Many features required by OLAP session cluster are still in master branch or 
in-progress and not released yet, for example: FLIP-295, FLIP-362, FLIP-374. We 
need to address this in the document and show users how to quickly build OLAP 
Session Cluster from master branch.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Comment Edited] (FLINK-33228) Fix the total current resource calculation when fulfilling requirements

2023-10-10 Thread xiangyu feng (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-33228?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17773597#comment-17773597
 ] 

xiangyu feng edited comment on FLINK-33228 at 10/10/23 9:01 AM:


Hi [~guoyangze], I've fixed this issue. Would u kindly assign this Jira to me?


was (Author: JIRAUSER301129):
Hi [~guoyangze], I fixed this issue by changing the totalCurrentResources from 
`ResourceProfile`  to `InternalResourceInfo`. Would u kindly assign this Jira 
to me?

> Fix the total current resource calculation when fulfilling requirements
> ---
>
> Key: FLINK-33228
> URL: https://issues.apache.org/jira/browse/FLINK-33228
> Project: Flink
>  Issue Type: Bug
>  Components: Runtime / Coordination
>Affects Versions: 1.19.0
>Reporter: xiangyu feng
>Assignee: xiangyu feng
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.19.0
>
> Attachments: image-2023-10-10-16-09-23-635.png
>
>
> Currently, the `totalCurrentResources` calculation in 
> `DefaultResourceAllocationStrategy#tryFulfillRequirements` is wrong.
> `ResourceProfile.merge` will not change the original `ResourceProfile`.
> !image-2023-10-10-16-09-23-635.png|width=564,height=286!



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (FLINK-33228) Fix the total current resource calculation when fulfilling requirements

2023-10-10 Thread xiangyu feng (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-33228?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17773597#comment-17773597
 ] 

xiangyu feng commented on FLINK-33228:
--

Hi [~guoyangze], I fixed this issue by changing the totalCurrentResources from 
`ResourceProfile`  to `InternalResourceInfo`. Would u kindly assign this Jira 
to me?

> Fix the total current resource calculation when fulfilling requirements
> ---
>
> Key: FLINK-33228
> URL: https://issues.apache.org/jira/browse/FLINK-33228
> Project: Flink
>  Issue Type: Bug
>  Components: Runtime / Coordination
>Reporter: xiangyu feng
>Priority: Major
>  Labels: pull-request-available
> Attachments: image-2023-10-10-16-09-23-635.png
>
>
> Currently, the `totalCurrentResources` calculation in 
> `DefaultResourceAllocationStrategy#tryFulfillRequirements` is wrong.
> `ResourceProfile.merge` will not change the original `ResourceProfile`.
> !image-2023-10-10-16-09-23-635.png|width=564,height=286!



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (FLINK-33228) Fix the total current resource calculation when fulfilling requirements

2023-10-10 Thread xiangyu feng (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-33228?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

xiangyu feng updated FLINK-33228:
-
Description: 
Currently, the `totalCurrentResources` calculation in 
`DefaultResourceAllocationStrategy#tryFulfillRequirements` is wrong.

`ResourceProfile.merge` will not change the original `ResourceProfile`.

!image-2023-10-10-16-09-23-635.png|width=564,height=286!

> Fix the total current resource calculation when fulfilling requirements
> ---
>
> Key: FLINK-33228
> URL: https://issues.apache.org/jira/browse/FLINK-33228
> Project: Flink
>  Issue Type: Bug
>  Components: Runtime / Coordination
>Reporter: xiangyu feng
>Priority: Major
> Attachments: image-2023-10-10-16-09-23-635.png
>
>
> Currently, the `totalCurrentResources` calculation in 
> `DefaultResourceAllocationStrategy#tryFulfillRequirements` is wrong.
> `ResourceProfile.merge` will not change the original `ResourceProfile`.
> !image-2023-10-10-16-09-23-635.png|width=564,height=286!



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (FLINK-33228) Fix the total current resource calculation when fulfilling requirements

2023-10-10 Thread xiangyu feng (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-33228?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

xiangyu feng updated FLINK-33228:
-
Attachment: image-2023-10-10-16-09-23-635.png

> Fix the total current resource calculation when fulfilling requirements
> ---
>
> Key: FLINK-33228
> URL: https://issues.apache.org/jira/browse/FLINK-33228
> Project: Flink
>  Issue Type: Bug
>  Components: Runtime / Coordination
>Reporter: xiangyu feng
>Priority: Major
> Attachments: image-2023-10-10-16-09-23-635.png
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (FLINK-33228) Fix the total current resource calculation when fulfilling requirements

2023-10-10 Thread xiangyu feng (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-33228?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

xiangyu feng updated FLINK-33228:
-
Summary: Fix the total current resource calculation when fulfilling 
requirements  (was: Fix the total current resource calculation in fulfill 
requirements)

> Fix the total current resource calculation when fulfilling requirements
> ---
>
> Key: FLINK-33228
> URL: https://issues.apache.org/jira/browse/FLINK-33228
> Project: Flink
>  Issue Type: Bug
>  Components: Runtime / Coordination
>Reporter: xiangyu feng
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (FLINK-33228) Fix the total current resource calculation in fulfill requirements

2023-10-10 Thread xiangyu feng (Jira)
xiangyu feng created FLINK-33228:


 Summary: Fix the total current resource calculation in fulfill 
requirements
 Key: FLINK-33228
 URL: https://issues.apache.org/jira/browse/FLINK-33228
 Project: Flink
  Issue Type: Bug
  Components: Runtime / Coordination
Reporter: xiangyu feng






--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (FLINK-33209) Translate Flink OLAP quick start guide to Chinese

2023-10-08 Thread xiangyu feng (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-33209?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17773093#comment-17773093
 ] 

xiangyu feng commented on FLINK-33209:
--

Hi [~shengbo], thanks for your volunteering. Currently we are still improving 
this document both in English and Chinese version since OLAP is still in fast 
progress, so we'd rather keeping this work in house. Is that working for you?

> Translate Flink OLAP quick start guide to Chinese
> -
>
> Key: FLINK-33209
> URL: https://issues.apache.org/jira/browse/FLINK-33209
> Project: Flink
>  Issue Type: Sub-task
>  Components: chinese-translation
>Reporter: xiangyu feng
>Priority: Major
>
> Translate Flink OLAP quick start guide to Chinese



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (FLINK-33209) Translate Flink OLAP quick start guide to Chinese

2023-10-08 Thread xiangyu feng (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-33209?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

xiangyu feng updated FLINK-33209:
-
Priority: Major  (was: Minor)

> Translate Flink OLAP quick start guide to Chinese
> -
>
> Key: FLINK-33209
> URL: https://issues.apache.org/jira/browse/FLINK-33209
> Project: Flink
>  Issue Type: Sub-task
>  Components: chinese-translation
>Reporter: xiangyu feng
>Priority: Major
>
> Translate Flink OLAP quick start guide to Chinese



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (FLINK-33209) Translate Flink OLAP quick start guide to Chinese

2023-10-08 Thread xiangyu feng (Jira)
xiangyu feng created FLINK-33209:


 Summary: Translate Flink OLAP quick start guide to Chinese
 Key: FLINK-33209
 URL: https://issues.apache.org/jira/browse/FLINK-33209
 Project: Flink
  Issue Type: Sub-task
  Components: chinese-translation
Reporter: xiangyu feng


Translate Flink OLAP quick start guide to Chinese



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (FLINK-15959) Add min number of slots configuration to limit total number of slots

2023-09-28 Thread xiangyu feng (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-15959?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17770291#comment-17770291
 ] 

xiangyu feng commented on FLINK-15959:
--

[~zhuzh] Hi, the 
[Flip-362|https://cwiki.apache.org/confluence/display/FLINK/FLIP-362%3A+Support+minimum+resource+limitation]
 has been accepted. we will continue this work.

> Add min number of slots configuration to limit total number of slots
> 
>
> Key: FLINK-15959
> URL: https://issues.apache.org/jira/browse/FLINK-15959
> Project: Flink
>  Issue Type: Sub-task
>  Components: Runtime / Coordination
>Affects Versions: 1.11.0
>Reporter: YufeiLiu
>Assignee: xiangyu feng
>Priority: Major
>  Labels: auto-deprioritized-major, auto-deprioritized-minor, 
> auto-unassigned, pull-request-available
>
> Flink removed `-n` option after FLIP-6, change to ResourceManager start a new 
> worker when required. But I think maintain a certain amount of slots is 
> necessary. These workers will start immediately when ResourceManager starts 
> and would not release even if all slots are free.
> Here are some resons:
> # Users actually know how many resources are needed when run a single job, 
> initialize all workers when cluster starts can speed up startup process.
> # Job schedule in  topology order,  next operator won't schedule until prior 
> execution slot allocated. The TaskExecutors will start in several batchs in 
> some cases, it might slow down the startup speed.
> # Flink support 
> [FLINK-12122|https://issues.apache.org/jira/browse/FLINK-12122] [Spread out 
> tasks evenly across all available registered TaskManagers], but it will only 
> effect if all TMs are registered. Start all TMs at begining can slove this 
> problem.
> *suggestion:*
> * Add config "taskmanager.minimum.numberOfTotalSlots" and 
> "taskmanager.maximum.numberOfTotalSlots", default behavior is still like 
> before.
> * Start plenty number of workers to satisfy minimum slots when 
> ResourceManager accept leadership(subtract recovered workers).
> * Don't comlete slot request until minimum number of slots are registered, 
> and throw exeception when exceed maximum.
> *update*
> Finally, we'd like to introduce three config options related to the minimum 
> resources restriction:
> * slotmanager.min-total-resource.cpu
> * slotmanager.min-total-resource.memory
> * slotmanager.number-of-slots.min 
> Note that these configuration do not take effect for standalone clusters, 
> where how many slots are allocated is not controlled by Flink. These config 
> are best effort and Flink will not block the job progress even if the min 
> resources are not guaranteed.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Comment Edited] (FLINK-25015) job name should not always be `collect` submitted by sql client

2023-09-24 Thread xiangyu feng (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-25015?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17768410#comment-17768410
 ] 

xiangyu feng edited comment on FLINK-25015 at 9/24/23 4:14 PM:
---

[~zjureel] [~libenchao]  Hi, I've created a pr for this issue and verified this 
change manually. Would you kindly help review this pr?

!https://user-images.githubusercontent.com/3021821/270177041-6ceea6af-b6ab-4424-8bdf-efce50feed9f.png|width=563,height=352!


was (Author: JIRAUSER301129):
[~zjureel] [~libenchao]  Hi, I've created a pr for this issue and verified this 
change manually. Would you kindly help review this pr?


!image-2023-09-25-00-13-08-681.png|width=736,height=459!

> job name should not always be `collect` submitted by sql client
> ---
>
> Key: FLINK-25015
> URL: https://issues.apache.org/jira/browse/FLINK-25015
> Project: Flink
>  Issue Type: Sub-task
>  Components: Table SQL / Client
>Affects Versions: 1.14.0
>Reporter: KevinyhZou
>Assignee: xiangyu feng
>Priority: Major
>  Labels: pull-request-available, stale-assigned
> Attachments: image-2021-11-23-20-15-32-459.png, 
> image-2021-11-23-20-16-21-932.png
>
>
> I use flink sql client to submitted different sql query to flink session 
> cluster,  and the sql job name is always `collect`,  as below
> !image-2021-11-23-20-16-21-932.png!
> which make no sence to users.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (FLINK-25015) job name should not always be `collect` submitted by sql client

2023-09-24 Thread xiangyu feng (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-25015?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17768410#comment-17768410
 ] 

xiangyu feng commented on FLINK-25015:
--

[~zjureel] [~libenchao]  Hi, I've created a pr for this issue and verified this 
change manually. Would you kindly help review this pr?


!image-2023-09-25-00-13-08-681.png|width=736,height=459!

> job name should not always be `collect` submitted by sql client
> ---
>
> Key: FLINK-25015
> URL: https://issues.apache.org/jira/browse/FLINK-25015
> Project: Flink
>  Issue Type: Sub-task
>  Components: Table SQL / Client
>Affects Versions: 1.14.0
>Reporter: KevinyhZou
>Assignee: xiangyu feng
>Priority: Major
>  Labels: pull-request-available, stale-assigned
> Attachments: image-2021-11-23-20-15-32-459.png, 
> image-2021-11-23-20-16-21-932.png
>
>
> I use flink sql client to submitted different sql query to flink session 
> cluster,  and the sql job name is always `collect`,  as below
> !image-2021-11-23-20-16-21-932.png!
> which make no sence to users.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Comment Edited] (FLINK-32488) Introduce configuration to control ExecutionGraph cache in REST API

2023-09-11 Thread xiangyu feng (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-32488?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17763989#comment-17763989
 ] 

xiangyu feng edited comment on FLINK-32488 at 9/12/23 2:51 AM:
---

Hi [~guoyangze] , [~hejufang001] has created a 
PR([https://github.com/apache/flink/pull/23387)|https://github.com/apache/flink/pull/23387,]
 for this issue would you kindly assign this issue to him?


was (Author: JIRAUSER301129):
Hi [~guoyangze] , [~hejufang001] has created a 
[PR|[https://github.com/apache/flink/pull/23387|https://github.com/apache/flink/pull/23387,]]
 for this issue would you kindly assign this issue to him?

> Introduce configuration to control ExecutionGraph cache in REST API
> ---
>
> Key: FLINK-32488
> URL: https://issues.apache.org/jira/browse/FLINK-32488
> Project: Flink
>  Issue Type: Improvement
>  Components: Runtime / REST
>Affects Versions: 1.16.2, 1.17.1
>Reporter: Hong Liang Teoh
>Assignee: Jufang He
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.18.0
>
>
> *What*
> Currently, REST handlers that inherit from AbstractExecutionGraphHandler 
> serve information derived from a cached ExecutionGraph.
> This ExecutionGraph cache currently derives it's timeout from 
> {*}web.refresh-interval{*}. The *web.refresh-interval* controls both the 
> refresh rate of the Flink dashboard and the ExecutionGraph cache timeout. 
> We should introduce a new configuration to control the ExecutionGraph cache, 
> namely {*}rest.cache.execution-graph.expiry{*}.
> *Why*
> Sharing configuration between REST handler and Flink dashboard is a sign that 
> we are coupling the two. 
> Ideally, we want our REST API behaviour to independent of the Flink dashboard 
> (e.g. supports programmatic access).
>  
> Mailing list discussion: 
> https://lists.apache.org/thread/7o330hfyoqqkkrfhtvz3kp448jcspjrm
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (FLINK-32488) Introduce configuration to control ExecutionGraph cache in REST API

2023-09-11 Thread xiangyu feng (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-32488?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17763989#comment-17763989
 ] 

xiangyu feng commented on FLINK-32488:
--

Hi [~guoyangze] , [~hejufang001] has created a 
[PR|[https://github.com/apache/flink/pull/23387|https://github.com/apache/flink/pull/23387,]]
 for this issue would you kindly assign this issue to him?

> Introduce configuration to control ExecutionGraph cache in REST API
> ---
>
> Key: FLINK-32488
> URL: https://issues.apache.org/jira/browse/FLINK-32488
> Project: Flink
>  Issue Type: Improvement
>  Components: Runtime / REST
>Affects Versions: 1.16.2, 1.17.1
>Reporter: Hong Liang Teoh
>Assignee: xiangyu feng
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.18.0
>
>
> *What*
> Currently, REST handlers that inherit from AbstractExecutionGraphHandler 
> serve information derived from a cached ExecutionGraph.
> This ExecutionGraph cache currently derives it's timeout from 
> {*}web.refresh-interval{*}. The *web.refresh-interval* controls both the 
> refresh rate of the Flink dashboard and the ExecutionGraph cache timeout. 
> We should introduce a new configuration to control the ExecutionGraph cache, 
> namely {*}rest.cache.execution-graph.expiry{*}.
> *Why*
> Sharing configuration between REST handler and Flink dashboard is a sign that 
> we are coupling the two. 
> Ideally, we want our REST API behaviour to independent of the Flink dashboard 
> (e.g. supports programmatic access).
>  
> Mailing list discussion: 
> https://lists.apache.org/thread/7o330hfyoqqkkrfhtvz3kp448jcspjrm
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (FLINK-32794) Release Testing: Verify FLIP-293: Introduce Flink Jdbc Driver For Sql Gateway

2023-09-06 Thread xiangyu feng (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-32794?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17762319#comment-17762319
 ] 

xiangyu feng commented on FLINK-32794:
--

[~Sergey Nuyanzin] Hi Sergey, IMHO it's ok to confirm that this feature was 
tested.

> Release Testing: Verify FLIP-293: Introduce Flink Jdbc Driver For Sql Gateway
> -
>
> Key: FLINK-32794
> URL: https://issues.apache.org/jira/browse/FLINK-32794
> Project: Flink
>  Issue Type: Sub-task
>  Components: Tests
>Affects Versions: 1.18.0
>Reporter: Qingsheng Ren
>Assignee: xiangyu feng
>Priority: Major
> Fix For: 1.18.0
>
> Attachments: image-2023-09-04-15-33-33-074.png, 
> image-2023-09-04-15-48-19-665.png, image-2023-09-04-15-51-17-161.png
>
>
> Document for jdbc driver: 
> https://nightlies.apache.org/flink/flink-docs-master/docs/dev/table/jdbcdriver/



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (FLINK-32794) Release Testing: Verify FLIP-293: Introduce Flink Jdbc Driver For Sql Gateway

2023-09-04 Thread xiangyu feng (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-32794?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17761726#comment-17761726
 ] 

xiangyu feng commented on FLINK-32794:
--

Hi [~renqs], I have verified this feature by using with both jdbc tool(SqlLine) 
and java application.

> Release Testing: Verify FLIP-293: Introduce Flink Jdbc Driver For Sql Gateway
> -
>
> Key: FLINK-32794
> URL: https://issues.apache.org/jira/browse/FLINK-32794
> Project: Flink
>  Issue Type: Sub-task
>  Components: Tests
>Affects Versions: 1.18.0
>Reporter: Qingsheng Ren
>Assignee: xiangyu feng
>Priority: Major
> Fix For: 1.18.0
>
> Attachments: image-2023-09-04-15-33-33-074.png, 
> image-2023-09-04-15-48-19-665.png, image-2023-09-04-15-51-17-161.png
>
>
> Document for jdbc driver: 
> https://nightlies.apache.org/flink/flink-docs-master/docs/dev/table/jdbcdriver/



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (FLINK-32794) Release Testing: Verify FLIP-293: Introduce Flink Jdbc Driver For Sql Gateway

2023-09-04 Thread xiangyu feng (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-32794?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

xiangyu feng updated FLINK-32794:
-
Attachment: image-2023-09-04-15-51-17-161.png

> Release Testing: Verify FLIP-293: Introduce Flink Jdbc Driver For Sql Gateway
> -
>
> Key: FLINK-32794
> URL: https://issues.apache.org/jira/browse/FLINK-32794
> Project: Flink
>  Issue Type: Sub-task
>  Components: Tests
>Affects Versions: 1.18.0
>Reporter: Qingsheng Ren
>Assignee: xiangyu feng
>Priority: Major
> Fix For: 1.18.0
>
> Attachments: image-2023-09-04-15-33-33-074.png, 
> image-2023-09-04-15-48-19-665.png, image-2023-09-04-15-51-17-161.png
>
>
> Document for jdbc driver: 
> https://nightlies.apache.org/flink/flink-docs-master/docs/dev/table/jdbcdriver/



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (FLINK-32794) Release Testing: Verify FLIP-293: Introduce Flink Jdbc Driver For Sql Gateway

2023-09-04 Thread xiangyu feng (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-32794?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17761722#comment-17761722
 ] 

xiangyu feng commented on FLINK-32794:
--

Verified using the Datasource Connection

!image-2023-09-04-15-51-17-161.png|width=1017,height=540!

> Release Testing: Verify FLIP-293: Introduce Flink Jdbc Driver For Sql Gateway
> -
>
> Key: FLINK-32794
> URL: https://issues.apache.org/jira/browse/FLINK-32794
> Project: Flink
>  Issue Type: Sub-task
>  Components: Tests
>Affects Versions: 1.18.0
>Reporter: Qingsheng Ren
>Assignee: xiangyu feng
>Priority: Major
> Fix For: 1.18.0
>
> Attachments: image-2023-09-04-15-33-33-074.png, 
> image-2023-09-04-15-48-19-665.png, image-2023-09-04-15-51-17-161.png
>
>
> Document for jdbc driver: 
> https://nightlies.apache.org/flink/flink-docs-master/docs/dev/table/jdbcdriver/



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (FLINK-32794) Release Testing: Verify FLIP-293: Introduce Flink Jdbc Driver For Sql Gateway

2023-09-04 Thread xiangyu feng (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-32794?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17761720#comment-17761720
 ] 

xiangyu feng commented on FLINK-32794:
--

Verified by using with Java application
!image-2023-09-04-15-48-19-665.png|width=818,height=403!

> Release Testing: Verify FLIP-293: Introduce Flink Jdbc Driver For Sql Gateway
> -
>
> Key: FLINK-32794
> URL: https://issues.apache.org/jira/browse/FLINK-32794
> Project: Flink
>  Issue Type: Sub-task
>  Components: Tests
>Affects Versions: 1.18.0
>Reporter: Qingsheng Ren
>Assignee: xiangyu feng
>Priority: Major
> Fix For: 1.18.0
>
> Attachments: image-2023-09-04-15-33-33-074.png, 
> image-2023-09-04-15-48-19-665.png
>
>
> Document for jdbc driver: 
> https://nightlies.apache.org/flink/flink-docs-master/docs/dev/table/jdbcdriver/



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (FLINK-32794) Release Testing: Verify FLIP-293: Introduce Flink Jdbc Driver For Sql Gateway

2023-09-04 Thread xiangyu feng (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-32794?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

xiangyu feng updated FLINK-32794:
-
Attachment: image-2023-09-04-15-48-19-665.png

> Release Testing: Verify FLIP-293: Introduce Flink Jdbc Driver For Sql Gateway
> -
>
> Key: FLINK-32794
> URL: https://issues.apache.org/jira/browse/FLINK-32794
> Project: Flink
>  Issue Type: Sub-task
>  Components: Tests
>Affects Versions: 1.18.0
>Reporter: Qingsheng Ren
>Assignee: xiangyu feng
>Priority: Major
> Fix For: 1.18.0
>
> Attachments: image-2023-09-04-15-33-33-074.png, 
> image-2023-09-04-15-48-19-665.png
>
>
> Document for jdbc driver: 
> https://nightlies.apache.org/flink/flink-docs-master/docs/dev/table/jdbcdriver/



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (FLINK-32794) Release Testing: Verify FLIP-293: Introduce Flink Jdbc Driver For Sql Gateway

2023-09-04 Thread xiangyu feng (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-32794?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17761715#comment-17761715
 ] 

xiangyu feng commented on FLINK-32794:
--

Verified by using with SqlLine

!image-2023-09-04-15-33-33-074.png|width=726,height=615!

> Release Testing: Verify FLIP-293: Introduce Flink Jdbc Driver For Sql Gateway
> -
>
> Key: FLINK-32794
> URL: https://issues.apache.org/jira/browse/FLINK-32794
> Project: Flink
>  Issue Type: Sub-task
>  Components: Tests
>Affects Versions: 1.18.0
>Reporter: Qingsheng Ren
>Assignee: xiangyu feng
>Priority: Major
> Fix For: 1.18.0
>
> Attachments: image-2023-09-04-15-33-33-074.png
>
>
> Document for jdbc driver: 
> https://nightlies.apache.org/flink/flink-docs-master/docs/dev/table/jdbcdriver/



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (FLINK-32794) Release Testing: Verify FLIP-293: Introduce Flink Jdbc Driver For Sql Gateway

2023-09-04 Thread xiangyu feng (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-32794?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

xiangyu feng updated FLINK-32794:
-
Attachment: image-2023-09-04-15-33-33-074.png

> Release Testing: Verify FLIP-293: Introduce Flink Jdbc Driver For Sql Gateway
> -
>
> Key: FLINK-32794
> URL: https://issues.apache.org/jira/browse/FLINK-32794
> Project: Flink
>  Issue Type: Sub-task
>  Components: Tests
>Affects Versions: 1.18.0
>Reporter: Qingsheng Ren
>Assignee: xiangyu feng
>Priority: Major
> Fix For: 1.18.0
>
> Attachments: image-2023-09-04-15-33-33-074.png
>
>
> Document for jdbc driver: 
> https://nightlies.apache.org/flink/flink-docs-master/docs/dev/table/jdbcdriver/



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


  1   2   >