[jira] [Commented] (FLINK-35130) Simplify AvailabilityNotifierImpl to support speculative scheduler and improve performance

2024-05-21 Thread Yuxin Tan (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-35130?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17848153#comment-17848153
 ] 

Yuxin Tan commented on FLINK-35130:
---

master(1.20.0): 46aaea8083047fc86c35491336d795ddcd565128

> Simplify AvailabilityNotifierImpl to support speculative scheduler and 
> improve performance
> --
>
> Key: FLINK-35130
> URL: https://issues.apache.org/jira/browse/FLINK-35130
> Project: Flink
>  Issue Type: Improvement
>  Components: Runtime / Network
>Affects Versions: 1.20.0
>Reporter: Yuxin Tan
>Assignee: Yuxin Tan
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.20.0
>
>
> The AvailabilityNotifierImpl in SingleInputGate has maps storing the channel 
> ids. But the map key is the result partition id, which will change according 
> to the different attempt numbers when speculation is enabled.  This can be 
> resolved by using `inputChannels` to get channel and the map key of 
> inputChannels will not vary with the attempts. 
> In addition, using that map instead can also improve performance for large 
> scale jobs because no extra maps are created.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Closed] (FLINK-35130) Simplify AvailabilityNotifierImpl to support speculative scheduler and improve performance

2024-05-21 Thread Yuxin Tan (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-35130?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuxin Tan closed FLINK-35130.
-
Fix Version/s: 1.20.0
   Resolution: Fixed

> Simplify AvailabilityNotifierImpl to support speculative scheduler and 
> improve performance
> --
>
> Key: FLINK-35130
> URL: https://issues.apache.org/jira/browse/FLINK-35130
> Project: Flink
>  Issue Type: Improvement
>  Components: Runtime / Network
>Affects Versions: 1.20.0
>Reporter: Yuxin Tan
>Assignee: Yuxin Tan
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.20.0
>
>
> The AvailabilityNotifierImpl in SingleInputGate has maps storing the channel 
> ids. But the map key is the result partition id, which will change according 
> to the different attempt numbers when speculation is enabled.  This can be 
> resolved by using `inputChannels` to get channel and the map key of 
> inputChannels will not vary with the attempts. 
> In addition, using that map instead can also improve performance for large 
> scale jobs because no extra maps are created.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (FLINK-35533) FLIP-459: Support Flink hybrid shuffle integration with Apache Celeborn

2024-06-05 Thread Yuxin Tan (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-35533?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuxin Tan updated FLINK-35533:
--
Labels: Umbrella  (was: )

> FLIP-459: Support Flink hybrid shuffle integration with Apache Celeborn
> ---
>
> Key: FLINK-35533
> URL: https://issues.apache.org/jira/browse/FLINK-35533
> Project: Flink
>  Issue Type: Improvement
>  Components: Runtime / Network
>Affects Versions: 1.20.0
>Reporter: Yuxin Tan
>Assignee: Yuxin Tan
>Priority: Major
>  Labels: Umbrella
>
> This is the umbrella jira for 
> [FLIP-459|https://cwiki.apache.org/confluence/display/FLINK/FLIP-459%3A+Support+Flink+hybrid+shuffle+integration+with+Apache+Celeborn].



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (FLINK-35533) FLIP-459: Support Flink hybrid shuffle integration with Apache Celeborn

2024-06-05 Thread Yuxin Tan (Jira)
Yuxin Tan created FLINK-35533:
-

 Summary: FLIP-459: Support Flink hybrid shuffle integration with 
Apache Celeborn
 Key: FLINK-35533
 URL: https://issues.apache.org/jira/browse/FLINK-35533
 Project: Flink
  Issue Type: Improvement
  Components: Runtime / Network
Affects Versions: 1.20.0
Reporter: Yuxin Tan
Assignee: Yuxin Tan


This is the umbrella jira for 
[FLIP-459|https://cwiki.apache.org/confluence/display/FLINK/FLIP-459%3A+Support+Flink+hybrid+shuffle+integration+with+Apache+Celeborn].



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (FLINK-35533) FLIP-459: Support Flink hybrid shuffle integration with Apache Celeborn

2024-06-05 Thread Yuxin Tan (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-35533?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuxin Tan updated FLINK-35533:
--
Labels:   (was: Umbrella)

> FLIP-459: Support Flink hybrid shuffle integration with Apache Celeborn
> ---
>
> Key: FLINK-35533
> URL: https://issues.apache.org/jira/browse/FLINK-35533
> Project: Flink
>  Issue Type: Improvement
>  Components: Runtime / Network
>Affects Versions: 1.20.0
>Reporter: Yuxin Tan
>Assignee: Yuxin Tan
>Priority: Major
>
> This is the umbrella jira for 
> [FLIP-459|https://cwiki.apache.org/confluence/display/FLINK/FLIP-459%3A+Support+Flink+hybrid+shuffle+integration+with+Apache+Celeborn].



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (FLINK-35533) FLIP-459: Support Flink hybrid shuffle integration with Apache Celeborn

2024-06-05 Thread Yuxin Tan (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-35533?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuxin Tan updated FLINK-35533:
--
Description: This is the jira for 
[FLIP-459|https://cwiki.apache.org/confluence/display/FLINK/FLIP-459%3A+Support+Flink+hybrid+shuffle+integration+with+Apache+Celeborn].
  (was: This is the umbrella jira for 
[FLIP-459|https://cwiki.apache.org/confluence/display/FLINK/FLIP-459%3A+Support+Flink+hybrid+shuffle+integration+with+Apache+Celeborn].)

> FLIP-459: Support Flink hybrid shuffle integration with Apache Celeborn
> ---
>
> Key: FLINK-35533
> URL: https://issues.apache.org/jira/browse/FLINK-35533
> Project: Flink
>  Issue Type: Improvement
>  Components: Runtime / Network
>Affects Versions: 1.20.0
>Reporter: Yuxin Tan
>Assignee: Yuxin Tan
>Priority: Major
>
> This is the jira for 
> [FLIP-459|https://cwiki.apache.org/confluence/display/FLINK/FLIP-459%3A+Support+Flink+hybrid+shuffle+integration+with+Apache+Celeborn].



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (FLINK-35658) Hybrid shuffle external tier can not work with UnknownInputChannel

2024-06-20 Thread Yuxin Tan (Jira)
Yuxin Tan created FLINK-35658:
-

 Summary: Hybrid shuffle external tier can not work with 
UnknownInputChannel
 Key: FLINK-35658
 URL: https://issues.apache.org/jira/browse/FLINK-35658
 Project: Flink
  Issue Type: Bug
  Components: Runtime / Network
Affects Versions: 1.20.0
Reporter: Yuxin Tan
Assignee: Yuxin Tan


Currently, the hybrid shuffle can not work with UnknownInputChannel, we should 
fix it.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (FLINK-35658) Hybrid shuffle external tier can not work with UnknownInputChannel

2024-06-20 Thread Yuxin Tan (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-35658?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuxin Tan updated FLINK-35658:
--
Labels: pull-request-available  (was: )

> Hybrid shuffle external tier can not work with UnknownInputChannel
> --
>
> Key: FLINK-35658
> URL: https://issues.apache.org/jira/browse/FLINK-35658
> Project: Flink
>  Issue Type: Bug
>  Components: Runtime / Network
>Affects Versions: 1.20.0
>Reporter: Yuxin Tan
>Assignee: Yuxin Tan
>Priority: Major
>  Labels: pull-request-available
>
> Currently, the hybrid shuffle can not work with UnknownInputChannel, we 
> should fix it.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (FLINK-35587) Job fails with "The read buffer is null in credit-based input channel" on TPC-DS 10TB benchmark

2024-06-21 Thread Yuxin Tan (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-35587?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17856695#comment-17856695
 ] 

Yuxin Tan commented on FLINK-35587:
---

After some discussions with [~Weijie Guo] and [~JunRuiLi] offline, we prefer to 
revert the change of FLINK-33668. And I have prepared the PR, PTAL.

> Job fails with "The read buffer is null in credit-based input channel" on 
> TPC-DS 10TB benchmark
> ---
>
> Key: FLINK-35587
> URL: https://issues.apache.org/jira/browse/FLINK-35587
> Project: Flink
>  Issue Type: Bug
>  Components: Runtime / Network
>Affects Versions: 1.20.0
>Reporter: Junrui Li
>Assignee: Yuxin Tan
>Priority: Blocker
> Attachments: image-2024-06-13-13-48-37-162.png
>
>
> While running TPC-DS 10TB benchmark on the latest master branch locally, I've 
> encountered a failure in certain queries, specifically query 75, resulting in 
> the error "The read buffer is null in credit-based input channel".
> Using a binary search approach, I identified the offending commit as 
> FLINK-33668. After reverting FLINK-33668 and subsequent commits, the issue 
> disappears. Re-applying FLINK-33668 to the branch re-introduces the error.
> Please see the attached image for the error stack trace.
> !image-2024-06-13-13-48-37-162.png|width=846,height=555!



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (FLINK-35603) Release Testing Instructions: Verify FLINK-35533(FLIP-459): Support Flink hybrid shuffle integration with Apache Celeborn

2024-06-24 Thread Yuxin Tan (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-35603?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuxin Tan updated FLINK-35603:
--
Description: 
Follow up the test for https://issues.apache.org/jira/browse/FLINK-35533

In Flink 1.20, we propose integrating Flink's Hybrid Shuffle with Apache 
Celeborn through a pluggable remote tier interface. Celeborn, as a tier of 
hybrid shuffle, is added to the tiered storage.
To verify this feature, you should reference these two steps.

1. Implement Celeborn tier. 
* Implement a new tier factory and tier for Celeborn.
* Support granular data management at the Segment level for both client and 
server sides.




  was:Follow up the test for https://issues.apache.org/jira/browse/FLINK-35533


> Release Testing Instructions: Verify FLINK-35533(FLIP-459): Support Flink 
> hybrid shuffle integration with Apache Celeborn
> -
>
> Key: FLINK-35603
> URL: https://issues.apache.org/jira/browse/FLINK-35603
> Project: Flink
>  Issue Type: Sub-task
>  Components: Runtime / Network
>Reporter: Rui Fan
>Assignee: Yuxin Tan
>Priority: Blocker
>  Labels: release-testing
> Fix For: 1.20.0
>
>
> Follow up the test for https://issues.apache.org/jira/browse/FLINK-35533
> In Flink 1.20, we propose integrating Flink's Hybrid Shuffle with Apache 
> Celeborn through a pluggable remote tier interface. Celeborn, as a tier of 
> hybrid shuffle, is added to the tiered storage.
> To verify this feature, you should reference these two steps.
> 1. Implement Celeborn tier. 
> * Implement a new tier factory and tier for Celeborn.
> * Support granular data management at the Segment level for both client and 
> server sides.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (FLINK-35603) Release Testing Instructions: Verify FLINK-35533(FLIP-459): Support Flink hybrid shuffle integration with Apache Celeborn

2024-06-24 Thread Yuxin Tan (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-35603?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuxin Tan updated FLINK-35603:
--
Description: 
Follow up the test for https://issues.apache.org/jira/browse/FLINK-35533

In Flink 1.20,  we proposed integrating Flink's Hybrid Shuffle with Apache 
Celeborn through a pluggable remote tier interface. To verify this feature, you 
should reference these main two steps.

1. Implement Celeborn tier.

Implement a new tier factory and tier for Celeborn.

The implement should include these APIs, including 
TierFactory/TierMasterAgent/TierProducerAgent/TierConsumerAgent.

The implementations should support granular data management at the Segment 
level for both client and server sides.

2. 

  was:
Follow up the test for https://issues.apache.org/jira/browse/FLINK-35533

In Flink 1.20,  we proposed integrating Flink's Hybrid Shuffle with Apache 
Celeborn through a pluggable remote tier interface. To verify this feature, you 
should reference these main two steps.

1. Implement Celeborn tier.

Implement a new tier factory and tier for Celeborn.

The implement should include these APIs, including 
TierFactory/TierMasterAgent/TierProducerAgent/TierConsumerAgent.


The implementations should support granular data management at the Segment 
level for both client and server sides.


> Release Testing Instructions: Verify FLINK-35533(FLIP-459): Support Flink 
> hybrid shuffle integration with Apache Celeborn
> -
>
> Key: FLINK-35603
> URL: https://issues.apache.org/jira/browse/FLINK-35603
> Project: Flink
>  Issue Type: Sub-task
>  Components: Runtime / Network
>Reporter: Rui Fan
>Assignee: Yuxin Tan
>Priority: Blocker
>  Labels: release-testing
> Fix For: 1.20.0
>
>
> Follow up the test for https://issues.apache.org/jira/browse/FLINK-35533
> In Flink 1.20,  we proposed integrating Flink's Hybrid Shuffle with Apache 
> Celeborn through a pluggable remote tier interface. To verify this feature, 
> you should reference these main two steps.
> 1. Implement Celeborn tier.
> Implement a new tier factory and tier for Celeborn.
> The implement should include these APIs, including 
> TierFactory/TierMasterAgent/TierProducerAgent/TierConsumerAgent.
> The implementations should support granular data management at the Segment 
> level for both client and server sides.
> 2. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (FLINK-35603) Release Testing Instructions: Verify FLINK-35533(FLIP-459): Support Flink hybrid shuffle integration with Apache Celeborn

2024-06-24 Thread Yuxin Tan (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-35603?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuxin Tan updated FLINK-35603:
--
Description: 
Follow up the test for https://issues.apache.org/jira/browse/FLINK-35533

In Flink 1.20,  we proposed integrating Flink's Hybrid Shuffle with Apache 
Celeborn through a pluggable remote tier interface. To verify this feature, you 
should reference these main two steps.

1. Implement Celeborn tier.

Implement a new tier factory and tier for Celeborn.

The implement should include these APIs, including 
TierFactory/TierMasterAgent/TierProducerAgent/TierConsumerAgent.


The implementations should support granular data management at the Segment 
level for both client and server sides.

  was:
Follow up the test for https://issues.apache.org/jira/browse/FLINK-35533

In Flink 1.20, we propose integrating Flink's Hybrid Shuffle with Apache 
Celeborn through a pluggable remote tier interface. Celeborn, as a tier of 
hybrid shuffle, is added to the tiered storage.
To verify this feature, you should reference these two steps.

1. Implement Celeborn tier. 
* Implement a new tier factory and tier for Celeborn.
* Support granular data management at the Segment level for both client and 
server sides.





> Release Testing Instructions: Verify FLINK-35533(FLIP-459): Support Flink 
> hybrid shuffle integration with Apache Celeborn
> -
>
> Key: FLINK-35603
> URL: https://issues.apache.org/jira/browse/FLINK-35603
> Project: Flink
>  Issue Type: Sub-task
>  Components: Runtime / Network
>Reporter: Rui Fan
>Assignee: Yuxin Tan
>Priority: Blocker
>  Labels: release-testing
> Fix For: 1.20.0
>
>
> Follow up the test for https://issues.apache.org/jira/browse/FLINK-35533
> In Flink 1.20,  we proposed integrating Flink's Hybrid Shuffle with Apache 
> Celeborn through a pluggable remote tier interface. To verify this feature, 
> you should reference these main two steps.
> 1. Implement Celeborn tier.
> Implement a new tier factory and tier for Celeborn.
> The implement should include these APIs, including 
> TierFactory/TierMasterAgent/TierProducerAgent/TierConsumerAgent.
> The implementations should support granular data management at the Segment 
> level for both client and server sides.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (FLINK-35603) Release Testing Instructions: Verify FLINK-35533(FLIP-459): Support Flink hybrid shuffle integration with Apache Celeborn

2024-06-24 Thread Yuxin Tan (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-35603?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuxin Tan updated FLINK-35603:
--
Description: 
Follow up the test for https://issues.apache.org/jira/browse/FLINK-35533

In Flink 1.20,  we proposed integrating Flink's Hybrid Shuffle with Apache 
Celeborn through a pluggable remote tier interface. To verify this feature, you 
should reference these main two steps.

1. Implement Celeborn tier.
 * Implement a new tier factory and tier for Celeborn, including these APIs, 
including TierFactory/TierMasterAgent/TierProducerAgent/TierConsumerAgent.
 * The implementations should support granular data management at the Segment 
level for both client and server sides.

2. Use the implemented tier to shuffle data.
 * Compile Flink and Celeborn.
 * Deploy Celeborn service
 ** Deploy a new Celeborn service with the new compiled packages. You can 
reference the doc (https://celeborn.apache.org/docs/latest/) to deploy the 
cluster. 
 * Add the compiled flink plugin jar (celeborn-client-flink-xxx.jar) to Flink 
classpaths.
 * Configure the options to enable the feature.
 ** Configure the option 
taskmanager.network.hybrid-shuffle.external-remote-tier-factory.class to the 
new Celeborn tier classes. Except for this option, the following options should 
also be added.

 
{code:java}
execution.batch-shuffle-mode: ALL_EXCHANGES_HYBRID_FULL 
celeborn.master.endpoints: 
celeborn.client.shuffle.partition.type: MAP{code}
 * Run some test examples(e.g., WordCount) to verify the feature.

 

  was:
Follow up the test for https://issues.apache.org/jira/browse/FLINK-35533

In Flink 1.20,  we proposed integrating Flink's Hybrid Shuffle with Apache 
Celeborn through a pluggable remote tier interface. To verify this feature, you 
should reference these main two steps.

1. Implement Celeborn tier.

Implement a new tier factory and tier for Celeborn.

The implement should include these APIs, including 
TierFactory/TierMasterAgent/TierProducerAgent/TierConsumerAgent.

The implementations should support granular data management at the Segment 
level for both client and server sides.

2. 


> Release Testing Instructions: Verify FLINK-35533(FLIP-459): Support Flink 
> hybrid shuffle integration with Apache Celeborn
> -
>
> Key: FLINK-35603
> URL: https://issues.apache.org/jira/browse/FLINK-35603
> Project: Flink
>  Issue Type: Sub-task
>  Components: Runtime / Network
>Reporter: Rui Fan
>Assignee: Yuxin Tan
>Priority: Blocker
>  Labels: release-testing
> Fix For: 1.20.0
>
>
> Follow up the test for https://issues.apache.org/jira/browse/FLINK-35533
> In Flink 1.20,  we proposed integrating Flink's Hybrid Shuffle with Apache 
> Celeborn through a pluggable remote tier interface. To verify this feature, 
> you should reference these main two steps.
> 1. Implement Celeborn tier.
>  * Implement a new tier factory and tier for Celeborn, including these APIs, 
> including TierFactory/TierMasterAgent/TierProducerAgent/TierConsumerAgent.
>  * The implementations should support granular data management at the Segment 
> level for both client and server sides.
> 2. Use the implemented tier to shuffle data.
>  * Compile Flink and Celeborn.
>  * Deploy Celeborn service
>  ** Deploy a new Celeborn service with the new compiled packages. You can 
> reference the doc (https://celeborn.apache.org/docs/latest/) to deploy the 
> cluster. 
>  * Add the compiled flink plugin jar (celeborn-client-flink-xxx.jar) to Flink 
> classpaths.
>  * Configure the options to enable the feature.
>  ** Configure the option 
> taskmanager.network.hybrid-shuffle.external-remote-tier-factory.class to the 
> new Celeborn tier classes. Except for this option, the following options 
> should also be added.
>  
> {code:java}
> execution.batch-shuffle-mode: ALL_EXCHANGES_HYBRID_FULL 
> celeborn.master.endpoints: 
> celeborn.client.shuffle.partition.type: MAP{code}
>  * Run some test examples(e.g., WordCount) to verify the feature.
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (FLINK-35690) Release Testing: Verify FLIP-459: Support Flink hybrid shuffle integration with Apache Celeborn

2024-06-24 Thread Yuxin Tan (Jira)
Yuxin Tan created FLINK-35690:
-

 Summary: Release Testing: Verify FLIP-459: Support Flink hybrid 
shuffle integration with Apache Celeborn
 Key: FLINK-35690
 URL: https://issues.apache.org/jira/browse/FLINK-35690
 Project: Flink
  Issue Type: Sub-task
Reporter: Yuxin Tan
 Fix For: 1.20.0


Follow up the test for https://issues.apache.org/jira/browse/FLINK-35533

In Flink 1.20,  we proposed integrating Flink's Hybrid Shuffle with Apache 
Celeborn through a pluggable remote tier interface. To verify this feature, you 
should reference these main two steps.

1. Implement Celeborn tier.
 * Implement a new tier factory and tier for Celeborn, including these APIs, 
including TierFactory/TierMasterAgent/TierProducerAgent/TierConsumerAgent.
 * The implementations should support granular data management at the Segment 
level for both client and server sides.

2. Use the implemented tier to shuffle data.
 * Compile Flink and Celeborn.
 * Deploy Celeborn service
 ** Deploy a new Celeborn service with the new compiled packages. You can 
reference the doc (https://celeborn.apache.org/docs/latest/) to deploy the 
cluster. 
 * Add the compiled flink plugin jar (celeborn-client-flink-xxx.jar) to Flink 
classpaths.
 * Configure the options to enable the feature.
 ** Configure the option 
taskmanager.network.hybrid-shuffle.external-remote-tier-factory.class to the 
new Celeborn tier classes. Except for this option, the following options should 
also be added.

 
{code:java}
execution.batch-shuffle-mode: ALL_EXCHANGES_HYBRID_FULL 
celeborn.master.endpoints: 
celeborn.client.shuffle.partition.type: MAP\{code}
 * Run some test examples(e.g., WordCount) to verify the feature.

 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (FLINK-35690) Release Testing: Verify FLIP-459: Support Flink hybrid shuffle integration with Apache Celeborn

2024-06-24 Thread Yuxin Tan (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-35690?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuxin Tan updated FLINK-35690:
--
Labels: release-testing  (was: )

> Release Testing: Verify FLIP-459: Support Flink hybrid shuffle integration 
> with Apache Celeborn
> ---
>
> Key: FLINK-35690
> URL: https://issues.apache.org/jira/browse/FLINK-35690
> Project: Flink
>  Issue Type: Sub-task
>  Components: Runtime / Network
>Reporter: Yuxin Tan
>Priority: Blocker
>  Labels: release-testing
> Fix For: 1.20.0
>
>
> Follow up the test for https://issues.apache.org/jira/browse/FLINK-35533
> In Flink 1.20,  we proposed integrating Flink's Hybrid Shuffle with Apache 
> Celeborn through a pluggable remote tier interface. To verify this feature, 
> you should reference these main two steps.
> 1. Implement Celeborn tier.
>  * Implement a new tier factory and tier for Celeborn, including these APIs, 
> including TierFactory/TierMasterAgent/TierProducerAgent/TierConsumerAgent.
>  * The implementations should support granular data management at the Segment 
> level for both client and server sides.
> 2. Use the implemented tier to shuffle data.
>  * Compile Flink and Celeborn.
>  * Deploy Celeborn service
>  ** Deploy a new Celeborn service with the new compiled packages. You can 
> reference the doc (https://celeborn.apache.org/docs/latest/) to deploy the 
> cluster. 
>  * Add the compiled flink plugin jar (celeborn-client-flink-xxx.jar) to Flink 
> classpaths.
>  * Configure the options to enable the feature.
>  ** Configure the option 
> taskmanager.network.hybrid-shuffle.external-remote-tier-factory.class to the 
> new Celeborn tier classes. Except for this option, the following options 
> should also be added.
>  
> {code:java}
> execution.batch-shuffle-mode: ALL_EXCHANGES_HYBRID_FULL 
> celeborn.master.endpoints: 
> celeborn.client.shuffle.partition.type: MAP\{code}
>  * Run some test examples(e.g., WordCount) to verify the feature.
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Closed] (FLINK-35603) Release Testing Instructions: Verify FLINK-35533(FLIP-459): Support Flink hybrid shuffle integration with Apache Celeborn

2024-06-24 Thread Yuxin Tan (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-35603?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuxin Tan closed FLINK-35603.
-
Resolution: Fixed

Closed due to FLINK-35690.

> Release Testing Instructions: Verify FLINK-35533(FLIP-459): Support Flink 
> hybrid shuffle integration with Apache Celeborn
> -
>
> Key: FLINK-35603
> URL: https://issues.apache.org/jira/browse/FLINK-35603
> Project: Flink
>  Issue Type: Sub-task
>  Components: Runtime / Network
>Reporter: Rui Fan
>Assignee: Yuxin Tan
>Priority: Blocker
>  Labels: release-testing
> Fix For: 1.20.0
>
>
> Follow up the test for https://issues.apache.org/jira/browse/FLINK-35533
> In Flink 1.20,  we proposed integrating Flink's Hybrid Shuffle with Apache 
> Celeborn through a pluggable remote tier interface. To verify this feature, 
> you should reference these main two steps.
> 1. Implement Celeborn tier.
>  * Implement a new tier factory and tier for Celeborn, including these APIs, 
> including TierFactory/TierMasterAgent/TierProducerAgent/TierConsumerAgent.
>  * The implementations should support granular data management at the Segment 
> level for both client and server sides.
> 2. Use the implemented tier to shuffle data.
>  * Compile Flink and Celeborn.
>  * Deploy Celeborn service
>  ** Deploy a new Celeborn service with the new compiled packages. You can 
> reference the doc (https://celeborn.apache.org/docs/latest/) to deploy the 
> cluster. 
>  * Add the compiled flink plugin jar (celeborn-client-flink-xxx.jar) to Flink 
> classpaths.
>  * Configure the options to enable the feature.
>  ** Configure the option 
> taskmanager.network.hybrid-shuffle.external-remote-tier-factory.class to the 
> new Celeborn tier classes. Except for this option, the following options 
> should also be added.
>  
> {code:java}
> execution.batch-shuffle-mode: ALL_EXCHANGES_HYBRID_FULL 
> celeborn.master.endpoints: 
> celeborn.client.shuffle.partition.type: MAP{code}
>  * Run some test examples(e.g., WordCount) to verify the feature.
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (FLINK-35690) Release Testing: Verify FLIP-459: Support Flink hybrid shuffle integration with Apache Celeborn

2024-06-24 Thread Yuxin Tan (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-35690?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuxin Tan updated FLINK-35690:
--
Component/s: Runtime / Network

> Release Testing: Verify FLIP-459: Support Flink hybrid shuffle integration 
> with Apache Celeborn
> ---
>
> Key: FLINK-35690
> URL: https://issues.apache.org/jira/browse/FLINK-35690
> Project: Flink
>  Issue Type: Sub-task
>  Components: Runtime / Network
>Reporter: Yuxin Tan
>Priority: Blocker
> Fix For: 1.20.0
>
>
> Follow up the test for https://issues.apache.org/jira/browse/FLINK-35533
> In Flink 1.20,  we proposed integrating Flink's Hybrid Shuffle with Apache 
> Celeborn through a pluggable remote tier interface. To verify this feature, 
> you should reference these main two steps.
> 1. Implement Celeborn tier.
>  * Implement a new tier factory and tier for Celeborn, including these APIs, 
> including TierFactory/TierMasterAgent/TierProducerAgent/TierConsumerAgent.
>  * The implementations should support granular data management at the Segment 
> level for both client and server sides.
> 2. Use the implemented tier to shuffle data.
>  * Compile Flink and Celeborn.
>  * Deploy Celeborn service
>  ** Deploy a new Celeborn service with the new compiled packages. You can 
> reference the doc (https://celeborn.apache.org/docs/latest/) to deploy the 
> cluster. 
>  * Add the compiled flink plugin jar (celeborn-client-flink-xxx.jar) to Flink 
> classpaths.
>  * Configure the options to enable the feature.
>  ** Configure the option 
> taskmanager.network.hybrid-shuffle.external-remote-tier-factory.class to the 
> new Celeborn tier classes. Except for this option, the following options 
> should also be added.
>  
> {code:java}
> execution.batch-shuffle-mode: ALL_EXCHANGES_HYBRID_FULL 
> celeborn.master.endpoints: 
> celeborn.client.shuffle.partition.type: MAP\{code}
>  * Run some test examples(e.g., WordCount) to verify the feature.
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (FLINK-35690) Release Testing: Verify FLIP-459: Support Flink hybrid shuffle integration with Apache Celeborn

2024-06-24 Thread Yuxin Tan (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-35690?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuxin Tan updated FLINK-35690:
--
Description: 
Follow up the test for https://issues.apache.org/jira/browse/FLINK-35533

In Flink 1.20,  we proposed integrating Flink's Hybrid Shuffle with Apache 
Celeborn through a pluggable remote tier interface. To verify this feature, you 
should reference these main two steps.

1. Implement Celeborn tier.
 * Implement a new tier factory and tier for Celeborn, including these APIs, 
including TierFactory/TierMasterAgent/TierProducerAgent/TierConsumerAgent.
 * The implementations should support granular data management at the Segment 
level for both client and server sides.

2. Use the implemented tier to shuffle data.
 * Compile Flink and Celeborn.
 * Deploy Celeborn service
 ** Deploy a new Celeborn service with the new compiled packages. You can 
reference the doc ([https://celeborn.apache.org/docs/latest/]) to deploy the 
cluster.
 * Add the compiled flink plugin jar (celeborn-client-flink-xxx.jar) to Flink 
classpath.
 * Configure the options to enable the feature.
 ** Configure the option 
taskmanager.network.hybrid-shuffle.external-remote-tier-factory.class to the 
new Celeborn tier classes. Except for this option, the following options should 
also be added.

{code:java}
execution.batch-shuffle-mode: ALL_EXCHANGES_HYBRID_FULL 
celeborn.master.endpoints: 
celeborn.client.shuffle.partition.type: MAP{code}
 * Run some test examples(e.g., WordCount) to verify the feature.

 

  was:
Follow up the test for https://issues.apache.org/jira/browse/FLINK-35533

In Flink 1.20,  we proposed integrating Flink's Hybrid Shuffle with Apache 
Celeborn through a pluggable remote tier interface. To verify this feature, you 
should reference these main two steps.

1. Implement Celeborn tier.
 * Implement a new tier factory and tier for Celeborn, including these APIs, 
including TierFactory/TierMasterAgent/TierProducerAgent/TierConsumerAgent.
 * The implementations should support granular data management at the Segment 
level for both client and server sides.

2. Use the implemented tier to shuffle data.
 * Compile Flink and Celeborn.
 * Deploy Celeborn service
 * Deploy a new Celeborn service with the new compiled packages. You can 
reference the doc ([https://celeborn.apache.org/docs/latest/]) to deploy the 
cluster.
 * Add the compiled flink plugin jar (celeborn-client-flink-xxx.jar) to Flink 
classpath.
 * Configure the options to enable the feature.
 ** Configure the option 
taskmanager.network.hybrid-shuffle.external-remote-tier-factory.class to the 
new Celeborn tier classes. Except for this option, the following options should 
also be added.

{code:java}
execution.batch-shuffle-mode: ALL_EXCHANGES_HYBRID_FULL 
celeborn.master.endpoints: 
celeborn.client.shuffle.partition.type: MAP{code}
 * Run some test examples(e.g., WordCount) to verify the feature.

 


> Release Testing: Verify FLIP-459: Support Flink hybrid shuffle integration 
> with Apache Celeborn
> ---
>
> Key: FLINK-35690
> URL: https://issues.apache.org/jira/browse/FLINK-35690
> Project: Flink
>  Issue Type: Sub-task
>  Components: Runtime / Network
>Reporter: Yuxin Tan
>Priority: Blocker
>  Labels: release-testing
> Fix For: 1.20.0
>
>
> Follow up the test for https://issues.apache.org/jira/browse/FLINK-35533
> In Flink 1.20,  we proposed integrating Flink's Hybrid Shuffle with Apache 
> Celeborn through a pluggable remote tier interface. To verify this feature, 
> you should reference these main two steps.
> 1. Implement Celeborn tier.
>  * Implement a new tier factory and tier for Celeborn, including these APIs, 
> including TierFactory/TierMasterAgent/TierProducerAgent/TierConsumerAgent.
>  * The implementations should support granular data management at the Segment 
> level for both client and server sides.
> 2. Use the implemented tier to shuffle data.
>  * Compile Flink and Celeborn.
>  * Deploy Celeborn service
>  ** Deploy a new Celeborn service with the new compiled packages. You can 
> reference the doc ([https://celeborn.apache.org/docs/latest/]) to deploy the 
> cluster.
>  * Add the compiled flink plugin jar (celeborn-client-flink-xxx.jar) to Flink 
> classpath.
>  * Configure the options to enable the feature.
>  ** Configure the option 
> taskmanager.network.hybrid-shuffle.external-remote-tier-factory.class to the 
> new Celeborn tier classes. Except for this option, the following options 
> should also be added.
> {code:java}
> execution.batch-shuffle-mode: ALL_EXCHANGES_HYBRID_FULL 
> celeborn.master.endpoints: 
> celeborn.client.shuffle.partition.type: MAP{code}
>  * Run some test examples(e.g., WordCount) to verify the feature.
>  



--

[jira] [Updated] (FLINK-35690) Release Testing: Verify FLIP-459: Support Flink hybrid shuffle integration with Apache Celeborn

2024-06-24 Thread Yuxin Tan (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-35690?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuxin Tan updated FLINK-35690:
--
Description: 
Follow up the test for https://issues.apache.org/jira/browse/FLINK-35533

In Flink 1.20,  we proposed integrating Flink's Hybrid Shuffle with Apache 
Celeborn through a pluggable remote tier interface. To verify this feature, you 
should reference these main two steps.

1. Implement Celeborn tier.
 * Implement a new tier factory and tier for Celeborn, including these APIs, 
including TierFactory/TierMasterAgent/TierProducerAgent/TierConsumerAgent.
 * The implementations should support granular data management at the Segment 
level for both client and server sides.

2. Use the implemented tier to shuffle data.
 * Compile Flink and Celeborn.
 * Deploy Celeborn service
 * Deploy a new Celeborn service with the new compiled packages. You can 
reference the doc ([https://celeborn.apache.org/docs/latest/]) to deploy the 
cluster.
 * Add the compiled flink plugin jar (celeborn-client-flink-xxx.jar) to Flink 
classpath.
 * Configure the options to enable the feature.
 ** Configure the option 
taskmanager.network.hybrid-shuffle.external-remote-tier-factory.class to the 
new Celeborn tier classes. Except for this option, the following options should 
also be added.

{code:java}
execution.batch-shuffle-mode: ALL_EXCHANGES_HYBRID_FULL 
celeborn.master.endpoints: 
celeborn.client.shuffle.partition.type: MAP{code}
 * Run some test examples(e.g., WordCount) to verify the feature.

 

  was:
Follow up the test for https://issues.apache.org/jira/browse/FLINK-35533

In Flink 1.20,  we proposed integrating Flink's Hybrid Shuffle with Apache 
Celeborn through a pluggable remote tier interface. To verify this feature, you 
should reference these main two steps.

1. Implement Celeborn tier.
 * Implement a new tier factory and tier for Celeborn, including these APIs, 
including TierFactory/TierMasterAgent/TierProducerAgent/TierConsumerAgent.
 * The implementations should support granular data management at the Segment 
level for both client and server sides.

2. Use the implemented tier to shuffle data.
 * Compile Flink and Celeborn.
 * Deploy Celeborn service
 ** Deploy a new Celeborn service with the new compiled packages. You can 
reference the doc (https://celeborn.apache.org/docs/latest/) to deploy the 
cluster. 
 * Add the compiled flink plugin jar (celeborn-client-flink-xxx.jar) to Flink 
classpaths.
 * Configure the options to enable the feature.
 ** Configure the option 
taskmanager.network.hybrid-shuffle.external-remote-tier-factory.class to the 
new Celeborn tier classes. Except for this option, the following options should 
also be added.

 
{code:java}
execution.batch-shuffle-mode: ALL_EXCHANGES_HYBRID_FULL 
celeborn.master.endpoints: 
celeborn.client.shuffle.partition.type: MAP\{code}
 * Run some test examples(e.g., WordCount) to verify the feature.

 


> Release Testing: Verify FLIP-459: Support Flink hybrid shuffle integration 
> with Apache Celeborn
> ---
>
> Key: FLINK-35690
> URL: https://issues.apache.org/jira/browse/FLINK-35690
> Project: Flink
>  Issue Type: Sub-task
>  Components: Runtime / Network
>Reporter: Yuxin Tan
>Priority: Blocker
>  Labels: release-testing
> Fix For: 1.20.0
>
>
> Follow up the test for https://issues.apache.org/jira/browse/FLINK-35533
> In Flink 1.20,  we proposed integrating Flink's Hybrid Shuffle with Apache 
> Celeborn through a pluggable remote tier interface. To verify this feature, 
> you should reference these main two steps.
> 1. Implement Celeborn tier.
>  * Implement a new tier factory and tier for Celeborn, including these APIs, 
> including TierFactory/TierMasterAgent/TierProducerAgent/TierConsumerAgent.
>  * The implementations should support granular data management at the Segment 
> level for both client and server sides.
> 2. Use the implemented tier to shuffle data.
>  * Compile Flink and Celeborn.
>  * Deploy Celeborn service
>  * Deploy a new Celeborn service with the new compiled packages. You can 
> reference the doc ([https://celeborn.apache.org/docs/latest/]) to deploy the 
> cluster.
>  * Add the compiled flink plugin jar (celeborn-client-flink-xxx.jar) to Flink 
> classpath.
>  * Configure the options to enable the feature.
>  ** Configure the option 
> taskmanager.network.hybrid-shuffle.external-remote-tier-factory.class to the 
> new Celeborn tier classes. Except for this option, the following options 
> should also be added.
> {code:java}
> execution.batch-shuffle-mode: ALL_EXCHANGES_HYBRID_FULL 
> celeborn.master.endpoints: 
> celeborn.client.shuffle.partition.type: MAP{code}
>  * Run some test examples(e.g., WordCount) to verify the feature.
>  




[jira] [Commented] (FLINK-35690) Release Testing: Verify FLIP-459: Support Flink hybrid shuffle integration with Apache Celeborn

2024-07-01 Thread Yuxin Tan (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-35690?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17861325#comment-17861325
 ] 

Yuxin Tan commented on FLINK-35690:
---

[~fanrui] Thanks for the reminding. We have assigned this Jira to [~yyhx] and 
he will help test it.

> Release Testing: Verify FLIP-459: Support Flink hybrid shuffle integration 
> with Apache Celeborn
> ---
>
> Key: FLINK-35690
> URL: https://issues.apache.org/jira/browse/FLINK-35690
> Project: Flink
>  Issue Type: Sub-task
>  Components: Runtime / Network
>Reporter: Yuxin Tan
>Assignee: xuhuang
>Priority: Blocker
>  Labels: release-testing
> Fix For: 1.20.0
>
>
> Follow up the test for https://issues.apache.org/jira/browse/FLINK-35533
> In Flink 1.20,  we proposed integrating Flink's Hybrid Shuffle with Apache 
> Celeborn through a pluggable remote tier interface. To verify this feature, 
> you should reference these main two steps.
> 1. Implement Celeborn tier.
>  * Implement a new tier factory and tier for Celeborn, including these APIs, 
> including TierFactory/TierMasterAgent/TierProducerAgent/TierConsumerAgent.
>  * The implementations should support granular data management at the Segment 
> level for both client and server sides.
> 2. Use the implemented tier to shuffle data.
>  * Compile Flink and Celeborn.
>  * Deploy Celeborn service
>  ** Deploy a new Celeborn service with the new compiled packages. You can 
> reference the doc ([https://celeborn.apache.org/docs/latest/]) to deploy the 
> cluster.
>  * Add the compiled flink plugin jar (celeborn-client-flink-xxx.jar) to Flink 
> classpath.
>  * Configure the options to enable the feature.
>  ** Configure the option 
> taskmanager.network.hybrid-shuffle.external-remote-tier-factory.class to the 
> new Celeborn tier classes. Except for this option, the following options 
> should also be added.
> {code:java}
> execution.batch-shuffle-mode: ALL_EXCHANGES_HYBRID_FULL 
> celeborn.master.endpoints: 
> celeborn.client.shuffle.partition.type: MAP{code}
>  * Run some test examples(e.g., WordCount) to verify the feature.
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (FLINK-35690) Release Testing: Verify FLIP-459: Support Flink hybrid shuffle integration with Apache Celeborn

2024-07-03 Thread Yuxin Tan (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-35690?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17862917#comment-17862917
 ] 

Yuxin Tan commented on FLINK-35690:
---

Thanks [~yyhx] for helping the test.

> Release Testing: Verify FLIP-459: Support Flink hybrid shuffle integration 
> with Apache Celeborn
> ---
>
> Key: FLINK-35690
> URL: https://issues.apache.org/jira/browse/FLINK-35690
> Project: Flink
>  Issue Type: Sub-task
>  Components: Runtime / Network
>Affects Versions: 1.20.0
>Reporter: Yuxin Tan
>Assignee: xuhuang
>Priority: Blocker
>  Labels: release-testing
> Fix For: 1.20.0
>
>
> Follow up the test for https://issues.apache.org/jira/browse/FLINK-35533
> In Flink 1.20,  we proposed integrating Flink's Hybrid Shuffle with Apache 
> Celeborn through a pluggable remote tier interface. To verify this feature, 
> you should reference these main two steps.
> 1. Implement Celeborn tier.
>  * Implement a new tier factory and tier for Celeborn, including these APIs, 
> including TierFactory/TierMasterAgent/TierProducerAgent/TierConsumerAgent.
>  * The implementations should support granular data management at the Segment 
> level for both client and server sides.
> 2. Use the implemented tier to shuffle data.
>  * Compile Flink and Celeborn.
>  * Deploy Celeborn service
>  ** Deploy a new Celeborn service with the new compiled packages. You can 
> reference the doc ([https://celeborn.apache.org/docs/latest/]) to deploy the 
> cluster.
>  * Add the compiled flink plugin jar (celeborn-client-flink-xxx.jar) to Flink 
> classpath.
>  * Configure the options to enable the feature.
>  ** Configure the option 
> taskmanager.network.hybrid-shuffle.external-remote-tier-factory.class to the 
> new Celeborn tier classes. Except for this option, the following options 
> should also be added.
> {code:java}
> execution.batch-shuffle-mode: ALL_EXCHANGES_HYBRID_FULL 
> celeborn.master.endpoints: 
> celeborn.client.shuffle.partition.type: MAP{code}
>  * Run some test examples(e.g., WordCount) to verify the feature.
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (FLINK-35130) Simplify AvailabilityNotifierImpl to support speculative scheduler and improve performance

2024-04-17 Thread Yuxin Tan (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-35130?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuxin Tan updated FLINK-35130:
--
Description: The AvailabilityNotifierImpl in SingleInputGate has maps 
storing the channel ids. But the map key is the result partition id, which will 
change according to the different attempt numbers when speculation is enabled.  
This can be resolved by using `inputChannels` to get channel and the map key of 
inputChannels will not vary with the attempts. In addition, using that map 
instead can also improve performance for large scale jobs because   (was: The 
AvailabilityNotifierImpl in SingleInputGate has maps storing the channel ids. 
But the map key is the result partition id, which will change according to the 
different attempt numbers when speculation is enabled.  This can be resolved by 
using `inputChannels` to get channel and the map key of inputChannels will not 
vary with the attempts.)

> Simplify AvailabilityNotifierImpl to support speculative scheduler and 
> improve performance
> --
>
> Key: FLINK-35130
> URL: https://issues.apache.org/jira/browse/FLINK-35130
> Project: Flink
>  Issue Type: Improvement
>  Components: Runtime / Network
>Affects Versions: 1.20.0
>Reporter: Yuxin Tan
>Assignee: Yuxin Tan
>Priority: Major
>
> The AvailabilityNotifierImpl in SingleInputGate has maps storing the channel 
> ids. But the map key is the result partition id, which will change according 
> to the different attempt numbers when speculation is enabled.  This can be 
> resolved by using `inputChannels` to get channel and the map key of 
> inputChannels will not vary with the attempts. In addition, using that map 
> instead can also improve performance for large scale jobs because 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (FLINK-35130) Simplify AvailabilityNotifierImpl to support speculative scheduler and improve performance

2024-04-17 Thread Yuxin Tan (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-35130?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuxin Tan updated FLINK-35130:
--
Description: 
The AvailabilityNotifierImpl in SingleInputGate has maps storing the channel 
ids. But the map key is the result partition id, which will change according to 
the different attempt numbers when speculation is enabled.  This can be 
resolved by using `inputChannels` to get channel and the map key of 
inputChannels will not vary with the attempts. 
In addition, using that map instead can also improve performance for large 
scale jobs because no extra maps are created.

  was:The AvailabilityNotifierImpl in SingleInputGate has maps storing the 
channel ids. But the map key is the result partition id, which will change 
according to the different attempt numbers when speculation is enabled.  This 
can be resolved by using `inputChannels` to get channel and the map key of 
inputChannels will not vary with the attempts. In addition, using that map 
instead can also improve performance for large scale jobs because 


> Simplify AvailabilityNotifierImpl to support speculative scheduler and 
> improve performance
> --
>
> Key: FLINK-35130
> URL: https://issues.apache.org/jira/browse/FLINK-35130
> Project: Flink
>  Issue Type: Improvement
>  Components: Runtime / Network
>Affects Versions: 1.20.0
>Reporter: Yuxin Tan
>Assignee: Yuxin Tan
>Priority: Major
>
> The AvailabilityNotifierImpl in SingleInputGate has maps storing the channel 
> ids. But the map key is the result partition id, which will change according 
> to the different attempt numbers when speculation is enabled.  This can be 
> resolved by using `inputChannels` to get channel and the map key of 
> inputChannels will not vary with the attempts. 
> In addition, using that map instead can also improve performance for large 
> scale jobs because no extra maps are created.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (FLINK-35130) Simplify AvailabilityNotifierImpl to support speculative scheduler and improve performance

2024-04-17 Thread Yuxin Tan (Jira)
Yuxin Tan created FLINK-35130:
-

 Summary: Simplify AvailabilityNotifierImpl to support speculative 
scheduler and improve performance
 Key: FLINK-35130
 URL: https://issues.apache.org/jira/browse/FLINK-35130
 Project: Flink
  Issue Type: Improvement
  Components: Runtime / Network
Affects Versions: 1.20.0
Reporter: Yuxin Tan


The AvailabilityNotifierImpl in SingleInputGate has maps storing the channel 
ids. But the map key is the result partition id, which will change according to 
the different attempt numbers when speculation is enabled.  This can be 
resolved by using `inputChannels` to get channel and the map key of 
inputChannels will not vary with the attempts.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Assigned] (FLINK-35130) Simplify AvailabilityNotifierImpl to support speculative scheduler and improve performance

2024-04-17 Thread Yuxin Tan (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-35130?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuxin Tan reassigned FLINK-35130:
-

Assignee: Yuxin Tan

> Simplify AvailabilityNotifierImpl to support speculative scheduler and 
> improve performance
> --
>
> Key: FLINK-35130
> URL: https://issues.apache.org/jira/browse/FLINK-35130
> Project: Flink
>  Issue Type: Improvement
>  Components: Runtime / Network
>Affects Versions: 1.20.0
>Reporter: Yuxin Tan
>Assignee: Yuxin Tan
>Priority: Major
>
> The AvailabilityNotifierImpl in SingleInputGate has maps storing the channel 
> ids. But the map key is the result partition id, which will change according 
> to the different attempt numbers when speculation is enabled.  This can be 
> resolved by using `inputChannels` to get channel and the map key of 
> inputChannels will not vary with the attempts.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (FLINK-35169) Recycle buffers to freeSegments before releasing data buffer for sort accumulator

2024-04-18 Thread Yuxin Tan (Jira)
Yuxin Tan created FLINK-35169:
-

 Summary: Recycle buffers to freeSegments before releasing data 
buffer for sort accumulator
 Key: FLINK-35169
 URL: https://issues.apache.org/jira/browse/FLINK-35169
 Project: Flink
  Issue Type: Bug
  Components: Runtime / Network
Affects Versions: 1.20.0
Reporter: Yuxin Tan
Assignee: Yuxin Tan


When using sortBufferAccumulator, we should recycle the buffers to freeSegments 
before releasing the data buffer. The reason is that when getting buffers from 
the DataBuffer, it may require more buffers than the current quantity available 
in freeSegments. Consequently, to ensure adequate buffers from DataBuffer, the 
flushed and recycled buffers should also be added to freeSegments for reuse.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Assigned] (FLINK-35166) Improve the performance of Hybrid Shuffle when enable memory decoupling

2024-04-21 Thread Yuxin Tan (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-35166?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuxin Tan reassigned FLINK-35166:
-

Assignee: Jiang Xin

> Improve the performance of Hybrid Shuffle when enable memory decoupling
> ---
>
> Key: FLINK-35166
> URL: https://issues.apache.org/jira/browse/FLINK-35166
> Project: Flink
>  Issue Type: Improvement
>  Components: Runtime / Network
>Reporter: Jiang Xin
>Assignee: Jiang Xin
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.20.0
>
>
> Currently, the tiered result partition creates the SortBufferAccumulator with 
> the number of expected buffers as min(numSubpartitions+1, 512), thus the 
> SortBufferAccumulator may obtain very few buffers when the parallelism is 
> small. We can easily make the number of expected buffers 512 by default to 
> have a better performance when the buffers are sufficient.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (FLINK-35214) Update result partition id for remote input channel when unknown input channel is updated

2024-04-22 Thread Yuxin Tan (Jira)
Yuxin Tan created FLINK-35214:
-

 Summary: Update result partition id for remote input channel when 
unknown input channel is updated
 Key: FLINK-35214
 URL: https://issues.apache.org/jira/browse/FLINK-35214
 Project: Flink
  Issue Type: Bug
  Components: Runtime / Network
Affects Versions: 1.20.0
Reporter: Yuxin Tan


In [FLINK-29768|https://issues.apache.org/jira/browse/FLINK-29768], the result 
partition in the local input channel has been updated to support speculation. 
It is necessary to similarly update the result partition ID in the remote input 
channel.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Assigned] (FLINK-35214) Update result partition id for remote input channel when unknown input channel is updated

2024-04-22 Thread Yuxin Tan (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-35214?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuxin Tan reassigned FLINK-35214:
-

Assignee: Yuxin Tan

> Update result partition id for remote input channel when unknown input 
> channel is updated
> -
>
> Key: FLINK-35214
> URL: https://issues.apache.org/jira/browse/FLINK-35214
> Project: Flink
>  Issue Type: Bug
>  Components: Runtime / Network
>Affects Versions: 1.20.0
>Reporter: Yuxin Tan
>Assignee: Yuxin Tan
>Priority: Major
>
> In [FLINK-29768|https://issues.apache.org/jira/browse/FLINK-29768], the 
> result partition in the local input channel has been updated to support 
> speculation. It is necessary to similarly update the result partition ID in 
> the remote input channel.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (FLINK-35214) Update result partition id for remote input channel when unknown input channel is updated

2024-04-23 Thread Yuxin Tan (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-35214?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuxin Tan updated FLINK-35214:
--
Fix Version/s: 1.20.0

> Update result partition id for remote input channel when unknown input 
> channel is updated
> -
>
> Key: FLINK-35214
> URL: https://issues.apache.org/jira/browse/FLINK-35214
> Project: Flink
>  Issue Type: Bug
>  Components: Runtime / Network
>Affects Versions: 1.20.0
>Reporter: Yuxin Tan
>Assignee: Yuxin Tan
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.20.0
>
>
> In [FLINK-29768|https://issues.apache.org/jira/browse/FLINK-29768], the 
> result partition in the local input channel has been updated to support 
> speculation. It is necessary to similarly update the result partition ID in 
> the remote input channel.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (FLINK-24954) Reset read buffer request timeout on buffer recycling for sort-shuffle

2021-11-26 Thread Yuxin Tan (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-24954?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17449442#comment-17449442
 ] 

Yuxin Tan commented on FLINK-24954:
---

Thanks [~kevin.cyj] for assigning the issue to me, I will take a look and 
submit a PR recently.

> Reset read buffer request timeout on buffer recycling for sort-shuffle
> --
>
> Key: FLINK-24954
> URL: https://issues.apache.org/jira/browse/FLINK-24954
> Project: Flink
>  Issue Type: Sub-task
>  Components: Runtime / Network
>Reporter: Yingjie Cao
>Assignee: Yuxin Tan
>Priority: Major
> Fix For: 1.15.0
>
>
> Currently, the read buffer request timeout implementation of sort-shuffle is 
> a little aggressive. As reported in the mailing list: 
> [https://lists.apache.org/thread/bd3s5bqfg9oxlb1g1gg3pxs3577lhf88]. The 
> TimeoutException may be triggered if there is data skew and the downstream 
> task is slow. Actually, we can further improve this case by reseting the 
> request timeout on buffer recycling.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Commented] (FLINK-22096) ServerTransportErrorHandlingTest.testRemoteClose fail

2021-12-13 Thread Yuxin Tan (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-22096?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17458237#comment-17458237
 ] 

Yuxin Tan commented on FLINK-22096:
---

Added 2 PRs for the branches of release-1.13 and release-1.14.

> ServerTransportErrorHandlingTest.testRemoteClose fail 
> --
>
> Key: FLINK-22096
> URL: https://issues.apache.org/jira/browse/FLINK-22096
> Project: Flink
>  Issue Type: Bug
>  Components: Runtime / Network
>Affects Versions: 1.13.0, 1.14.0, 1.15.0
>Reporter: Guowei Ma
>Assignee: Yuxin Tan
>Priority: Major
>  Labels: pull-request-available, stale-assigned, test-stability
> Fix For: 1.15.0, 1.14.1, 1.13.4
>
>
> https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=15966&view=logs&j=0da23115-68bb-5dcd-192c-bd4c8adebde1&t=05b74a19-4ee4-5036-c46f-ada307df6cf0&l=6580
> {code:java}
> [ERROR] Tests run: 1, Failures: 0, Errors: 1, Skipped: 0, Time elapsed: 1.415 
> s <<< FAILURE! - in 
> org.apache.flink.runtime.io.network.netty.ServerTransportErrorHandlingTest
> [ERROR] 
> testRemoteClose(org.apache.flink.runtime.io.network.netty.ServerTransportErrorHandlingTest)
>   Time elapsed: 1.338 s  <<< ERROR!
> org.apache.flink.shaded.netty4.io.netty.channel.unix.Errors$NativeIoException:
>  bind(..) failed: Address already in use
> {code}



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Commented] (FLINK-30563) Update training exercises to use Flink 1.16

2023-01-04 Thread Yuxin Tan (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-30563?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17654714#comment-17654714
 ] 

Yuxin Tan commented on FLINK-30563:
---

Thanks [~danderson] for reporting this. I want to take a look at this.

> Update training exercises to use Flink 1.16
> ---
>
> Key: FLINK-30563
> URL: https://issues.apache.org/jira/browse/FLINK-30563
> Project: Flink
>  Issue Type: Improvement
>  Components: Documentation / Training / Exercises
>Affects Versions: 1.16.0
>Reporter: David Anderson
>Priority: Major
>
> The training exercises in the 
> [flink-training|https://github.com/apache/flink-training] repo need to be 
> updated to use Flink 1.16.
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (FLINK-30471) Optimize the enriching network memory process in SsgNetworkMemoryCalculationUtils

2023-01-08 Thread Yuxin Tan (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-30471?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuxin Tan updated FLINK-30471:
--
Description: 
In SsgNetworkMemoryCalculationUtils#enrichNetworkMemory, getting PartitionTypes 
is run in a separate loop, which is not friendly to performance.  If we want to 
add inputPartitionTypes in the subsequential PR, a new separate loop may be 
introduced too, which I think is not a good choice.

Using a separate loop to get each collection just looks simpler in code style, 
but it will affect the performance. We can get all the results of 
maxSubpartitionNums and partitionTypes through one loop instead of multiple 
loops, which will be faster. In this way, when we need to add 
inputPartitionTypes later, we do not need to add a new loop logic.

  was:
In SsgNetworkMemoryCalculationUtils#enrichNetworkMemory, getting PartitionTypes 
is run in a separate loop, which is not friendly to performance. If we want to 
get inputPartitionTypes, a new separate loop may be introduced too. 

It just looks simpler in code, but it will affect the performance. We can get 
all the results through one loop instead of multiple loops, which will be 
faster.


> Optimize the enriching network memory process in 
> SsgNetworkMemoryCalculationUtils
> -
>
> Key: FLINK-30471
> URL: https://issues.apache.org/jira/browse/FLINK-30471
> Project: Flink
>  Issue Type: Sub-task
>  Components: Runtime / Network
>Affects Versions: 1.17.0
>Reporter: Yuxin Tan
>Priority: Major
>  Labels: pull-request-available
>
> In SsgNetworkMemoryCalculationUtils#enrichNetworkMemory, getting 
> PartitionTypes is run in a separate loop, which is not friendly to 
> performance.  If we want to add inputPartitionTypes in the subsequential PR, 
> a new separate loop may be introduced too, which I think is not a good choice.
> Using a separate loop to get each collection just looks simpler in code 
> style, but it will affect the performance. We can get all the results of 
> maxSubpartitionNums and partitionTypes through one loop instead of multiple 
> loops, which will be faster. In this way, when we need to add 
> inputPartitionTypes later, we do not need to add a new loop logic.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Assigned] (FLINK-30471) Optimize the enriching network memory process in SsgNetworkMemoryCalculationUtils

2023-01-09 Thread Yuxin Tan (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-30471?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuxin Tan reassigned FLINK-30471:
-

Assignee: Yuxin Tan

> Optimize the enriching network memory process in 
> SsgNetworkMemoryCalculationUtils
> -
>
> Key: FLINK-30471
> URL: https://issues.apache.org/jira/browse/FLINK-30471
> Project: Flink
>  Issue Type: Sub-task
>  Components: Runtime / Network
>Affects Versions: 1.17.0
>Reporter: Yuxin Tan
>Assignee: Yuxin Tan
>Priority: Major
>  Labels: pull-request-available
>
> In SsgNetworkMemoryCalculationUtils#enrichNetworkMemory, getting 
> PartitionTypes is run in a separate loop, which is not friendly to 
> performance.  If we want to add inputPartitionTypes in the subsequential PR, 
> a new separate loop may be introduced too, which I think is not a good choice.
> Using a separate loop to get each collection just looks simpler in code 
> style, but it will affect the performance. We can get all the results of 
> maxSubpartitionNums and partitionTypes through one loop instead of multiple 
> loops, which will be faster. In this way, when we need to add 
> inputPartitionTypes later, we do not need to add a new loop logic.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Assigned] (FLINK-30472) Modify the default value of the max network memory config option

2023-01-09 Thread Yuxin Tan (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-30472?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuxin Tan reassigned FLINK-30472:
-

Assignee: Yuxin Tan

> Modify the default value of the max network memory config option
> 
>
> Key: FLINK-30472
> URL: https://issues.apache.org/jira/browse/FLINK-30472
> Project: Flink
>  Issue Type: Sub-task
>  Components: Runtime / Network
>Affects Versions: 1.17.0
>Reporter: Yuxin Tan
>Assignee: Yuxin Tan
>Priority: Major
>  Labels: pull-request-available
>
> This issue mainly focuses on the second issue in 
> [FLIP-266|https://cwiki.apache.org/confluence/display/FLINK/FLIP-266%3A+Simplify+network+memory+configurations+for+TaskManager],
>  modifying the default value of taskmanager.memory.network.max



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Assigned] (FLINK-30473) Optimize the InputGate network memory management for TaskManager

2023-01-09 Thread Yuxin Tan (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-30473?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuxin Tan reassigned FLINK-30473:
-

Assignee: Yuxin Tan

> Optimize the InputGate network memory management for TaskManager
> 
>
> Key: FLINK-30473
> URL: https://issues.apache.org/jira/browse/FLINK-30473
> Project: Flink
>  Issue Type: Sub-task
>  Components: Runtime / Network
>Affects Versions: 1.17.0
>Reporter: Yuxin Tan
>Assignee: Yuxin Tan
>Priority: Major
>  Labels: pull-request-available
>
> Based on the 
> [FLIP-266|https://cwiki.apache.org/confluence/display/FLINK/FLIP-266%3A+Simplify+network+memory+configurations+for+TaskManager],
>  this issue mainly focuses on the first issue.
> This change proposes a method to control the maximum required memory buffers 
> in an inputGate according to parallelism size.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Resolved] (FLINK-30469) FLIP-266: Simplify network memory configurations for TaskManager

2023-01-16 Thread Yuxin Tan (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-30469?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuxin Tan resolved FLINK-30469.
---
Resolution: Resolved

master (1.17): ae8de97ef2acc798dae34ad2096ece8886bcf308

> FLIP-266: Simplify network memory configurations for TaskManager
> 
>
> Key: FLINK-30469
> URL: https://issues.apache.org/jira/browse/FLINK-30469
> Project: Flink
>  Issue Type: Improvement
>  Components: Runtime / Network
>Affects Versions: 1.17.0
>Reporter: Yuxin Tan
>Assignee: Yuxin Tan
>Priority: Major
>
> When using Flink, users may encounter the following issues that affect 
> usability.
> 1. The job may fail with an "Insufficient number of network buffers" 
> exception.
> 2. Flink network memory size adjustment is complex.
> When encountering these issues, users can solve some problems by adding or 
> adjusting parameters. However, multiple memory config options should be 
> changed. The config option adjustment requires understanding the detailed 
> internal implementation, which is impractical for most users.
> To resolve the issues, we propose some improvement solutions. For more 
> details see 
> [FLIP-266|https://cwiki.apache.org/confluence/display/FLINK/FLIP-266%3A+Simplify+network+memory+configurations+for+TaskManager].
> This is the umbrella ticket to track all the changes of this feature.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] (FLINK-30469) FLIP-266: Simplify network memory configurations for TaskManager

2023-01-16 Thread Yuxin Tan (Jira)


[ https://issues.apache.org/jira/browse/FLINK-30469 ]


Yuxin Tan deleted comment on FLINK-30469:
---

was (Author: tanyuxin):
master (1.17): ae8de97ef2acc798dae34ad2096ece8886bcf308

> FLIP-266: Simplify network memory configurations for TaskManager
> 
>
> Key: FLINK-30469
> URL: https://issues.apache.org/jira/browse/FLINK-30469
> Project: Flink
>  Issue Type: Improvement
>  Components: Runtime / Network
>Affects Versions: 1.17.0
>Reporter: Yuxin Tan
>Assignee: Yuxin Tan
>Priority: Major
>
> When using Flink, users may encounter the following issues that affect 
> usability.
> 1. The job may fail with an "Insufficient number of network buffers" 
> exception.
> 2. Flink network memory size adjustment is complex.
> When encountering these issues, users can solve some problems by adding or 
> adjusting parameters. However, multiple memory config options should be 
> changed. The config option adjustment requires understanding the detailed 
> internal implementation, which is impractical for most users.
> To resolve the issues, we propose some improvement solutions. For more 
> details see 
> [FLIP-266|https://cwiki.apache.org/confluence/display/FLINK/FLIP-266%3A+Simplify+network+memory+configurations+for+TaskManager].
> This is the umbrella ticket to track all the changes of this feature.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Closed] (FLINK-30469) FLIP-266: Simplify network memory configurations for TaskManager

2023-01-16 Thread Yuxin Tan (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-30469?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuxin Tan closed FLINK-30469.
-

> FLIP-266: Simplify network memory configurations for TaskManager
> 
>
> Key: FLINK-30469
> URL: https://issues.apache.org/jira/browse/FLINK-30469
> Project: Flink
>  Issue Type: Improvement
>  Components: Runtime / Network
>Affects Versions: 1.17.0
>Reporter: Yuxin Tan
>Assignee: Yuxin Tan
>Priority: Major
>
> When using Flink, users may encounter the following issues that affect 
> usability.
> 1. The job may fail with an "Insufficient number of network buffers" 
> exception.
> 2. Flink network memory size adjustment is complex.
> When encountering these issues, users can solve some problems by adding or 
> adjusting parameters. However, multiple memory config options should be 
> changed. The config option adjustment requires understanding the detailed 
> internal implementation, which is impractical for most users.
> To resolve the issues, we propose some improvement solutions. For more 
> details see 
> [FLIP-266|https://cwiki.apache.org/confluence/display/FLINK/FLIP-266%3A+Simplify+network+memory+configurations+for+TaskManager].
> This is the umbrella ticket to track all the changes of this feature.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (FLINK-30712) Update network memory configuration docs

2023-01-16 Thread Yuxin Tan (Jira)
Yuxin Tan created FLINK-30712:
-

 Summary: Update network memory configuration docs 
 Key: FLINK-30712
 URL: https://issues.apache.org/jira/browse/FLINK-30712
 Project: Flink
  Issue Type: Improvement
  Components: Runtime / Network
Affects Versions: 1.17.0
Reporter: Yuxin Tan


After https://issues.apache.org/jira/browse/FLINK-30469, the network memory 
configuration docs for TaskManager should also be updated.
The configuration descriptions to be updated mainly include 
`taskmanager.network.memory.buffers-per-channel`
and `taskmanager.network.memory.floating-buffers-per-gate`.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Assigned] (FLINK-30712) Update network memory configuration docs

2023-01-16 Thread Yuxin Tan (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-30712?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuxin Tan reassigned FLINK-30712:
-

Assignee: Yuxin Tan

> Update network memory configuration docs 
> -
>
> Key: FLINK-30712
> URL: https://issues.apache.org/jira/browse/FLINK-30712
> Project: Flink
>  Issue Type: Improvement
>  Components: Runtime / Network
>Affects Versions: 1.17.0
>Reporter: Yuxin Tan
>Assignee: Yuxin Tan
>Priority: Major
>
> After [FLINK-30469|https://issues.apache.org/jira/browse/FLINK-30469], the 
> network memory configuration docs for TaskManager should also be updated.
> The configuration descriptions to be updated mainly include 
> `taskmanager.network.memory.buffers-per-channel`
> and `taskmanager.network.memory.floating-buffers-per-gate`.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (FLINK-30712) Update network memory configuration docs

2023-01-16 Thread Yuxin Tan (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-30712?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuxin Tan updated FLINK-30712:
--
Description: 
After [FLINK-30469|https://issues.apache.org/jira/browse/FLINK-30469], the 
network memory configuration docs for TaskManager should also be updated.
The configuration descriptions to be updated mainly include 
`taskmanager.network.memory.buffers-per-channel`
and `taskmanager.network.memory.floating-buffers-per-gate`.

  was:
After https://issues.apache.org/jira/browse/FLINK-30469, the network memory 
configuration docs for TaskManager should also be updated.
The configuration descriptions to be updated mainly include 
`taskmanager.network.memory.buffers-per-channel`
and `taskmanager.network.memory.floating-buffers-per-gate`.


> Update network memory configuration docs 
> -
>
> Key: FLINK-30712
> URL: https://issues.apache.org/jira/browse/FLINK-30712
> Project: Flink
>  Issue Type: Improvement
>  Components: Runtime / Network
>Affects Versions: 1.17.0
>Reporter: Yuxin Tan
>Priority: Major
>
> After [FLINK-30469|https://issues.apache.org/jira/browse/FLINK-30469], the 
> network memory configuration docs for TaskManager should also be updated.
> The configuration descriptions to be updated mainly include 
> `taskmanager.network.memory.buffers-per-channel`
> and `taskmanager.network.memory.floating-buffers-per-gate`.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (FLINK-30627) Refactor FileSystemTableSink to use FileSink instead of StreamingFileSink

2023-02-10 Thread Yuxin Tan (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-30627?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17686995#comment-17686995
 ] 

Yuxin Tan commented on FLINK-30627:
---

[~martijnvisser], hi, Martijn, May I take a look at the issue?  And I will 
prepare a PR recently.

> Refactor FileSystemTableSink to use FileSink instead of StreamingFileSink
> -
>
> Key: FLINK-30627
> URL: https://issues.apache.org/jira/browse/FLINK-30627
> Project: Flink
>  Issue Type: Technical Debt
>  Components: Connectors / FileSystem
>Reporter: Martijn Visser
>Priority: Major
>
> In order to be able to remove the StreamingFileSink, the FileSystemTableSink 
> needs to be refactored to no longer depend on StreamingFileSink but on 
> FileSink



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] (FLINK-30627) Refactor FileSystemTableSink to use FileSink instead of StreamingFileSink

2023-02-10 Thread Yuxin Tan (Jira)


[ https://issues.apache.org/jira/browse/FLINK-30627 ]


Yuxin Tan deleted comment on FLINK-30627:
---

was (Author: tanyuxin):
[~martijnvisser], hi, Martijn, May I take a look at the issue?  And I will 
prepare a PR recently.

> Refactor FileSystemTableSink to use FileSink instead of StreamingFileSink
> -
>
> Key: FLINK-30627
> URL: https://issues.apache.org/jira/browse/FLINK-30627
> Project: Flink
>  Issue Type: Technical Debt
>  Components: Connectors / FileSystem
>Reporter: Martijn Visser
>Priority: Major
>
> In order to be able to remove the StreamingFileSink, the FileSystemTableSink 
> needs to be refactored to no longer depend on StreamingFileSink but on 
> FileSink



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (FLINK-30627) Refactor FileSystemTableSink to use FileSink instead of StreamingFileSink

2023-02-12 Thread Yuxin Tan (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-30627?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17687680#comment-17687680
 ] 

Yuxin Tan commented on FLINK-30627:
---

[~martijnvisser] Sorry that I underestimated the complexity of this problem at 
the beginning. I don't have a good solution yet.

> Refactor FileSystemTableSink to use FileSink instead of StreamingFileSink
> -
>
> Key: FLINK-30627
> URL: https://issues.apache.org/jira/browse/FLINK-30627
> Project: Flink
>  Issue Type: Technical Debt
>  Components: Connectors / FileSystem
>Reporter: Martijn Visser
>Priority: Major
>
> {{FileSystemTableSink}} currently depends on most of the capabilities from 
> {{StreamingFileSink}}, for example 
> https://github.com/apache/flink/blob/master/flink-connectors/flink-connector-files/src/main/java/org/apache/flink/connector/file/table/FileSystemTableSink.java#L223-L243
> This is necessary to complete FLINK-28641



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (FLINK-30983) the security.ssl.algorithms configuration does not take effect in rest ssl

2023-02-14 Thread Yuxin Tan (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-30983?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17688842#comment-17688842
 ] 

Yuxin Tan commented on FLINK-30983:
---

[~lyssg] Hi, I will take a look at the issue.

> the security.ssl.algorithms configuration does not take effect in rest ssl
> --
>
> Key: FLINK-30983
> URL: https://issues.apache.org/jira/browse/FLINK-30983
> Project: Flink
>  Issue Type: Bug
>  Components: Runtime / Network
>Affects Versions: 1.16.0
>Reporter: luyuan
>Priority: Minor
>  Labels: pull-request-available
> Attachments: image-2023-02-09-15-58-36-254.png, 
> image-2023-02-09-15-58-43-963.png
>
>
> The security.ssl.algorithms configuration does not take effect in rest ssl.
>  
> SSLUtils#createRestNettySSLContext does not call SslContextBuilder#ciphers as 
>  SSLUtils#createInternalNettySSLContext.
> !image-2023-02-09-15-58-36-254.png!
>  
> !image-2023-02-09-15-58-43-963.png!



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (FLINK-29329) Checkpoint can not be triggered if encountering OOM

2022-12-08 Thread Yuxin Tan (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-29329?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17644656#comment-17644656
 ] 

Yuxin Tan commented on FLINK-29329:
---

Never reproduce it again, close temporarily. If encountered it again, I will 
open it and give more details.

> Checkpoint can not be triggered if encountering OOM
> ---
>
> Key: FLINK-29329
> URL: https://issues.apache.org/jira/browse/FLINK-29329
> Project: Flink
>  Issue Type: Bug
>Reporter: Yuxin Tan
>Priority: Major
> Fix For: 1.13.7
>
> Attachments: job-exceptions-1.txt
>
>
> When writing a checkpoint, an OOM error is thrown. But the JM is not failed 
> and is restored because I found a log "No master state to restore".
> Then JM never makes checkpoints anymore. Currently, the root cause is not 
> that clear, maybe this is a bug and we should deal with the OOM or other 
> exceptions when making checkpoints.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Closed] (FLINK-29329) Checkpoint can not be triggered if encountering OOM

2022-12-08 Thread Yuxin Tan (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-29329?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuxin Tan closed FLINK-29329.
-
Resolution: Cannot Reproduce

> Checkpoint can not be triggered if encountering OOM
> ---
>
> Key: FLINK-29329
> URL: https://issues.apache.org/jira/browse/FLINK-29329
> Project: Flink
>  Issue Type: Bug
>Reporter: Yuxin Tan
>Priority: Major
> Fix For: 1.13.7
>
> Attachments: job-exceptions-1.txt
>
>
> When writing a checkpoint, an OOM error is thrown. But the JM is not failed 
> and is restored because I found a log "No master state to restore".
> Then JM never makes checkpoints anymore. Currently, the root cause is not 
> that clear, maybe this is a bug and we should deal with the OOM or other 
> exceptions when making checkpoints.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (FLINK-30469) FLIP-266: Simplify network memory configurations for TaskManager

2022-12-20 Thread Yuxin Tan (Jira)
Yuxin Tan created FLINK-30469:
-

 Summary: FLIP-266: Simplify network memory configurations for 
TaskManager
 Key: FLINK-30469
 URL: https://issues.apache.org/jira/browse/FLINK-30469
 Project: Flink
  Issue Type: Improvement
  Components: Runtime / Network
Affects Versions: 1.17.0
Reporter: Yuxin Tan


When using Flink, users may encounter the following issues that affect 
usability.
1. The job may fail with an "Insufficient number of network buffers" exception.
2. Flink network memory size adjustment is complex.


When encountering these issues, users can solve some problems by adding or 
adjusting parameters. However, multiple memory config options should be 
changed. The config option adjustment requires understanding the detailed 
internal implementation, which is impractical for most users.

To resolve the issues, we propose some improvement solutions. For more details 
see 
[FLIP-266|[https://cwiki.apache.org/confluence/display/FLINK/FLIP-266%3A+Simplify+network+memory+configurations+for+TaskManager].]



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (FLINK-30469) FLIP-266: Simplify network memory configurations for TaskManager

2022-12-20 Thread Yuxin Tan (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-30469?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuxin Tan updated FLINK-30469:
--
Description: 
When using Flink, users may encounter the following issues that affect 
usability.
1. The job may fail with an "Insufficient number of network buffers" exception.
2. Flink network memory size adjustment is complex.

When encountering these issues, users can solve some problems by adding or 
adjusting parameters. However, multiple memory config options should be 
changed. The config option adjustment requires understanding the detailed 
internal implementation, which is impractical for most users.

To resolve the issues, we propose some improvement solutions. For more details 
see 
[FLIP-266|[https://cwiki.apache.org/confluence/display/FLINK/FLIP-266%3A+Simplify+network+memory+configurations+for+TaskManager]].

  was:
When using Flink, users may encounter the following issues that affect 
usability.
1. The job may fail with an "Insufficient number of network buffers" exception.
2. Flink network memory size adjustment is complex.


When encountering these issues, users can solve some problems by adding or 
adjusting parameters. However, multiple memory config options should be 
changed. The config option adjustment requires understanding the detailed 
internal implementation, which is impractical for most users.

To resolve the issues, we propose some improvement solutions. For more details 
see 
[FLIP-266|[https://cwiki.apache.org/confluence/display/FLINK/FLIP-266%3A+Simplify+network+memory+configurations+for+TaskManager].]


> FLIP-266: Simplify network memory configurations for TaskManager
> 
>
> Key: FLINK-30469
> URL: https://issues.apache.org/jira/browse/FLINK-30469
> Project: Flink
>  Issue Type: Improvement
>  Components: Runtime / Network
>Affects Versions: 1.17.0
>Reporter: Yuxin Tan
>Priority: Major
>
> When using Flink, users may encounter the following issues that affect 
> usability.
> 1. The job may fail with an "Insufficient number of network buffers" 
> exception.
> 2. Flink network memory size adjustment is complex.
> When encountering these issues, users can solve some problems by adding or 
> adjusting parameters. However, multiple memory config options should be 
> changed. The config option adjustment requires understanding the detailed 
> internal implementation, which is impractical for most users.
> To resolve the issues, we propose some improvement solutions. For more 
> details see 
> [FLIP-266|[https://cwiki.apache.org/confluence/display/FLINK/FLIP-266%3A+Simplify+network+memory+configurations+for+TaskManager]].



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (FLINK-30469) FLIP-266: Simplify network memory configurations for TaskManager

2022-12-20 Thread Yuxin Tan (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-30469?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuxin Tan updated FLINK-30469:
--
Description: 
When using Flink, users may encounter the following issues that affect 
usability.
1. The job may fail with an "Insufficient number of network buffers" exception.
2. Flink network memory size adjustment is complex.

When encountering these issues, users can solve some problems by adding or 
adjusting parameters. However, multiple memory config options should be 
changed. The config option adjustment requires understanding the detailed 
internal implementation, which is impractical for most users.

To resolve the issues, we propose some improvement solutions. For more details 
see 
[FLIP-266|https://cwiki.apache.org/confluence/display/FLINK/FLIP-266%3A+Simplify+network+memory+configurations+for+TaskManager].

  was:
When using Flink, users may encounter the following issues that affect 
usability.
1. The job may fail with an "Insufficient number of network buffers" exception.
2. Flink network memory size adjustment is complex.

When encountering these issues, users can solve some problems by adding or 
adjusting parameters. However, multiple memory config options should be 
changed. The config option adjustment requires understanding the detailed 
internal implementation, which is impractical for most users.

To resolve the issues, we propose some improvement solutions. For more details 
see 
[FLIP-266|[https://cwiki.apache.org/confluence/display/FLINK/FLIP-266%3A+Simplify+network+memory+configurations+for+TaskManager]].


> FLIP-266: Simplify network memory configurations for TaskManager
> 
>
> Key: FLINK-30469
> URL: https://issues.apache.org/jira/browse/FLINK-30469
> Project: Flink
>  Issue Type: Improvement
>  Components: Runtime / Network
>Affects Versions: 1.17.0
>Reporter: Yuxin Tan
>Priority: Major
>
> When using Flink, users may encounter the following issues that affect 
> usability.
> 1. The job may fail with an "Insufficient number of network buffers" 
> exception.
> 2. Flink network memory size adjustment is complex.
> When encountering these issues, users can solve some problems by adding or 
> adjusting parameters. However, multiple memory config options should be 
> changed. The config option adjustment requires understanding the detailed 
> internal implementation, which is impractical for most users.
> To resolve the issues, we propose some improvement solutions. For more 
> details see 
> [FLIP-266|https://cwiki.apache.org/confluence/display/FLINK/FLIP-266%3A+Simplify+network+memory+configurations+for+TaskManager].



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (FLINK-30471) Optimize the enriching network memory process in SsgNetworkMemoryCalculationUtils

2022-12-21 Thread Yuxin Tan (Jira)
Yuxin Tan created FLINK-30471:
-

 Summary: Optimize the enriching network memory process in 
SsgNetworkMemoryCalculationUtils
 Key: FLINK-30471
 URL: https://issues.apache.org/jira/browse/FLINK-30471
 Project: Flink
  Issue Type: Sub-task
  Components: Runtime / Network
Affects Versions: 1.17.0
 Environment: In SsgNetworkMemoryCalculationUtils#enrichNetworkMemory, 
getting PartitionTypes is run in a separate loop, which is not friendly to 
performance. If we want to get inputPartitionTypes, a new separate loop may be 
introduced too. 

It just looks simpler in code, but it will affect the performance. We can get 
all the results through one loop instead of multiple loops, which will be 
faster.
Reporter: Yuxin Tan






--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (FLINK-30472) Modify the default value of the max network memory config option

2022-12-21 Thread Yuxin Tan (Jira)
Yuxin Tan created FLINK-30472:
-

 Summary: Modify the default value of the max network memory config 
option
 Key: FLINK-30472
 URL: https://issues.apache.org/jira/browse/FLINK-30472
 Project: Flink
  Issue Type: Sub-task
  Components: Runtime / Network
Affects Versions: 1.17.0
Reporter: Yuxin Tan






--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (FLINK-30469) FLIP-266: Simplify network memory configurations for TaskManager

2022-12-21 Thread Yuxin Tan (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-30469?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuxin Tan updated FLINK-30469:
--
Fix Version/s: (was: 1.17.0)

> FLIP-266: Simplify network memory configurations for TaskManager
> 
>
> Key: FLINK-30469
> URL: https://issues.apache.org/jira/browse/FLINK-30469
> Project: Flink
>  Issue Type: Improvement
>  Components: Runtime / Network
>Affects Versions: 1.17.0
>Reporter: Yuxin Tan
>Priority: Major
>
> When using Flink, users may encounter the following issues that affect 
> usability.
> 1. The job may fail with an "Insufficient number of network buffers" 
> exception.
> 2. Flink network memory size adjustment is complex.
> When encountering these issues, users can solve some problems by adding or 
> adjusting parameters. However, multiple memory config options should be 
> changed. The config option adjustment requires understanding the detailed 
> internal implementation, which is impractical for most users.
> To resolve the issues, we propose some improvement solutions. For more 
> details see 
> [FLIP-266|https://cwiki.apache.org/confluence/display/FLINK/FLIP-266%3A+Simplify+network+memory+configurations+for+TaskManager].



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (FLINK-30469) FLIP-266: Simplify network memory configurations for TaskManager

2022-12-21 Thread Yuxin Tan (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-30469?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuxin Tan updated FLINK-30469:
--
Fix Version/s: 1.17.0

> FLIP-266: Simplify network memory configurations for TaskManager
> 
>
> Key: FLINK-30469
> URL: https://issues.apache.org/jira/browse/FLINK-30469
> Project: Flink
>  Issue Type: Improvement
>  Components: Runtime / Network
>Affects Versions: 1.17.0
>Reporter: Yuxin Tan
>Priority: Major
> Fix For: 1.17.0
>
>
> When using Flink, users may encounter the following issues that affect 
> usability.
> 1. The job may fail with an "Insufficient number of network buffers" 
> exception.
> 2. Flink network memory size adjustment is complex.
> When encountering these issues, users can solve some problems by adding or 
> adjusting parameters. However, multiple memory config options should be 
> changed. The config option adjustment requires understanding the detailed 
> internal implementation, which is impractical for most users.
> To resolve the issues, we propose some improvement solutions. For more 
> details see 
> [FLIP-266|https://cwiki.apache.org/confluence/display/FLINK/FLIP-266%3A+Simplify+network+memory+configurations+for+TaskManager].



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (FLINK-30473) Optimize the InputGate network memory management for TaskManager

2022-12-21 Thread Yuxin Tan (Jira)
Yuxin Tan created FLINK-30473:
-

 Summary: Optimize the InputGate network memory management for 
TaskManager
 Key: FLINK-30473
 URL: https://issues.apache.org/jira/browse/FLINK-30473
 Project: Flink
  Issue Type: Sub-task
  Components: Runtime / Network
Affects Versions: 1.17.0
Reporter: Yuxin Tan


Based on the 
[FLIP-266|https://cwiki.apache.org/confluence/display/FLINK/FLIP-266%3A+Simplify+network+memory+configurations+for+TaskManager],
 this issue mainly focuses on the first issue.

This change proposes a method to control the maximum required memory buffers in 
an inputGate according to parallelism size.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (FLINK-30472) Modify the default value of the max network memory config option

2022-12-21 Thread Yuxin Tan (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-30472?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuxin Tan updated FLINK-30472:
--
Description: This issue mainly focuses on the second issue in 
[FLIP-266|https://cwiki.apache.org/confluence/display/FLINK/FLIP-266%3A+Simplify+network+memory+configurations+for+TaskManager],
 modifying the default value of taskmanager.memory.network.max

> Modify the default value of the max network memory config option
> 
>
> Key: FLINK-30472
> URL: https://issues.apache.org/jira/browse/FLINK-30472
> Project: Flink
>  Issue Type: Sub-task
>  Components: Runtime / Network
>Affects Versions: 1.17.0
>Reporter: Yuxin Tan
>Priority: Major
>
> This issue mainly focuses on the second issue in 
> [FLIP-266|https://cwiki.apache.org/confluence/display/FLINK/FLIP-266%3A+Simplify+network+memory+configurations+for+TaskManager],
>  modifying the default value of taskmanager.memory.network.max



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (FLINK-30471) Optimize the enriching network memory process in SsgNetworkMemoryCalculationUtils

2022-12-21 Thread Yuxin Tan (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-30471?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuxin Tan updated FLINK-30471:
--
Environment: (was: In 
SsgNetworkMemoryCalculationUtils#enrichNetworkMemory, getting PartitionTypes is 
run in a separate loop, which is not friendly to performance. If we want to get 
inputPartitionTypes, a new separate loop may be introduced too. 

It just looks simpler in code, but it will affect the performance. We can get 
all the results through one loop instead of multiple loops, which will be 
faster.)

> Optimize the enriching network memory process in 
> SsgNetworkMemoryCalculationUtils
> -
>
> Key: FLINK-30471
> URL: https://issues.apache.org/jira/browse/FLINK-30471
> Project: Flink
>  Issue Type: Sub-task
>  Components: Runtime / Network
>Affects Versions: 1.17.0
>Reporter: Yuxin Tan
>Priority: Major
>
> In SsgNetworkMemoryCalculationUtils#enrichNetworkMemory, getting 
> PartitionTypes is run in a separate loop, which is not friendly to 
> performance. If we want to get inputPartitionTypes, a new separate loop may 
> be introduced too. 
> It just looks simpler in code, but it will affect the performance. We can get 
> all the results through one loop instead of multiple loops, which will be 
> faster.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (FLINK-30471) Optimize the enriching network memory process in SsgNetworkMemoryCalculationUtils

2022-12-21 Thread Yuxin Tan (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-30471?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuxin Tan updated FLINK-30471:
--
Description: 
In SsgNetworkMemoryCalculationUtils#enrichNetworkMemory, getting PartitionTypes 
is run in a separate loop, which is not friendly to performance. If we want to 
get inputPartitionTypes, a new separate loop may be introduced too. 

It just looks simpler in code, but it will affect the performance. We can get 
all the results through one loop instead of multiple loops, which will be 
faster.

> Optimize the enriching network memory process in 
> SsgNetworkMemoryCalculationUtils
> -
>
> Key: FLINK-30471
> URL: https://issues.apache.org/jira/browse/FLINK-30471
> Project: Flink
>  Issue Type: Sub-task
>  Components: Runtime / Network
>Affects Versions: 1.17.0
> Environment: In SsgNetworkMemoryCalculationUtils#enrichNetworkMemory, 
> getting PartitionTypes is run in a separate loop, which is not friendly to 
> performance. If we want to get inputPartitionTypes, a new separate loop may 
> be introduced too. 
> It just looks simpler in code, but it will affect the performance. We can get 
> all the results through one loop instead of multiple loops, which will be 
> faster.
>Reporter: Yuxin Tan
>Priority: Major
>
> In SsgNetworkMemoryCalculationUtils#enrichNetworkMemory, getting 
> PartitionTypes is run in a separate loop, which is not friendly to 
> performance. If we want to get inputPartitionTypes, a new separate loop may 
> be introduced too. 
> It just looks simpler in code, but it will affect the performance. We can get 
> all the results through one loop instead of multiple loops, which will be 
> faster.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (FLINK-30469) FLIP-266: Simplify network memory configurations for TaskManager

2022-12-21 Thread Yuxin Tan (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-30469?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuxin Tan updated FLINK-30469:
--
Description: 
When using Flink, users may encounter the following issues that affect 
usability.
1. The job may fail with an "Insufficient number of network buffers" exception.
2. Flink network memory size adjustment is complex.

When encountering these issues, users can solve some problems by adding or 
adjusting parameters. However, multiple memory config options should be 
changed. The config option adjustment requires understanding the detailed 
internal implementation, which is impractical for most users.

To resolve the issues, we propose some improvement solutions. For more details 
see 
[FLIP-266|https://cwiki.apache.org/confluence/display/FLINK/FLIP-266%3A+Simplify+network+memory+configurations+for+TaskManager].

This is the umbrella ticket to track all the changes of this feature.

  was:
When using Flink, users may encounter the following issues that affect 
usability.
1. The job may fail with an "Insufficient number of network buffers" exception.
2. Flink network memory size adjustment is complex.

When encountering these issues, users can solve some problems by adding or 
adjusting parameters. However, multiple memory config options should be 
changed. The config option adjustment requires understanding the detailed 
internal implementation, which is impractical for most users.

To resolve the issues, we propose some improvement solutions. For more details 
see 
[FLIP-266|https://cwiki.apache.org/confluence/display/FLINK/FLIP-266%3A+Simplify+network+memory+configurations+for+TaskManager].


> FLIP-266: Simplify network memory configurations for TaskManager
> 
>
> Key: FLINK-30469
> URL: https://issues.apache.org/jira/browse/FLINK-30469
> Project: Flink
>  Issue Type: Improvement
>  Components: Runtime / Network
>Affects Versions: 1.17.0
>Reporter: Yuxin Tan
>Priority: Major
>
> When using Flink, users may encounter the following issues that affect 
> usability.
> 1. The job may fail with an "Insufficient number of network buffers" 
> exception.
> 2. Flink network memory size adjustment is complex.
> When encountering these issues, users can solve some problems by adding or 
> adjusting parameters. However, multiple memory config options should be 
> changed. The config option adjustment requires understanding the detailed 
> internal implementation, which is impractical for most users.
> To resolve the issues, we propose some improvement solutions. For more 
> details see 
> [FLIP-266|https://cwiki.apache.org/confluence/display/FLINK/FLIP-266%3A+Simplify+network+memory+configurations+for+TaskManager].
> This is the umbrella ticket to track all the changes of this feature.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Closed] (FLINK-28446) Expose more information in PartitionDescriptor to support more optimized Shuffle Service

2022-07-20 Thread Yuxin Tan (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-28446?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuxin Tan closed FLINK-28446.
-

> Expose more information in PartitionDescriptor to support more optimized 
> Shuffle Service
> 
>
> Key: FLINK-28446
> URL: https://issues.apache.org/jira/browse/FLINK-28446
> Project: Flink
>  Issue Type: Improvement
>Affects Versions: 1.16.0
>Reporter: Yuxin Tan
>Assignee: Yuxin Tan
>Priority: Critical
>  Labels: pull-request-available
>
> To improve shuffle performance, more detailed fields should be added to 
> PartitionDescriptor, for example, whether the partition is a broadcast 
> partition or the distribution pattern of the intermediate result.
> After the detailed information is added, the Shuffle Service can use 
> different Shuffle strategies in different situations, which may lead to 
> better shuffle performance.
> In addition, we only add fields to PartitionDescriptor, which is compatible 
> with the previous version, and the added fields are insensitive to earlier 
> users.
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (FLINK-28823) Enlarge the max requested buffers for SortMergeResultPartitionReadScheduler

2022-08-04 Thread Yuxin Tan (Jira)
Yuxin Tan created FLINK-28823:
-

 Summary: Enlarge the max requested buffers for 
SortMergeResultPartitionReadScheduler
 Key: FLINK-28823
 URL: https://issues.apache.org/jira/browse/FLINK-28823
 Project: Flink
  Issue Type: Sub-task
  Components: Runtime / Network
Affects Versions: 1.16.0
Reporter: Yuxin Tan


The num of the max requested buffers in SortMergeResultPartitionReadScheduler 
is determined by the num of subpartitions, which may be insufficient in some 
scenarios, such as speculative execution and partition reuse.

Therefore, we should increase the max requested buffers to make full use of 
available memory when TM memory is sufficient.

 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (FLINK-28826) Avoid notifying too frequently when recycling buffers for BatchShuffleReadBufferPool

2022-08-05 Thread Yuxin Tan (Jira)
Yuxin Tan created FLINK-28826:
-

 Summary: Avoid notifying too frequently when recycling buffers for 
BatchShuffleReadBufferPool
 Key: FLINK-28826
 URL: https://issues.apache.org/jira/browse/FLINK-28826
 Project: Flink
  Issue Type: Sub-task
  Components: Runtime / Network
Affects Versions: 1.16.0
Reporter: Yuxin Tan


When recycling buffers in BatchShuffleReadBufferPool, the number of buffers may 
be larger than numBuffersPerRequest, which may cause too frequent notifications 
that the buffers are already available.

So we should modify the condition of notifications that the buffer is available 
to solve this problem.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (FLINK-28828) Sorting all unfinished readers in batches at one time in SortMergeResultPartitionReadScheduler

2022-08-05 Thread Yuxin Tan (Jira)
Yuxin Tan created FLINK-28828:
-

 Summary: Sorting all unfinished readers in batches at one time in 
SortMergeResultPartitionReadScheduler
 Key: FLINK-28828
 URL: https://issues.apache.org/jira/browse/FLINK-28828
 Project: Flink
  Issue Type: Sub-task
  Components: Runtime / Network
Affects Versions: 1.16.0
Reporter: Yuxin Tan


Currently, when reading data in SortMergeResultPartitionReadScheduler, the 
reader is added to the priority queue immediately. However, the data read from 
this reader may not have been consumed, which will cause this reader to be 
ranked later in the queue, which is unfavorable to sequential reading.

To solve the issue, After reading the data, we should sort all unfinished 
readers in batches at one time, that is, add all unfinished readers to the 
priority queue, which is more conducive to sequential reading.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (FLINK-28789) TPC-DS tests failed due to release input gate for task failure

2022-08-07 Thread Yuxin Tan (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-28789?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17576508#comment-17576508
 ] 

Yuxin Tan commented on FLINK-28789:
---

The commit has been reverted on Aug 05, and the issue has been resolved.

>  TPC-DS tests failed  due to release input gate for task failure
> 
>
> Key: FLINK-28789
> URL: https://issues.apache.org/jira/browse/FLINK-28789
> Project: Flink
>  Issue Type: Bug
>  Components: Runtime / Network
>Affects Versions: 1.16.0
>Reporter: Leonard Xu
>Assignee: Yuxin Tan
>Priority: Blocker
>  Labels: test-stability
> Fix For: 1.16.0
>
>
> {code:java}
> switched from CANCELING to CANCELED.
> 2022-08-03 08:03:02,776 INFO  org.apache.flink.runtime.taskmanager.Task   
>  [] - Freeing task resources for MultipleInput[2212] -> 
> Calc[2191] -> HashAggregate[2192] (8/8)#1 
> (cf5f33b100f0efb21b9ff8d27a78cd8e_d806bb3f5ea308ac3f1df304a96163b4_7_1).
> 2022-08-03 08:03:02,776 ERROR org.apache.flink.runtime.taskmanager.Task   
>  [] - Failed to release input gate for task MultipleInput[2212] 
> -> Calc[2191] -> HashAggregate[2192] (8/8)#1.
> org.apache.flink.shaded.netty4.io.netty.util.IllegalReferenceCountException: 
> refCnt: 0, decrement: 1
>   at 
> org.apache.flink.shaded.netty4.io.netty.util.internal.ReferenceCountUpdater.toLiveRealRefCnt(ReferenceCountUpdater.java:74)
>  ~[flink-dist-1.16-SNAPSHOT.jar:1.16-SNAPSHOT]
>   at 
> org.apache.flink.shaded.netty4.io.netty.util.internal.ReferenceCountUpdater.release(ReferenceCountUpdater.java:138)
>  ~[flink-dist-1.16-SNAPSHOT.jar:1.16-SNAPSHOT]
>   at 
> org.apache.flink.shaded.netty4.io.netty.buffer.AbstractReferenceCountedByteBuf.release(AbstractReferenceCountedByteBuf.java:100)
>  ~[flink-dist-1.16-SNAPSHOT.jar:1.16-SNAPSHOT]
>   at 
> org.apache.flink.runtime.io.network.buffer.NetworkBuffer.recycleBuffer(NetworkBuffer.java:156)
>  ~[flink-dist-1.16-SNAPSHOT.jar:1.16-SNAPSHOT]
>   at 
> org.apache.flink.runtime.io.network.buffer.ReadOnlySlicedNetworkBuffer.recycleBuffer(ReadOnlySlicedNetworkBuffer.java:123)
>  ~[flink-dist-1.16-SNAPSHOT.jar:1.16-SNAPSHOT]
>   at 
> org.apache.flink.runtime.io.network.buffer.CompositeBuffer.recycleBuffer(CompositeBuffer.java:70)
>  ~[flink-dist-1.16-SNAPSHOT.jar:1.16-SNAPSHOT]
>   at java.util.ArrayList.forEach(ArrayList.java:1259) ~[?:1.8.0_332]
>   at 
> org.apache.flink.runtime.io.network.partition.SortMergeSubpartitionReader.releaseInternal(SortMergeSubpartitionReader.java:181)
>  ~[flink-dist-1.16-SNAPSHOT.jar:1.16-SNAPSHOT]
>   at 
> org.apache.flink.runtime.io.network.partition.SortMergeSubpartitionReader.releaseAllResources(SortMergeSubpartitionReader.java:163)
>  ~[flink-dist-1.16-SNAPSHOT.jar:1.16-SNAPSHOT]
>   at 
> org.apache.flink.runtime.io.network.partition.consumer.LocalInputChannel.releaseAllResources(LocalInputChannel.java:341)
>  ~[flink-dist-1.16-SNAPSHOT.jar:1.16-SNAPSHOT]
>   at 
> org.apache.flink.runtime.io.network.partition.consumer.SingleInputGate.close(SingleInputGate.java:667)
>  ~[flink-dist-1.16-SNAPSHOT.jar:1.16-SNAPSHOT]
>   at 
> org.apache.flink.runtime.taskmanager.InputGateWithMetrics.close(InputGateWithMetrics.java:140)
>  ~[flink-dist-1.16-SNAPSHOT.jar:1.16-SNAPSHOT]
>   at 
> org.apache.flink.runtime.taskmanager.Task.closeAllInputGates(Task.java:1010) 
> [flink-dist-1.16-SNAPSHOT.jar:1.16-SNAPSHOT]
>   at 
> org.apache.flink.runtime.taskmanager.Task.releaseResources(Task.java:975) 
> [flink-dist-1.16-SNAPSHOT.jar:1.16-SNAPSHOT]
>   at org.apache.flink.runtime.taskmanager.Task.doRun(Task.java:820) 
> [flink-dist-1.16-SNAPSHOT.jar:1.16-SNAPSHOT]
>   at org.apache.flink.runtime.taskmanager.Task.run(Task.java:550) 
> [flink-dist-1.16-SNAPSHOT.jar:1.16-SNAPSHOT]
>   at java.lang.Thread.run(Thread.java:750) [?:1.8.0_332]
> 2022-08-03 08:03:02,778 WARN  org.apache.flink.metrics.MetricGroup 
> {code}
> The failed CI link: 
> https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=39152&view=results



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (FLINK-28844) YARNHighAvailabilityITCase fails with NoSuchMethod of org.apache.curator

2022-08-07 Thread Yuxin Tan (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-28844?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17576551#comment-17576551
 ] 

Yuxin Tan commented on FLINK-28844:
---

https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=39534&view=logs&j=fc5181b0-e452-5c8f-68de-1097947f6483&t=995c650b-6573-581c-9ce6-7ad4cc038461&l=29842

> YARNHighAvailabilityITCase fails with NoSuchMethod of org.apache.curator
> 
>
> Key: FLINK-28844
> URL: https://issues.apache.org/jira/browse/FLINK-28844
> Project: Flink
>  Issue Type: Bug
>  Components: Test Infrastructure
>Affects Versions: 1.16.0
>Reporter: Jark Wu
>Assignee: Chesnay Schepler
>Priority: Blocker
> Fix For: 1.16.0
>
>
> This is keep failing on master since the commit of 
> https://github.com/flink-ci/flink-mirror/commit/6335b573863af2b30a6541f910be96ddf61f9c84
>  which removes curator-test dependency from the flink-test-utils module. 
> https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=39394&view=logs&j=fc5181b0-e452-5c8f-68de-1097947f6483&t=995c650b-6573-581c-9ce6-7ad4cc038461
> {code}
> 2022-08-05T18:31:47.0438160Z Aug 05 18:31:47 [ERROR] Tests run: 1, Failures: 
> 0, Errors: 1, Skipped: 0, Time elapsed: 5.237 s <<< FAILURE! - in 
> org.apache.flink.yarn.YARNHighAvailabilityITCase
> 2022-08-05T18:31:47.0439564Z Aug 05 18:31:47 [ERROR] 
> org.apache.flink.yarn.YARNHighAvailabilityITCase  Time elapsed: 5.237 s  <<< 
> ERROR!
> 2022-08-05T18:31:47.0440370Z Aug 05 18:31:47 java.lang.NoSuchMethodError: 
> org.apache.curator.test.InstanceSpec.getHostname()Ljava/lang/String;
> 2022-08-05T18:31:47.0441582Z Aug 05 18:31:47  at 
> org.apache.flink.runtime.testutils.ZooKeeperTestUtils.getZookeeperInstanceSpecWithIncreasedSessionTimeout(ZooKeeperTestUtils.java:71)
> 2022-08-05T18:31:47.0442643Z Aug 05 18:31:47  at 
> org.apache.flink.runtime.testutils.ZooKeeperTestUtils.createAndStartZookeeperTestingServer(ZooKeeperTestUtils.java:49)
> 2022-08-05T18:31:47.0443461Z Aug 05 18:31:47  at 
> org.apache.flink.yarn.YARNHighAvailabilityITCase.setup(YARNHighAvailabilityITCase.java:114)
> 2022-08-05T18:31:47.0444094Z Aug 05 18:31:47  at 
> sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> 2022-08-05T18:31:47.0444717Z Aug 05 18:31:47  at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
> 2022-08-05T18:31:47.0445424Z Aug 05 18:31:47  at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> 2022-08-05T18:31:47.0446063Z Aug 05 18:31:47  at 
> java.lang.reflect.Method.invoke(Method.java:498)
> 2022-08-05T18:31:47.0446818Z Aug 05 18:31:47  at 
> org.junit.platform.commons.util.ReflectionUtils.invokeMethod(ReflectionUtils.java:725)
> 2022-08-05T18:31:47.0447822Z Aug 05 18:31:47  at 
> org.junit.jupiter.engine.execution.MethodInvocation.proceed(MethodInvocation.java:60)
> 2022-08-05T18:31:47.0448657Z Aug 05 18:31:47  at 
> org.junit.jupiter.engine.execution.InvocationInterceptorChain$ValidatingInvocation.proceed(InvocationInterceptorChain.java:131)
> 2022-08-05T18:31:47.0449692Z Aug 05 18:31:47  at 
> org.junit.jupiter.engine.extension.TimeoutExtension.intercept(TimeoutExtension.java:149)
> 2022-08-05T18:31:47.0450637Z Aug 05 18:31:47  at 
> org.junit.jupiter.engine.extension.TimeoutExtension.interceptLifecycleMethod(TimeoutExtension.java:126)
> 2022-08-05T18:31:47.0451443Z Aug 05 18:31:47  at 
> org.junit.jupiter.engine.extension.TimeoutExtension.interceptBeforeAllMethod(TimeoutExtension.java:68)
> 2022-08-05T18:31:47.0452304Z Aug 05 18:31:47  at 
> org.junit.jupiter.engine.execution.ExecutableInvoker$ReflectiveInterceptorCall.lambda$ofVoidMethod$0(ExecutableInvoker.java:115)
> 2022-08-05T18:31:47.0453162Z Aug 05 18:31:47  at 
> org.junit.jupiter.engine.execution.ExecutableInvoker.lambda$invoke$0(ExecutableInvoker.java:105)
> 2022-08-05T18:31:47.0454013Z Aug 05 18:31:47  at 
> org.junit.jupiter.engine.execution.InvocationInterceptorChain$InterceptedInvocation.proceed(InvocationInterceptorChain.java:106)
> 2022-08-05T18:31:47.0454882Z Aug 05 18:31:47  at 
> org.junit.jupiter.engine.execution.InvocationInterceptorChain.proceed(InvocationInterceptorChain.java:64)
> 2022-08-05T18:31:47.0455716Z Aug 05 18:31:47  at 
> org.junit.jupiter.engine.execution.InvocationInterceptorChain.chainAndInvoke(InvocationInterceptorChain.java:45)
> 2022-08-05T18:31:47.0456525Z Aug 05 18:31:47  at 
> org.junit.jupiter.engine.execution.InvocationInterceptorChain.invoke(InvocationInterceptorChain.java:37)
> 2022-08-05T18:31:47.0457512Z Aug 05 18:31:47  at 
> org.junit.jupiter.engine.execution.ExecutableInvoker.invoke(ExecutableInvoker.java:104)
> 2022-08-05T18:31:47.0458637Z Aug 05 18:31:47  at 
> org.junit.jupiter.engine.execution.Executabl

[jira] [Created] (FLINK-28942) Deadlock may occurs when releasing readers for SortMergeResultPartition

2022-08-11 Thread Yuxin Tan (Jira)
Yuxin Tan created FLINK-28942:
-

 Summary: Deadlock may occurs when releasing readers for 
SortMergeResultPartition
 Key: FLINK-28942
 URL: https://issues.apache.org/jira/browse/FLINK-28942
 Project: Flink
  Issue Type: Bug
Affects Versions: 1.16.0
Reporter: Yuxin Tan


After adding the logic of recycling buffers in CompositeBuffer in 
https://issues.apache.org/jira/browse/FLINK-28373, when reading data and 
recycling buffers simultaneously, the deadlock between the lock of 
SortMergeResultPartition and the lock of SingleInputGate may occur.

In short, the deadlock may occur as follows.

1. SingleInputGate.getNextBufferOrEvent (SingleInputGate lock)

CompositeBuffer.getFullBufferData -> CompositeBuffer.recycleBuffer -> waiting 
for 

SortMergeResultPartition lock;

2. ResultPartitionManager.releasePartition (SortMergeResultPartition lock) -> 

SortMergeSubpartitionReader.notifyDataAvailable -> 

SingleInputGate.notifyChannelNonEmpty -> waiting for SingleInputGate lock.

The possibility of this deadlock is very small, but we should fix the bug as 
soon as possible.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (FLINK-28446) Expose more information in PartitionDescriptor to support more optimized Shuffle Service

2022-07-07 Thread Yuxin Tan (Jira)
Yuxin Tan created FLINK-28446:
-

 Summary: Expose more information in PartitionDescriptor to support 
more optimized Shuffle Service
 Key: FLINK-28446
 URL: https://issues.apache.org/jira/browse/FLINK-28446
 Project: Flink
  Issue Type: Improvement
Affects Versions: 1.16.0
Reporter: Yuxin Tan


To improve shuffle performance, more detailed fields should be added to 
PartitionDescriptor, for example, whether the partition is a broadcast 
partition or the distribution pattern of the intermediate result.

After the detailed information is added, the Shuffle Service can use different 
Shuffle strategies in different situations, which may lead to better shuffle 
performance.

In addition, we only add fields to PartitionDescriptor, which is compatible 
with the previous version, and the added fields are insensitive to earlier 
users.

 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (FLINK-29329) Checkpoint can not be triggered if encountering OOM

2022-09-18 Thread Yuxin Tan (Jira)
Yuxin Tan created FLINK-29329:
-

 Summary: Checkpoint can not be triggered if encountering OOM
 Key: FLINK-29329
 URL: https://issues.apache.org/jira/browse/FLINK-29329
 Project: Flink
  Issue Type: Bug
Reporter: Yuxin Tan
 Fix For: 1.13.7
 Attachments: job-exceptions.txt

When writing a checkpoint, an OOM error is thrown. But the JM is not failed and 
is restored because I found a log "No master state to restore".

Then JM never makes checkpoints anymore. Currently, the root cause is not that 
clear, maybe this is a bug and we should deal with the OOM or other exceptions 
when making checkpoints.

 

[^job-exceptions.txt]



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (FLINK-29329) Checkpoint can not be triggered if encountering OOM

2022-09-18 Thread Yuxin Tan (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-29329?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuxin Tan updated FLINK-29329:
--
Description: 
When writing a checkpoint, an OOM error is thrown. But the JM is not failed and 
is restored because I found a log "No master state to restore".

Then JM never makes checkpoints anymore. Currently, the root cause is not that 
clear, maybe this is a bug and we should deal with the OOM or other exceptions 
when making checkpoints.

  was:
When writing a checkpoint, an OOM error is thrown. But the JM is not failed and 
is restored because I found a log "No master state to restore".

Then JM never makes checkpoints anymore. Currently, the root cause is not that 
clear, maybe this is a bug and we should deal with the OOM or other exceptions 
when making checkpoints.

 

[^job-exceptions.txt]


> Checkpoint can not be triggered if encountering OOM
> ---
>
> Key: FLINK-29329
> URL: https://issues.apache.org/jira/browse/FLINK-29329
> Project: Flink
>  Issue Type: Bug
>Reporter: Yuxin Tan
>Priority: Major
> Fix For: 1.13.7
>
>
> When writing a checkpoint, an OOM error is thrown. But the JM is not failed 
> and is restored because I found a log "No master state to restore".
> Then JM never makes checkpoints anymore. Currently, the root cause is not 
> that clear, maybe this is a bug and we should deal with the OOM or other 
> exceptions when making checkpoints.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (FLINK-29329) Checkpoint can not be triggered if encountering OOM

2022-09-18 Thread Yuxin Tan (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-29329?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuxin Tan updated FLINK-29329:
--
Attachment: (was: job-exceptions.txt)

> Checkpoint can not be triggered if encountering OOM
> ---
>
> Key: FLINK-29329
> URL: https://issues.apache.org/jira/browse/FLINK-29329
> Project: Flink
>  Issue Type: Bug
>Reporter: Yuxin Tan
>Priority: Major
> Fix For: 1.13.7
>
>
> When writing a checkpoint, an OOM error is thrown. But the JM is not failed 
> and is restored because I found a log "No master state to restore".
> Then JM never makes checkpoints anymore. Currently, the root cause is not 
> that clear, maybe this is a bug and we should deal with the OOM or other 
> exceptions when making checkpoints.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (FLINK-29329) Checkpoint can not be triggered if encountering OOM

2022-09-18 Thread Yuxin Tan (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-29329?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuxin Tan updated FLINK-29329:
--
Attachment: job-exceptions-1.txt

> Checkpoint can not be triggered if encountering OOM
> ---
>
> Key: FLINK-29329
> URL: https://issues.apache.org/jira/browse/FLINK-29329
> Project: Flink
>  Issue Type: Bug
>Reporter: Yuxin Tan
>Priority: Major
> Fix For: 1.13.7
>
> Attachments: job-exceptions-1.txt
>
>
> When writing a checkpoint, an OOM error is thrown. But the JM is not failed 
> and is restored because I found a log "No master state to restore".
> Then JM never makes checkpoints anymore. Currently, the root cause is not 
> that clear, maybe this is a bug and we should deal with the OOM or other 
> exceptions when making checkpoints.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (FLINK-29329) Checkpoint can not be triggered if encountering OOM

2022-09-19 Thread Yuxin Tan (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-29329?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17606476#comment-17606476
 ] 

Yuxin Tan commented on FLINK-29329:
---

[~yunta] Thanks for the quick replay. I mean then no checkpoints would be 
triggered by JM once OOM occurs.

> Checkpoint can not be triggered if encountering OOM
> ---
>
> Key: FLINK-29329
> URL: https://issues.apache.org/jira/browse/FLINK-29329
> Project: Flink
>  Issue Type: Bug
>Reporter: Yuxin Tan
>Priority: Major
> Fix For: 1.13.7
>
> Attachments: job-exceptions-1.txt
>
>
> When writing a checkpoint, an OOM error is thrown. But the JM is not failed 
> and is restored because I found a log "No master state to restore".
> Then JM never makes checkpoints anymore. Currently, the root cause is not 
> that clear, maybe this is a bug and we should deal with the OOM or other 
> exceptions when making checkpoints.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (FLINK-29134) fetch metrics may cause oom(ThreadPool task pile up)

2022-10-20 Thread Yuxin Tan (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-29134?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17620935#comment-17620935
 ] 

Yuxin Tan commented on FLINK-29134:
---

[~xtsong] Thanks for your idea. OK, I will take a look at this issue.

> fetch metrics may cause oom(ThreadPool task pile up)
> 
>
> Key: FLINK-29134
> URL: https://issues.apache.org/jira/browse/FLINK-29134
> Project: Flink
>  Issue Type: Improvement
>  Components: Runtime / Metrics
>Affects Versions: 1.11.0, 1.15.2
>Reporter: Sitan Pang
>Assignee: Yuxin Tan
>Priority: Major
> Attachments: dump-queueTask.png, dump-threadPool.png
>
>
> When we queryMetrics we use thread pool to process the data which are 
> returned by TMs. 
> {code:java}
> private void queryMetrics(final MetricQueryServiceGateway 
> queryServiceGateway) {
> LOG.debug("Query metrics for {}.", queryServiceGateway.getAddress());
> queryServiceGateway
> .queryMetrics(timeout)
> .whenCompleteAsync(
> (MetricDumpSerialization.MetricSerializationResult 
> result, Throwable t) -> {
> if (t != null) {
> LOG.debug("Fetching metrics failed.", t);
> } else {
> metrics.addAll(deserializer.deserialize(result));
> }
> },
> executor);
> } {code}
> The only condition we will fetch metrics is update time is larger than 
> updateInterval
> {code:java}
> public void update() {
> synchronized (this) {
> long currentTime = System.currentTimeMillis();
> if (currentTime - lastUpdateTime > updateInterval) {
> lastUpdateTime = currentTime;
> fetchMetrics();
> }
> }
> } {code}
> Therefore, if we could not process the data in update-interval-time, metrics 
> data will accumulate.
> Besides, webMonitorEndpoint, restHandlers and metrics share the thread pool. 
> When we open ui, it maybe even worse.
> {code:java}
> final ScheduledExecutorService executor =
> WebMonitorEndpoint.createExecutorService(
> configuration.getInteger(RestOptions.SERVER_NUM_THREADS),
> configuration.getInteger(RestOptions.SERVER_THREAD_PRIORITY),
> "DispatcherRestEndpoint");
> final long updateInterval =
> configuration.getLong(MetricOptions.METRIC_FETCHER_UPDATE_INTERVAL);
> final MetricFetcher metricFetcher =
> updateInterval == 0
> ? VoidMetricFetcher.INSTANCE
> : MetricFetcherImpl.fromConfiguration(
> configuration,
> metricQueryServiceRetriever,
> dispatcherGatewayRetriever,
> executor);
> webMonitorEndpoint =
> restEndpointFactory.createRestEndpoint(
> configuration,
> dispatcherGatewayRetriever,
> resourceManagerGatewayRetriever,
> blobServer,
> executor,
> metricFetcher,
> 
> highAvailabilityServices.getClusterRestEndpointLeaderElectionService(),
> fatalErrorHandler); {code}
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (FLINK-26555) Missing close in FileSystemJobResultStore

2022-03-31 Thread Yuxin Tan (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-26555?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuxin Tan updated FLINK-26555:
--
Attachment: image-2022-03-31-16-39-44-189.png

> Missing close in FileSystemJobResultStore
> -
>
> Key: FLINK-26555
> URL: https://issues.apache.org/jira/browse/FLINK-26555
> Project: Flink
>  Issue Type: Bug
>  Components: Runtime / Coordination
>Affects Versions: 1.15.0
>Reporter: Matthias Pohl
>Assignee: Matthias Pohl
>Priority: Blocker
>  Labels: pull-request-available
> Fix For: 1.15.0
>
> Attachments: image-2022-03-31-16-39-44-189.png
>
>
> {{FileSystemJobResultStore.createDirtyResultInternal}} does not close the 
> opened {{OutputStream}}



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Updated] (FLINK-26555) Missing close in FileSystemJobResultStore

2022-03-31 Thread Yuxin Tan (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-26555?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuxin Tan updated FLINK-26555:
--
Attachment: image-2022-03-31-16-42-56-530.png

> Missing close in FileSystemJobResultStore
> -
>
> Key: FLINK-26555
> URL: https://issues.apache.org/jira/browse/FLINK-26555
> Project: Flink
>  Issue Type: Bug
>  Components: Runtime / Coordination
>Affects Versions: 1.15.0
>Reporter: Matthias Pohl
>Assignee: Matthias Pohl
>Priority: Blocker
>  Labels: pull-request-available
> Fix For: 1.15.0
>
> Attachments: image-2022-03-31-16-39-44-189.png, 
> image-2022-03-31-16-42-56-530.png, image-2022-03-31-16-43-56-322.png
>
>
> {{FileSystemJobResultStore.createDirtyResultInternal}} does not close the 
> opened {{OutputStream}}



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Updated] (FLINK-26555) Missing close in FileSystemJobResultStore

2022-03-31 Thread Yuxin Tan (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-26555?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuxin Tan updated FLINK-26555:
--
Attachment: image-2022-03-31-16-43-56-322.png

> Missing close in FileSystemJobResultStore
> -
>
> Key: FLINK-26555
> URL: https://issues.apache.org/jira/browse/FLINK-26555
> Project: Flink
>  Issue Type: Bug
>  Components: Runtime / Coordination
>Affects Versions: 1.15.0
>Reporter: Matthias Pohl
>Assignee: Matthias Pohl
>Priority: Blocker
>  Labels: pull-request-available
> Fix For: 1.15.0
>
> Attachments: image-2022-03-31-16-39-44-189.png, 
> image-2022-03-31-16-42-56-530.png, image-2022-03-31-16-43-56-322.png
>
>
> {{FileSystemJobResultStore.createDirtyResultInternal}} does not close the 
> opened {{OutputStream}}



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Commented] (FLINK-26555) Missing close in FileSystemJobResultStore

2022-03-31 Thread Yuxin Tan (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-26555?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17515188#comment-17515188
 ] 

Yuxin Tan commented on FLINK-26555:
---

[~mapohl] , hello, about the issue, we encountered an exception as follows. 
Could you please help take a look?

When a job is finished in a session cluster, the job result may flush failed.
!image-2022-03-31-16-43-56-322.png|width=819,height=423!
{code:java}
mapper.writeValue(os, new JsonJobResultEntry(jobResultEntry)); {code}
About this line in the patch, I checked the source code and found it called the 
method 
{code:java}
this._writeValueAndClose(this.createGenerator(out, JsonEncoding.UTF8), value); 
{code}
 and an _UTF8JsonGenerator_ is inited and used. 
{code:java}
protected final void _writeValueAndClose(JsonGenerator g, Object value) throws 
IOException {
SerializationConfig cfg = this.getSerializationConfig();
if (cfg.isEnabled(SerializationFeature.CLOSE_CLOSEABLE) && value instanceof 
Closeable) {
this._writeCloseable(g, value, cfg);
} else {
try {
this._serializerProvider(cfg).serializeValue(g, value);
} catch (Exception var5) {
ClassUtil.closeOnFailAndThrowAsIOE(g, var5);
return;
}

g.close();
}
} {code}
The _UTF8JsonGenerator#close_ will be called finally and I found the 
OutputStream may be closed in the method when some features of Json generator 
is enabled.
{code:java}
public void close() throws IOException {
...
if (this._outputStream != null) {
if (!this._ioContext.isResourceManaged() && 
!this.isEnabled(Feature.AUTO_CLOSE_TARGET)) {
if (this.isEnabled(Feature.FLUSH_PASSED_TO_STREAM)) {
this._outputStream.flush();
}
} else {
this._outputStream.close();
}
}
...
} {code}
If the output stream is closed after {_}writeValue{_}, the above 
_ClosedChannelException_ may be thrown when calling the _flush_ method added. 
{code:java}
os.flush(); {code}
I found this patch has changed  the initialization of the output stream to the 
try-with-resource mode. Generally, the data will be flushed before the file 
system is closed. Could we delete this line of code  _os.flush();_ to avoid the 
exception?

[~mapohl] , WDYT about the exception and could you help take a look when having 
free time? If I missed something, please correct it at any time. Thanks very 
much.

 

> Missing close in FileSystemJobResultStore
> -
>
> Key: FLINK-26555
> URL: https://issues.apache.org/jira/browse/FLINK-26555
> Project: Flink
>  Issue Type: Bug
>  Components: Runtime / Coordination
>Affects Versions: 1.15.0
>Reporter: Matthias Pohl
>Assignee: Matthias Pohl
>Priority: Blocker
>  Labels: pull-request-available
> Fix For: 1.15.0
>
> Attachments: image-2022-03-31-16-39-44-189.png, 
> image-2022-03-31-16-42-56-530.png, image-2022-03-31-16-43-56-322.png
>
>
> {{FileSystemJobResultStore.createDirtyResultInternal}} does not close the 
> opened {{OutputStream}}



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Updated] (FLINK-26555) Missing close in FileSystemJobResultStore

2022-03-31 Thread Yuxin Tan (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-26555?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuxin Tan updated FLINK-26555:
--
Attachment: (was: image-2022-03-31-16-42-56-530.png)

> Missing close in FileSystemJobResultStore
> -
>
> Key: FLINK-26555
> URL: https://issues.apache.org/jira/browse/FLINK-26555
> Project: Flink
>  Issue Type: Bug
>  Components: Runtime / Coordination
>Affects Versions: 1.15.0
>Reporter: Matthias Pohl
>Assignee: Matthias Pohl
>Priority: Blocker
>  Labels: pull-request-available
> Fix For: 1.15.0
>
> Attachments: image-2022-03-31-16-43-56-322.png
>
>
> {{FileSystemJobResultStore.createDirtyResultInternal}} does not close the 
> opened {{OutputStream}}



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Updated] (FLINK-26555) Missing close in FileSystemJobResultStore

2022-03-31 Thread Yuxin Tan (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-26555?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuxin Tan updated FLINK-26555:
--
Attachment: (was: image-2022-03-31-16-39-44-189.png)

> Missing close in FileSystemJobResultStore
> -
>
> Key: FLINK-26555
> URL: https://issues.apache.org/jira/browse/FLINK-26555
> Project: Flink
>  Issue Type: Bug
>  Components: Runtime / Coordination
>Affects Versions: 1.15.0
>Reporter: Matthias Pohl
>Assignee: Matthias Pohl
>Priority: Blocker
>  Labels: pull-request-available
> Fix For: 1.15.0
>
> Attachments: image-2022-03-31-16-43-56-322.png
>
>
> {{FileSystemJobResultStore.createDirtyResultInternal}} does not close the 
> opened {{OutputStream}}



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Commented] (FLINK-26555) Missing close in FileSystemJobResultStore

2022-03-31 Thread Yuxin Tan (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-26555?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17515289#comment-17515289
 ] 

Yuxin Tan commented on FLINK-26555:
---

[~mapohl] Thanks for your quick reply. We are looking forward to the follow-up 
issue being resolved. 

> Missing close in FileSystemJobResultStore
> -
>
> Key: FLINK-26555
> URL: https://issues.apache.org/jira/browse/FLINK-26555
> Project: Flink
>  Issue Type: Bug
>  Components: Runtime / Coordination
>Affects Versions: 1.15.0
>Reporter: Matthias Pohl
>Assignee: Matthias Pohl
>Priority: Blocker
>  Labels: pull-request-available
> Fix For: 1.15.0
>
> Attachments: image-2022-03-31-16-43-56-322.png
>
>
> {{FileSystemJobResultStore.createDirtyResultInternal}} does not close the 
> opened {{OutputStream}}



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Commented] (FLINK-22096) ServerTransportErrorHandlingTest.testRemoteClose fail

2021-10-08 Thread Yuxin Tan (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-22096?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17426446#comment-17426446
 ] 

Yuxin Tan commented on FLINK-22096:
---

Hi, I want to take a look at this test issue. Could someone help assign it to 
me? :D

> ServerTransportErrorHandlingTest.testRemoteClose fail 
> --
>
> Key: FLINK-22096
> URL: https://issues.apache.org/jira/browse/FLINK-22096
> Project: Flink
>  Issue Type: Bug
>  Components: Runtime / Network
>Affects Versions: 1.13.0, 1.14.0, 1.15.0
>Reporter: Guowei Ma
>Priority: Major
>  Labels: test-stability
> Fix For: 1.13.3, 1.15.0, 1.14.1
>
>
> https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=15966&view=logs&j=0da23115-68bb-5dcd-192c-bd4c8adebde1&t=05b74a19-4ee4-5036-c46f-ada307df6cf0&l=6580
> {code:java}
> [ERROR] Tests run: 1, Failures: 0, Errors: 1, Skipped: 0, Time elapsed: 1.415 
> s <<< FAILURE! - in 
> org.apache.flink.runtime.io.network.netty.ServerTransportErrorHandlingTest
> [ERROR] 
> testRemoteClose(org.apache.flink.runtime.io.network.netty.ServerTransportErrorHandlingTest)
>   Time elapsed: 1.338 s  <<< ERROR!
> org.apache.flink.shaded.netty4.io.netty.channel.unix.Errors$NativeIoException:
>  bind(..) failed: Address already in use
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (FLINK-22096) ServerTransportErrorHandlingTest.testRemoteClose fail

2021-10-20 Thread Yuxin Tan (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-22096?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17432202#comment-17432202
 ] 

Yuxin Tan commented on FLINK-22096:
---

In the test case, a `{{BindException`}} may be thrown when initializing Netty 
server. To solve this problem, a retry is added when calling 
`{{initServerAndClient`}} method. And I submitted a PR on my point of view. 
What do you think about the case? [~maguowei] [~kevin.cyj]. Please correct me 
at any time if I missed anything, thanks.

> ServerTransportErrorHandlingTest.testRemoteClose fail 
> --
>
> Key: FLINK-22096
> URL: https://issues.apache.org/jira/browse/FLINK-22096
> Project: Flink
>  Issue Type: Bug
>  Components: Runtime / Network
>Affects Versions: 1.13.0, 1.14.0, 1.15.0
>Reporter: Guowei Ma
>Assignee: Yuxin Tan
>Priority: Major
>  Labels: pull-request-available, test-stability
> Fix For: 1.15.0, 1.14.1, 1.13.4
>
>
> https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=15966&view=logs&j=0da23115-68bb-5dcd-192c-bd4c8adebde1&t=05b74a19-4ee4-5036-c46f-ada307df6cf0&l=6580
> {code:java}
> [ERROR] Tests run: 1, Failures: 0, Errors: 1, Skipped: 0, Time elapsed: 1.415 
> s <<< FAILURE! - in 
> org.apache.flink.runtime.io.network.netty.ServerTransportErrorHandlingTest
> [ERROR] 
> testRemoteClose(org.apache.flink.runtime.io.network.netty.ServerTransportErrorHandlingTest)
>   Time elapsed: 1.338 s  <<< ERROR!
> org.apache.flink.shaded.netty4.io.netty.channel.unix.Errors$NativeIoException:
>  bind(..) failed: Address already in use
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (FLINK-32094) startScheduling.BATCH performance regression since May 11th

2023-05-15 Thread Yuxin Tan (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-32094?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17722696#comment-17722696
 ] 

Yuxin Tan commented on FLINK-32094:
---

Thanks [~martijnvisser] and [~Thesharing].

I reproduced the issue locally. 
To address the issue, I have submitted a 
PR(https://github.com/apache/flink/pull/22578) to prevent the creation of the 
tiered internal shuffle master until tiered storage is enabled. 
Furthermore, to ensure that the issue does not reoccur, we will address the 
regression issue before proceeding with the enablement of tiered storage.

> startScheduling.BATCH performance regression since May 11th
> ---
>
> Key: FLINK-32094
> URL: https://issues.apache.org/jira/browse/FLINK-32094
> Project: Flink
>  Issue Type: Bug
>  Components: Runtime / Coordination
>Reporter: Martijn Visser
>Assignee: Yuxin Tan
>Priority: Blocker
> Attachments: image-2023-05-14-22-58-00-886.png, 
> image-2023-05-15-12-33-56-319.png
>
>
> http://codespeed.dak8s.net:8000/timeline/#/?exe=5&ben=startScheduling.BATCH&extr=on&quarts=on&equid=off&env=2&revs=200



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (FLINK-32101) FlinkKafkaInternalProducerITCase.testInitTransactionId test failed

2023-05-15 Thread Yuxin Tan (Jira)
Yuxin Tan created FLINK-32101:
-

 Summary: FlinkKafkaInternalProducerITCase.testInitTransactionId 
test failed
 Key: FLINK-32101
 URL: https://issues.apache.org/jira/browse/FLINK-32101
 Project: Flink
  Issue Type: Bug
  Components: Connectors / Kafka
Affects Versions: 1.18.0
Reporter: Yuxin Tan


FlinkKafkaInternalProducerITCase.testInitTransactionId test failed

https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=48990&view=logs&j=aa18c3f6-13b8-5f58-86bb-c1cffb239496&t=502fb6c0-30a2-5e49-c5c2-a00fa3acb203&l=22973

{code:java}
Caused by: org.apache.kafka.common.KafkaException: 
org.apache.kafka.common.KafkaException: Unexpected error in 
InitProducerIdResponse; The request timed out.
at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
at 
sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
at 
sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
at 
java.util.concurrent.ForkJoinTask.getThrowableException(ForkJoinTask.java:593)
at 
java.util.concurrent.ForkJoinTask.reportException(ForkJoinTask.java:677)
at java.util.concurrent.ForkJoinTask.invoke(ForkJoinTask.java:735)
at 
java.util.stream.ForEachOps$ForEachOp.evaluateParallel(ForEachOps.java:159)
at 
java.util.stream.ForEachOps$ForEachOp$OfRef.evaluateParallel(ForEachOps.java:173)
at java.util.stream.AbstractPipeline.evaluate(AbstractPipeline.java:233)
at 
java.util.stream.ReferencePipeline.forEach(ReferencePipeline.java:485)
at 
java.util.stream.ReferencePipeline$Head.forEach(ReferencePipeline.java:650)
at 
org.apache.flink.streaming.connectors.kafka.FlinkKafkaProducer.abortTransactions(FlinkKafkaProducer.java:1290)
at 
org.apache.flink.streaming.connectors.kafka.FlinkKafkaProducer.initializeState(FlinkKafkaProducer.java:1216)
at 
org.apache.flink.streaming.util.functions.StreamingFunctionUtils.tryRestoreFunction(StreamingFunctionUtils.java:189)
at 
org.apache.flink.streaming.util.functions.StreamingFunctionUtils.restoreFunctionState(StreamingFunctionUtils.java:171)
at 
org.apache.flink.streaming.api.operators.AbstractUdfStreamOperator.initializeState(AbstractUdfStreamOperator.java:95)
at 
org.apache.flink.streaming.api.operators.StreamOperatorStateHandler.initializeOperatorState(StreamOperatorStateHandler.java:122)
at 
org.apache.flink.streaming.api.operators.AbstractStreamOperator.initializeState(AbstractStreamOperator.java:274)
at 
org.apache.flink.streaming.runtime.tasks.RegularOperatorChain.initializeStateAndOpenOperators(RegularOperatorChain.java:106)
at 
org.apache.flink.streaming.runtime.tasks.StreamTask.restoreGates(StreamTask.java:747)
at 
org.apache.flink.streaming.runtime.tasks.StreamTaskActionExecutor$SynchronizedStreamTaskActionExecutor.call(StreamTaskActionExecutor.java:100)
at 
org.apache.flink.streaming.runtime.tasks.StreamTask.restoreInternal(StreamTask.java:722)
at 
org.apache.flink.streaming.runtime.tasks.StreamTask.restore(StreamTask.java:688)
at 
org.apache.flink.runtime.taskmanager.Task.runWithSystemExitMonitoring(Task.java:952)
at 
org.apache.flink.runtime.taskmanager.Task.restoreAndInvoke(Task.java:921)
at org.apache.flink.runtime.taskmanager.Task.doRun(Task.java:745)
at org.apache.flink.runtime.taskmanager.Task.run(Task.java:562)
at java.lang.Thread.run(Thread.java:748)
Caused by: org.apache.kafka.common.KafkaException: Unexpected error in 
InitProducerIdResponse; The request timed out.
at 
org.apache.kafka.clients.producer.internals.TransactionManager$InitProducerIdHandler.handleResponse(TransactionManager.java:1418)
at 
org.apache.kafka.clients.producer.internals.TransactionManager$TxnRequestHandler.onComplete(TransactionManager.java:1322)
at 
org.apache.kafka.clients.ClientResponse.onComplete(ClientResponse.java:109)
at 
org.apache.kafka.clients.NetworkClient.completeResponses(NetworkClient.java:583)
at org.apache.kafka.clients.NetworkClient.poll(NetworkClient.java:575)
at 
org.apache.kafka.clients.producer.internals.Sender.maybeSendAndPollTransactionalRequest(Sender.java:418)
at 
org.apache.kafka.clients.producer.internals.Sender.runOnce(Sender.java:316)
at 
org.apache.kafka.clients.producer.internals.Sender.run(Sender.java:243)
... 1 more

{code}





--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (FLINK-32101) FlinkKafkaInternalProducerITCase.testInitTransactionId test failed

2023-05-15 Thread Yuxin Tan (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-32101?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuxin Tan updated FLINK-32101:
--
Description: 
FlinkKafkaInternalProducerITCase.testInitTransactionId test failed
```
Caused by: org.apache.kafka.common.KafkaException: Unexpected error in 
InitProducerIdResponse; The request timed out.
```

https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=48990&view=logs&j=aa18c3f6-13b8-5f58-86bb-c1cffb239496&t=502fb6c0-30a2-5e49-c5c2-a00fa3acb203&l=22973

{code:java}
Caused by: org.apache.kafka.common.KafkaException: 
org.apache.kafka.common.KafkaException: Unexpected error in 
InitProducerIdResponse; The request timed out.
at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
at 
sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
at 
sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
at 
java.util.concurrent.ForkJoinTask.getThrowableException(ForkJoinTask.java:593)
at 
java.util.concurrent.ForkJoinTask.reportException(ForkJoinTask.java:677)
at java.util.concurrent.ForkJoinTask.invoke(ForkJoinTask.java:735)
at 
java.util.stream.ForEachOps$ForEachOp.evaluateParallel(ForEachOps.java:159)
at 
java.util.stream.ForEachOps$ForEachOp$OfRef.evaluateParallel(ForEachOps.java:173)
at java.util.stream.AbstractPipeline.evaluate(AbstractPipeline.java:233)
at 
java.util.stream.ReferencePipeline.forEach(ReferencePipeline.java:485)
at 
java.util.stream.ReferencePipeline$Head.forEach(ReferencePipeline.java:650)
at 
org.apache.flink.streaming.connectors.kafka.FlinkKafkaProducer.abortTransactions(FlinkKafkaProducer.java:1290)
at 
org.apache.flink.streaming.connectors.kafka.FlinkKafkaProducer.initializeState(FlinkKafkaProducer.java:1216)
at 
org.apache.flink.streaming.util.functions.StreamingFunctionUtils.tryRestoreFunction(StreamingFunctionUtils.java:189)
at 
org.apache.flink.streaming.util.functions.StreamingFunctionUtils.restoreFunctionState(StreamingFunctionUtils.java:171)
at 
org.apache.flink.streaming.api.operators.AbstractUdfStreamOperator.initializeState(AbstractUdfStreamOperator.java:95)
at 
org.apache.flink.streaming.api.operators.StreamOperatorStateHandler.initializeOperatorState(StreamOperatorStateHandler.java:122)
at 
org.apache.flink.streaming.api.operators.AbstractStreamOperator.initializeState(AbstractStreamOperator.java:274)
at 
org.apache.flink.streaming.runtime.tasks.RegularOperatorChain.initializeStateAndOpenOperators(RegularOperatorChain.java:106)
at 
org.apache.flink.streaming.runtime.tasks.StreamTask.restoreGates(StreamTask.java:747)
at 
org.apache.flink.streaming.runtime.tasks.StreamTaskActionExecutor$SynchronizedStreamTaskActionExecutor.call(StreamTaskActionExecutor.java:100)
at 
org.apache.flink.streaming.runtime.tasks.StreamTask.restoreInternal(StreamTask.java:722)
at 
org.apache.flink.streaming.runtime.tasks.StreamTask.restore(StreamTask.java:688)
at 
org.apache.flink.runtime.taskmanager.Task.runWithSystemExitMonitoring(Task.java:952)
at 
org.apache.flink.runtime.taskmanager.Task.restoreAndInvoke(Task.java:921)
at org.apache.flink.runtime.taskmanager.Task.doRun(Task.java:745)
at org.apache.flink.runtime.taskmanager.Task.run(Task.java:562)
at java.lang.Thread.run(Thread.java:748)
Caused by: org.apache.kafka.common.KafkaException: Unexpected error in 
InitProducerIdResponse; The request timed out.
at 
org.apache.kafka.clients.producer.internals.TransactionManager$InitProducerIdHandler.handleResponse(TransactionManager.java:1418)
at 
org.apache.kafka.clients.producer.internals.TransactionManager$TxnRequestHandler.onComplete(TransactionManager.java:1322)
at 
org.apache.kafka.clients.ClientResponse.onComplete(ClientResponse.java:109)
at 
org.apache.kafka.clients.NetworkClient.completeResponses(NetworkClient.java:583)
at org.apache.kafka.clients.NetworkClient.poll(NetworkClient.java:575)
at 
org.apache.kafka.clients.producer.internals.Sender.maybeSendAndPollTransactionalRequest(Sender.java:418)
at 
org.apache.kafka.clients.producer.internals.Sender.runOnce(Sender.java:316)
at 
org.apache.kafka.clients.producer.internals.Sender.run(Sender.java:243)
... 1 more

{code}



  was:
FlinkKafkaInternalProducerITCase.testInitTransactionId test failed

https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=48990&view=logs&j=aa18c3f6-13b8-5f58-86bb-c1cffb239496&t=502fb6c0-30a2-5e49-c5c2-a00fa3acb203&l=22973

{code:java}
Caused by: org.apache.kafka.common.KafkaException: 
org.apache.kafka.common.KafkaException: 

[jira] [Updated] (FLINK-32101) FlinkKafkaInternalProducerITCase.testInitTransactionId test failed

2023-05-15 Thread Yuxin Tan (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-32101?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuxin Tan updated FLINK-32101:
--
Description: 
FlinkKafkaInternalProducerITCase.testInitTransactionId test failed.

  Caused by: org.apache.kafka.common.KafkaException: Unexpected error in 
InitProducerIdResponse; The request timed out.


https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=48990&view=logs&j=aa18c3f6-13b8-5f58-86bb-c1cffb239496&t=502fb6c0-30a2-5e49-c5c2-a00fa3acb203&l=22973

{code:java}
Caused by: org.apache.kafka.common.KafkaException: 
org.apache.kafka.common.KafkaException: Unexpected error in 
InitProducerIdResponse; The request timed out.
at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
at 
sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
at 
sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
at 
java.util.concurrent.ForkJoinTask.getThrowableException(ForkJoinTask.java:593)
at 
java.util.concurrent.ForkJoinTask.reportException(ForkJoinTask.java:677)
at java.util.concurrent.ForkJoinTask.invoke(ForkJoinTask.java:735)
at 
java.util.stream.ForEachOps$ForEachOp.evaluateParallel(ForEachOps.java:159)
at 
java.util.stream.ForEachOps$ForEachOp$OfRef.evaluateParallel(ForEachOps.java:173)
at java.util.stream.AbstractPipeline.evaluate(AbstractPipeline.java:233)
at 
java.util.stream.ReferencePipeline.forEach(ReferencePipeline.java:485)
at 
java.util.stream.ReferencePipeline$Head.forEach(ReferencePipeline.java:650)
at 
org.apache.flink.streaming.connectors.kafka.FlinkKafkaProducer.abortTransactions(FlinkKafkaProducer.java:1290)
at 
org.apache.flink.streaming.connectors.kafka.FlinkKafkaProducer.initializeState(FlinkKafkaProducer.java:1216)
at 
org.apache.flink.streaming.util.functions.StreamingFunctionUtils.tryRestoreFunction(StreamingFunctionUtils.java:189)
at 
org.apache.flink.streaming.util.functions.StreamingFunctionUtils.restoreFunctionState(StreamingFunctionUtils.java:171)
at 
org.apache.flink.streaming.api.operators.AbstractUdfStreamOperator.initializeState(AbstractUdfStreamOperator.java:95)
at 
org.apache.flink.streaming.api.operators.StreamOperatorStateHandler.initializeOperatorState(StreamOperatorStateHandler.java:122)
at 
org.apache.flink.streaming.api.operators.AbstractStreamOperator.initializeState(AbstractStreamOperator.java:274)
at 
org.apache.flink.streaming.runtime.tasks.RegularOperatorChain.initializeStateAndOpenOperators(RegularOperatorChain.java:106)
at 
org.apache.flink.streaming.runtime.tasks.StreamTask.restoreGates(StreamTask.java:747)
at 
org.apache.flink.streaming.runtime.tasks.StreamTaskActionExecutor$SynchronizedStreamTaskActionExecutor.call(StreamTaskActionExecutor.java:100)
at 
org.apache.flink.streaming.runtime.tasks.StreamTask.restoreInternal(StreamTask.java:722)
at 
org.apache.flink.streaming.runtime.tasks.StreamTask.restore(StreamTask.java:688)
at 
org.apache.flink.runtime.taskmanager.Task.runWithSystemExitMonitoring(Task.java:952)
at 
org.apache.flink.runtime.taskmanager.Task.restoreAndInvoke(Task.java:921)
at org.apache.flink.runtime.taskmanager.Task.doRun(Task.java:745)
at org.apache.flink.runtime.taskmanager.Task.run(Task.java:562)
at java.lang.Thread.run(Thread.java:748)
Caused by: org.apache.kafka.common.KafkaException: Unexpected error in 
InitProducerIdResponse; The request timed out.
at 
org.apache.kafka.clients.producer.internals.TransactionManager$InitProducerIdHandler.handleResponse(TransactionManager.java:1418)
at 
org.apache.kafka.clients.producer.internals.TransactionManager$TxnRequestHandler.onComplete(TransactionManager.java:1322)
at 
org.apache.kafka.clients.ClientResponse.onComplete(ClientResponse.java:109)
at 
org.apache.kafka.clients.NetworkClient.completeResponses(NetworkClient.java:583)
at org.apache.kafka.clients.NetworkClient.poll(NetworkClient.java:575)
at 
org.apache.kafka.clients.producer.internals.Sender.maybeSendAndPollTransactionalRequest(Sender.java:418)
at 
org.apache.kafka.clients.producer.internals.Sender.runOnce(Sender.java:316)
at 
org.apache.kafka.clients.producer.internals.Sender.run(Sender.java:243)
... 1 more

{code}



  was:
FlinkKafkaInternalProducerITCase.testInitTransactionId test failed
```
Caused by: org.apache.kafka.common.KafkaException: Unexpected error in 
InitProducerIdResponse; The request timed out.
```

https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=48990&view=logs&j=aa18c3f6-13b8-5f58-86bb-c1cffb239496&t=502fb6c0-30a2-5e49-c5c

[jira] [Updated] (FLINK-32101) FlinkKafkaInternalProducerITCase.testInitTransactionId test failed

2023-05-15 Thread Yuxin Tan (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-32101?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuxin Tan updated FLINK-32101:
--
Labels: test  (was: )

> FlinkKafkaInternalProducerITCase.testInitTransactionId test failed
> --
>
> Key: FLINK-32101
> URL: https://issues.apache.org/jira/browse/FLINK-32101
> Project: Flink
>  Issue Type: Bug
>  Components: Connectors / Kafka
>Affects Versions: 1.18.0
>Reporter: Yuxin Tan
>Priority: Major
>  Labels: test
>
> FlinkKafkaInternalProducerITCase.testInitTransactionId test failed.
>   Caused by: org.apache.kafka.common.KafkaException: Unexpected error in 
> InitProducerIdResponse; The request timed out.
> https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=48990&view=logs&j=aa18c3f6-13b8-5f58-86bb-c1cffb239496&t=502fb6c0-30a2-5e49-c5c2-a00fa3acb203&l=22973
> {code:java}
> Caused by: org.apache.kafka.common.KafkaException: 
> org.apache.kafka.common.KafkaException: Unexpected error in 
> InitProducerIdResponse; The request timed out.
>   at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
>   at 
> sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
>   at 
> sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
>   at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
>   at 
> java.util.concurrent.ForkJoinTask.getThrowableException(ForkJoinTask.java:593)
>   at 
> java.util.concurrent.ForkJoinTask.reportException(ForkJoinTask.java:677)
>   at java.util.concurrent.ForkJoinTask.invoke(ForkJoinTask.java:735)
>   at 
> java.util.stream.ForEachOps$ForEachOp.evaluateParallel(ForEachOps.java:159)
>   at 
> java.util.stream.ForEachOps$ForEachOp$OfRef.evaluateParallel(ForEachOps.java:173)
>   at java.util.stream.AbstractPipeline.evaluate(AbstractPipeline.java:233)
>   at 
> java.util.stream.ReferencePipeline.forEach(ReferencePipeline.java:485)
>   at 
> java.util.stream.ReferencePipeline$Head.forEach(ReferencePipeline.java:650)
>   at 
> org.apache.flink.streaming.connectors.kafka.FlinkKafkaProducer.abortTransactions(FlinkKafkaProducer.java:1290)
>   at 
> org.apache.flink.streaming.connectors.kafka.FlinkKafkaProducer.initializeState(FlinkKafkaProducer.java:1216)
>   at 
> org.apache.flink.streaming.util.functions.StreamingFunctionUtils.tryRestoreFunction(StreamingFunctionUtils.java:189)
>   at 
> org.apache.flink.streaming.util.functions.StreamingFunctionUtils.restoreFunctionState(StreamingFunctionUtils.java:171)
>   at 
> org.apache.flink.streaming.api.operators.AbstractUdfStreamOperator.initializeState(AbstractUdfStreamOperator.java:95)
>   at 
> org.apache.flink.streaming.api.operators.StreamOperatorStateHandler.initializeOperatorState(StreamOperatorStateHandler.java:122)
>   at 
> org.apache.flink.streaming.api.operators.AbstractStreamOperator.initializeState(AbstractStreamOperator.java:274)
>   at 
> org.apache.flink.streaming.runtime.tasks.RegularOperatorChain.initializeStateAndOpenOperators(RegularOperatorChain.java:106)
>   at 
> org.apache.flink.streaming.runtime.tasks.StreamTask.restoreGates(StreamTask.java:747)
>   at 
> org.apache.flink.streaming.runtime.tasks.StreamTaskActionExecutor$SynchronizedStreamTaskActionExecutor.call(StreamTaskActionExecutor.java:100)
>   at 
> org.apache.flink.streaming.runtime.tasks.StreamTask.restoreInternal(StreamTask.java:722)
>   at 
> org.apache.flink.streaming.runtime.tasks.StreamTask.restore(StreamTask.java:688)
>   at 
> org.apache.flink.runtime.taskmanager.Task.runWithSystemExitMonitoring(Task.java:952)
>   at 
> org.apache.flink.runtime.taskmanager.Task.restoreAndInvoke(Task.java:921)
>   at org.apache.flink.runtime.taskmanager.Task.doRun(Task.java:745)
>   at org.apache.flink.runtime.taskmanager.Task.run(Task.java:562)
>   at java.lang.Thread.run(Thread.java:748)
> Caused by: org.apache.kafka.common.KafkaException: Unexpected error in 
> InitProducerIdResponse; The request timed out.
>   at 
> org.apache.kafka.clients.producer.internals.TransactionManager$InitProducerIdHandler.handleResponse(TransactionManager.java:1418)
>   at 
> org.apache.kafka.clients.producer.internals.TransactionManager$TxnRequestHandler.onComplete(TransactionManager.java:1322)
>   at 
> org.apache.kafka.clients.ClientResponse.onComplete(ClientResponse.java:109)
>   at 
> org.apache.kafka.clients.NetworkClient.completeResponses(NetworkClient.java:583)
>   at org.apache.kafka.clients.NetworkClient.poll(NetworkClient.java:575)
>   at 
> org.apache.kafka.clients.producer.internals.Sender.maybeSendAndPollTransactionalRequest(Sender.java:418)
>  

[jira] [Updated] (FLINK-32101) FlinkKafkaInternalProducerITCase.testInitTransactionId test failed

2023-05-15 Thread Yuxin Tan (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-32101?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuxin Tan updated FLINK-32101:
--
Labels: test-stability  (was: test)

> FlinkKafkaInternalProducerITCase.testInitTransactionId test failed
> --
>
> Key: FLINK-32101
> URL: https://issues.apache.org/jira/browse/FLINK-32101
> Project: Flink
>  Issue Type: Bug
>  Components: Connectors / Kafka
>Affects Versions: 1.18.0
>Reporter: Yuxin Tan
>Priority: Major
>  Labels: test-stability
>
> FlinkKafkaInternalProducerITCase.testInitTransactionId test failed.
>   Caused by: org.apache.kafka.common.KafkaException: Unexpected error in 
> InitProducerIdResponse; The request timed out.
> https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=48990&view=logs&j=aa18c3f6-13b8-5f58-86bb-c1cffb239496&t=502fb6c0-30a2-5e49-c5c2-a00fa3acb203&l=22973
> {code:java}
> Caused by: org.apache.kafka.common.KafkaException: 
> org.apache.kafka.common.KafkaException: Unexpected error in 
> InitProducerIdResponse; The request timed out.
>   at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
>   at 
> sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
>   at 
> sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
>   at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
>   at 
> java.util.concurrent.ForkJoinTask.getThrowableException(ForkJoinTask.java:593)
>   at 
> java.util.concurrent.ForkJoinTask.reportException(ForkJoinTask.java:677)
>   at java.util.concurrent.ForkJoinTask.invoke(ForkJoinTask.java:735)
>   at 
> java.util.stream.ForEachOps$ForEachOp.evaluateParallel(ForEachOps.java:159)
>   at 
> java.util.stream.ForEachOps$ForEachOp$OfRef.evaluateParallel(ForEachOps.java:173)
>   at java.util.stream.AbstractPipeline.evaluate(AbstractPipeline.java:233)
>   at 
> java.util.stream.ReferencePipeline.forEach(ReferencePipeline.java:485)
>   at 
> java.util.stream.ReferencePipeline$Head.forEach(ReferencePipeline.java:650)
>   at 
> org.apache.flink.streaming.connectors.kafka.FlinkKafkaProducer.abortTransactions(FlinkKafkaProducer.java:1290)
>   at 
> org.apache.flink.streaming.connectors.kafka.FlinkKafkaProducer.initializeState(FlinkKafkaProducer.java:1216)
>   at 
> org.apache.flink.streaming.util.functions.StreamingFunctionUtils.tryRestoreFunction(StreamingFunctionUtils.java:189)
>   at 
> org.apache.flink.streaming.util.functions.StreamingFunctionUtils.restoreFunctionState(StreamingFunctionUtils.java:171)
>   at 
> org.apache.flink.streaming.api.operators.AbstractUdfStreamOperator.initializeState(AbstractUdfStreamOperator.java:95)
>   at 
> org.apache.flink.streaming.api.operators.StreamOperatorStateHandler.initializeOperatorState(StreamOperatorStateHandler.java:122)
>   at 
> org.apache.flink.streaming.api.operators.AbstractStreamOperator.initializeState(AbstractStreamOperator.java:274)
>   at 
> org.apache.flink.streaming.runtime.tasks.RegularOperatorChain.initializeStateAndOpenOperators(RegularOperatorChain.java:106)
>   at 
> org.apache.flink.streaming.runtime.tasks.StreamTask.restoreGates(StreamTask.java:747)
>   at 
> org.apache.flink.streaming.runtime.tasks.StreamTaskActionExecutor$SynchronizedStreamTaskActionExecutor.call(StreamTaskActionExecutor.java:100)
>   at 
> org.apache.flink.streaming.runtime.tasks.StreamTask.restoreInternal(StreamTask.java:722)
>   at 
> org.apache.flink.streaming.runtime.tasks.StreamTask.restore(StreamTask.java:688)
>   at 
> org.apache.flink.runtime.taskmanager.Task.runWithSystemExitMonitoring(Task.java:952)
>   at 
> org.apache.flink.runtime.taskmanager.Task.restoreAndInvoke(Task.java:921)
>   at org.apache.flink.runtime.taskmanager.Task.doRun(Task.java:745)
>   at org.apache.flink.runtime.taskmanager.Task.run(Task.java:562)
>   at java.lang.Thread.run(Thread.java:748)
> Caused by: org.apache.kafka.common.KafkaException: Unexpected error in 
> InitProducerIdResponse; The request timed out.
>   at 
> org.apache.kafka.clients.producer.internals.TransactionManager$InitProducerIdHandler.handleResponse(TransactionManager.java:1418)
>   at 
> org.apache.kafka.clients.producer.internals.TransactionManager$TxnRequestHandler.onComplete(TransactionManager.java:1322)
>   at 
> org.apache.kafka.clients.ClientResponse.onComplete(ClientResponse.java:109)
>   at 
> org.apache.kafka.clients.NetworkClient.completeResponses(NetworkClient.java:583)
>   at org.apache.kafka.clients.NetworkClient.poll(NetworkClient.java:575)
>   at 
> org.apache.kafka.clients.producer.internals.Sender.maybeSendAndPollTransactionalRequest(

[jira] [Commented] (FLINK-32101) FlinkKafkaInternalProducerITCase.testInitTransactionId test failed

2023-05-15 Thread Yuxin Tan (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-32101?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17722966#comment-17722966
 ] 

Yuxin Tan commented on FLINK-32101:
---

[~mapohl] Yes, exactly. Thanks for tracking it.

> FlinkKafkaInternalProducerITCase.testInitTransactionId test failed
> --
>
> Key: FLINK-32101
> URL: https://issues.apache.org/jira/browse/FLINK-32101
> Project: Flink
>  Issue Type: Bug
>  Components: Connectors / Kafka
>Affects Versions: 1.18.0
>Reporter: Yuxin Tan
>Priority: Major
>  Labels: test
>
> FlinkKafkaInternalProducerITCase.testInitTransactionId test failed.
>   Caused by: org.apache.kafka.common.KafkaException: Unexpected error in 
> InitProducerIdResponse; The request timed out.
> https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=48990&view=logs&j=aa18c3f6-13b8-5f58-86bb-c1cffb239496&t=502fb6c0-30a2-5e49-c5c2-a00fa3acb203&l=22973
> {code:java}
> Caused by: org.apache.kafka.common.KafkaException: 
> org.apache.kafka.common.KafkaException: Unexpected error in 
> InitProducerIdResponse; The request timed out.
>   at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
>   at 
> sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
>   at 
> sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
>   at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
>   at 
> java.util.concurrent.ForkJoinTask.getThrowableException(ForkJoinTask.java:593)
>   at 
> java.util.concurrent.ForkJoinTask.reportException(ForkJoinTask.java:677)
>   at java.util.concurrent.ForkJoinTask.invoke(ForkJoinTask.java:735)
>   at 
> java.util.stream.ForEachOps$ForEachOp.evaluateParallel(ForEachOps.java:159)
>   at 
> java.util.stream.ForEachOps$ForEachOp$OfRef.evaluateParallel(ForEachOps.java:173)
>   at java.util.stream.AbstractPipeline.evaluate(AbstractPipeline.java:233)
>   at 
> java.util.stream.ReferencePipeline.forEach(ReferencePipeline.java:485)
>   at 
> java.util.stream.ReferencePipeline$Head.forEach(ReferencePipeline.java:650)
>   at 
> org.apache.flink.streaming.connectors.kafka.FlinkKafkaProducer.abortTransactions(FlinkKafkaProducer.java:1290)
>   at 
> org.apache.flink.streaming.connectors.kafka.FlinkKafkaProducer.initializeState(FlinkKafkaProducer.java:1216)
>   at 
> org.apache.flink.streaming.util.functions.StreamingFunctionUtils.tryRestoreFunction(StreamingFunctionUtils.java:189)
>   at 
> org.apache.flink.streaming.util.functions.StreamingFunctionUtils.restoreFunctionState(StreamingFunctionUtils.java:171)
>   at 
> org.apache.flink.streaming.api.operators.AbstractUdfStreamOperator.initializeState(AbstractUdfStreamOperator.java:95)
>   at 
> org.apache.flink.streaming.api.operators.StreamOperatorStateHandler.initializeOperatorState(StreamOperatorStateHandler.java:122)
>   at 
> org.apache.flink.streaming.api.operators.AbstractStreamOperator.initializeState(AbstractStreamOperator.java:274)
>   at 
> org.apache.flink.streaming.runtime.tasks.RegularOperatorChain.initializeStateAndOpenOperators(RegularOperatorChain.java:106)
>   at 
> org.apache.flink.streaming.runtime.tasks.StreamTask.restoreGates(StreamTask.java:747)
>   at 
> org.apache.flink.streaming.runtime.tasks.StreamTaskActionExecutor$SynchronizedStreamTaskActionExecutor.call(StreamTaskActionExecutor.java:100)
>   at 
> org.apache.flink.streaming.runtime.tasks.StreamTask.restoreInternal(StreamTask.java:722)
>   at 
> org.apache.flink.streaming.runtime.tasks.StreamTask.restore(StreamTask.java:688)
>   at 
> org.apache.flink.runtime.taskmanager.Task.runWithSystemExitMonitoring(Task.java:952)
>   at 
> org.apache.flink.runtime.taskmanager.Task.restoreAndInvoke(Task.java:921)
>   at org.apache.flink.runtime.taskmanager.Task.doRun(Task.java:745)
>   at org.apache.flink.runtime.taskmanager.Task.run(Task.java:562)
>   at java.lang.Thread.run(Thread.java:748)
> Caused by: org.apache.kafka.common.KafkaException: Unexpected error in 
> InitProducerIdResponse; The request timed out.
>   at 
> org.apache.kafka.clients.producer.internals.TransactionManager$InitProducerIdHandler.handleResponse(TransactionManager.java:1418)
>   at 
> org.apache.kafka.clients.producer.internals.TransactionManager$TxnRequestHandler.onComplete(TransactionManager.java:1322)
>   at 
> org.apache.kafka.clients.ClientResponse.onComplete(ClientResponse.java:109)
>   at 
> org.apache.kafka.clients.NetworkClient.completeResponses(NetworkClient.java:583)
>   at org.apache.kafka.clients.NetworkClient.poll(NetworkClient.java:575)
>   at 
> org.apache.kafka.clients.produce

[jira] [Commented] (FLINK-31418) SortMergeResultPartitionReadSchedulerTest.testRequestBufferTimeout failed on CI

2023-05-16 Thread Yuxin Tan (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-31418?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17723027#comment-17723027
 ] 

Yuxin Tan commented on FLINK-31418:
---

[~Sergey Nuyanzin][~Weijie Guo], The patch is not in release-1.16. I created a 
PR to backport it.https://github.com/apache/flink/pull/22589

> SortMergeResultPartitionReadSchedulerTest.testRequestBufferTimeout failed on 
> CI
> ---
>
> Key: FLINK-31418
> URL: https://issues.apache.org/jira/browse/FLINK-31418
> Project: Flink
>  Issue Type: Bug
>  Components: Runtime / Task
>Affects Versions: 1.17.0, 1.16.1
>Reporter: Qingsheng Ren
>Assignee: Yuxin Tan
>Priority: Critical
>  Labels: pull-request-available, test-stability
> Fix For: 1.17.0, 1.18.0
>
>
> [https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=47077&view=logs&j=4d4a0d10-fca2-5507-8eed-c07f0bdf4887&t=7b25afdf-cc6c-566f-5459-359dc2585798&l=8756]
> Error message:
> {code:java}
> Mar 13 05:22:10 [ERROR] Failures: 
> Mar 13 05:22:10 [ERROR]   
> SortMergeResultPartitionReadSchedulerTest.testRequestBufferTimeout:278 
> Mar 13 05:22:10 Expecting value to be true but was false{code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Comment Edited] (FLINK-31418) SortMergeResultPartitionReadSchedulerTest.testRequestBufferTimeout failed on CI

2023-05-16 Thread Yuxin Tan (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-31418?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17723027#comment-17723027
 ] 

Yuxin Tan edited comment on FLINK-31418 at 5/16/23 7:31 AM:


[~Sergey Nuyanzin]   [~Weijie Guo], The patch is not in release-1.16. I created 
a PR to backport it.https://github.com/apache/flink/pull/22589


was (Author: tanyuxin):
[~Sergey Nuyanzin][~Weijie Guo], The patch is not in release-1.16. I created a 
PR to backport it.https://github.com/apache/flink/pull/22589

> SortMergeResultPartitionReadSchedulerTest.testRequestBufferTimeout failed on 
> CI
> ---
>
> Key: FLINK-31418
> URL: https://issues.apache.org/jira/browse/FLINK-31418
> Project: Flink
>  Issue Type: Bug
>  Components: Runtime / Task
>Affects Versions: 1.17.0, 1.16.1
>Reporter: Qingsheng Ren
>Assignee: Yuxin Tan
>Priority: Critical
>  Labels: pull-request-available, test-stability
> Fix For: 1.17.0, 1.18.0
>
>
> [https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=47077&view=logs&j=4d4a0d10-fca2-5507-8eed-c07f0bdf4887&t=7b25afdf-cc6c-566f-5459-359dc2585798&l=8756]
> Error message:
> {code:java}
> Mar 13 05:22:10 [ERROR] Failures: 
> Mar 13 05:22:10 [ERROR]   
> SortMergeResultPartitionReadSchedulerTest.testRequestBufferTimeout:278 
> Mar 13 05:22:10 Expecting value to be true but was false{code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (FLINK-32121) Avro Confluent Schema Registry nightly end-to-end test failed due to timeout

2023-05-17 Thread Yuxin Tan (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-32121?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17723437#comment-17723437
 ] 

Yuxin Tan commented on FLINK-32121:
---

https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=49103&view=logs&j=af184cdd-c6d8-5084-0b69-7e9c67b35f7a&t=0f3adb59-eefa-51c6-2858-3654d9e0749d&l=3804

> Avro Confluent Schema Registry nightly end-to-end test failed due to timeout
> 
>
> Key: FLINK-32121
> URL: https://issues.apache.org/jira/browse/FLINK-32121
> Project: Flink
>  Issue Type: Bug
>  Components: Build System / CI
>Affects Versions: 1.18.0
>Reporter: Wencong Liu
>Priority: Blocker
> Attachments: temp1.jpg, temp2.jpg
>
>
> [https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=49102&view=logs&j=af184cdd-c6d8-5084-0b69-7e9c67b35f7a&t=0f3adb59-eefa-51c6-2858-3654d9e0749d#:~:text=%5BFAIL%5D%20%27Avro%20Confluent%20Schema%20Registry]
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (FLINK-32123) Fix E2E kafka download timeout

2023-05-17 Thread Yuxin Tan (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-32123?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17723774#comment-17723774
 ] 

Yuxin Tan commented on FLINK-32123:
---

[~pgaref] Thanks for reporting this, I will take a look at this issue.

> Fix E2E kafka download timeout
> --
>
> Key: FLINK-32123
> URL: https://issues.apache.org/jira/browse/FLINK-32123
> Project: Flink
>  Issue Type: Bug
>Reporter: Panagiotis Garefalakis
>Priority: Major
>
> For the past few hours, E2E tests fail with: 'Avro Confluent Schema Registry 
> nightly end-to-end test' failed after 9 minutes and 53 seconds! Test exited 
> with exit code 1
> Looks like [https://archive.apache.org/dist/kafka/]  mirror is overloaded – 
> download locally took more than 30min
> Lets switch to  [https://downloads.apache.org|https://downloads.apache.org/] 
> mirror
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Assigned] (FLINK-32123) Fix E2E kafka download timeout

2023-05-17 Thread Yuxin Tan (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-32123?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuxin Tan reassigned FLINK-32123:
-

Assignee: Yuxin Tan

> Fix E2E kafka download timeout
> --
>
> Key: FLINK-32123
> URL: https://issues.apache.org/jira/browse/FLINK-32123
> Project: Flink
>  Issue Type: Bug
>Reporter: Panagiotis Garefalakis
>Assignee: Yuxin Tan
>Priority: Major
>
> For the past few hours, E2E tests fail with: 'Avro Confluent Schema Registry 
> nightly end-to-end test' failed after 9 minutes and 53 seconds! Test exited 
> with exit code 1
> Looks like [https://archive.apache.org/dist/kafka/]  mirror is overloaded – 
> download locally took more than 30min
> Lets switch to  [https://downloads.apache.org|https://downloads.apache.org/] 
> mirror
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (FLINK-32121) Avro Confluent Schema Registry nightly end-to-end test failed due to timeout

2023-05-17 Thread Yuxin Tan (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-32121?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17723775#comment-17723775
 ] 

Yuxin Tan commented on FLINK-32121:
---

This looks like the duplicated issue with 
https://issues.apache.org/jira/browse/FLINK-32123

> Avro Confluent Schema Registry nightly end-to-end test failed due to timeout
> 
>
> Key: FLINK-32121
> URL: https://issues.apache.org/jira/browse/FLINK-32121
> Project: Flink
>  Issue Type: Bug
>  Components: Build System / CI
>Affects Versions: 1.18.0
>Reporter: Wencong Liu
>Priority: Blocker
> Attachments: temp1.jpg, temp2.jpg
>
>
> [https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=49102&view=logs&j=af184cdd-c6d8-5084-0b69-7e9c67b35f7a&t=0f3adb59-eefa-51c6-2858-3654d9e0749d#:~:text=%5BFAIL%5D%20%27Avro%20Confluent%20Schema%20Registry]
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (FLINK-32123) Fix E2E kafka download timeout

2023-05-17 Thread Yuxin Tan (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-32123?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuxin Tan updated FLINK-32123:
--
Labels: test-stability  (was: )

> Fix E2E kafka download timeout
> --
>
> Key: FLINK-32123
> URL: https://issues.apache.org/jira/browse/FLINK-32123
> Project: Flink
>  Issue Type: Bug
>  Components: Connectors / Kafka
>Affects Versions: 1.18.0
>Reporter: Panagiotis Garefalakis
>Assignee: Yuxin Tan
>Priority: Major
>  Labels: test-stability
>
> For the past few hours, E2E tests fail with: 'Avro Confluent Schema Registry 
> nightly end-to-end test' failed after 9 minutes and 53 seconds! Test exited 
> with exit code 1
> Looks like [https://archive.apache.org/dist/kafka/]  mirror is overloaded – 
> download locally took more than 30min
> Lets switch to  [https://downloads.apache.org|https://downloads.apache.org/] 
> mirror
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (FLINK-32123) Fix E2E kafka download timeout

2023-05-17 Thread Yuxin Tan (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-32123?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuxin Tan updated FLINK-32123:
--
Affects Version/s: 1.18.0

> Fix E2E kafka download timeout
> --
>
> Key: FLINK-32123
> URL: https://issues.apache.org/jira/browse/FLINK-32123
> Project: Flink
>  Issue Type: Bug
>Affects Versions: 1.18.0
>Reporter: Panagiotis Garefalakis
>Assignee: Yuxin Tan
>Priority: Major
>
> For the past few hours, E2E tests fail with: 'Avro Confluent Schema Registry 
> nightly end-to-end test' failed after 9 minutes and 53 seconds! Test exited 
> with exit code 1
> Looks like [https://archive.apache.org/dist/kafka/]  mirror is overloaded – 
> download locally took more than 30min
> Lets switch to  [https://downloads.apache.org|https://downloads.apache.org/] 
> mirror
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (FLINK-32123) Fix E2E kafka download timeout

2023-05-17 Thread Yuxin Tan (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-32123?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuxin Tan updated FLINK-32123:
--
Component/s: Connectors / Kafka

> Fix E2E kafka download timeout
> --
>
> Key: FLINK-32123
> URL: https://issues.apache.org/jira/browse/FLINK-32123
> Project: Flink
>  Issue Type: Bug
>  Components: Connectors / Kafka
>Affects Versions: 1.18.0
>Reporter: Panagiotis Garefalakis
>Assignee: Yuxin Tan
>Priority: Major
>
> For the past few hours, E2E tests fail with: 'Avro Confluent Schema Registry 
> nightly end-to-end test' failed after 9 minutes and 53 seconds! Test exited 
> with exit code 1
> Looks like [https://archive.apache.org/dist/kafka/]  mirror is overloaded – 
> download locally took more than 30min
> Lets switch to  [https://downloads.apache.org|https://downloads.apache.org/] 
> mirror
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


  1   2   3   4   >