[jira] [Commented] (FLINK-34105) Akka timeout happens in TPC-DS benchmarks

2024-01-17 Thread dizhou cao (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-34105?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17808008#comment-17808008
 ] 

dizhou cao commented on FLINK-34105:


[~zhuzh] Great suggestion. We've tried to move the serialization of 
shuffleDescriptor logic on the JobMaster side to the futureExecutor for 
asynchronous serialization. Moving the deserialization to a separate thread 
pool on the TM side would disrupt the original synchronous logic of the 
submission interface, potentially introducing additional risks. Not 
implementing serialization modifications on the TM side results in about a 30% 
performance degradation in our tests under OLAP scenarios. Furthermore, we plan 
to advance batch submission optimizations for Task submission stages in OLAP 
scenarios. We intend to test the asynchronous serialization optimization 
internally, as its performance is roughly consistent with placing it in the 
Akka remote thread pool. Therefore, for this fix, we plan to move the 
serialization operation on the Jobmaster side to an asynchronous thread pool, 
while keeping the deserialization on the TM side back on the main thread. WDYT?

> Akka timeout happens in TPC-DS benchmarks
> -
>
> Key: FLINK-34105
> URL: https://issues.apache.org/jira/browse/FLINK-34105
> Project: Flink
>  Issue Type: Bug
>  Components: Runtime / Coordination
>Affects Versions: 1.19.0
>Reporter: Zhu Zhu
>Assignee: Yangze Guo
>Priority: Critical
> Attachments: image-2024-01-16-13-59-45-556.png
>
>
> We noticed akka timeout happens in 10TB TPC-DS benchmarks in 1.19. The 
> problem did not happen in 1.18.0.
> After bisecting, we find the problem was introduced in FLINK-33532.
>  !image-2024-01-16-13-59-45-556.png|width=800! 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (FLINK-33467) Support concurrent serialization in akka

2023-11-06 Thread dizhou cao (Jira)
dizhou cao created FLINK-33467:
--

 Summary: Support concurrent serialization in akka
 Key: FLINK-33467
 URL: https://issues.apache.org/jira/browse/FLINK-33467
 Project: Flink
  Issue Type: Sub-task
  Components: Runtime / RPC
Reporter: dizhou cao


In OLAP high QPS scenarios, there are a large number of RPC requests for deploy 
task and update task status. At this time, the serialization and 
deserialization performed by AKKA threads become a bottleneck. By supporting 
parallelization of akka serialization operations, the performance of akka in 
accepting and processing a large number of RPC requests can be improved.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (FLINK-33160) Log the remote address when an exception occurs in the PartitionRequestQueue

2023-10-24 Thread dizhou cao (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-33160?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17779081#comment-17779081
 ] 

dizhou cao commented on FLINK-33160:


[~huweihua] I have sent a pull request. Could you please take a look at it? 

> Log the remote address when an exception occurs in the PartitionRequestQueue
> 
>
> Key: FLINK-33160
> URL: https://issues.apache.org/jira/browse/FLINK-33160
> Project: Flink
>  Issue Type: Improvement
>  Components: Runtime / Network
>Reporter: dizhou cao
>Priority: Minor
>  Labels: pull-request-available
>
> Add the information of the remote address in the exception handling logs of 
> the PartitionRequestQueue, so that network issues can be located through the 
> network quintuple.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (FLINK-33160) Log the remote address when an exception occurs in the PartitionRequestQueue

2023-09-28 Thread dizhou cao (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-33160?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

dizhou cao updated FLINK-33160:
---
Summary: Log the remote address when an exception occurs in the 
PartitionRequestQueue  (was: Print the remote address when an exception occurs 
in the PartitionRequestQueue)

> Log the remote address when an exception occurs in the PartitionRequestQueue
> 
>
> Key: FLINK-33160
> URL: https://issues.apache.org/jira/browse/FLINK-33160
> Project: Flink
>  Issue Type: Improvement
>  Components: Runtime / Network
>Reporter: dizhou cao
>Priority: Minor
>
> Add the information of the remote address in the exception handling logs of 
> the PartitionRequestQueue, so that network issues can be located through the 
> network quintuple.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (FLINK-33160) Print the remote address when an exception occurs in the PartitionRequestQueue

2023-09-26 Thread dizhou cao (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-33160?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

dizhou cao updated FLINK-33160:
---
Description: Add the information of the remote address in the exception 
handling logs of the PartitionRequestQueue, so that network issues can be 
located through the network quintuple.  (was: Add the information of the remote 
address in the exception handling of the PartitionRequestQueue, so that network 
issues can be located through the network quintuple.)

> Print the remote address when an exception occurs in the PartitionRequestQueue
> --
>
> Key: FLINK-33160
> URL: https://issues.apache.org/jira/browse/FLINK-33160
> Project: Flink
>  Issue Type: Improvement
>  Components: Runtime / Network
>Reporter: dizhou cao
>Priority: Minor
>
> Add the information of the remote address in the exception handling logs of 
> the PartitionRequestQueue, so that network issues can be located through the 
> network quintuple.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (FLINK-33160) Print the remote address when an exception occurs in the PartitionRequestQueue

2023-09-26 Thread dizhou cao (Jira)
dizhou cao created FLINK-33160:
--

 Summary: Print the remote address when an exception occurs in the 
PartitionRequestQueue
 Key: FLINK-33160
 URL: https://issues.apache.org/jira/browse/FLINK-33160
 Project: Flink
  Issue Type: Improvement
  Components: Runtime / Network
Reporter: dizhou cao


Add the information of the remote address in the exception handling of the 
PartitionRequestQueue, so that network issues can be located through the 
network quintuple.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (FLINK-32191) Support for configuring tcp keepalive related parameters.

2023-05-25 Thread dizhou cao (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-32191?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

dizhou cao updated FLINK-32191:
---
Summary: Support for configuring tcp keepalive related parameters.  (was: 
Support for configuring keepalive related parameters.)

> Support for configuring tcp keepalive related parameters.
> -
>
> Key: FLINK-32191
> URL: https://issues.apache.org/jira/browse/FLINK-32191
> Project: Flink
>  Issue Type: Improvement
>  Components: Runtime / Network
>Reporter: dizhou cao
>Priority: Minor
>
> We encountered a case in our production environment where the netty client 
> was unable to send data to the server due to an abnormality in the switch 
> link. However, client can only detect the abnormality after RTO timeout 
> retransmission failure, which takes about 15 minutes in our production 
> environment. This may result in a 15-minute job unavailability. We hope to 
> perform failover and reschedule job more quickly. Flink has already enabled 
> keepalive, but the default keepalive idle time is 2 hours. We can adjust the 
> timeout of TCP keepalive by configuring TCP_KEEPIDLE, TCP_KEEPINTERVAL, and 
> TCP_KEEPCOUNT. These configurations are already supported at the Netty.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (FLINK-32191) Support for configuring keepalive related parameters.

2023-05-25 Thread dizhou cao (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-32191?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

dizhou cao updated FLINK-32191:
---
Description: We encountered a case in our production environment where the 
netty client was unable to send data to the server due to an abnormality in the 
switch link. However, client can only detect the abnormality after RTO timeout 
retransmission failure, which takes about 15 minutes in our production 
environment. This may result in a 15-minute job unavailability. We hope to 
perform failover and reschedule job more quickly. Flink has already enabled 
keepalive, but the default keepalive idle time is 2 hours. We can adjust the 
timeout of TCP keepalive by configuring TCP_KEEPIDLE, TCP_KEEPINTERVAL, and 
TCP_KEEPCOUNT. These configurations are already supported at the Netty.  (was: 
We encountered a case in our production environment where netty client was 
unable to send data downstream due to an abnormality in the switch link. 
However, client can only detect the abnormality after RTO timeout 
retransmission failure, which takes about 15 minutes in our production 
environment. This may result in a 15-minute job unavailability. We hope to 
perform failover and reschedule job more quickly. Flink has already enabled 
keepalive, but the default keepalive idle time is 2 hours. We can adjust the 
timeout of TCP keepalive by configuring TCP_KEEPIDLE, TCP_KEEPINTERVAL, and 
TCP_KEEPCOUNT. These configurations are already supported at the Netty.)

> Support for configuring keepalive related parameters.
> -
>
> Key: FLINK-32191
> URL: https://issues.apache.org/jira/browse/FLINK-32191
> Project: Flink
>  Issue Type: Improvement
>  Components: Runtime / Network
>Reporter: dizhou cao
>Priority: Minor
>
> We encountered a case in our production environment where the netty client 
> was unable to send data to the server due to an abnormality in the switch 
> link. However, client can only detect the abnormality after RTO timeout 
> retransmission failure, which takes about 15 minutes in our production 
> environment. This may result in a 15-minute job unavailability. We hope to 
> perform failover and reschedule job more quickly. Flink has already enabled 
> keepalive, but the default keepalive idle time is 2 hours. We can adjust the 
> timeout of TCP keepalive by configuring TCP_KEEPIDLE, TCP_KEEPINTERVAL, and 
> TCP_KEEPCOUNT. These configurations are already supported at the Netty.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (FLINK-32191) Support for configuring keepalive related parameters.

2023-05-25 Thread dizhou cao (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-32191?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

dizhou cao updated FLINK-32191:
---
Description: We encountered a case in our production environment where 
netty client was unable to send data downstream due to an abnormality in the 
switch link. However, client can only detect the abnormality after RTO timeout 
retransmission failure, which takes about 15 minutes in our production 
environment. This may result in a 15-minute job unavailability. We hope to 
perform failover and reschedule job more quickly. Flink has already enabled 
keepalive, but the default keepalive idle time is 2 hours. We can adjust the 
timeout of TCP keepalive by configuring TCP_KEEPIDLE, TCP_KEEPINTERVAL, and 
TCP_KEEPCOUNT. These configurations are already supported at the Netty.  (was: 
We encountered a case in our production environment where upstream was unable 
to send data downstream due to an abnormality in the switch link. However, 
upstream can only detect the abnormality after RTO timeout retransmission 
failure, which takes about 15 minutes in our production environment. This may 
result in a 15-minute job unavailability. We hope to perform failover and 
reschedule job more quickly. Flink has already enabled keepalive, but the 
default keepalive idle time is 2 hours. We can adjust the timeout of TCP 
keepalive by configuring TCP_KEEPIDLE, TCP_KEEPINTERVAL, and TCP_KEEPCOUNT. 
These configurations are already supported at the Netty.)

> Support for configuring keepalive related parameters.
> -
>
> Key: FLINK-32191
> URL: https://issues.apache.org/jira/browse/FLINK-32191
> Project: Flink
>  Issue Type: Improvement
>  Components: Runtime / Network
>Reporter: dizhou cao
>Priority: Minor
>
> We encountered a case in our production environment where netty client was 
> unable to send data downstream due to an abnormality in the switch link. 
> However, client can only detect the abnormality after RTO timeout 
> retransmission failure, which takes about 15 minutes in our production 
> environment. This may result in a 15-minute job unavailability. We hope to 
> perform failover and reschedule job more quickly. Flink has already enabled 
> keepalive, but the default keepalive idle time is 2 hours. We can adjust the 
> timeout of TCP keepalive by configuring TCP_KEEPIDLE, TCP_KEEPINTERVAL, and 
> TCP_KEEPCOUNT. These configurations are already supported at the Netty.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (FLINK-32191) Support for configuring keepalive related parameters.

2023-05-25 Thread dizhou cao (Jira)
dizhou cao created FLINK-32191:
--

 Summary: Support for configuring keepalive related parameters.
 Key: FLINK-32191
 URL: https://issues.apache.org/jira/browse/FLINK-32191
 Project: Flink
  Issue Type: Improvement
  Components: Runtime / Network
Reporter: dizhou cao


We encountered a case in our production environment where upstream was unable 
to send data downstream due to an abnormality in the switch link. However, 
upstream can only detect the abnormality after RTO timeout retransmission 
failure, which takes about 15 minutes in our production environment. This may 
result in a 15-minute job unavailability. We hope to perform failover and 
reschedule job more quickly. Flink has already enabled keepalive, but the 
default keepalive idle time is 2 hours. We can adjust the timeout of TCP 
keepalive by configuring TCP_KEEPIDLE, TCP_KEEPINTERVAL, and TCP_KEEPCOUNT. 
These configurations are already supported at the Netty.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (FLINK-18112) Approximate Task-Local Recovery -- Milestone One

2023-05-22 Thread dizhou cao (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-18112?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17724841#comment-17724841
 ] 

dizhou cao commented on FLINK-18112:


Hi [~ym] , We are very interested in this feature and greatly appreciate you 
for this ticket. However, it has not been updated for over a year. Is there any 
progress recently?

> Approximate Task-Local Recovery -- Milestone One
> 
>
> Key: FLINK-18112
> URL: https://issues.apache.org/jira/browse/FLINK-18112
> Project: Flink
>  Issue Type: New Feature
>  Components: Runtime / Checkpointing, Runtime / Coordination, Runtime 
> / Network
>Affects Versions: 1.12.0
>Reporter: Yuan Mei
>Priority: Minor
>  Labels: auto-deprioritized-major, auto-unassigned
> Attachments: image-2021-12-06-16-39-21-604.png, 
> image-2021-12-14-10-30-26-486.png
>
>
> This is the Jira ticket for Milestone One of [FLIP-135 Approximate Task-Local 
> Recovery|https://cwiki.apache.org/confluence/display/FLINK/FLIP-135+Approximate+Task-Local+Recovery]
> In short, in Approximate Task-Local Recovery, if a task fails, only the 
> failed task restarts without affecting the rest of the job. To ease 
> discussion, we divide the problem of approximate task-local recovery into 
> three parts with each part only focusing on addressing a set of problems. 
> This Jira ticket focuses on address the first milestone.
> Milestone One: sink recovery. Here a sink task stands for no consumers 
> reading data from it. In this scenario, if a sink vertex fails, the sink is 
> restarted from the last successfully completed checkpoint and data loss is 
> expected. If a non-sink vertex fails, a regional failover strategy takes 
> place. In milestone one, we focus on issues related to task failure handling 
> and upstream reconnection.
>  
> Milestone one includes two parts of change:
> *Part 1*: Network Part: how the failed task able to link to the upstream 
> Result(Sub)Partitions, and continue processing data
> *Part 2*: Scheduling part, a new failover strategy to restart the sink only 
> when the sink fails.
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (FLINK-27107) Typo in Task

2022-04-07 Thread dizhou cao (Jira)
dizhou cao created FLINK-27107:
--

 Summary: Typo in Task
 Key: FLINK-27107
 URL: https://issues.apache.org/jira/browse/FLINK-27107
 Project: Flink
  Issue Type: Improvement
  Components: Runtime / Task
Affects Versions: 1.15.0
Reporter: dizhou cao


two small typos in Task
 * TaskCancelerWatchDog/TaskInterrupter field: executerThread -> executorThread 
 * TaskCanceler field: executer -> executor

 



--
This message was sent by Atlassian Jira
(v8.20.1#820001)