[jira] [Commented] (FLINK-34105) Akka timeout happens in TPC-DS benchmarks
[ https://issues.apache.org/jira/browse/FLINK-34105?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17808008#comment-17808008 ] dizhou cao commented on FLINK-34105: [~zhuzh] Great suggestion. We've tried to move the serialization of shuffleDescriptor logic on the JobMaster side to the futureExecutor for asynchronous serialization. Moving the deserialization to a separate thread pool on the TM side would disrupt the original synchronous logic of the submission interface, potentially introducing additional risks. Not implementing serialization modifications on the TM side results in about a 30% performance degradation in our tests under OLAP scenarios. Furthermore, we plan to advance batch submission optimizations for Task submission stages in OLAP scenarios. We intend to test the asynchronous serialization optimization internally, as its performance is roughly consistent with placing it in the Akka remote thread pool. Therefore, for this fix, we plan to move the serialization operation on the Jobmaster side to an asynchronous thread pool, while keeping the deserialization on the TM side back on the main thread. WDYT? > Akka timeout happens in TPC-DS benchmarks > - > > Key: FLINK-34105 > URL: https://issues.apache.org/jira/browse/FLINK-34105 > Project: Flink > Issue Type: Bug > Components: Runtime / Coordination >Affects Versions: 1.19.0 >Reporter: Zhu Zhu >Assignee: Yangze Guo >Priority: Critical > Attachments: image-2024-01-16-13-59-45-556.png > > > We noticed akka timeout happens in 10TB TPC-DS benchmarks in 1.19. The > problem did not happen in 1.18.0. > After bisecting, we find the problem was introduced in FLINK-33532. > !image-2024-01-16-13-59-45-556.png|width=800! -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (FLINK-33467) Support concurrent serialization in akka
dizhou cao created FLINK-33467: -- Summary: Support concurrent serialization in akka Key: FLINK-33467 URL: https://issues.apache.org/jira/browse/FLINK-33467 Project: Flink Issue Type: Sub-task Components: Runtime / RPC Reporter: dizhou cao In OLAP high QPS scenarios, there are a large number of RPC requests for deploy task and update task status. At this time, the serialization and deserialization performed by AKKA threads become a bottleneck. By supporting parallelization of akka serialization operations, the performance of akka in accepting and processing a large number of RPC requests can be improved. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (FLINK-33160) Log the remote address when an exception occurs in the PartitionRequestQueue
[ https://issues.apache.org/jira/browse/FLINK-33160?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17779081#comment-17779081 ] dizhou cao commented on FLINK-33160: [~huweihua] I have sent a pull request. Could you please take a look at it? > Log the remote address when an exception occurs in the PartitionRequestQueue > > > Key: FLINK-33160 > URL: https://issues.apache.org/jira/browse/FLINK-33160 > Project: Flink > Issue Type: Improvement > Components: Runtime / Network >Reporter: dizhou cao >Priority: Minor > Labels: pull-request-available > > Add the information of the remote address in the exception handling logs of > the PartitionRequestQueue, so that network issues can be located through the > network quintuple. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (FLINK-33160) Log the remote address when an exception occurs in the PartitionRequestQueue
[ https://issues.apache.org/jira/browse/FLINK-33160?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] dizhou cao updated FLINK-33160: --- Summary: Log the remote address when an exception occurs in the PartitionRequestQueue (was: Print the remote address when an exception occurs in the PartitionRequestQueue) > Log the remote address when an exception occurs in the PartitionRequestQueue > > > Key: FLINK-33160 > URL: https://issues.apache.org/jira/browse/FLINK-33160 > Project: Flink > Issue Type: Improvement > Components: Runtime / Network >Reporter: dizhou cao >Priority: Minor > > Add the information of the remote address in the exception handling logs of > the PartitionRequestQueue, so that network issues can be located through the > network quintuple. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (FLINK-33160) Print the remote address when an exception occurs in the PartitionRequestQueue
[ https://issues.apache.org/jira/browse/FLINK-33160?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] dizhou cao updated FLINK-33160: --- Description: Add the information of the remote address in the exception handling logs of the PartitionRequestQueue, so that network issues can be located through the network quintuple. (was: Add the information of the remote address in the exception handling of the PartitionRequestQueue, so that network issues can be located through the network quintuple.) > Print the remote address when an exception occurs in the PartitionRequestQueue > -- > > Key: FLINK-33160 > URL: https://issues.apache.org/jira/browse/FLINK-33160 > Project: Flink > Issue Type: Improvement > Components: Runtime / Network >Reporter: dizhou cao >Priority: Minor > > Add the information of the remote address in the exception handling logs of > the PartitionRequestQueue, so that network issues can be located through the > network quintuple. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (FLINK-33160) Print the remote address when an exception occurs in the PartitionRequestQueue
dizhou cao created FLINK-33160: -- Summary: Print the remote address when an exception occurs in the PartitionRequestQueue Key: FLINK-33160 URL: https://issues.apache.org/jira/browse/FLINK-33160 Project: Flink Issue Type: Improvement Components: Runtime / Network Reporter: dizhou cao Add the information of the remote address in the exception handling of the PartitionRequestQueue, so that network issues can be located through the network quintuple. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (FLINK-32191) Support for configuring tcp keepalive related parameters.
[ https://issues.apache.org/jira/browse/FLINK-32191?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] dizhou cao updated FLINK-32191: --- Summary: Support for configuring tcp keepalive related parameters. (was: Support for configuring keepalive related parameters.) > Support for configuring tcp keepalive related parameters. > - > > Key: FLINK-32191 > URL: https://issues.apache.org/jira/browse/FLINK-32191 > Project: Flink > Issue Type: Improvement > Components: Runtime / Network >Reporter: dizhou cao >Priority: Minor > > We encountered a case in our production environment where the netty client > was unable to send data to the server due to an abnormality in the switch > link. However, client can only detect the abnormality after RTO timeout > retransmission failure, which takes about 15 minutes in our production > environment. This may result in a 15-minute job unavailability. We hope to > perform failover and reschedule job more quickly. Flink has already enabled > keepalive, but the default keepalive idle time is 2 hours. We can adjust the > timeout of TCP keepalive by configuring TCP_KEEPIDLE, TCP_KEEPINTERVAL, and > TCP_KEEPCOUNT. These configurations are already supported at the Netty. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (FLINK-32191) Support for configuring keepalive related parameters.
[ https://issues.apache.org/jira/browse/FLINK-32191?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] dizhou cao updated FLINK-32191: --- Description: We encountered a case in our production environment where the netty client was unable to send data to the server due to an abnormality in the switch link. However, client can only detect the abnormality after RTO timeout retransmission failure, which takes about 15 minutes in our production environment. This may result in a 15-minute job unavailability. We hope to perform failover and reschedule job more quickly. Flink has already enabled keepalive, but the default keepalive idle time is 2 hours. We can adjust the timeout of TCP keepalive by configuring TCP_KEEPIDLE, TCP_KEEPINTERVAL, and TCP_KEEPCOUNT. These configurations are already supported at the Netty. (was: We encountered a case in our production environment where netty client was unable to send data downstream due to an abnormality in the switch link. However, client can only detect the abnormality after RTO timeout retransmission failure, which takes about 15 minutes in our production environment. This may result in a 15-minute job unavailability. We hope to perform failover and reschedule job more quickly. Flink has already enabled keepalive, but the default keepalive idle time is 2 hours. We can adjust the timeout of TCP keepalive by configuring TCP_KEEPIDLE, TCP_KEEPINTERVAL, and TCP_KEEPCOUNT. These configurations are already supported at the Netty.) > Support for configuring keepalive related parameters. > - > > Key: FLINK-32191 > URL: https://issues.apache.org/jira/browse/FLINK-32191 > Project: Flink > Issue Type: Improvement > Components: Runtime / Network >Reporter: dizhou cao >Priority: Minor > > We encountered a case in our production environment where the netty client > was unable to send data to the server due to an abnormality in the switch > link. However, client can only detect the abnormality after RTO timeout > retransmission failure, which takes about 15 minutes in our production > environment. This may result in a 15-minute job unavailability. We hope to > perform failover and reschedule job more quickly. Flink has already enabled > keepalive, but the default keepalive idle time is 2 hours. We can adjust the > timeout of TCP keepalive by configuring TCP_KEEPIDLE, TCP_KEEPINTERVAL, and > TCP_KEEPCOUNT. These configurations are already supported at the Netty. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (FLINK-32191) Support for configuring keepalive related parameters.
[ https://issues.apache.org/jira/browse/FLINK-32191?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] dizhou cao updated FLINK-32191: --- Description: We encountered a case in our production environment where netty client was unable to send data downstream due to an abnormality in the switch link. However, client can only detect the abnormality after RTO timeout retransmission failure, which takes about 15 minutes in our production environment. This may result in a 15-minute job unavailability. We hope to perform failover and reschedule job more quickly. Flink has already enabled keepalive, but the default keepalive idle time is 2 hours. We can adjust the timeout of TCP keepalive by configuring TCP_KEEPIDLE, TCP_KEEPINTERVAL, and TCP_KEEPCOUNT. These configurations are already supported at the Netty. (was: We encountered a case in our production environment where upstream was unable to send data downstream due to an abnormality in the switch link. However, upstream can only detect the abnormality after RTO timeout retransmission failure, which takes about 15 minutes in our production environment. This may result in a 15-minute job unavailability. We hope to perform failover and reschedule job more quickly. Flink has already enabled keepalive, but the default keepalive idle time is 2 hours. We can adjust the timeout of TCP keepalive by configuring TCP_KEEPIDLE, TCP_KEEPINTERVAL, and TCP_KEEPCOUNT. These configurations are already supported at the Netty.) > Support for configuring keepalive related parameters. > - > > Key: FLINK-32191 > URL: https://issues.apache.org/jira/browse/FLINK-32191 > Project: Flink > Issue Type: Improvement > Components: Runtime / Network >Reporter: dizhou cao >Priority: Minor > > We encountered a case in our production environment where netty client was > unable to send data downstream due to an abnormality in the switch link. > However, client can only detect the abnormality after RTO timeout > retransmission failure, which takes about 15 minutes in our production > environment. This may result in a 15-minute job unavailability. We hope to > perform failover and reschedule job more quickly. Flink has already enabled > keepalive, but the default keepalive idle time is 2 hours. We can adjust the > timeout of TCP keepalive by configuring TCP_KEEPIDLE, TCP_KEEPINTERVAL, and > TCP_KEEPCOUNT. These configurations are already supported at the Netty. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (FLINK-32191) Support for configuring keepalive related parameters.
dizhou cao created FLINK-32191: -- Summary: Support for configuring keepalive related parameters. Key: FLINK-32191 URL: https://issues.apache.org/jira/browse/FLINK-32191 Project: Flink Issue Type: Improvement Components: Runtime / Network Reporter: dizhou cao We encountered a case in our production environment where upstream was unable to send data downstream due to an abnormality in the switch link. However, upstream can only detect the abnormality after RTO timeout retransmission failure, which takes about 15 minutes in our production environment. This may result in a 15-minute job unavailability. We hope to perform failover and reschedule job more quickly. Flink has already enabled keepalive, but the default keepalive idle time is 2 hours. We can adjust the timeout of TCP keepalive by configuring TCP_KEEPIDLE, TCP_KEEPINTERVAL, and TCP_KEEPCOUNT. These configurations are already supported at the Netty. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (FLINK-18112) Approximate Task-Local Recovery -- Milestone One
[ https://issues.apache.org/jira/browse/FLINK-18112?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17724841#comment-17724841 ] dizhou cao commented on FLINK-18112: Hi [~ym] , We are very interested in this feature and greatly appreciate you for this ticket. However, it has not been updated for over a year. Is there any progress recently? > Approximate Task-Local Recovery -- Milestone One > > > Key: FLINK-18112 > URL: https://issues.apache.org/jira/browse/FLINK-18112 > Project: Flink > Issue Type: New Feature > Components: Runtime / Checkpointing, Runtime / Coordination, Runtime > / Network >Affects Versions: 1.12.0 >Reporter: Yuan Mei >Priority: Minor > Labels: auto-deprioritized-major, auto-unassigned > Attachments: image-2021-12-06-16-39-21-604.png, > image-2021-12-14-10-30-26-486.png > > > This is the Jira ticket for Milestone One of [FLIP-135 Approximate Task-Local > Recovery|https://cwiki.apache.org/confluence/display/FLINK/FLIP-135+Approximate+Task-Local+Recovery] > In short, in Approximate Task-Local Recovery, if a task fails, only the > failed task restarts without affecting the rest of the job. To ease > discussion, we divide the problem of approximate task-local recovery into > three parts with each part only focusing on addressing a set of problems. > This Jira ticket focuses on address the first milestone. > Milestone One: sink recovery. Here a sink task stands for no consumers > reading data from it. In this scenario, if a sink vertex fails, the sink is > restarted from the last successfully completed checkpoint and data loss is > expected. If a non-sink vertex fails, a regional failover strategy takes > place. In milestone one, we focus on issues related to task failure handling > and upstream reconnection. > > Milestone one includes two parts of change: > *Part 1*: Network Part: how the failed task able to link to the upstream > Result(Sub)Partitions, and continue processing data > *Part 2*: Scheduling part, a new failover strategy to restart the sink only > when the sink fails. > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (FLINK-27107) Typo in Task
dizhou cao created FLINK-27107: -- Summary: Typo in Task Key: FLINK-27107 URL: https://issues.apache.org/jira/browse/FLINK-27107 Project: Flink Issue Type: Improvement Components: Runtime / Task Affects Versions: 1.15.0 Reporter: dizhou cao two small typos in Task * TaskCancelerWatchDog/TaskInterrupter field: executerThread -> executorThread * TaskCanceler field: executer -> executor -- This message was sent by Atlassian Jira (v8.20.1#820001)