[ https://issues.apache.org/jira/browse/FLINK-34105?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17808008#comment-17808008 ]
dizhou cao commented on FLINK-34105: ------------------------------------ [~zhuzh] Great suggestion. We've tried to move the serialization of shuffleDescriptor logic on the JobMaster side to the futureExecutor for asynchronous serialization. Moving the deserialization to a separate thread pool on the TM side would disrupt the original synchronous logic of the submission interface, potentially introducing additional risks. Not implementing serialization modifications on the TM side results in about a 30% performance degradation in our tests under OLAP scenarios. Furthermore, we plan to advance batch submission optimizations for Task submission stages in OLAP scenarios. We intend to test the asynchronous serialization optimization internally, as its performance is roughly consistent with placing it in the Akka remote thread pool. Therefore, for this fix, we plan to move the serialization operation on the Jobmaster side to an asynchronous thread pool, while keeping the deserialization on the TM side back on the main thread. WDYT? > Akka timeout happens in TPC-DS benchmarks > ----------------------------------------- > > Key: FLINK-34105 > URL: https://issues.apache.org/jira/browse/FLINK-34105 > Project: Flink > Issue Type: Bug > Components: Runtime / Coordination > Affects Versions: 1.19.0 > Reporter: Zhu Zhu > Assignee: Yangze Guo > Priority: Critical > Attachments: image-2024-01-16-13-59-45-556.png > > > We noticed akka timeout happens in 10TB TPC-DS benchmarks in 1.19. The > problem did not happen in 1.18.0. > After bisecting, we find the problem was introduced in FLINK-33532. > !image-2024-01-16-13-59-45-556.png|width=800! -- This message was sent by Atlassian Jira (v8.20.10#820010)