[ 
https://issues.apache.org/jira/browse/FLINK-34105?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17808008#comment-17808008
 ] 

dizhou cao commented on FLINK-34105:
------------------------------------

[~zhuzh] Great suggestion. We've tried to move the serialization of 
shuffleDescriptor logic on the JobMaster side to the futureExecutor for 
asynchronous serialization. Moving the deserialization to a separate thread 
pool on the TM side would disrupt the original synchronous logic of the 
submission interface, potentially introducing additional risks. Not 
implementing serialization modifications on the TM side results in about a 30% 
performance degradation in our tests under OLAP scenarios. Furthermore, we plan 
to advance batch submission optimizations for Task submission stages in OLAP 
scenarios. We intend to test the asynchronous serialization optimization 
internally, as its performance is roughly consistent with placing it in the 
Akka remote thread pool. Therefore, for this fix, we plan to move the 
serialization operation on the Jobmaster side to an asynchronous thread pool, 
while keeping the deserialization on the TM side back on the main thread. WDYT?

> Akka timeout happens in TPC-DS benchmarks
> -----------------------------------------
>
>                 Key: FLINK-34105
>                 URL: https://issues.apache.org/jira/browse/FLINK-34105
>             Project: Flink
>          Issue Type: Bug
>          Components: Runtime / Coordination
>    Affects Versions: 1.19.0
>            Reporter: Zhu Zhu
>            Assignee: Yangze Guo
>            Priority: Critical
>         Attachments: image-2024-01-16-13-59-45-556.png
>
>
> We noticed akka timeout happens in 10TB TPC-DS benchmarks in 1.19. The 
> problem did not happen in 1.18.0.
> After bisecting, we find the problem was introduced in FLINK-33532.
>  !image-2024-01-16-13-59-45-556.png|width=800! 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to