[jira] [Commented] (FLINK-34105) Akka timeout happens in TPC-DS benchmarks
[ https://issues.apache.org/jira/browse/FLINK-34105?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17823929#comment-17823929 ] Yangze Guo commented on FLINK-34105: master: 7a709bf > Akka timeout happens in TPC-DS benchmarks > - > > Key: FLINK-34105 > URL: https://issues.apache.org/jira/browse/FLINK-34105 > Project: Flink > Issue Type: Bug > Components: Runtime / Coordination >Affects Versions: 1.19.0 >Reporter: Zhu Zhu >Assignee: Yangze Guo >Priority: Critical > Attachments: image-2024-01-16-13-59-45-556.png > > > We noticed akka timeout happens in 10TB TPC-DS benchmarks in 1.19. The > problem did not happen in 1.18.0. > After bisecting, we find the problem was introduced in FLINK-33532. > !image-2024-01-16-13-59-45-556.png|width=800! -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (FLINK-34105) Akka timeout happens in TPC-DS benchmarks
[ https://issues.apache.org/jira/browse/FLINK-34105?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17812123#comment-17812123 ] Zhu Zhu commented on FLINK-34105: - Thanks for the updates! [~guoyangze] > Akka timeout happens in TPC-DS benchmarks > - > > Key: FLINK-34105 > URL: https://issues.apache.org/jira/browse/FLINK-34105 > Project: Flink > Issue Type: Bug > Components: Runtime / Coordination >Affects Versions: 1.19.0 >Reporter: Zhu Zhu >Assignee: Yangze Guo >Priority: Critical > Attachments: image-2024-01-16-13-59-45-556.png > > > We noticed akka timeout happens in 10TB TPC-DS benchmarks in 1.19. The > problem did not happen in 1.18.0. > After bisecting, we find the problem was introduced in FLINK-33532. > !image-2024-01-16-13-59-45-556.png|width=800! -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (FLINK-34105) Akka timeout happens in TPC-DS benchmarks
[ https://issues.apache.org/jira/browse/FLINK-34105?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17812121#comment-17812121 ] Yangze Guo commented on FLINK-34105: [~lsdy] has some personal affairs and will back this Thursday. > Akka timeout happens in TPC-DS benchmarks > - > > Key: FLINK-34105 > URL: https://issues.apache.org/jira/browse/FLINK-34105 > Project: Flink > Issue Type: Bug > Components: Runtime / Coordination >Affects Versions: 1.19.0 >Reporter: Zhu Zhu >Assignee: Yangze Guo >Priority: Critical > Attachments: image-2024-01-16-13-59-45-556.png > > > We noticed akka timeout happens in 10TB TPC-DS benchmarks in 1.19. The > problem did not happen in 1.18.0. > After bisecting, we find the problem was introduced in FLINK-33532. > !image-2024-01-16-13-59-45-556.png|width=800! -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (FLINK-34105) Akka timeout happens in TPC-DS benchmarks
[ https://issues.apache.org/jira/browse/FLINK-34105?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17811867#comment-17811867 ] Zhu Zhu commented on FLINK-34105: - Hi [~lsdy], what's the status of the fix? > Akka timeout happens in TPC-DS benchmarks > - > > Key: FLINK-34105 > URL: https://issues.apache.org/jira/browse/FLINK-34105 > Project: Flink > Issue Type: Bug > Components: Runtime / Coordination >Affects Versions: 1.19.0 >Reporter: Zhu Zhu >Assignee: Yangze Guo >Priority: Critical > Attachments: image-2024-01-16-13-59-45-556.png > > > We noticed akka timeout happens in 10TB TPC-DS benchmarks in 1.19. The > problem did not happen in 1.18.0. > After bisecting, we find the problem was introduced in FLINK-33532. > !image-2024-01-16-13-59-45-556.png|width=800! -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (FLINK-34105) Akka timeout happens in TPC-DS benchmarks
[ https://issues.apache.org/jira/browse/FLINK-34105?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17808457#comment-17808457 ] Zhu Zhu commented on FLINK-34105: - [~lsdy] Sounds good to me. Feel free to open a PR for it. > Akka timeout happens in TPC-DS benchmarks > - > > Key: FLINK-34105 > URL: https://issues.apache.org/jira/browse/FLINK-34105 > Project: Flink > Issue Type: Bug > Components: Runtime / Coordination >Affects Versions: 1.19.0 >Reporter: Zhu Zhu >Assignee: Yangze Guo >Priority: Critical > Attachments: image-2024-01-16-13-59-45-556.png > > > We noticed akka timeout happens in 10TB TPC-DS benchmarks in 1.19. The > problem did not happen in 1.18.0. > After bisecting, we find the problem was introduced in FLINK-33532. > !image-2024-01-16-13-59-45-556.png|width=800! -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (FLINK-34105) Akka timeout happens in TPC-DS benchmarks
[ https://issues.apache.org/jira/browse/FLINK-34105?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17808008#comment-17808008 ] dizhou cao commented on FLINK-34105: [~zhuzh] Great suggestion. We've tried to move the serialization of shuffleDescriptor logic on the JobMaster side to the futureExecutor for asynchronous serialization. Moving the deserialization to a separate thread pool on the TM side would disrupt the original synchronous logic of the submission interface, potentially introducing additional risks. Not implementing serialization modifications on the TM side results in about a 30% performance degradation in our tests under OLAP scenarios. Furthermore, we plan to advance batch submission optimizations for Task submission stages in OLAP scenarios. We intend to test the asynchronous serialization optimization internally, as its performance is roughly consistent with placing it in the Akka remote thread pool. Therefore, for this fix, we plan to move the serialization operation on the Jobmaster side to an asynchronous thread pool, while keeping the deserialization on the TM side back on the main thread. WDYT? > Akka timeout happens in TPC-DS benchmarks > - > > Key: FLINK-34105 > URL: https://issues.apache.org/jira/browse/FLINK-34105 > Project: Flink > Issue Type: Bug > Components: Runtime / Coordination >Affects Versions: 1.19.0 >Reporter: Zhu Zhu >Assignee: Yangze Guo >Priority: Critical > Attachments: image-2024-01-16-13-59-45-556.png > > > We noticed akka timeout happens in 10TB TPC-DS benchmarks in 1.19. The > problem did not happen in 1.18.0. > After bisecting, we find the problem was introduced in FLINK-33532. > !image-2024-01-16-13-59-45-556.png|width=800! -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (FLINK-34105) Akka timeout happens in TPC-DS benchmarks
[ https://issues.apache.org/jira/browse/FLINK-34105?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17807554#comment-17807554 ] Zhu Zhu commented on FLINK-34105: - I think it is a regression because existing jobs can become unstable. Is it possible that we use a thread pool to do parallelized serialization before conducting task submission, so that it will not be counted as part of pekko RPC process? > Akka timeout happens in TPC-DS benchmarks > - > > Key: FLINK-34105 > URL: https://issues.apache.org/jira/browse/FLINK-34105 > Project: Flink > Issue Type: Bug > Components: Runtime / Coordination >Affects Versions: 1.19.0 >Reporter: Zhu Zhu >Assignee: Yangze Guo >Priority: Critical > Attachments: image-2024-01-16-13-59-45-556.png > > > We noticed akka timeout happens in 10TB TPC-DS benchmarks in 1.19. The > problem did not happen in 1.18.0. > After bisecting, we find the problem was introduced in FLINK-33532. > !image-2024-01-16-13-59-45-556.png|width=800! -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (FLINK-34105) Akka timeout happens in TPC-DS benchmarks
[ https://issues.apache.org/jira/browse/FLINK-34105?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17807526#comment-17807526 ] Yangze Guo commented on FLINK-34105: [~zhuzh] I encountered the same Akka timeout issue in my testing with a two-level join job using the same concurrency configuration. Adjusting pekko.ask.timeout indeed resolved this problem. I believe the root cause of this issue is that we moved the serialization and compression of ShuffleDescriptorGroup from the RPC main thread to Akka's serialization thread. The time spent on this operation is included in the process monitored by pekko.ask.timeout. Personally speaking, I consider this as an optimization rather than a problem for users because the serialization thread is pooled, allowing parallelization of the serialization and compression process. Otherwise, each compression are executed sequentially in the main thread. This change will speed up job deployment, although for very large jobs, users may need to manually adjust the configuration. WDYT? > Akka timeout happens in TPC-DS benchmarks > - > > Key: FLINK-34105 > URL: https://issues.apache.org/jira/browse/FLINK-34105 > Project: Flink > Issue Type: Bug > Components: Runtime / Coordination >Affects Versions: 1.19.0 >Reporter: Zhu Zhu >Assignee: Yangze Guo >Priority: Critical > Attachments: image-2024-01-16-13-59-45-556.png > > > We noticed akka timeout happens in 10TB TPC-DS benchmarks in 1.19. The > problem did not happen in 1.18.0. > After bisecting, we find the problem was introduced in FLINK-33532. > !image-2024-01-16-13-59-45-556.png|width=800! -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (FLINK-34105) Akka timeout happens in TPC-DS benchmarks
[ https://issues.apache.org/jira/browse/FLINK-34105?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17807505#comment-17807505 ] Zhu Zhu commented on FLINK-34105: - We tried increasing {{pekko.ask.timeout}} to {{1min}}(from the default {{10s}}), and the problem did not happen again. So I guess it's not related to the framesize. > Akka timeout happens in TPC-DS benchmarks > - > > Key: FLINK-34105 > URL: https://issues.apache.org/jira/browse/FLINK-34105 > Project: Flink > Issue Type: Bug > Components: Runtime / Coordination >Affects Versions: 1.19.0 >Reporter: Zhu Zhu >Assignee: Yangze Guo >Priority: Critical > Attachments: image-2024-01-16-13-59-45-556.png > > > We noticed akka timeout happens in 10TB TPC-DS benchmarks in 1.19. The > problem did not happen in 1.18.0. > After bisecting, we find the problem was introduced in FLINK-33532. > !image-2024-01-16-13-59-45-556.png|width=800! -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (FLINK-34105) Akka timeout happens in TPC-DS benchmarks
[ https://issues.apache.org/jira/browse/FLINK-34105?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17807238#comment-17807238 ] Yangze Guo commented on FLINK-34105: We'll try to reproduce this issue in our environment. > Akka timeout happens in TPC-DS benchmarks > - > > Key: FLINK-34105 > URL: https://issues.apache.org/jira/browse/FLINK-34105 > Project: Flink > Issue Type: Bug > Components: Runtime / Coordination >Affects Versions: 1.19.0 >Reporter: Zhu Zhu >Assignee: Yangze Guo >Priority: Critical > Attachments: image-2024-01-16-13-59-45-556.png > > > We noticed akka timeout happens in 10TB TPC-DS benchmarks in 1.19. The > problem did not happen in 1.18.0. > After bisecting, we find the problem was introduced in FLINK-33532. > !image-2024-01-16-13-59-45-556.png|width=800! -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (FLINK-34105) Akka timeout happens in TPC-DS benchmarks
[ https://issues.apache.org/jira/browse/FLINK-34105?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17807234#comment-17807234 ] Yangze Guo commented on FLINK-34105: [~zhuzh] My gut feeling is that the change leads to payload overhead. Could you adjust the pekko.framesize to 1g and rerun the benchmark? BTW, the default value of pekko.framesize might be too small. AFAIK, many giants like Alibaba and Bytedance would increase it in their production environments. WDYT? > Akka timeout happens in TPC-DS benchmarks > - > > Key: FLINK-34105 > URL: https://issues.apache.org/jira/browse/FLINK-34105 > Project: Flink > Issue Type: Bug > Components: Runtime / Coordination >Affects Versions: 1.19.0 >Reporter: Zhu Zhu >Assignee: Yangze Guo >Priority: Critical > Attachments: image-2024-01-16-13-59-45-556.png > > > We noticed akka timeout happens in 10TB TPC-DS benchmarks in 1.19. The > problem did not happen in 1.18.0. > After bisecting, we find the problem was introduced in FLINK-33532. > !image-2024-01-16-13-59-45-556.png|width=800! -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (FLINK-34105) Akka timeout happens in TPC-DS benchmarks
[ https://issues.apache.org/jira/browse/FLINK-34105?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17807203#comment-17807203 ] Zhu Zhu commented on FLINK-34105: - [~guoyangze] pekko.framesize is not configured and the default value should have been used. Here are all the configuration: ``` parallelism.default: 1500 slotmanager.number-of-slots.max: 1500 taskmanager.numberOfTaskSlots: 10 jobmanager.memory.process.size: 24000m taskmanager.memory.process.size: 24000m resourcemanager.taskmanager-timeout: 90 taskmanager.memory.network.fraction: 0.2 cluster.evenly-spread-out-slots: true table.optimizer.join-reorder-enabled: true table.optimizer.join.broadcast-threshold: 10485760 table.exec.operator-fusion-codegen.enabled: true ``` > Akka timeout happens in TPC-DS benchmarks > - > > Key: FLINK-34105 > URL: https://issues.apache.org/jira/browse/FLINK-34105 > Project: Flink > Issue Type: Bug > Components: Runtime / Coordination >Affects Versions: 1.19.0 >Reporter: Zhu Zhu >Assignee: Yangze Guo >Priority: Critical > Attachments: image-2024-01-16-13-59-45-556.png > > > We noticed akka timeout happens in 10TB TPC-DS benchmarks in 1.19. The > problem did not happen in 1.18.0. > After bisecting, we find the problem was introduced in FLINK-33532. > !image-2024-01-16-13-59-45-556.png|width=800! -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (FLINK-34105) Akka timeout happens in TPC-DS benchmarks
[ https://issues.apache.org/jira/browse/FLINK-34105?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17807164#comment-17807164 ] Yangze Guo commented on FLINK-34105: Thanks for the pointer, [~zhuzh]. How does the pekko.framesize configure in this cluster? > Akka timeout happens in TPC-DS benchmarks > - > > Key: FLINK-34105 > URL: https://issues.apache.org/jira/browse/FLINK-34105 > Project: Flink > Issue Type: Bug > Components: Runtime / Coordination >Affects Versions: 1.19.0 >Reporter: Zhu Zhu >Priority: Critical > Attachments: image-2024-01-16-13-59-45-556.png > > > We noticed akka timeout happens in 10TB TPC-DS benchmarks in 1.19. The > problem did not happen in 1.18.0. > After bisecting, we find the problem was introduced in FLINK-33532. > !image-2024-01-16-13-59-45-556.png|width=800! -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (FLINK-34105) Akka timeout happens in TPC-DS benchmarks
[ https://issues.apache.org/jira/browse/FLINK-34105?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17807052#comment-17807052 ] Zhu Zhu commented on FLINK-34105: - [~lsdy] [~guoyangze] would you help to take a look? > Akka timeout happens in TPC-DS benchmarks > - > > Key: FLINK-34105 > URL: https://issues.apache.org/jira/browse/FLINK-34105 > Project: Flink > Issue Type: Bug > Components: Runtime / Coordination >Affects Versions: 1.19.0 >Reporter: Zhu Zhu >Priority: Major > Attachments: image-2024-01-16-13-59-45-556.png > > > We noticed akka timeout happens in 10TB TPC-DS benchmarks in 1.19. The > problem did not happen in 1.18.0. > After bisecting, we find the problem was introduced in FLINK-33532. > !image-2024-01-16-13-59-45-556.png! -- This message was sent by Atlassian Jira (v8.20.10#820010)