[jira] [Commented] (FLINK-32027) Batch jobs could hang at shuffle phase when max parallelism is really large

Weijie Guo (Jira) Mon, 08 May 2023 04:53:04 -0700


    [ 
https://issues.apache.org/jira/browse/FLINK-32027?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17720481#comment-17720481
 ]


Weijie Guo commented on FLINK-32027:
------------------------------------

Thanks [~yunta] for reporting this! 

Through some analysis, [~kevin.cyj] and I found that this is indeed a bug 
caused by multiple threads moving the {{FileChannel}} of index file 
simultaneously. I will fix this asap.

> Batch jobs could hang at shuffle phase when max parallelism is really large
> ---------------------------------------------------------------------------
>
>                 Key: FLINK-32027
>                 URL: https://issues.apache.org/jira/browse/FLINK-32027
>             Project: Flink
>          Issue Type: Bug
>          Components: Runtime / Network
>    Affects Versions: 1.17.0
>            Reporter: Yun Tang
>            Priority: Critical
>             Fix For: 1.17.1
>
>         Attachments: image-2023-05-08-11-12-58-361.png
>
>
> In batch stream mode with adaptive batch schedule mode, If we set the max 
> parallelism large as 32768 (pipeline.max-parallelism), the job could hang at 
> the shuffle phase:
> It would hang for a long time and show "No bytes sent":
>  !image-2023-05-08-11-12-58-361.png! 
> After some time to debug, we can see the downstream operator did not receive 
> the end-of-partition event.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Commented] (FLINK-32027) Batch jobs could hang at shuffle phase when max parallelism is really large

Reply via email to