[ 
https://issues.apache.org/jira/browse/FLINK-24300?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhilong Hong updated FLINK-24300:
---------------------------------
    Description: 
When we are running TPCDS with release 1.14 we find that the job with 
{{MultipleInputOperator}} is running much more slowly than before. With a 
binary search among the commits, we find that the issue may be introduced by 
FLINK-23408. 

At the commit 64570e4c56955713ca599fd1d7ae7be891a314c6, the job in TPCDS runs 
normally, as the image below illustrates:

!64570e4c56955713ca599fd1d7ae7be891a314c6.png|width=600!

At the commit e3010c16947ed8da2ecb7d89a3aa08dacecc524a, the job q2.sql gets 
stuck for a pretty long time (longer than half an hour), as the image below 
illustrates:

!e3010c16947ed8da2ecb7d89a3aa08dacecc524a.png|width=600!

The detail of the job is illustrated below:

!detail-of-the-job.png|width=600!

The job uses a {{MultipleInputOperator}} with one normal input and two chained 
FileSource. It has finished reading the normal input and start to read the 
chained source. Each chained source has one source data fetcher.

We capture the jstack of the stuck tasks and attach the file below. From the 
[^jstack.txt] we can see the main thread is blocked on waiting for the lock, 
and the lock is held by a source data fetcher. The source data fetcher is still 
running but the stack keeps on {{CompletableFuture.cleanStack}}.

This issue happens in a batch job. However, from where it get blocked, it seems 
also affects the streaming jobs.

For the reference, the code of TPCDS we are running is located at 
[https://github.com/ververica/flink-sql-benchmark/tree/dev].

  was:
When we are running TPCDS with release 1.14 we find that the job with 
MultipleInputOperator is running much more slowly than before. With a binary 
search among the commits, we find that the issue may be introduced by 
FLINK-23408. 

At the commit 64570e4c56955713ca599fd1d7ae7be891a314c6, the job runs normally 
in TPCDS, as the image below illustrates:

!64570e4c56955713ca599fd1d7ae7be891a314c6.png|width=600!

At the commit e3010c16947ed8da2ecb7d89a3aa08dacecc524a, the job q2.sql gets 
stuck for a pretty long time (longer than half an hour), as the image below 
illustrates:

!e3010c16947ed8da2ecb7d89a3aa08dacecc524a.png|width=600!

The detail of the job is illustrated below:

!detail-of-the-job.png|width=600!

The job uses a {{MultipleInputOperator}} with one normal input and two chained 
FileSource. It has finished reading the normal input and start to read the 
chained source. Each chained source has one source data fetcher.

We capture the jstack of the stuck tasks and attach the file below. From the 
[^jstack.txt] we can see the main thread is blocked on waiting for the lock, 
and the lock is held by a source data fetcher. The source data fetcher is still 
running but the stack keeps on {{CompletableFuture.cleanStack}}.

This issue happens in a batch job. However, from where it get blocked, it seems 
also affects the streaming jobs.

For the reference, the code of TPCDS we are running is located at 
[https://github.com/ververica/flink-sql-benchmark/tree/dev].


> MultipleInputOperator is running much more slowly in TPCDS
> ----------------------------------------------------------
>
>                 Key: FLINK-24300
>                 URL: https://issues.apache.org/jira/browse/FLINK-24300
>             Project: Flink
>          Issue Type: Bug
>          Components: Runtime / Network
>    Affects Versions: 1.14.0, 1.15.0
>            Reporter: Zhilong Hong
>            Priority: Major
>         Attachments: 64570e4c56955713ca599fd1d7ae7be891a314c6.png, 
> detail-of-the-job.png, e3010c16947ed8da2ecb7d89a3aa08dacecc524a.png, 
> jstack-2.txt, jstack.txt
>
>
> When we are running TPCDS with release 1.14 we find that the job with 
> {{MultipleInputOperator}} is running much more slowly than before. With a 
> binary search among the commits, we find that the issue may be introduced by 
> FLINK-23408. 
> At the commit 64570e4c56955713ca599fd1d7ae7be891a314c6, the job in TPCDS runs 
> normally, as the image below illustrates:
> !64570e4c56955713ca599fd1d7ae7be891a314c6.png|width=600!
> At the commit e3010c16947ed8da2ecb7d89a3aa08dacecc524a, the job q2.sql gets 
> stuck for a pretty long time (longer than half an hour), as the image below 
> illustrates:
> !e3010c16947ed8da2ecb7d89a3aa08dacecc524a.png|width=600!
> The detail of the job is illustrated below:
> !detail-of-the-job.png|width=600!
> The job uses a {{MultipleInputOperator}} with one normal input and two 
> chained FileSource. It has finished reading the normal input and start to 
> read the chained source. Each chained source has one source data fetcher.
> We capture the jstack of the stuck tasks and attach the file below. From the 
> [^jstack.txt] we can see the main thread is blocked on waiting for the lock, 
> and the lock is held by a source data fetcher. The source data fetcher is 
> still running but the stack keeps on {{CompletableFuture.cleanStack}}.
> This issue happens in a batch job. However, from where it get blocked, it 
> seems also affects the streaming jobs.
> For the reference, the code of TPCDS we are running is located at 
> [https://github.com/ververica/flink-sql-benchmark/tree/dev].



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to