Yingjie Cao created FLINK-12329:
-----------------------------------

             Summary: Netty thread deadlock bug of the SpilledSubpartitionView
                 Key: FLINK-12329
                 URL: https://issues.apache.org/jira/browse/FLINK-12329
             Project: Flink
          Issue Type: Bug
          Components: Runtime / Network
    Affects Versions: 1.8.0
            Reporter: Yingjie Cao
             Fix For: 1.9.0


The Netty thread may be blocked when using the blocking batch mode of FLINK. In 
my opinion, the combination of several designs, including request buffer 
blocking, zero coy (the buffer is recycled only after it is sent out to 
network), limited number of buffers (only two buffers and not configurable) and 
backlog data is not ready (must request buffer and read, pipeline mode dose not 
have the problem), leads to this bug.

The flowing processing flow of Netty thread can block itself (note that 
writeAndFlush dose not mean the buffer is send out to the network).

1. request and read the first buffer -> write and flush the first buffer -> 
send the first buffer to network -> request and read the second buffer ->  
write and flush the first buffer  -> no credit -> add credit -> request and 
read buffer -> blocking (the second buffer is not sent out)

2. no credit -> add credit -> request and read buffer -> write and flush the 
buffer -> no credit -> add credit ->  request and read buffer -> blocking (the 
previous read buffer is not sent out)

How to reproduce?

The bug is easy to be reproduced, two vertices with a blocking edge can 
reproduce it. Large parallelism, small number of slots per TM and large data 
volume make it easy to reproduce the bug.Setting the parallelism to 100, the 
number of slots per TM to 1 and more than 10M data per subpartition will be ok.

How to fix?

The new mmappartition implementation can fix this problem because the number of 
buffers is not limited and the data is not loaded until sent.

The bug can be also fixed based on the old implementation. Firstly, the buffer 
request should not be blocking. Besides, the NetworkSequenceViewReader should 
enqueue as available reader when it is available for read and is not registered 
currently.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to