[ https://issues.apache.org/jira/browse/FLINK-16403?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Zhijiang updated FLINK-16403: ----------------------------- Release Note: (was: Duplicated with FLINK-16404) > Solve the potential deadlock problem when reducing exclusive buffers to zero > ---------------------------------------------------------------------------- > > Key: FLINK-16403 > URL: https://issues.apache.org/jira/browse/FLINK-16403 > Project: Flink > Issue Type: Sub-task > Components: Runtime / Network > Reporter: Zhijiang > Priority: Critical > > One motivation of this issue is for reducing the in-flight data in the case > of back pressure to speed up checkpoint. The current default exclusive > buffers per channel is 2. If we reduce it to 0 and increase somewhat floating > buffers for compensation, it might cause deadlock problem because all the > floating buffers might be requested away by some blocked input channels and > never recycled until barrier alignment. > In order to solve above deadlock concern, we can make some logic changes on > both sender and receiver sides. > * Sender side: it should revoke previous received credit after sending > checkpoint barrier, that means it would not send any following buffers until > receiving new credits. > * Receiver side: after processing the barrier from one channel and setting > it blocked, it should release the available floating buffers for this blocked > channel, and restore requesting floating buffers until barrier alignment. > That means the receiver would only announce new credits to sender side after > barrier alignment. > Another possible benefit to do so is that the floating buffers might be more > properly made use of before barrier alignment. We can further verify the > performance concern via existing micro-benchmark. -- This message was sent by Atlassian Jira (v8.3.4#803005)