[ 
https://issues.apache.org/jira/browse/FLINK-16403?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhijiang updated FLINK-16403:
-----------------------------
    Release Note:   (was: Duplicated with FLINK-16404)

> Solve the potential deadlock problem when reducing exclusive buffers to zero
> ----------------------------------------------------------------------------
>
>                 Key: FLINK-16403
>                 URL: https://issues.apache.org/jira/browse/FLINK-16403
>             Project: Flink
>          Issue Type: Sub-task
>          Components: Runtime / Network
>            Reporter: Zhijiang
>            Priority: Critical
>
> One motivation of this issue is for reducing the in-flight data in the case 
> of back pressure to speed up checkpoint. The current default exclusive 
> buffers per channel is 2. If we reduce it to 0 and increase somewhat floating 
> buffers for compensation, it might cause deadlock problem because all the 
> floating buffers might be requested away by some blocked input channels and 
> never recycled until barrier alignment.
> In order to solve above deadlock concern, we can make some logic changes on 
> both sender and receiver sides.
>  * Sender side: it should revoke previous received credit after sending 
> checkpoint barrier, that means it would not send any following buffers until 
> receiving new credits.
>  * Receiver side: after processing the barrier from one channel and setting 
> it blocked, it should release the available floating buffers for this blocked 
> channel, and restore requesting floating buffers until barrier alignment. 
> That means the receiver would only announce new credits to sender side after 
> barrier alignment.
> Another possible benefit to do so is that the floating buffers might be more 
> properly made use of before barrier alignment. We can further verify the 
> performance concern via existing micro-benchmark.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to