[ https://issues.apache.org/jira/browse/FLINK-17823?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17113714#comment-17113714 ]
Zhijiang edited comment on FLINK-17823 at 6/2/20, 8:03 AM: ----------------------------------------------------------- Merged in release-1.11: 3eb1075ded64da20e6f7a5bc268f455eaf6573eb Merged in master: 8c7c7267be95cddd7122d2b97e5334f5db4cc37c was (Author: zjwang): Merged in release-1.11: 3eb1075ded64da20e6f7a5bc268f455eaf6573eb Will merge to master later and update the info. > Resolve the race condition while releasing RemoteInputChannel > ------------------------------------------------------------- > > Key: FLINK-17823 > URL: https://issues.apache.org/jira/browse/FLINK-17823 > Project: Flink > Issue Type: Bug > Components: Runtime / Network > Affects Versions: 1.11.0 > Reporter: Zhijiang > Assignee: Zhijiang > Priority: Blocker > Labels: pull-request-available > Fix For: 1.11.0 > > > RemoteInputChannel#releaseAllResources might be called by canceler thread. > Meanwhile, the task thread can also call RemoteInputChannel#getNextBuffer. > There probably cause two potential problems: > * Task thread might get null buffer after canceler thread already released > all the buffers, then it might cause misleading NPE in getNextBuffer. > * Task thread and canceler thread might pull the same buffer concurrently, > which causes unexpected exception when the same buffer is recycled twice. > The solution is to properly synchronize the buffer queue in release method to > avoid the same buffer pulled by both canceler thread and task thread. And in > getNextBuffer method, we add some explicit checks to avoid misleading NPE and > hint some valid exceptions. -- This message was sent by Atlassian Jira (v8.3.4#803005)