[ https://issues.apache.org/jira/browse/FLINK-19964?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17226640#comment-17226640 ]
Roman Khachatryan commented on FLINK-19964: ------------------------------------------- I assumed at first that it's caused by my recent addition of waiting for EndOfChannelState event. But git bisect gave a rather old commit: {code:java} e17dbab24f4f71c5472d27267e938791686e45c3 is the first bad commit commit e17dbab24f4f71c5472d27267e938791686e45c3 Author: Arvid Heise <ar...@ververica.com> Date: Fri Sep 25 14:39:17 2020 +0200 [FLINK-16972][network] LocalBufferPool eagerly fetches global segments to ensure proper availability. Before this commit, availability of LocalBufferPool depended on a the availability of a shared NetworkBufferPool. However, if multiple LocalBufferPools simultaneously are available only because the NetworkBufferPool becomes available with one segment, only one of the LocalBufferPools is truly available (the one that actually acquires this segment). The solution in this commit is to define availability only through the guaranteed ability to provide a memory segment to the consumer. If a LocalBufferPool runs out of local segments it will become unavailable until it receives a segment from the NetworkBufferPool. To minimize unavailability, LocalBufferPool first tries to eagerly fetch new segments before declaring unavailability and if that fails, the local pool subscribes to the availability to the network pool to restore availability asap. Additionally, LocalBufferPool would switch to unavailable only after it could not serve a requested memory segment. For requestBufferBuilderBlocking that is too late as it entered the blocking loop already. Finally, LocalBufferPool now permanently holds at least one buffer. To reflect that, the number of required segments needs to be at least one, which matches all usages in production code. A few test needed to be adjusted to properly capture the new requirement. :040000 040000 1331ab5652c4bfbdbed02576f4e57a87ccaa1170 4f767acd0262ba07eefbdee6b8bd717ef1957765 M flink-runtime {code} Reverting it on master solves the problem. The failure itself doesn't happen all the time. Also adding logging or running in debug mode prevents it. Given that, and that it's quite old, I'd lower the priority. WDYT [~rmetzger], [~pnowojski]? > Gelly ITCase stuck on Azure in HITSITCase.testPrintWithRMatGraph > ---------------------------------------------------------------- > > Key: FLINK-19964 > URL: https://issues.apache.org/jira/browse/FLINK-19964 > Project: Flink > Issue Type: Bug > Components: Library / Graph Processing (Gelly), Runtime / Network, > Tests > Affects Versions: 1.12.0 > Reporter: Chesnay Schepler > Assignee: Roman Khachatryan > Priority: Blocker > Labels: test-stability > Fix For: 1.12.0 > > > The HITSITCase has gotten stuck on Azure. Chances are that something in the > scheduling or network has broken it. > https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=8919&view=logs&j=c5f0071e-1851-543e-9a45-9ac140befc32&t=1fb1a56f-e8b5-5a82-00a0-a2db7757b4f5 -- This message was sent by Atlassian Jira (v8.3.4#803005)