[ 
https://issues.apache.org/jira/browse/FLINK-19964?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17226640#comment-17226640
 ] 

Roman Khachatryan commented on FLINK-19964:
-------------------------------------------

I assumed at first that it's caused by my recent addition of waiting for 
EndOfChannelState event.

But git bisect gave a rather old commit:
{code:java}
e17dbab24f4f71c5472d27267e938791686e45c3 is the first bad commit
commit e17dbab24f4f71c5472d27267e938791686e45c3
Author: Arvid Heise <ar...@ververica.com>
Date:   Fri Sep 25 14:39:17 2020 +0200

    [FLINK-16972][network] LocalBufferPool eagerly fetches global segments to 
ensure proper availability.

    Before this commit, availability of LocalBufferPool depended on a the 
availability of a shared NetworkBufferPool. However, if multiple 
LocalBufferPools simultaneously are available only because the 
NetworkBufferPool becomes available with one segment, only one of the 
LocalBufferPools is truly available (the one that actually acquires this 
segment).

    The solution in this commit is to define availability only through the 
guaranteed ability to provide a memory segment to the consumer. If a 
LocalBufferPool runs out of local segments it will become unavailable until it 
receives a segment from the NetworkBufferPool. To minimize unavailability, 
LocalBufferPool first tries to eagerly fetch new segments before declaring 
unavailability and if that fails, the local pool subscribes to the availability 
to the network pool to restore availability asap.

    Additionally, LocalBufferPool would switch to unavailable only after it 
could not serve a requested memory segment. For requestBufferBuilderBlocking 
that is too late as it entered the blocking loop already.

    Finally, LocalBufferPool now permanently holds at least one buffer. To 
reflect that, the number of required segments needs to be at least one, which 
matches all usages in production code. A few test needed to be adjusted to 
properly capture the new requirement.

:040000 040000 1331ab5652c4bfbdbed02576f4e57a87ccaa1170 
4f767acd0262ba07eefbdee6b8bd717ef1957765 M      flink-runtime

{code}
Reverting it on master solves the problem.

The failure itself doesn't happen all the time. Also adding logging or running 
in debug mode prevents it.

Given that, and that it's quite old, I'd lower the priority. WDYT [~rmetzger], 
[~pnowojski]?

> Gelly ITCase stuck on Azure in HITSITCase.testPrintWithRMatGraph
> ----------------------------------------------------------------
>
>                 Key: FLINK-19964
>                 URL: https://issues.apache.org/jira/browse/FLINK-19964
>             Project: Flink
>          Issue Type: Bug
>          Components: Library / Graph Processing (Gelly), Runtime / Network, 
> Tests
>    Affects Versions: 1.12.0
>            Reporter: Chesnay Schepler
>            Assignee: Roman Khachatryan
>            Priority: Blocker
>              Labels: test-stability
>             Fix For: 1.12.0
>
>
> The HITSITCase has gotten stuck on Azure. Chances are that something in the 
> scheduling or network has broken it.
> https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=8919&view=logs&j=c5f0071e-1851-543e-9a45-9ac140befc32&t=1fb1a56f-e8b5-5a82-00a0-a2db7757b4f5



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to