[ 
https://issues.apache.org/jira/browse/FLINK-19964?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17229465#comment-17229465
 ] 

Arvid Heise edited comment on FLINK-19964 at 11/10/20, 7:27 PM:
----------------------------------------------------------------

Okay here is what happens:
* Two {{LocalBufferPool}}s are concurrently destroyed.
* The second {{LocalBufferPool}} still gets buffer assigned while the first one 
is destroyed in {{NetworkBufferPool}} (always has been like this).
* With the change in e17dbab24f4f71c5472d27267e938791686e45c3, the second 
{{LocalBufferPool}} proactively acquires one of the assigned buffers.
* Since it already has been destroyed this buffer is never returned to the 
{{NetworkBufferPool}} and simply vanishes from heap.
* With each repeated test run, the {{NetworkBufferPool}} of the same 
mini-cluster has less buffers available (1 less for each concurrent release).
* Eventually, there are too few buffers available to progress.

So there is a serious bug happening that needs to be fixed asap. The two 
obvious choices are either to not assign buffers to destroyed pools (which is 
expensive to detect because of different threads) and to not pro-actively take 
a buffer on destroyed local pool (easy to implement = old behavior).

There could also be a follow-up work related to why fewer available buffer 
cause a deadlock instead of a fast failure.

Note that the recent changes might have caused more concurrent releases of 
pools and thus made the issue visible.
Note2: no dev probably executed one of the tests locally in the last 3 months. 
(Yes, they happen that often locally!)


was (Author: aheise):
Okay here is what happens:
* Two {{LocalBufferPool}}s are concurrently destroyed.
* The second {{LocalBufferPool}} still gets buffer assigned while the first one 
is destroyed in {{NetworkBufferPool}} (always has been like this).
* With the change in e17dbab24f4f71c5472d27267e938791686e45c3, the second 
{{LocalBufferPool}} proactively acquires one of the assigned buffers.
* Since it already has been destroyed this buffer is never returned to the 
{{NetworkBufferPool}} and simply vanishes from heap.
* With each repeated test run, the {{NetworkBufferPool}} of the same 
mini-cluster has less buffers available (1 less for each concurrent release).
* Eventually, there are too few buffers available to progress.

So there is a serious bug happening that needs to be fixed asap. The two 
obvious choices are either to not assign buffers to destroyed pools (which is 
expensive to detect because of different threads) and to not pro-actively take 
a buffer on destroyed local pool (easy to implement = old behavior).

There could also be a follow-up work related to why fewer available buffer 
cause a deadlock instead of a fast failure.

> Gelly ITCase stuck on Azure in HITSITCase.testPrintWithRMatGraph
> ----------------------------------------------------------------
>
>                 Key: FLINK-19964
>                 URL: https://issues.apache.org/jira/browse/FLINK-19964
>             Project: Flink
>          Issue Type: Bug
>          Components: Library / Graph Processing (Gelly), Runtime / Network, 
> Tests
>    Affects Versions: 1.12.0
>            Reporter: Chesnay Schepler
>            Assignee: Arvid Heise
>            Priority: Blocker
>              Labels: test-stability
>             Fix For: 1.12.0
>
>
> The HITSITCase has gotten stuck on Azure. Chances are that something in the 
> scheduling or network has broken it.
> https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=8919&view=logs&j=c5f0071e-1851-543e-9a45-9ac140befc32&t=1fb1a56f-e8b5-5a82-00a0-a2db7757b4f5



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to