[jira] [Commented] (FLINK-19964) Gelly ITCase stuck on Azure in HITSITCase.testPrintWithRMatGraph

2020-11-10 Thread Robert Metzger (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-19964?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17229801#comment-17229801
 ] 

Robert Metzger commented on FLINK-19964:


Thanks a lot for the fix!

> Gelly ITCase stuck on Azure in HITSITCase.testPrintWithRMatGraph
> 
>
> Key: FLINK-19964
> URL: https://issues.apache.org/jira/browse/FLINK-19964
> Project: Flink
>  Issue Type: Bug
>  Components: Library / Graph Processing (Gelly), Runtime / Network, 
> Tests
>Affects Versions: 1.12.0
>Reporter: Chesnay Schepler
>Assignee: Arvid Heise
>Priority: Blocker
>  Labels: pull-request-available, test-stability
> Fix For: 1.12.0
>
>
> The HITSITCase has gotten stuck on Azure. Chances are that something in the 
> scheduling or network has broken it.
> https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=8919&view=logs&j=c5f0071e-1851-543e-9a45-9ac140befc32&t=1fb1a56f-e8b5-5a82-00a0-a2db7757b4f5



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (FLINK-19964) Gelly ITCase stuck on Azure in HITSITCase.testPrintWithRMatGraph

2020-11-10 Thread Arvid Heise (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-19964?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17229777#comment-17229777
 ] 

Arvid Heise commented on FLINK-19964:
-

Yes sorry about that, but I realized that exactly that question was unanswered 
;).

Merged the fix into master as 18ffebb3dbecc21d2d33c436628176c4971cebbd. 
Backport not applicable.

> Gelly ITCase stuck on Azure in HITSITCase.testPrintWithRMatGraph
> 
>
> Key: FLINK-19964
> URL: https://issues.apache.org/jira/browse/FLINK-19964
> Project: Flink
>  Issue Type: Bug
>  Components: Library / Graph Processing (Gelly), Runtime / Network, 
> Tests
>Affects Versions: 1.12.0
>Reporter: Chesnay Schepler
>Assignee: Arvid Heise
>Priority: Blocker
>  Labels: pull-request-available, test-stability
> Fix For: 1.12.0
>
>
> The HITSITCase has gotten stuck on Azure. Chances are that something in the 
> scheduling or network has broken it.
> https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=8919&view=logs&j=c5f0071e-1851-543e-9a45-9ac140befc32&t=1fb1a56f-e8b5-5a82-00a0-a2db7757b4f5



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (FLINK-19964) Gelly ITCase stuck on Azure in HITSITCase.testPrintWithRMatGraph

2020-11-10 Thread Robert Metzger (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-19964?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17229467#comment-17229467
 ] 

Robert Metzger commented on FLINK-19964:


Okay, it seems that you've edited your comment while I posted :) 

> Gelly ITCase stuck on Azure in HITSITCase.testPrintWithRMatGraph
> 
>
> Key: FLINK-19964
> URL: https://issues.apache.org/jira/browse/FLINK-19964
> Project: Flink
>  Issue Type: Bug
>  Components: Library / Graph Processing (Gelly), Runtime / Network, 
> Tests
>Affects Versions: 1.12.0
>Reporter: Chesnay Schepler
>Assignee: Arvid Heise
>Priority: Blocker
>  Labels: test-stability
> Fix For: 1.12.0
>
>
> The HITSITCase has gotten stuck on Azure. Chances are that something in the 
> scheduling or network has broken it.
> https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=8919&view=logs&j=c5f0071e-1851-543e-9a45-9ac140befc32&t=1fb1a56f-e8b5-5a82-00a0-a2db7757b4f5



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (FLINK-19964) Gelly ITCase stuck on Azure in HITSITCase.testPrintWithRMatGraph

2020-11-10 Thread Robert Metzger (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-19964?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17229466#comment-17229466
 ] 

Robert Metzger commented on FLINK-19964:


Thanks a lot for your analysis. I still don't understand why the issue didn't 
occur earlier. It has been merged at the beginning of October, but we saw the 
first failures in November, and that quite frequently. 

> Gelly ITCase stuck on Azure in HITSITCase.testPrintWithRMatGraph
> 
>
> Key: FLINK-19964
> URL: https://issues.apache.org/jira/browse/FLINK-19964
> Project: Flink
>  Issue Type: Bug
>  Components: Library / Graph Processing (Gelly), Runtime / Network, 
> Tests
>Affects Versions: 1.12.0
>Reporter: Chesnay Schepler
>Assignee: Arvid Heise
>Priority: Blocker
>  Labels: test-stability
> Fix For: 1.12.0
>
>
> The HITSITCase has gotten stuck on Azure. Chances are that something in the 
> scheduling or network has broken it.
> https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=8919&view=logs&j=c5f0071e-1851-543e-9a45-9ac140befc32&t=1fb1a56f-e8b5-5a82-00a0-a2db7757b4f5



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (FLINK-19964) Gelly ITCase stuck on Azure in HITSITCase.testPrintWithRMatGraph

2020-11-10 Thread Arvid Heise (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-19964?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17229465#comment-17229465
 ] 

Arvid Heise commented on FLINK-19964:
-

Okay here is what happens:
* Two {{LocalBufferPool}}s are concurrently destroyed.
* The second {{LocalBufferPool}} still gets buffer assigned while the first one 
is destroyed in {{NetworkBufferPool}} (always has been like this).
* With the change in e17dbab24f4f71c5472d27267e938791686e45c3, the second 
{{LocalBufferPool}} proactively acquires one of the assigned buffers.
* Since it already has been destroyed this buffer is never returned to the 
{{NetworkBufferPool}} and simply vanishes from heap.
* With each repeated test run, the {{NetworkBufferPool}} of the same 
mini-cluster has less buffers available (1 less for each concurrent release).
* Eventually, there are too few buffers available to progress.

So there is a serious bug happening that needs to be fixed asap. The two 
obvious choices are either to not assign buffers to destroyed pools (which is 
expensive to detect because of different threads) and to not pro-actively take 
a buffer on destroyed local pool (easy to implement = old behavior).

There could also be a follow-up work related to why fewer available buffer 
cause a deadlock instead of a fast failure.

> Gelly ITCase stuck on Azure in HITSITCase.testPrintWithRMatGraph
> 
>
> Key: FLINK-19964
> URL: https://issues.apache.org/jira/browse/FLINK-19964
> Project: Flink
>  Issue Type: Bug
>  Components: Library / Graph Processing (Gelly), Runtime / Network, 
> Tests
>Affects Versions: 1.12.0
>Reporter: Chesnay Schepler
>Assignee: Arvid Heise
>Priority: Blocker
>  Labels: test-stability
> Fix For: 1.12.0
>
>
> The HITSITCase has gotten stuck on Azure. Chances are that something in the 
> scheduling or network has broken it.
> https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=8919&view=logs&j=c5f0071e-1851-543e-9a45-9ac140befc32&t=1fb1a56f-e8b5-5a82-00a0-a2db7757b4f5



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (FLINK-19964) Gelly ITCase stuck on Azure in HITSITCase.testPrintWithRMatGraph

2020-11-10 Thread Arvid Heise (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-19964?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17229290#comment-17229290
 ] 

Arvid Heise commented on FLINK-19964:
-

I can confirm [~roman_khachatryan]'s find that this is caused by 
e17dbab24f4f71c5472d27267e938791686e45c3. It's easy to produce locally.
I'm assuming recent commits just changed the timing and made it more likely to 
appear.

However, it's also interesting that locally, the test only gets stuck after >10 
tests are run on the same mini cluster. So it may also be related to some 
memory leaks.

> Gelly ITCase stuck on Azure in HITSITCase.testPrintWithRMatGraph
> 
>
> Key: FLINK-19964
> URL: https://issues.apache.org/jira/browse/FLINK-19964
> Project: Flink
>  Issue Type: Bug
>  Components: Library / Graph Processing (Gelly), Runtime / Network, 
> Tests
>Affects Versions: 1.12.0
>Reporter: Chesnay Schepler
>Assignee: Arvid Heise
>Priority: Blocker
>  Labels: test-stability
> Fix For: 1.12.0
>
>
> The HITSITCase has gotten stuck on Azure. Chances are that something in the 
> scheduling or network has broken it.
> https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=8919&view=logs&j=c5f0071e-1851-543e-9a45-9ac140befc32&t=1fb1a56f-e8b5-5a82-00a0-a2db7757b4f5



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (FLINK-19964) Gelly ITCase stuck on Azure in HITSITCase.testPrintWithRMatGraph

2020-11-10 Thread Robert Metzger (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-19964?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17229274#comment-17229274
 ] 

Robert Metzger commented on FLINK-19964:


You are right. Given the frequency of these deadlocks, something must have been 
merged in the last week that triggers this.

> Gelly ITCase stuck on Azure in HITSITCase.testPrintWithRMatGraph
> 
>
> Key: FLINK-19964
> URL: https://issues.apache.org/jira/browse/FLINK-19964
> Project: Flink
>  Issue Type: Bug
>  Components: Library / Graph Processing (Gelly), Runtime / Network, 
> Tests
>Affects Versions: 1.12.0
>Reporter: Chesnay Schepler
>Assignee: Arvid Heise
>Priority: Blocker
>  Labels: test-stability
> Fix For: 1.12.0
>
>
> The HITSITCase has gotten stuck on Azure. Chances are that something in the 
> scheduling or network has broken it.
> https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=8919&view=logs&j=c5f0071e-1851-543e-9a45-9ac140befc32&t=1fb1a56f-e8b5-5a82-00a0-a2db7757b4f5



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (FLINK-19964) Gelly ITCase stuck on Azure in HITSITCase.testPrintWithRMatGraph

2020-11-08 Thread Zhu Zhu (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-19964?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17228381#comment-17228381
 ] 

Zhu Zhu commented on FLINK-19964:
-

The change FLINK-19189 to "enable pipelined region scheduling by default" has 
been merged since 09/24, which is even older.
So there may be another cause, or possibly there are multiple causes.

> Gelly ITCase stuck on Azure in HITSITCase.testPrintWithRMatGraph
> 
>
> Key: FLINK-19964
> URL: https://issues.apache.org/jira/browse/FLINK-19964
> Project: Flink
>  Issue Type: Bug
>  Components: Library / Graph Processing (Gelly), Runtime / Network, 
> Tests
>Affects Versions: 1.12.0
>Reporter: Chesnay Schepler
>Assignee: Roman Khachatryan
>Priority: Blocker
>  Labels: test-stability
> Fix For: 1.12.0
>
>
> The HITSITCase has gotten stuck on Azure. Chances are that something in the 
> scheduling or network has broken it.
> https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=8919&view=logs&j=c5f0071e-1851-543e-9a45-9ac140befc32&t=1fb1a56f-e8b5-5a82-00a0-a2db7757b4f5



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (FLINK-19964) Gelly ITCase stuck on Azure in HITSITCase.testPrintWithRMatGraph

2020-11-08 Thread Robert Metzger (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-19964?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17228354#comment-17228354
 ] 

Robert Metzger commented on FLINK-19964:


The iterations deadlocks are happening quite frequently now. The commit Roman 
found bisecting is quite old. I believe a more recent change to the pipelined 
region scheduling is more likely to cause this instability.

> Gelly ITCase stuck on Azure in HITSITCase.testPrintWithRMatGraph
> 
>
> Key: FLINK-19964
> URL: https://issues.apache.org/jira/browse/FLINK-19964
> Project: Flink
>  Issue Type: Bug
>  Components: Library / Graph Processing (Gelly), Runtime / Network, 
> Tests
>Affects Versions: 1.12.0
>Reporter: Chesnay Schepler
>Assignee: Roman Khachatryan
>Priority: Blocker
>  Labels: test-stability
> Fix For: 1.12.0
>
>
> The HITSITCase has gotten stuck on Azure. Chances are that something in the 
> scheduling or network has broken it.
> https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=8919&view=logs&j=c5f0071e-1851-543e-9a45-9ac140befc32&t=1fb1a56f-e8b5-5a82-00a0-a2db7757b4f5



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (FLINK-19964) Gelly ITCase stuck on Azure in HITSITCase.testPrintWithRMatGraph

2020-11-06 Thread Robert Metzger (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-19964?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17227368#comment-17227368
 ] 

Robert Metzger commented on FLINK-19964:


Another gelly test is affected too: 
https://issues.apache.org/jira/browse/FLINK-20011

> Gelly ITCase stuck on Azure in HITSITCase.testPrintWithRMatGraph
> 
>
> Key: FLINK-19964
> URL: https://issues.apache.org/jira/browse/FLINK-19964
> Project: Flink
>  Issue Type: Bug
>  Components: Library / Graph Processing (Gelly), Runtime / Network, 
> Tests
>Affects Versions: 1.12.0
>Reporter: Chesnay Schepler
>Assignee: Roman Khachatryan
>Priority: Blocker
>  Labels: test-stability
> Fix For: 1.12.0
>
>
> The HITSITCase has gotten stuck on Azure. Chances are that something in the 
> scheduling or network has broken it.
> https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=8919&view=logs&j=c5f0071e-1851-543e-9a45-9ac140befc32&t=1fb1a56f-e8b5-5a82-00a0-a2db7757b4f5



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (FLINK-19964) Gelly ITCase stuck on Azure in HITSITCase.testPrintWithRMatGraph

2020-11-05 Thread Zhu Zhu (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-19964?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17227153#comment-17227153
 ] 

Zhu Zhu commented on FLINK-19964:
-

We recently noticed the issue FLINK-19994 that pipelined region scheduling will 
eagerly schedule all the vertices in a DataSet iteration job.
[~roman_khachatryan] Is it possible that the problem is caused by downstream 
task allocated all available network buffers from global pool, and then the 
upstream task cannot obtain any buffer and get stuck? If so, I think 
FLINK-19994 can fix this problem.
However, I cannot reproduce the problem after 1700+ runs locally. So I'm not 
sure whether my guess is correct.

> Gelly ITCase stuck on Azure in HITSITCase.testPrintWithRMatGraph
> 
>
> Key: FLINK-19964
> URL: https://issues.apache.org/jira/browse/FLINK-19964
> Project: Flink
>  Issue Type: Bug
>  Components: Library / Graph Processing (Gelly), Runtime / Network, 
> Tests
>Affects Versions: 1.12.0
>Reporter: Chesnay Schepler
>Assignee: Roman Khachatryan
>Priority: Blocker
>  Labels: test-stability
> Fix For: 1.12.0
>
>
> The HITSITCase has gotten stuck on Azure. Chances are that something in the 
> scheduling or network has broken it.
> https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=8919&view=logs&j=c5f0071e-1851-543e-9a45-9ac140befc32&t=1fb1a56f-e8b5-5a82-00a0-a2db7757b4f5



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (FLINK-19964) Gelly ITCase stuck on Azure in HITSITCase.testPrintWithRMatGraph

2020-11-05 Thread Roman Khachatryan (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-19964?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17227032#comment-17227032
 ] 

Roman Khachatryan commented on FLINK-19964:
---

I'm not sure, probably not.

> Gelly ITCase stuck on Azure in HITSITCase.testPrintWithRMatGraph
> 
>
> Key: FLINK-19964
> URL: https://issues.apache.org/jira/browse/FLINK-19964
> Project: Flink
>  Issue Type: Bug
>  Components: Library / Graph Processing (Gelly), Runtime / Network, 
> Tests
>Affects Versions: 1.12.0
>Reporter: Chesnay Schepler
>Assignee: Roman Khachatryan
>Priority: Blocker
>  Labels: test-stability
> Fix For: 1.12.0
>
>
> The HITSITCase has gotten stuck on Azure. Chances are that something in the 
> scheduling or network has broken it.
> https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=8919&view=logs&j=c5f0071e-1851-543e-9a45-9ac140befc32&t=1fb1a56f-e8b5-5a82-00a0-a2db7757b4f5



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (FLINK-19964) Gelly ITCase stuck on Azure in HITSITCase.testPrintWithRMatGraph

2020-11-05 Thread Robert Metzger (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-19964?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17226793#comment-17226793
 ] 

Robert Metzger commented on FLINK-19964:


[~roman_khachatryan] Is this blocker expected to be resolved before the feature 
freeze on Sunday?


> Gelly ITCase stuck on Azure in HITSITCase.testPrintWithRMatGraph
> 
>
> Key: FLINK-19964
> URL: https://issues.apache.org/jira/browse/FLINK-19964
> Project: Flink
>  Issue Type: Bug
>  Components: Library / Graph Processing (Gelly), Runtime / Network, 
> Tests
>Affects Versions: 1.12.0
>Reporter: Chesnay Schepler
>Assignee: Roman Khachatryan
>Priority: Blocker
>  Labels: test-stability
> Fix For: 1.12.0
>
>
> The HITSITCase has gotten stuck on Azure. Chances are that something in the 
> scheduling or network has broken it.
> https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=8919&view=logs&j=c5f0071e-1851-543e-9a45-9ac140befc32&t=1fb1a56f-e8b5-5a82-00a0-a2db7757b4f5



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (FLINK-19964) Gelly ITCase stuck on Azure in HITSITCase.testPrintWithRMatGraph

2020-11-05 Thread Robert Metzger (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-19964?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17226649#comment-17226649
 ] 

Robert Metzger commented on FLINK-19964:


Yes, let's keep it as release blocker and fix it asap. All iterative batch jobs 
are probably affected by this.

> Gelly ITCase stuck on Azure in HITSITCase.testPrintWithRMatGraph
> 
>
> Key: FLINK-19964
> URL: https://issues.apache.org/jira/browse/FLINK-19964
> Project: Flink
>  Issue Type: Bug
>  Components: Library / Graph Processing (Gelly), Runtime / Network, 
> Tests
>Affects Versions: 1.12.0
>Reporter: Chesnay Schepler
>Assignee: Roman Khachatryan
>Priority: Blocker
>  Labels: test-stability
> Fix For: 1.12.0
>
>
> The HITSITCase has gotten stuck on Azure. Chances are that something in the 
> scheduling or network has broken it.
> https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=8919&view=logs&j=c5f0071e-1851-543e-9a45-9ac140befc32&t=1fb1a56f-e8b5-5a82-00a0-a2db7757b4f5



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (FLINK-19964) Gelly ITCase stuck on Azure in HITSITCase.testPrintWithRMatGraph

2020-11-05 Thread Piotr Nowojski (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-19964?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17226642#comment-17226642
 ] 

Piotr Nowojski commented on FLINK-19964:


Thanks for investigation [~roman_khachatryan]. FLINK-16972 Is not that old, it 
was only merged in 1.12, so we should keep this ticket as release blocker. 
Otherwise we might release 1.12 with some serious new bug.

> Gelly ITCase stuck on Azure in HITSITCase.testPrintWithRMatGraph
> 
>
> Key: FLINK-19964
> URL: https://issues.apache.org/jira/browse/FLINK-19964
> Project: Flink
>  Issue Type: Bug
>  Components: Library / Graph Processing (Gelly), Runtime / Network, 
> Tests
>Affects Versions: 1.12.0
>Reporter: Chesnay Schepler
>Assignee: Roman Khachatryan
>Priority: Blocker
>  Labels: test-stability
> Fix For: 1.12.0
>
>
> The HITSITCase has gotten stuck on Azure. Chances are that something in the 
> scheduling or network has broken it.
> https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=8919&view=logs&j=c5f0071e-1851-543e-9a45-9ac140befc32&t=1fb1a56f-e8b5-5a82-00a0-a2db7757b4f5



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (FLINK-19964) Gelly ITCase stuck on Azure in HITSITCase.testPrintWithRMatGraph

2020-11-05 Thread Roman Khachatryan (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-19964?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17226640#comment-17226640
 ] 

Roman Khachatryan commented on FLINK-19964:
---

I assumed at first that it's caused by my recent addition of waiting for 
EndOfChannelState event.

But git bisect gave a rather old commit:
{code:java}
e17dbab24f4f71c5472d27267e938791686e45c3 is the first bad commit
commit e17dbab24f4f71c5472d27267e938791686e45c3
Author: Arvid Heise 
Date:   Fri Sep 25 14:39:17 2020 +0200

[FLINK-16972][network] LocalBufferPool eagerly fetches global segments to 
ensure proper availability.

Before this commit, availability of LocalBufferPool depended on a the 
availability of a shared NetworkBufferPool. However, if multiple 
LocalBufferPools simultaneously are available only because the 
NetworkBufferPool becomes available with one segment, only one of the 
LocalBufferPools is truly available (the one that actually acquires this 
segment).

The solution in this commit is to define availability only through the 
guaranteed ability to provide a memory segment to the consumer. If a 
LocalBufferPool runs out of local segments it will become unavailable until it 
receives a segment from the NetworkBufferPool. To minimize unavailability, 
LocalBufferPool first tries to eagerly fetch new segments before declaring 
unavailability and if that fails, the local pool subscribes to the availability 
to the network pool to restore availability asap.

Additionally, LocalBufferPool would switch to unavailable only after it 
could not serve a requested memory segment. For requestBufferBuilderBlocking 
that is too late as it entered the blocking loop already.

Finally, LocalBufferPool now permanently holds at least one buffer. To 
reflect that, the number of required segments needs to be at least one, which 
matches all usages in production code. A few test needed to be adjusted to 
properly capture the new requirement.

:04 04 1331ab5652c4bfbdbed02576f4e57a87ccaa1170 
4f767acd0262ba07eefbdee6b8bd717ef1957765 M  flink-runtime

{code}
Reverting it on master solves the problem.

The failure itself doesn't happen all the time. Also adding logging or running 
in debug mode prevents it.

Given that, and that it's quite old, I'd lower the priority. WDYT [~rmetzger], 
[~pnowojski]?

> Gelly ITCase stuck on Azure in HITSITCase.testPrintWithRMatGraph
> 
>
> Key: FLINK-19964
> URL: https://issues.apache.org/jira/browse/FLINK-19964
> Project: Flink
>  Issue Type: Bug
>  Components: Library / Graph Processing (Gelly), Runtime / Network, 
> Tests
>Affects Versions: 1.12.0
>Reporter: Chesnay Schepler
>Assignee: Roman Khachatryan
>Priority: Blocker
>  Labels: test-stability
> Fix For: 1.12.0
>
>
> The HITSITCase has gotten stuck on Azure. Chances are that something in the 
> scheduling or network has broken it.
> https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=8919&view=logs&j=c5f0071e-1851-543e-9a45-9ac140befc32&t=1fb1a56f-e8b5-5a82-00a0-a2db7757b4f5



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (FLINK-19964) Gelly ITCase stuck on Azure in HITSITCase.testPrintWithRMatGraph

2020-11-04 Thread Arvid Heise (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-19964?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17226077#comment-17226077
 ] 

Arvid Heise commented on FLINK-19964:
-

[~roman_khachatryan] is looking into it.

> Gelly ITCase stuck on Azure in HITSITCase.testPrintWithRMatGraph
> 
>
> Key: FLINK-19964
> URL: https://issues.apache.org/jira/browse/FLINK-19964
> Project: Flink
>  Issue Type: Bug
>  Components: Library / Graph Processing (Gelly), Runtime / Network, 
> Tests
>Affects Versions: 1.12.0
>Reporter: Chesnay Schepler
>Priority: Blocker
>  Labels: test-stability
> Fix For: 1.12.0
>
>
> The HITSITCase has gotten stuck on Azure. Chances are that something in the 
> scheduling or network has broken it.
> https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=8919&view=logs&j=c5f0071e-1851-543e-9a45-9ac140befc32&t=1fb1a56f-e8b5-5a82-00a0-a2db7757b4f5



--
This message was sent by Atlassian Jira
(v8.3.4#803005)