[jira] [Commented] (FLINK-25185) StreamFaultToleranceTestBase hangs on AZP

2021-12-22 Thread Jira


[ 
https://issues.apache.org/jira/browse/FLINK-25185?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17463638#comment-17463638
 ] 

David Morávek commented on FLINK-25185:
---

https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=28447&view=logs&j=a57e0635-3fad-5b08-57c7-a4142d7d6fa9&t=2ef0effc-1da1-50e5-c2bd-aab434b1c5b7

> StreamFaultToleranceTestBase hangs on AZP
> -
>
> Key: FLINK-25185
> URL: https://issues.apache.org/jira/browse/FLINK-25185
> Project: Flink
>  Issue Type: Bug
>  Components: Runtime / State Backends, Test Infrastructure
>Affects Versions: 1.13.3, 1.15.0
>Reporter: Till Rohrmann
>Priority: Blocker
>  Labels: test-stability
> Fix For: 1.15.0
>
>
> The {{StreamFaultToleranceTestBase}} hangs on AZP.
> {code}
> 2021-12-06T04:24:48.1676089Z 
> ==
> 2021-12-06T04:24:48.1678883Z === WARNING: This task took already 95% of the 
> available time budget of 237 minutes ===
> 2021-12-06T04:24:48.1679596Z 
> ==
> 2021-12-06T04:24:48.1680326Z 
> ==
> 2021-12-06T04:24:48.1680877Z The following Java processes are running (JPS)
> 2021-12-06T04:24:48.1681467Z 
> ==
> 2021-12-06T04:24:48.6514536Z 13701 surefirebooter17740627448580534543.jar
> 2021-12-06T04:24:48.6515353Z 1622 Jps
> 2021-12-06T04:24:48.6515795Z 780 Launcher
> 2021-12-06T04:24:48.6825889Z 
> ==
> 2021-12-06T04:24:48.6826565Z Printing stack trace of Java process 13701
> 2021-12-06T04:24:48.6827012Z 
> ==
> 2021-12-06T04:24:49.1876086Z 2021-12-06 04:24:49
> 2021-12-06T04:24:49.1877098Z Full thread dump OpenJDK 64-Bit Server VM 
> (11.0.10+9 mixed mode):
> 2021-12-06T04:24:49.1877362Z 
> 2021-12-06T04:24:49.1877672Z Threads class SMR info:
> 2021-12-06T04:24:49.1878049Z _java_thread_list=0x7f254c007630, 
> length=365, elements={
> 2021-12-06T04:24:49.1878504Z 0x7f2598028000, 0x7f2598280800, 
> 0x7f2598284800, 0x7f2598299000,
> 2021-12-06T04:24:49.1878973Z 0x7f259829b000, 0x7f259829d800, 
> 0x7f259829f800, 0x7f25982a1800,
> 2021-12-06T04:24:49.1879680Z 0x7f2598337800, 0x7f25983e3000, 
> 0x7f2598431000, 0x7f2528016000,
> 2021-12-06T04:24:49.1896613Z 0x7f2599003000, 0x7f259972e000, 
> 0x7f2599833800, 0x7f259984c000,
> 2021-12-06T04:24:49.1897558Z 0x7f259984f000, 0x7f2599851000, 
> 0x7f2599892000, 0x7f2599894800,
> 2021-12-06T04:24:49.1898075Z 0x7f2499a16000, 0x7f2485acd800, 
> 0x7f2485ace000, 0x7f24876bb800,
> 2021-12-06T04:24:49.1898562Z 0x7f2461e59000, 0x7f2499a0e800, 
> 0x7f2461e5e800, 0x7f2461e81000,
> 2021-12-06T04:24:49.1899037Z 0x7f24dc015000, 0x7f2461e86800, 
> 0x7f2448002000, 0x7f24dc01c000,
> 2021-12-06T04:24:49.1899522Z 0x7f2438001000, 0x7f2438003000, 
> 0x7f2438005000, 0x7f2438006800,
> 2021-12-06T04:24:49.1899982Z 0x7f2438008800, 0x7f2434017800, 
> 0x7f243401a800, 0x7f2414008800,
> 2021-12-06T04:24:49.1900495Z 0x7f24e8089800, 0x7f24e809, 
> 0x7f23e4005800, 0x7f24e8092800,
> 2021-12-06T04:24:49.1901163Z 0x7f24e8099000, 0x7f2414015800, 
> 0x7f24dc04c000, 0x7f2414018800,
> 2021-12-06T04:24:49.1901680Z 0x7f241402, 0x7f24dc058000, 
> 0x7f24dc05b000, 0x7f2414022000,
> 2021-12-06T04:24:49.1902283Z 0x7f24d400f000, 0x7f241402e800, 
> 0x7f2414031800, 0x7f2414033800,
> 2021-12-06T04:24:49.1902880Z 0x7f2414035000, 0x7f2414037000, 
> 0x7f2414038800, 0x7f241403a800,
> 2021-12-06T04:24:49.1903354Z 0x7f241403c000, 0x7f241403e000, 
> 0x7f241403f800, 0x7f2414041800,
> 2021-12-06T04:24:49.1903812Z 0x7f2414043000, 0x7f2414045000, 
> 0x7f24dc064800, 0x7f2414047000,
> 2021-12-06T04:24:49.1904284Z 0x7f2414048800, 0x7f241404a800, 
> 0x7f241404c800, 0x7f241404e000,
> 2021-12-06T04:24:49.1904800Z 0x7f241405, 0x7f2414051800, 
> 0x7f2414053800, 0x7f2414055000,
> 2021-12-06T04:24:49.1905455Z 0x7f2414057000, 0x7f2414059000, 
> 0x7f241405a800, 0x7f241405c800,
> 2021-12-06T04:24:49.1906098Z 0x7f241405e000, 0x7f241406, 
> 0x7f2414062000, 0x7f2414063800,
> 2021-12-06T04:24:49.1906728Z 0x7f22e400c800, 0x7f2328008000, 
> 0x7f2284007000, 0x7f22cc019800,
> 2021-12-06T04:24:49.1907396Z 0x00

[jira] [Commented] (FLINK-25185) StreamFaultToleranceTestBase hangs on AZP

2021-12-21 Thread Yingjie Cao (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-25185?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17463554#comment-17463554
 ] 

Yingjie Cao commented on FLINK-25185:
-

I think I now understand that it should be the interrupt exception which leaded 
to the the reach of _catch_ block. I will try to prepare a fix for FLINK-25407 
soon.

> StreamFaultToleranceTestBase hangs on AZP
> -
>
> Key: FLINK-25185
> URL: https://issues.apache.org/jira/browse/FLINK-25185
> Project: Flink
>  Issue Type: Bug
>  Components: Runtime / State Backends, Test Infrastructure
>Affects Versions: 1.13.3, 1.15.0
>Reporter: Till Rohrmann
>Priority: Blocker
>  Labels: test-stability
> Fix For: 1.15.0
>
>
> The {{StreamFaultToleranceTestBase}} hangs on AZP.
> {code}
> 2021-12-06T04:24:48.1676089Z 
> ==
> 2021-12-06T04:24:48.1678883Z === WARNING: This task took already 95% of the 
> available time budget of 237 minutes ===
> 2021-12-06T04:24:48.1679596Z 
> ==
> 2021-12-06T04:24:48.1680326Z 
> ==
> 2021-12-06T04:24:48.1680877Z The following Java processes are running (JPS)
> 2021-12-06T04:24:48.1681467Z 
> ==
> 2021-12-06T04:24:48.6514536Z 13701 surefirebooter17740627448580534543.jar
> 2021-12-06T04:24:48.6515353Z 1622 Jps
> 2021-12-06T04:24:48.6515795Z 780 Launcher
> 2021-12-06T04:24:48.6825889Z 
> ==
> 2021-12-06T04:24:48.6826565Z Printing stack trace of Java process 13701
> 2021-12-06T04:24:48.6827012Z 
> ==
> 2021-12-06T04:24:49.1876086Z 2021-12-06 04:24:49
> 2021-12-06T04:24:49.1877098Z Full thread dump OpenJDK 64-Bit Server VM 
> (11.0.10+9 mixed mode):
> 2021-12-06T04:24:49.1877362Z 
> 2021-12-06T04:24:49.1877672Z Threads class SMR info:
> 2021-12-06T04:24:49.1878049Z _java_thread_list=0x7f254c007630, 
> length=365, elements={
> 2021-12-06T04:24:49.1878504Z 0x7f2598028000, 0x7f2598280800, 
> 0x7f2598284800, 0x7f2598299000,
> 2021-12-06T04:24:49.1878973Z 0x7f259829b000, 0x7f259829d800, 
> 0x7f259829f800, 0x7f25982a1800,
> 2021-12-06T04:24:49.1879680Z 0x7f2598337800, 0x7f25983e3000, 
> 0x7f2598431000, 0x7f2528016000,
> 2021-12-06T04:24:49.1896613Z 0x7f2599003000, 0x7f259972e000, 
> 0x7f2599833800, 0x7f259984c000,
> 2021-12-06T04:24:49.1897558Z 0x7f259984f000, 0x7f2599851000, 
> 0x7f2599892000, 0x7f2599894800,
> 2021-12-06T04:24:49.1898075Z 0x7f2499a16000, 0x7f2485acd800, 
> 0x7f2485ace000, 0x7f24876bb800,
> 2021-12-06T04:24:49.1898562Z 0x7f2461e59000, 0x7f2499a0e800, 
> 0x7f2461e5e800, 0x7f2461e81000,
> 2021-12-06T04:24:49.1899037Z 0x7f24dc015000, 0x7f2461e86800, 
> 0x7f2448002000, 0x7f24dc01c000,
> 2021-12-06T04:24:49.1899522Z 0x7f2438001000, 0x7f2438003000, 
> 0x7f2438005000, 0x7f2438006800,
> 2021-12-06T04:24:49.1899982Z 0x7f2438008800, 0x7f2434017800, 
> 0x7f243401a800, 0x7f2414008800,
> 2021-12-06T04:24:49.1900495Z 0x7f24e8089800, 0x7f24e809, 
> 0x7f23e4005800, 0x7f24e8092800,
> 2021-12-06T04:24:49.1901163Z 0x7f24e8099000, 0x7f2414015800, 
> 0x7f24dc04c000, 0x7f2414018800,
> 2021-12-06T04:24:49.1901680Z 0x7f241402, 0x7f24dc058000, 
> 0x7f24dc05b000, 0x7f2414022000,
> 2021-12-06T04:24:49.1902283Z 0x7f24d400f000, 0x7f241402e800, 
> 0x7f2414031800, 0x7f2414033800,
> 2021-12-06T04:24:49.1902880Z 0x7f2414035000, 0x7f2414037000, 
> 0x7f2414038800, 0x7f241403a800,
> 2021-12-06T04:24:49.1903354Z 0x7f241403c000, 0x7f241403e000, 
> 0x7f241403f800, 0x7f2414041800,
> 2021-12-06T04:24:49.1903812Z 0x7f2414043000, 0x7f2414045000, 
> 0x7f24dc064800, 0x7f2414047000,
> 2021-12-06T04:24:49.1904284Z 0x7f2414048800, 0x7f241404a800, 
> 0x7f241404c800, 0x7f241404e000,
> 2021-12-06T04:24:49.1904800Z 0x7f241405, 0x7f2414051800, 
> 0x7f2414053800, 0x7f2414055000,
> 2021-12-06T04:24:49.1905455Z 0x7f2414057000, 0x7f2414059000, 
> 0x7f241405a800, 0x7f241405c800,
> 2021-12-06T04:24:49.1906098Z 0x7f241405e000, 0x7f241406, 
> 0x7f2414062000, 0x7f2414063800,
> 2021-12-06T04:24:49.1906728Z 0x7f22e400c800, 0x7f2328008000, 
> 0x7f2284007000, 0x7f22cc019800,
> 2021-12-06T04:24:49.1907396Z 0x

[jira] [Commented] (FLINK-25185) StreamFaultToleranceTestBase hangs on AZP

2021-12-21 Thread Piotr Nowojski (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-25185?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17463325#comment-17463325
 ] 

Piotr Nowojski commented on FLINK-25185:


After an offline discussion with [~roman] and some further analysis this is 
what we think is happening for 1.15 branch.

# Test is hitting {{FileNotFoundException}}, probably caused by FLINK-25395
# Test ends up in an infinite restart loop, where each restart attempt hits 
{{FileNotFoundException}}
# After tens of thousands of restart attempts and cancellations (for example in 
attempt #14176 as [commented in Roman's post|  
https://issues.apache.org/jira/browse/FLINK-25185?focusedCommentId=17462834&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-17462834]),
 this endless cycle of restarts and cancellations is causing FLINK-25407 
deadlock to surface
# From now on, StreamFaultToleranceTestBase will end up in yet another infinite 
restart loop, but this time scheduling will be failing with "Could not acquire 
the minimum required resources."

We have extracted those two issues FLINK-25395 (affects only 1.15, after 
merging FLINK-24611 a couple of days ago. It's a release blocker) and 
FLINK-25407 (affects 1.14.x and 1.15.x, but not as severe issue) to independent 
tickets. For the time being we will disable changelog state backend 
randomisation until FLINK-25395 is fixed to reduce the number of test failure.

However the first report was from 1.13 branch, and I can not see the same 
deadlock there. I can not verify the logs from that failure, because logs 
upload has failed. So most likely there is still another issue present in the 
code base (At least in 1.13.x branch), that we have no way of analysing at the 
moment and we will have to wait for another failure with successful logs upload 
this time.

> StreamFaultToleranceTestBase hangs on AZP
> -
>
> Key: FLINK-25185
> URL: https://issues.apache.org/jira/browse/FLINK-25185
> Project: Flink
>  Issue Type: Bug
>  Components: Runtime / State Backends, Test Infrastructure
>Affects Versions: 1.13.3, 1.15.0
>Reporter: Till Rohrmann
>Priority: Blocker
>  Labels: test-stability
> Fix For: 1.15.0
>
>
> The {{StreamFaultToleranceTestBase}} hangs on AZP.
> {code}
> 2021-12-06T04:24:48.1676089Z 
> ==
> 2021-12-06T04:24:48.1678883Z === WARNING: This task took already 95% of the 
> available time budget of 237 minutes ===
> 2021-12-06T04:24:48.1679596Z 
> ==
> 2021-12-06T04:24:48.1680326Z 
> ==
> 2021-12-06T04:24:48.1680877Z The following Java processes are running (JPS)
> 2021-12-06T04:24:48.1681467Z 
> ==
> 2021-12-06T04:24:48.6514536Z 13701 surefirebooter17740627448580534543.jar
> 2021-12-06T04:24:48.6515353Z 1622 Jps
> 2021-12-06T04:24:48.6515795Z 780 Launcher
> 2021-12-06T04:24:48.6825889Z 
> ==
> 2021-12-06T04:24:48.6826565Z Printing stack trace of Java process 13701
> 2021-12-06T04:24:48.6827012Z 
> ==
> 2021-12-06T04:24:49.1876086Z 2021-12-06 04:24:49
> 2021-12-06T04:24:49.1877098Z Full thread dump OpenJDK 64-Bit Server VM 
> (11.0.10+9 mixed mode):
> 2021-12-06T04:24:49.1877362Z 
> 2021-12-06T04:24:49.1877672Z Threads class SMR info:
> 2021-12-06T04:24:49.1878049Z _java_thread_list=0x7f254c007630, 
> length=365, elements={
> 2021-12-06T04:24:49.1878504Z 0x7f2598028000, 0x7f2598280800, 
> 0x7f2598284800, 0x7f2598299000,
> 2021-12-06T04:24:49.1878973Z 0x7f259829b000, 0x7f259829d800, 
> 0x7f259829f800, 0x7f25982a1800,
> 2021-12-06T04:24:49.1879680Z 0x7f2598337800, 0x7f25983e3000, 
> 0x7f2598431000, 0x7f2528016000,
> 2021-12-06T04:24:49.1896613Z 0x7f2599003000, 0x7f259972e000, 
> 0x7f2599833800, 0x7f259984c000,
> 2021-12-06T04:24:49.1897558Z 0x7f259984f000, 0x7f2599851000, 
> 0x7f2599892000, 0x7f2599894800,
> 2021-12-06T04:24:49.1898075Z 0x7f2499a16000, 0x7f2485acd800, 
> 0x7f2485ace000, 0x7f24876bb800,
> 2021-12-06T04:24:49.1898562Z 0x7f2461e59000, 0x7f2499a0e800, 
> 0x7f2461e5e800, 0x7f2461e81000,
> 2021-12-06T04:24:49.1899037Z 0x7f24dc015000, 0x7f2461e86800, 
> 0x7f2448002000, 0x7f24dc01c000,
> 2021-12-06T04:24:49.1899522Z 0x7f2438001000, 0x7f2438003000, 
> 0x7f2438005000, 0x7f2438006800,
> 2021-12-06T04:24:

[jira] [Commented] (FLINK-25185) StreamFaultToleranceTestBase hangs on AZP

2021-12-21 Thread Yingjie Cao (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-25185?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17463222#comment-17463222
 ] 

Yingjie Cao commented on FLINK-25185:
-

After looking into the code and deadlock stacks, I can confirm that FLINK-24035 
caused the deadlock. NetworkBufferPool#internalRequestMemorySegments may also 
need to acquire the 'factoryLock' in the ```catch``` block, I did not realize 
that previously. Apart from this, I am still trying to understand why we 
reached the ```catch``` block. Maybe there is another issue. Anyway, I will try 
first to fix the issue caused by FLINK-24035 and update if I have some any new 
findings.

> StreamFaultToleranceTestBase hangs on AZP
> -
>
> Key: FLINK-25185
> URL: https://issues.apache.org/jira/browse/FLINK-25185
> Project: Flink
>  Issue Type: Bug
>  Components: Runtime / State Backends, Test Infrastructure
>Affects Versions: 1.13.3, 1.15.0
>Reporter: Till Rohrmann
>Priority: Blocker
>  Labels: test-stability
> Fix For: 1.15.0
>
>
> The {{StreamFaultToleranceTestBase}} hangs on AZP.
> {code}
> 2021-12-06T04:24:48.1676089Z 
> ==
> 2021-12-06T04:24:48.1678883Z === WARNING: This task took already 95% of the 
> available time budget of 237 minutes ===
> 2021-12-06T04:24:48.1679596Z 
> ==
> 2021-12-06T04:24:48.1680326Z 
> ==
> 2021-12-06T04:24:48.1680877Z The following Java processes are running (JPS)
> 2021-12-06T04:24:48.1681467Z 
> ==
> 2021-12-06T04:24:48.6514536Z 13701 surefirebooter17740627448580534543.jar
> 2021-12-06T04:24:48.6515353Z 1622 Jps
> 2021-12-06T04:24:48.6515795Z 780 Launcher
> 2021-12-06T04:24:48.6825889Z 
> ==
> 2021-12-06T04:24:48.6826565Z Printing stack trace of Java process 13701
> 2021-12-06T04:24:48.6827012Z 
> ==
> 2021-12-06T04:24:49.1876086Z 2021-12-06 04:24:49
> 2021-12-06T04:24:49.1877098Z Full thread dump OpenJDK 64-Bit Server VM 
> (11.0.10+9 mixed mode):
> 2021-12-06T04:24:49.1877362Z 
> 2021-12-06T04:24:49.1877672Z Threads class SMR info:
> 2021-12-06T04:24:49.1878049Z _java_thread_list=0x7f254c007630, 
> length=365, elements={
> 2021-12-06T04:24:49.1878504Z 0x7f2598028000, 0x7f2598280800, 
> 0x7f2598284800, 0x7f2598299000,
> 2021-12-06T04:24:49.1878973Z 0x7f259829b000, 0x7f259829d800, 
> 0x7f259829f800, 0x7f25982a1800,
> 2021-12-06T04:24:49.1879680Z 0x7f2598337800, 0x7f25983e3000, 
> 0x7f2598431000, 0x7f2528016000,
> 2021-12-06T04:24:49.1896613Z 0x7f2599003000, 0x7f259972e000, 
> 0x7f2599833800, 0x7f259984c000,
> 2021-12-06T04:24:49.1897558Z 0x7f259984f000, 0x7f2599851000, 
> 0x7f2599892000, 0x7f2599894800,
> 2021-12-06T04:24:49.1898075Z 0x7f2499a16000, 0x7f2485acd800, 
> 0x7f2485ace000, 0x7f24876bb800,
> 2021-12-06T04:24:49.1898562Z 0x7f2461e59000, 0x7f2499a0e800, 
> 0x7f2461e5e800, 0x7f2461e81000,
> 2021-12-06T04:24:49.1899037Z 0x7f24dc015000, 0x7f2461e86800, 
> 0x7f2448002000, 0x7f24dc01c000,
> 2021-12-06T04:24:49.1899522Z 0x7f2438001000, 0x7f2438003000, 
> 0x7f2438005000, 0x7f2438006800,
> 2021-12-06T04:24:49.1899982Z 0x7f2438008800, 0x7f2434017800, 
> 0x7f243401a800, 0x7f2414008800,
> 2021-12-06T04:24:49.1900495Z 0x7f24e8089800, 0x7f24e809, 
> 0x7f23e4005800, 0x7f24e8092800,
> 2021-12-06T04:24:49.1901163Z 0x7f24e8099000, 0x7f2414015800, 
> 0x7f24dc04c000, 0x7f2414018800,
> 2021-12-06T04:24:49.1901680Z 0x7f241402, 0x7f24dc058000, 
> 0x7f24dc05b000, 0x7f2414022000,
> 2021-12-06T04:24:49.1902283Z 0x7f24d400f000, 0x7f241402e800, 
> 0x7f2414031800, 0x7f2414033800,
> 2021-12-06T04:24:49.1902880Z 0x7f2414035000, 0x7f2414037000, 
> 0x7f2414038800, 0x7f241403a800,
> 2021-12-06T04:24:49.1903354Z 0x7f241403c000, 0x7f241403e000, 
> 0x7f241403f800, 0x7f2414041800,
> 2021-12-06T04:24:49.1903812Z 0x7f2414043000, 0x7f2414045000, 
> 0x7f24dc064800, 0x7f2414047000,
> 2021-12-06T04:24:49.1904284Z 0x7f2414048800, 0x7f241404a800, 
> 0x7f241404c800, 0x7f241404e000,
> 2021-12-06T04:24:49.1904800Z 0x7f241405, 0x7f2414051800, 
> 0x7f2414053800, 0x7f2414055000,
> 2021-12-06T04:24:49.1905455Z 0x7f2414057000, 0x7f241

[jira] [Commented] (FLINK-25185) StreamFaultToleranceTestBase hangs on AZP

2021-12-21 Thread Piotr Nowojski (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-25185?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17463093#comment-17463093
 ] 

Piotr Nowojski commented on FLINK-25185:


{quote}
I don't think so: the decision whether to re-use some state or not is made by 
the State backend, not runtime (not 
AsyncCheckpointRunnable/SubtaskCheckpointCoordinatorImpl).
(...){quote}
Ok. I've thought that the {{lastUploadedSstFiles.putAll(sstFiles);}} in 
{{uploadSstFiles()}} happens in the sync part of checkpoint process. Now I see 
it's in the async phase and it actually happens only once files are actually 
uploaded.

Let's chat offline about what is exactly happening here and what's your 
proposal to fix it.

Regarding the deadlock that you posted, is it the primary issue causing those 
test failures? It looks like the deadlock might have been introduced in 
FLINK-24035. CC [~kevin.cyj]

> StreamFaultToleranceTestBase hangs on AZP
> -
>
> Key: FLINK-25185
> URL: https://issues.apache.org/jira/browse/FLINK-25185
> Project: Flink
>  Issue Type: Bug
>  Components: Runtime / State Backends, Test Infrastructure
>Affects Versions: 1.13.3, 1.15.0
>Reporter: Till Rohrmann
>Priority: Blocker
>  Labels: test-stability
> Fix For: 1.15.0
>
>
> The {{StreamFaultToleranceTestBase}} hangs on AZP.
> {code}
> 2021-12-06T04:24:48.1676089Z 
> ==
> 2021-12-06T04:24:48.1678883Z === WARNING: This task took already 95% of the 
> available time budget of 237 minutes ===
> 2021-12-06T04:24:48.1679596Z 
> ==
> 2021-12-06T04:24:48.1680326Z 
> ==
> 2021-12-06T04:24:48.1680877Z The following Java processes are running (JPS)
> 2021-12-06T04:24:48.1681467Z 
> ==
> 2021-12-06T04:24:48.6514536Z 13701 surefirebooter17740627448580534543.jar
> 2021-12-06T04:24:48.6515353Z 1622 Jps
> 2021-12-06T04:24:48.6515795Z 780 Launcher
> 2021-12-06T04:24:48.6825889Z 
> ==
> 2021-12-06T04:24:48.6826565Z Printing stack trace of Java process 13701
> 2021-12-06T04:24:48.6827012Z 
> ==
> 2021-12-06T04:24:49.1876086Z 2021-12-06 04:24:49
> 2021-12-06T04:24:49.1877098Z Full thread dump OpenJDK 64-Bit Server VM 
> (11.0.10+9 mixed mode):
> 2021-12-06T04:24:49.1877362Z 
> 2021-12-06T04:24:49.1877672Z Threads class SMR info:
> 2021-12-06T04:24:49.1878049Z _java_thread_list=0x7f254c007630, 
> length=365, elements={
> 2021-12-06T04:24:49.1878504Z 0x7f2598028000, 0x7f2598280800, 
> 0x7f2598284800, 0x7f2598299000,
> 2021-12-06T04:24:49.1878973Z 0x7f259829b000, 0x7f259829d800, 
> 0x7f259829f800, 0x7f25982a1800,
> 2021-12-06T04:24:49.1879680Z 0x7f2598337800, 0x7f25983e3000, 
> 0x7f2598431000, 0x7f2528016000,
> 2021-12-06T04:24:49.1896613Z 0x7f2599003000, 0x7f259972e000, 
> 0x7f2599833800, 0x7f259984c000,
> 2021-12-06T04:24:49.1897558Z 0x7f259984f000, 0x7f2599851000, 
> 0x7f2599892000, 0x7f2599894800,
> 2021-12-06T04:24:49.1898075Z 0x7f2499a16000, 0x7f2485acd800, 
> 0x7f2485ace000, 0x7f24876bb800,
> 2021-12-06T04:24:49.1898562Z 0x7f2461e59000, 0x7f2499a0e800, 
> 0x7f2461e5e800, 0x7f2461e81000,
> 2021-12-06T04:24:49.1899037Z 0x7f24dc015000, 0x7f2461e86800, 
> 0x7f2448002000, 0x7f24dc01c000,
> 2021-12-06T04:24:49.1899522Z 0x7f2438001000, 0x7f2438003000, 
> 0x7f2438005000, 0x7f2438006800,
> 2021-12-06T04:24:49.1899982Z 0x7f2438008800, 0x7f2434017800, 
> 0x7f243401a800, 0x7f2414008800,
> 2021-12-06T04:24:49.1900495Z 0x7f24e8089800, 0x7f24e809, 
> 0x7f23e4005800, 0x7f24e8092800,
> 2021-12-06T04:24:49.1901163Z 0x7f24e8099000, 0x7f2414015800, 
> 0x7f24dc04c000, 0x7f2414018800,
> 2021-12-06T04:24:49.1901680Z 0x7f241402, 0x7f24dc058000, 
> 0x7f24dc05b000, 0x7f2414022000,
> 2021-12-06T04:24:49.1902283Z 0x7f24d400f000, 0x7f241402e800, 
> 0x7f2414031800, 0x7f2414033800,
> 2021-12-06T04:24:49.1902880Z 0x7f2414035000, 0x7f2414037000, 
> 0x7f2414038800, 0x7f241403a800,
> 2021-12-06T04:24:49.1903354Z 0x7f241403c000, 0x7f241403e000, 
> 0x7f241403f800, 0x7f2414041800,
> 2021-12-06T04:24:49.1903812Z 0x7f2414043000, 0x7f2414045000, 
> 0x7f24dc064800, 0x7f2414047000,
> 2021-12-06T04:24:49.1904284Z 0x7f2414048800, 0x00

[jira] [Commented] (FLINK-25185) StreamFaultToleranceTestBase hangs on AZP

2021-12-21 Thread Till Rohrmann (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-25185?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17463056#comment-17463056
 ] 

Till Rohrmann commented on FLINK-25185:
---

Deadlock: 
https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=28393&view=logs&j=a57e0635-3fad-5b08-57c7-a4142d7d6fa9&t=2ef0effc-1da1-50e5-c2bd-aab434b1c5b7

> StreamFaultToleranceTestBase hangs on AZP
> -
>
> Key: FLINK-25185
> URL: https://issues.apache.org/jira/browse/FLINK-25185
> Project: Flink
>  Issue Type: Bug
>  Components: Runtime / State Backends, Test Infrastructure
>Affects Versions: 1.13.3, 1.15.0
>Reporter: Till Rohrmann
>Priority: Critical
>  Labels: test-stability
> Fix For: 1.15.0
>
>
> The {{StreamFaultToleranceTestBase}} hangs on AZP.
> {code}
> 2021-12-06T04:24:48.1676089Z 
> ==
> 2021-12-06T04:24:48.1678883Z === WARNING: This task took already 95% of the 
> available time budget of 237 minutes ===
> 2021-12-06T04:24:48.1679596Z 
> ==
> 2021-12-06T04:24:48.1680326Z 
> ==
> 2021-12-06T04:24:48.1680877Z The following Java processes are running (JPS)
> 2021-12-06T04:24:48.1681467Z 
> ==
> 2021-12-06T04:24:48.6514536Z 13701 surefirebooter17740627448580534543.jar
> 2021-12-06T04:24:48.6515353Z 1622 Jps
> 2021-12-06T04:24:48.6515795Z 780 Launcher
> 2021-12-06T04:24:48.6825889Z 
> ==
> 2021-12-06T04:24:48.6826565Z Printing stack trace of Java process 13701
> 2021-12-06T04:24:48.6827012Z 
> ==
> 2021-12-06T04:24:49.1876086Z 2021-12-06 04:24:49
> 2021-12-06T04:24:49.1877098Z Full thread dump OpenJDK 64-Bit Server VM 
> (11.0.10+9 mixed mode):
> 2021-12-06T04:24:49.1877362Z 
> 2021-12-06T04:24:49.1877672Z Threads class SMR info:
> 2021-12-06T04:24:49.1878049Z _java_thread_list=0x7f254c007630, 
> length=365, elements={
> 2021-12-06T04:24:49.1878504Z 0x7f2598028000, 0x7f2598280800, 
> 0x7f2598284800, 0x7f2598299000,
> 2021-12-06T04:24:49.1878973Z 0x7f259829b000, 0x7f259829d800, 
> 0x7f259829f800, 0x7f25982a1800,
> 2021-12-06T04:24:49.1879680Z 0x7f2598337800, 0x7f25983e3000, 
> 0x7f2598431000, 0x7f2528016000,
> 2021-12-06T04:24:49.1896613Z 0x7f2599003000, 0x7f259972e000, 
> 0x7f2599833800, 0x7f259984c000,
> 2021-12-06T04:24:49.1897558Z 0x7f259984f000, 0x7f2599851000, 
> 0x7f2599892000, 0x7f2599894800,
> 2021-12-06T04:24:49.1898075Z 0x7f2499a16000, 0x7f2485acd800, 
> 0x7f2485ace000, 0x7f24876bb800,
> 2021-12-06T04:24:49.1898562Z 0x7f2461e59000, 0x7f2499a0e800, 
> 0x7f2461e5e800, 0x7f2461e81000,
> 2021-12-06T04:24:49.1899037Z 0x7f24dc015000, 0x7f2461e86800, 
> 0x7f2448002000, 0x7f24dc01c000,
> 2021-12-06T04:24:49.1899522Z 0x7f2438001000, 0x7f2438003000, 
> 0x7f2438005000, 0x7f2438006800,
> 2021-12-06T04:24:49.1899982Z 0x7f2438008800, 0x7f2434017800, 
> 0x7f243401a800, 0x7f2414008800,
> 2021-12-06T04:24:49.1900495Z 0x7f24e8089800, 0x7f24e809, 
> 0x7f23e4005800, 0x7f24e8092800,
> 2021-12-06T04:24:49.1901163Z 0x7f24e8099000, 0x7f2414015800, 
> 0x7f24dc04c000, 0x7f2414018800,
> 2021-12-06T04:24:49.1901680Z 0x7f241402, 0x7f24dc058000, 
> 0x7f24dc05b000, 0x7f2414022000,
> 2021-12-06T04:24:49.1902283Z 0x7f24d400f000, 0x7f241402e800, 
> 0x7f2414031800, 0x7f2414033800,
> 2021-12-06T04:24:49.1902880Z 0x7f2414035000, 0x7f2414037000, 
> 0x7f2414038800, 0x7f241403a800,
> 2021-12-06T04:24:49.1903354Z 0x7f241403c000, 0x7f241403e000, 
> 0x7f241403f800, 0x7f2414041800,
> 2021-12-06T04:24:49.1903812Z 0x7f2414043000, 0x7f2414045000, 
> 0x7f24dc064800, 0x7f2414047000,
> 2021-12-06T04:24:49.1904284Z 0x7f2414048800, 0x7f241404a800, 
> 0x7f241404c800, 0x7f241404e000,
> 2021-12-06T04:24:49.1904800Z 0x7f241405, 0x7f2414051800, 
> 0x7f2414053800, 0x7f2414055000,
> 2021-12-06T04:24:49.1905455Z 0x7f2414057000, 0x7f2414059000, 
> 0x7f241405a800, 0x7f241405c800,
> 2021-12-06T04:24:49.1906098Z 0x7f241405e000, 0x7f241406, 
> 0x7f2414062000, 0x7f2414063800,
> 2021-12-06T04:24:49.1906728Z 0x7f22e400c800, 0x7f2328008000, 
> 0x7f2284007000, 0x7f22cc019800,
> 2021-12-06T04:24:49.1

[jira] [Commented] (FLINK-25185) StreamFaultToleranceTestBase hangs on AZP

2021-12-21 Thread Till Rohrmann (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-25185?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17463055#comment-17463055
 ] 

Till Rohrmann commented on FLINK-25185:
---

Deadlock instance: 
https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=28393&view=logs&j=a57e0635-3fad-5b08-57c7-a4142d7d6fa9&t=2ef0effc-1da1-50e5-c2bd-aab434b1c5b7

> StreamFaultToleranceTestBase hangs on AZP
> -
>
> Key: FLINK-25185
> URL: https://issues.apache.org/jira/browse/FLINK-25185
> Project: Flink
>  Issue Type: Bug
>  Components: Runtime / State Backends, Test Infrastructure
>Affects Versions: 1.13.3, 1.15.0
>Reporter: Till Rohrmann
>Priority: Critical
>  Labels: test-stability
> Fix For: 1.15.0
>
>
> The {{StreamFaultToleranceTestBase}} hangs on AZP.
> {code}
> 2021-12-06T04:24:48.1676089Z 
> ==
> 2021-12-06T04:24:48.1678883Z === WARNING: This task took already 95% of the 
> available time budget of 237 minutes ===
> 2021-12-06T04:24:48.1679596Z 
> ==
> 2021-12-06T04:24:48.1680326Z 
> ==
> 2021-12-06T04:24:48.1680877Z The following Java processes are running (JPS)
> 2021-12-06T04:24:48.1681467Z 
> ==
> 2021-12-06T04:24:48.6514536Z 13701 surefirebooter17740627448580534543.jar
> 2021-12-06T04:24:48.6515353Z 1622 Jps
> 2021-12-06T04:24:48.6515795Z 780 Launcher
> 2021-12-06T04:24:48.6825889Z 
> ==
> 2021-12-06T04:24:48.6826565Z Printing stack trace of Java process 13701
> 2021-12-06T04:24:48.6827012Z 
> ==
> 2021-12-06T04:24:49.1876086Z 2021-12-06 04:24:49
> 2021-12-06T04:24:49.1877098Z Full thread dump OpenJDK 64-Bit Server VM 
> (11.0.10+9 mixed mode):
> 2021-12-06T04:24:49.1877362Z 
> 2021-12-06T04:24:49.1877672Z Threads class SMR info:
> 2021-12-06T04:24:49.1878049Z _java_thread_list=0x7f254c007630, 
> length=365, elements={
> 2021-12-06T04:24:49.1878504Z 0x7f2598028000, 0x7f2598280800, 
> 0x7f2598284800, 0x7f2598299000,
> 2021-12-06T04:24:49.1878973Z 0x7f259829b000, 0x7f259829d800, 
> 0x7f259829f800, 0x7f25982a1800,
> 2021-12-06T04:24:49.1879680Z 0x7f2598337800, 0x7f25983e3000, 
> 0x7f2598431000, 0x7f2528016000,
> 2021-12-06T04:24:49.1896613Z 0x7f2599003000, 0x7f259972e000, 
> 0x7f2599833800, 0x7f259984c000,
> 2021-12-06T04:24:49.1897558Z 0x7f259984f000, 0x7f2599851000, 
> 0x7f2599892000, 0x7f2599894800,
> 2021-12-06T04:24:49.1898075Z 0x7f2499a16000, 0x7f2485acd800, 
> 0x7f2485ace000, 0x7f24876bb800,
> 2021-12-06T04:24:49.1898562Z 0x7f2461e59000, 0x7f2499a0e800, 
> 0x7f2461e5e800, 0x7f2461e81000,
> 2021-12-06T04:24:49.1899037Z 0x7f24dc015000, 0x7f2461e86800, 
> 0x7f2448002000, 0x7f24dc01c000,
> 2021-12-06T04:24:49.1899522Z 0x7f2438001000, 0x7f2438003000, 
> 0x7f2438005000, 0x7f2438006800,
> 2021-12-06T04:24:49.1899982Z 0x7f2438008800, 0x7f2434017800, 
> 0x7f243401a800, 0x7f2414008800,
> 2021-12-06T04:24:49.1900495Z 0x7f24e8089800, 0x7f24e809, 
> 0x7f23e4005800, 0x7f24e8092800,
> 2021-12-06T04:24:49.1901163Z 0x7f24e8099000, 0x7f2414015800, 
> 0x7f24dc04c000, 0x7f2414018800,
> 2021-12-06T04:24:49.1901680Z 0x7f241402, 0x7f24dc058000, 
> 0x7f24dc05b000, 0x7f2414022000,
> 2021-12-06T04:24:49.1902283Z 0x7f24d400f000, 0x7f241402e800, 
> 0x7f2414031800, 0x7f2414033800,
> 2021-12-06T04:24:49.1902880Z 0x7f2414035000, 0x7f2414037000, 
> 0x7f2414038800, 0x7f241403a800,
> 2021-12-06T04:24:49.1903354Z 0x7f241403c000, 0x7f241403e000, 
> 0x7f241403f800, 0x7f2414041800,
> 2021-12-06T04:24:49.1903812Z 0x7f2414043000, 0x7f2414045000, 
> 0x7f24dc064800, 0x7f2414047000,
> 2021-12-06T04:24:49.1904284Z 0x7f2414048800, 0x7f241404a800, 
> 0x7f241404c800, 0x7f241404e000,
> 2021-12-06T04:24:49.1904800Z 0x7f241405, 0x7f2414051800, 
> 0x7f2414053800, 0x7f2414055000,
> 2021-12-06T04:24:49.1905455Z 0x7f2414057000, 0x7f2414059000, 
> 0x7f241405a800, 0x7f241405c800,
> 2021-12-06T04:24:49.1906098Z 0x7f241405e000, 0x7f241406, 
> 0x7f2414062000, 0x7f2414063800,
> 2021-12-06T04:24:49.1906728Z 0x7f22e400c800, 0x7f2328008000, 
> 0x7f2284007000, 0x7f22cc019800,
> 2021-12-06T0

[jira] [Commented] (FLINK-25185) StreamFaultToleranceTestBase hangs on AZP

2021-12-20 Thread Roman Khachatryan (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-25185?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17462834#comment-17462834
 ] 

Roman Khachatryan commented on FLINK-25185:
---


In later (1.15) logs, I see a deadlock in the network stack:
{code}
Java stack information for the threads listed above:
===
"Canceler for Source: Custom Source -> Filter (7/12)#14176 
(0fbb8a89616ca7a40e473adad51f236f).":
   at 
org.apache.flink.runtime.io.network.buffer.NetworkBufferPool.destroyBufferPool(NetworkBufferPool.java:420)
   - waiting to lock <0x82937f28> (a java.lang.Object)
   at 
org.apache.flink.runtime.io.network.buffer.LocalBufferPool.lazyDestroy(LocalBufferPool.java:567)
   at 
org.apache.flink.runtime.io.network.partition.ResultPartition.closeBufferPool(ResultPartition.java:264)
   at 
org.apache.flink.runtime.io.network.partition.ResultPartition.fail(ResultPartition.java:276)
   at 
org.apache.flink.runtime.taskmanager.Task.failAllResultPartitions(Task.java:999)
   at org.apache.flink.runtime.taskmanager.Task.access$100(Task.java:138)
   at org.apache.flink.runtime.taskmanager.Task$TaskCanceler.run(Task.java:1669)
   at java.lang.Thread.run(Thread.java:748)
"Canceler for Map -> Map (6/12)#14176 (6195862d199aa4d52c12f25b39904725).":
   at 
org.apache.flink.runtime.io.network.buffer.LocalBufferPool.setNumBuffers(LocalBufferPool.java:585)
   - waiting to lock <0x97108898> (a java.util.ArrayDeque)
   at 
org.apache.flink.runtime.io.network.buffer.NetworkBufferPool.redistributeBuffers(NetworkBufferPool.java:544)
   at 
org.apache.flink.runtime.io.network.buffer.NetworkBufferPool.destroyBufferPool(NetworkBufferPool.java:424)
   - locked <0x82937f28> (a java.lang.Object)
   at 
org.apache.flink.runtime.io.network.buffer.LocalBufferPool.lazyDestroy(LocalBufferPool.java:567)
   at 
org.apache.flink.runtime.io.network.partition.ResultPartition.closeBufferPool(ResultPartition.java:264)
   at 
org.apache.flink.runtime.io.network.partition.ResultPartition.fail(ResultPartition.java:276)
   at 
org.apache.flink.runtime.taskmanager.Task.failAllResultPartitions(Task.java:999)
   at org.apache.flink.runtime.taskmanager.Task.access$100(Task.java:138)
   at org.apache.flink.runtime.taskmanager.Task$TaskCanceler.run(Task.java:1669)
   at java.lang.Thread.run(Thread.java:748)
"Map -> Sink: Unnamed (7/12)#14176":
   at 
org.apache.flink.runtime.io.network.buffer.NetworkBufferPool.recycleMemorySegments(NetworkBufferPool.java:256)
   - waiting to lock <0x82937f28> (a java.lang.Object)
   at 
org.apache.flink.runtime.io.network.buffer.NetworkBufferPool.internalRequestMemorySegments(NetworkBufferPool.ja
   at 
org.apache.flink.runtime.io.network.buffer.NetworkBufferPool.requestMemorySegmentsBlocking(NetworkBufferPool.ja
   at 
org.apache.flink.runtime.io.network.buffer.LocalBufferPool.reserveSegments(LocalBufferPool.java:247)
   - locked <0x97108898> (a java.util.ArrayDeque)
   at 
org.apache.flink.runtime.io.network.partition.consumer.SingleInputGate.setupChannels(SingleInputGate.java:497)
   at 
org.apache.flink.runtime.io.network.partition.consumer.SingleInputGate.setup(SingleInputGate.java:276)
   at 
org.apache.flink.runtime.taskmanager.InputGateWithMetrics.setup(InputGateWithMetrics.java:105)
   at 
org.apache.flink.runtime.taskmanager.Task.setupPartitionsAndGates(Task.java:965)
   at org.apache.flink.runtime.taskmanager.Task.doRun(Task.java:652)
   at org.apache.flink.runtime.taskmanager.Task.run(Task.java:563)
   at java.lang.Thread.run(Thread.java:748)

Found 1 deadlock.
{code}


> StreamFaultToleranceTestBase hangs on AZP
> -
>
> Key: FLINK-25185
> URL: https://issues.apache.org/jira/browse/FLINK-25185
> Project: Flink
>  Issue Type: Bug
>  Components: Runtime / State Backends, Test Infrastructure
>Affects Versions: 1.13.3, 1.15.0
>Reporter: Till Rohrmann
>Priority: Critical
>  Labels: test-stability
> Fix For: 1.15.0
>
>
> The {{StreamFaultToleranceTestBase}} hangs on AZP.
> {code}
> 2021-12-06T04:24:48.1676089Z 
> ==
> 2021-12-06T04:24:48.1678883Z === WARNING: This task took already 95% of the 
> available time budget of 237 minutes ===
> 2021-12-06T04:24:48.1679596Z 
> ==
> 2021-12-06T04:24:48.1680326Z 
> ==
> 2021-12-06T04:24:48.1680877Z The following Java processes are running (JPS)
> 2021-12-06T04:24:48.1681467Z 
> ==
> 2021-12-06T04:24:48.6514536Z 13701 surefirebooter17740627

[jira] [Commented] (FLINK-25185) StreamFaultToleranceTestBase hangs on AZP

2021-12-20 Thread Roman Khachatryan (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-25185?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17462822#comment-17462822
 ] 

Roman Khachatryan commented on FLINK-25185:
---

I've just noticed that the issue is also reported for 1.13 which doesn't use 
changelog.
Created FLINK-25395 for the issue discussed above (removing incremental state 
in AsyncCheckpointRunnable).

> StreamFaultToleranceTestBase hangs on AZP
> -
>
> Key: FLINK-25185
> URL: https://issues.apache.org/jira/browse/FLINK-25185
> Project: Flink
>  Issue Type: Bug
>  Components: Runtime / State Backends, Test Infrastructure
>Affects Versions: 1.13.3, 1.15.0
>Reporter: Till Rohrmann
>Priority: Critical
>  Labels: test-stability
> Fix For: 1.15.0
>
>
> The {{StreamFaultToleranceTestBase}} hangs on AZP.
> {code}
> 2021-12-06T04:24:48.1676089Z 
> ==
> 2021-12-06T04:24:48.1678883Z === WARNING: This task took already 95% of the 
> available time budget of 237 minutes ===
> 2021-12-06T04:24:48.1679596Z 
> ==
> 2021-12-06T04:24:48.1680326Z 
> ==
> 2021-12-06T04:24:48.1680877Z The following Java processes are running (JPS)
> 2021-12-06T04:24:48.1681467Z 
> ==
> 2021-12-06T04:24:48.6514536Z 13701 surefirebooter17740627448580534543.jar
> 2021-12-06T04:24:48.6515353Z 1622 Jps
> 2021-12-06T04:24:48.6515795Z 780 Launcher
> 2021-12-06T04:24:48.6825889Z 
> ==
> 2021-12-06T04:24:48.6826565Z Printing stack trace of Java process 13701
> 2021-12-06T04:24:48.6827012Z 
> ==
> 2021-12-06T04:24:49.1876086Z 2021-12-06 04:24:49
> 2021-12-06T04:24:49.1877098Z Full thread dump OpenJDK 64-Bit Server VM 
> (11.0.10+9 mixed mode):
> 2021-12-06T04:24:49.1877362Z 
> 2021-12-06T04:24:49.1877672Z Threads class SMR info:
> 2021-12-06T04:24:49.1878049Z _java_thread_list=0x7f254c007630, 
> length=365, elements={
> 2021-12-06T04:24:49.1878504Z 0x7f2598028000, 0x7f2598280800, 
> 0x7f2598284800, 0x7f2598299000,
> 2021-12-06T04:24:49.1878973Z 0x7f259829b000, 0x7f259829d800, 
> 0x7f259829f800, 0x7f25982a1800,
> 2021-12-06T04:24:49.1879680Z 0x7f2598337800, 0x7f25983e3000, 
> 0x7f2598431000, 0x7f2528016000,
> 2021-12-06T04:24:49.1896613Z 0x7f2599003000, 0x7f259972e000, 
> 0x7f2599833800, 0x7f259984c000,
> 2021-12-06T04:24:49.1897558Z 0x7f259984f000, 0x7f2599851000, 
> 0x7f2599892000, 0x7f2599894800,
> 2021-12-06T04:24:49.1898075Z 0x7f2499a16000, 0x7f2485acd800, 
> 0x7f2485ace000, 0x7f24876bb800,
> 2021-12-06T04:24:49.1898562Z 0x7f2461e59000, 0x7f2499a0e800, 
> 0x7f2461e5e800, 0x7f2461e81000,
> 2021-12-06T04:24:49.1899037Z 0x7f24dc015000, 0x7f2461e86800, 
> 0x7f2448002000, 0x7f24dc01c000,
> 2021-12-06T04:24:49.1899522Z 0x7f2438001000, 0x7f2438003000, 
> 0x7f2438005000, 0x7f2438006800,
> 2021-12-06T04:24:49.1899982Z 0x7f2438008800, 0x7f2434017800, 
> 0x7f243401a800, 0x7f2414008800,
> 2021-12-06T04:24:49.1900495Z 0x7f24e8089800, 0x7f24e809, 
> 0x7f23e4005800, 0x7f24e8092800,
> 2021-12-06T04:24:49.1901163Z 0x7f24e8099000, 0x7f2414015800, 
> 0x7f24dc04c000, 0x7f2414018800,
> 2021-12-06T04:24:49.1901680Z 0x7f241402, 0x7f24dc058000, 
> 0x7f24dc05b000, 0x7f2414022000,
> 2021-12-06T04:24:49.1902283Z 0x7f24d400f000, 0x7f241402e800, 
> 0x7f2414031800, 0x7f2414033800,
> 2021-12-06T04:24:49.1902880Z 0x7f2414035000, 0x7f2414037000, 
> 0x7f2414038800, 0x7f241403a800,
> 2021-12-06T04:24:49.1903354Z 0x7f241403c000, 0x7f241403e000, 
> 0x7f241403f800, 0x7f2414041800,
> 2021-12-06T04:24:49.1903812Z 0x7f2414043000, 0x7f2414045000, 
> 0x7f24dc064800, 0x7f2414047000,
> 2021-12-06T04:24:49.1904284Z 0x7f2414048800, 0x7f241404a800, 
> 0x7f241404c800, 0x7f241404e000,
> 2021-12-06T04:24:49.1904800Z 0x7f241405, 0x7f2414051800, 
> 0x7f2414053800, 0x7f2414055000,
> 2021-12-06T04:24:49.1905455Z 0x7f2414057000, 0x7f2414059000, 
> 0x7f241405a800, 0x7f241405c800,
> 2021-12-06T04:24:49.1906098Z 0x7f241405e000, 0x7f241406, 
> 0x7f2414062000, 0x7f2414063800,
> 2021-12-06T04:24:49.1906728Z 0x7f22e400c800, 0x7f2328008000, 
> 0x7f2284007000, 0x7f22cc01

[jira] [Commented] (FLINK-25185) StreamFaultToleranceTestBase hangs on AZP

2021-12-20 Thread Roman Khachatryan (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-25185?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17462709#comment-17462709
 ] 

Roman Khachatryan commented on FLINK-25185:
---

> Roman Khachatryan, do you mean 
> SubtaskCheckpointCoordinatorImpl#cancelAsyncCheckpointRunnable being invoked 
> and the uploads being cancelled?
Yes, or any other case when AsyncCheckpointRunnable.cleanup() is invoked; for 
example, reporting to JM failed.

> Doesn't it point to a larger problem? That future checkpoints in general can 
> be deemed as completed, even if previous async phases are still uploading 
> some of the files that those future checkpoints are referencing?

I don't think so: the decision whether to re-use some state or not is made by 
the State backend, not runtime (not 
AsyncCheckpointRunnable/SubtaskCheckpointCoordinatorImpl).
Both RocksDB and Changelog consider state as re-usable only once the upload 
finishes (RocksIncrementalSnapshotStrategy.lastUploadedSstFiles or 
FsStateChangelogWriter.uploaded is updated).
For a concurrent checkpoint, RocksDB will be re-uploaded the state; and 
changelog will wait for upload completion in UploadCompletionListener.
Does this make sense?

> It seems like neither of those problem will be easy to fix?
The only easy fix I see is to never discard shared state in 
IncrementalRemoteKeyedStateHandle.discardState and rely on checkpoint 
subsumption or job termination for the cleanup. In any case when the state 
didn't reach JM it will be left orphaned (e.g. checkpoint aborted and backend 
materialized, not reporting this state again).
Orphaned files problem should be mitigated by FLINK-24852.

A proper fix I think would be private/shared state separation and TM-side 
registry (FLINK-23139 and related tickets).
As the latter is much more invasive, I'd choose the former for the upcoming 
release (still need to confirm this is the root cause).

> StreamFaultToleranceTestBase hangs on AZP
> -
>
> Key: FLINK-25185
> URL: https://issues.apache.org/jira/browse/FLINK-25185
> Project: Flink
>  Issue Type: Bug
>  Components: Runtime / State Backends, Test Infrastructure
>Affects Versions: 1.13.3, 1.15.0
>Reporter: Till Rohrmann
>Priority: Critical
>  Labels: test-stability
> Fix For: 1.15.0
>
>
> The {{StreamFaultToleranceTestBase}} hangs on AZP.
> {code}
> 2021-12-06T04:24:48.1676089Z 
> ==
> 2021-12-06T04:24:48.1678883Z === WARNING: This task took already 95% of the 
> available time budget of 237 minutes ===
> 2021-12-06T04:24:48.1679596Z 
> ==
> 2021-12-06T04:24:48.1680326Z 
> ==
> 2021-12-06T04:24:48.1680877Z The following Java processes are running (JPS)
> 2021-12-06T04:24:48.1681467Z 
> ==
> 2021-12-06T04:24:48.6514536Z 13701 surefirebooter17740627448580534543.jar
> 2021-12-06T04:24:48.6515353Z 1622 Jps
> 2021-12-06T04:24:48.6515795Z 780 Launcher
> 2021-12-06T04:24:48.6825889Z 
> ==
> 2021-12-06T04:24:48.6826565Z Printing stack trace of Java process 13701
> 2021-12-06T04:24:48.6827012Z 
> ==
> 2021-12-06T04:24:49.1876086Z 2021-12-06 04:24:49
> 2021-12-06T04:24:49.1877098Z Full thread dump OpenJDK 64-Bit Server VM 
> (11.0.10+9 mixed mode):
> 2021-12-06T04:24:49.1877362Z 
> 2021-12-06T04:24:49.1877672Z Threads class SMR info:
> 2021-12-06T04:24:49.1878049Z _java_thread_list=0x7f254c007630, 
> length=365, elements={
> 2021-12-06T04:24:49.1878504Z 0x7f2598028000, 0x7f2598280800, 
> 0x7f2598284800, 0x7f2598299000,
> 2021-12-06T04:24:49.1878973Z 0x7f259829b000, 0x7f259829d800, 
> 0x7f259829f800, 0x7f25982a1800,
> 2021-12-06T04:24:49.1879680Z 0x7f2598337800, 0x7f25983e3000, 
> 0x7f2598431000, 0x7f2528016000,
> 2021-12-06T04:24:49.1896613Z 0x7f2599003000, 0x7f259972e000, 
> 0x7f2599833800, 0x7f259984c000,
> 2021-12-06T04:24:49.1897558Z 0x7f259984f000, 0x7f2599851000, 
> 0x7f2599892000, 0x7f2599894800,
> 2021-12-06T04:24:49.1898075Z 0x7f2499a16000, 0x7f2485acd800, 
> 0x7f2485ace000, 0x7f24876bb800,
> 2021-12-06T04:24:49.1898562Z 0x7f2461e59000, 0x7f2499a0e800, 
> 0x7f2461e5e800, 0x7f2461e81000,
> 2021-12-06T04:24:49.1899037Z 0x7f24dc015000, 0x7f2461e86800, 
> 0x7f2448002000, 0x7f24dc01c000,
> 2021-12-06T04:24:49.1899522Z 0x7f2438001000, 0x7f2

[jira] [Commented] (FLINK-25185) StreamFaultToleranceTestBase hangs on AZP

2021-12-20 Thread Piotr Nowojski (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-25185?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17462658#comment-17462658
 ] 

Piotr Nowojski commented on FLINK-25185:


{quote}
When a checkpoitnt is aborted, TM will try to discard in progress uploads.
{quote}
[~roman], do you mean 
{{SubtaskCheckpointCoordinatorImpl#cancelAsyncCheckpointRunnable}} being 
invoked and the uploads being cancelled? 

Doesn't it point to a larger problem? That future checkpoints in general can be 
deemed as completed, even if previous async phases are still uploading some of 
the files that those future checkpoints are referencing? 
{quote}
This state can't be re-used for future checkpoints.
{quote}
Probably it's not only about "future" as not yet triggered checkpoints, but any 
subsequent checkpoints, of which some of them might have been already in 
progress.

It seems like neither of those problem will be easy to fix?

> StreamFaultToleranceTestBase hangs on AZP
> -
>
> Key: FLINK-25185
> URL: https://issues.apache.org/jira/browse/FLINK-25185
> Project: Flink
>  Issue Type: Bug
>  Components: Runtime / State Backends, Test Infrastructure
>Affects Versions: 1.13.3, 1.15.0
>Reporter: Till Rohrmann
>Priority: Critical
>  Labels: test-stability
> Fix For: 1.15.0
>
>
> The {{StreamFaultToleranceTestBase}} hangs on AZP.
> {code}
> 2021-12-06T04:24:48.1676089Z 
> ==
> 2021-12-06T04:24:48.1678883Z === WARNING: This task took already 95% of the 
> available time budget of 237 minutes ===
> 2021-12-06T04:24:48.1679596Z 
> ==
> 2021-12-06T04:24:48.1680326Z 
> ==
> 2021-12-06T04:24:48.1680877Z The following Java processes are running (JPS)
> 2021-12-06T04:24:48.1681467Z 
> ==
> 2021-12-06T04:24:48.6514536Z 13701 surefirebooter17740627448580534543.jar
> 2021-12-06T04:24:48.6515353Z 1622 Jps
> 2021-12-06T04:24:48.6515795Z 780 Launcher
> 2021-12-06T04:24:48.6825889Z 
> ==
> 2021-12-06T04:24:48.6826565Z Printing stack trace of Java process 13701
> 2021-12-06T04:24:48.6827012Z 
> ==
> 2021-12-06T04:24:49.1876086Z 2021-12-06 04:24:49
> 2021-12-06T04:24:49.1877098Z Full thread dump OpenJDK 64-Bit Server VM 
> (11.0.10+9 mixed mode):
> 2021-12-06T04:24:49.1877362Z 
> 2021-12-06T04:24:49.1877672Z Threads class SMR info:
> 2021-12-06T04:24:49.1878049Z _java_thread_list=0x7f254c007630, 
> length=365, elements={
> 2021-12-06T04:24:49.1878504Z 0x7f2598028000, 0x7f2598280800, 
> 0x7f2598284800, 0x7f2598299000,
> 2021-12-06T04:24:49.1878973Z 0x7f259829b000, 0x7f259829d800, 
> 0x7f259829f800, 0x7f25982a1800,
> 2021-12-06T04:24:49.1879680Z 0x7f2598337800, 0x7f25983e3000, 
> 0x7f2598431000, 0x7f2528016000,
> 2021-12-06T04:24:49.1896613Z 0x7f2599003000, 0x7f259972e000, 
> 0x7f2599833800, 0x7f259984c000,
> 2021-12-06T04:24:49.1897558Z 0x7f259984f000, 0x7f2599851000, 
> 0x7f2599892000, 0x7f2599894800,
> 2021-12-06T04:24:49.1898075Z 0x7f2499a16000, 0x7f2485acd800, 
> 0x7f2485ace000, 0x7f24876bb800,
> 2021-12-06T04:24:49.1898562Z 0x7f2461e59000, 0x7f2499a0e800, 
> 0x7f2461e5e800, 0x7f2461e81000,
> 2021-12-06T04:24:49.1899037Z 0x7f24dc015000, 0x7f2461e86800, 
> 0x7f2448002000, 0x7f24dc01c000,
> 2021-12-06T04:24:49.1899522Z 0x7f2438001000, 0x7f2438003000, 
> 0x7f2438005000, 0x7f2438006800,
> 2021-12-06T04:24:49.1899982Z 0x7f2438008800, 0x7f2434017800, 
> 0x7f243401a800, 0x7f2414008800,
> 2021-12-06T04:24:49.1900495Z 0x7f24e8089800, 0x7f24e809, 
> 0x7f23e4005800, 0x7f24e8092800,
> 2021-12-06T04:24:49.1901163Z 0x7f24e8099000, 0x7f2414015800, 
> 0x7f24dc04c000, 0x7f2414018800,
> 2021-12-06T04:24:49.1901680Z 0x7f241402, 0x7f24dc058000, 
> 0x7f24dc05b000, 0x7f2414022000,
> 2021-12-06T04:24:49.1902283Z 0x7f24d400f000, 0x7f241402e800, 
> 0x7f2414031800, 0x7f2414033800,
> 2021-12-06T04:24:49.1902880Z 0x7f2414035000, 0x7f2414037000, 
> 0x7f2414038800, 0x7f241403a800,
> 2021-12-06T04:24:49.1903354Z 0x7f241403c000, 0x7f241403e000, 
> 0x7f241403f800, 0x7f2414041800,
> 2021-12-06T04:24:49.1903812Z 0x7f2414043000, 0x7f2414045000, 
> 0x7f24dc064800, 0x7f2414047000,
> 2021-12-06T04:24:49.1

[jira] [Commented] (FLINK-25185) StreamFaultToleranceTestBase hangs on AZP

2021-12-20 Thread Roman Khachatryan (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-25185?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17462639#comment-17462639
 ] 

Roman Khachatryan commented on FLINK-25185:
---

I think it might be related to checkpoint abortion and incremental state:
When a checkpoitnt is aborted, TM will try to discard in progress uploads. This 
state can't be re-used for future checkpoints.
 
Prior to FLINK-24611, this worked for RocksDB, because RocksDB backend would 
wait for JM confirmation before trying to reuse the state.
After FLINK-24611, this breaks any incremental backend (rocksdb and changelog, 
though the latter one is more likely fail).
WDYT [~pnowojski]?

I'll try to validate this locally. If the cause is different, I'll open a new 
bug for the above FLINK-24611 related issue.

> StreamFaultToleranceTestBase hangs on AZP
> -
>
> Key: FLINK-25185
> URL: https://issues.apache.org/jira/browse/FLINK-25185
> Project: Flink
>  Issue Type: Bug
>  Components: Runtime / State Backends, Test Infrastructure
>Affects Versions: 1.13.3, 1.15.0
>Reporter: Till Rohrmann
>Priority: Critical
>  Labels: test-stability
> Fix For: 1.15.0
>
>
> The {{StreamFaultToleranceTestBase}} hangs on AZP.
> {code}
> 2021-12-06T04:24:48.1676089Z 
> ==
> 2021-12-06T04:24:48.1678883Z === WARNING: This task took already 95% of the 
> available time budget of 237 minutes ===
> 2021-12-06T04:24:48.1679596Z 
> ==
> 2021-12-06T04:24:48.1680326Z 
> ==
> 2021-12-06T04:24:48.1680877Z The following Java processes are running (JPS)
> 2021-12-06T04:24:48.1681467Z 
> ==
> 2021-12-06T04:24:48.6514536Z 13701 surefirebooter17740627448580534543.jar
> 2021-12-06T04:24:48.6515353Z 1622 Jps
> 2021-12-06T04:24:48.6515795Z 780 Launcher
> 2021-12-06T04:24:48.6825889Z 
> ==
> 2021-12-06T04:24:48.6826565Z Printing stack trace of Java process 13701
> 2021-12-06T04:24:48.6827012Z 
> ==
> 2021-12-06T04:24:49.1876086Z 2021-12-06 04:24:49
> 2021-12-06T04:24:49.1877098Z Full thread dump OpenJDK 64-Bit Server VM 
> (11.0.10+9 mixed mode):
> 2021-12-06T04:24:49.1877362Z 
> 2021-12-06T04:24:49.1877672Z Threads class SMR info:
> 2021-12-06T04:24:49.1878049Z _java_thread_list=0x7f254c007630, 
> length=365, elements={
> 2021-12-06T04:24:49.1878504Z 0x7f2598028000, 0x7f2598280800, 
> 0x7f2598284800, 0x7f2598299000,
> 2021-12-06T04:24:49.1878973Z 0x7f259829b000, 0x7f259829d800, 
> 0x7f259829f800, 0x7f25982a1800,
> 2021-12-06T04:24:49.1879680Z 0x7f2598337800, 0x7f25983e3000, 
> 0x7f2598431000, 0x7f2528016000,
> 2021-12-06T04:24:49.1896613Z 0x7f2599003000, 0x7f259972e000, 
> 0x7f2599833800, 0x7f259984c000,
> 2021-12-06T04:24:49.1897558Z 0x7f259984f000, 0x7f2599851000, 
> 0x7f2599892000, 0x7f2599894800,
> 2021-12-06T04:24:49.1898075Z 0x7f2499a16000, 0x7f2485acd800, 
> 0x7f2485ace000, 0x7f24876bb800,
> 2021-12-06T04:24:49.1898562Z 0x7f2461e59000, 0x7f2499a0e800, 
> 0x7f2461e5e800, 0x7f2461e81000,
> 2021-12-06T04:24:49.1899037Z 0x7f24dc015000, 0x7f2461e86800, 
> 0x7f2448002000, 0x7f24dc01c000,
> 2021-12-06T04:24:49.1899522Z 0x7f2438001000, 0x7f2438003000, 
> 0x7f2438005000, 0x7f2438006800,
> 2021-12-06T04:24:49.1899982Z 0x7f2438008800, 0x7f2434017800, 
> 0x7f243401a800, 0x7f2414008800,
> 2021-12-06T04:24:49.1900495Z 0x7f24e8089800, 0x7f24e809, 
> 0x7f23e4005800, 0x7f24e8092800,
> 2021-12-06T04:24:49.1901163Z 0x7f24e8099000, 0x7f2414015800, 
> 0x7f24dc04c000, 0x7f2414018800,
> 2021-12-06T04:24:49.1901680Z 0x7f241402, 0x7f24dc058000, 
> 0x7f24dc05b000, 0x7f2414022000,
> 2021-12-06T04:24:49.1902283Z 0x7f24d400f000, 0x7f241402e800, 
> 0x7f2414031800, 0x7f2414033800,
> 2021-12-06T04:24:49.1902880Z 0x7f2414035000, 0x7f2414037000, 
> 0x7f2414038800, 0x7f241403a800,
> 2021-12-06T04:24:49.1903354Z 0x7f241403c000, 0x7f241403e000, 
> 0x7f241403f800, 0x7f2414041800,
> 2021-12-06T04:24:49.1903812Z 0x7f2414043000, 0x7f2414045000, 
> 0x7f24dc064800, 0x7f2414047000,
> 2021-12-06T04:24:49.1904284Z 0x7f2414048800, 0x7f241404a800, 
> 0x7f241404c800, 0x7f241404e000,
> 2021-12-06T04:24:49.1904800Z 0x7

[jira] [Commented] (FLINK-25185) StreamFaultToleranceTestBase hangs on AZP

2021-12-20 Thread Piotr Nowojski (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-25185?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17462554#comment-17462554
 ] 

Piotr Nowojski commented on FLINK-25185:


It looks like those tests were stuck in an endless loop being unable to 
allocate enough slots to run the job:

{noformat}
org.apache.flink.runtime.jobmanager.scheduler.NoResourceAvailableException: 
Could not acquire the minimum required resources.
06:42:22,189 [flink-akka.actor.default-dispatcher-7] WARN  
org.apache.flink.runtime.resourcemanager.slotmanager.FineGrainedSlotManager [] 
- Could not fulfill resource requirements of job 
5a5ac441318e8085606c78b40c3a2f25.
06:42:22,189 [flink-akka.actor.default-dispatcher-7] WARN  
org.apache.flink.runtime.jobmaster.slotpool.DeclarativeSlotPoolBridge [] - 
Could not acquire the minimum required resources, failing slot requests. 
Acquired: 
[ResourceRequirement{resourceProfile=ResourceProfile{taskHeapMemory=256.000gb 
(274877906944 bytes), taskOffHeapMemory=256.000gb (274877906944 bytes), 
managedMemory=20.000mb (20971520 bytes), networkMemory=16.000mb (16777216 
bytes)}, numberOfRequiredSlots=8}]. Current slot pool status: Registered TMs: 
2, registered slots: 8 free slots: 0
org.apache.flink.runtime.jobmanager.scheduler.NoResourceAvailableException: 
Could not acquire the minimum required resources.
06:42:22,259 [flink-akka.actor.default-dispatcher-9] WARN  
org.apache.flink.runtime.resourcemanager.slotmanager.FineGrainedSlotManager [] 
- Could not fulfill resource requirements of job 
5a5ac441318e8085606c78b40c3a2f25.
06:42:22,259 [flink-akka.actor.default-dispatcher-9] WARN  
org.apache.flink.runtime.jobmaster.slotpool.DeclarativeSlotPoolBridge [] - 
Could not acquire the minimum required resources, failing slot requests. 
Acquired: 
[ResourceRequirement{resourceProfile=ResourceProfile{taskHeapMemory=256.000gb 
(274877906944 bytes), taskOffHeapMemory=256.000gb (274877906944 bytes), 
managedMemory=20.000mb (20971520 bytes), networkMemory=16.000mb (16777216 
bytes)}, numberOfRequiredSlots=8}]. Current slot pool status: Registered TMs: 
2, registered slots: 8 free slots: 0
org.apache.flink.runtime.j
{noformat}

It's very hard to say, but it looks like (one of?) the first failure was this 
one:

{noformat}
04:06:26,659 [Map -> Sink: Unnamed (9/12)#1] WARN  
org.apache.flink.streaming.api.operators.BackendRestorerProcedure [] - 
Exception while restoring keyed state backend for 
StreamMap_dc2290bb6f8f5cd2bd425368843494fe_(9/12) from alternative (1/1), will 
retry while mor
e alternatives are available.
java.lang.RuntimeException: java.io.FileNotFoundException: 
/tmp/junit3146957979516280339/junit1602669867129285236/d6a6dbdd-3fd7-4786-9dc1-9ccc161740da
 (No such file or directory)
at 
org.apache.flink.util.ExceptionUtils.rethrow(ExceptionUtils.java:319) 
~[flink-core-1.15-SNAPSHOT.jar:1.15-SNAPSHOT]
at 
org.apache.flink.runtime.state.changelog.StateChangelogHandleStreamHandleReader$1.advance(StateChangelogHandleStreamHandleReader.java:87)
 ~[flink-runtime-1.15-SNAPSHOT.jar:1.15-SNAPSHOT]
at 
org.apache.flink.runtime.state.changelog.StateChangelogHandleStreamHandleReader$1.hasNext(StateChangelogHandleStreamHandleReader.java:69)
 ~[flink-runtime-1.15-SNAPSHOT.jar:1.15-SNAPSHOT]
at 
org.apache.flink.state.changelog.restore.ChangelogBackendRestoreOperation.readBackendHandle(ChangelogBackendRestoreOperation.java:92)
 ~[flink-statebackend-changelog-1.15-SNAPSHOT.jar:1.15-SNAPSHOT]
at 
org.apache.flink.state.changelog.restore.ChangelogBackendRestoreOperation.restore(ChangelogBackendRestoreOperation.java:74)
 ~[flink-statebackend-changelog-1.15-SNAPSHOT.jar:1.15-SNAPSHOT]
at 
org.apache.flink.state.changelog.ChangelogStateBackend.restore(ChangelogStateBackend.java:221)
 ~[flink-statebackend-changelog-1.15-SNAPSHOT.jar:1.15-SNAPSHOT]
at 
org.apache.flink.state.changelog.ChangelogStateBackend.createKeyedStateBackend(ChangelogStateBackend.java:145)
 ~[flink-statebackend-changelog-1.15-SNAPSHOT.jar:1.15-SNAPSHOT]
at 
org.apache.flink.streaming.api.operators.StreamTaskStateInitializerImpl.lambda$keyedStatedBackend$1(StreamTaskStateInitializerImpl.java:329)
 ~[flink-streaming-java-1.15-SNAPSHOT.jar:1.15-SNAPSHOT]
at 
org.apache.flink.streaming.api.operators.BackendRestorerProcedure.attemptCreateAndRestore(BackendRestorerProcedure.java:168)
 ~[flink-streaming-java-1.15-SNAPSHOT.jar:1.15-SNAPSHOT]
at 
org.apache.flink.streaming.api.operators.BackendRestorerProcedure.createAndRestore(BackendRestorerProcedure.java:135)
 ~[flink-streaming-java-1.15-SNAPSHOT.jar:1.15-SNAPSHOT]
at 
org.apache.flink.streaming.api.operators.StreamTaskStateInitializerImpl.keyedStatedBackend(StreamTaskStateInitializerImpl.java:346)
 ~[flink-streaming-java-1.15-SNAPSHOT.jar:1.15-SNAPSHOT]
at 
org.apache.flink.streaming.api.operators.StreamTaskStateInitiali

[jira] [Commented] (FLINK-25185) StreamFaultToleranceTestBase hangs on AZP

2021-12-20 Thread Till Rohrmann (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-25185?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17462475#comment-17462475
 ] 

Till Rohrmann commented on FLINK-25185:
---

cc [~pnowojski]

> StreamFaultToleranceTestBase hangs on AZP
> -
>
> Key: FLINK-25185
> URL: https://issues.apache.org/jira/browse/FLINK-25185
> Project: Flink
>  Issue Type: Bug
>  Components: Runtime / Checkpointing, Test Infrastructure
>Affects Versions: 1.13.3, 1.15.0
>Reporter: Till Rohrmann
>Priority: Critical
>  Labels: test-stability
> Fix For: 1.15.0
>
>
> The {{StreamFaultToleranceTestBase}} hangs on AZP.
> {code}
> 2021-12-06T04:24:48.1676089Z 
> ==
> 2021-12-06T04:24:48.1678883Z === WARNING: This task took already 95% of the 
> available time budget of 237 minutes ===
> 2021-12-06T04:24:48.1679596Z 
> ==
> 2021-12-06T04:24:48.1680326Z 
> ==
> 2021-12-06T04:24:48.1680877Z The following Java processes are running (JPS)
> 2021-12-06T04:24:48.1681467Z 
> ==
> 2021-12-06T04:24:48.6514536Z 13701 surefirebooter17740627448580534543.jar
> 2021-12-06T04:24:48.6515353Z 1622 Jps
> 2021-12-06T04:24:48.6515795Z 780 Launcher
> 2021-12-06T04:24:48.6825889Z 
> ==
> 2021-12-06T04:24:48.6826565Z Printing stack trace of Java process 13701
> 2021-12-06T04:24:48.6827012Z 
> ==
> 2021-12-06T04:24:49.1876086Z 2021-12-06 04:24:49
> 2021-12-06T04:24:49.1877098Z Full thread dump OpenJDK 64-Bit Server VM 
> (11.0.10+9 mixed mode):
> 2021-12-06T04:24:49.1877362Z 
> 2021-12-06T04:24:49.1877672Z Threads class SMR info:
> 2021-12-06T04:24:49.1878049Z _java_thread_list=0x7f254c007630, 
> length=365, elements={
> 2021-12-06T04:24:49.1878504Z 0x7f2598028000, 0x7f2598280800, 
> 0x7f2598284800, 0x7f2598299000,
> 2021-12-06T04:24:49.1878973Z 0x7f259829b000, 0x7f259829d800, 
> 0x7f259829f800, 0x7f25982a1800,
> 2021-12-06T04:24:49.1879680Z 0x7f2598337800, 0x7f25983e3000, 
> 0x7f2598431000, 0x7f2528016000,
> 2021-12-06T04:24:49.1896613Z 0x7f2599003000, 0x7f259972e000, 
> 0x7f2599833800, 0x7f259984c000,
> 2021-12-06T04:24:49.1897558Z 0x7f259984f000, 0x7f2599851000, 
> 0x7f2599892000, 0x7f2599894800,
> 2021-12-06T04:24:49.1898075Z 0x7f2499a16000, 0x7f2485acd800, 
> 0x7f2485ace000, 0x7f24876bb800,
> 2021-12-06T04:24:49.1898562Z 0x7f2461e59000, 0x7f2499a0e800, 
> 0x7f2461e5e800, 0x7f2461e81000,
> 2021-12-06T04:24:49.1899037Z 0x7f24dc015000, 0x7f2461e86800, 
> 0x7f2448002000, 0x7f24dc01c000,
> 2021-12-06T04:24:49.1899522Z 0x7f2438001000, 0x7f2438003000, 
> 0x7f2438005000, 0x7f2438006800,
> 2021-12-06T04:24:49.1899982Z 0x7f2438008800, 0x7f2434017800, 
> 0x7f243401a800, 0x7f2414008800,
> 2021-12-06T04:24:49.1900495Z 0x7f24e8089800, 0x7f24e809, 
> 0x7f23e4005800, 0x7f24e8092800,
> 2021-12-06T04:24:49.1901163Z 0x7f24e8099000, 0x7f2414015800, 
> 0x7f24dc04c000, 0x7f2414018800,
> 2021-12-06T04:24:49.1901680Z 0x7f241402, 0x7f24dc058000, 
> 0x7f24dc05b000, 0x7f2414022000,
> 2021-12-06T04:24:49.1902283Z 0x7f24d400f000, 0x7f241402e800, 
> 0x7f2414031800, 0x7f2414033800,
> 2021-12-06T04:24:49.1902880Z 0x7f2414035000, 0x7f2414037000, 
> 0x7f2414038800, 0x7f241403a800,
> 2021-12-06T04:24:49.1903354Z 0x7f241403c000, 0x7f241403e000, 
> 0x7f241403f800, 0x7f2414041800,
> 2021-12-06T04:24:49.1903812Z 0x7f2414043000, 0x7f2414045000, 
> 0x7f24dc064800, 0x7f2414047000,
> 2021-12-06T04:24:49.1904284Z 0x7f2414048800, 0x7f241404a800, 
> 0x7f241404c800, 0x7f241404e000,
> 2021-12-06T04:24:49.1904800Z 0x7f241405, 0x7f2414051800, 
> 0x7f2414053800, 0x7f2414055000,
> 2021-12-06T04:24:49.1905455Z 0x7f2414057000, 0x7f2414059000, 
> 0x7f241405a800, 0x7f241405c800,
> 2021-12-06T04:24:49.1906098Z 0x7f241405e000, 0x7f241406, 
> 0x7f2414062000, 0x7f2414063800,
> 2021-12-06T04:24:49.1906728Z 0x7f22e400c800, 0x7f2328008000, 
> 0x7f2284007000, 0x7f22cc019800,
> 2021-12-06T04:24:49.1907396Z 0x7f21f8004000, 0x7f2304012800, 
> 0x7f230001b000, 0x7f223c011000,
> 2021-12-06T04:24:49.1908080Z 0x7f24e40c1800, 0x7f2454001000,

[jira] [Commented] (FLINK-25185) StreamFaultToleranceTestBase hangs on AZP

2021-12-17 Thread Yun Gao (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-25185?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17461471#comment-17461471
 ] 

Yun Gao commented on FLINK-25185:
-

https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=28306&view=logs&j=5c8e7682-d68f-54d1-16a2-a09310218a49&t=86f654fa-ab48-5c1a-25f4-7e7f6afb9bba&l=19832

> StreamFaultToleranceTestBase hangs on AZP
> -
>
> Key: FLINK-25185
> URL: https://issues.apache.org/jira/browse/FLINK-25185
> Project: Flink
>  Issue Type: Bug
>  Components: Runtime / Checkpointing, Test Infrastructure
>Affects Versions: 1.13.3, 1.15.0
>Reporter: Till Rohrmann
>Priority: Critical
>  Labels: test-stability
>
> The {{StreamFaultToleranceTestBase}} hangs on AZP.
> {code}
> 2021-12-06T04:24:48.1676089Z 
> ==
> 2021-12-06T04:24:48.1678883Z === WARNING: This task took already 95% of the 
> available time budget of 237 minutes ===
> 2021-12-06T04:24:48.1679596Z 
> ==
> 2021-12-06T04:24:48.1680326Z 
> ==
> 2021-12-06T04:24:48.1680877Z The following Java processes are running (JPS)
> 2021-12-06T04:24:48.1681467Z 
> ==
> 2021-12-06T04:24:48.6514536Z 13701 surefirebooter17740627448580534543.jar
> 2021-12-06T04:24:48.6515353Z 1622 Jps
> 2021-12-06T04:24:48.6515795Z 780 Launcher
> 2021-12-06T04:24:48.6825889Z 
> ==
> 2021-12-06T04:24:48.6826565Z Printing stack trace of Java process 13701
> 2021-12-06T04:24:48.6827012Z 
> ==
> 2021-12-06T04:24:49.1876086Z 2021-12-06 04:24:49
> 2021-12-06T04:24:49.1877098Z Full thread dump OpenJDK 64-Bit Server VM 
> (11.0.10+9 mixed mode):
> 2021-12-06T04:24:49.1877362Z 
> 2021-12-06T04:24:49.1877672Z Threads class SMR info:
> 2021-12-06T04:24:49.1878049Z _java_thread_list=0x7f254c007630, 
> length=365, elements={
> 2021-12-06T04:24:49.1878504Z 0x7f2598028000, 0x7f2598280800, 
> 0x7f2598284800, 0x7f2598299000,
> 2021-12-06T04:24:49.1878973Z 0x7f259829b000, 0x7f259829d800, 
> 0x7f259829f800, 0x7f25982a1800,
> 2021-12-06T04:24:49.1879680Z 0x7f2598337800, 0x7f25983e3000, 
> 0x7f2598431000, 0x7f2528016000,
> 2021-12-06T04:24:49.1896613Z 0x7f2599003000, 0x7f259972e000, 
> 0x7f2599833800, 0x7f259984c000,
> 2021-12-06T04:24:49.1897558Z 0x7f259984f000, 0x7f2599851000, 
> 0x7f2599892000, 0x7f2599894800,
> 2021-12-06T04:24:49.1898075Z 0x7f2499a16000, 0x7f2485acd800, 
> 0x7f2485ace000, 0x7f24876bb800,
> 2021-12-06T04:24:49.1898562Z 0x7f2461e59000, 0x7f2499a0e800, 
> 0x7f2461e5e800, 0x7f2461e81000,
> 2021-12-06T04:24:49.1899037Z 0x7f24dc015000, 0x7f2461e86800, 
> 0x7f2448002000, 0x7f24dc01c000,
> 2021-12-06T04:24:49.1899522Z 0x7f2438001000, 0x7f2438003000, 
> 0x7f2438005000, 0x7f2438006800,
> 2021-12-06T04:24:49.1899982Z 0x7f2438008800, 0x7f2434017800, 
> 0x7f243401a800, 0x7f2414008800,
> 2021-12-06T04:24:49.1900495Z 0x7f24e8089800, 0x7f24e809, 
> 0x7f23e4005800, 0x7f24e8092800,
> 2021-12-06T04:24:49.1901163Z 0x7f24e8099000, 0x7f2414015800, 
> 0x7f24dc04c000, 0x7f2414018800,
> 2021-12-06T04:24:49.1901680Z 0x7f241402, 0x7f24dc058000, 
> 0x7f24dc05b000, 0x7f2414022000,
> 2021-12-06T04:24:49.1902283Z 0x7f24d400f000, 0x7f241402e800, 
> 0x7f2414031800, 0x7f2414033800,
> 2021-12-06T04:24:49.1902880Z 0x7f2414035000, 0x7f2414037000, 
> 0x7f2414038800, 0x7f241403a800,
> 2021-12-06T04:24:49.1903354Z 0x7f241403c000, 0x7f241403e000, 
> 0x7f241403f800, 0x7f2414041800,
> 2021-12-06T04:24:49.1903812Z 0x7f2414043000, 0x7f2414045000, 
> 0x7f24dc064800, 0x7f2414047000,
> 2021-12-06T04:24:49.1904284Z 0x7f2414048800, 0x7f241404a800, 
> 0x7f241404c800, 0x7f241404e000,
> 2021-12-06T04:24:49.1904800Z 0x7f241405, 0x7f2414051800, 
> 0x7f2414053800, 0x7f2414055000,
> 2021-12-06T04:24:49.1905455Z 0x7f2414057000, 0x7f2414059000, 
> 0x7f241405a800, 0x7f241405c800,
> 2021-12-06T04:24:49.1906098Z 0x7f241405e000, 0x7f241406, 
> 0x7f2414062000, 0x7f2414063800,
> 2021-12-06T04:24:49.1906728Z 0x7f22e400c800, 0x7f2328008000, 
> 0x7f2284007000, 0x7f22cc019800,
> 2021-12-06T04:24:49.1907396Z 0x7f21f8004000, 0x7f2304012800, 

[jira] [Commented] (FLINK-25185) StreamFaultToleranceTestBase hangs on AZP

2021-12-16 Thread Yun Gao (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-25185?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17461237#comment-17461237
 ] 

Yun Gao commented on FLINK-25185:
-

https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=28297&view=logs&j=a57e0635-3fad-5b08-57c7-a4142d7d6fa9&t=2ef0effc-1da1-50e5-c2bd-aab434b1c5b7&l=19003

> StreamFaultToleranceTestBase hangs on AZP
> -
>
> Key: FLINK-25185
> URL: https://issues.apache.org/jira/browse/FLINK-25185
> Project: Flink
>  Issue Type: Bug
>  Components: Runtime / Checkpointing, Test Infrastructure
>Affects Versions: 1.13.3
>Reporter: Till Rohrmann
>Priority: Critical
>  Labels: test-stability
>
> The {{StreamFaultToleranceTestBase}} hangs on AZP.
> {code}
> 2021-12-06T04:24:48.1676089Z 
> ==
> 2021-12-06T04:24:48.1678883Z === WARNING: This task took already 95% of the 
> available time budget of 237 minutes ===
> 2021-12-06T04:24:48.1679596Z 
> ==
> 2021-12-06T04:24:48.1680326Z 
> ==
> 2021-12-06T04:24:48.1680877Z The following Java processes are running (JPS)
> 2021-12-06T04:24:48.1681467Z 
> ==
> 2021-12-06T04:24:48.6514536Z 13701 surefirebooter17740627448580534543.jar
> 2021-12-06T04:24:48.6515353Z 1622 Jps
> 2021-12-06T04:24:48.6515795Z 780 Launcher
> 2021-12-06T04:24:48.6825889Z 
> ==
> 2021-12-06T04:24:48.6826565Z Printing stack trace of Java process 13701
> 2021-12-06T04:24:48.6827012Z 
> ==
> 2021-12-06T04:24:49.1876086Z 2021-12-06 04:24:49
> 2021-12-06T04:24:49.1877098Z Full thread dump OpenJDK 64-Bit Server VM 
> (11.0.10+9 mixed mode):
> 2021-12-06T04:24:49.1877362Z 
> 2021-12-06T04:24:49.1877672Z Threads class SMR info:
> 2021-12-06T04:24:49.1878049Z _java_thread_list=0x7f254c007630, 
> length=365, elements={
> 2021-12-06T04:24:49.1878504Z 0x7f2598028000, 0x7f2598280800, 
> 0x7f2598284800, 0x7f2598299000,
> 2021-12-06T04:24:49.1878973Z 0x7f259829b000, 0x7f259829d800, 
> 0x7f259829f800, 0x7f25982a1800,
> 2021-12-06T04:24:49.1879680Z 0x7f2598337800, 0x7f25983e3000, 
> 0x7f2598431000, 0x7f2528016000,
> 2021-12-06T04:24:49.1896613Z 0x7f2599003000, 0x7f259972e000, 
> 0x7f2599833800, 0x7f259984c000,
> 2021-12-06T04:24:49.1897558Z 0x7f259984f000, 0x7f2599851000, 
> 0x7f2599892000, 0x7f2599894800,
> 2021-12-06T04:24:49.1898075Z 0x7f2499a16000, 0x7f2485acd800, 
> 0x7f2485ace000, 0x7f24876bb800,
> 2021-12-06T04:24:49.1898562Z 0x7f2461e59000, 0x7f2499a0e800, 
> 0x7f2461e5e800, 0x7f2461e81000,
> 2021-12-06T04:24:49.1899037Z 0x7f24dc015000, 0x7f2461e86800, 
> 0x7f2448002000, 0x7f24dc01c000,
> 2021-12-06T04:24:49.1899522Z 0x7f2438001000, 0x7f2438003000, 
> 0x7f2438005000, 0x7f2438006800,
> 2021-12-06T04:24:49.1899982Z 0x7f2438008800, 0x7f2434017800, 
> 0x7f243401a800, 0x7f2414008800,
> 2021-12-06T04:24:49.1900495Z 0x7f24e8089800, 0x7f24e809, 
> 0x7f23e4005800, 0x7f24e8092800,
> 2021-12-06T04:24:49.1901163Z 0x7f24e8099000, 0x7f2414015800, 
> 0x7f24dc04c000, 0x7f2414018800,
> 2021-12-06T04:24:49.1901680Z 0x7f241402, 0x7f24dc058000, 
> 0x7f24dc05b000, 0x7f2414022000,
> 2021-12-06T04:24:49.1902283Z 0x7f24d400f000, 0x7f241402e800, 
> 0x7f2414031800, 0x7f2414033800,
> 2021-12-06T04:24:49.1902880Z 0x7f2414035000, 0x7f2414037000, 
> 0x7f2414038800, 0x7f241403a800,
> 2021-12-06T04:24:49.1903354Z 0x7f241403c000, 0x7f241403e000, 
> 0x7f241403f800, 0x7f2414041800,
> 2021-12-06T04:24:49.1903812Z 0x7f2414043000, 0x7f2414045000, 
> 0x7f24dc064800, 0x7f2414047000,
> 2021-12-06T04:24:49.1904284Z 0x7f2414048800, 0x7f241404a800, 
> 0x7f241404c800, 0x7f241404e000,
> 2021-12-06T04:24:49.1904800Z 0x7f241405, 0x7f2414051800, 
> 0x7f2414053800, 0x7f2414055000,
> 2021-12-06T04:24:49.1905455Z 0x7f2414057000, 0x7f2414059000, 
> 0x7f241405a800, 0x7f241405c800,
> 2021-12-06T04:24:49.1906098Z 0x7f241405e000, 0x7f241406, 
> 0x7f2414062000, 0x7f2414063800,
> 2021-12-06T04:24:49.1906728Z 0x7f22e400c800, 0x7f2328008000, 
> 0x7f2284007000, 0x7f22cc019800,
> 2021-12-06T04:24:49.1907396Z 0x7f21f8004000, 0x7f2304012800, 
> 0x000