[jira] [Updated] (FLINK-25185) StreamFaultToleranceTestBase hangs on AZP

2021-12-29 Thread Piotr Nowojski (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-25185?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Piotr Nowojski updated FLINK-25185:
---
Component/s: Runtime / Coordination
 (was: Runtime / State Backends)
 (was: Test Infrastructure)

> StreamFaultToleranceTestBase hangs on AZP
> -
>
> Key: FLINK-25185
> URL: https://issues.apache.org/jira/browse/FLINK-25185
> Project: Flink
>  Issue Type: Bug
>  Components: Runtime / Coordination
>Affects Versions: 1.13.3, 1.15.0
>Reporter: Till Rohrmann
>Priority: Blocker
>  Labels: test-stability
> Fix For: 1.15.0
>
>
> The {{StreamFaultToleranceTestBase}} hangs on AZP.
> {code}
> 2021-12-06T04:24:48.1676089Z 
> ==
> 2021-12-06T04:24:48.1678883Z === WARNING: This task took already 95% of the 
> available time budget of 237 minutes ===
> 2021-12-06T04:24:48.1679596Z 
> ==
> 2021-12-06T04:24:48.1680326Z 
> ==
> 2021-12-06T04:24:48.1680877Z The following Java processes are running (JPS)
> 2021-12-06T04:24:48.1681467Z 
> ==
> 2021-12-06T04:24:48.6514536Z 13701 surefirebooter17740627448580534543.jar
> 2021-12-06T04:24:48.6515353Z 1622 Jps
> 2021-12-06T04:24:48.6515795Z 780 Launcher
> 2021-12-06T04:24:48.6825889Z 
> ==
> 2021-12-06T04:24:48.6826565Z Printing stack trace of Java process 13701
> 2021-12-06T04:24:48.6827012Z 
> ==
> 2021-12-06T04:24:49.1876086Z 2021-12-06 04:24:49
> 2021-12-06T04:24:49.1877098Z Full thread dump OpenJDK 64-Bit Server VM 
> (11.0.10+9 mixed mode):
> 2021-12-06T04:24:49.1877362Z 
> 2021-12-06T04:24:49.1877672Z Threads class SMR info:
> 2021-12-06T04:24:49.1878049Z _java_thread_list=0x7f254c007630, 
> length=365, elements={
> 2021-12-06T04:24:49.1878504Z 0x7f2598028000, 0x7f2598280800, 
> 0x7f2598284800, 0x7f2598299000,
> 2021-12-06T04:24:49.1878973Z 0x7f259829b000, 0x7f259829d800, 
> 0x7f259829f800, 0x7f25982a1800,
> 2021-12-06T04:24:49.1879680Z 0x7f2598337800, 0x7f25983e3000, 
> 0x7f2598431000, 0x7f2528016000,
> 2021-12-06T04:24:49.1896613Z 0x7f2599003000, 0x7f259972e000, 
> 0x7f2599833800, 0x7f259984c000,
> 2021-12-06T04:24:49.1897558Z 0x7f259984f000, 0x7f2599851000, 
> 0x7f2599892000, 0x7f2599894800,
> 2021-12-06T04:24:49.1898075Z 0x7f2499a16000, 0x7f2485acd800, 
> 0x7f2485ace000, 0x7f24876bb800,
> 2021-12-06T04:24:49.1898562Z 0x7f2461e59000, 0x7f2499a0e800, 
> 0x7f2461e5e800, 0x7f2461e81000,
> 2021-12-06T04:24:49.1899037Z 0x7f24dc015000, 0x7f2461e86800, 
> 0x7f2448002000, 0x7f24dc01c000,
> 2021-12-06T04:24:49.1899522Z 0x7f2438001000, 0x7f2438003000, 
> 0x7f2438005000, 0x7f2438006800,
> 2021-12-06T04:24:49.1899982Z 0x7f2438008800, 0x7f2434017800, 
> 0x7f243401a800, 0x7f2414008800,
> 2021-12-06T04:24:49.1900495Z 0x7f24e8089800, 0x7f24e809, 
> 0x7f23e4005800, 0x7f24e8092800,
> 2021-12-06T04:24:49.1901163Z 0x7f24e8099000, 0x7f2414015800, 
> 0x7f24dc04c000, 0x7f2414018800,
> 2021-12-06T04:24:49.1901680Z 0x7f241402, 0x7f24dc058000, 
> 0x7f24dc05b000, 0x7f2414022000,
> 2021-12-06T04:24:49.1902283Z 0x7f24d400f000, 0x7f241402e800, 
> 0x7f2414031800, 0x7f2414033800,
> 2021-12-06T04:24:49.1902880Z 0x7f2414035000, 0x7f2414037000, 
> 0x7f2414038800, 0x7f241403a800,
> 2021-12-06T04:24:49.1903354Z 0x7f241403c000, 0x7f241403e000, 
> 0x7f241403f800, 0x7f2414041800,
> 2021-12-06T04:24:49.1903812Z 0x7f2414043000, 0x7f2414045000, 
> 0x7f24dc064800, 0x7f2414047000,
> 2021-12-06T04:24:49.1904284Z 0x7f2414048800, 0x7f241404a800, 
> 0x7f241404c800, 0x7f241404e000,
> 2021-12-06T04:24:49.1904800Z 0x7f241405, 0x7f2414051800, 
> 0x7f2414053800, 0x7f2414055000,
> 2021-12-06T04:24:49.1905455Z 0x7f2414057000, 0x7f2414059000, 
> 0x7f241405a800, 0x7f241405c800,
> 2021-12-06T04:24:49.1906098Z 0x7f241405e000, 0x7f241406, 
> 0x7f2414062000, 0x7f2414063800,
> 2021-12-06T04:24:49.1906728Z 0x7f22e400c800, 0x7f2328008000, 
> 0x7f2284007000, 0x7f22cc019800,
> 2021-12-06T04:24:49.1907396Z 0x7f21f8004000, 0x7f2304012800, 
> 0x7f230001b000, 0x7f223c011000,
> 2021-12-06T04:24:49.19

[jira] [Commented] (FLINK-22643) Too many TCP connections among TaskManagers for large scale jobs

2021-12-29 Thread Piotr Nowojski (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-22643?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17466423#comment-17466423
 ] 

Piotr Nowojski commented on FLINK-22643:


Currently it looks like nobody is working on this feature actively [~fanrui]. 
Maybe [~Thesharing] would like to pick up this work? If not, judging based on 
the POC provided by [~Thesharing], the change itself is quite simple and anyone 
could pick this up. Is it something that you would like to work on [~fanrui]?

>From my perspective (assuming the linked POC is working properly), the only 
>remaining thing to do would be picking up the default value. It would be great 
>to run some benchmarks with different settings (1, 5, Integer.MAX_VALUE) to 
>select it. However this can always be a split up to a follow up ticket, and as 
>a first step we can preserve the current behaviour by selecting 
>{{Integer.MAX_VALUE}} as the default option.

> Too many TCP connections among TaskManagers for large scale jobs
> 
>
> Key: FLINK-22643
> URL: https://issues.apache.org/jira/browse/FLINK-22643
> Project: Flink
>  Issue Type: Improvement
>  Components: Runtime / Network
>Affects Versions: 1.14.0, 1.13.2
>Reporter: Zhilong Hong
>Priority: Minor
>  Labels: auto-deprioritized-major
>
> For the large scale jobs, there will be too many TCP connections among 
> TaskManagers. Let's take an example.
> For a streaming job with 20 JobVertices, each JobVertex has 500 parallelism. 
> We divide the vertices into 5 slot sharing groups. Each TaskManager has 5 
> slots. Thus there will be 400 taskmanagers in this job. Let's assume that job 
> runs on a cluster with 20 machines.
> If all the job edges are all-to-all edges, there will be 19 * 20 * 399 * 2 = 
> 303,240 TCP connections for each machine. If we run several jobs on this 
> cluster, the TCP connections may exceed the maximum limit of linux, which is 
> 1,048,576. This will stop the TaskManagers from creating new TCP connections 
> and cause task failovers.
> As we run our production jobs on a K8S cluster, the job always failover due 
> to exceptions related to network, such as {{Sending the partition request to 
> 'null' failed}}, and etc.
> We think that we can decrease the number of connections by letting tasks 
> reuse the same connection. We implemented a POC that makes all tasks on the 
> same TaskManager reuse one TCP connection. For the example job we mentioned 
> above, the number of connections will decrease from 303,240 to 15960. With 
> the POC, the frequency of meeting exceptions related to network in our 
> production jobs drops significantly.
> The POC is illustrated in: 
> https://github.com/wsry/flink/commit/bf1c09e80450f40d018a1d1d4fe3dfd2de777fdc
>  



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Closed] (FLINK-25417) Too many connections for TM

2021-12-29 Thread Piotr Nowojski (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-25417?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Piotr Nowojski closed FLINK-25417.
--
Resolution: Duplicate

> Too many connections for TM
> ---
>
> Key: FLINK-25417
> URL: https://issues.apache.org/jira/browse/FLINK-25417
> Project: Flink
>  Issue Type: Improvement
>  Components: Runtime / Network
>Affects Versions: 1.15.0, 1.13.5, 1.14.2
>Reporter: fanrui
>Priority: Major
> Attachments: image-2021-12-22-19-17-59-486.png, 
> image-2021-12-22-19-18-23-138.png
>
>
> Hi masters, when the number of task exceeds 10, some TM has more than 4000 
> TCP connections.
> !image-2021-12-22-19-17-59-486.png|width=1388,height=307!
>  
> h2. Reason:
> When the task is initialized, the downstream InputChannel will connect to the 
> upstream ResultPartition.
> In PartitionRequestClientFactory#createPartitionRequestClient, there is a 
> clients({_}ConcurrentMap CompletableFuture{_}{_}{_}{_}> clients{_}). It's 
> a cache to avoid repeated tcp connections. But the ConnectionID has a field 
> is connectionIndex.
> The connectionIndex comes from IntermediateResult, which is a random number. 
> When multiple Tasks are running in a TM, other TMs need to establish multiple 
> connections to this TM, and each Task has a NettyPartitionRequestClient.
> Assume that the parallelism of the flink job is 100, each TM has 20 Tasks, 
> and the Partition strategy between tasks is rebalance or hash. Then the 
> number of connections for a single TM is (20-1) * 100 * 2 = 3800. If multiple 
> such TMs are running on a single node, there is a risk.
>  
> I want to know whether it is risky to change the cache key to 
> connectionID.address? That is: a tcp connection is shared between all Tasks 
> of TM. 
> I guess it is feasible because:
>  # I have tested it and the task can run normally.
>  # The Message contains the InputChannelID, which is used to distinguish 
> which channel the NettyMessage belongs to.
>  
> !image-2021-12-22-19-18-23-138.png|width=2953,height=686!



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Updated] (FLINK-25414) Provide metrics to measure how long task has been blocked

2021-12-22 Thread Piotr Nowojski (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-25414?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Piotr Nowojski updated FLINK-25414:
---
Description: 
Currently back pressured/busy metrics tell the user whether task is 
blocked/busy and how much % of the time it is blocked/busy. But they do not 
tell how for how long single block event is lasting. It can be 1ms or 1h and 
back pressure/busy would be still reporting 100%.

In order to improve this, we could provide two new metrics:
# maxSoftBackPressureTime
# maxHardBackPressureTime

The max would be reset to 0 periodically or on every access to the metric (via 
metric reporter). Soft back pressure would be if task is back pressured in a 
non blocking fashion (StreamTask detected in availability of the output). Hard 
back pressure would measure the time task is actually blocked.

In order to calculate those metrics I'm proposing to split the already existing 
backPressuredTimeMsPerSecond into soft and hard versions as well.

Unfortunately I don't know how to efficiently provide similar metric for busy 
time, without impacting max throughput.

  was:
Currently back pressured/busy metrics tell the user whether task is 
blocked/busy and how much % of the time it is blocked/busy. But they do not 
tell how for how long single block event is lasting. It can be 1ms or 1h and 
back pressure/busy would be still reporting 100%.

In order to improve this, we could provide two new metrics:
# maxSoftBackPressureDuration
# maxHardBackPressureDuration

The max would be reset to 0 periodically or on every access to the metric (via 
metric reporter). Soft back pressure would be if task is back pressured in a 
non blocking fashion (StreamTask detected in availability of the output). Hard 
back pressure would measure the time task is actually blocked.

Unfortunately I don't know how to efficiently provide similar metric for busy 
time, without impacting max throughput.


> Provide metrics to measure how long task has been blocked
> -
>
> Key: FLINK-25414
> URL: https://issues.apache.org/jira/browse/FLINK-25414
> Project: Flink
>  Issue Type: New Feature
>  Components: Runtime / Metrics, Runtime / Task
>Affects Versions: 1.14.2
>Reporter: Piotr Nowojski
>Assignee: Piotr Nowojski
>Priority: Major
>  Labels: pull-request-available
>
> Currently back pressured/busy metrics tell the user whether task is 
> blocked/busy and how much % of the time it is blocked/busy. But they do not 
> tell how for how long single block event is lasting. It can be 1ms or 1h and 
> back pressure/busy would be still reporting 100%.
> In order to improve this, we could provide two new metrics:
> # maxSoftBackPressureTime
> # maxHardBackPressureTime
> The max would be reset to 0 periodically or on every access to the metric 
> (via metric reporter). Soft back pressure would be if task is back pressured 
> in a non blocking fashion (StreamTask detected in availability of the 
> output). Hard back pressure would measure the time task is actually blocked.
> In order to calculate those metrics I'm proposing to split the already 
> existing backPressuredTimeMsPerSecond into soft and hard versions as well.
> Unfortunately I don't know how to efficiently provide similar metric for busy 
> time, without impacting max throughput.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Assigned] (FLINK-25414) Provide metrics to measure how long task has been blocked

2021-12-22 Thread Piotr Nowojski (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-25414?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Piotr Nowojski reassigned FLINK-25414:
--

Assignee: Piotr Nowojski

> Provide metrics to measure how long task has been blocked
> -
>
> Key: FLINK-25414
> URL: https://issues.apache.org/jira/browse/FLINK-25414
> Project: Flink
>  Issue Type: New Feature
>  Components: Runtime / Metrics, Runtime / Task
>Affects Versions: 1.14.2
>Reporter: Piotr Nowojski
>Assignee: Piotr Nowojski
>Priority: Major
>
> Currently back pressured/busy metrics tell the user whether task is 
> blocked/busy and how much % of the time it is blocked/busy. But they do not 
> tell how for how long single block event is lasting. It can be 1ms or 1h and 
> back pressure/busy would be still reporting 100%.
> In order to improve this, we could provide two new metrics:
> # maxSoftBackPressureDuration
> # maxHardBackPressureDuration
> The max would be reset to 0 periodically or on every access to the metric 
> (via metric reporter). Soft back pressure would be if task is back pressured 
> in a non blocking fashion (StreamTask detected in availability of the 
> output). Hard back pressure would measure the time task is actually blocked.
> Unfortunately I don't know how to efficiently provide similar metric for busy 
> time, without impacting max throughput.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Assigned] (FLINK-25407) Network stack deadlock when cancellation happens during initialisation

2021-12-22 Thread Piotr Nowojski (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-25407?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Piotr Nowojski reassigned FLINK-25407:
--

Assignee: Yingjie Cao

> Network stack deadlock when cancellation happens during initialisation
> --
>
> Key: FLINK-25407
> URL: https://issues.apache.org/jira/browse/FLINK-25407
> Project: Flink
>  Issue Type: Bug
>  Components: Runtime / Network
>Affects Versions: 1.14.0, 1.15.0
>Reporter: Piotr Nowojski
>Assignee: Yingjie Cao
>Priority: Critical
>
> This issue was extracted from and initially reported in FLINK-25185. It is 
> most likely caused by FLINK-24035.
> {noformat}
> Java stack information for the threads listed above:
> ===
> "Canceler for Source: Custom Source -> Filter (7/12)#14176 
> (0fbb8a89616ca7a40e473adad51f236f).":
>at 
> org.apache.flink.runtime.io.network.buffer.NetworkBufferPool.destroyBufferPool(NetworkBufferPool.java:420)
>- waiting to lock <0x82937f28> (a java.lang.Object)
>at 
> org.apache.flink.runtime.io.network.buffer.LocalBufferPool.lazyDestroy(LocalBufferPool.java:567)
>at 
> org.apache.flink.runtime.io.network.partition.ResultPartition.closeBufferPool(ResultPartition.java:264)
>at 
> org.apache.flink.runtime.io.network.partition.ResultPartition.fail(ResultPartition.java:276)
>at 
> org.apache.flink.runtime.taskmanager.Task.failAllResultPartitions(Task.java:999)
>at org.apache.flink.runtime.taskmanager.Task.access$100(Task.java:138)
>at 
> org.apache.flink.runtime.taskmanager.Task$TaskCanceler.run(Task.java:1669)
>at java.lang.Thread.run(Thread.java:748)
> "Canceler for Map -> Map (6/12)#14176 (6195862d199aa4d52c12f25b39904725).":
>at 
> org.apache.flink.runtime.io.network.buffer.LocalBufferPool.setNumBuffers(LocalBufferPool.java:585)
>- waiting to lock <0x97108898> (a java.util.ArrayDeque)
>at 
> org.apache.flink.runtime.io.network.buffer.NetworkBufferPool.redistributeBuffers(NetworkBufferPool.java:544)
>at 
> org.apache.flink.runtime.io.network.buffer.NetworkBufferPool.destroyBufferPool(NetworkBufferPool.java:424)
>- locked <0x82937f28> (a java.lang.Object)
>at 
> org.apache.flink.runtime.io.network.buffer.LocalBufferPool.lazyDestroy(LocalBufferPool.java:567)
>at 
> org.apache.flink.runtime.io.network.partition.ResultPartition.closeBufferPool(ResultPartition.java:264)
>at 
> org.apache.flink.runtime.io.network.partition.ResultPartition.fail(ResultPartition.java:276)
>at 
> org.apache.flink.runtime.taskmanager.Task.failAllResultPartitions(Task.java:999)
>at org.apache.flink.runtime.taskmanager.Task.access$100(Task.java:138)
>at 
> org.apache.flink.runtime.taskmanager.Task$TaskCanceler.run(Task.java:1669)
>at java.lang.Thread.run(Thread.java:748)
> "Map -> Sink: Unnamed (7/12)#14176":
>at 
> org.apache.flink.runtime.io.network.buffer.NetworkBufferPool.recycleMemorySegments(NetworkBufferPool.java:256)
>- waiting to lock <0x82937f28> (a java.lang.Object)
>at 
> org.apache.flink.runtime.io.network.buffer.NetworkBufferPool.internalRequestMemorySegments(NetworkBufferPool.ja
>at 
> org.apache.flink.runtime.io.network.buffer.NetworkBufferPool.requestMemorySegmentsBlocking(NetworkBufferPool.ja
>at 
> org.apache.flink.runtime.io.network.buffer.LocalBufferPool.reserveSegments(LocalBufferPool.java:247)
>- locked <0x97108898> (a java.util.ArrayDeque)
>at 
> org.apache.flink.runtime.io.network.partition.consumer.SingleInputGate.setupChannels(SingleInputGate.java:497)
>at 
> org.apache.flink.runtime.io.network.partition.consumer.SingleInputGate.setup(SingleInputGate.java:276)
>at 
> org.apache.flink.runtime.taskmanager.InputGateWithMetrics.setup(InputGateWithMetrics.java:105)
>at 
> org.apache.flink.runtime.taskmanager.Task.setupPartitionsAndGates(Task.java:965)
>at org.apache.flink.runtime.taskmanager.Task.doRun(Task.java:652)
>at org.apache.flink.runtime.taskmanager.Task.run(Task.java:563)
>at java.lang.Thread.run(Thread.java:748)
> Found 1 deadlock.
> {noformat}
> https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=28297&view=logs&j=a57e0635-3fad-5b08-57c7-a4142d7d6fa9&t=2ef0effc-1da1-50e5-c2bd-aab434b1c5b7&l=19003
> https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=28306&view=logs&j=5c8e7682-d68f-54d1-16a2-a09310218a49&t=86f654fa-ab48-5c1a-25f4-7e7f6afb9bba&l=19832
> CC [~kevin.cyj]



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Created] (FLINK-25414) Provide metrics to measure how long task has been blocked

2021-12-22 Thread Piotr Nowojski (Jira)
Piotr Nowojski created FLINK-25414:
--

 Summary: Provide metrics to measure how long task has been blocked
 Key: FLINK-25414
 URL: https://issues.apache.org/jira/browse/FLINK-25414
 Project: Flink
  Issue Type: New Feature
  Components: Runtime / Metrics, Runtime / Task
Affects Versions: 1.14.2
Reporter: Piotr Nowojski


Currently back pressured/busy metrics tell the user whether task is 
blocked/busy and how much % of the time it is blocked/busy. But they do not 
tell how for how long single block event is lasting. It can be 1ms or 1h and 
back pressure/busy would be still reporting 100%.

In order to improve this, we could provide two new metrics:
# maxSoftBackPressureDuration
# maxHardBackPressureDuration

The max would be reset to 0 periodically or on every access to the metric (via 
metric reporter). Soft back pressure would be if task is back pressured in a 
non blocking fashion (StreamTask detected in availability of the output). Hard 
back pressure would measure the time task is actually blocked.

Unfortunately I don't know how to efficiently provide similar metric for busy 
time, without impacting max throughput.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Updated] (FLINK-25395) FileNotFoundException during recovery caused by Incremental shared state being discarded by TM

2021-12-21 Thread Piotr Nowojski (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-25395?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Piotr Nowojski updated FLINK-25395:
---
Priority: Blocker  (was: Critical)

> FileNotFoundException during recovery caused by Incremental shared state 
> being discarded by TM
> --
>
> Key: FLINK-25395
> URL: https://issues.apache.org/jira/browse/FLINK-25395
> Project: Flink
>  Issue Type: Bug
>  Components: Runtime / Checkpointing, Runtime / State Backends
>Affects Versions: 1.15.0
>Reporter: Roman Khachatryan
>Priority: Blocker
> Fix For: 1.15.0
>
>
> Extracting from [FLINK-25185 
> discussion|https://issues.apache.org/jira/browse/FLINK-25185?focusedCommentId=17462554&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-17462554]
> On checkpoint abortion or any failure in AsyncCheckpointRunnable,
> it discards the state, in particular shared (incremental) state.
> Since FLINK-24611, this creates a problem because shared state can be re-used 
> for future checkpoints. 
> Needs confirmation.
> Likely symptom of this failure is a following exception during recovery:
> {noformat}
> Caused by: java.io.FileNotFoundException: 
> /tmp/junit3146957979516280339/junit1602669867129285236/d6a6dbdd-3fd7-4786-9dc1-9ccc161740da
>  (No such file or directory)
> at java.io.FileInputStream.open0(Native Method) ~[?:1.8.0_292]
> at java.io.FileInputStream.open(FileInputStream.java:195) 
> ~[?:1.8.0_292]
> at java.io.FileInputStream.(FileInputStream.java:138) 
> ~[?:1.8.0_292]
> at 
> org.apache.flink.core.fs.local.LocalDataInputStream.(LocalDataInputStream.java:50)
>  ~[flink-core-1.15-SNAPSHOT.jar:1.15-SNAPSHOT]
> at 
> org.apache.flink.core.fs.local.LocalFileSystem.open(LocalFileSystem.java:134) 
> ~[flink-core-1.15-SNAPSHOT.jar:1.15-SNAPSHOT]
> at 
> org.apache.flink.core.fs.SafetyNetWrapperFileSystem.open(SafetyNetWrapperFileSystem.java:87)
>  ~[flink-core-1.15-SNAPSHOT.jar:1.15-SNAPSHOT]
> at 
> org.apache.flink.runtime.state.filesystem.FileStateHandle.openInputStream(FileStateHandle.java:68)
>  ~[flink-runtime-1.15-SNAPSHOT.jar:1.15-SNAPSHOT]
> at 
> org.apache.flink.changelog.fs.StateChangeFormat.read(StateChangeFormat.java:92)
>  ~[flink-dstl-dfs-1.15-SNAPSHOT.jar:1.15-SNAPSHOT]
> at 
> org.apache.flink.runtime.state.changelog.StateChangelogHandleStreamHandleReader$1.advance(StateChangelogHandleStreamHandleReader.java:85)
>  ~[flink-runtime-1.15-SNAPSHOT.jar:1.15-SNAPSHOT]
> {noformat}



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Comment Edited] (FLINK-25185) StreamFaultToleranceTestBase hangs on AZP

2021-12-21 Thread Piotr Nowojski (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-25185?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17463325#comment-17463325
 ] 

Piotr Nowojski edited comment on FLINK-25185 at 12/21/21, 4:03 PM:
---

After an offline discussion with [~roman] and some further analysis this is 
what we think is happening for 1.15 branch.

# Test is hitting {{FileNotFoundException}}, probably caused by FLINK-25395
# Test ends up in an infinite restart loop, where each restart attempt hits 
{{FileNotFoundException}}
# After tens of thousands of restart attempts and cancellations (for example in 
attempt #14176 as [commented in Roman's post|  
https://issues.apache.org/jira/browse/FLINK-25185?focusedCommentId=17462834&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-17462834]),
 this endless cycle of restarts and cancellations is causing FLINK-25407 
deadlock to surface. 
# From now on, StreamFaultToleranceTestBase will end up in yet another infinite 
restart loop, but this time scheduling will be failing with "Could not acquire 
the minimum required resources." This is because one TaskManager is stuck in 
this deadlock and hence we are missing resources to restart the job.

We have extracted those two issues FLINK-25395 (affects only 1.15, after 
merging FLINK-24611 a couple of days ago. It's a release blocker) and 
FLINK-25407 (affects 1.14.x and 1.15.x, but not as severe issue) to independent 
tickets. For the time being we will disable changelog state backend 
randomisation until FLINK-25395 is fixed to reduce the number of test failure.

However the first report was from 1.13 branch, and I can not see the same 
deadlock there. I can not verify the logs from that failure, because logs 
upload has failed. So most likely there is still another issue present in the 
code base (At least in 1.13.x branch), that we have no way of analysing at the 
moment and we will have to wait for another failure with successful logs upload 
this time.


was (Author: pnowojski):
After an offline discussion with [~roman] and some further analysis this is 
what we think is happening for 1.15 branch.

# Test is hitting {{FileNotFoundException}}, probably caused by FLINK-25395
# Test ends up in an infinite restart loop, where each restart attempt hits 
{{FileNotFoundException}}
# After tens of thousands of restart attempts and cancellations (for example in 
attempt #14176 as [commented in Roman's post|  
https://issues.apache.org/jira/browse/FLINK-25185?focusedCommentId=17462834&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-17462834]),
 this endless cycle of restarts and cancellations is causing FLINK-25407 
deadlock to surface
# From now on, StreamFaultToleranceTestBase will end up in yet another infinite 
restart loop, but this time scheduling will be failing with "Could not acquire 
the minimum required resources."

We have extracted those two issues FLINK-25395 (affects only 1.15, after 
merging FLINK-24611 a couple of days ago. It's a release blocker) and 
FLINK-25407 (affects 1.14.x and 1.15.x, but not as severe issue) to independent 
tickets. For the time being we will disable changelog state backend 
randomisation until FLINK-25395 is fixed to reduce the number of test failure.

However the first report was from 1.13 branch, and I can not see the same 
deadlock there. I can not verify the logs from that failure, because logs 
upload has failed. So most likely there is still another issue present in the 
code base (At least in 1.13.x branch), that we have no way of analysing at the 
moment and we will have to wait for another failure with successful logs upload 
this time.

> StreamFaultToleranceTestBase hangs on AZP
> -
>
> Key: FLINK-25185
> URL: https://issues.apache.org/jira/browse/FLINK-25185
> Project: Flink
>  Issue Type: Bug
>  Components: Runtime / State Backends, Test Infrastructure
>Affects Versions: 1.13.3, 1.15.0
>Reporter: Till Rohrmann
>Priority: Blocker
>  Labels: test-stability
> Fix For: 1.15.0
>
>
> The {{StreamFaultToleranceTestBase}} hangs on AZP.
> {code}
> 2021-12-06T04:24:48.1676089Z 
> ==
> 2021-12-06T04:24:48.1678883Z === WARNING: This task took already 95% of the 
> available time budget of 237 minutes ===
> 2021-12-06T04:24:48.1679596Z 
> ==
> 2021-12-06T04:24:48.1680326Z 
> ==
> 2021-12-06T04:24:48.1680877Z The following Java processes are running (JPS)
> 2021-12-06T04:24:48.1681467Z 
> 

[jira] [Commented] (FLINK-25185) StreamFaultToleranceTestBase hangs on AZP

2021-12-21 Thread Piotr Nowojski (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-25185?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17463325#comment-17463325
 ] 

Piotr Nowojski commented on FLINK-25185:


After an offline discussion with [~roman] and some further analysis this is 
what we think is happening for 1.15 branch.

# Test is hitting {{FileNotFoundException}}, probably caused by FLINK-25395
# Test ends up in an infinite restart loop, where each restart attempt hits 
{{FileNotFoundException}}
# After tens of thousands of restart attempts and cancellations (for example in 
attempt #14176 as [commented in Roman's post|  
https://issues.apache.org/jira/browse/FLINK-25185?focusedCommentId=17462834&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-17462834]),
 this endless cycle of restarts and cancellations is causing FLINK-25407 
deadlock to surface
# From now on, StreamFaultToleranceTestBase will end up in yet another infinite 
restart loop, but this time scheduling will be failing with "Could not acquire 
the minimum required resources."

We have extracted those two issues FLINK-25395 (affects only 1.15, after 
merging FLINK-24611 a couple of days ago. It's a release blocker) and 
FLINK-25407 (affects 1.14.x and 1.15.x, but not as severe issue) to independent 
tickets. For the time being we will disable changelog state backend 
randomisation until FLINK-25395 is fixed to reduce the number of test failure.

However the first report was from 1.13 branch, and I can not see the same 
deadlock there. I can not verify the logs from that failure, because logs 
upload has failed. So most likely there is still another issue present in the 
code base (At least in 1.13.x branch), that we have no way of analysing at the 
moment and we will have to wait for another failure with successful logs upload 
this time.

> StreamFaultToleranceTestBase hangs on AZP
> -
>
> Key: FLINK-25185
> URL: https://issues.apache.org/jira/browse/FLINK-25185
> Project: Flink
>  Issue Type: Bug
>  Components: Runtime / State Backends, Test Infrastructure
>Affects Versions: 1.13.3, 1.15.0
>Reporter: Till Rohrmann
>Priority: Blocker
>  Labels: test-stability
> Fix For: 1.15.0
>
>
> The {{StreamFaultToleranceTestBase}} hangs on AZP.
> {code}
> 2021-12-06T04:24:48.1676089Z 
> ==
> 2021-12-06T04:24:48.1678883Z === WARNING: This task took already 95% of the 
> available time budget of 237 minutes ===
> 2021-12-06T04:24:48.1679596Z 
> ==
> 2021-12-06T04:24:48.1680326Z 
> ==
> 2021-12-06T04:24:48.1680877Z The following Java processes are running (JPS)
> 2021-12-06T04:24:48.1681467Z 
> ==
> 2021-12-06T04:24:48.6514536Z 13701 surefirebooter17740627448580534543.jar
> 2021-12-06T04:24:48.6515353Z 1622 Jps
> 2021-12-06T04:24:48.6515795Z 780 Launcher
> 2021-12-06T04:24:48.6825889Z 
> ==
> 2021-12-06T04:24:48.6826565Z Printing stack trace of Java process 13701
> 2021-12-06T04:24:48.6827012Z 
> ==
> 2021-12-06T04:24:49.1876086Z 2021-12-06 04:24:49
> 2021-12-06T04:24:49.1877098Z Full thread dump OpenJDK 64-Bit Server VM 
> (11.0.10+9 mixed mode):
> 2021-12-06T04:24:49.1877362Z 
> 2021-12-06T04:24:49.1877672Z Threads class SMR info:
> 2021-12-06T04:24:49.1878049Z _java_thread_list=0x7f254c007630, 
> length=365, elements={
> 2021-12-06T04:24:49.1878504Z 0x7f2598028000, 0x7f2598280800, 
> 0x7f2598284800, 0x7f2598299000,
> 2021-12-06T04:24:49.1878973Z 0x7f259829b000, 0x7f259829d800, 
> 0x7f259829f800, 0x7f25982a1800,
> 2021-12-06T04:24:49.1879680Z 0x7f2598337800, 0x7f25983e3000, 
> 0x7f2598431000, 0x7f2528016000,
> 2021-12-06T04:24:49.1896613Z 0x7f2599003000, 0x7f259972e000, 
> 0x7f2599833800, 0x7f259984c000,
> 2021-12-06T04:24:49.1897558Z 0x7f259984f000, 0x7f2599851000, 
> 0x7f2599892000, 0x7f2599894800,
> 2021-12-06T04:24:49.1898075Z 0x7f2499a16000, 0x7f2485acd800, 
> 0x7f2485ace000, 0x7f24876bb800,
> 2021-12-06T04:24:49.1898562Z 0x7f2461e59000, 0x7f2499a0e800, 
> 0x7f2461e5e800, 0x7f2461e81000,
> 2021-12-06T04:24:49.1899037Z 0x7f24dc015000, 0x7f2461e86800, 
> 0x7f2448002000, 0x7f24dc01c000,
> 2021-12-06T04:24:49.1899522Z 0x7f2438001000, 0x7f2438003000, 
> 0x7f2438005000, 0x7f2438006800,
> 2021-12-06T04:24:

[jira] [Updated] (FLINK-25407) Network stack deadlock when cancellation happens during initialisation

2021-12-21 Thread Piotr Nowojski (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-25407?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Piotr Nowojski updated FLINK-25407:
---
Description: 
This issue was extracted from and initially reported in FLINK-25185. It is most 
likely caused by FLINK-24035.

{noformat}
Java stack information for the threads listed above:
===
"Canceler for Source: Custom Source -> Filter (7/12)#14176 
(0fbb8a89616ca7a40e473adad51f236f).":
   at 
org.apache.flink.runtime.io.network.buffer.NetworkBufferPool.destroyBufferPool(NetworkBufferPool.java:420)
   - waiting to lock <0x82937f28> (a java.lang.Object)
   at 
org.apache.flink.runtime.io.network.buffer.LocalBufferPool.lazyDestroy(LocalBufferPool.java:567)
   at 
org.apache.flink.runtime.io.network.partition.ResultPartition.closeBufferPool(ResultPartition.java:264)
   at 
org.apache.flink.runtime.io.network.partition.ResultPartition.fail(ResultPartition.java:276)
   at 
org.apache.flink.runtime.taskmanager.Task.failAllResultPartitions(Task.java:999)
   at org.apache.flink.runtime.taskmanager.Task.access$100(Task.java:138)
   at org.apache.flink.runtime.taskmanager.Task$TaskCanceler.run(Task.java:1669)
   at java.lang.Thread.run(Thread.java:748)
"Canceler for Map -> Map (6/12)#14176 (6195862d199aa4d52c12f25b39904725).":
   at 
org.apache.flink.runtime.io.network.buffer.LocalBufferPool.setNumBuffers(LocalBufferPool.java:585)
   - waiting to lock <0x97108898> (a java.util.ArrayDeque)
   at 
org.apache.flink.runtime.io.network.buffer.NetworkBufferPool.redistributeBuffers(NetworkBufferPool.java:544)
   at 
org.apache.flink.runtime.io.network.buffer.NetworkBufferPool.destroyBufferPool(NetworkBufferPool.java:424)
   - locked <0x82937f28> (a java.lang.Object)
   at 
org.apache.flink.runtime.io.network.buffer.LocalBufferPool.lazyDestroy(LocalBufferPool.java:567)
   at 
org.apache.flink.runtime.io.network.partition.ResultPartition.closeBufferPool(ResultPartition.java:264)
   at 
org.apache.flink.runtime.io.network.partition.ResultPartition.fail(ResultPartition.java:276)
   at 
org.apache.flink.runtime.taskmanager.Task.failAllResultPartitions(Task.java:999)
   at org.apache.flink.runtime.taskmanager.Task.access$100(Task.java:138)
   at org.apache.flink.runtime.taskmanager.Task$TaskCanceler.run(Task.java:1669)
   at java.lang.Thread.run(Thread.java:748)
"Map -> Sink: Unnamed (7/12)#14176":
   at 
org.apache.flink.runtime.io.network.buffer.NetworkBufferPool.recycleMemorySegments(NetworkBufferPool.java:256)
   - waiting to lock <0x82937f28> (a java.lang.Object)
   at 
org.apache.flink.runtime.io.network.buffer.NetworkBufferPool.internalRequestMemorySegments(NetworkBufferPool.ja
   at 
org.apache.flink.runtime.io.network.buffer.NetworkBufferPool.requestMemorySegmentsBlocking(NetworkBufferPool.ja
   at 
org.apache.flink.runtime.io.network.buffer.LocalBufferPool.reserveSegments(LocalBufferPool.java:247)
   - locked <0x97108898> (a java.util.ArrayDeque)
   at 
org.apache.flink.runtime.io.network.partition.consumer.SingleInputGate.setupChannels(SingleInputGate.java:497)
   at 
org.apache.flink.runtime.io.network.partition.consumer.SingleInputGate.setup(SingleInputGate.java:276)
   at 
org.apache.flink.runtime.taskmanager.InputGateWithMetrics.setup(InputGateWithMetrics.java:105)
   at 
org.apache.flink.runtime.taskmanager.Task.setupPartitionsAndGates(Task.java:965)
   at org.apache.flink.runtime.taskmanager.Task.doRun(Task.java:652)
   at org.apache.flink.runtime.taskmanager.Task.run(Task.java:563)
   at java.lang.Thread.run(Thread.java:748)

Found 1 deadlock.
{noformat}
https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=28297&view=logs&j=a57e0635-3fad-5b08-57c7-a4142d7d6fa9&t=2ef0effc-1da1-50e5-c2bd-aab434b1c5b7&l=19003
https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=28306&view=logs&j=5c8e7682-d68f-54d1-16a2-a09310218a49&t=86f654fa-ab48-5c1a-25f4-7e7f6afb9bba&l=19832

CC [~kevin.cyj]

  was:
{noformat}
Java stack information for the threads listed above:
===
"Canceler for Source: Custom Source -> Filter (7/12)#14176 
(0fbb8a89616ca7a40e473adad51f236f).":
   at 
org.apache.flink.runtime.io.network.buffer.NetworkBufferPool.destroyBufferPool(NetworkBufferPool.java:420)
   - waiting to lock <0x82937f28> (a java.lang.Object)
   at 
org.apache.flink.runtime.io.network.buffer.LocalBufferPool.lazyDestroy(LocalBufferPool.java:567)
   at 
org.apache.flink.runtime.io.network.partition.ResultPartition.closeBufferPool(ResultPartition.java:264)
   at 
org.apache.flink.runtime.io.network.partition.ResultPartition.fail(ResultPartition.java:276)
   at 
org.apache.flink.runtime.taskmanager.Task.failAllResultPartitions(Task.java:999)
   at org.apache.flink.runtime.taskmanager.Task.access$100(Task.java:138)
   at org.apache.flink.runtime.taskmanager.Task$Ta

[jira] [Created] (FLINK-25407) Network stack deadlock when cancellation happens during initialisation

2021-12-21 Thread Piotr Nowojski (Jira)
Piotr Nowojski created FLINK-25407:
--

 Summary: Network stack deadlock when cancellation happens during 
initialisation
 Key: FLINK-25407
 URL: https://issues.apache.org/jira/browse/FLINK-25407
 Project: Flink
  Issue Type: Bug
  Components: Runtime / Network
Affects Versions: 1.14.0, 1.15.0
Reporter: Piotr Nowojski


{noformat}
Java stack information for the threads listed above:
===
"Canceler for Source: Custom Source -> Filter (7/12)#14176 
(0fbb8a89616ca7a40e473adad51f236f).":
   at 
org.apache.flink.runtime.io.network.buffer.NetworkBufferPool.destroyBufferPool(NetworkBufferPool.java:420)
   - waiting to lock <0x82937f28> (a java.lang.Object)
   at 
org.apache.flink.runtime.io.network.buffer.LocalBufferPool.lazyDestroy(LocalBufferPool.java:567)
   at 
org.apache.flink.runtime.io.network.partition.ResultPartition.closeBufferPool(ResultPartition.java:264)
   at 
org.apache.flink.runtime.io.network.partition.ResultPartition.fail(ResultPartition.java:276)
   at 
org.apache.flink.runtime.taskmanager.Task.failAllResultPartitions(Task.java:999)
   at org.apache.flink.runtime.taskmanager.Task.access$100(Task.java:138)
   at org.apache.flink.runtime.taskmanager.Task$TaskCanceler.run(Task.java:1669)
   at java.lang.Thread.run(Thread.java:748)
"Canceler for Map -> Map (6/12)#14176 (6195862d199aa4d52c12f25b39904725).":
   at 
org.apache.flink.runtime.io.network.buffer.LocalBufferPool.setNumBuffers(LocalBufferPool.java:585)
   - waiting to lock <0x97108898> (a java.util.ArrayDeque)
   at 
org.apache.flink.runtime.io.network.buffer.NetworkBufferPool.redistributeBuffers(NetworkBufferPool.java:544)
   at 
org.apache.flink.runtime.io.network.buffer.NetworkBufferPool.destroyBufferPool(NetworkBufferPool.java:424)
   - locked <0x82937f28> (a java.lang.Object)
   at 
org.apache.flink.runtime.io.network.buffer.LocalBufferPool.lazyDestroy(LocalBufferPool.java:567)
   at 
org.apache.flink.runtime.io.network.partition.ResultPartition.closeBufferPool(ResultPartition.java:264)
   at 
org.apache.flink.runtime.io.network.partition.ResultPartition.fail(ResultPartition.java:276)
   at 
org.apache.flink.runtime.taskmanager.Task.failAllResultPartitions(Task.java:999)
   at org.apache.flink.runtime.taskmanager.Task.access$100(Task.java:138)
   at org.apache.flink.runtime.taskmanager.Task$TaskCanceler.run(Task.java:1669)
   at java.lang.Thread.run(Thread.java:748)
"Map -> Sink: Unnamed (7/12)#14176":
   at 
org.apache.flink.runtime.io.network.buffer.NetworkBufferPool.recycleMemorySegments(NetworkBufferPool.java:256)
   - waiting to lock <0x82937f28> (a java.lang.Object)
   at 
org.apache.flink.runtime.io.network.buffer.NetworkBufferPool.internalRequestMemorySegments(NetworkBufferPool.ja
   at 
org.apache.flink.runtime.io.network.buffer.NetworkBufferPool.requestMemorySegmentsBlocking(NetworkBufferPool.ja
   at 
org.apache.flink.runtime.io.network.buffer.LocalBufferPool.reserveSegments(LocalBufferPool.java:247)
   - locked <0x97108898> (a java.util.ArrayDeque)
   at 
org.apache.flink.runtime.io.network.partition.consumer.SingleInputGate.setupChannels(SingleInputGate.java:497)
   at 
org.apache.flink.runtime.io.network.partition.consumer.SingleInputGate.setup(SingleInputGate.java:276)
   at 
org.apache.flink.runtime.taskmanager.InputGateWithMetrics.setup(InputGateWithMetrics.java:105)
   at 
org.apache.flink.runtime.taskmanager.Task.setupPartitionsAndGates(Task.java:965)
   at org.apache.flink.runtime.taskmanager.Task.doRun(Task.java:652)
   at org.apache.flink.runtime.taskmanager.Task.run(Task.java:563)
   at java.lang.Thread.run(Thread.java:748)

Found 1 deadlock.
{noformat}
https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=28297&view=logs&j=a57e0635-3fad-5b08-57c7-a4142d7d6fa9&t=2ef0effc-1da1-50e5-c2bd-aab434b1c5b7&l=19003
https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=28306&view=logs&j=5c8e7682-d68f-54d1-16a2-a09310218a49&t=86f654fa-ab48-5c1a-25f4-7e7f6afb9bba&l=19832



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Updated] (FLINK-25185) StreamFaultToleranceTestBase hangs on AZP

2021-12-21 Thread Piotr Nowojski (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-25185?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Piotr Nowojski updated FLINK-25185:
---
Affects Version/s: 1.13.3

> StreamFaultToleranceTestBase hangs on AZP
> -
>
> Key: FLINK-25185
> URL: https://issues.apache.org/jira/browse/FLINK-25185
> Project: Flink
>  Issue Type: Bug
>  Components: Runtime / State Backends, Test Infrastructure
>Affects Versions: 1.13.3, 1.15.0
>Reporter: Till Rohrmann
>Priority: Blocker
>  Labels: test-stability
> Fix For: 1.15.0
>
>
> The {{StreamFaultToleranceTestBase}} hangs on AZP.
> {code}
> 2021-12-06T04:24:48.1676089Z 
> ==
> 2021-12-06T04:24:48.1678883Z === WARNING: This task took already 95% of the 
> available time budget of 237 minutes ===
> 2021-12-06T04:24:48.1679596Z 
> ==
> 2021-12-06T04:24:48.1680326Z 
> ==
> 2021-12-06T04:24:48.1680877Z The following Java processes are running (JPS)
> 2021-12-06T04:24:48.1681467Z 
> ==
> 2021-12-06T04:24:48.6514536Z 13701 surefirebooter17740627448580534543.jar
> 2021-12-06T04:24:48.6515353Z 1622 Jps
> 2021-12-06T04:24:48.6515795Z 780 Launcher
> 2021-12-06T04:24:48.6825889Z 
> ==
> 2021-12-06T04:24:48.6826565Z Printing stack trace of Java process 13701
> 2021-12-06T04:24:48.6827012Z 
> ==
> 2021-12-06T04:24:49.1876086Z 2021-12-06 04:24:49
> 2021-12-06T04:24:49.1877098Z Full thread dump OpenJDK 64-Bit Server VM 
> (11.0.10+9 mixed mode):
> 2021-12-06T04:24:49.1877362Z 
> 2021-12-06T04:24:49.1877672Z Threads class SMR info:
> 2021-12-06T04:24:49.1878049Z _java_thread_list=0x7f254c007630, 
> length=365, elements={
> 2021-12-06T04:24:49.1878504Z 0x7f2598028000, 0x7f2598280800, 
> 0x7f2598284800, 0x7f2598299000,
> 2021-12-06T04:24:49.1878973Z 0x7f259829b000, 0x7f259829d800, 
> 0x7f259829f800, 0x7f25982a1800,
> 2021-12-06T04:24:49.1879680Z 0x7f2598337800, 0x7f25983e3000, 
> 0x7f2598431000, 0x7f2528016000,
> 2021-12-06T04:24:49.1896613Z 0x7f2599003000, 0x7f259972e000, 
> 0x7f2599833800, 0x7f259984c000,
> 2021-12-06T04:24:49.1897558Z 0x7f259984f000, 0x7f2599851000, 
> 0x7f2599892000, 0x7f2599894800,
> 2021-12-06T04:24:49.1898075Z 0x7f2499a16000, 0x7f2485acd800, 
> 0x7f2485ace000, 0x7f24876bb800,
> 2021-12-06T04:24:49.1898562Z 0x7f2461e59000, 0x7f2499a0e800, 
> 0x7f2461e5e800, 0x7f2461e81000,
> 2021-12-06T04:24:49.1899037Z 0x7f24dc015000, 0x7f2461e86800, 
> 0x7f2448002000, 0x7f24dc01c000,
> 2021-12-06T04:24:49.1899522Z 0x7f2438001000, 0x7f2438003000, 
> 0x7f2438005000, 0x7f2438006800,
> 2021-12-06T04:24:49.1899982Z 0x7f2438008800, 0x7f2434017800, 
> 0x7f243401a800, 0x7f2414008800,
> 2021-12-06T04:24:49.1900495Z 0x7f24e8089800, 0x7f24e809, 
> 0x7f23e4005800, 0x7f24e8092800,
> 2021-12-06T04:24:49.1901163Z 0x7f24e8099000, 0x7f2414015800, 
> 0x7f24dc04c000, 0x7f2414018800,
> 2021-12-06T04:24:49.1901680Z 0x7f241402, 0x7f24dc058000, 
> 0x7f24dc05b000, 0x7f2414022000,
> 2021-12-06T04:24:49.1902283Z 0x7f24d400f000, 0x7f241402e800, 
> 0x7f2414031800, 0x7f2414033800,
> 2021-12-06T04:24:49.1902880Z 0x7f2414035000, 0x7f2414037000, 
> 0x7f2414038800, 0x7f241403a800,
> 2021-12-06T04:24:49.1903354Z 0x7f241403c000, 0x7f241403e000, 
> 0x7f241403f800, 0x7f2414041800,
> 2021-12-06T04:24:49.1903812Z 0x7f2414043000, 0x7f2414045000, 
> 0x7f24dc064800, 0x7f2414047000,
> 2021-12-06T04:24:49.1904284Z 0x7f2414048800, 0x7f241404a800, 
> 0x7f241404c800, 0x7f241404e000,
> 2021-12-06T04:24:49.1904800Z 0x7f241405, 0x7f2414051800, 
> 0x7f2414053800, 0x7f2414055000,
> 2021-12-06T04:24:49.1905455Z 0x7f2414057000, 0x7f2414059000, 
> 0x7f241405a800, 0x7f241405c800,
> 2021-12-06T04:24:49.1906098Z 0x7f241405e000, 0x7f241406, 
> 0x7f2414062000, 0x7f2414063800,
> 2021-12-06T04:24:49.1906728Z 0x7f22e400c800, 0x7f2328008000, 
> 0x7f2284007000, 0x7f22cc019800,
> 2021-12-06T04:24:49.1907396Z 0x7f21f8004000, 0x7f2304012800, 
> 0x7f230001b000, 0x7f223c011000,
> 2021-12-06T04:24:49.1908080Z 0x7f24e40c1800, 0x7f2454001000, 
> 0x7f24e40c3000, 0x7f2454003000

[jira] [Updated] (FLINK-25185) StreamFaultToleranceTestBase hangs on AZP

2021-12-21 Thread Piotr Nowojski (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-25185?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Piotr Nowojski updated FLINK-25185:
---
Affects Version/s: (was: 1.13.3)

> StreamFaultToleranceTestBase hangs on AZP
> -
>
> Key: FLINK-25185
> URL: https://issues.apache.org/jira/browse/FLINK-25185
> Project: Flink
>  Issue Type: Bug
>  Components: Runtime / State Backends, Test Infrastructure
>Affects Versions: 1.15.0
>Reporter: Till Rohrmann
>Priority: Blocker
>  Labels: test-stability
> Fix For: 1.15.0
>
>
> The {{StreamFaultToleranceTestBase}} hangs on AZP.
> {code}
> 2021-12-06T04:24:48.1676089Z 
> ==
> 2021-12-06T04:24:48.1678883Z === WARNING: This task took already 95% of the 
> available time budget of 237 minutes ===
> 2021-12-06T04:24:48.1679596Z 
> ==
> 2021-12-06T04:24:48.1680326Z 
> ==
> 2021-12-06T04:24:48.1680877Z The following Java processes are running (JPS)
> 2021-12-06T04:24:48.1681467Z 
> ==
> 2021-12-06T04:24:48.6514536Z 13701 surefirebooter17740627448580534543.jar
> 2021-12-06T04:24:48.6515353Z 1622 Jps
> 2021-12-06T04:24:48.6515795Z 780 Launcher
> 2021-12-06T04:24:48.6825889Z 
> ==
> 2021-12-06T04:24:48.6826565Z Printing stack trace of Java process 13701
> 2021-12-06T04:24:48.6827012Z 
> ==
> 2021-12-06T04:24:49.1876086Z 2021-12-06 04:24:49
> 2021-12-06T04:24:49.1877098Z Full thread dump OpenJDK 64-Bit Server VM 
> (11.0.10+9 mixed mode):
> 2021-12-06T04:24:49.1877362Z 
> 2021-12-06T04:24:49.1877672Z Threads class SMR info:
> 2021-12-06T04:24:49.1878049Z _java_thread_list=0x7f254c007630, 
> length=365, elements={
> 2021-12-06T04:24:49.1878504Z 0x7f2598028000, 0x7f2598280800, 
> 0x7f2598284800, 0x7f2598299000,
> 2021-12-06T04:24:49.1878973Z 0x7f259829b000, 0x7f259829d800, 
> 0x7f259829f800, 0x7f25982a1800,
> 2021-12-06T04:24:49.1879680Z 0x7f2598337800, 0x7f25983e3000, 
> 0x7f2598431000, 0x7f2528016000,
> 2021-12-06T04:24:49.1896613Z 0x7f2599003000, 0x7f259972e000, 
> 0x7f2599833800, 0x7f259984c000,
> 2021-12-06T04:24:49.1897558Z 0x7f259984f000, 0x7f2599851000, 
> 0x7f2599892000, 0x7f2599894800,
> 2021-12-06T04:24:49.1898075Z 0x7f2499a16000, 0x7f2485acd800, 
> 0x7f2485ace000, 0x7f24876bb800,
> 2021-12-06T04:24:49.1898562Z 0x7f2461e59000, 0x7f2499a0e800, 
> 0x7f2461e5e800, 0x7f2461e81000,
> 2021-12-06T04:24:49.1899037Z 0x7f24dc015000, 0x7f2461e86800, 
> 0x7f2448002000, 0x7f24dc01c000,
> 2021-12-06T04:24:49.1899522Z 0x7f2438001000, 0x7f2438003000, 
> 0x7f2438005000, 0x7f2438006800,
> 2021-12-06T04:24:49.1899982Z 0x7f2438008800, 0x7f2434017800, 
> 0x7f243401a800, 0x7f2414008800,
> 2021-12-06T04:24:49.1900495Z 0x7f24e8089800, 0x7f24e809, 
> 0x7f23e4005800, 0x7f24e8092800,
> 2021-12-06T04:24:49.1901163Z 0x7f24e8099000, 0x7f2414015800, 
> 0x7f24dc04c000, 0x7f2414018800,
> 2021-12-06T04:24:49.1901680Z 0x7f241402, 0x7f24dc058000, 
> 0x7f24dc05b000, 0x7f2414022000,
> 2021-12-06T04:24:49.1902283Z 0x7f24d400f000, 0x7f241402e800, 
> 0x7f2414031800, 0x7f2414033800,
> 2021-12-06T04:24:49.1902880Z 0x7f2414035000, 0x7f2414037000, 
> 0x7f2414038800, 0x7f241403a800,
> 2021-12-06T04:24:49.1903354Z 0x7f241403c000, 0x7f241403e000, 
> 0x7f241403f800, 0x7f2414041800,
> 2021-12-06T04:24:49.1903812Z 0x7f2414043000, 0x7f2414045000, 
> 0x7f24dc064800, 0x7f2414047000,
> 2021-12-06T04:24:49.1904284Z 0x7f2414048800, 0x7f241404a800, 
> 0x7f241404c800, 0x7f241404e000,
> 2021-12-06T04:24:49.1904800Z 0x7f241405, 0x7f2414051800, 
> 0x7f2414053800, 0x7f2414055000,
> 2021-12-06T04:24:49.1905455Z 0x7f2414057000, 0x7f2414059000, 
> 0x7f241405a800, 0x7f241405c800,
> 2021-12-06T04:24:49.1906098Z 0x7f241405e000, 0x7f241406, 
> 0x7f2414062000, 0x7f2414063800,
> 2021-12-06T04:24:49.1906728Z 0x7f22e400c800, 0x7f2328008000, 
> 0x7f2284007000, 0x7f22cc019800,
> 2021-12-06T04:24:49.1907396Z 0x7f21f8004000, 0x7f2304012800, 
> 0x7f230001b000, 0x7f223c011000,
> 2021-12-06T04:24:49.1908080Z 0x7f24e40c1800, 0x7f2454001000, 
> 0x7f24e40c3000, 0x7f2454003

[jira] [Updated] (FLINK-25395) FileNotFoundException during recovery caused by Incremental shared state being discarded by TM

2021-12-21 Thread Piotr Nowojski (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-25395?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Piotr Nowojski updated FLINK-25395:
---
Summary: FileNotFoundException during recovery caused by Incremental shared 
state being discarded by TM  (was: Incremental shared state might be discarded 
by TM)

> FileNotFoundException during recovery caused by Incremental shared state 
> being discarded by TM
> --
>
> Key: FLINK-25395
> URL: https://issues.apache.org/jira/browse/FLINK-25395
> Project: Flink
>  Issue Type: Bug
>  Components: Runtime / Checkpointing, Runtime / State Backends
>Affects Versions: 1.15.0
>Reporter: Roman Khachatryan
>Priority: Critical
> Fix For: 1.15.0
>
>
> Extracting from [FLINK-25185 
> discussion|https://issues.apache.org/jira/browse/FLINK-25185?focusedCommentId=17462554&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-17462554]
> On checkpoint abortion or any failure in AsyncCheckpointRunnable,
> it discards the state, in particular shared (incremental) state.
> Since FLINK-24611, this creates a problem because shared state can be re-used 
> for future checkpoints. 
> Needs confirmation.
> Likely symptom of this failure is a following exception during recovery:
> {noformat}
> Caused by: java.io.FileNotFoundException: 
> /tmp/junit3146957979516280339/junit1602669867129285236/d6a6dbdd-3fd7-4786-9dc1-9ccc161740da
>  (No such file or directory)
> at java.io.FileInputStream.open0(Native Method) ~[?:1.8.0_292]
> at java.io.FileInputStream.open(FileInputStream.java:195) 
> ~[?:1.8.0_292]
> at java.io.FileInputStream.(FileInputStream.java:138) 
> ~[?:1.8.0_292]
> at 
> org.apache.flink.core.fs.local.LocalDataInputStream.(LocalDataInputStream.java:50)
>  ~[flink-core-1.15-SNAPSHOT.jar:1.15-SNAPSHOT]
> at 
> org.apache.flink.core.fs.local.LocalFileSystem.open(LocalFileSystem.java:134) 
> ~[flink-core-1.15-SNAPSHOT.jar:1.15-SNAPSHOT]
> at 
> org.apache.flink.core.fs.SafetyNetWrapperFileSystem.open(SafetyNetWrapperFileSystem.java:87)
>  ~[flink-core-1.15-SNAPSHOT.jar:1.15-SNAPSHOT]
> at 
> org.apache.flink.runtime.state.filesystem.FileStateHandle.openInputStream(FileStateHandle.java:68)
>  ~[flink-runtime-1.15-SNAPSHOT.jar:1.15-SNAPSHOT]
> at 
> org.apache.flink.changelog.fs.StateChangeFormat.read(StateChangeFormat.java:92)
>  ~[flink-dstl-dfs-1.15-SNAPSHOT.jar:1.15-SNAPSHOT]
> at 
> org.apache.flink.runtime.state.changelog.StateChangelogHandleStreamHandleReader$1.advance(StateChangelogHandleStreamHandleReader.java:85)
>  ~[flink-runtime-1.15-SNAPSHOT.jar:1.15-SNAPSHOT]
> {noformat}



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Updated] (FLINK-25395) Incremental shared state might be discarded by TM

2021-12-21 Thread Piotr Nowojski (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-25395?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Piotr Nowojski updated FLINK-25395:
---
Description: 
Extracting from [FLINK-25185 
discussion|https://issues.apache.org/jira/browse/FLINK-25185?focusedCommentId=17462554&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-17462554]

On checkpoint abortion or any failure in AsyncCheckpointRunnable,
it discards the state, in particular shared (incremental) state.

Since FLINK-24611, this creates a problem because shared state can be re-used 
for future checkpoints. 

Needs confirmation.

Likely symptom of this failure is a following exception during recovery:

{noformat}
Caused by: java.io.FileNotFoundException: 
/tmp/junit3146957979516280339/junit1602669867129285236/d6a6dbdd-3fd7-4786-9dc1-9ccc161740da
 (No such file or directory)
at java.io.FileInputStream.open0(Native Method) ~[?:1.8.0_292]
at java.io.FileInputStream.open(FileInputStream.java:195) ~[?:1.8.0_292]
at java.io.FileInputStream.(FileInputStream.java:138) 
~[?:1.8.0_292]
at 
org.apache.flink.core.fs.local.LocalDataInputStream.(LocalDataInputStream.java:50)
 ~[flink-core-1.15-SNAPSHOT.jar:1.15-SNAPSHOT]
at 
org.apache.flink.core.fs.local.LocalFileSystem.open(LocalFileSystem.java:134) 
~[flink-core-1.15-SNAPSHOT.jar:1.15-SNAPSHOT]
at 
org.apache.flink.core.fs.SafetyNetWrapperFileSystem.open(SafetyNetWrapperFileSystem.java:87)
 ~[flink-core-1.15-SNAPSHOT.jar:1.15-SNAPSHOT]
at 
org.apache.flink.runtime.state.filesystem.FileStateHandle.openInputStream(FileStateHandle.java:68)
 ~[flink-runtime-1.15-SNAPSHOT.jar:1.15-SNAPSHOT]
at 
org.apache.flink.changelog.fs.StateChangeFormat.read(StateChangeFormat.java:92) 
~[flink-dstl-dfs-1.15-SNAPSHOT.jar:1.15-SNAPSHOT]
at 
org.apache.flink.runtime.state.changelog.StateChangelogHandleStreamHandleReader$1.advance(StateChangelogHandleStreamHandleReader.java:85)
 ~[flink-runtime-1.15-SNAPSHOT.jar:1.15-SNAPSHOT]
{noformat}

  was:
Extracting from [FLINK-25185 
discussion|https://issues.apache.org/jira/browse/FLINK-25185?focusedCommentId=17462554&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-17462554]

On checkpoint abortion or any failure in AsyncCheckpointRunnable,
it discards the state, in particular shared (incremental) state.

Since FLINK-24611, this creates a problem because shared state can be re-used 
for future checkpoints. 

Needs confirmation.

Likely symptom of this failure is a following exception during recovery:
{preformat}
Caused by: java.io.FileNotFoundException: 
/tmp/junit3146957979516280339/junit1602669867129285236/d6a6dbdd-3fd7-4786-9dc1-9ccc161740da
 (No such file or directory)
at java.io.FileInputStream.open0(Native Method) ~[?:1.8.0_292]
at java.io.FileInputStream.open(FileInputStream.java:195) ~[?:1.8.0_292]
at java.io.FileInputStream.(FileInputStream.java:138) 
~[?:1.8.0_292]
at 
org.apache.flink.core.fs.local.LocalDataInputStream.(LocalDataInputStream.java:50)
 ~[flink-core-1.15-SNAPSHOT.jar:1.15-SNAPSHOT]
at 
org.apache.flink.core.fs.local.LocalFileSystem.open(LocalFileSystem.java:134) 
~[flink-core-1.15-SNAPSHOT.jar:1.15-SNAPSHOT]
at 
org.apache.flink.core.fs.SafetyNetWrapperFileSystem.open(SafetyNetWrapperFileSystem.java:87)
 ~[flink-core-1.15-SNAPSHOT.jar:1.15-SNAPSHOT]
at 
org.apache.flink.runtime.state.filesystem.FileStateHandle.openInputStream(FileStateHandle.java:68)
 ~[flink-runtime-1.15-SNAPSHOT.jar:1.15-SNAPSHOT]
at 
org.apache.flink.changelog.fs.StateChangeFormat.read(StateChangeFormat.java:92) 
~[flink-dstl-dfs-1.15-SNAPSHOT.jar:1.15-SNAPSHOT]
at 
org.apache.flink.runtime.state.changelog.StateChangelogHandleStreamHandleReader$1.advance(StateChangelogHandleStreamHandleReader.java:85)
 ~[flink-runtime-1.15-SNAPSHOT.jar:1.15-SNAPSHOT]
{preformat}


> Incremental shared state might be discarded by TM
> -
>
> Key: FLINK-25395
> URL: https://issues.apache.org/jira/browse/FLINK-25395
> Project: Flink
>  Issue Type: Bug
>  Components: Runtime / Checkpointing, Runtime / State Backends
>Affects Versions: 1.15.0
>Reporter: Roman Khachatryan
>Priority: Critical
> Fix For: 1.15.0
>
>
> Extracting from [FLINK-25185 
> discussion|https://issues.apache.org/jira/browse/FLINK-25185?focusedCommentId=17462554&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-17462554]
> On checkpoint abortion or any failure in AsyncCheckpointRunnable,
> it discards the state, in particular shared (incremental) state.
> Since FLINK-24611, this creates a problem because shared state can be re-used 
> for future checkpoints. 
> Needs confirmation.
> Likely symptom of this failure is 

[jira] [Updated] (FLINK-25395) Incremental shared state might be discarded by TM

2021-12-21 Thread Piotr Nowojski (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-25395?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Piotr Nowojski updated FLINK-25395:
---
Description: 
Extracting from [FLINK-25185 
discussion|https://issues.apache.org/jira/browse/FLINK-25185?focusedCommentId=17462554&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-17462554]

On checkpoint abortion or any failure in AsyncCheckpointRunnable,
it discards the state, in particular shared (incremental) state.

Since FLINK-24611, this creates a problem because shared state can be re-used 
for future checkpoints. 

Needs confirmation.

Likely symptom of this failure is a following exception during recovery:
{preformat}
Caused by: java.io.FileNotFoundException: 
/tmp/junit3146957979516280339/junit1602669867129285236/d6a6dbdd-3fd7-4786-9dc1-9ccc161740da
 (No such file or directory)
at java.io.FileInputStream.open0(Native Method) ~[?:1.8.0_292]
at java.io.FileInputStream.open(FileInputStream.java:195) ~[?:1.8.0_292]
at java.io.FileInputStream.(FileInputStream.java:138) 
~[?:1.8.0_292]
at 
org.apache.flink.core.fs.local.LocalDataInputStream.(LocalDataInputStream.java:50)
 ~[flink-core-1.15-SNAPSHOT.jar:1.15-SNAPSHOT]
at 
org.apache.flink.core.fs.local.LocalFileSystem.open(LocalFileSystem.java:134) 
~[flink-core-1.15-SNAPSHOT.jar:1.15-SNAPSHOT]
at 
org.apache.flink.core.fs.SafetyNetWrapperFileSystem.open(SafetyNetWrapperFileSystem.java:87)
 ~[flink-core-1.15-SNAPSHOT.jar:1.15-SNAPSHOT]
at 
org.apache.flink.runtime.state.filesystem.FileStateHandle.openInputStream(FileStateHandle.java:68)
 ~[flink-runtime-1.15-SNAPSHOT.jar:1.15-SNAPSHOT]
at 
org.apache.flink.changelog.fs.StateChangeFormat.read(StateChangeFormat.java:92) 
~[flink-dstl-dfs-1.15-SNAPSHOT.jar:1.15-SNAPSHOT]
at 
org.apache.flink.runtime.state.changelog.StateChangelogHandleStreamHandleReader$1.advance(StateChangelogHandleStreamHandleReader.java:85)
 ~[flink-runtime-1.15-SNAPSHOT.jar:1.15-SNAPSHOT]
{preformat}

  was:
Extracting from [FLINK-25185 
discussion|https://issues.apache.org/jira/browse/FLINK-25185?focusedCommentId=17462554&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-17462554]

On checkpoint abortion or any failure in AsyncCheckpointRunnable,
it discards the state, in particular shared (incremental) state.

Since FLINK-24611, this creates a problem because shared state can be re-used 
for future checkpoints. 

Needs confirmation.


> Incremental shared state might be discarded by TM
> -
>
> Key: FLINK-25395
> URL: https://issues.apache.org/jira/browse/FLINK-25395
> Project: Flink
>  Issue Type: Bug
>  Components: Runtime / Checkpointing, Runtime / State Backends
>Affects Versions: 1.15.0
>Reporter: Roman Khachatryan
>Priority: Critical
> Fix For: 1.15.0
>
>
> Extracting from [FLINK-25185 
> discussion|https://issues.apache.org/jira/browse/FLINK-25185?focusedCommentId=17462554&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-17462554]
> On checkpoint abortion or any failure in AsyncCheckpointRunnable,
> it discards the state, in particular shared (incremental) state.
> Since FLINK-24611, this creates a problem because shared state can be re-used 
> for future checkpoints. 
> Needs confirmation.
> Likely symptom of this failure is a following exception during recovery:
> {preformat}
> Caused by: java.io.FileNotFoundException: 
> /tmp/junit3146957979516280339/junit1602669867129285236/d6a6dbdd-3fd7-4786-9dc1-9ccc161740da
>  (No such file or directory)
> at java.io.FileInputStream.open0(Native Method) ~[?:1.8.0_292]
> at java.io.FileInputStream.open(FileInputStream.java:195) 
> ~[?:1.8.0_292]
> at java.io.FileInputStream.(FileInputStream.java:138) 
> ~[?:1.8.0_292]
> at 
> org.apache.flink.core.fs.local.LocalDataInputStream.(LocalDataInputStream.java:50)
>  ~[flink-core-1.15-SNAPSHOT.jar:1.15-SNAPSHOT]
> at 
> org.apache.flink.core.fs.local.LocalFileSystem.open(LocalFileSystem.java:134) 
> ~[flink-core-1.15-SNAPSHOT.jar:1.15-SNAPSHOT]
> at 
> org.apache.flink.core.fs.SafetyNetWrapperFileSystem.open(SafetyNetWrapperFileSystem.java:87)
>  ~[flink-core-1.15-SNAPSHOT.jar:1.15-SNAPSHOT]
> at 
> org.apache.flink.runtime.state.filesystem.FileStateHandle.openInputStream(FileStateHandle.java:68)
>  ~[flink-runtime-1.15-SNAPSHOT.jar:1.15-SNAPSHOT]
> at 
> org.apache.flink.changelog.fs.StateChangeFormat.read(StateChangeFormat.java:92)
>  ~[flink-dstl-dfs-1.15-SNAPSHOT.jar:1.15-SNAPSHOT]
> at 
> org.apache.flink.runtime.state.changelog.StateChangelogHandleStreamHandleReader$1.advance(StateChangelogHandleStreamHandleReader.java:85)
>  ~[flink-runtime-1.15-SNAPSHOT.jar:1.15-S

[jira] [Commented] (FLINK-25399) AZP fails with exit code 137 when running checkpointing test cases

2021-12-21 Thread Piotr Nowojski (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-25399?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17463269#comment-17463269
 ] 

Piotr Nowojski commented on FLINK-25399:


The {{FileNotFoundException}} indicates this might be a duplicated issue of 
FLINK-25395

> AZP fails with exit code 137 when running checkpointing test cases
> --
>
> Key: FLINK-25399
> URL: https://issues.apache.org/jira/browse/FLINK-25399
> Project: Flink
>  Issue Type: Bug
>  Components: Runtime / Checkpointing
>Affects Versions: 1.15.0
>Reporter: Till Rohrmann
>Priority: Critical
>  Labels: test-stability
> Fix For: 1.15.0
>
>
> The AZP build for fine grained resource management failed with exit code 137, 
> when running an extensive list of checkpointing tests:
> {code}
> 2021-12-21T06:06:08.8728404Z Dec 21 06:06:08 [INFO] Running 
> org.apache.flink.test.checkpointing.EventTimeWindowCheckpointingITCase
> 2021-12-21T06:06:37.6584668Z Dec 21 06:06:37 Starting 
> org.apache.flink.test.checkpointing.UnalignedCheckpointRescaleITCase#shouldRescaleUnalignedCheckpoint[upscale
>  union from 3 to 7, buffersPerChannel = 0].
> 2021-12-21T06:06:37.6585685Z Dec 21 06:06:37 Starting 
> org.apache.flink.test.checkpointing.UnalignedCheckpointRescaleITCase#shouldRescaleUnalignedCheckpoint[upscale
>  union from 3 to 7, buffersPerChannel = 0].
> 2021-12-21T06:06:37.6593448Z Dec 21 06:06:37 Finished 
> org.apache.flink.test.checkpointing.UnalignedCheckpointRescaleITCase#shouldRescaleUnalignedCheckpoint[upscale
>  union from 3 to 7, buffersPerChannel = 0].
> 2021-12-21T06:06:41.3044200Z Dec 21 06:06:41 Starting 
> org.apache.flink.test.checkpointing.EventTimeWindowCheckpointingITCase#testSlidingTimeWindow[statebackend
>  type =MEM, buffersPerChannel = 0].
> 2021-12-21T06:06:41.3045146Z Dec 21 06:06:41 Finished 
> org.apache.flink.test.checkpointing.EventTimeWindowCheckpointingITCase#testSlidingTimeWindow[statebackend
>  type =MEM, buffersPerChannel = 0].
> 2021-12-21T06:06:49.7482529Z Dec 21 06:06:49 Starting 
> org.apache.flink.test.checkpointing.EventTimeWindowCheckpointingITCase#testTumblingTimeWindowWithKVStateMinMaxParallelism[statebackend
>  type =MEM, buffersPerChannel = 0].
> 2021-12-21T06:06:49.7483922Z Dec 21 06:06:49 Finished 
> org.apache.flink.test.checkpointing.EventTimeWindowCheckpointingITCase#testTumblingTimeWindowWithKVStateMinMaxParallelism[statebackend
>  type =MEM, buffersPerChannel = 0].
> 2021-12-21T06:06:56.7462828Z Dec 21 06:06:56 Starting 
> org.apache.flink.test.checkpointing.EventTimeWindowCheckpointingITCase#testTumblingTimeWindow[statebackend
>  type =MEM, buffersPerChannel = 0].
> 2021-12-21T06:06:56.7463831Z Dec 21 06:06:56 Finished 
> org.apache.flink.test.checkpointing.EventTimeWindowCheckpointingITCase#testTumblingTimeWindow[statebackend
>  type =MEM, buffersPerChannel = 0].
> 2021-12-21T06:07:06.7225398Z Dec 21 06:07:06 Starting 
> org.apache.flink.test.checkpointing.EventTimeWindowCheckpointingITCase#testTumblingTimeWindowWithKVStateMaxMaxParallelism[statebackend
>  type =MEM, buffersPerChannel = 0].
> 2021-12-21T06:07:06.7226580Z Dec 21 06:07:06 Finished 
> org.apache.flink.test.checkpointing.EventTimeWindowCheckpointingITCase#testTumblingTimeWindowWithKVStateMaxMaxParallelism[statebackend
>  type =MEM, buffersPerChannel = 0].
> 2021-12-21T06:07:12.1987555Z Dec 21 06:07:12 Starting 
> org.apache.flink.test.checkpointing.UnalignedCheckpointRescaleITCase#shouldRescaleUnalignedCheckpoint[upscale
>  union from 3 to 7, buffersPerChannel = 1].
> 2021-12-21T06:07:12.1992168Z Dec 21 06:07:12 Starting 
> org.apache.flink.test.checkpointing.UnalignedCheckpointRescaleITCase#shouldRescaleUnalignedCheckpoint[upscale
>  union from 3 to 7, buffersPerChannel = 1].
> 2021-12-21T06:07:12.1993591Z Dec 21 06:07:12 Finished 
> org.apache.flink.test.checkpointing.UnalignedCheckpointRescaleITCase#shouldRescaleUnalignedCheckpoint[upscale
>  union from 3 to 7, buffersPerChannel = 1].
> 2021-12-21T06:07:16.3825669Z Dec 21 06:07:15 Starting 
> org.apache.flink.test.checkpointing.EventTimeWindowCheckpointingITCase#testPreAggregatedTumblingTimeWindow[statebackend
>  type =MEM, buffersPerChannel = 0].
> 2021-12-21T06:07:16.3826827Z Dec 21 06:07:15 Finished 
> org.apache.flink.test.checkpointing.EventTimeWindowCheckpointingITCase#testPreAggregatedTumblingTimeWindow[statebackend
>  type =MEM, buffersPerChannel = 0].
> 2021-12-21T06:07:23.4489701Z Dec 21 06:07:23 Starting 
> org.apache.flink.test.checkpointing.EventTimeWindowCheckpointingITCase#testPreAggregatedSlidingTimeWindow[statebackend
>  type =MEM, buffersPerChannel = 0].
> 2021-12-21T06:07:23.4495250Z Dec 21 06:07:23 Finished 
> org.apache.flink.test.checkpointing.EventTimeWindowCheckpointingITCase#testPreAggregatedSlidingTimeWind

[jira] [Commented] (FLINK-25185) StreamFaultToleranceTestBase hangs on AZP

2021-12-21 Thread Piotr Nowojski (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-25185?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17463093#comment-17463093
 ] 

Piotr Nowojski commented on FLINK-25185:


{quote}
I don't think so: the decision whether to re-use some state or not is made by 
the State backend, not runtime (not 
AsyncCheckpointRunnable/SubtaskCheckpointCoordinatorImpl).
(...){quote}
Ok. I've thought that the {{lastUploadedSstFiles.putAll(sstFiles);}} in 
{{uploadSstFiles()}} happens in the sync part of checkpoint process. Now I see 
it's in the async phase and it actually happens only once files are actually 
uploaded.

Let's chat offline about what is exactly happening here and what's your 
proposal to fix it.

Regarding the deadlock that you posted, is it the primary issue causing those 
test failures? It looks like the deadlock might have been introduced in 
FLINK-24035. CC [~kevin.cyj]

> StreamFaultToleranceTestBase hangs on AZP
> -
>
> Key: FLINK-25185
> URL: https://issues.apache.org/jira/browse/FLINK-25185
> Project: Flink
>  Issue Type: Bug
>  Components: Runtime / State Backends, Test Infrastructure
>Affects Versions: 1.13.3, 1.15.0
>Reporter: Till Rohrmann
>Priority: Blocker
>  Labels: test-stability
> Fix For: 1.15.0
>
>
> The {{StreamFaultToleranceTestBase}} hangs on AZP.
> {code}
> 2021-12-06T04:24:48.1676089Z 
> ==
> 2021-12-06T04:24:48.1678883Z === WARNING: This task took already 95% of the 
> available time budget of 237 minutes ===
> 2021-12-06T04:24:48.1679596Z 
> ==
> 2021-12-06T04:24:48.1680326Z 
> ==
> 2021-12-06T04:24:48.1680877Z The following Java processes are running (JPS)
> 2021-12-06T04:24:48.1681467Z 
> ==
> 2021-12-06T04:24:48.6514536Z 13701 surefirebooter17740627448580534543.jar
> 2021-12-06T04:24:48.6515353Z 1622 Jps
> 2021-12-06T04:24:48.6515795Z 780 Launcher
> 2021-12-06T04:24:48.6825889Z 
> ==
> 2021-12-06T04:24:48.6826565Z Printing stack trace of Java process 13701
> 2021-12-06T04:24:48.6827012Z 
> ==
> 2021-12-06T04:24:49.1876086Z 2021-12-06 04:24:49
> 2021-12-06T04:24:49.1877098Z Full thread dump OpenJDK 64-Bit Server VM 
> (11.0.10+9 mixed mode):
> 2021-12-06T04:24:49.1877362Z 
> 2021-12-06T04:24:49.1877672Z Threads class SMR info:
> 2021-12-06T04:24:49.1878049Z _java_thread_list=0x7f254c007630, 
> length=365, elements={
> 2021-12-06T04:24:49.1878504Z 0x7f2598028000, 0x7f2598280800, 
> 0x7f2598284800, 0x7f2598299000,
> 2021-12-06T04:24:49.1878973Z 0x7f259829b000, 0x7f259829d800, 
> 0x7f259829f800, 0x7f25982a1800,
> 2021-12-06T04:24:49.1879680Z 0x7f2598337800, 0x7f25983e3000, 
> 0x7f2598431000, 0x7f2528016000,
> 2021-12-06T04:24:49.1896613Z 0x7f2599003000, 0x7f259972e000, 
> 0x7f2599833800, 0x7f259984c000,
> 2021-12-06T04:24:49.1897558Z 0x7f259984f000, 0x7f2599851000, 
> 0x7f2599892000, 0x7f2599894800,
> 2021-12-06T04:24:49.1898075Z 0x7f2499a16000, 0x7f2485acd800, 
> 0x7f2485ace000, 0x7f24876bb800,
> 2021-12-06T04:24:49.1898562Z 0x7f2461e59000, 0x7f2499a0e800, 
> 0x7f2461e5e800, 0x7f2461e81000,
> 2021-12-06T04:24:49.1899037Z 0x7f24dc015000, 0x7f2461e86800, 
> 0x7f2448002000, 0x7f24dc01c000,
> 2021-12-06T04:24:49.1899522Z 0x7f2438001000, 0x7f2438003000, 
> 0x7f2438005000, 0x7f2438006800,
> 2021-12-06T04:24:49.1899982Z 0x7f2438008800, 0x7f2434017800, 
> 0x7f243401a800, 0x7f2414008800,
> 2021-12-06T04:24:49.1900495Z 0x7f24e8089800, 0x7f24e809, 
> 0x7f23e4005800, 0x7f24e8092800,
> 2021-12-06T04:24:49.1901163Z 0x7f24e8099000, 0x7f2414015800, 
> 0x7f24dc04c000, 0x7f2414018800,
> 2021-12-06T04:24:49.1901680Z 0x7f241402, 0x7f24dc058000, 
> 0x7f24dc05b000, 0x7f2414022000,
> 2021-12-06T04:24:49.1902283Z 0x7f24d400f000, 0x7f241402e800, 
> 0x7f2414031800, 0x7f2414033800,
> 2021-12-06T04:24:49.1902880Z 0x7f2414035000, 0x7f2414037000, 
> 0x7f2414038800, 0x7f241403a800,
> 2021-12-06T04:24:49.1903354Z 0x7f241403c000, 0x7f241403e000, 
> 0x7f241403f800, 0x7f2414041800,
> 2021-12-06T04:24:49.1903812Z 0x7f2414043000, 0x7f2414045000, 
> 0x7f24dc064800, 0x7f2414047000,
> 2021-12-06T04:24:49.1904284Z 0x7f2414048800, 0x00

[jira] [Updated] (FLINK-25185) StreamFaultToleranceTestBase hangs on AZP

2021-12-21 Thread Piotr Nowojski (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-25185?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Piotr Nowojski updated FLINK-25185:
---
Priority: Blocker  (was: Critical)

> StreamFaultToleranceTestBase hangs on AZP
> -
>
> Key: FLINK-25185
> URL: https://issues.apache.org/jira/browse/FLINK-25185
> Project: Flink
>  Issue Type: Bug
>  Components: Runtime / State Backends, Test Infrastructure
>Affects Versions: 1.13.3, 1.15.0
>Reporter: Till Rohrmann
>Priority: Blocker
>  Labels: test-stability
> Fix For: 1.15.0
>
>
> The {{StreamFaultToleranceTestBase}} hangs on AZP.
> {code}
> 2021-12-06T04:24:48.1676089Z 
> ==
> 2021-12-06T04:24:48.1678883Z === WARNING: This task took already 95% of the 
> available time budget of 237 minutes ===
> 2021-12-06T04:24:48.1679596Z 
> ==
> 2021-12-06T04:24:48.1680326Z 
> ==
> 2021-12-06T04:24:48.1680877Z The following Java processes are running (JPS)
> 2021-12-06T04:24:48.1681467Z 
> ==
> 2021-12-06T04:24:48.6514536Z 13701 surefirebooter17740627448580534543.jar
> 2021-12-06T04:24:48.6515353Z 1622 Jps
> 2021-12-06T04:24:48.6515795Z 780 Launcher
> 2021-12-06T04:24:48.6825889Z 
> ==
> 2021-12-06T04:24:48.6826565Z Printing stack trace of Java process 13701
> 2021-12-06T04:24:48.6827012Z 
> ==
> 2021-12-06T04:24:49.1876086Z 2021-12-06 04:24:49
> 2021-12-06T04:24:49.1877098Z Full thread dump OpenJDK 64-Bit Server VM 
> (11.0.10+9 mixed mode):
> 2021-12-06T04:24:49.1877362Z 
> 2021-12-06T04:24:49.1877672Z Threads class SMR info:
> 2021-12-06T04:24:49.1878049Z _java_thread_list=0x7f254c007630, 
> length=365, elements={
> 2021-12-06T04:24:49.1878504Z 0x7f2598028000, 0x7f2598280800, 
> 0x7f2598284800, 0x7f2598299000,
> 2021-12-06T04:24:49.1878973Z 0x7f259829b000, 0x7f259829d800, 
> 0x7f259829f800, 0x7f25982a1800,
> 2021-12-06T04:24:49.1879680Z 0x7f2598337800, 0x7f25983e3000, 
> 0x7f2598431000, 0x7f2528016000,
> 2021-12-06T04:24:49.1896613Z 0x7f2599003000, 0x7f259972e000, 
> 0x7f2599833800, 0x7f259984c000,
> 2021-12-06T04:24:49.1897558Z 0x7f259984f000, 0x7f2599851000, 
> 0x7f2599892000, 0x7f2599894800,
> 2021-12-06T04:24:49.1898075Z 0x7f2499a16000, 0x7f2485acd800, 
> 0x7f2485ace000, 0x7f24876bb800,
> 2021-12-06T04:24:49.1898562Z 0x7f2461e59000, 0x7f2499a0e800, 
> 0x7f2461e5e800, 0x7f2461e81000,
> 2021-12-06T04:24:49.1899037Z 0x7f24dc015000, 0x7f2461e86800, 
> 0x7f2448002000, 0x7f24dc01c000,
> 2021-12-06T04:24:49.1899522Z 0x7f2438001000, 0x7f2438003000, 
> 0x7f2438005000, 0x7f2438006800,
> 2021-12-06T04:24:49.1899982Z 0x7f2438008800, 0x7f2434017800, 
> 0x7f243401a800, 0x7f2414008800,
> 2021-12-06T04:24:49.1900495Z 0x7f24e8089800, 0x7f24e809, 
> 0x7f23e4005800, 0x7f24e8092800,
> 2021-12-06T04:24:49.1901163Z 0x7f24e8099000, 0x7f2414015800, 
> 0x7f24dc04c000, 0x7f2414018800,
> 2021-12-06T04:24:49.1901680Z 0x7f241402, 0x7f24dc058000, 
> 0x7f24dc05b000, 0x7f2414022000,
> 2021-12-06T04:24:49.1902283Z 0x7f24d400f000, 0x7f241402e800, 
> 0x7f2414031800, 0x7f2414033800,
> 2021-12-06T04:24:49.1902880Z 0x7f2414035000, 0x7f2414037000, 
> 0x7f2414038800, 0x7f241403a800,
> 2021-12-06T04:24:49.1903354Z 0x7f241403c000, 0x7f241403e000, 
> 0x7f241403f800, 0x7f2414041800,
> 2021-12-06T04:24:49.1903812Z 0x7f2414043000, 0x7f2414045000, 
> 0x7f24dc064800, 0x7f2414047000,
> 2021-12-06T04:24:49.1904284Z 0x7f2414048800, 0x7f241404a800, 
> 0x7f241404c800, 0x7f241404e000,
> 2021-12-06T04:24:49.1904800Z 0x7f241405, 0x7f2414051800, 
> 0x7f2414053800, 0x7f2414055000,
> 2021-12-06T04:24:49.1905455Z 0x7f2414057000, 0x7f2414059000, 
> 0x7f241405a800, 0x7f241405c800,
> 2021-12-06T04:24:49.1906098Z 0x7f241405e000, 0x7f241406, 
> 0x7f2414062000, 0x7f2414063800,
> 2021-12-06T04:24:49.1906728Z 0x7f22e400c800, 0x7f2328008000, 
> 0x7f2284007000, 0x7f22cc019800,
> 2021-12-06T04:24:49.1907396Z 0x7f21f8004000, 0x7f2304012800, 
> 0x7f230001b000, 0x7f223c011000,
> 2021-12-06T04:24:49.1908080Z 0x7f24e40c1800, 0x7f2454001000, 
> 0x7f24e40c3000, 0x7f2

[jira] [Assigned] (FLINK-21186) RecordWriterOutput swallows interrupt state when interrupted.

2021-12-21 Thread Piotr Nowojski (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-21186?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Piotr Nowojski reassigned FLINK-21186:
--

Assignee: Piotr Nowojski

> RecordWriterOutput swallows interrupt state when interrupted.
> -
>
> Key: FLINK-21186
> URL: https://issues.apache.org/jira/browse/FLINK-21186
> Project: Flink
>  Issue Type: Bug
>  Components: Runtime / Task
>Affects Versions: 1.13.0
>Reporter: Arvid Heise
>Assignee: Piotr Nowojski
>Priority: Minor
>  Labels: auto-deprioritized-major, stale-minor
>




--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Commented] (FLINK-21186) RecordWriterOutput swallows interrupt state when interrupted.

2021-12-21 Thread Piotr Nowojski (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-21186?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17463075#comment-17463075
 ] 

Piotr Nowojski commented on FLINK-21186:


I still think there is no issue in this code. It's maybe not the prettiest 
solution, but it works as far as I can tell. Under this ticket we can do a 
little bit of clean up in `RecordWriterOutput`, to explicitly limit what types 
of exception (only `IOException`) can be wrapped in `RuntimeException`, and 
instead of using `RuntimeException` directly, we can thus use 
`UncheckedIOException`.

> RecordWriterOutput swallows interrupt state when interrupted.
> -
>
> Key: FLINK-21186
> URL: https://issues.apache.org/jira/browse/FLINK-21186
> Project: Flink
>  Issue Type: Bug
>  Components: Runtime / Task
>Affects Versions: 1.13.0
>Reporter: Arvid Heise
>Priority: Minor
>  Labels: auto-deprioritized-major, stale-minor
>




--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Assigned] (FLINK-25194) Implement an API for duplicating artefacts

2021-12-20 Thread Piotr Nowojski (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-25194?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Piotr Nowojski reassigned FLINK-25194:
--

Assignee: Piotr Nowojski  (was: Dawid Wysakowicz)

> Implement an API for duplicating artefacts
> --
>
> Key: FLINK-25194
> URL: https://issues.apache.org/jira/browse/FLINK-25194
> Project: Flink
>  Issue Type: Sub-task
>  Components: Connectors / FileSystem, Runtime / Checkpointing
>Reporter: Dawid Wysakowicz
>Assignee: Piotr Nowojski
>Priority: Major
> Fix For: 1.15.0
>
>
> We should implement methods that let us duplicate artefacts in a DFS. We can 
> later on use it for cheaply duplicating shared snapshots artefacts instead of 
> reuploading them.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Commented] (FLINK-25185) StreamFaultToleranceTestBase hangs on AZP

2021-12-20 Thread Piotr Nowojski (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-25185?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17462658#comment-17462658
 ] 

Piotr Nowojski commented on FLINK-25185:


{quote}
When a checkpoitnt is aborted, TM will try to discard in progress uploads.
{quote}
[~roman], do you mean 
{{SubtaskCheckpointCoordinatorImpl#cancelAsyncCheckpointRunnable}} being 
invoked and the uploads being cancelled? 

Doesn't it point to a larger problem? That future checkpoints in general can be 
deemed as completed, even if previous async phases are still uploading some of 
the files that those future checkpoints are referencing? 
{quote}
This state can't be re-used for future checkpoints.
{quote}
Probably it's not only about "future" as not yet triggered checkpoints, but any 
subsequent checkpoints, of which some of them might have been already in 
progress.

It seems like neither of those problem will be easy to fix?

> StreamFaultToleranceTestBase hangs on AZP
> -
>
> Key: FLINK-25185
> URL: https://issues.apache.org/jira/browse/FLINK-25185
> Project: Flink
>  Issue Type: Bug
>  Components: Runtime / State Backends, Test Infrastructure
>Affects Versions: 1.13.3, 1.15.0
>Reporter: Till Rohrmann
>Priority: Critical
>  Labels: test-stability
> Fix For: 1.15.0
>
>
> The {{StreamFaultToleranceTestBase}} hangs on AZP.
> {code}
> 2021-12-06T04:24:48.1676089Z 
> ==
> 2021-12-06T04:24:48.1678883Z === WARNING: This task took already 95% of the 
> available time budget of 237 minutes ===
> 2021-12-06T04:24:48.1679596Z 
> ==
> 2021-12-06T04:24:48.1680326Z 
> ==
> 2021-12-06T04:24:48.1680877Z The following Java processes are running (JPS)
> 2021-12-06T04:24:48.1681467Z 
> ==
> 2021-12-06T04:24:48.6514536Z 13701 surefirebooter17740627448580534543.jar
> 2021-12-06T04:24:48.6515353Z 1622 Jps
> 2021-12-06T04:24:48.6515795Z 780 Launcher
> 2021-12-06T04:24:48.6825889Z 
> ==
> 2021-12-06T04:24:48.6826565Z Printing stack trace of Java process 13701
> 2021-12-06T04:24:48.6827012Z 
> ==
> 2021-12-06T04:24:49.1876086Z 2021-12-06 04:24:49
> 2021-12-06T04:24:49.1877098Z Full thread dump OpenJDK 64-Bit Server VM 
> (11.0.10+9 mixed mode):
> 2021-12-06T04:24:49.1877362Z 
> 2021-12-06T04:24:49.1877672Z Threads class SMR info:
> 2021-12-06T04:24:49.1878049Z _java_thread_list=0x7f254c007630, 
> length=365, elements={
> 2021-12-06T04:24:49.1878504Z 0x7f2598028000, 0x7f2598280800, 
> 0x7f2598284800, 0x7f2598299000,
> 2021-12-06T04:24:49.1878973Z 0x7f259829b000, 0x7f259829d800, 
> 0x7f259829f800, 0x7f25982a1800,
> 2021-12-06T04:24:49.1879680Z 0x7f2598337800, 0x7f25983e3000, 
> 0x7f2598431000, 0x7f2528016000,
> 2021-12-06T04:24:49.1896613Z 0x7f2599003000, 0x7f259972e000, 
> 0x7f2599833800, 0x7f259984c000,
> 2021-12-06T04:24:49.1897558Z 0x7f259984f000, 0x7f2599851000, 
> 0x7f2599892000, 0x7f2599894800,
> 2021-12-06T04:24:49.1898075Z 0x7f2499a16000, 0x7f2485acd800, 
> 0x7f2485ace000, 0x7f24876bb800,
> 2021-12-06T04:24:49.1898562Z 0x7f2461e59000, 0x7f2499a0e800, 
> 0x7f2461e5e800, 0x7f2461e81000,
> 2021-12-06T04:24:49.1899037Z 0x7f24dc015000, 0x7f2461e86800, 
> 0x7f2448002000, 0x7f24dc01c000,
> 2021-12-06T04:24:49.1899522Z 0x7f2438001000, 0x7f2438003000, 
> 0x7f2438005000, 0x7f2438006800,
> 2021-12-06T04:24:49.1899982Z 0x7f2438008800, 0x7f2434017800, 
> 0x7f243401a800, 0x7f2414008800,
> 2021-12-06T04:24:49.1900495Z 0x7f24e8089800, 0x7f24e809, 
> 0x7f23e4005800, 0x7f24e8092800,
> 2021-12-06T04:24:49.1901163Z 0x7f24e8099000, 0x7f2414015800, 
> 0x7f24dc04c000, 0x7f2414018800,
> 2021-12-06T04:24:49.1901680Z 0x7f241402, 0x7f24dc058000, 
> 0x7f24dc05b000, 0x7f2414022000,
> 2021-12-06T04:24:49.1902283Z 0x7f24d400f000, 0x7f241402e800, 
> 0x7f2414031800, 0x7f2414033800,
> 2021-12-06T04:24:49.1902880Z 0x7f2414035000, 0x7f2414037000, 
> 0x7f2414038800, 0x7f241403a800,
> 2021-12-06T04:24:49.1903354Z 0x7f241403c000, 0x7f241403e000, 
> 0x7f241403f800, 0x7f2414041800,
> 2021-12-06T04:24:49.1903812Z 0x7f2414043000, 0x7f2414045000, 
> 0x7f24dc064800, 0x7f2414047000,
> 2021-12-06T04:24:49.1

[jira] [Updated] (FLINK-25185) StreamFaultToleranceTestBase hangs on AZP

2021-12-20 Thread Piotr Nowojski (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-25185?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Piotr Nowojski updated FLINK-25185:
---
Component/s: Runtime / State Backends
 (was: Runtime / Coordination)

> StreamFaultToleranceTestBase hangs on AZP
> -
>
> Key: FLINK-25185
> URL: https://issues.apache.org/jira/browse/FLINK-25185
> Project: Flink
>  Issue Type: Bug
>  Components: Runtime / State Backends, Test Infrastructure
>Affects Versions: 1.13.3, 1.15.0
>Reporter: Till Rohrmann
>Priority: Critical
>  Labels: test-stability
> Fix For: 1.15.0
>
>
> The {{StreamFaultToleranceTestBase}} hangs on AZP.
> {code}
> 2021-12-06T04:24:48.1676089Z 
> ==
> 2021-12-06T04:24:48.1678883Z === WARNING: This task took already 95% of the 
> available time budget of 237 minutes ===
> 2021-12-06T04:24:48.1679596Z 
> ==
> 2021-12-06T04:24:48.1680326Z 
> ==
> 2021-12-06T04:24:48.1680877Z The following Java processes are running (JPS)
> 2021-12-06T04:24:48.1681467Z 
> ==
> 2021-12-06T04:24:48.6514536Z 13701 surefirebooter17740627448580534543.jar
> 2021-12-06T04:24:48.6515353Z 1622 Jps
> 2021-12-06T04:24:48.6515795Z 780 Launcher
> 2021-12-06T04:24:48.6825889Z 
> ==
> 2021-12-06T04:24:48.6826565Z Printing stack trace of Java process 13701
> 2021-12-06T04:24:48.6827012Z 
> ==
> 2021-12-06T04:24:49.1876086Z 2021-12-06 04:24:49
> 2021-12-06T04:24:49.1877098Z Full thread dump OpenJDK 64-Bit Server VM 
> (11.0.10+9 mixed mode):
> 2021-12-06T04:24:49.1877362Z 
> 2021-12-06T04:24:49.1877672Z Threads class SMR info:
> 2021-12-06T04:24:49.1878049Z _java_thread_list=0x7f254c007630, 
> length=365, elements={
> 2021-12-06T04:24:49.1878504Z 0x7f2598028000, 0x7f2598280800, 
> 0x7f2598284800, 0x7f2598299000,
> 2021-12-06T04:24:49.1878973Z 0x7f259829b000, 0x7f259829d800, 
> 0x7f259829f800, 0x7f25982a1800,
> 2021-12-06T04:24:49.1879680Z 0x7f2598337800, 0x7f25983e3000, 
> 0x7f2598431000, 0x7f2528016000,
> 2021-12-06T04:24:49.1896613Z 0x7f2599003000, 0x7f259972e000, 
> 0x7f2599833800, 0x7f259984c000,
> 2021-12-06T04:24:49.1897558Z 0x7f259984f000, 0x7f2599851000, 
> 0x7f2599892000, 0x7f2599894800,
> 2021-12-06T04:24:49.1898075Z 0x7f2499a16000, 0x7f2485acd800, 
> 0x7f2485ace000, 0x7f24876bb800,
> 2021-12-06T04:24:49.1898562Z 0x7f2461e59000, 0x7f2499a0e800, 
> 0x7f2461e5e800, 0x7f2461e81000,
> 2021-12-06T04:24:49.1899037Z 0x7f24dc015000, 0x7f2461e86800, 
> 0x7f2448002000, 0x7f24dc01c000,
> 2021-12-06T04:24:49.1899522Z 0x7f2438001000, 0x7f2438003000, 
> 0x7f2438005000, 0x7f2438006800,
> 2021-12-06T04:24:49.1899982Z 0x7f2438008800, 0x7f2434017800, 
> 0x7f243401a800, 0x7f2414008800,
> 2021-12-06T04:24:49.1900495Z 0x7f24e8089800, 0x7f24e809, 
> 0x7f23e4005800, 0x7f24e8092800,
> 2021-12-06T04:24:49.1901163Z 0x7f24e8099000, 0x7f2414015800, 
> 0x7f24dc04c000, 0x7f2414018800,
> 2021-12-06T04:24:49.1901680Z 0x7f241402, 0x7f24dc058000, 
> 0x7f24dc05b000, 0x7f2414022000,
> 2021-12-06T04:24:49.1902283Z 0x7f24d400f000, 0x7f241402e800, 
> 0x7f2414031800, 0x7f2414033800,
> 2021-12-06T04:24:49.1902880Z 0x7f2414035000, 0x7f2414037000, 
> 0x7f2414038800, 0x7f241403a800,
> 2021-12-06T04:24:49.1903354Z 0x7f241403c000, 0x7f241403e000, 
> 0x7f241403f800, 0x7f2414041800,
> 2021-12-06T04:24:49.1903812Z 0x7f2414043000, 0x7f2414045000, 
> 0x7f24dc064800, 0x7f2414047000,
> 2021-12-06T04:24:49.1904284Z 0x7f2414048800, 0x7f241404a800, 
> 0x7f241404c800, 0x7f241404e000,
> 2021-12-06T04:24:49.1904800Z 0x7f241405, 0x7f2414051800, 
> 0x7f2414053800, 0x7f2414055000,
> 2021-12-06T04:24:49.1905455Z 0x7f2414057000, 0x7f2414059000, 
> 0x7f241405a800, 0x7f241405c800,
> 2021-12-06T04:24:49.1906098Z 0x7f241405e000, 0x7f241406, 
> 0x7f2414062000, 0x7f2414063800,
> 2021-12-06T04:24:49.1906728Z 0x7f22e400c800, 0x7f2328008000, 
> 0x7f2284007000, 0x7f22cc019800,
> 2021-12-06T04:24:49.1907396Z 0x7f21f8004000, 0x7f2304012800, 
> 0x7f230001b000, 0x7f223c011000,
> 2021-12-06T04:24:49.1908080Z 0x7f24e40c180

[jira] [Commented] (FLINK-25185) StreamFaultToleranceTestBase hangs on AZP

2021-12-20 Thread Piotr Nowojski (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-25185?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17462554#comment-17462554
 ] 

Piotr Nowojski commented on FLINK-25185:


It looks like those tests were stuck in an endless loop being unable to 
allocate enough slots to run the job:

{noformat}
org.apache.flink.runtime.jobmanager.scheduler.NoResourceAvailableException: 
Could not acquire the minimum required resources.
06:42:22,189 [flink-akka.actor.default-dispatcher-7] WARN  
org.apache.flink.runtime.resourcemanager.slotmanager.FineGrainedSlotManager [] 
- Could not fulfill resource requirements of job 
5a5ac441318e8085606c78b40c3a2f25.
06:42:22,189 [flink-akka.actor.default-dispatcher-7] WARN  
org.apache.flink.runtime.jobmaster.slotpool.DeclarativeSlotPoolBridge [] - 
Could not acquire the minimum required resources, failing slot requests. 
Acquired: 
[ResourceRequirement{resourceProfile=ResourceProfile{taskHeapMemory=256.000gb 
(274877906944 bytes), taskOffHeapMemory=256.000gb (274877906944 bytes), 
managedMemory=20.000mb (20971520 bytes), networkMemory=16.000mb (16777216 
bytes)}, numberOfRequiredSlots=8}]. Current slot pool status: Registered TMs: 
2, registered slots: 8 free slots: 0
org.apache.flink.runtime.jobmanager.scheduler.NoResourceAvailableException: 
Could not acquire the minimum required resources.
06:42:22,259 [flink-akka.actor.default-dispatcher-9] WARN  
org.apache.flink.runtime.resourcemanager.slotmanager.FineGrainedSlotManager [] 
- Could not fulfill resource requirements of job 
5a5ac441318e8085606c78b40c3a2f25.
06:42:22,259 [flink-akka.actor.default-dispatcher-9] WARN  
org.apache.flink.runtime.jobmaster.slotpool.DeclarativeSlotPoolBridge [] - 
Could not acquire the minimum required resources, failing slot requests. 
Acquired: 
[ResourceRequirement{resourceProfile=ResourceProfile{taskHeapMemory=256.000gb 
(274877906944 bytes), taskOffHeapMemory=256.000gb (274877906944 bytes), 
managedMemory=20.000mb (20971520 bytes), networkMemory=16.000mb (16777216 
bytes)}, numberOfRequiredSlots=8}]. Current slot pool status: Registered TMs: 
2, registered slots: 8 free slots: 0
org.apache.flink.runtime.j
{noformat}

It's very hard to say, but it looks like (one of?) the first failure was this 
one:

{noformat}
04:06:26,659 [Map -> Sink: Unnamed (9/12)#1] WARN  
org.apache.flink.streaming.api.operators.BackendRestorerProcedure [] - 
Exception while restoring keyed state backend for 
StreamMap_dc2290bb6f8f5cd2bd425368843494fe_(9/12) from alternative (1/1), will 
retry while mor
e alternatives are available.
java.lang.RuntimeException: java.io.FileNotFoundException: 
/tmp/junit3146957979516280339/junit1602669867129285236/d6a6dbdd-3fd7-4786-9dc1-9ccc161740da
 (No such file or directory)
at 
org.apache.flink.util.ExceptionUtils.rethrow(ExceptionUtils.java:319) 
~[flink-core-1.15-SNAPSHOT.jar:1.15-SNAPSHOT]
at 
org.apache.flink.runtime.state.changelog.StateChangelogHandleStreamHandleReader$1.advance(StateChangelogHandleStreamHandleReader.java:87)
 ~[flink-runtime-1.15-SNAPSHOT.jar:1.15-SNAPSHOT]
at 
org.apache.flink.runtime.state.changelog.StateChangelogHandleStreamHandleReader$1.hasNext(StateChangelogHandleStreamHandleReader.java:69)
 ~[flink-runtime-1.15-SNAPSHOT.jar:1.15-SNAPSHOT]
at 
org.apache.flink.state.changelog.restore.ChangelogBackendRestoreOperation.readBackendHandle(ChangelogBackendRestoreOperation.java:92)
 ~[flink-statebackend-changelog-1.15-SNAPSHOT.jar:1.15-SNAPSHOT]
at 
org.apache.flink.state.changelog.restore.ChangelogBackendRestoreOperation.restore(ChangelogBackendRestoreOperation.java:74)
 ~[flink-statebackend-changelog-1.15-SNAPSHOT.jar:1.15-SNAPSHOT]
at 
org.apache.flink.state.changelog.ChangelogStateBackend.restore(ChangelogStateBackend.java:221)
 ~[flink-statebackend-changelog-1.15-SNAPSHOT.jar:1.15-SNAPSHOT]
at 
org.apache.flink.state.changelog.ChangelogStateBackend.createKeyedStateBackend(ChangelogStateBackend.java:145)
 ~[flink-statebackend-changelog-1.15-SNAPSHOT.jar:1.15-SNAPSHOT]
at 
org.apache.flink.streaming.api.operators.StreamTaskStateInitializerImpl.lambda$keyedStatedBackend$1(StreamTaskStateInitializerImpl.java:329)
 ~[flink-streaming-java-1.15-SNAPSHOT.jar:1.15-SNAPSHOT]
at 
org.apache.flink.streaming.api.operators.BackendRestorerProcedure.attemptCreateAndRestore(BackendRestorerProcedure.java:168)
 ~[flink-streaming-java-1.15-SNAPSHOT.jar:1.15-SNAPSHOT]
at 
org.apache.flink.streaming.api.operators.BackendRestorerProcedure.createAndRestore(BackendRestorerProcedure.java:135)
 ~[flink-streaming-java-1.15-SNAPSHOT.jar:1.15-SNAPSHOT]
at 
org.apache.flink.streaming.api.operators.StreamTaskStateInitializerImpl.keyedStatedBackend(StreamTaskStateInitializerImpl.java:346)
 ~[flink-streaming-java-1.15-SNAPSHOT.jar:1.15-SNAPSHOT]
at 
org.apache.flink.streaming.api.operators.StreamTaskStateInitiali

[jira] [Updated] (FLINK-25185) StreamFaultToleranceTestBase hangs on AZP

2021-12-20 Thread Piotr Nowojski (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-25185?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Piotr Nowojski updated FLINK-25185:
---
Component/s: Runtime / Coordination
 (was: Runtime / Checkpointing)

> StreamFaultToleranceTestBase hangs on AZP
> -
>
> Key: FLINK-25185
> URL: https://issues.apache.org/jira/browse/FLINK-25185
> Project: Flink
>  Issue Type: Bug
>  Components: Runtime / Coordination, Test Infrastructure
>Affects Versions: 1.13.3, 1.15.0
>Reporter: Till Rohrmann
>Priority: Critical
>  Labels: test-stability
> Fix For: 1.15.0
>
>
> The {{StreamFaultToleranceTestBase}} hangs on AZP.
> {code}
> 2021-12-06T04:24:48.1676089Z 
> ==
> 2021-12-06T04:24:48.1678883Z === WARNING: This task took already 95% of the 
> available time budget of 237 minutes ===
> 2021-12-06T04:24:48.1679596Z 
> ==
> 2021-12-06T04:24:48.1680326Z 
> ==
> 2021-12-06T04:24:48.1680877Z The following Java processes are running (JPS)
> 2021-12-06T04:24:48.1681467Z 
> ==
> 2021-12-06T04:24:48.6514536Z 13701 surefirebooter17740627448580534543.jar
> 2021-12-06T04:24:48.6515353Z 1622 Jps
> 2021-12-06T04:24:48.6515795Z 780 Launcher
> 2021-12-06T04:24:48.6825889Z 
> ==
> 2021-12-06T04:24:48.6826565Z Printing stack trace of Java process 13701
> 2021-12-06T04:24:48.6827012Z 
> ==
> 2021-12-06T04:24:49.1876086Z 2021-12-06 04:24:49
> 2021-12-06T04:24:49.1877098Z Full thread dump OpenJDK 64-Bit Server VM 
> (11.0.10+9 mixed mode):
> 2021-12-06T04:24:49.1877362Z 
> 2021-12-06T04:24:49.1877672Z Threads class SMR info:
> 2021-12-06T04:24:49.1878049Z _java_thread_list=0x7f254c007630, 
> length=365, elements={
> 2021-12-06T04:24:49.1878504Z 0x7f2598028000, 0x7f2598280800, 
> 0x7f2598284800, 0x7f2598299000,
> 2021-12-06T04:24:49.1878973Z 0x7f259829b000, 0x7f259829d800, 
> 0x7f259829f800, 0x7f25982a1800,
> 2021-12-06T04:24:49.1879680Z 0x7f2598337800, 0x7f25983e3000, 
> 0x7f2598431000, 0x7f2528016000,
> 2021-12-06T04:24:49.1896613Z 0x7f2599003000, 0x7f259972e000, 
> 0x7f2599833800, 0x7f259984c000,
> 2021-12-06T04:24:49.1897558Z 0x7f259984f000, 0x7f2599851000, 
> 0x7f2599892000, 0x7f2599894800,
> 2021-12-06T04:24:49.1898075Z 0x7f2499a16000, 0x7f2485acd800, 
> 0x7f2485ace000, 0x7f24876bb800,
> 2021-12-06T04:24:49.1898562Z 0x7f2461e59000, 0x7f2499a0e800, 
> 0x7f2461e5e800, 0x7f2461e81000,
> 2021-12-06T04:24:49.1899037Z 0x7f24dc015000, 0x7f2461e86800, 
> 0x7f2448002000, 0x7f24dc01c000,
> 2021-12-06T04:24:49.1899522Z 0x7f2438001000, 0x7f2438003000, 
> 0x7f2438005000, 0x7f2438006800,
> 2021-12-06T04:24:49.1899982Z 0x7f2438008800, 0x7f2434017800, 
> 0x7f243401a800, 0x7f2414008800,
> 2021-12-06T04:24:49.1900495Z 0x7f24e8089800, 0x7f24e809, 
> 0x7f23e4005800, 0x7f24e8092800,
> 2021-12-06T04:24:49.1901163Z 0x7f24e8099000, 0x7f2414015800, 
> 0x7f24dc04c000, 0x7f2414018800,
> 2021-12-06T04:24:49.1901680Z 0x7f241402, 0x7f24dc058000, 
> 0x7f24dc05b000, 0x7f2414022000,
> 2021-12-06T04:24:49.1902283Z 0x7f24d400f000, 0x7f241402e800, 
> 0x7f2414031800, 0x7f2414033800,
> 2021-12-06T04:24:49.1902880Z 0x7f2414035000, 0x7f2414037000, 
> 0x7f2414038800, 0x7f241403a800,
> 2021-12-06T04:24:49.1903354Z 0x7f241403c000, 0x7f241403e000, 
> 0x7f241403f800, 0x7f2414041800,
> 2021-12-06T04:24:49.1903812Z 0x7f2414043000, 0x7f2414045000, 
> 0x7f24dc064800, 0x7f2414047000,
> 2021-12-06T04:24:49.1904284Z 0x7f2414048800, 0x7f241404a800, 
> 0x7f241404c800, 0x7f241404e000,
> 2021-12-06T04:24:49.1904800Z 0x7f241405, 0x7f2414051800, 
> 0x7f2414053800, 0x7f2414055000,
> 2021-12-06T04:24:49.1905455Z 0x7f2414057000, 0x7f2414059000, 
> 0x7f241405a800, 0x7f241405c800,
> 2021-12-06T04:24:49.1906098Z 0x7f241405e000, 0x7f241406, 
> 0x7f2414062000, 0x7f2414063800,
> 2021-12-06T04:24:49.1906728Z 0x7f22e400c800, 0x7f2328008000, 
> 0x7f2284007000, 0x7f22cc019800,
> 2021-12-06T04:24:49.1907396Z 0x7f21f8004000, 0x7f2304012800, 
> 0x7f230001b000, 0x7f223c011000,
> 2021-12-06T04:24:49.1908080Z 0x7f24e40c1800, 

[jira] [Closed] (FLINK-25382) Failure in "Upload Logs" task

2021-12-20 Thread Piotr Nowojski (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-25382?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Piotr Nowojski closed FLINK-25382.
--
Resolution: Duplicate

> Failure in "Upload Logs" task
> -
>
> Key: FLINK-25382
> URL: https://issues.apache.org/jira/browse/FLINK-25382
> Project: Flink
>  Issue Type: Bug
>  Components: Test Infrastructure
>Affects Versions: 1.15.0
>Reporter: Piotr Nowojski
>Priority: Critical
>
> I don't see any error message, but it seems like uploading the logs has 
> failed:
> https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=27568&view=logs&j=2c3cbe13-dee0-5837-cf47-3053da9a8a78&t=bb16d35c-fdfe-5139-f244-9492cbd2050b
> for the following build:
> https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=27568&view=logs&j=2c3cbe13-dee0-5837-cf47-3053da9a8a78&t=2c7d57b9-7341-5a87-c9af-2cf7cc1a37dc



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Updated] (FLINK-22090) Upload logs fails

2021-12-20 Thread Piotr Nowojski (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-22090?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Piotr Nowojski updated FLINK-22090:
---
Priority: Critical  (was: Not a Priority)

> Upload logs fails
> -
>
> Key: FLINK-22090
> URL: https://issues.apache.org/jira/browse/FLINK-22090
> Project: Flink
>  Issue Type: Bug
>  Components: Test Infrastructure
>Reporter: Matthias Pohl
>Priority: Critical
>  Labels: auto-deprioritized-major, auto-deprioritized-minor, 
> test-stability
>
> [This 
> build|https://dev.azure.com/mapohl/flink/_build/results?buildId=382&view=logs&j=9dc1b5dc-bcfa-5f83-eaa7-0cb181ddc267&t=599dab09-ab33-58b6-4804-349ab7dc2f73]
>  failed just because an {{upload logs}} step failed. It looks like this is an 
> AzureCI problem. Is this a known issue?
> The artifacts seems to be uploaded based on the logs. But [the download 
> link|https://dev.azure.com/mapohl/flink/_build/results?buildId=382&view=logs&j=9dc1b5dc-bcfa-5f83-eaa7-0cb181ddc267]
>  does not show up.
> Another build that had the same issue: 
> [test_ci_blinkplanner|https://dev.azure.com/mapohl/flink/_build/results?buildId=383&view=logs&j=d1352042-8a7d-50b6-3946-a85d176b7981&t=7b7009bb-e6bf-5426-3d4b-20b25eada636&l=75]
>  and 
> [test_ci_build_core|https://dev.azure.com/mapohl/flink/_build/results?buildId=383&view=logs&j=9dc1b5dc-bcfa-5f83-eaa7-0cb181ddc267&t=599dab09-ab33-58b6-4804-349ab7dc2f73&l=44]



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Commented] (FLINK-22090) Upload logs fails

2021-12-20 Thread Piotr Nowojski (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-22090?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17462503#comment-17462503
 ] 

Piotr Nowojski commented on FLINK-22090:


Another instances:

https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=27568&view=logs&j=2c3cbe13-dee0-5837-cf47-3053da9a8a78&t=bb16d35c-fdfe-5139-f244-9492cbd2050b

for the following build:
https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=27568&view=logs&j=2c3cbe13-dee0-5837-cf47-3053da9a8a78&t=2c7d57b9-7341-5a87-c9af-2cf7cc1a37dc

logs are missing. They were not uploaded at all.


For other builds it looks like uploading task failed, but the artefacts were 
uploaded in the end?
https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=28306&view=logs&j=5c8e7682-d68f-54d1-16a2-a09310218a49&t=d9365e9b-b0cd-5e35-8489-cee9ea412dd2

> Upload logs fails
> -
>
> Key: FLINK-22090
> URL: https://issues.apache.org/jira/browse/FLINK-22090
> Project: Flink
>  Issue Type: Bug
>  Components: Test Infrastructure
>Reporter: Matthias Pohl
>Priority: Not a Priority
>  Labels: auto-deprioritized-major, auto-deprioritized-minor, 
> test-stability
>
> [This 
> build|https://dev.azure.com/mapohl/flink/_build/results?buildId=382&view=logs&j=9dc1b5dc-bcfa-5f83-eaa7-0cb181ddc267&t=599dab09-ab33-58b6-4804-349ab7dc2f73]
>  failed just because an {{upload logs}} step failed. It looks like this is an 
> AzureCI problem. Is this a known issue?
> The artifacts seems to be uploaded based on the logs. But [the download 
> link|https://dev.azure.com/mapohl/flink/_build/results?buildId=382&view=logs&j=9dc1b5dc-bcfa-5f83-eaa7-0cb181ddc267]
>  does not show up.
> Another build that had the same issue: 
> [test_ci_blinkplanner|https://dev.azure.com/mapohl/flink/_build/results?buildId=383&view=logs&j=d1352042-8a7d-50b6-3946-a85d176b7981&t=7b7009bb-e6bf-5426-3d4b-20b25eada636&l=75]
>  and 
> [test_ci_build_core|https://dev.azure.com/mapohl/flink/_build/results?buildId=383&view=logs&j=9dc1b5dc-bcfa-5f83-eaa7-0cb181ddc267&t=599dab09-ab33-58b6-4804-349ab7dc2f73&l=44]



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Created] (FLINK-25382) Failure in "Upload Logs" task

2021-12-20 Thread Piotr Nowojski (Jira)
Piotr Nowojski created FLINK-25382:
--

 Summary: Failure in "Upload Logs" task
 Key: FLINK-25382
 URL: https://issues.apache.org/jira/browse/FLINK-25382
 Project: Flink
  Issue Type: Bug
  Components: Test Infrastructure
Affects Versions: 1.15.0
Reporter: Piotr Nowojski


I don't see any error message, but it seems like uploading the logs has failed:

https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=27568&view=logs&j=2c3cbe13-dee0-5837-cf47-3053da9a8a78&t=bb16d35c-fdfe-5139-f244-9492cbd2050b

for the following build:
https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=27568&view=logs&j=2c3cbe13-dee0-5837-cf47-3053da9a8a78&t=2c7d57b9-7341-5a87-c9af-2cf7cc1a37dc



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Updated] (FLINK-25185) StreamFaultToleranceTestBase hangs on AZP

2021-12-20 Thread Piotr Nowojski (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-25185?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Piotr Nowojski updated FLINK-25185:
---
Summary: StreamFaultToleranceTestBase hangs on AZP  (was: 
UdfStreamOperatorCheckpointingITCase  (StreamFaultToleranceTestBase) hangs on 
AZP)

> StreamFaultToleranceTestBase hangs on AZP
> -
>
> Key: FLINK-25185
> URL: https://issues.apache.org/jira/browse/FLINK-25185
> Project: Flink
>  Issue Type: Bug
>  Components: Runtime / Checkpointing, Test Infrastructure
>Affects Versions: 1.13.3, 1.15.0
>Reporter: Till Rohrmann
>Priority: Critical
>  Labels: test-stability
> Fix For: 1.15.0
>
>
> The {{StreamFaultToleranceTestBase}} hangs on AZP.
> {code}
> 2021-12-06T04:24:48.1676089Z 
> ==
> 2021-12-06T04:24:48.1678883Z === WARNING: This task took already 95% of the 
> available time budget of 237 minutes ===
> 2021-12-06T04:24:48.1679596Z 
> ==
> 2021-12-06T04:24:48.1680326Z 
> ==
> 2021-12-06T04:24:48.1680877Z The following Java processes are running (JPS)
> 2021-12-06T04:24:48.1681467Z 
> ==
> 2021-12-06T04:24:48.6514536Z 13701 surefirebooter17740627448580534543.jar
> 2021-12-06T04:24:48.6515353Z 1622 Jps
> 2021-12-06T04:24:48.6515795Z 780 Launcher
> 2021-12-06T04:24:48.6825889Z 
> ==
> 2021-12-06T04:24:48.6826565Z Printing stack trace of Java process 13701
> 2021-12-06T04:24:48.6827012Z 
> ==
> 2021-12-06T04:24:49.1876086Z 2021-12-06 04:24:49
> 2021-12-06T04:24:49.1877098Z Full thread dump OpenJDK 64-Bit Server VM 
> (11.0.10+9 mixed mode):
> 2021-12-06T04:24:49.1877362Z 
> 2021-12-06T04:24:49.1877672Z Threads class SMR info:
> 2021-12-06T04:24:49.1878049Z _java_thread_list=0x7f254c007630, 
> length=365, elements={
> 2021-12-06T04:24:49.1878504Z 0x7f2598028000, 0x7f2598280800, 
> 0x7f2598284800, 0x7f2598299000,
> 2021-12-06T04:24:49.1878973Z 0x7f259829b000, 0x7f259829d800, 
> 0x7f259829f800, 0x7f25982a1800,
> 2021-12-06T04:24:49.1879680Z 0x7f2598337800, 0x7f25983e3000, 
> 0x7f2598431000, 0x7f2528016000,
> 2021-12-06T04:24:49.1896613Z 0x7f2599003000, 0x7f259972e000, 
> 0x7f2599833800, 0x7f259984c000,
> 2021-12-06T04:24:49.1897558Z 0x7f259984f000, 0x7f2599851000, 
> 0x7f2599892000, 0x7f2599894800,
> 2021-12-06T04:24:49.1898075Z 0x7f2499a16000, 0x7f2485acd800, 
> 0x7f2485ace000, 0x7f24876bb800,
> 2021-12-06T04:24:49.1898562Z 0x7f2461e59000, 0x7f2499a0e800, 
> 0x7f2461e5e800, 0x7f2461e81000,
> 2021-12-06T04:24:49.1899037Z 0x7f24dc015000, 0x7f2461e86800, 
> 0x7f2448002000, 0x7f24dc01c000,
> 2021-12-06T04:24:49.1899522Z 0x7f2438001000, 0x7f2438003000, 
> 0x7f2438005000, 0x7f2438006800,
> 2021-12-06T04:24:49.1899982Z 0x7f2438008800, 0x7f2434017800, 
> 0x7f243401a800, 0x7f2414008800,
> 2021-12-06T04:24:49.1900495Z 0x7f24e8089800, 0x7f24e809, 
> 0x7f23e4005800, 0x7f24e8092800,
> 2021-12-06T04:24:49.1901163Z 0x7f24e8099000, 0x7f2414015800, 
> 0x7f24dc04c000, 0x7f2414018800,
> 2021-12-06T04:24:49.1901680Z 0x7f241402, 0x7f24dc058000, 
> 0x7f24dc05b000, 0x7f2414022000,
> 2021-12-06T04:24:49.1902283Z 0x7f24d400f000, 0x7f241402e800, 
> 0x7f2414031800, 0x7f2414033800,
> 2021-12-06T04:24:49.1902880Z 0x7f2414035000, 0x7f2414037000, 
> 0x7f2414038800, 0x7f241403a800,
> 2021-12-06T04:24:49.1903354Z 0x7f241403c000, 0x7f241403e000, 
> 0x7f241403f800, 0x7f2414041800,
> 2021-12-06T04:24:49.1903812Z 0x7f2414043000, 0x7f2414045000, 
> 0x7f24dc064800, 0x7f2414047000,
> 2021-12-06T04:24:49.1904284Z 0x7f2414048800, 0x7f241404a800, 
> 0x7f241404c800, 0x7f241404e000,
> 2021-12-06T04:24:49.1904800Z 0x7f241405, 0x7f2414051800, 
> 0x7f2414053800, 0x7f2414055000,
> 2021-12-06T04:24:49.1905455Z 0x7f2414057000, 0x7f2414059000, 
> 0x7f241405a800, 0x7f241405c800,
> 2021-12-06T04:24:49.1906098Z 0x7f241405e000, 0x7f241406, 
> 0x7f2414062000, 0x7f2414063800,
> 2021-12-06T04:24:49.1906728Z 0x7f22e400c800, 0x7f2328008000, 
> 0x7f2284007000, 0x7f22cc019800,
> 2021-12-06T04:24:49.1907396Z 0x7f21f8004000, 0x7f2304012800, 
> 0x7f230001b000, 0x7f223c011

[jira] [Updated] (FLINK-25185) UdfStreamOperatorCheckpointingITCase (StreamFaultToleranceTestBase) hangs on AZP

2021-12-20 Thread Piotr Nowojski (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-25185?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Piotr Nowojski updated FLINK-25185:
---
Summary: UdfStreamOperatorCheckpointingITCase  
(StreamFaultToleranceTestBase) hangs on AZP  (was: StreamFaultToleranceTestBase 
hangs on AZP)

> UdfStreamOperatorCheckpointingITCase  (StreamFaultToleranceTestBase) hangs on 
> AZP
> -
>
> Key: FLINK-25185
> URL: https://issues.apache.org/jira/browse/FLINK-25185
> Project: Flink
>  Issue Type: Bug
>  Components: Runtime / Checkpointing, Test Infrastructure
>Affects Versions: 1.13.3, 1.15.0
>Reporter: Till Rohrmann
>Priority: Critical
>  Labels: test-stability
> Fix For: 1.15.0
>
>
> The {{StreamFaultToleranceTestBase}} hangs on AZP.
> {code}
> 2021-12-06T04:24:48.1676089Z 
> ==
> 2021-12-06T04:24:48.1678883Z === WARNING: This task took already 95% of the 
> available time budget of 237 minutes ===
> 2021-12-06T04:24:48.1679596Z 
> ==
> 2021-12-06T04:24:48.1680326Z 
> ==
> 2021-12-06T04:24:48.1680877Z The following Java processes are running (JPS)
> 2021-12-06T04:24:48.1681467Z 
> ==
> 2021-12-06T04:24:48.6514536Z 13701 surefirebooter17740627448580534543.jar
> 2021-12-06T04:24:48.6515353Z 1622 Jps
> 2021-12-06T04:24:48.6515795Z 780 Launcher
> 2021-12-06T04:24:48.6825889Z 
> ==
> 2021-12-06T04:24:48.6826565Z Printing stack trace of Java process 13701
> 2021-12-06T04:24:48.6827012Z 
> ==
> 2021-12-06T04:24:49.1876086Z 2021-12-06 04:24:49
> 2021-12-06T04:24:49.1877098Z Full thread dump OpenJDK 64-Bit Server VM 
> (11.0.10+9 mixed mode):
> 2021-12-06T04:24:49.1877362Z 
> 2021-12-06T04:24:49.1877672Z Threads class SMR info:
> 2021-12-06T04:24:49.1878049Z _java_thread_list=0x7f254c007630, 
> length=365, elements={
> 2021-12-06T04:24:49.1878504Z 0x7f2598028000, 0x7f2598280800, 
> 0x7f2598284800, 0x7f2598299000,
> 2021-12-06T04:24:49.1878973Z 0x7f259829b000, 0x7f259829d800, 
> 0x7f259829f800, 0x7f25982a1800,
> 2021-12-06T04:24:49.1879680Z 0x7f2598337800, 0x7f25983e3000, 
> 0x7f2598431000, 0x7f2528016000,
> 2021-12-06T04:24:49.1896613Z 0x7f2599003000, 0x7f259972e000, 
> 0x7f2599833800, 0x7f259984c000,
> 2021-12-06T04:24:49.1897558Z 0x7f259984f000, 0x7f2599851000, 
> 0x7f2599892000, 0x7f2599894800,
> 2021-12-06T04:24:49.1898075Z 0x7f2499a16000, 0x7f2485acd800, 
> 0x7f2485ace000, 0x7f24876bb800,
> 2021-12-06T04:24:49.1898562Z 0x7f2461e59000, 0x7f2499a0e800, 
> 0x7f2461e5e800, 0x7f2461e81000,
> 2021-12-06T04:24:49.1899037Z 0x7f24dc015000, 0x7f2461e86800, 
> 0x7f2448002000, 0x7f24dc01c000,
> 2021-12-06T04:24:49.1899522Z 0x7f2438001000, 0x7f2438003000, 
> 0x7f2438005000, 0x7f2438006800,
> 2021-12-06T04:24:49.1899982Z 0x7f2438008800, 0x7f2434017800, 
> 0x7f243401a800, 0x7f2414008800,
> 2021-12-06T04:24:49.1900495Z 0x7f24e8089800, 0x7f24e809, 
> 0x7f23e4005800, 0x7f24e8092800,
> 2021-12-06T04:24:49.1901163Z 0x7f24e8099000, 0x7f2414015800, 
> 0x7f24dc04c000, 0x7f2414018800,
> 2021-12-06T04:24:49.1901680Z 0x7f241402, 0x7f24dc058000, 
> 0x7f24dc05b000, 0x7f2414022000,
> 2021-12-06T04:24:49.1902283Z 0x7f24d400f000, 0x7f241402e800, 
> 0x7f2414031800, 0x7f2414033800,
> 2021-12-06T04:24:49.1902880Z 0x7f2414035000, 0x7f2414037000, 
> 0x7f2414038800, 0x7f241403a800,
> 2021-12-06T04:24:49.1903354Z 0x7f241403c000, 0x7f241403e000, 
> 0x7f241403f800, 0x7f2414041800,
> 2021-12-06T04:24:49.1903812Z 0x7f2414043000, 0x7f2414045000, 
> 0x7f24dc064800, 0x7f2414047000,
> 2021-12-06T04:24:49.1904284Z 0x7f2414048800, 0x7f241404a800, 
> 0x7f241404c800, 0x7f241404e000,
> 2021-12-06T04:24:49.1904800Z 0x7f241405, 0x7f2414051800, 
> 0x7f2414053800, 0x7f2414055000,
> 2021-12-06T04:24:49.1905455Z 0x7f2414057000, 0x7f2414059000, 
> 0x7f241405a800, 0x7f241405c800,
> 2021-12-06T04:24:49.1906098Z 0x7f241405e000, 0x7f241406, 
> 0x7f2414062000, 0x7f2414063800,
> 2021-12-06T04:24:49.1906728Z 0x7f22e400c800, 0x7f2328008000, 
> 0x7f2284007000, 0x7f22cc019800,
> 2021-12-06T04:24:49.1907

[jira] [Assigned] (FLINK-25199) fromValues does not emit final MAX watermark

2021-12-17 Thread Piotr Nowojski (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-25199?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Piotr Nowojski reassigned FLINK-25199:
--

Assignee: Marios Trivyzas

> fromValues does not emit final MAX watermark
> 
>
> Key: FLINK-25199
> URL: https://issues.apache.org/jira/browse/FLINK-25199
> Project: Flink
>  Issue Type: Bug
>  Components: Runtime / Task
>Reporter: Timo Walther
>Assignee: Marios Trivyzas
>Priority: Critical
> Fix For: 1.15.0, 1.14.3
>
>
> It seems {{fromValues}} that generates multiple rows does not emit any 
> watermarks:
> {code}
> StreamTableEnvironment tEnv = StreamTableEnvironment.create(env);
> Table inputTable =
> tEnv.fromValues(
> DataTypes.ROW(
> DataTypes.FIELD("weight", DataTypes.DOUBLE()),
> DataTypes.FIELD("f0", DataTypes.STRING()),
> DataTypes.FIELD("f1", DataTypes.DOUBLE()),
> DataTypes.FIELD("f2", DataTypes.DOUBLE()),
> DataTypes.FIELD("f3", DataTypes.DOUBLE()),
> DataTypes.FIELD("f4", DataTypes.INT()),
> DataTypes.FIELD("label", DataTypes.STRING())),
> Row.of(1., "a", 1., 1., 1., 2, "l1"),
> Row.of(1., "a", 1., 1., 1., 2, "l1"));
> DataStream input = tEnv.toDataStream(inputTable);
> {code}
> {{fromValues(1, 2, 3)}} or {{fromValues}} with only 1 row works correctly.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Closed] (FLINK-24846) AsyncWaitOperator fails during stop-with-savepoint

2021-12-17 Thread Piotr Nowojski (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-24846?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Piotr Nowojski closed FLINK-24846.
--
Resolution: Fixed

> AsyncWaitOperator fails during stop-with-savepoint
> --
>
> Key: FLINK-24846
> URL: https://issues.apache.org/jira/browse/FLINK-24846
> Project: Flink
>  Issue Type: Bug
>  Components: Runtime / Checkpointing, Runtime / Task
>Affects Versions: 1.14.0
>Reporter: Piotr Nowojski
>Assignee: Anton Kalashnikov
>Priority: Critical
>  Labels: pull-request-available
> Fix For: 1.15.0, 1.13.6, 1.14.3
>
> Attachments: log-jm.txt
>
>
> {noformat}
> Caused by: 
> org.apache.flink.streaming.runtime.tasks.mailbox.TaskMailbox$MailboxClosedException:
>  Mailbox is in state QUIESCED, but is required to be in state OPEN for put 
> operations.
> at 
> org.apache.flink.streaming.runtime.tasks.mailbox.TaskMailboxImpl.checkPutStateConditions(TaskMailboxImpl.java:269)
>  ~[flink-dist_2.11-1.14.0.jar:1.14.0]
> at 
> org.apache.flink.streaming.runtime.tasks.mailbox.TaskMailboxImpl.put(TaskMailboxImpl.java:197)
>  ~[flink-dist_2.11-1.14.0.jar:1.14.0]
> at 
> org.apache.flink.streaming.runtime.tasks.mailbox.MailboxExecutorImpl.execute(MailboxExecutorImpl.java:74)
>  ~[flink-dist_2.11-1.14.0.jar:1.14.0]
> at 
> org.apache.flink.api.common.operators.MailboxExecutor.execute(MailboxExecutor.java:103)
>  ~[flink-dist_2.11-1.14.0.jar:1.14.0]
> at 
> org.apache.flink.streaming.api.operators.async.AsyncWaitOperator.outputCompletedElement(AsyncWaitOperator.java:304)
>  ~[flink-dist_2.11-1.14.0.jar:1.14.0]
> at 
> org.apache.flink.streaming.api.operators.async.AsyncWaitOperator.access$100(AsyncWaitOperator.java:78)
>  ~[flink-dist_2.11-1.14.0.jar:1.14.0]
> at 
> org.apache.flink.streaming.api.operators.async.AsyncWaitOperator$ResultHandler.processResults(AsyncWaitOperator.java:370)
>  ~[flink-dist_2.11-1.14.0.jar:1.14.0]
> at 
> org.apache.flink.streaming.api.operators.async.AsyncWaitOperator$ResultHandler.lambda$processInMailbox$0(AsyncWaitOperator.java:351)
>  ~[flink-dist_2.11-1.14.0.jar:1.14.0]
> at 
> org.apache.flink.streaming.runtime.tasks.StreamTaskActionExecutor$1.runThrowing(StreamTaskActionExecutor.java:50)
>  ~[flink-dist_2.11-1.14.0.jar:1.14.0]
> at 
> org.apache.flink.streaming.runtime.tasks.mailbox.Mail.run(Mail.java:90) 
> ~[flink-dist_2.11-1.14.0.jar:1.14.0]
> at 
> org.apache.flink.streaming.runtime.tasks.mailbox.MailboxProcessor.drain(MailboxProcessor.java:177)
>  ~[flink-dist_2.11-1.14.0.jar:1.14.0]
> at 
> org.apache.flink.streaming.runtime.tasks.StreamTask.afterInvoke(StreamTask.java:854)
>  ~[flink-dist_2.11-1.14.0.jar:1.14.0]
> at 
> org.apache.flink.streaming.runtime.tasks.StreamTask.invoke(StreamTask.java:767)
>  ~[flink-dist_2.11-1.14.0.jar:1.14.0]
> at 
> org.apache.flink.runtime.taskmanager.Task.runWithSystemExitMonitoring(Task.java:958)
>  ~[flink-dist_2.11-1.14.0.jar:1.14.0]
> at 
> org.apache.flink.runtime.taskmanager.Task.restoreAndInvoke(Task.java:937) 
> ~[flink-dist_2.11-1.14.0.jar:1.14.0]
> at org.apache.flink.runtime.taskmanager.Task.doRun(Task.java:766) 
> ~[flink-dist_2.11-1.14.0.jar:1.14.0]
> at org.apache.flink.runtime.taskmanager.Task.run(Task.java:575) 
> ~[flink-dist_2.11-1.14.0.jar:1.14.0]
> at java.lang.Thread.run(Thread.java:829) ~[?:?]
> {noformat}
> As reported by a user on [the mailing 
> list:|https://mail-archives.apache.org/mod_mbox/flink-user/202111.mbox/%3CCAO6dnLwtLNxkr9qXG202ysrnse18Wgvph4hqHZe3ar8cuXAfDw%40mail.gmail.com%3E]
> {quote}
> I failed to stop a job with savepoint with the following message:
> Inconsistent execution state after stopping with savepoint. At least one 
> execution is still in one of the following states: FAILED, CANCELED. A global 
> fail-over is triggered to recover the job 452594f3ec5797f399e07f95c884a44b.
> The job manager said
>  A savepoint was created at 
> hdfs://mobdata-flink-hdfs/driving-habits/svpts/savepoint-452594-f60305755d0e 
> but the corresponding job 452594f3ec5797f399e07f95c884a44b didn't terminate 
> successfully.
> while complaining about
> Mailbox is in state QUIESCED, but is required to be in state OPEN for put 
> operations.
> Is it okay to ignore this kind of error?
> Please see the attached files for the detailed context.
> FYI, 
> - I used the latest 1.14.0
> - I started the job with "$FLINK_HOME"/bin/flink run --target yarn-per-job
> - I couldn't reproduce the exception using the same jar so I might not able 
> to provide DUBUG messages
> {quote}



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Comment Edited] (FLINK-24846) AsyncWaitOperator fails during stop-with-savepoint

2021-12-17 Thread Piotr Nowojski (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-24846?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17459916#comment-17459916
 ] 

Piotr Nowojski edited comment on FLINK-24846 at 12/17/21, 11:37 AM:


merged commit 4065bfb + b54c413febc^ and b54c413febc into apache:master
merged as e7df5ec81fe and 8d5d7d46463 into release-1.14
merged commit 2bdc194 into apache:release-1.13


was (Author: pnowojski):
merged commit 4065bfb into apache:master
merged as e7df5ec81fe and 8d5d7d46463 into release-1.14
merged commit 2bdc194 into apache:release-1.13

> AsyncWaitOperator fails during stop-with-savepoint
> --
>
> Key: FLINK-24846
> URL: https://issues.apache.org/jira/browse/FLINK-24846
> Project: Flink
>  Issue Type: Bug
>  Components: Runtime / Checkpointing, Runtime / Task
>Affects Versions: 1.14.0
>Reporter: Piotr Nowojski
>Assignee: Anton Kalashnikov
>Priority: Critical
>  Labels: pull-request-available
> Fix For: 1.15.0, 1.13.6, 1.14.3
>
> Attachments: log-jm.txt
>
>
> {noformat}
> Caused by: 
> org.apache.flink.streaming.runtime.tasks.mailbox.TaskMailbox$MailboxClosedException:
>  Mailbox is in state QUIESCED, but is required to be in state OPEN for put 
> operations.
> at 
> org.apache.flink.streaming.runtime.tasks.mailbox.TaskMailboxImpl.checkPutStateConditions(TaskMailboxImpl.java:269)
>  ~[flink-dist_2.11-1.14.0.jar:1.14.0]
> at 
> org.apache.flink.streaming.runtime.tasks.mailbox.TaskMailboxImpl.put(TaskMailboxImpl.java:197)
>  ~[flink-dist_2.11-1.14.0.jar:1.14.0]
> at 
> org.apache.flink.streaming.runtime.tasks.mailbox.MailboxExecutorImpl.execute(MailboxExecutorImpl.java:74)
>  ~[flink-dist_2.11-1.14.0.jar:1.14.0]
> at 
> org.apache.flink.api.common.operators.MailboxExecutor.execute(MailboxExecutor.java:103)
>  ~[flink-dist_2.11-1.14.0.jar:1.14.0]
> at 
> org.apache.flink.streaming.api.operators.async.AsyncWaitOperator.outputCompletedElement(AsyncWaitOperator.java:304)
>  ~[flink-dist_2.11-1.14.0.jar:1.14.0]
> at 
> org.apache.flink.streaming.api.operators.async.AsyncWaitOperator.access$100(AsyncWaitOperator.java:78)
>  ~[flink-dist_2.11-1.14.0.jar:1.14.0]
> at 
> org.apache.flink.streaming.api.operators.async.AsyncWaitOperator$ResultHandler.processResults(AsyncWaitOperator.java:370)
>  ~[flink-dist_2.11-1.14.0.jar:1.14.0]
> at 
> org.apache.flink.streaming.api.operators.async.AsyncWaitOperator$ResultHandler.lambda$processInMailbox$0(AsyncWaitOperator.java:351)
>  ~[flink-dist_2.11-1.14.0.jar:1.14.0]
> at 
> org.apache.flink.streaming.runtime.tasks.StreamTaskActionExecutor$1.runThrowing(StreamTaskActionExecutor.java:50)
>  ~[flink-dist_2.11-1.14.0.jar:1.14.0]
> at 
> org.apache.flink.streaming.runtime.tasks.mailbox.Mail.run(Mail.java:90) 
> ~[flink-dist_2.11-1.14.0.jar:1.14.0]
> at 
> org.apache.flink.streaming.runtime.tasks.mailbox.MailboxProcessor.drain(MailboxProcessor.java:177)
>  ~[flink-dist_2.11-1.14.0.jar:1.14.0]
> at 
> org.apache.flink.streaming.runtime.tasks.StreamTask.afterInvoke(StreamTask.java:854)
>  ~[flink-dist_2.11-1.14.0.jar:1.14.0]
> at 
> org.apache.flink.streaming.runtime.tasks.StreamTask.invoke(StreamTask.java:767)
>  ~[flink-dist_2.11-1.14.0.jar:1.14.0]
> at 
> org.apache.flink.runtime.taskmanager.Task.runWithSystemExitMonitoring(Task.java:958)
>  ~[flink-dist_2.11-1.14.0.jar:1.14.0]
> at 
> org.apache.flink.runtime.taskmanager.Task.restoreAndInvoke(Task.java:937) 
> ~[flink-dist_2.11-1.14.0.jar:1.14.0]
> at org.apache.flink.runtime.taskmanager.Task.doRun(Task.java:766) 
> ~[flink-dist_2.11-1.14.0.jar:1.14.0]
> at org.apache.flink.runtime.taskmanager.Task.run(Task.java:575) 
> ~[flink-dist_2.11-1.14.0.jar:1.14.0]
> at java.lang.Thread.run(Thread.java:829) ~[?:?]
> {noformat}
> As reported by a user on [the mailing 
> list:|https://mail-archives.apache.org/mod_mbox/flink-user/202111.mbox/%3CCAO6dnLwtLNxkr9qXG202ysrnse18Wgvph4hqHZe3ar8cuXAfDw%40mail.gmail.com%3E]
> {quote}
> I failed to stop a job with savepoint with the following message:
> Inconsistent execution state after stopping with savepoint. At least one 
> execution is still in one of the following states: FAILED, CANCELED. A global 
> fail-over is triggered to recover the job 452594f3ec5797f399e07f95c884a44b.
> The job manager said
>  A savepoint was created at 
> hdfs://mobdata-flink-hdfs/driving-habits/svpts/savepoint-452594-f60305755d0e 
> but the corresponding job 452594f3ec5797f399e07f95c884a44b didn't terminate 
> successfully.
> while complaining about
> Mailbox is in state QUIESCED, but is required to be in state OPEN for put 
> operations.
> Is it okay to ignore this kind of error?
> P

[jira] [Commented] (FLINK-25318) Improvement of scheduler and execution for Flink OLAP

2021-12-16 Thread Piotr Nowojski (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-25318?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17460615#comment-17460615
 ] 

Piotr Nowojski commented on FLINK-25318:


Hi all. Thanks for taking up this interesting initiative. So far we - Flink 
developers - were not paying much attention to short living jobs/queries, and 
often that impacted our decisions in the past. It would be interesting to see 
how much demand is there for such use cases in Flink and how much we can 
improve Flink in this regard.

I would like to point out two things. 

# If we care about something, it should be tested, otherwise feature might be 
lost/accidentally removed. Here the feature is performance, and as such I think 
ideally (whenever it's feasible) every change that you are doing should be 
backed up by a visible benchmark improvement.
# For quite some time we are maintaining [various micro benchmarks 
|https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=115511847]. 
Since they are micro benchmarks, some of them are submitting a job with small 
bounded input and simply measuring how long does it take to process the bounded 
input and those jobs are expected to finish under 1s. This accidentally was 
testing OLAP use cases. That was not our purpose, just a side effect of trying 
to make benchmarks run quickly. If we detected the performance regression that 
was caused by slower startup/initialisation (for example FLINK-23593), most 
often we were simply ignoring it or just extending the length of the test. If 
OLAP support is something that we want to seriously tackle, it would be great 
to have more support from the OLAP devs in investigating and policing this kind 
of issues in the future. Help with that would be very much welcome by the 
community.

> Improvement of scheduler and execution for Flink OLAP
> -
>
> Key: FLINK-25318
> URL: https://issues.apache.org/jira/browse/FLINK-25318
> Project: Flink
>  Issue Type: Improvement
>  Components: Runtime / Coordination, Runtime / Network
>Affects Versions: 1.14.0, 1.12.5, 1.13.3
>Reporter: Shammon
>Priority: Major
>  Labels: Umbrella
> Fix For: 1.15.0
>
>
> We use flink to perform OLAP queries. We launch flink session cluster, submit 
> batch jobs to the cluster as OLAP queries, and fetch the jobs' results. OLAP 
> jobs are generally small queries which will finish at the seconds or 
> milliseconds, and users always submit multiple jobs to the session cluster 
> concurrently. We found the qps and latency of jobs will be greatly affected 
> when there're tens jobs are running, even when there's little data in each 
> query. We will give the result of benchmark for the latest version later.
> After discussed with [~xtsong], and thanks for his advice, we create this 
> issue to trace and manager Flink OLAP related improvements. More users and 
> developers are welcome and feel free to create Flink OLAP related subtasks 
> here, thanks



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Comment Edited] (FLINK-24846) AsyncWaitOperator fails during stop-with-savepoint

2021-12-16 Thread Piotr Nowojski (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-24846?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17459916#comment-17459916
 ] 

Piotr Nowojski edited comment on FLINK-24846 at 12/16/21, 8:08 AM:
---

merged commit 4065bfb into apache:master
merged as e7df5ec81fe and 8d5d7d46463 into release-1.14
merged commit 2bdc194 into apache:release-1.13


was (Author: pnowojski):
merged commit 4065bfb into apache:master
merged as e7df5ec81fe and 8d5d7d46463 into release-1.14

> AsyncWaitOperator fails during stop-with-savepoint
> --
>
> Key: FLINK-24846
> URL: https://issues.apache.org/jira/browse/FLINK-24846
> Project: Flink
>  Issue Type: Bug
>  Components: Runtime / Checkpointing, Runtime / Task
>Affects Versions: 1.14.0
>Reporter: Piotr Nowojski
>Assignee: Anton Kalashnikov
>Priority: Critical
>  Labels: pull-request-available
> Fix For: 1.15.0, 1.13.6, 1.14.3
>
> Attachments: log-jm.txt
>
>
> {noformat}
> Caused by: 
> org.apache.flink.streaming.runtime.tasks.mailbox.TaskMailbox$MailboxClosedException:
>  Mailbox is in state QUIESCED, but is required to be in state OPEN for put 
> operations.
> at 
> org.apache.flink.streaming.runtime.tasks.mailbox.TaskMailboxImpl.checkPutStateConditions(TaskMailboxImpl.java:269)
>  ~[flink-dist_2.11-1.14.0.jar:1.14.0]
> at 
> org.apache.flink.streaming.runtime.tasks.mailbox.TaskMailboxImpl.put(TaskMailboxImpl.java:197)
>  ~[flink-dist_2.11-1.14.0.jar:1.14.0]
> at 
> org.apache.flink.streaming.runtime.tasks.mailbox.MailboxExecutorImpl.execute(MailboxExecutorImpl.java:74)
>  ~[flink-dist_2.11-1.14.0.jar:1.14.0]
> at 
> org.apache.flink.api.common.operators.MailboxExecutor.execute(MailboxExecutor.java:103)
>  ~[flink-dist_2.11-1.14.0.jar:1.14.0]
> at 
> org.apache.flink.streaming.api.operators.async.AsyncWaitOperator.outputCompletedElement(AsyncWaitOperator.java:304)
>  ~[flink-dist_2.11-1.14.0.jar:1.14.0]
> at 
> org.apache.flink.streaming.api.operators.async.AsyncWaitOperator.access$100(AsyncWaitOperator.java:78)
>  ~[flink-dist_2.11-1.14.0.jar:1.14.0]
> at 
> org.apache.flink.streaming.api.operators.async.AsyncWaitOperator$ResultHandler.processResults(AsyncWaitOperator.java:370)
>  ~[flink-dist_2.11-1.14.0.jar:1.14.0]
> at 
> org.apache.flink.streaming.api.operators.async.AsyncWaitOperator$ResultHandler.lambda$processInMailbox$0(AsyncWaitOperator.java:351)
>  ~[flink-dist_2.11-1.14.0.jar:1.14.0]
> at 
> org.apache.flink.streaming.runtime.tasks.StreamTaskActionExecutor$1.runThrowing(StreamTaskActionExecutor.java:50)
>  ~[flink-dist_2.11-1.14.0.jar:1.14.0]
> at 
> org.apache.flink.streaming.runtime.tasks.mailbox.Mail.run(Mail.java:90) 
> ~[flink-dist_2.11-1.14.0.jar:1.14.0]
> at 
> org.apache.flink.streaming.runtime.tasks.mailbox.MailboxProcessor.drain(MailboxProcessor.java:177)
>  ~[flink-dist_2.11-1.14.0.jar:1.14.0]
> at 
> org.apache.flink.streaming.runtime.tasks.StreamTask.afterInvoke(StreamTask.java:854)
>  ~[flink-dist_2.11-1.14.0.jar:1.14.0]
> at 
> org.apache.flink.streaming.runtime.tasks.StreamTask.invoke(StreamTask.java:767)
>  ~[flink-dist_2.11-1.14.0.jar:1.14.0]
> at 
> org.apache.flink.runtime.taskmanager.Task.runWithSystemExitMonitoring(Task.java:958)
>  ~[flink-dist_2.11-1.14.0.jar:1.14.0]
> at 
> org.apache.flink.runtime.taskmanager.Task.restoreAndInvoke(Task.java:937) 
> ~[flink-dist_2.11-1.14.0.jar:1.14.0]
> at org.apache.flink.runtime.taskmanager.Task.doRun(Task.java:766) 
> ~[flink-dist_2.11-1.14.0.jar:1.14.0]
> at org.apache.flink.runtime.taskmanager.Task.run(Task.java:575) 
> ~[flink-dist_2.11-1.14.0.jar:1.14.0]
> at java.lang.Thread.run(Thread.java:829) ~[?:?]
> {noformat}
> As reported by a user on [the mailing 
> list:|https://mail-archives.apache.org/mod_mbox/flink-user/202111.mbox/%3CCAO6dnLwtLNxkr9qXG202ysrnse18Wgvph4hqHZe3ar8cuXAfDw%40mail.gmail.com%3E]
> {quote}
> I failed to stop a job with savepoint with the following message:
> Inconsistent execution state after stopping with savepoint. At least one 
> execution is still in one of the following states: FAILED, CANCELED. A global 
> fail-over is triggered to recover the job 452594f3ec5797f399e07f95c884a44b.
> The job manager said
>  A savepoint was created at 
> hdfs://mobdata-flink-hdfs/driving-habits/svpts/savepoint-452594-f60305755d0e 
> but the corresponding job 452594f3ec5797f399e07f95c884a44b didn't terminate 
> successfully.
> while complaining about
> Mailbox is in state QUIESCED, but is required to be in state OPEN for put 
> operations.
> Is it okay to ignore this kind of error?
> Please see the attached files for the detailed context.
> FYI, 
> - I used the la

[jira] [Comment Edited] (FLINK-24846) AsyncWaitOperator fails during stop-with-savepoint

2021-12-15 Thread Piotr Nowojski (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-24846?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17459916#comment-17459916
 ] 

Piotr Nowojski edited comment on FLINK-24846 at 12/15/21, 1:57 PM:
---

merged commit 4065bfb into apache:master
merged as e7df5ec81fe and 8d5d7d46463 into release-1.14


was (Author: pnowojski):
merged commit 4065bfb into apache:master
merged as e7df5ec81fe into release-1.14

> AsyncWaitOperator fails during stop-with-savepoint
> --
>
> Key: FLINK-24846
> URL: https://issues.apache.org/jira/browse/FLINK-24846
> Project: Flink
>  Issue Type: Bug
>  Components: Runtime / Checkpointing, Runtime / Task
>Affects Versions: 1.14.0
>Reporter: Piotr Nowojski
>Assignee: Anton Kalashnikov
>Priority: Critical
>  Labels: pull-request-available
> Fix For: 1.15.0, 1.14.3
>
> Attachments: log-jm.txt
>
>
> {noformat}
> Caused by: 
> org.apache.flink.streaming.runtime.tasks.mailbox.TaskMailbox$MailboxClosedException:
>  Mailbox is in state QUIESCED, but is required to be in state OPEN for put 
> operations.
> at 
> org.apache.flink.streaming.runtime.tasks.mailbox.TaskMailboxImpl.checkPutStateConditions(TaskMailboxImpl.java:269)
>  ~[flink-dist_2.11-1.14.0.jar:1.14.0]
> at 
> org.apache.flink.streaming.runtime.tasks.mailbox.TaskMailboxImpl.put(TaskMailboxImpl.java:197)
>  ~[flink-dist_2.11-1.14.0.jar:1.14.0]
> at 
> org.apache.flink.streaming.runtime.tasks.mailbox.MailboxExecutorImpl.execute(MailboxExecutorImpl.java:74)
>  ~[flink-dist_2.11-1.14.0.jar:1.14.0]
> at 
> org.apache.flink.api.common.operators.MailboxExecutor.execute(MailboxExecutor.java:103)
>  ~[flink-dist_2.11-1.14.0.jar:1.14.0]
> at 
> org.apache.flink.streaming.api.operators.async.AsyncWaitOperator.outputCompletedElement(AsyncWaitOperator.java:304)
>  ~[flink-dist_2.11-1.14.0.jar:1.14.0]
> at 
> org.apache.flink.streaming.api.operators.async.AsyncWaitOperator.access$100(AsyncWaitOperator.java:78)
>  ~[flink-dist_2.11-1.14.0.jar:1.14.0]
> at 
> org.apache.flink.streaming.api.operators.async.AsyncWaitOperator$ResultHandler.processResults(AsyncWaitOperator.java:370)
>  ~[flink-dist_2.11-1.14.0.jar:1.14.0]
> at 
> org.apache.flink.streaming.api.operators.async.AsyncWaitOperator$ResultHandler.lambda$processInMailbox$0(AsyncWaitOperator.java:351)
>  ~[flink-dist_2.11-1.14.0.jar:1.14.0]
> at 
> org.apache.flink.streaming.runtime.tasks.StreamTaskActionExecutor$1.runThrowing(StreamTaskActionExecutor.java:50)
>  ~[flink-dist_2.11-1.14.0.jar:1.14.0]
> at 
> org.apache.flink.streaming.runtime.tasks.mailbox.Mail.run(Mail.java:90) 
> ~[flink-dist_2.11-1.14.0.jar:1.14.0]
> at 
> org.apache.flink.streaming.runtime.tasks.mailbox.MailboxProcessor.drain(MailboxProcessor.java:177)
>  ~[flink-dist_2.11-1.14.0.jar:1.14.0]
> at 
> org.apache.flink.streaming.runtime.tasks.StreamTask.afterInvoke(StreamTask.java:854)
>  ~[flink-dist_2.11-1.14.0.jar:1.14.0]
> at 
> org.apache.flink.streaming.runtime.tasks.StreamTask.invoke(StreamTask.java:767)
>  ~[flink-dist_2.11-1.14.0.jar:1.14.0]
> at 
> org.apache.flink.runtime.taskmanager.Task.runWithSystemExitMonitoring(Task.java:958)
>  ~[flink-dist_2.11-1.14.0.jar:1.14.0]
> at 
> org.apache.flink.runtime.taskmanager.Task.restoreAndInvoke(Task.java:937) 
> ~[flink-dist_2.11-1.14.0.jar:1.14.0]
> at org.apache.flink.runtime.taskmanager.Task.doRun(Task.java:766) 
> ~[flink-dist_2.11-1.14.0.jar:1.14.0]
> at org.apache.flink.runtime.taskmanager.Task.run(Task.java:575) 
> ~[flink-dist_2.11-1.14.0.jar:1.14.0]
> at java.lang.Thread.run(Thread.java:829) ~[?:?]
> {noformat}
> As reported by a user on [the mailing 
> list:|https://mail-archives.apache.org/mod_mbox/flink-user/202111.mbox/%3CCAO6dnLwtLNxkr9qXG202ysrnse18Wgvph4hqHZe3ar8cuXAfDw%40mail.gmail.com%3E]
> {quote}
> I failed to stop a job with savepoint with the following message:
> Inconsistent execution state after stopping with savepoint. At least one 
> execution is still in one of the following states: FAILED, CANCELED. A global 
> fail-over is triggered to recover the job 452594f3ec5797f399e07f95c884a44b.
> The job manager said
>  A savepoint was created at 
> hdfs://mobdata-flink-hdfs/driving-habits/svpts/savepoint-452594-f60305755d0e 
> but the corresponding job 452594f3ec5797f399e07f95c884a44b didn't terminate 
> successfully.
> while complaining about
> Mailbox is in state QUIESCED, but is required to be in state OPEN for put 
> operations.
> Is it okay to ignore this kind of error?
> Please see the attached files for the detailed context.
> FYI, 
> - I used the latest 1.14.0
> - I started the job with "$FLINK_HOME"/bin/flink run --ta

[jira] [Updated] (FLINK-24846) AsyncWaitOperator fails during stop-with-savepoint

2021-12-15 Thread Piotr Nowojski (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-24846?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Piotr Nowojski updated FLINK-24846:
---
Fix Version/s: 1.13.6

> AsyncWaitOperator fails during stop-with-savepoint
> --
>
> Key: FLINK-24846
> URL: https://issues.apache.org/jira/browse/FLINK-24846
> Project: Flink
>  Issue Type: Bug
>  Components: Runtime / Checkpointing, Runtime / Task
>Affects Versions: 1.14.0
>Reporter: Piotr Nowojski
>Assignee: Anton Kalashnikov
>Priority: Critical
>  Labels: pull-request-available
> Fix For: 1.15.0, 1.13.6, 1.14.3
>
> Attachments: log-jm.txt
>
>
> {noformat}
> Caused by: 
> org.apache.flink.streaming.runtime.tasks.mailbox.TaskMailbox$MailboxClosedException:
>  Mailbox is in state QUIESCED, but is required to be in state OPEN for put 
> operations.
> at 
> org.apache.flink.streaming.runtime.tasks.mailbox.TaskMailboxImpl.checkPutStateConditions(TaskMailboxImpl.java:269)
>  ~[flink-dist_2.11-1.14.0.jar:1.14.0]
> at 
> org.apache.flink.streaming.runtime.tasks.mailbox.TaskMailboxImpl.put(TaskMailboxImpl.java:197)
>  ~[flink-dist_2.11-1.14.0.jar:1.14.0]
> at 
> org.apache.flink.streaming.runtime.tasks.mailbox.MailboxExecutorImpl.execute(MailboxExecutorImpl.java:74)
>  ~[flink-dist_2.11-1.14.0.jar:1.14.0]
> at 
> org.apache.flink.api.common.operators.MailboxExecutor.execute(MailboxExecutor.java:103)
>  ~[flink-dist_2.11-1.14.0.jar:1.14.0]
> at 
> org.apache.flink.streaming.api.operators.async.AsyncWaitOperator.outputCompletedElement(AsyncWaitOperator.java:304)
>  ~[flink-dist_2.11-1.14.0.jar:1.14.0]
> at 
> org.apache.flink.streaming.api.operators.async.AsyncWaitOperator.access$100(AsyncWaitOperator.java:78)
>  ~[flink-dist_2.11-1.14.0.jar:1.14.0]
> at 
> org.apache.flink.streaming.api.operators.async.AsyncWaitOperator$ResultHandler.processResults(AsyncWaitOperator.java:370)
>  ~[flink-dist_2.11-1.14.0.jar:1.14.0]
> at 
> org.apache.flink.streaming.api.operators.async.AsyncWaitOperator$ResultHandler.lambda$processInMailbox$0(AsyncWaitOperator.java:351)
>  ~[flink-dist_2.11-1.14.0.jar:1.14.0]
> at 
> org.apache.flink.streaming.runtime.tasks.StreamTaskActionExecutor$1.runThrowing(StreamTaskActionExecutor.java:50)
>  ~[flink-dist_2.11-1.14.0.jar:1.14.0]
> at 
> org.apache.flink.streaming.runtime.tasks.mailbox.Mail.run(Mail.java:90) 
> ~[flink-dist_2.11-1.14.0.jar:1.14.0]
> at 
> org.apache.flink.streaming.runtime.tasks.mailbox.MailboxProcessor.drain(MailboxProcessor.java:177)
>  ~[flink-dist_2.11-1.14.0.jar:1.14.0]
> at 
> org.apache.flink.streaming.runtime.tasks.StreamTask.afterInvoke(StreamTask.java:854)
>  ~[flink-dist_2.11-1.14.0.jar:1.14.0]
> at 
> org.apache.flink.streaming.runtime.tasks.StreamTask.invoke(StreamTask.java:767)
>  ~[flink-dist_2.11-1.14.0.jar:1.14.0]
> at 
> org.apache.flink.runtime.taskmanager.Task.runWithSystemExitMonitoring(Task.java:958)
>  ~[flink-dist_2.11-1.14.0.jar:1.14.0]
> at 
> org.apache.flink.runtime.taskmanager.Task.restoreAndInvoke(Task.java:937) 
> ~[flink-dist_2.11-1.14.0.jar:1.14.0]
> at org.apache.flink.runtime.taskmanager.Task.doRun(Task.java:766) 
> ~[flink-dist_2.11-1.14.0.jar:1.14.0]
> at org.apache.flink.runtime.taskmanager.Task.run(Task.java:575) 
> ~[flink-dist_2.11-1.14.0.jar:1.14.0]
> at java.lang.Thread.run(Thread.java:829) ~[?:?]
> {noformat}
> As reported by a user on [the mailing 
> list:|https://mail-archives.apache.org/mod_mbox/flink-user/202111.mbox/%3CCAO6dnLwtLNxkr9qXG202ysrnse18Wgvph4hqHZe3ar8cuXAfDw%40mail.gmail.com%3E]
> {quote}
> I failed to stop a job with savepoint with the following message:
> Inconsistent execution state after stopping with savepoint. At least one 
> execution is still in one of the following states: FAILED, CANCELED. A global 
> fail-over is triggered to recover the job 452594f3ec5797f399e07f95c884a44b.
> The job manager said
>  A savepoint was created at 
> hdfs://mobdata-flink-hdfs/driving-habits/svpts/savepoint-452594-f60305755d0e 
> but the corresponding job 452594f3ec5797f399e07f95c884a44b didn't terminate 
> successfully.
> while complaining about
> Mailbox is in state QUIESCED, but is required to be in state OPEN for put 
> operations.
> Is it okay to ignore this kind of error?
> Please see the attached files for the detailed context.
> FYI, 
> - I used the latest 1.14.0
> - I started the job with "$FLINK_HOME"/bin/flink run --target yarn-per-job
> - I couldn't reproduce the exception using the same jar so I might not able 
> to provide DUBUG messages
> {quote}



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Commented] (FLINK-24846) AsyncWaitOperator fails during stop-with-savepoint

2021-12-15 Thread Piotr Nowojski (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-24846?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17459916#comment-17459916
 ] 

Piotr Nowojski commented on FLINK-24846:


merged commit 4065bfb into apache:master
merged as e7df5ec81fe into release-1.14

> AsyncWaitOperator fails during stop-with-savepoint
> --
>
> Key: FLINK-24846
> URL: https://issues.apache.org/jira/browse/FLINK-24846
> Project: Flink
>  Issue Type: Bug
>  Components: Runtime / Checkpointing, Runtime / Task
>Affects Versions: 1.14.0
>Reporter: Piotr Nowojski
>Assignee: Anton Kalashnikov
>Priority: Critical
>  Labels: pull-request-available
> Attachments: log-jm.txt
>
>
> {noformat}
> Caused by: 
> org.apache.flink.streaming.runtime.tasks.mailbox.TaskMailbox$MailboxClosedException:
>  Mailbox is in state QUIESCED, but is required to be in state OPEN for put 
> operations.
> at 
> org.apache.flink.streaming.runtime.tasks.mailbox.TaskMailboxImpl.checkPutStateConditions(TaskMailboxImpl.java:269)
>  ~[flink-dist_2.11-1.14.0.jar:1.14.0]
> at 
> org.apache.flink.streaming.runtime.tasks.mailbox.TaskMailboxImpl.put(TaskMailboxImpl.java:197)
>  ~[flink-dist_2.11-1.14.0.jar:1.14.0]
> at 
> org.apache.flink.streaming.runtime.tasks.mailbox.MailboxExecutorImpl.execute(MailboxExecutorImpl.java:74)
>  ~[flink-dist_2.11-1.14.0.jar:1.14.0]
> at 
> org.apache.flink.api.common.operators.MailboxExecutor.execute(MailboxExecutor.java:103)
>  ~[flink-dist_2.11-1.14.0.jar:1.14.0]
> at 
> org.apache.flink.streaming.api.operators.async.AsyncWaitOperator.outputCompletedElement(AsyncWaitOperator.java:304)
>  ~[flink-dist_2.11-1.14.0.jar:1.14.0]
> at 
> org.apache.flink.streaming.api.operators.async.AsyncWaitOperator.access$100(AsyncWaitOperator.java:78)
>  ~[flink-dist_2.11-1.14.0.jar:1.14.0]
> at 
> org.apache.flink.streaming.api.operators.async.AsyncWaitOperator$ResultHandler.processResults(AsyncWaitOperator.java:370)
>  ~[flink-dist_2.11-1.14.0.jar:1.14.0]
> at 
> org.apache.flink.streaming.api.operators.async.AsyncWaitOperator$ResultHandler.lambda$processInMailbox$0(AsyncWaitOperator.java:351)
>  ~[flink-dist_2.11-1.14.0.jar:1.14.0]
> at 
> org.apache.flink.streaming.runtime.tasks.StreamTaskActionExecutor$1.runThrowing(StreamTaskActionExecutor.java:50)
>  ~[flink-dist_2.11-1.14.0.jar:1.14.0]
> at 
> org.apache.flink.streaming.runtime.tasks.mailbox.Mail.run(Mail.java:90) 
> ~[flink-dist_2.11-1.14.0.jar:1.14.0]
> at 
> org.apache.flink.streaming.runtime.tasks.mailbox.MailboxProcessor.drain(MailboxProcessor.java:177)
>  ~[flink-dist_2.11-1.14.0.jar:1.14.0]
> at 
> org.apache.flink.streaming.runtime.tasks.StreamTask.afterInvoke(StreamTask.java:854)
>  ~[flink-dist_2.11-1.14.0.jar:1.14.0]
> at 
> org.apache.flink.streaming.runtime.tasks.StreamTask.invoke(StreamTask.java:767)
>  ~[flink-dist_2.11-1.14.0.jar:1.14.0]
> at 
> org.apache.flink.runtime.taskmanager.Task.runWithSystemExitMonitoring(Task.java:958)
>  ~[flink-dist_2.11-1.14.0.jar:1.14.0]
> at 
> org.apache.flink.runtime.taskmanager.Task.restoreAndInvoke(Task.java:937) 
> ~[flink-dist_2.11-1.14.0.jar:1.14.0]
> at org.apache.flink.runtime.taskmanager.Task.doRun(Task.java:766) 
> ~[flink-dist_2.11-1.14.0.jar:1.14.0]
> at org.apache.flink.runtime.taskmanager.Task.run(Task.java:575) 
> ~[flink-dist_2.11-1.14.0.jar:1.14.0]
> at java.lang.Thread.run(Thread.java:829) ~[?:?]
> {noformat}
> As reported by a user on [the mailing 
> list:|https://mail-archives.apache.org/mod_mbox/flink-user/202111.mbox/%3CCAO6dnLwtLNxkr9qXG202ysrnse18Wgvph4hqHZe3ar8cuXAfDw%40mail.gmail.com%3E]
> {quote}
> I failed to stop a job with savepoint with the following message:
> Inconsistent execution state after stopping with savepoint. At least one 
> execution is still in one of the following states: FAILED, CANCELED. A global 
> fail-over is triggered to recover the job 452594f3ec5797f399e07f95c884a44b.
> The job manager said
>  A savepoint was created at 
> hdfs://mobdata-flink-hdfs/driving-habits/svpts/savepoint-452594-f60305755d0e 
> but the corresponding job 452594f3ec5797f399e07f95c884a44b didn't terminate 
> successfully.
> while complaining about
> Mailbox is in state QUIESCED, but is required to be in state OPEN for put 
> operations.
> Is it okay to ignore this kind of error?
> Please see the attached files for the detailed context.
> FYI, 
> - I used the latest 1.14.0
> - I started the job with "$FLINK_HOME"/bin/flink run --target yarn-per-job
> - I couldn't reproduce the exception using the same jar so I might not able 
> to provide DUBUG messages
> {quote}



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Updated] (FLINK-24846) AsyncWaitOperator fails during stop-with-savepoint

2021-12-15 Thread Piotr Nowojski (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-24846?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Piotr Nowojski updated FLINK-24846:
---
Fix Version/s: 1.15.0
   1.14.3

> AsyncWaitOperator fails during stop-with-savepoint
> --
>
> Key: FLINK-24846
> URL: https://issues.apache.org/jira/browse/FLINK-24846
> Project: Flink
>  Issue Type: Bug
>  Components: Runtime / Checkpointing, Runtime / Task
>Affects Versions: 1.14.0
>Reporter: Piotr Nowojski
>Assignee: Anton Kalashnikov
>Priority: Critical
>  Labels: pull-request-available
> Fix For: 1.15.0, 1.14.3
>
> Attachments: log-jm.txt
>
>
> {noformat}
> Caused by: 
> org.apache.flink.streaming.runtime.tasks.mailbox.TaskMailbox$MailboxClosedException:
>  Mailbox is in state QUIESCED, but is required to be in state OPEN for put 
> operations.
> at 
> org.apache.flink.streaming.runtime.tasks.mailbox.TaskMailboxImpl.checkPutStateConditions(TaskMailboxImpl.java:269)
>  ~[flink-dist_2.11-1.14.0.jar:1.14.0]
> at 
> org.apache.flink.streaming.runtime.tasks.mailbox.TaskMailboxImpl.put(TaskMailboxImpl.java:197)
>  ~[flink-dist_2.11-1.14.0.jar:1.14.0]
> at 
> org.apache.flink.streaming.runtime.tasks.mailbox.MailboxExecutorImpl.execute(MailboxExecutorImpl.java:74)
>  ~[flink-dist_2.11-1.14.0.jar:1.14.0]
> at 
> org.apache.flink.api.common.operators.MailboxExecutor.execute(MailboxExecutor.java:103)
>  ~[flink-dist_2.11-1.14.0.jar:1.14.0]
> at 
> org.apache.flink.streaming.api.operators.async.AsyncWaitOperator.outputCompletedElement(AsyncWaitOperator.java:304)
>  ~[flink-dist_2.11-1.14.0.jar:1.14.0]
> at 
> org.apache.flink.streaming.api.operators.async.AsyncWaitOperator.access$100(AsyncWaitOperator.java:78)
>  ~[flink-dist_2.11-1.14.0.jar:1.14.0]
> at 
> org.apache.flink.streaming.api.operators.async.AsyncWaitOperator$ResultHandler.processResults(AsyncWaitOperator.java:370)
>  ~[flink-dist_2.11-1.14.0.jar:1.14.0]
> at 
> org.apache.flink.streaming.api.operators.async.AsyncWaitOperator$ResultHandler.lambda$processInMailbox$0(AsyncWaitOperator.java:351)
>  ~[flink-dist_2.11-1.14.0.jar:1.14.0]
> at 
> org.apache.flink.streaming.runtime.tasks.StreamTaskActionExecutor$1.runThrowing(StreamTaskActionExecutor.java:50)
>  ~[flink-dist_2.11-1.14.0.jar:1.14.0]
> at 
> org.apache.flink.streaming.runtime.tasks.mailbox.Mail.run(Mail.java:90) 
> ~[flink-dist_2.11-1.14.0.jar:1.14.0]
> at 
> org.apache.flink.streaming.runtime.tasks.mailbox.MailboxProcessor.drain(MailboxProcessor.java:177)
>  ~[flink-dist_2.11-1.14.0.jar:1.14.0]
> at 
> org.apache.flink.streaming.runtime.tasks.StreamTask.afterInvoke(StreamTask.java:854)
>  ~[flink-dist_2.11-1.14.0.jar:1.14.0]
> at 
> org.apache.flink.streaming.runtime.tasks.StreamTask.invoke(StreamTask.java:767)
>  ~[flink-dist_2.11-1.14.0.jar:1.14.0]
> at 
> org.apache.flink.runtime.taskmanager.Task.runWithSystemExitMonitoring(Task.java:958)
>  ~[flink-dist_2.11-1.14.0.jar:1.14.0]
> at 
> org.apache.flink.runtime.taskmanager.Task.restoreAndInvoke(Task.java:937) 
> ~[flink-dist_2.11-1.14.0.jar:1.14.0]
> at org.apache.flink.runtime.taskmanager.Task.doRun(Task.java:766) 
> ~[flink-dist_2.11-1.14.0.jar:1.14.0]
> at org.apache.flink.runtime.taskmanager.Task.run(Task.java:575) 
> ~[flink-dist_2.11-1.14.0.jar:1.14.0]
> at java.lang.Thread.run(Thread.java:829) ~[?:?]
> {noformat}
> As reported by a user on [the mailing 
> list:|https://mail-archives.apache.org/mod_mbox/flink-user/202111.mbox/%3CCAO6dnLwtLNxkr9qXG202ysrnse18Wgvph4hqHZe3ar8cuXAfDw%40mail.gmail.com%3E]
> {quote}
> I failed to stop a job with savepoint with the following message:
> Inconsistent execution state after stopping with savepoint. At least one 
> execution is still in one of the following states: FAILED, CANCELED. A global 
> fail-over is triggered to recover the job 452594f3ec5797f399e07f95c884a44b.
> The job manager said
>  A savepoint was created at 
> hdfs://mobdata-flink-hdfs/driving-habits/svpts/savepoint-452594-f60305755d0e 
> but the corresponding job 452594f3ec5797f399e07f95c884a44b didn't terminate 
> successfully.
> while complaining about
> Mailbox is in state QUIESCED, but is required to be in state OPEN for put 
> operations.
> Is it okay to ignore this kind of error?
> Please see the attached files for the detailed context.
> FYI, 
> - I used the latest 1.14.0
> - I started the job with "$FLINK_HOME"/bin/flink run --target yarn-per-job
> - I couldn't reproduce the exception using the same jar so I might not able 
> to provide DUBUG messages
> {quote}



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Assigned] (FLINK-25199) fromValues does not emit final MAX watermark

2021-12-15 Thread Piotr Nowojski (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-25199?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Piotr Nowojski reassigned FLINK-25199:
--

Assignee: (was: Dawid Wysakowicz)

> fromValues does not emit final MAX watermark
> 
>
> Key: FLINK-25199
> URL: https://issues.apache.org/jira/browse/FLINK-25199
> Project: Flink
>  Issue Type: Bug
>  Components: Runtime / Task
>Reporter: Timo Walther
>Priority: Critical
> Fix For: 1.15.0, 1.14.3
>
>
> It seems {{fromValues}} that generates multiple rows does not emit any 
> watermarks:
> {code}
> StreamTableEnvironment tEnv = StreamTableEnvironment.create(env);
> Table inputTable =
> tEnv.fromValues(
> DataTypes.ROW(
> DataTypes.FIELD("weight", DataTypes.DOUBLE()),
> DataTypes.FIELD("f0", DataTypes.STRING()),
> DataTypes.FIELD("f1", DataTypes.DOUBLE()),
> DataTypes.FIELD("f2", DataTypes.DOUBLE()),
> DataTypes.FIELD("f3", DataTypes.DOUBLE()),
> DataTypes.FIELD("f4", DataTypes.INT()),
> DataTypes.FIELD("label", DataTypes.STRING())),
> Row.of(1., "a", 1., 1., 1., 2, "l1"),
> Row.of(1., "a", 1., 1., 1., 2, "l1"));
> DataStream input = tEnv.toDataStream(inputTable);
> {code}
> {{fromValues(1, 2, 3)}} or {{fromValues}} with only 1 row works correctly.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Assigned] (FLINK-18808) Task-level numRecordsOut metric may be underestimated

2021-12-15 Thread Piotr Nowojski (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-18808?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Piotr Nowojski reassigned FLINK-18808:
--

Assignee: Lijie Wang

> Task-level numRecordsOut metric may be underestimated
> -
>
> Key: FLINK-18808
> URL: https://issues.apache.org/jira/browse/FLINK-18808
> Project: Flink
>  Issue Type: Improvement
>  Components: Runtime / Metrics, Runtime / Task
>Affects Versions: 1.11.1
>Reporter: ming li
>Assignee: Lijie Wang
>Priority: Not a Priority
>  Labels: pull-request-available, usability
> Attachments: image-2020-08-04-11-28-13-800.png, 
> image-2020-08-04-11-32-20-678.png, image-2020-08-13-18-36-13-282.png
>
>
> At present, we only register task-level numRecordsOut metric by reusing 
> operator output record counter at the end of OperatorChain.
> {code:java}
> if (config.isChainEnd()) {
>operatorMetricGroup.getIOMetricGroup().reuseOutputMetricsForTask();
> }
> {code}
> If we only send data out through the last operator of OperatorChain, there is 
> no problem with this statistics. But consider the following scenario:
> !image-2020-08-04-11-28-13-800.png|width=507,height=174!
> In this JobGraph, we not only send data in the last operator, but also send 
> data in the middle operator of OperatorChain (the map operator just returns 
> the original value directly). Below is one of our test topology, we can see 
> that the statistics actually only have half of the total data received by the 
> downstream.
> !image-2020-08-04-11-32-20-678.png|width=648,height=251!
> I think the data sent out by the intermediate operator should also be counted 
> into the numRecordsOut of the Task. But currently we are not reusing 
> operators output record counters in the intermediate operators, which leads 
> to our task-level numRecordsOut metric is underestimated (although this has 
> no effect on the actual operation of the job, it may affect our monitoring).
> A simple idea of ​​mine is to modify the condition of reusing operators 
> output record counter:
> {code:java}
> if (!config.getNonChainedOutputs(getUserCodeClassloader()).isEmpty()) {
>operatorMetricGroup.getIOMetricGroup().reuseOutputMetricsForTask();
> }{code}
> In addition, I have another question: If a record is broadcast to all 
> downstream, should the numRecordsOut counter increase by one or the 
> downstream number? It seems that currently we are adding one to calculate the 
> numRecordsOut metric.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Commented] (FLINK-18808) Task-level numRecordsOut metric may be underestimated

2021-12-14 Thread Piotr Nowojski (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-18808?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17459274#comment-17459274
 ] 

Piotr Nowojski commented on FLINK-18808:


That would be great :) would you like me to assign the ticket to you?

> Task-level numRecordsOut metric may be underestimated
> -
>
> Key: FLINK-18808
> URL: https://issues.apache.org/jira/browse/FLINK-18808
> Project: Flink
>  Issue Type: Improvement
>  Components: Runtime / Metrics, Runtime / Task
>Affects Versions: 1.11.1
>Reporter: ming li
>Priority: Not a Priority
>  Labels: pull-request-available, usability
> Attachments: image-2020-08-04-11-28-13-800.png, 
> image-2020-08-04-11-32-20-678.png, image-2020-08-13-18-36-13-282.png
>
>
> At present, we only register task-level numRecordsOut metric by reusing 
> operator output record counter at the end of OperatorChain.
> {code:java}
> if (config.isChainEnd()) {
>operatorMetricGroup.getIOMetricGroup().reuseOutputMetricsForTask();
> }
> {code}
> If we only send data out through the last operator of OperatorChain, there is 
> no problem with this statistics. But consider the following scenario:
> !image-2020-08-04-11-28-13-800.png|width=507,height=174!
> In this JobGraph, we not only send data in the last operator, but also send 
> data in the middle operator of OperatorChain (the map operator just returns 
> the original value directly). Below is one of our test topology, we can see 
> that the statistics actually only have half of the total data received by the 
> downstream.
> !image-2020-08-04-11-32-20-678.png|width=648,height=251!
> I think the data sent out by the intermediate operator should also be counted 
> into the numRecordsOut of the Task. But currently we are not reusing 
> operators output record counters in the intermediate operators, which leads 
> to our task-level numRecordsOut metric is underestimated (although this has 
> no effect on the actual operation of the job, it may affect our monitoring).
> A simple idea of ​​mine is to modify the condition of reusing operators 
> output record counter:
> {code:java}
> if (!config.getNonChainedOutputs(getUserCodeClassloader()).isEmpty()) {
>operatorMetricGroup.getIOMetricGroup().reuseOutputMetricsForTask();
> }{code}
> In addition, I have another question: If a record is broadcast to all 
> downstream, should the numRecordsOut counter increase by one or the 
> downstream number? It seems that currently we are adding one to calculate the 
> numRecordsOut metric.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Commented] (FLINK-6755) Allow triggering Checkpoints through command line client

2021-12-13 Thread Piotr Nowojski (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-6755?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17458438#comment-17458438
 ] 

Piotr Nowojski commented on FLINK-6755:
---

The motivation behind this feature request will be covered by FLINK-25276.

As mentioned above by Aljoscha, there might be still a value of exposing manual 
checkpoint triggering REST API hook, so I'm keeping this ticket open. However 
it doesn't look like such feature is well motivated. Implementation of this 
should be quite straightforward since Flink internally already supports this 
(FLINK-24280). It's just not exposed in anyway to the user.

> Allow triggering Checkpoints through command line client
> 
>
> Key: FLINK-6755
> URL: https://issues.apache.org/jira/browse/FLINK-6755
> Project: Flink
>  Issue Type: New Feature
>  Components: Command Line Client, Runtime / Checkpointing
>Affects Versions: 1.3.0
>Reporter: Gyula Fora
>Priority: Minor
>  Labels: auto-deprioritized-major, auto-unassigned
>
> The command line client currently only allows triggering (and canceling with) 
> Savepoints. 
> While this is good if we want to fork or modify the pipelines in a 
> non-checkpoint compatible way, now with incremental checkpoints this becomes 
> wasteful for simple job restarts/pipeline updates. 
> I suggest we add a new command: 
> ./bin/flink checkpoint  [checkpointDirectory]
> and a new flag -c for the cancel command to indicate we want to trigger a 
> checkpoint:
> ./bin/flink cancel -c [targetDirectory] 
> Otherwise this can work similar to the current savepoint taking logic, we 
> could probably even piggyback on the current messages by adding boolean flag 
> indicating whether it should be a savepoint or a checkpoint.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Closed] (FLINK-12619) Support TERMINATE/SUSPEND Job with Checkpoint

2021-12-13 Thread Piotr Nowojski (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-12619?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Piotr Nowojski closed FLINK-12619.
--
Resolution: Won't Do

FLINK-25276 should address the motivation behind this feature, while the core 
idea of stop-with-checkpoint is inconsistent with FLIP-193 savepoints vs 
checkpoints semantics.

> Support TERMINATE/SUSPEND Job with Checkpoint
> -
>
> Key: FLINK-12619
> URL: https://issues.apache.org/jira/browse/FLINK-12619
> Project: Flink
>  Issue Type: New Feature
>  Components: Runtime / State Backends
>Reporter: Congxian Qiu
>Priority: Minor
>  Labels: auto-deprioritized-major, auto-unassigned, 
> pull-request-available
>  Time Spent: 50m
>  Remaining Estimate: 0h
>
> Inspired by the idea of FLINK-11458, we propose to support terminate/suspend 
> a job with checkpoint. This improvement cooperates with incremental and 
> external checkpoint features, that if checkpoint is retained and this feature 
> is configured, we will trigger a checkpoint before the job stops. It could 
> accelarate job recovery a lot since:
> 1. No source rewinding required any more.
> 2. It's much faster than taking a savepoint since incremental checkpoint is 
> enabled.
> Please note that conceptually savepoints is different from checkpoint in a 
> similar way that backups are different from recovery logs in traditional 
> database systems. So we suggest using this feature only for job recovery, 
> while stick with FLINK-11458 for the 
> upgrading/cross-cluster-job-migration/state-backend-switch cases.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Created] (FLINK-25276) Support native and incremental savepoints

2021-12-13 Thread Piotr Nowojski (Jira)
Piotr Nowojski created FLINK-25276:
--

 Summary: Support native and incremental savepoints
 Key: FLINK-25276
 URL: https://issues.apache.org/jira/browse/FLINK-25276
 Project: Flink
  Issue Type: New Feature
Reporter: Piotr Nowojski


Motivation. Currently with non incremental canonical format savepoints, with 
very large state, both taking and recovery from savepoints can take very long 
time. Providing options to take native format and incremental savepoint would 
alleviate this problem.

In the past the main challenge lied in the ownership semantic and files clean 
up of such incremental savepoints. However with FLINK-25154 implemented some of 
those concerns can be solved. Incremental savepoint could leverage "force full 
snapshot" mode provided by FLINK-25192, to duplicate/copy all of the savepoint 
files out of the Flink's ownership scope.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Closed] (FLINK-11748) Optimize savepoint: Remove the task states and KeyedState into increments.

2021-12-13 Thread Piotr Nowojski (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-11748?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Piotr Nowojski closed FLINK-11748.
--
Resolution: Abandoned

The PR has been closed. It looks like this ticket should have been closed as 
well.

> Optimize savepoint: Remove the task states and KeyedState into increments.
> --
>
> Key: FLINK-11748
> URL: https://issues.apache.org/jira/browse/FLINK-11748
> Project: Flink
>  Issue Type: Improvement
>  Components: Runtime / Task
>Affects Versions: 1.6.3, 1.6.4, 1.7.2
>Reporter: Mr.Nineteen
>Priority: Not a Priority
>  Labels: pull-request-available
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> The optimized savepoint (v3) comes from the blink branch. After verification, 
> I want to merge into the main branch of flink. For large jobs, the storage 
> space for savepoints is significantly reduced. This optimization will be 
> great in the next flink release.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Created] (FLINK-25275) Weighted KeyGroup assignment

2021-12-13 Thread Piotr Nowojski (Jira)
Piotr Nowojski created FLINK-25275:
--

 Summary: Weighted KeyGroup assignment
 Key: FLINK-25275
 URL: https://issues.apache.org/jira/browse/FLINK-25275
 Project: Flink
  Issue Type: New Feature
  Components: Runtime / Network
Affects Versions: 1.14.0
Reporter: Piotr Nowojski


Currently key groups are split into key group ranges naively in the simplest 
way. Key groups are split into equally sized continuous ranges (number of 
ranges = parallelism = number of keygroups / size of single keygroup). Flink 
could avoid data skew between key groups, by assigning them to tasks based on 
their "weight". "Weight" could be defined as frequency of an access for the 
given key group. 

Arbitrary, non-continuous, key group assignment (for example TM1 is processing 
kg1 and kg3 while TM2 is processing only kg2) would require extensive changes 
to the state backends for example. However the data skew could be mitigated to 
some extent by creating key group ranges in a more clever way, while keeping 
the key group range continuity. For example TM1 processes range [kg1, kg9], 
while TM2 just [kg10, kg11].

[This branch shows a PoC of such 
approach.|https://github.com/pnowojski/flink/commits/antiskew]



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Commented] (FLINK-25256) Savepoints do not work with ExternallyInducedSources

2021-12-10 Thread Piotr Nowojski (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-25256?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17457073#comment-17457073
 ] 

Piotr Nowojski commented on FLINK-25256:


This is related to supporting/handling "force full snapshot" flag in the 
-no-claim recovery mode.

> Savepoints do not work with ExternallyInducedSources
> 
>
> Key: FLINK-25256
> URL: https://issues.apache.org/jira/browse/FLINK-25256
> Project: Flink
>  Issue Type: Bug
>  Components: Runtime / Checkpointing
>Affects Versions: 1.14.0, 1.13.3
>Reporter: Dawid Wysakowicz
>Priority: Major
>
> It is not possible to take a proper savepoint with 
> {{ExternallyInducedSource}} or {{ExternallyInducedSourceReader}} (both legacy 
> and FLIP-27 versions). The problem is that we're hardcoding 
> {{CheckpointOptions}} in the {{triggerHook}}.
> The outcome of current state is that operators would try to take checkpoints 
> in the checkpoint location whereas the {{CheckpointCoordinator}} will write 
> metadata for those states in the savepoint location.
> Moreover the situation gets even weirder (I have not checked it entirely), if 
> we have a mixture of {{ExternallyInducedSource(s)}} and regular sources. In 
> such a case the location and format at which the state of a particular task 
> is persisted depends on the order of barriers arrival. If a barrier from a 
> regular source arrives last the task takes a savepoint, on the other hand if 
> last barrier is from an externally induced source it will take a checkpoint.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Updated] (FLINK-25256) Savepoints do not work with ExternallyInducedSources

2021-12-10 Thread Piotr Nowojski (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-25256?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Piotr Nowojski updated FLINK-25256:
---
Description: 
It is not possible to take a proper savepoint with {{ExternallyInducedSource}} 
or {{ExternallyInducedSourceReader}} (both legacy and FLIP-27 versions). The 
problem is that we're hardcoding {{CheckpointOptions}} in the {{triggerHook}}.

The outcome of current state is that operators would try to take checkpoints in 
the checkpoint location whereas the {{CheckpointCoordinator}} will write 
metadata for those states in the savepoint location.

Moreover the situation gets even weirder (I have not checked it entirely), if 
we have a mixture of {{ExternallyInducedSource(s)}} and regular sources. In 
such a case the location and format at which the state of a particular task is 
persisted depends on the order of barriers arrival. If a barrier from a regular 
source arrives last the task takes a savepoint, on the other hand if last 
barrier is from an externally induced source it will take a checkpoint.

  was:
It is not possible to take a proper savepoint with {{ExternallyInducedSource}} 
or {{ExternallyInducedSourceReader}}. The problem is that we're hardcoding 
{{CheckpointOptions}} in the {{triggerHook}}.

The outcome of current state is that operators would try to take checkpoints in 
the checkpoint location whereas the {{CheckpointCoordinator}} will write 
metadata for those states in the savepoint location.

Moreover the situation gets even weirder (I have not checked it entirely), if 
we have a mixture of {{ExternallyInducedSource(s)}} and regular sources. In 
such a case the location and format at which the state of a particular task is 
persisted depends on the order of barriers arrival. If a barrier from a regular 
source arrives last the task takes a savepoint, on the other hand if last 
barrier is from an externally induced source it will take a checkpoint.


> Savepoints do not work with ExternallyInducedSources
> 
>
> Key: FLINK-25256
> URL: https://issues.apache.org/jira/browse/FLINK-25256
> Project: Flink
>  Issue Type: Bug
>  Components: Runtime / Checkpointing
>Affects Versions: 1.14.0, 1.13.3
>Reporter: Dawid Wysakowicz
>Priority: Major
>
> It is not possible to take a proper savepoint with 
> {{ExternallyInducedSource}} or {{ExternallyInducedSourceReader}} (both legacy 
> and FLIP-27 versions). The problem is that we're hardcoding 
> {{CheckpointOptions}} in the {{triggerHook}}.
> The outcome of current state is that operators would try to take checkpoints 
> in the checkpoint location whereas the {{CheckpointCoordinator}} will write 
> metadata for those states in the savepoint location.
> Moreover the situation gets even weirder (I have not checked it entirely), if 
> we have a mixture of {{ExternallyInducedSource(s)}} and regular sources. In 
> such a case the location and format at which the state of a particular task 
> is persisted depends on the order of barriers arrival. If a barrier from a 
> regular source arrives last the task takes a savepoint, on the other hand if 
> last barrier is from an externally induced source it will take a checkpoint.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Commented] (FLINK-18808) Task-level numRecordsOut metric may be underestimated

2021-12-10 Thread Piotr Nowojski (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-18808?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17457071#comment-17457071
 ] 

Piotr Nowojski commented on FLINK-18808:


[~wanglijie95] as far as I remember, counting calls to {{emitRecord}} or 
{{broadcastRecord}} would give you incorrect results for {{numberRecordsOut}}, 
for example because of the {{BroadcastingOutputCollector}}. There might be more 
than one invocation of {{emitRecord}} per single record.

Regarding the {{numRecordsSent}} please check my previous comment, I would be 
against introducing it or replacing {{numRecordsOut}} with {{numRecordsSent}}.

> Task-level numRecordsOut metric may be underestimated
> -
>
> Key: FLINK-18808
> URL: https://issues.apache.org/jira/browse/FLINK-18808
> Project: Flink
>  Issue Type: Improvement
>  Components: Runtime / Metrics, Runtime / Task
>Affects Versions: 1.11.1
>Reporter: ming li
>Priority: Not a Priority
>  Labels: pull-request-available, usability
> Attachments: image-2020-08-04-11-28-13-800.png, 
> image-2020-08-04-11-32-20-678.png, image-2020-08-13-18-36-13-282.png
>
>
> At present, we only register task-level numRecordsOut metric by reusing 
> operator output record counter at the end of OperatorChain.
> {code:java}
> if (config.isChainEnd()) {
>operatorMetricGroup.getIOMetricGroup().reuseOutputMetricsForTask();
> }
> {code}
> If we only send data out through the last operator of OperatorChain, there is 
> no problem with this statistics. But consider the following scenario:
> !image-2020-08-04-11-28-13-800.png|width=507,height=174!
> In this JobGraph, we not only send data in the last operator, but also send 
> data in the middle operator of OperatorChain (the map operator just returns 
> the original value directly). Below is one of our test topology, we can see 
> that the statistics actually only have half of the total data received by the 
> downstream.
> !image-2020-08-04-11-32-20-678.png|width=648,height=251!
> I think the data sent out by the intermediate operator should also be counted 
> into the numRecordsOut of the Task. But currently we are not reusing 
> operators output record counters in the intermediate operators, which leads 
> to our task-level numRecordsOut metric is underestimated (although this has 
> no effect on the actual operation of the job, it may affect our monitoring).
> A simple idea of ​​mine is to modify the condition of reusing operators 
> output record counter:
> {code:java}
> if (!config.getNonChainedOutputs(getUserCodeClassloader()).isEmpty()) {
>operatorMetricGroup.getIOMetricGroup().reuseOutputMetricsForTask();
> }{code}
> In addition, I have another question: If a record is broadcast to all 
> downstream, should the numRecordsOut counter increase by one or the 
> downstream number? It seems that currently we are adding one to calculate the 
> numRecordsOut metric.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Created] (FLINK-25255) Consider/design implementing State Processor API (FC)

2021-12-10 Thread Piotr Nowojski (Jira)
Piotr Nowojski created FLINK-25255:
--

 Summary: Consider/design implementing State Processor API (FC)
 Key: FLINK-25255
 URL: https://issues.apache.org/jira/browse/FLINK-25255
 Project: Flink
  Issue Type: Sub-task
Reporter: Piotr Nowojski






--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Commented] (FLINK-18808) Task-level numRecordsOut metric may be underestimated

2021-12-09 Thread Piotr Nowojski (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-18808?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17456235#comment-17456235
 ] 

Piotr Nowojski commented on FLINK-18808:


I'm not sure. Maybe it would be actually better to pick up the old PR 
https://github.com/apache/flink/pull/13109 for fixing the number of records 
produced? Maybe it's enough to rebase and simplify it ([as stated in my last 
comment|https://github.com/apache/flink/pull/13109#issuecomment-688690309]). 

Having properly working numRecordsSent and buggy numRecordsOut would be very 
confusing. 

Dropping numRecordsOut and replacing it with numRecordsSent would require us to 
invest extra effort in figuring out what to do with backward compatibility of 
the metrics and might prove impossible. 

On the other hand having both of them (properly working) might be a little bit 
redundant? 

That's why I would suggest to first re-evaluate this old PR. However I don't 
fully remember what was the status of this change, whether there were still 
some unanswered questions and whether it was safe from the performance 
perspective or not. 


> Task-level numRecordsOut metric may be underestimated
> -
>
> Key: FLINK-18808
> URL: https://issues.apache.org/jira/browse/FLINK-18808
> Project: Flink
>  Issue Type: Improvement
>  Components: Runtime / Metrics, Runtime / Task
>Affects Versions: 1.11.1
>Reporter: ming li
>Priority: Not a Priority
>  Labels: pull-request-available, usability
> Attachments: image-2020-08-04-11-28-13-800.png, 
> image-2020-08-04-11-32-20-678.png, image-2020-08-13-18-36-13-282.png
>
>
> At present, we only register task-level numRecordsOut metric by reusing 
> operator output record counter at the end of OperatorChain.
> {code:java}
> if (config.isChainEnd()) {
>operatorMetricGroup.getIOMetricGroup().reuseOutputMetricsForTask();
> }
> {code}
> If we only send data out through the last operator of OperatorChain, there is 
> no problem with this statistics. But consider the following scenario:
> !image-2020-08-04-11-28-13-800.png|width=507,height=174!
> In this JobGraph, we not only send data in the last operator, but also send 
> data in the middle operator of OperatorChain (the map operator just returns 
> the original value directly). Below is one of our test topology, we can see 
> that the statistics actually only have half of the total data received by the 
> downstream.
> !image-2020-08-04-11-32-20-678.png|width=648,height=251!
> I think the data sent out by the intermediate operator should also be counted 
> into the numRecordsOut of the Task. But currently we are not reusing 
> operators output record counters in the intermediate operators, which leads 
> to our task-level numRecordsOut metric is underestimated (although this has 
> no effect on the actual operation of the job, it may affect our monitoring).
> A simple idea of ​​mine is to modify the condition of reusing operators 
> output record counter:
> {code:java}
> if (!config.getNonChainedOutputs(getUserCodeClassloader()).isEmpty()) {
>operatorMetricGroup.getIOMetricGroup().reuseOutputMetricsForTask();
> }{code}
> In addition, I have another question: If a record is broadcast to all 
> downstream, should the numRecordsOut counter increase by one or the 
> downstream number? It seems that currently we are adding one to calculate the 
> numRecordsOut metric.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Updated] (FLINK-18647) How to handle processing time timers with bounded input

2021-12-09 Thread Piotr Nowojski (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-18647?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Piotr Nowojski updated FLINK-18647:
---
Priority: Not a Priority  (was: Minor)

> How to handle processing time timers with bounded input
> ---
>
> Key: FLINK-18647
> URL: https://issues.apache.org/jira/browse/FLINK-18647
> Project: Flink
>  Issue Type: Improvement
>  Components: API / DataStream
>Affects Versions: 1.11.0
>Reporter: Piotr Nowojski
>Priority: Not a Priority
>  Labels: auto-deprioritized-critical, auto-deprioritized-major, 
> stale-minor
>
> (most of this description comes from an offline discussion between me, 
> [~AHeise], [~roman_khachatryan], [~aljoscha] and [~sunhaibotb])
> In case of end of input (for example for bounded sources), all pending 
> (untriggered) processing time timers are ignored/dropped. In some cases this 
> is desirable, but for example for {{WindowOperator}} it means that last 
> trailing window will not be triggered, causing an apparent data loss.
> There are a couple of ideas what should be considered.
> 1. Provide a way for users to decide what to do with such timers: cancel, 
> wait, trigger immediately. For example by overloading the existing methods: 
> {{ProcessingTimeService#registerTimer}} and 
> {{ProcessingTimeService#scheduleAtFixedRate}} in the following way:
> {code:java}
> ScheduledFuture registerTimer(long timestamp, ProcessingTimeCallback 
> target, TimerAction timerAction);
> enum TimerAction { 
> CANCEL_ON_END_OF_INPUT, 
> TRIGGER_ON_END_OF_INPUT,
> WAIT_ON_END_OF_INPUT}
> {code}
> or maybe:
> {code}
> public interface TimerAction {
> void onEndOfInput(ScheduledFuture timer);
> }
> {code}
> But this would also mean we store additional state with each timer and we 
> need to modify the serialisation format (providing some kind of state 
> migration path) and potentially increase the size foot print of the timers.
> Extra overhead could have been avoided via some kind of {{Map TimerAction>}}, with lack of entry meaning some default value.
> 2. 
> Also another way to solve this problem might be let the operator code decide 
> what to do with the given timer. Either ask an operator what should happen 
> with given timer (a), or let the operator iterate and cancel the timers on 
> endOfInput() (b), or just fire the timer with some endOfInput flag (c).
> I think none of the (a), (b), and (c) would require braking API changes, no 
> state changes and no additional overheads. Just the logic what to do with the 
> timer would have to be “hardcoded” in the operator’s code. (which btw might 
> even has an additional benefit of being easier to change in case of some 
> bugs, like a timer was registered with wrong/incorrect {{TimerAction}}).
> This is complicated a bit by a question, how (if at all?) options a), b) or 
> c) should be exposed to UDFs? 
> 3. 
> Maybe we need a combination of both? Pre existing operators could implement 
> some custom handling of this issue (via 2a, 2b or 2c), while UDFs could be 
> handled by 1.? 



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Commented] (FLINK-25167) Support user-defined `StreamOperatorFactory` in `ConnectedStreams`#transform

2021-12-09 Thread Piotr Nowojski (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-25167?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17456224#comment-17456224
 ] 

Piotr Nowojski commented on FLINK-25167:


Until we overhaul {{ProcessFunction}} I think exposing 
{{StreamOperatorFactory}} wouldn't be a bad idea. 

However as [~arvid] noted, {{OperatorCoordinator}} is internal interface and as 
of now we do not provide any guarantees for its stability or existence even 
between minor bug fix releases. Whether it should be exposed in the future or 
not, that's a good question.

> Support user-defined `StreamOperatorFactory` in `ConnectedStreams`#transform
> 
>
> Key: FLINK-25167
> URL: https://issues.apache.org/jira/browse/FLINK-25167
> Project: Flink
>  Issue Type: Improvement
>  Components: API / DataStream
>Reporter: Lsw_aka_laplace
>Priority: Minor
>
>   From my side, it is necessary to set my custom `StreamOperatorFactory` when 
> I ’m calling  `ConnectedStreams`#transform so that I can set up my own 
> `OperatorCoordinator`. 
>  Well, currently, `ConnectStreams` seems not to give the access, the default 
> behavior is using `SimpleOperatorFactory`.  After checking the code, I think 
> it is a trivial change to support that. If no one is working on it, I'm 
> willing to doing that.  : )



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Commented] (FLINK-24149) Make checkpoint self-contained and relocatable

2021-12-08 Thread Piotr Nowojski (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-24149?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17455237#comment-17455237
 ] 

Piotr Nowojski commented on FLINK-24149:


[~Feifan Wang] I've noticed that you opened a PR for this feature 
([~dwysakowicz] has already written down why we think that your PR is 
incorrect). Here I would like to re-open the discussion if this issue is still 
valid or not with context of the 
[FLIP-193|https://cwiki.apache.org/confluence/display/FLINK/FLIP-193%3A+Snapshots+ownership],
 and it's future follow up (also planned for 1.15, Incremental native format 
savepoints).

1. Incremental native format savepoints I think will help address your 
background problem. As long as your filesystem will be supporting cheap 
duplication of the artefacts, such savepoints will be just as quick as 
incremental checkpoints.
2. "Problem 1 : retained incremental checkpoint difficult to clean up once they 
used for recovery" is going to be addressed directly by FLIP-193, with it's 
{{claim}} and {{no-claim}} recovery modes.
3. As such "Problem 2 : checkpoint not relocatable", would have little value. 
It would only make sense for scenarios when user is not able to complete a 
savepoint, for example during some disaster recovery, when user has to manually 
move checkpoint files to a new cluster.

However as we pointed out before, relocatable checkpoints are only easily 
do-able for self contained checkpoints (non incremental ones). With incremental 
checkpoints, it's tricky to handle relative directories. First of all (as 
visible in the test failures in your PR) we would need to re-relativize file 
paths with respect to the new {{_metadata}} file. But even then, that's quite 
fishy. How would user know which files to relocate, if they are spread among 
multiple directories? As long as your checkpoint references only previous files 
from the same job, I could imagine user relocating the directory containing all 
of the checkpoints, but that's only a part of the story. Referenced files can 
be in completely different root directory, or even completely different file 
system, so in general case it would be quite difficult to achieve.

All in all, I would strongly suggest to drop this feature request for any 
foreseeable future, since incremental savepoints should solve most of the use 
cases here.

If that's not enough, I can imagine making self contained checkpoints 
(non-incremental) relocatable. It would be quite easy to understand and explain 
to the users and it has some value as I mentioned before (disaster recovery if 
user is unable to restart Flink job in the same cluster without first 
relocating checkpoints). But this is not as simple as your current proposal and 
it wouldn't solve your problem (as you are using incremental checkpoints).

Supporting relocation of incremental checkpoints I think is difficult if we 
want to support all cases (including checkpoints that are stored in different 
file systems). On top of that, I don't see how user would know hot to even 
relocate such checkpoint that is spread among many directories/filesystems? How 
should he know which files to copy? If we want to limit ourselves to files in 
the same root checkpoints directory, it becomes a bit inconsistent. How user 
would know whether this particular checkpoint is relocatable or not? So I would 
be also against doing that. It would be a limited value dangerous feature, 
that's not trivial to implement.

> Make checkpoint self-contained and relocatable
> --
>
> Key: FLINK-24149
> URL: https://issues.apache.org/jira/browse/FLINK-24149
> Project: Flink
>  Issue Type: Improvement
>  Components: Runtime / Checkpointing
>Reporter: Feifan Wang
>Priority: Major
>  Labels: pull-request-available, stale-major
> Attachments: image-2021-09-08-17-06-31-560.png, 
> image-2021-09-08-17-10-28-240.png, image-2021-09-08-17-55-46-898.png, 
> image-2021-09-08-18-01-03-176.png, image-2021-09-14-14-22-31-537.png
>
>
> h1. Backgroud
> We have many jobs with large state size in production environment. According 
> to the operation practice of these jobs and the analysis of some specific 
> problems, we believe that RocksDBStateBackend's incremental checkpoint has 
> many advantages over savepoint:
>  # Savepoint takes much longer time then incremental checkpoint in jobs with 
> large state. The figure below is a job in our production environment, it 
> takes nearly 7 minutes to complete a savepoint, while checkpoint only takes a 
> few seconds.( checkpoint after savepoint takes longer time is a problem 
> described in -FLINK-23949-)
>  !image-2021-09-08-17-55-46-898.png|width=723,height=161!
>  # Savepoint causes excessive cpu usage. The figure below shows the CPU usage 
> of the same job in th

[jira] [Assigned] (FLINK-24086) Do not re-register SharedStateRegistry to reduce the recovery time of the job

2021-12-08 Thread Piotr Nowojski (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-24086?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Piotr Nowojski reassigned FLINK-24086:
--

Assignee: Roman Khachatryan  (was: ming li)

> Do not re-register SharedStateRegistry to reduce the recovery time of the job
> -
>
> Key: FLINK-24086
> URL: https://issues.apache.org/jira/browse/FLINK-24086
> Project: Flink
>  Issue Type: Improvement
>  Components: Runtime / Checkpointing, Runtime / Coordination
>Reporter: ming li
>Assignee: Roman Khachatryan
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.15.0
>
>
> At present, we only recover the {{CompletedCheckpointStore}} when the 
> {{JobManager}} starts, so it seems that we do not need to re-register the 
> {{SharedStateRegistry}} when the task restarts.
> The reason for this issue is that in our production environment, we discard 
> part of the data and state to only restart the failed task, but found that it 
> may take several seconds to register the {{SharedStateRegistry}} (thousands 
> of tasks and dozens of TB states). When there are a large number of task 
> failures at the same time, this may take several minutes (number of tasks * 
> several seconds).
> Therefore, if the {{SharedStateRegistry}} can be reused, the time for task 
> recovery can be reduced.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Updated] (FLINK-25026) UnalignedCheckpointRescaleITCase.shouldRescaleUnalignedCheckpoint fails on AZP

2021-12-08 Thread Piotr Nowojski (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-25026?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Piotr Nowojski updated FLINK-25026:
---
Priority: Major  (was: Critical)

> UnalignedCheckpointRescaleITCase.shouldRescaleUnalignedCheckpoint fails on AZP
> --
>
> Key: FLINK-25026
> URL: https://issues.apache.org/jira/browse/FLINK-25026
> Project: Flink
>  Issue Type: Bug
>  Components: Runtime / Checkpointing
>Affects Versions: 1.14.0
>Reporter: Till Rohrmann
>Priority: Major
>  Labels: test-stability
> Fix For: 1.14.1
>
>
> {{UnalignedCheckpointRescaleITCase.shouldRescaleUnalignedCheckpoint}} fails 
> on AZP with
> {code}
> 2021-11-23T00:58:03.8286352Z Nov 23 00:58:03 [ERROR] Tests run: 72, Failures: 
> 0, Errors: 1, Skipped: 0, Time elapsed: 716.362 s <<< FAILURE! - in 
> org.apache.flink.test.checkpointing.UnalignedCheckpointRescaleITCase
> 2021-11-23T00:58:03.8288790Z Nov 23 00:58:03 [ERROR] 
> shouldRescaleUnalignedCheckpoint[downscale union from 3 to 2, 
> buffersPerChannel = 1]  Time elapsed: 4.051 s  <<< ERROR!
> 2021-11-23T00:58:03.8289953Z Nov 23 00:58:03 
> org.apache.flink.runtime.client.JobExecutionException: Job execution failed.
> 2021-11-23T00:58:03.8291473Z Nov 23 00:58:03  at 
> org.apache.flink.runtime.jobmaster.JobResult.toJobExecutionResult(JobResult.java:144)
> 2021-11-23T00:58:03.8292776Z Nov 23 00:58:03  at 
> org.apache.flink.test.checkpointing.UnalignedCheckpointTestBase.execute(UnalignedCheckpointTestBase.java:168)
> 2021-11-23T00:58:03.8294520Z Nov 23 00:58:03  at 
> org.apache.flink.test.checkpointing.UnalignedCheckpointRescaleITCase.shouldRescaleUnalignedCheckpoint(UnalignedCheckpointRescaleITCase.java:534)
> 2021-11-23T00:58:03.8295909Z Nov 23 00:58:03  at 
> jdk.internal.reflect.GeneratedMethodAccessor123.invoke(Unknown Source)
> 2021-11-23T00:58:03.8297310Z Nov 23 00:58:03  at 
> java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> 2021-11-23T00:58:03.8298922Z Nov 23 00:58:03  at 
> java.base/java.lang.reflect.Method.invoke(Method.java:566)
> 2021-11-23T00:58:03.8300298Z Nov 23 00:58:03  at 
> org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:59)
> 2021-11-23T00:58:03.8301741Z Nov 23 00:58:03  at 
> org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12)
> 2021-11-23T00:58:03.8303233Z Nov 23 00:58:03  at 
> org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:56)
> 2021-11-23T00:58:03.8304514Z Nov 23 00:58:03  at 
> org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17)
> 2021-11-23T00:58:03.8305736Z Nov 23 00:58:03  at 
> org.junit.rules.Verifier$1.evaluate(Verifier.java:35)
> 2021-11-23T00:58:03.8306856Z Nov 23 00:58:03  at 
> org.junit.rules.TestWatcher$1.evaluate(TestWatcher.java:61)
> 2021-11-23T00:58:03.8308218Z Nov 23 00:58:03  at 
> org.junit.rules.ExternalResource$1.evaluate(ExternalResource.java:54)
> 2021-11-23T00:58:03.8309532Z Nov 23 00:58:03  at 
> org.apache.flink.util.TestNameProvider$1.evaluate(TestNameProvider.java:45)
> 2021-11-23T00:58:03.8310780Z Nov 23 00:58:03  at 
> org.junit.rules.TestWatcher$1.evaluate(TestWatcher.java:61)
> 2021-11-23T00:58:03.8312026Z Nov 23 00:58:03  at 
> org.junit.runners.ParentRunner$3.evaluate(ParentRunner.java:306)
> 2021-11-23T00:58:03.8313515Z Nov 23 00:58:03  at 
> org.junit.runners.BlockJUnit4ClassRunner$1.evaluate(BlockJUnit4ClassRunner.java:100)
> 2021-11-23T00:58:03.8314842Z Nov 23 00:58:03  at 
> org.junit.runners.ParentRunner.runLeaf(ParentRunner.java:366)
> 2021-11-23T00:58:03.8316116Z Nov 23 00:58:03  at 
> org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:103)
> 2021-11-23T00:58:03.8317538Z Nov 23 00:58:03  at 
> org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:63)
> 2021-11-23T00:58:03.8320044Z Nov 23 00:58:03  at 
> org.junit.runners.ParentRunner$4.run(ParentRunner.java:331)
> 2021-11-23T00:58:03.8321044Z Nov 23 00:58:03  at 
> org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:79)
> 2021-11-23T00:58:03.8321978Z Nov 23 00:58:03  at 
> org.junit.runners.ParentRunner.runChildren(ParentRunner.java:329)
> 2021-11-23T00:58:03.8322915Z Nov 23 00:58:03  at 
> org.junit.runners.ParentRunner.access$100(ParentRunner.java:66)
> 2021-11-23T00:58:03.8323848Z Nov 23 00:58:03  at 
> org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:293)
> 2021-11-23T00:58:03.8325330Z Nov 23 00:58:03  at 
> org.junit.runners.ParentRunner.run(ParentRunner.java:413)
> 2021-11-23T00:58:03.8337747Z Nov 23 00:58:03  at 
> org.junit.runners.Suite.runChild(Suite.java:128)
> 2021-11-23T00:58:03.8339178Z Nov 23 00:58:03  at 
> org.junit.runners.Suite.runChild(Suite

[jira] [Closed] (FLINK-25081) When chaining an operator of a side output stream, the num records sent displayed on the dashboard is incorrect

2021-12-07 Thread Piotr Nowojski (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-25081?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Piotr Nowojski closed FLINK-25081.
--
Resolution: Duplicate

[~reswqa], let me close this ticket so that we have all of the discussions in 
one place.

Can you respond/repost your question in the original ticket?

> When chaining an operator of a side output stream, the num records sent 
> displayed on the dashboard is incorrect
> ---
>
> Key: FLINK-25081
> URL: https://issues.apache.org/jira/browse/FLINK-25081
> Project: Flink
>  Issue Type: Bug
>  Components: Runtime / Metrics, Runtime / Task
>Affects Versions: 1.14.0
>Reporter: Lijie Wang
>Priority: Major
> Attachments: image-2021-11-26-20-32-08-443.png
>
>
> As show in the following figure, "Map" is an operator of a side output 
> stream, the num records sent of first vertex is 0.
> !image-2021-11-26-20-32-08-443.png|width=750,height=253!
>  
> The job code is as follows:
> {code:java}
> final StreamExecutionEnvironment env = 
> StreamExecutionEnvironment.getExecutionEnvironment();
> SingleOutputStreamOperator dataStream =
> env.addSource(new 
> DataGeneratorSource<>(RandomGenerator.longGenerator(1, 1000)))
> .returns(Long.class)
> .setParallelism(10)
> .slotSharingGroup("group1");
> DataStream sideOutput = dataStream.getSideOutput(new 
> OutputTag("10") {});
> sideOutput.map(num -> num).setParallelism(10).slotSharingGroup("group1");
> dataStream.addSink(new 
> DiscardingSink<>()).setParallelism(10).slotSharingGroup("group2");
> env.execute("WordCount"); {code}



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Updated] (FLINK-25081) When chaining an operator of a side output stream, the num records sent displayed on the dashboard is incorrect

2021-12-07 Thread Piotr Nowojski (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-25081?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Piotr Nowojski updated FLINK-25081:
---
Component/s: Runtime / Task
 (was: Runtime / Network)

> When chaining an operator of a side output stream, the num records sent 
> displayed on the dashboard is incorrect
> ---
>
> Key: FLINK-25081
> URL: https://issues.apache.org/jira/browse/FLINK-25081
> Project: Flink
>  Issue Type: Bug
>  Components: Runtime / Metrics, Runtime / Task
>Affects Versions: 1.14.0
>Reporter: Lijie Wang
>Priority: Major
> Attachments: image-2021-11-26-20-32-08-443.png
>
>
> As show in the following figure, "Map" is an operator of a side output 
> stream, the num records sent of first vertex is 0.
> !image-2021-11-26-20-32-08-443.png|width=750,height=253!
>  
> The job code is as follows:
> {code:java}
> final StreamExecutionEnvironment env = 
> StreamExecutionEnvironment.getExecutionEnvironment();
> SingleOutputStreamOperator dataStream =
> env.addSource(new 
> DataGeneratorSource<>(RandomGenerator.longGenerator(1, 1000)))
> .returns(Long.class)
> .setParallelism(10)
> .slotSharingGroup("group1");
> DataStream sideOutput = dataStream.getSideOutput(new 
> OutputTag("10") {});
> sideOutput.map(num -> num).setParallelism(10).slotSharingGroup("group1");
> dataStream.addSink(new 
> DiscardingSink<>()).setParallelism(10).slotSharingGroup("group2");
> env.execute("WordCount"); {code}



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Updated] (FLINK-25081) When chaining an operator of a side output stream, the num records sent displayed on the dashboard is incorrect

2021-12-07 Thread Piotr Nowojski (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-25081?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Piotr Nowojski updated FLINK-25081:
---
Component/s: Runtime / Network

> When chaining an operator of a side output stream, the num records sent 
> displayed on the dashboard is incorrect
> ---
>
> Key: FLINK-25081
> URL: https://issues.apache.org/jira/browse/FLINK-25081
> Project: Flink
>  Issue Type: Bug
>  Components: Runtime / Metrics, Runtime / Network
>Affects Versions: 1.14.0
>Reporter: Lijie Wang
>Priority: Major
> Attachments: image-2021-11-26-20-32-08-443.png
>
>
> As show in the following figure, "Map" is an operator of a side output 
> stream, the num records sent of first vertex is 0.
> !image-2021-11-26-20-32-08-443.png|width=750,height=253!
>  
> The job code is as follows:
> {code:java}
> final StreamExecutionEnvironment env = 
> StreamExecutionEnvironment.getExecutionEnvironment();
> SingleOutputStreamOperator dataStream =
> env.addSource(new 
> DataGeneratorSource<>(RandomGenerator.longGenerator(1, 1000)))
> .returns(Long.class)
> .setParallelism(10)
> .slotSharingGroup("group1");
> DataStream sideOutput = dataStream.getSideOutput(new 
> OutputTag("10") {});
> sideOutput.map(num -> num).setParallelism(10).slotSharingGroup("group1");
> dataStream.addSink(new 
> DiscardingSink<>()).setParallelism(10).slotSharingGroup("group2");
> env.execute("WordCount"); {code}



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Comment Edited] (FLINK-25081) When chaining an operator of a side output stream, the num records sent displayed on the dashboard is incorrect

2021-12-07 Thread Piotr Nowojski (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-25081?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17454624#comment-17454624
 ] 

Piotr Nowojski edited comment on FLINK-25081 at 12/7/21, 12:41 PM:
---

Isn't this a duplicate of FLINK-18808? There was some longer discussion about 
how to solve it, but in the end external contributor abandoned the ticket.


was (Author: pnowojski):
Isn't this a duplicate of FLINK-18808?

> When chaining an operator of a side output stream, the num records sent 
> displayed on the dashboard is incorrect
> ---
>
> Key: FLINK-25081
> URL: https://issues.apache.org/jira/browse/FLINK-25081
> Project: Flink
>  Issue Type: Bug
>  Components: Runtime / Metrics
>Affects Versions: 1.14.0
>Reporter: Lijie Wang
>Priority: Major
> Attachments: image-2021-11-26-20-32-08-443.png
>
>
> As show in the following figure, "Map" is an operator of a side output 
> stream, the num records sent of first vertex is 0.
> !image-2021-11-26-20-32-08-443.png|width=750,height=253!
>  
> The job code is as follows:
> {code:java}
> final StreamExecutionEnvironment env = 
> StreamExecutionEnvironment.getExecutionEnvironment();
> SingleOutputStreamOperator dataStream =
> env.addSource(new 
> DataGeneratorSource<>(RandomGenerator.longGenerator(1, 1000)))
> .returns(Long.class)
> .setParallelism(10)
> .slotSharingGroup("group1");
> DataStream sideOutput = dataStream.getSideOutput(new 
> OutputTag("10") {});
> sideOutput.map(num -> num).setParallelism(10).slotSharingGroup("group1");
> dataStream.addSink(new 
> DiscardingSink<>()).setParallelism(10).slotSharingGroup("group2");
> env.execute("WordCount"); {code}



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Commented] (FLINK-25081) When chaining an operator of a side output stream, the num records sent displayed on the dashboard is incorrect

2021-12-07 Thread Piotr Nowojski (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-25081?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17454624#comment-17454624
 ] 

Piotr Nowojski commented on FLINK-25081:


Isn't this a duplicate of FLINK-18808?

> When chaining an operator of a side output stream, the num records sent 
> displayed on the dashboard is incorrect
> ---
>
> Key: FLINK-25081
> URL: https://issues.apache.org/jira/browse/FLINK-25081
> Project: Flink
>  Issue Type: Bug
>  Components: Runtime / Metrics
>Affects Versions: 1.14.0
>Reporter: Lijie Wang
>Priority: Major
> Attachments: image-2021-11-26-20-32-08-443.png
>
>
> As show in the following figure, "Map" is an operator of a side output 
> stream, the num records sent of first vertex is 0.
> !image-2021-11-26-20-32-08-443.png|width=750,height=253!
>  
> The job code is as follows:
> {code:java}
> final StreamExecutionEnvironment env = 
> StreamExecutionEnvironment.getExecutionEnvironment();
> SingleOutputStreamOperator dataStream =
> env.addSource(new 
> DataGeneratorSource<>(RandomGenerator.longGenerator(1, 1000)))
> .returns(Long.class)
> .setParallelism(10)
> .slotSharingGroup("group1");
> DataStream sideOutput = dataStream.getSideOutput(new 
> OutputTag("10") {});
> sideOutput.map(num -> num).setParallelism(10).slotSharingGroup("group1");
> dataStream.addSink(new 
> DiscardingSink<>()).setParallelism(10).slotSharingGroup("group2");
> env.execute("WordCount"); {code}



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Updated] (FLINK-25191) Skip savepoints for recovery

2021-12-07 Thread Piotr Nowojski (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-25191?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Piotr Nowojski updated FLINK-25191:
---
Description: 
Intermediate savepoints should not be used for recovery. In order to achieve 
that we should:
* do not send {{notifyCheckpointComplete}} for intermediate savepoints
* do not add them to {{CompletedCheckpointStore}}

Important! Synchronous savepoints (stop-with-savepoint) should still commit 
side-effects. We need to distinguish them from the intermediate savepoints.

https://cwiki.apache.org/confluence/x/bIyqCw#FLIP193:Snapshotsownership-SkippingSavepointsforRecovery

Document recommendation to drop (change UID) the external state (like 
transactions from two phase commit sinks) if starting multiple jobs from the 
same intermediate savepoint.

  was:
Intermediate savepoints should not be used for recovery. In order to achieve 
that we should:
* do not send {{notifyCheckpointComplete}} for intermediate savepoints
* do not add them to {{CompletedCheckpointStore}}

Important! Synchronous savepoints (stop-with-savepoint) should still commit 
side-effects. We need to distinguish them from the intermediate savepoints.

https://cwiki.apache.org/confluence/x/bIyqCw#FLIP193:Snapshotsownership-SkippingSavepointsforRecovery

Document recommendation to drop the sink's state if starting multiple jobs from 
the same intermediate savepoint.


> Skip savepoints for recovery
> 
>
> Key: FLINK-25191
> URL: https://issues.apache.org/jira/browse/FLINK-25191
> Project: Flink
>  Issue Type: Sub-task
>  Components: Runtime / Checkpointing
>Reporter: Dawid Wysakowicz
>Priority: Major
> Fix For: 1.15.0
>
>
> Intermediate savepoints should not be used for recovery. In order to achieve 
> that we should:
> * do not send {{notifyCheckpointComplete}} for intermediate savepoints
> * do not add them to {{CompletedCheckpointStore}}
> Important! Synchronous savepoints (stop-with-savepoint) should still commit 
> side-effects. We need to distinguish them from the intermediate savepoints.
> https://cwiki.apache.org/confluence/x/bIyqCw#FLIP193:Snapshotsownership-SkippingSavepointsforRecovery
> Document recommendation to drop (change UID) the external state (like 
> transactions from two phase commit sinks) if starting multiple jobs from the 
> same intermediate savepoint.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Updated] (FLINK-14884) User defined function for checkpoint trigger time and interval behavior

2021-12-06 Thread Piotr Nowojski (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-14884?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Piotr Nowojski updated FLINK-14884:
---
Priority: Not a Priority  (was: Minor)

> User defined function for checkpoint trigger time and interval behavior
> ---
>
> Key: FLINK-14884
> URL: https://issues.apache.org/jira/browse/FLINK-14884
> Project: Flink
>  Issue Type: Improvement
>  Components: API / DataStream, Runtime / Checkpointing
>Reporter: Shuwen Zhou
>Priority: Not a Priority
>  Labels: auto-deprioritized-major, stale-minor
>
> Hi,
> I wish to have control on checkpoint trigger time and interval time behavior 
> by user defined function.
> For some cases I would like to skip checkpoint during a certain hour or 
> certain minutes or vice versa.
> And some cases during several hours I would like to increase checkpoint 
> interval.
> And some case A crontab style could be used simply.
>  



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Updated] (FLINK-14011) Make some fields final and initialize them during construction in AsyncWaitOperator

2021-12-06 Thread Piotr Nowojski (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-14011?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Piotr Nowojski updated FLINK-14011:
---
Priority: Not a Priority  (was: Minor)

> Make some fields final and initialize them during construction in 
> AsyncWaitOperator
> ---
>
> Key: FLINK-14011
> URL: https://issues.apache.org/jira/browse/FLINK-14011
> Project: Flink
>  Issue Type: Improvement
>  Components: Runtime / Task
>Reporter: Alex
>Priority: Not a Priority
>  Labels: auto-deprioritized-major, stale-minor
>
> This is a small follow up ticket after the FLINK-12958.
> With the changes introduced there, the {{AsyncWaitOperator}} is created by 
> {{AsyncWaitOperatorFactory}}, so some fields that initialized in the 
> {{setup}} method can be setup in the constructor instead.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Commented] (FLINK-25103) KeyedBroadcastProcessFunction run set 6, parallelism ValueState variables A

2021-12-06 Thread Piotr Nowojski (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-25103?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17453848#comment-17453848
 ] 

Piotr Nowojski commented on FLINK-25103:


Hey [~wangbaohua]. Note for the future, it's much better to ask such questions 
on [the user mailing 
list|https://flink.apache.org/community.html#mailing-lists]. You would get help 
and response there much quicker than via filing a ticket.

Getting back to your question, what do you mean by "ValueState variables A" 
with the context of the shared code snippet? It would be also helpful (both for 
us and for you) if you could create a minimalistic example, with some 
artificial data source, that we can run and that would show the problem.

> KeyedBroadcastProcessFunction run set 6, parallelism ValueState variables A
> ---
>
> Key: FLINK-25103
> URL: https://issues.apache.org/jira/browse/FLINK-25103
> Project: Flink
>  Issue Type: Improvement
>  Components: Runtime / Task
>Affects Versions: 1.14.0
>Reporter: wangbaohua
>Priority: Major
>
> KeyedBroadcastProcessFunction run set 6, parallelism ValueState variables A, 
> excuse me how A stored in the six tasks.  When I was running, I observed that 
> some tasks fetched variable A was null, while others had values  .The 
> following code  :
> 
> setParallelism(9);
> ..
> public class dealStreamProcessFunction extends 
> KeyedBroadcastProcessFunction, 
> StandardEvent> {
> private static final Logger logger = 
> LoggerFactory.getLogger(dealStreamProcessFunction.class);
> private transient ValueState> listState;
> private transient ValueState runingFlagState;
> private transient ValueState engineState;
> MapStateDescriptor> ruleStateDescriptor = new 
> MapStateDescriptor<>(ContextInfo.RULE_SBROAD_CAST_STATE
> , BasicTypeInfo.STRING_TYPE_INFO
> , new ListTypeInfo<>(String.class));
> InferenceEngine engine;
> /**
>  * open方法只会执行一次
>  * 可以在这实现初始化的功能
>  *
>  * @param parameters
>  * @throws Exception
>  */
> @Override
> public void open(Configuration parameters) throws Exception {
> super.open(parameters);
> ValueStateDescriptor> recentOperatorsDescriptor = 
> new ValueStateDescriptor>(
> "recent-operator",
> TypeInformation.of(new TypeHint>() {
> }));
> ValueStateDescriptor runingFlagDescriptor = new 
> ValueStateDescriptor(
> "runingFlag",
> Boolean.class);
> ValueStateDescriptor engineDescriptor = new 
> ValueStateDescriptor(
> "runingFlag1",
> InferenceEngine.class);
> engineState = getRuntimeContext().getState(engineDescriptor);
> listState = getRuntimeContext().getState(recentOperatorsDescriptor);
> runingFlagState = getRuntimeContext().getState(runingFlagDescriptor);
> logger.info("KeyedBroadcastProcessFunction open");
> }
> @Override
> public void processElement(StandardEvent standardEvent, ReadOnlyContext 
> readOnlyContext, Collector collector) throws Exception {
> if(standardEvent == null){
> return;
> }
> List list = null;
> list = 
> readOnlyContext.getBroadcastState(ruleStateDescriptor).get(ContextInfo.RULE_SBROAD_CAST_STATE);
> if (list == null) {
> logger.info("RulesBroadcastState is null..");
> List lst = listState.value();
> if (lst == null) {
> lst = new ArrayList<>();
> }
> lst.add(standardEvent);
> listState.update(lst);
> return;
> }
> //第一次进来
> if (runingFlagState.value() == null) {
> logger.info("runingFlagState.value() == null");
> runingFlagState.update(true);
> }
> if (((runingFlagState.value() && list.get(0).equals("1")) || 
> list.get(0).equals("0"))) {
> logger.info("action update.:" + list.size() + ":" + 
> runingFlagState.value() + ":" + list.get(0));
> String flag = list.get(0);
> list.remove(0);
> InferenceEngine engine1 = 
> InferenceEngine.compile(RuleReader.parseRules(list));
> engineState.update(engine1);
> if (runingFlagState.value() && flag.equals("1")) {
> runingFlagState.update(false);
> }
> }
> if (engineState.value() != null) {
> List listTmp = listState.value();
> if (listTmp != null) {
> for (StandardEvent standardEventTmp : listTmp) {
> logger.info("listState.:" + standardEventTmp);
> match(standardEventTmp, col

[jira] [Updated] (FLINK-25103) KeyedBroadcastProcessFunction run set 6, parallelism ValueState variables A

2021-12-06 Thread Piotr Nowojski (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-25103?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Piotr Nowojski updated FLINK-25103:
---
Priority: Major  (was: Blocker)

> KeyedBroadcastProcessFunction run set 6, parallelism ValueState variables A
> ---
>
> Key: FLINK-25103
> URL: https://issues.apache.org/jira/browse/FLINK-25103
> Project: Flink
>  Issue Type: Improvement
>  Components: Runtime / Task
>Affects Versions: 1.14.0
>Reporter: wangbaohua
>Priority: Major
>
> KeyedBroadcastProcessFunction run set 6, parallelism ValueState variables A, 
> excuse me how A stored in the six tasks.  When I was running, I observed that 
> some tasks fetched variable A was null, while others had values  .The 
> following code  :
> 
> setParallelism(9);
> ..
> public class dealStreamProcessFunction extends 
> KeyedBroadcastProcessFunction, 
> StandardEvent> {
> private static final Logger logger = 
> LoggerFactory.getLogger(dealStreamProcessFunction.class);
> private transient ValueState> listState;
> private transient ValueState runingFlagState;
> private transient ValueState engineState;
> MapStateDescriptor> ruleStateDescriptor = new 
> MapStateDescriptor<>(ContextInfo.RULE_SBROAD_CAST_STATE
> , BasicTypeInfo.STRING_TYPE_INFO
> , new ListTypeInfo<>(String.class));
> InferenceEngine engine;
> /**
>  * open方法只会执行一次
>  * 可以在这实现初始化的功能
>  *
>  * @param parameters
>  * @throws Exception
>  */
> @Override
> public void open(Configuration parameters) throws Exception {
> super.open(parameters);
> ValueStateDescriptor> recentOperatorsDescriptor = 
> new ValueStateDescriptor>(
> "recent-operator",
> TypeInformation.of(new TypeHint>() {
> }));
> ValueStateDescriptor runingFlagDescriptor = new 
> ValueStateDescriptor(
> "runingFlag",
> Boolean.class);
> ValueStateDescriptor engineDescriptor = new 
> ValueStateDescriptor(
> "runingFlag1",
> InferenceEngine.class);
> engineState = getRuntimeContext().getState(engineDescriptor);
> listState = getRuntimeContext().getState(recentOperatorsDescriptor);
> runingFlagState = getRuntimeContext().getState(runingFlagDescriptor);
> logger.info("KeyedBroadcastProcessFunction open");
> }
> @Override
> public void processElement(StandardEvent standardEvent, ReadOnlyContext 
> readOnlyContext, Collector collector) throws Exception {
> if(standardEvent == null){
> return;
> }
> List list = null;
> list = 
> readOnlyContext.getBroadcastState(ruleStateDescriptor).get(ContextInfo.RULE_SBROAD_CAST_STATE);
> if (list == null) {
> logger.info("RulesBroadcastState is null..");
> List lst = listState.value();
> if (lst == null) {
> lst = new ArrayList<>();
> }
> lst.add(standardEvent);
> listState.update(lst);
> return;
> }
> //第一次进来
> if (runingFlagState.value() == null) {
> logger.info("runingFlagState.value() == null");
> runingFlagState.update(true);
> }
> if (((runingFlagState.value() && list.get(0).equals("1")) || 
> list.get(0).equals("0"))) {
> logger.info("action update.:" + list.size() + ":" + 
> runingFlagState.value() + ":" + list.get(0));
> String flag = list.get(0);
> list.remove(0);
> InferenceEngine engine1 = 
> InferenceEngine.compile(RuleReader.parseRules(list));
> engineState.update(engine1);
> if (runingFlagState.value() && flag.equals("1")) {
> runingFlagState.update(false);
> }
> }
> if (engineState.value() != null) {
> List listTmp = listState.value();
> if (listTmp != null) {
> for (StandardEvent standardEventTmp : listTmp) {
> logger.info("listState.:" + standardEventTmp);
> match(standardEventTmp, collector);
> }
> listState.clear();
> }
> match(standardEvent, collector);
> } else {
> logger.info("processElement engine is null.:");
> }
> }
> private void match(StandardEvent standardEvent, Collector 
> collector) throws IOException {
> PatternMatcher matcher = engineState.value().matcher(standardEvent);
> if (matcher.find()) {
> List actions = matcher.getActions();
> for (Action action : actions) {
> if (sta

[jira] [Updated] (FLINK-25028) java.lang.OutOfMemoryError: Java heap space

2021-12-05 Thread Piotr Nowojski (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-25028?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Piotr Nowojski updated FLINK-25028:
---
Priority: Major  (was: Critical)

> java.lang.OutOfMemoryError: Java heap space
> ---
>
> Key: FLINK-25028
> URL: https://issues.apache.org/jira/browse/FLINK-25028
> Project: Flink
>  Issue Type: Bug
>  Components: Runtime / Task
>Affects Versions: 1.14.0
>Reporter: wangbaohua
>Priority: Major
> Attachments: error.txt
>
>
> java.lang.OutOfMemoryError: Java heap space
>   at java.util.HashMap.resize(HashMap.java:703) ~[?:1.8.0_131]
>   at java.util.HashMap.putVal(HashMap.java:628) ~[?:1.8.0_131]
>   at java.util.HashMap.put(HashMap.java:611) ~[?:1.8.0_131]
>   at java.util.HashSet.add(HashSet.java:219) ~[?:1.8.0_131]
>   at 
> java.io.ObjectStreamClass$FieldReflector.(ObjectStreamClass.java:1945) 
> ~[?:1.8.0_131]
>   at java.io.ObjectStreamClass.getReflector(ObjectStreamClass.java:2193) 
> ~[?:1.8.0_131]
>   at java.io.ObjectStreamClass.(ObjectStreamClass.java:521) 
> ~[?:1.8.0_131]
>   at java.io.ObjectStreamClass.lookup(ObjectStreamClass.java:369) 
> ~[?:1.8.0_131]
>   at java.io.ObjectStreamClass.(ObjectStreamClass.java:468) 
> ~[?:1.8.0_131]
>   at java.io.ObjectStreamClass.lookup(ObjectStreamClass.java:369) 
> ~[?:1.8.0_131]
>   at 
> java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1134) 
> ~[?:1.8.0_131]
>   at java.io.ObjectOutputStream.writeArray(ObjectOutputStream.java:1378) 
> ~[?:1.8.0_131]
>   at 
> java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1174) 
> ~[?:1.8.0_131]
>   at java.io.ObjectOutputStream.access$300(ObjectOutputStream.java:162) 
> ~[?:1.8.0_131]
>   at 
> java.io.ObjectOutputStream$PutFieldImpl.writeFields(ObjectOutputStream.java:1707)
>  ~[?:1.8.0_131]
>   at java.io.ObjectOutputStream.writeFields(ObjectOutputStream.java:482) 
> ~[?:1.8.0_131]
>   at 
> java.util.concurrent.ConcurrentHashMap.writeObject(ConcurrentHashMap.java:1406)
>  ~[?:1.8.0_131]
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) 
> ~[?:1.8.0_131]
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) 
> ~[?:1.8.0_131]
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>  ~[?:1.8.0_131]
>   at java.lang.reflect.Method.invoke(Method.java:498) ~[?:1.8.0_131]
>   at 
> java.io.ObjectStreamClass.invokeWriteObject(ObjectStreamClass.java:1028) 
> ~[?:1.8.0_131]
>   at 
> java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1496) 
> ~[?:1.8.0_131]
>   at 
> java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1432) 
> ~[?:1.8.0_131]
>   at 
> java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1178) 
> ~[?:1.8.0_131]
>   at java.io.ObjectOutputStream.writeObject(ObjectOutputStream.java:348) 
> ~[?:1.8.0_131]
>   at 
> org.apache.flink.util.InstantiationUtil.serializeObject(InstantiationUtil.java:632)
>  ~[flink-dist_2.11-1.14.0.jar:1.14.0]
>   at 
> org.apache.flink.util.SerializedValue.(SerializedValue.java:62) 
> ~[flink-dist_2.11-1.14.0.jar:1.14.0]
>   at 
> org.apache.flink.runtime.accumulators.AccumulatorSnapshot.(AccumulatorSnapshot.java:51)
>  ~[flink-dist_2.11-1.14.0.jar:1.14.0]
>   at 
> org.apache.flink.runtime.accumulators.AccumulatorRegistry.getSnapshot(AccumulatorRegistry.java:54)
>  ~[flink-dist_2.11-1.14.0.jar:1.14.0]
>   at 
> org.apache.flink.runtime.taskexecutor.TaskExecutor$JobManagerHeartbeatListener.lambda$retrievePayload$3(TaskExecutor.java:2425)
>  ~[flink-dist_2.11-1.14.0.jar:1.14.0]
>   at 
> org.apache.flink.runtime.taskexecutor.TaskExecutor$JobManagerHeartbeatListener$$Lambda$1020/78782846.apply(Unknown
>  Source) ~[?:?]



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Commented] (FLINK-24696) Translate how to configure unaligned checkpoints into Chinese

2021-11-23 Thread Piotr Nowojski (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-24696?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17448094#comment-17448094
 ] 

Piotr Nowojski commented on FLINK-24696:


Sure, thanks!

> Translate how to configure unaligned checkpoints into Chinese
> -
>
> Key: FLINK-24696
> URL: https://issues.apache.org/jira/browse/FLINK-24696
> Project: Flink
>  Issue Type: Improvement
>  Components: chinese-translation, Documentation
>Affects Versions: 1.15.0, 1.14.1
>Reporter: Piotr Nowojski
>Assignee: ZhuoYu Chen
>Priority: Major
> Fix For: 1.15.0
>
>
> As part of FLINK-24695 
> {{docs/content/docs/ops/state/checkpointing_under_backpressure.md}} and 
> {{docs/content/docs/dev/datastream/fault-tolerance/checkpointing.md}} were 
> modified. Those modifications should be translated into Chinese



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Assigned] (FLINK-24696) Translate how to configure unaligned checkpoints into Chinese

2021-11-23 Thread Piotr Nowojski (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-24696?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Piotr Nowojski reassigned FLINK-24696:
--

Assignee: ZhuoYu Chen

> Translate how to configure unaligned checkpoints into Chinese
> -
>
> Key: FLINK-24696
> URL: https://issues.apache.org/jira/browse/FLINK-24696
> Project: Flink
>  Issue Type: Improvement
>  Components: chinese-translation, Documentation
>Affects Versions: 1.15.0, 1.14.1
>Reporter: Piotr Nowojski
>Assignee: ZhuoYu Chen
>Priority: Major
> Fix For: 1.15.0
>
>
> As part of FLINK-24695 
> {{docs/content/docs/ops/state/checkpointing_under_backpressure.md}} and 
> {{docs/content/docs/dev/datastream/fault-tolerance/checkpointing.md}} were 
> modified. Those modifications should be translated into Chinese



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Updated] (FLINK-16931) Large _metadata file lead to JobManager not responding when restart

2021-11-23 Thread Piotr Nowojski (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-16931?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Piotr Nowojski updated FLINK-16931:
---
Priority: Not a Priority  (was: Minor)

> Large _metadata file lead to JobManager not responding when restart
> ---
>
> Key: FLINK-16931
> URL: https://issues.apache.org/jira/browse/FLINK-16931
> Project: Flink
>  Issue Type: Improvement
>  Components: Runtime / Checkpointing, Runtime / Coordination
>Affects Versions: 1.9.2, 1.10.0, 1.11.0, 1.12.0
>Reporter: Lu Niu
>Priority: Not a Priority
>  Labels: auto-unassigned, stale-minor
>
> When _metadata file is big, JobManager could never recover from checkpoint. 
> It fall into a loop that fetch checkpoint -> JM timeout -> restart. Here is 
> related log: 
> {code:java}
>  2020-04-01 17:08:25,689 INFO 
> org.apache.flink.runtime.checkpoint.ZooKeeperCompletedCheckpointStore - 
> Recovering checkpoints from ZooKeeper.
>  2020-04-01 17:08:25,698 INFO 
> org.apache.flink.runtime.checkpoint.ZooKeeperCompletedCheckpointStore - Found 
> 3 checkpoints in ZooKeeper.
>  2020-04-01 17:08:25,698 INFO 
> org.apache.flink.runtime.checkpoint.ZooKeeperCompletedCheckpointStore - 
> Trying to fetch 3 checkpoints from storage.
>  2020-04-01 17:08:25,698 INFO 
> org.apache.flink.runtime.checkpoint.ZooKeeperCompletedCheckpointStore - 
> Trying to retrieve checkpoint 50.
>  2020-04-01 17:08:48,589 INFO 
> org.apache.flink.runtime.checkpoint.ZooKeeperCompletedCheckpointStore - 
> Trying to retrieve checkpoint 51.
>  2020-04-01 17:09:12,775 INFO org.apache.flink.yarn.YarnResourceManager - The 
> heartbeat of JobManager with id 02500708baf0bb976891c391afd3d7d5 timed out.
> {code}
> Digging into the code, looks like ExecutionGraph::restart runs in JobMaster 
> main thread and finally calls 
> ZooKeeperCompletedCheckpointStore::retrieveCompletedCheckpoint which download 
> file form DFS. The main thread is basically blocked for a while because of 
> this. One possible solution is to making the downloading part async. More 
> things might need to consider as the original change tries to make it 
> single-threaded. [https://github.com/apache/flink/pull/7568]



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Updated] (FLINK-11038) Rewrite Kafka at-least-once it cases

2021-11-22 Thread Piotr Nowojski (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-11038?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Piotr Nowojski updated FLINK-11038:
---
Priority: Not a Priority  (was: Minor)

> Rewrite Kafka at-least-once it cases
> 
>
> Key: FLINK-11038
> URL: https://issues.apache.org/jira/browse/FLINK-11038
> Project: Flink
>  Issue Type: Bug
>  Components: Connectors / Kafka
>Affects Versions: 1.7.0
>Reporter: Piotr Nowojski
>Priority: Not a Priority
>  Labels: auto-deprioritized-major, stale-minor
>
> Currently they are using {{NetworkFailuresProxy}} which is unstable both for 
> Kafka 0.11 in exactly once mode (in 50% tests are live locking) and for Kafka 
> 2.0 (and because of that currently {{testOneToOneAtLeastOnceRegularSink}} and 
> {{testOneToOneAtLeastOnceCustomOperator}} tests are disabled).
> Those tests should either be rewritten to SIGKILL Flink's process doing the 
> writing. Either as an ITCase SIGKILL-ing task manager or test harness 
> SIGKILL-ing/exiting test harness process.
> We can not simply use test harness and do not close it to simulate failure, 
> because we want to make sure that we have flushed the records during 
> checkpoint. If we do not SIGKILL the process, the background Kafka client's 
> threads can just send those records for us.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Commented] (FLINK-24967) Make the IO pattern configureable in state benchmarks

2021-11-22 Thread Piotr Nowojski (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-24967?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17447515#comment-17447515
 ] 

Piotr Nowojski commented on FLINK-24967:


[~aitozi], about this TODO that I've left, I didn't mean it to be fully 
parametrisable. They should be however moved to a context class like for 
example {{DataSkewStreamNetworkThroughputBenchmarkExecutor.MultiEnvironment}}. 
Note that contexts can have different scopes.

With the current setup there are a couple of issues.
# It's generally speaking difficult to work with static fields and especially 
the not so pretty static initialisation segments. 
# JVM theoretically could pick up such static field as a constant, and optimise 
the code to something that would be far far away from out intention, via loop 
unrolling constant folding, dead code elimination. For example if JVM notices 
that result of a method is not used anywhere and that the code doesn't have 
side effects, it can remove all of that code for being "a dead code". JMH as 
far as I remember has mechanism to prevent that from happening, but for this to 
work all of the input parameters to the benchmark should be passed via that 
context classes or via {{@Params}}. 
# Partially related to my TODO comments from {{StateBenchmarkBase}} is that 
some of those "static" variables and 
{{org.apache.flink.state.benchmark.StateBenchmarkBase.KeyValue#kvSetup}} 
initialisation code are only used in some benchmarks - not in all of them. As 
such there is no point in executing them per every benchmark. For example 
initialisation of 
{{org.apache.flink.state.benchmark.StateBenchmarkBase.KeyValue#listValue}} is 
quite costly, but it is only being used in two benchmarks out of ~20. With 
contexts classes it would be easy/easier to initialise/setup only those things 
that we really need to.

> Make the IO pattern configureable in state benchmarks
> -
>
> Key: FLINK-24967
> URL: https://issues.apache.org/jira/browse/FLINK-24967
> Project: Flink
>  Issue Type: Improvement
>  Components: Benchmarks, Runtime / State Backends
>Reporter: Aitozi
>Priority: Minor
>
> Currently, state benchmarks IO size are controlled by 
> {{StateBenchmarkConstants}}, which are not flexible to change. It's not easy 
> to test the performance under different IO size/pattern and different disk 
> (which can be solved by 
> [FLINK-24918|https://issues.apache.org/jira/browse/FLINK-24918]). I purpose 
> to make the state benchmark more configurable .



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Commented] (FLINK-24693) Update Chinese version of "Checkpoints" and "Checkpointing" page

2021-11-22 Thread Piotr Nowojski (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-24693?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17447507#comment-17447507
 ] 

Piotr Nowojski commented on FLINK-24693:


Sure! Thanks for taking care of this.

> Update Chinese version of "Checkpoints" and "Checkpointing" page
> 
>
> Key: FLINK-24693
> URL: https://issues.apache.org/jira/browse/FLINK-24693
> Project: Flink
>  Issue Type: Improvement
>  Components: chinese-translation, Documentation
>Affects Versions: 1.15.0, 1.14.1
>Reporter: Piotr Nowojski
>Assignee: wulei0302
>Priority: Critical
>
> Page {{content.zh/docs/ops/state/checkpoints.md}} and 
> {{content.zh/docs/dev/datastream/fault-tolerance/checkpointing.md}} are very 
> much out of date and out of sync of theirs english counterparts.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Assigned] (FLINK-24693) Update Chinese version of "Checkpoints" and "Checkpointing" page

2021-11-22 Thread Piotr Nowojski (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-24693?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Piotr Nowojski reassigned FLINK-24693:
--

Assignee: wulei0302

> Update Chinese version of "Checkpoints" and "Checkpointing" page
> 
>
> Key: FLINK-24693
> URL: https://issues.apache.org/jira/browse/FLINK-24693
> Project: Flink
>  Issue Type: Improvement
>  Components: chinese-translation, Documentation
>Affects Versions: 1.15.0, 1.14.1
>Reporter: Piotr Nowojski
>Assignee: wulei0302
>Priority: Critical
>
> Page {{content.zh/docs/ops/state/checkpoints.md}} and 
> {{content.zh/docs/dev/datastream/fault-tolerance/checkpointing.md}} are very 
> much out of date and out of sync of theirs english counterparts.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Updated] (FLINK-20724) Create a http handler for aggregating metrics from whole job

2021-11-22 Thread Piotr Nowojski (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-20724?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Piotr Nowojski updated FLINK-20724:
---
Priority: Not a Priority  (was: Minor)

> Create a http handler for aggregating metrics from whole job
> 
>
> Key: FLINK-20724
> URL: https://issues.apache.org/jira/browse/FLINK-20724
> Project: Flink
>  Issue Type: Improvement
>  Components: Runtime / Metrics
>Affects Versions: 1.13.0, 1.12.3
>Reporter: Piotr Nowojski
>Priority: Not a Priority
>  Labels: auto-deprioritized-major, stale-minor
>
> This is an optimisation idea. 
> Create a similar http handler to {{AggregatingSubtasksMetricsHandler}}, but 
> one that would aggregate metrics per task, from all of the job vertices. The 
> new handler would only take {{JobID}} as a parameter. So that Web UI can in 
> one RPC obtain {{max(isBackPressureRatio)}} / 
> {{max(isCausingBackPressureRatio)}} per each task in the job graph.
> This is related to FLINK-14712, where we are invoking more REST calls to get 
> the statistics per each task/node (in order to color the nodes based on the 
> back pressure and busy times). With this new handler, WebUI could make a one 
> REST call to get all the metrics it needs, instead of doing one REST call per 
> every Task.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Closed] (FLINK-20724) Create a http handler for aggregating metrics from whole job

2021-11-22 Thread Piotr Nowojski (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-20724?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Piotr Nowojski closed FLINK-20724.
--
Resolution: Abandoned

> Create a http handler for aggregating metrics from whole job
> 
>
> Key: FLINK-20724
> URL: https://issues.apache.org/jira/browse/FLINK-20724
> Project: Flink
>  Issue Type: Improvement
>  Components: Runtime / Metrics
>Affects Versions: 1.13.0, 1.12.3
>Reporter: Piotr Nowojski
>Priority: Not a Priority
>  Labels: auto-deprioritized-major, stale-minor
>
> This is an optimisation idea. 
> Create a similar http handler to {{AggregatingSubtasksMetricsHandler}}, but 
> one that would aggregate metrics per task, from all of the job vertices. The 
> new handler would only take {{JobID}} as a parameter. So that Web UI can in 
> one RPC obtain {{max(isBackPressureRatio)}} / 
> {{max(isCausingBackPressureRatio)}} per each task in the job graph.
> This is related to FLINK-14712, where we are invoking more REST calls to get 
> the statistics per each task/node (in order to color the nodes based on the 
> back pressure and busy times). With this new handler, WebUI could make a one 
> REST call to get all the metrics it needs, instead of doing one REST call per 
> every Task.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Updated] (FLINK-20743) Print ContainerId For RemoteTransportException

2021-11-22 Thread Piotr Nowojski (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-20743?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Piotr Nowojski updated FLINK-20743:
---
Priority: Not a Priority  (was: Minor)

> Print ContainerId For RemoteTransportException
> --
>
> Key: FLINK-20743
> URL: https://issues.apache.org/jira/browse/FLINK-20743
> Project: Flink
>  Issue Type: Improvement
>  Components: Runtime / Network
>Affects Versions: 1.10.0, 1.11.1, 1.12.1
>Reporter: yang gang
>Priority: Not a Priority
>  Labels: auto-unassigned, stale-minor
> Attachments: image-2020-12-23-15-13-21-226.png
>
>
> !image-2020-12-23-15-13-21-226.png|width=970,height=291!
>  RemoteTransportException, this exception reminds the user which service has 
> a problem by means of Ip/Port.
>  When we troubleshoot the problem, the information is not accurate enough. 
> Usually at this time we need to look at the running log of the container that 
> has the problem, but when we see this log, it also shows that the container 
> has died, so pass Ip/ The port method can no longer quickly locate a specific 
> container.
>  So I hope that when such an exception occurs, I hope to print the 
> containerId。
> E.g:
>  Connection unexpectedly closed by remote task manager 
> 'hostName/ip:port/containerId'. This might indicate that the remote task 
> manager was lost.
>   
>  



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Updated] (FLINK-21467) Document possible recommended usage of Bounded{One/Multi}Input.endInput and emphasize that they could be called multiple times

2021-11-22 Thread Piotr Nowojski (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-21467?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Piotr Nowojski updated FLINK-21467:
---
Priority: Not a Priority  (was: Minor)

> Document possible recommended usage of Bounded{One/Multi}Input.endInput and 
> emphasize that they could be called multiple times
> --
>
> Key: FLINK-21467
> URL: https://issues.apache.org/jira/browse/FLINK-21467
> Project: Flink
>  Issue Type: Improvement
>  Components: API / DataStream
>Affects Versions: 1.13.0
>Reporter: Kezhu Wang
>Priority: Not a Priority
>  Labels: auto-deprioritized-major, stale-minor
>
> It is too tempting to use these api, especially {{BoundedOneInput.endInput}}, 
> to commit final result before FLIP-147 delivered. And this will cause 
> re-commit after failover as [~gaoyunhaii] has pointed out in FLINK-21132.
> I have 
> [pointed|https://github.com/apache/iceberg/issues/2033#issuecomment-784153620]
>  this out in 
> [apache/iceberg#2033|https://github.com/apache/iceberg/issues/2033], please 
> correct me if I was wrong.
> cc [~aljoscha] [~pnowojski] [~roman_khachatryan]



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Comment Edited] (FLINK-24815) Reduce the cpu cost of calculating stateSize during state allocation

2021-11-22 Thread Piotr Nowojski (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-24815?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17447466#comment-17447466
 ] 

Piotr Nowojski edited comment on FLINK-24815 at 11/22/21, 3:13 PM:
---

I think you are right. It looks like the {{getStateSize()}} is used only for 
pretty printing or during snapshotting for metrics/logging/webUI.

Anyway, I don't like the idea of passing an invalid number. An easy change 
would be to calculate the state size lazily on demand only, but that would also 
be not very nice - it would be misleading for the {{getStateSize()}} method to 
actually be doing some intensive computations. 

[~yunta]/[~roman], do you have any thoughts on this one?

[~Ming Li], how important is this optimisation in the use case that you have in 
mind? How long does it take to calculate the state size during recovery?


was (Author: pnowojski):
I think you are right. It looks like the {{getStateSize()}} is used only for 
pretty printing or during snapshotting for metrics/logging/webUI.

Anyway, I don't like the idea of passing an invalid number. An easy change 
would be to calculate the state size lazily on demand only, but that would also 
be not very nice - it would be misleading for the {{getStateSize()}} method to 
actually be doing some intensive computations. 

[~Ming Li], how important is this optimisation in the use case that you have in 
mind? How long does it take to calculate the state size during recovery?

[~yunta]/[~roman], do you have any thoughts on this one?

> Reduce the cpu cost of calculating stateSize during state allocation
> 
>
> Key: FLINK-24815
> URL: https://issues.apache.org/jira/browse/FLINK-24815
> Project: Flink
>  Issue Type: Improvement
>  Components: Runtime / Checkpointing, Runtime / State Backends
>Affects Versions: 1.14.0
>Reporter: ming li
>Priority: Major
>
> When the task failover, we will reassign the state for each subtask and 
> create a new {{OperatorSubtaskState}} object. At this time, the {{stateSize}} 
> field in the {{OperatorSubtaskState}} will be recalculated. When using 
> incremental {{{}Checkpoint{}}}, this field needs to traverse all shared 
> states and then accumulate the size of the state.
> Taking a job with 2000 parallelism and 100 share state for each task as an 
> example, it needs to traverse 2000 * 100 = 20w times. At this time, the cpu 
> of the JM scheduling thread will be full.
> I think we can try to provide a construction method with {{stateSize}} for 
> {{OperatorSubtaskState}} or delay the calculation of {{{}stateSize{}}}.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Comment Edited] (FLINK-24815) Reduce the cpu cost of calculating stateSize during state allocation

2021-11-22 Thread Piotr Nowojski (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-24815?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17447466#comment-17447466
 ] 

Piotr Nowojski edited comment on FLINK-24815 at 11/22/21, 3:13 PM:
---

I think you are right. It looks like the {{getStateSize()}} is used only for 
pretty printing or during snapshotting for metrics/logging/webUI.

Anyway, I don't like the idea of passing an invalid number. An easy change 
would be to calculate the state size lazily on demand only, but that would also 
be not very nice - it would be misleading for the {{getStateSize()}} method to 
actually be doing some intensive computations. 

[~Ming Li], how important is this optimisation in the use case that you have in 
mind? How long does it take to calculate the state size during recovery?

[~yunta]/[~roman], do you have any thoughts on this one?


was (Author: pnowojski):
I think you are right. It looks like the {{getStateSize()}} is used only for 
pretty printing or during snapshotting for metrics/logging/webUI.

Anyway, I don't like the idea of passing an invalid number. An easy change 
would be to calculate the state size lazily on demand only, but that would also 
be not very nice - it would be misleading for the {{getStateSize()}} method to 
actually be doing some intensive computations. 

[~yunta]/[~roman], do you have any thoughts on this one?

> Reduce the cpu cost of calculating stateSize during state allocation
> 
>
> Key: FLINK-24815
> URL: https://issues.apache.org/jira/browse/FLINK-24815
> Project: Flink
>  Issue Type: Improvement
>  Components: Runtime / Checkpointing, Runtime / State Backends
>Affects Versions: 1.14.0
>Reporter: ming li
>Priority: Major
>
> When the task failover, we will reassign the state for each subtask and 
> create a new {{OperatorSubtaskState}} object. At this time, the {{stateSize}} 
> field in the {{OperatorSubtaskState}} will be recalculated. When using 
> incremental {{{}Checkpoint{}}}, this field needs to traverse all shared 
> states and then accumulate the size of the state.
> Taking a job with 2000 parallelism and 100 share state for each task as an 
> example, it needs to traverse 2000 * 100 = 20w times. At this time, the cpu 
> of the JM scheduling thread will be full.
> I think we can try to provide a construction method with {{stateSize}} for 
> {{OperatorSubtaskState}} or delay the calculation of {{{}stateSize{}}}.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Updated] (FLINK-24815) Reduce the cpu cost of calculating stateSize during state allocation

2021-11-22 Thread Piotr Nowojski (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-24815?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Piotr Nowojski updated FLINK-24815:
---
Component/s: Runtime / State Backends

> Reduce the cpu cost of calculating stateSize during state allocation
> 
>
> Key: FLINK-24815
> URL: https://issues.apache.org/jira/browse/FLINK-24815
> Project: Flink
>  Issue Type: Improvement
>  Components: Runtime / Checkpointing, Runtime / State Backends
>Affects Versions: 1.14.0
>Reporter: ming li
>Priority: Major
>
> When the task failover, we will reassign the state for each subtask and 
> create a new {{OperatorSubtaskState}} object. At this time, the {{stateSize}} 
> field in the {{OperatorSubtaskState}} will be recalculated. When using 
> incremental {{{}Checkpoint{}}}, this field needs to traverse all shared 
> states and then accumulate the size of the state.
> Taking a job with 2000 parallelism and 100 share state for each task as an 
> example, it needs to traverse 2000 * 100 = 20w times. At this time, the cpu 
> of the JM scheduling thread will be full.
> I think we can try to provide a construction method with {{stateSize}} for 
> {{OperatorSubtaskState}} or delay the calculation of {{{}stateSize{}}}.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Commented] (FLINK-24815) Reduce the cpu cost of calculating stateSize during state allocation

2021-11-22 Thread Piotr Nowojski (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-24815?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17447466#comment-17447466
 ] 

Piotr Nowojski commented on FLINK-24815:


I think you are right. It looks like the {{getStateSize()}} is used only for 
pretty printing or during snapshotting for metrics/logging/webUI.

Anyway, I don't like the idea of passing an invalid number. An easy change 
would be to calculate the state size lazily on demand only, but that would also 
be not very nice - it would be misleading for the {{getStateSize()}} method to 
actually be doing some intensive computations. 

[~yunta]/[~roman], do you have any thoughts on this one?

> Reduce the cpu cost of calculating stateSize during state allocation
> 
>
> Key: FLINK-24815
> URL: https://issues.apache.org/jira/browse/FLINK-24815
> Project: Flink
>  Issue Type: Improvement
>  Components: Runtime / Checkpointing, Runtime / State Backends
>Affects Versions: 1.14.0
>Reporter: ming li
>Priority: Major
>
> When the task failover, we will reassign the state for each subtask and 
> create a new {{OperatorSubtaskState}} object. At this time, the {{stateSize}} 
> field in the {{OperatorSubtaskState}} will be recalculated. When using 
> incremental {{{}Checkpoint{}}}, this field needs to traverse all shared 
> states and then accumulate the size of the state.
> Taking a job with 2000 parallelism and 100 share state for each task as an 
> example, it needs to traverse 2000 * 100 = 20w times. At this time, the cpu 
> of the JM scheduling thread will be full.
> I think we can try to provide a construction method with {{stateSize}} for 
> {{OperatorSubtaskState}} or delay the calculation of {{{}stateSize{}}}.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Updated] (FLINK-12674) BoundedBlockingSubpartition#unsynchronizedGetNumberOfQueuedBuffers is not implemented

2021-11-16 Thread Piotr Nowojski (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-12674?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Piotr Nowojski updated FLINK-12674:
---
Priority: Not a Priority  (was: Minor)

> BoundedBlockingSubpartition#unsynchronizedGetNumberOfQueuedBuffers is not 
> implemented
> -
>
> Key: FLINK-12674
> URL: https://issues.apache.org/jira/browse/FLINK-12674
> Project: Flink
>  Issue Type: Bug
>  Components: Runtime / Metrics, Runtime / Network
>Affects Versions: 1.9.0, 1.10.2, 1.11.2
>Reporter: Piotr Nowojski
>Priority: Not a Priority
>  Labels: auto-deprioritized-major, stale-minor
>
> Currently 
> {{BoundedBlockingSubpartition#unsynchronizedGetNumberOfQueuedBuffers}} 
> returns always 0, which affects metrics and is a regression compared to the 
> old {{SpillableSubpartition}}. Is there a way to implement it?
>  
> Note that before taking this ticket, make sure what's the current status of 
> the issue reported 
> https://issues.apache.org/jira/browse/FLINK-12070?focusedCommentId=16849615&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-16849615
>  which might affect the final implementation.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Updated] (FLINK-12872) WindowOperator may fail with UnsupportedOperationException when merging windows

2021-11-16 Thread Piotr Nowojski (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-12872?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Piotr Nowojski updated FLINK-12872:
---
Priority: Not a Priority  (was: Minor)

> WindowOperator may fail with UnsupportedOperationException when merging 
> windows
> ---
>
> Key: FLINK-12872
> URL: https://issues.apache.org/jira/browse/FLINK-12872
> Project: Flink
>  Issue Type: Bug
>  Components: API / DataStream
>Affects Versions: 1.6.4, 1.7.2, 1.8.0
>Reporter: Piotr Nowojski
>Priority: Not a Priority
>  Labels: auto-deprioritized-major, stale-minor
>
> [Reported 
> |http://mail-archives.apache.org/mod_mbox/flink-user/201906.mbox/%3CCALDWsfhbP6D9+pnTzYuGaP0V4nReKJ4s9VsG_Xe1hZJq4O=z...@mail.gmail.com%3E]
>  by a user.
> {noformat}
> I have a job that uses processing time session window with inactivity gap of 
> 60ms where I intermittently run into the following exception. I'm trying to 
> figure out what happened here. Haven't been able to reproduce this scenario. 
> Any thoughts?
> java.lang.UnsupportedOperationException: The end timestamp of a 
> processing-time window cannot become earlier than the current processing time 
> by merging. Current processing time: 1560493731808 window: 
> TimeWindow{start=1560493731654, end=1560493731778}
>   at 
> org.apache.flink.streaming.runtime.operators.windowing.WindowOperator$2.merge(WindowOperator.java:325)
>   at 
> org.apache.flink.streaming.runtime.operators.windowing.WindowOperator$2.merge(WindowOperator.java:311)
>   at 
> org.apache.flink.streaming.runtime.operators.windowing.MergingWindowSet.addWindow(MergingWindowSet.java:212)
>   at 
> org.apache.flink.streaming.runtime.operators.windowing.WindowOperator.processElement(WindowOperator.java:311)
>   at 
> org.apache.flink.streaming.runtime.io.StreamInputProcessor.processInput(StreamInputProcessor.java:202)
>   at 
> org.apache.flink.streaming.runtime.tasks.OneInputStreamTask.run(OneInputStreamTask.java:105)
>   at 
> org.apache.flink.streaming.runtime.tasks.StreamTask.invoke(StreamTask.java:300)
>   at org.apache.flink.runtime.taskmanager.Task.run(Task.java:711)
>   at java.lang.Thread.run(Thread.java:745)
> {noformat}
> This is happening probably because {{System.currentTimeMillis()}} is not a 
> monotonic function and {{WindowOperator}} accesses it at least twice: once 
> when it creates a window and second time during performing the above 
> mentioned check (that has failed). However I would guess there are more 
> places like this, not only in {{WindowOperator}}.
> The fix could be either to make sure that processing time is monotonic, or to 
> access it only once per operator per record or to drop processing time in 
> favour of ingestion time.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Commented] (FLINK-24815) Reduce the cpu cost of calculating stateSize during state allocation

2021-11-16 Thread Piotr Nowojski (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-24815?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17444375#comment-17444375
 ] 

Piotr Nowojski commented on FLINK-24815:


I don't know much about this part of the code so sorry for maybe a basic 
question, but how would you know the actual state size value to pass to that 
builder? Wouldn't you have to iterate over all state handles and ultimately do 
the same thing what the {{OperatorSubtaskState}} is already doing?

> Reduce the cpu cost of calculating stateSize during state allocation
> 
>
> Key: FLINK-24815
> URL: https://issues.apache.org/jira/browse/FLINK-24815
> Project: Flink
>  Issue Type: Improvement
>  Components: Runtime / Checkpointing
>Affects Versions: 1.14.0
>Reporter: ming li
>Priority: Major
>
> When the task failover, we will reassign the state for each subtask and 
> create a new {{OperatorSubtaskState}} object. At this time, the {{stateSize}} 
> field in the {{OperatorSubtaskState}} will be recalculated. When using 
> incremental {{{}Checkpoint{}}}, this field needs to traverse all shared 
> states and then accumulate the size of the state.
> Taking a job with 2000 parallelism and 100 share state for each task as an 
> example, it needs to traverse 2000 * 100 = 20w times. At this time, the cpu 
> of the JM scheduling thread will be full.
> I think we can try to provide a construction method with {{stateSize}} for 
> {{OperatorSubtaskState}} or delay the calculation of {{{}stateSize{}}}.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Closed] (FLINK-23466) UnalignedCheckpointITCase hangs on Azure

2021-11-16 Thread Piotr Nowojski (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-23466?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Piotr Nowojski closed FLINK-23466.
--
Resolution: Fixed

I've extracted the newly reported issue to FLINK-24919

> UnalignedCheckpointITCase hangs on Azure
> 
>
> Key: FLINK-23466
> URL: https://issues.apache.org/jira/browse/FLINK-23466
> Project: Flink
>  Issue Type: Bug
>  Components: Runtime / Checkpointing
>Affects Versions: 1.14.0
>Reporter: Dawid Wysakowicz
>Priority: Blocker
>  Labels: pull-request-available, test-stability
> Fix For: 1.14.1
>
>
> https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=20813&view=logs&j=a57e0635-3fad-5b08-57c7-a4142d7d6fa9&t=2ef0effc-1da1-50e5-c2bd-aab434b1c5b7&l=16016
> The problem is the buffer listener will be removed from the listener queue 
> when notified and then it will be added to the listener queue again if it 
> needs more buffers. However, if some buffers are recycled meanwhile, the 
> buffer listener will not be notified of the available buffers. For example:
> 1. Thread 1 calls LocalBufferPool#recycle().
> 2. Thread 1 reaches LocalBufferPool#fireBufferAvailableNotification() and 
> listener.notifyBufferAvailable() is invoked, but Thread 1 sleeps before 
> acquiring the lock to registeredListeners.add(listener).
> 3. Thread 2 is being woken up as a result of notifyBufferAvailable() 
> call. It takes the buffer, but it needs more buffers.
> 4. Other threads, return all buffers, including this one that has been 
> recycled. None are taken. Are all in the LocalBufferPool.
> 5. Thread 1 wakes up, and continues fireBufferAvailableNotification() 
> invocation.
> 6. Thread 1 re-adds listener that's waiting for more buffer 
> registeredListeners.add(listener).
> 7. Thread 1 exits loop LocalBufferPool#recycle(MemorySegment, int) 
> inside, as the original memory segment has been used.
> At the end we have a state where all buffers are in the LocalBufferPool, so 
> no new recycle() calls will happen, but there is still one listener waiting 
> for a buffer (despite buffers being available).



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Created] (FLINK-24919) UnalignedCheckpointITCase hangs on Azure

2021-11-16 Thread Piotr Nowojski (Jira)
Piotr Nowojski created FLINK-24919:
--

 Summary: UnalignedCheckpointITCase hangs on Azure
 Key: FLINK-24919
 URL: https://issues.apache.org/jira/browse/FLINK-24919
 Project: Flink
  Issue Type: Bug
  Components: Runtime / Checkpointing
Affects Versions: 1.15.0
Reporter: Piotr Nowojski


Extracted from FLINK-23466

https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=26304&view=logs&j=a57e0635-3fad-5b08-57c7-a4142d7d6fa9&t=2ef0effc-1da1-50e5-c2bd-aab434b1c5b7&l=13067

Nov 10 16:13:03 Starting 
org.apache.flink.test.checkpointing.UnalignedCheckpointITCase#execute[pipeline 
with mixed channels, p = 20, timeout = 0, buffersPerChannel = 1].

>From the log, we can see this case hangs. I guess this seems a new issue which 
>is different from the one reported in this ticket. From the stack, it seems 
>there is something wrong with the checkpoint coordinator, the following thread 
>locked 0x87db4fb8:
{code:java}
2021-11-10T17:14:21.0899474Z Nov 10 17:14:21 "jobmanager-io-thread-2" #12984 
daemon prio=5 os_prio=0 tid=0x7f12e000b800 nid=0x3fb6 runnable 
[0x7f0fcd6d4000]
2021-11-10T17:14:21.0899924Z Nov 10 17:14:21java.lang.Thread.State: RUNNABLE
2021-11-10T17:14:21.0900300Z Nov 10 17:14:21at 
java.util.HashMap$TreeNode.balanceDeletion(HashMap.java:2338)
2021-11-10T17:14:21.0900745Z Nov 10 17:14:21at 
java.util.HashMap$TreeNode.removeTreeNode(HashMap.java:2112)
2021-11-10T17:14:21.0901146Z Nov 10 17:14:21at 
java.util.HashMap.removeNode(HashMap.java:840)
2021-11-10T17:14:21.0901577Z Nov 10 17:14:21at 
java.util.LinkedHashMap.afterNodeInsertion(LinkedHashMap.java:301)
2021-11-10T17:14:21.0902002Z Nov 10 17:14:21at 
java.util.HashMap.putVal(HashMap.java:664)
2021-11-10T17:14:21.0902531Z Nov 10 17:14:21at 
java.util.HashMap.putMapEntries(HashMap.java:515)
2021-11-10T17:14:21.0902931Z Nov 10 17:14:21at 
java.util.HashMap.putAll(HashMap.java:785)
2021-11-10T17:14:21.0903429Z Nov 10 17:14:21at 
org.apache.flink.runtime.checkpoint.ExecutionAttemptMappingProvider.getVertex(ExecutionAttemptMappingProvider.java:60)
2021-11-10T17:14:21.0904060Z Nov 10 17:14:21at 
org.apache.flink.runtime.checkpoint.CheckpointCoordinator.reportStats(CheckpointCoordinator.java:1867)
2021-11-10T17:14:21.0904686Z Nov 10 17:14:21at 
org.apache.flink.runtime.checkpoint.CheckpointCoordinator.receiveAcknowledgeMessage(CheckpointCoordinator.java:1152)
2021-11-10T17:14:21.0905372Z Nov 10 17:14:21- locked <0x87db4fb8> 
(a java.lang.Object)
2021-11-10T17:14:21.0905895Z Nov 10 17:14:21at 
org.apache.flink.runtime.scheduler.ExecutionGraphHandler.lambda$acknowledgeCheckpoint$1(ExecutionGraphHandler.java:89)
2021-11-10T17:14:21.0906493Z Nov 10 17:14:21at 
org.apache.flink.runtime.scheduler.ExecutionGraphHandler$$Lambda$1368/705813936.accept(Unknown
 Source)
2021-11-10T17:14:21.0907086Z Nov 10 17:14:21at 
org.apache.flink.runtime.scheduler.ExecutionGraphHandler.lambda$processCheckpointCoordinatorMessage$3(ExecutionGraphHandler.java:119)
2021-11-10T17:14:21.0907698Z Nov 10 17:14:21at 
org.apache.flink.runtime.scheduler.ExecutionGraphHandler$$Lambda$1369/1447418658.run(Unknown
 Source)
2021-11-10T17:14:21.0908210Z Nov 10 17:14:21at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
2021-11-10T17:14:21.0908735Z Nov 10 17:14:21at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
2021-11-10T17:14:21.0909333Z Nov 10 17:14:21at 
java.lang.Thread.run(Thread.java:748) {code}
But other thread is waiting for the lock. I am not familiar with these logics 
and not sure if this is in the right state. Could anyone who is familiar with 
these logics take a look?

 

BTW, concurrent access of HashMap may cause infinite loop,I see in the stack 
that there are multiple threads are accessing HashMap, though I am not sure if 
they are the same instance.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Commented] (FLINK-24800) BufferTimeoutITCase.testDisablingBufferTimeout failed on Azure

2021-11-11 Thread Piotr Nowojski (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-24800?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17442272#comment-17442272
 ] 

Piotr Nowojski commented on FLINK-24800:


I agree with your assessment [~akalashnikov]. Let's keep the existing 
production code behaviour and let's just try to fix this in the tests.

> BufferTimeoutITCase.testDisablingBufferTimeout failed on Azure
> --
>
> Key: FLINK-24800
> URL: https://issues.apache.org/jira/browse/FLINK-24800
> Project: Flink
>  Issue Type: Bug
>  Components: Runtime / Network
>Affects Versions: 1.15.0
>Reporter: Yun Gao
>Assignee: Anton Kalashnikov
>Priority: Blocker
>  Labels: pull-request-available, test-stability
> Fix For: 1.15.0
>
>
> {code:java}
> 2021-11-05T12:18:50.5272055Z Nov 05 12:18:50 [INFO] Results:
> 2021-11-05T12:18:50.5273369Z Nov 05 12:18:50 [INFO] 
> 2021-11-05T12:18:50.5274011Z Nov 05 12:18:50 [ERROR] Failures: 
> 2021-11-05T12:18:50.5274518Z Nov 05 12:18:50 [ERROR]   
> BufferTimeoutITCase.testDisablingBufferTimeout:85 
> 2021-11-05T12:18:50.5274871Z Nov 05 12:18:50 Expected: <0>
> 2021-11-05T12:18:50.5275150Z Nov 05 12:18:50  but: was <1>
> 2021-11-05T12:18:50.5276136Z Nov 05 12:18:50 [INFO] 
> 2021-11-05T12:18:50.5276667Z Nov 05 12:18:50 [ERROR] Tests run: 1849, 
> Failures: 1, Errors: 0, Skipped: 58
> {code}
> https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=26018&view=logs&j=a57e0635-3fad-5b08-57c7-a4142d7d6fa9&t=2ef0effc-1da1-50e5-c2bd-aab434b1c5b7&l=10850



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Updated] (FLINK-24846) AsyncWaitOperator fails during stop-with-savepoint

2021-11-09 Thread Piotr Nowojski (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-24846?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Piotr Nowojski updated FLINK-24846:
---
Priority: Critical  (was: Major)

> AsyncWaitOperator fails during stop-with-savepoint
> --
>
> Key: FLINK-24846
> URL: https://issues.apache.org/jira/browse/FLINK-24846
> Project: Flink
>  Issue Type: Bug
>  Components: Runtime / Task
>Reporter: Piotr Nowojski
>Priority: Critical
> Attachments: log-jm.txt
>
>
> {noformat}
> Caused by: 
> org.apache.flink.streaming.runtime.tasks.mailbox.TaskMailbox$MailboxClosedException:
>  Mailbox is in state QUIESCED, but is required to be in state OPEN for put 
> operations.
> at 
> org.apache.flink.streaming.runtime.tasks.mailbox.TaskMailboxImpl.checkPutStateConditions(TaskMailboxImpl.java:269)
>  ~[flink-dist_2.11-1.14.0.jar:1.14.0]
> at 
> org.apache.flink.streaming.runtime.tasks.mailbox.TaskMailboxImpl.put(TaskMailboxImpl.java:197)
>  ~[flink-dist_2.11-1.14.0.jar:1.14.0]
> at 
> org.apache.flink.streaming.runtime.tasks.mailbox.MailboxExecutorImpl.execute(MailboxExecutorImpl.java:74)
>  ~[flink-dist_2.11-1.14.0.jar:1.14.0]
> at 
> org.apache.flink.api.common.operators.MailboxExecutor.execute(MailboxExecutor.java:103)
>  ~[flink-dist_2.11-1.14.0.jar:1.14.0]
> at 
> org.apache.flink.streaming.api.operators.async.AsyncWaitOperator.outputCompletedElement(AsyncWaitOperator.java:304)
>  ~[flink-dist_2.11-1.14.0.jar:1.14.0]
> at 
> org.apache.flink.streaming.api.operators.async.AsyncWaitOperator.access$100(AsyncWaitOperator.java:78)
>  ~[flink-dist_2.11-1.14.0.jar:1.14.0]
> at 
> org.apache.flink.streaming.api.operators.async.AsyncWaitOperator$ResultHandler.processResults(AsyncWaitOperator.java:370)
>  ~[flink-dist_2.11-1.14.0.jar:1.14.0]
> at 
> org.apache.flink.streaming.api.operators.async.AsyncWaitOperator$ResultHandler.lambda$processInMailbox$0(AsyncWaitOperator.java:351)
>  ~[flink-dist_2.11-1.14.0.jar:1.14.0]
> at 
> org.apache.flink.streaming.runtime.tasks.StreamTaskActionExecutor$1.runThrowing(StreamTaskActionExecutor.java:50)
>  ~[flink-dist_2.11-1.14.0.jar:1.14.0]
> at 
> org.apache.flink.streaming.runtime.tasks.mailbox.Mail.run(Mail.java:90) 
> ~[flink-dist_2.11-1.14.0.jar:1.14.0]
> at 
> org.apache.flink.streaming.runtime.tasks.mailbox.MailboxProcessor.drain(MailboxProcessor.java:177)
>  ~[flink-dist_2.11-1.14.0.jar:1.14.0]
> at 
> org.apache.flink.streaming.runtime.tasks.StreamTask.afterInvoke(StreamTask.java:854)
>  ~[flink-dist_2.11-1.14.0.jar:1.14.0]
> at 
> org.apache.flink.streaming.runtime.tasks.StreamTask.invoke(StreamTask.java:767)
>  ~[flink-dist_2.11-1.14.0.jar:1.14.0]
> at 
> org.apache.flink.runtime.taskmanager.Task.runWithSystemExitMonitoring(Task.java:958)
>  ~[flink-dist_2.11-1.14.0.jar:1.14.0]
> at 
> org.apache.flink.runtime.taskmanager.Task.restoreAndInvoke(Task.java:937) 
> ~[flink-dist_2.11-1.14.0.jar:1.14.0]
> at org.apache.flink.runtime.taskmanager.Task.doRun(Task.java:766) 
> ~[flink-dist_2.11-1.14.0.jar:1.14.0]
> at org.apache.flink.runtime.taskmanager.Task.run(Task.java:575) 
> ~[flink-dist_2.11-1.14.0.jar:1.14.0]
> at java.lang.Thread.run(Thread.java:829) ~[?:?]
> {noformat}
> As reported by a user on [the mailing 
> list:|https://mail-archives.apache.org/mod_mbox/flink-user/202111.mbox/%3CCAO6dnLwtLNxkr9qXG202ysrnse18Wgvph4hqHZe3ar8cuXAfDw%40mail.gmail.com%3E]
> {quote}
> I failed to stop a job with savepoint with the following message:
> Inconsistent execution state after stopping with savepoint. At least one 
> execution is still in one of the following states: FAILED, CANCELED. A global 
> fail-over is triggered to recover the job 452594f3ec5797f399e07f95c884a44b.
> The job manager said
>  A savepoint was created at 
> hdfs://mobdata-flink-hdfs/driving-habits/svpts/savepoint-452594-f60305755d0e 
> but the corresponding job 452594f3ec5797f399e07f95c884a44b didn't terminate 
> successfully.
> while complaining about
> Mailbox is in state QUIESCED, but is required to be in state OPEN for put 
> operations.
> Is it okay to ignore this kind of error?
> Please see the attached files for the detailed context.
> FYI, 
> - I used the latest 1.14.0
> - I started the job with "$FLINK_HOME"/bin/flink run --target yarn-per-job
> - I couldn't reproduce the exception using the same jar so I might not able 
> to provide DUBUG messages
> {quote}



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Updated] (FLINK-24846) AsyncWaitOperator fails during stop-with-savepoint

2021-11-09 Thread Piotr Nowojski (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-24846?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Piotr Nowojski updated FLINK-24846:
---
Affects Version/s: 1.14.0

> AsyncWaitOperator fails during stop-with-savepoint
> --
>
> Key: FLINK-24846
> URL: https://issues.apache.org/jira/browse/FLINK-24846
> Project: Flink
>  Issue Type: Bug
>  Components: Runtime / Task
>Affects Versions: 1.14.0
>Reporter: Piotr Nowojski
>Priority: Critical
> Attachments: log-jm.txt
>
>
> {noformat}
> Caused by: 
> org.apache.flink.streaming.runtime.tasks.mailbox.TaskMailbox$MailboxClosedException:
>  Mailbox is in state QUIESCED, but is required to be in state OPEN for put 
> operations.
> at 
> org.apache.flink.streaming.runtime.tasks.mailbox.TaskMailboxImpl.checkPutStateConditions(TaskMailboxImpl.java:269)
>  ~[flink-dist_2.11-1.14.0.jar:1.14.0]
> at 
> org.apache.flink.streaming.runtime.tasks.mailbox.TaskMailboxImpl.put(TaskMailboxImpl.java:197)
>  ~[flink-dist_2.11-1.14.0.jar:1.14.0]
> at 
> org.apache.flink.streaming.runtime.tasks.mailbox.MailboxExecutorImpl.execute(MailboxExecutorImpl.java:74)
>  ~[flink-dist_2.11-1.14.0.jar:1.14.0]
> at 
> org.apache.flink.api.common.operators.MailboxExecutor.execute(MailboxExecutor.java:103)
>  ~[flink-dist_2.11-1.14.0.jar:1.14.0]
> at 
> org.apache.flink.streaming.api.operators.async.AsyncWaitOperator.outputCompletedElement(AsyncWaitOperator.java:304)
>  ~[flink-dist_2.11-1.14.0.jar:1.14.0]
> at 
> org.apache.flink.streaming.api.operators.async.AsyncWaitOperator.access$100(AsyncWaitOperator.java:78)
>  ~[flink-dist_2.11-1.14.0.jar:1.14.0]
> at 
> org.apache.flink.streaming.api.operators.async.AsyncWaitOperator$ResultHandler.processResults(AsyncWaitOperator.java:370)
>  ~[flink-dist_2.11-1.14.0.jar:1.14.0]
> at 
> org.apache.flink.streaming.api.operators.async.AsyncWaitOperator$ResultHandler.lambda$processInMailbox$0(AsyncWaitOperator.java:351)
>  ~[flink-dist_2.11-1.14.0.jar:1.14.0]
> at 
> org.apache.flink.streaming.runtime.tasks.StreamTaskActionExecutor$1.runThrowing(StreamTaskActionExecutor.java:50)
>  ~[flink-dist_2.11-1.14.0.jar:1.14.0]
> at 
> org.apache.flink.streaming.runtime.tasks.mailbox.Mail.run(Mail.java:90) 
> ~[flink-dist_2.11-1.14.0.jar:1.14.0]
> at 
> org.apache.flink.streaming.runtime.tasks.mailbox.MailboxProcessor.drain(MailboxProcessor.java:177)
>  ~[flink-dist_2.11-1.14.0.jar:1.14.0]
> at 
> org.apache.flink.streaming.runtime.tasks.StreamTask.afterInvoke(StreamTask.java:854)
>  ~[flink-dist_2.11-1.14.0.jar:1.14.0]
> at 
> org.apache.flink.streaming.runtime.tasks.StreamTask.invoke(StreamTask.java:767)
>  ~[flink-dist_2.11-1.14.0.jar:1.14.0]
> at 
> org.apache.flink.runtime.taskmanager.Task.runWithSystemExitMonitoring(Task.java:958)
>  ~[flink-dist_2.11-1.14.0.jar:1.14.0]
> at 
> org.apache.flink.runtime.taskmanager.Task.restoreAndInvoke(Task.java:937) 
> ~[flink-dist_2.11-1.14.0.jar:1.14.0]
> at org.apache.flink.runtime.taskmanager.Task.doRun(Task.java:766) 
> ~[flink-dist_2.11-1.14.0.jar:1.14.0]
> at org.apache.flink.runtime.taskmanager.Task.run(Task.java:575) 
> ~[flink-dist_2.11-1.14.0.jar:1.14.0]
> at java.lang.Thread.run(Thread.java:829) ~[?:?]
> {noformat}
> As reported by a user on [the mailing 
> list:|https://mail-archives.apache.org/mod_mbox/flink-user/202111.mbox/%3CCAO6dnLwtLNxkr9qXG202ysrnse18Wgvph4hqHZe3ar8cuXAfDw%40mail.gmail.com%3E]
> {quote}
> I failed to stop a job with savepoint with the following message:
> Inconsistent execution state after stopping with savepoint. At least one 
> execution is still in one of the following states: FAILED, CANCELED. A global 
> fail-over is triggered to recover the job 452594f3ec5797f399e07f95c884a44b.
> The job manager said
>  A savepoint was created at 
> hdfs://mobdata-flink-hdfs/driving-habits/svpts/savepoint-452594-f60305755d0e 
> but the corresponding job 452594f3ec5797f399e07f95c884a44b didn't terminate 
> successfully.
> while complaining about
> Mailbox is in state QUIESCED, but is required to be in state OPEN for put 
> operations.
> Is it okay to ignore this kind of error?
> Please see the attached files for the detailed context.
> FYI, 
> - I used the latest 1.14.0
> - I started the job with "$FLINK_HOME"/bin/flink run --target yarn-per-job
> - I couldn't reproduce the exception using the same jar so I might not able 
> to provide DUBUG messages
> {quote}



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Created] (FLINK-24846) AsyncWaitOperator fails during stop-with-savepoint

2021-11-09 Thread Piotr Nowojski (Jira)
Piotr Nowojski created FLINK-24846:
--

 Summary: AsyncWaitOperator fails during stop-with-savepoint
 Key: FLINK-24846
 URL: https://issues.apache.org/jira/browse/FLINK-24846
 Project: Flink
  Issue Type: Bug
  Components: Runtime / Task
Reporter: Piotr Nowojski
 Attachments: log-jm.txt

{noformat}
Caused by: 
org.apache.flink.streaming.runtime.tasks.mailbox.TaskMailbox$MailboxClosedException:
 Mailbox is in state QUIESCED, but is required to be in state OPEN for put 
operations.
at 
org.apache.flink.streaming.runtime.tasks.mailbox.TaskMailboxImpl.checkPutStateConditions(TaskMailboxImpl.java:269)
 ~[flink-dist_2.11-1.14.0.jar:1.14.0]
at 
org.apache.flink.streaming.runtime.tasks.mailbox.TaskMailboxImpl.put(TaskMailboxImpl.java:197)
 ~[flink-dist_2.11-1.14.0.jar:1.14.0]
at 
org.apache.flink.streaming.runtime.tasks.mailbox.MailboxExecutorImpl.execute(MailboxExecutorImpl.java:74)
 ~[flink-dist_2.11-1.14.0.jar:1.14.0]
at 
org.apache.flink.api.common.operators.MailboxExecutor.execute(MailboxExecutor.java:103)
 ~[flink-dist_2.11-1.14.0.jar:1.14.0]
at 
org.apache.flink.streaming.api.operators.async.AsyncWaitOperator.outputCompletedElement(AsyncWaitOperator.java:304)
 ~[flink-dist_2.11-1.14.0.jar:1.14.0]
at 
org.apache.flink.streaming.api.operators.async.AsyncWaitOperator.access$100(AsyncWaitOperator.java:78)
 ~[flink-dist_2.11-1.14.0.jar:1.14.0]
at 
org.apache.flink.streaming.api.operators.async.AsyncWaitOperator$ResultHandler.processResults(AsyncWaitOperator.java:370)
 ~[flink-dist_2.11-1.14.0.jar:1.14.0]
at 
org.apache.flink.streaming.api.operators.async.AsyncWaitOperator$ResultHandler.lambda$processInMailbox$0(AsyncWaitOperator.java:351)
 ~[flink-dist_2.11-1.14.0.jar:1.14.0]
at 
org.apache.flink.streaming.runtime.tasks.StreamTaskActionExecutor$1.runThrowing(StreamTaskActionExecutor.java:50)
 ~[flink-dist_2.11-1.14.0.jar:1.14.0]
at 
org.apache.flink.streaming.runtime.tasks.mailbox.Mail.run(Mail.java:90) 
~[flink-dist_2.11-1.14.0.jar:1.14.0]
at 
org.apache.flink.streaming.runtime.tasks.mailbox.MailboxProcessor.drain(MailboxProcessor.java:177)
 ~[flink-dist_2.11-1.14.0.jar:1.14.0]
at 
org.apache.flink.streaming.runtime.tasks.StreamTask.afterInvoke(StreamTask.java:854)
 ~[flink-dist_2.11-1.14.0.jar:1.14.0]
at 
org.apache.flink.streaming.runtime.tasks.StreamTask.invoke(StreamTask.java:767) 
~[flink-dist_2.11-1.14.0.jar:1.14.0]
at 
org.apache.flink.runtime.taskmanager.Task.runWithSystemExitMonitoring(Task.java:958)
 ~[flink-dist_2.11-1.14.0.jar:1.14.0]
at 
org.apache.flink.runtime.taskmanager.Task.restoreAndInvoke(Task.java:937) 
~[flink-dist_2.11-1.14.0.jar:1.14.0]
at org.apache.flink.runtime.taskmanager.Task.doRun(Task.java:766) 
~[flink-dist_2.11-1.14.0.jar:1.14.0]
at org.apache.flink.runtime.taskmanager.Task.run(Task.java:575) 
~[flink-dist_2.11-1.14.0.jar:1.14.0]
at java.lang.Thread.run(Thread.java:829) ~[?:?]

{noformat}

As reported by a user on [the mailing 
list:|https://mail-archives.apache.org/mod_mbox/flink-user/202111.mbox/%3CCAO6dnLwtLNxkr9qXG202ysrnse18Wgvph4hqHZe3ar8cuXAfDw%40mail.gmail.com%3E]
{quote}
I failed to stop a job with savepoint with the following message:
Inconsistent execution state after stopping with savepoint. At least one 
execution is still in one of the following states: FAILED, CANCELED. A global 
fail-over is triggered to recover the job 452594f3ec5797f399e07f95c884a44b.

The job manager said
 A savepoint was created at 
hdfs://mobdata-flink-hdfs/driving-habits/svpts/savepoint-452594-f60305755d0e 
but the corresponding job 452594f3ec5797f399e07f95c884a44b didn't terminate 
successfully.
while complaining about
Mailbox is in state QUIESCED, but is required to be in state OPEN for put 
operations.

Is it okay to ignore this kind of error?

Please see the attached files for the detailed context.

FYI, 
- I used the latest 1.14.0
- I started the job with "$FLINK_HOME"/bin/flink run --target yarn-per-job
- I couldn't reproduce the exception using the same jar so I might not able to 
provide DUBUG messages
{quote}



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Closed] (FLINK-23665) Flaky test: BlockingShuffleITCase.testBoundedBlockingShuffle

2021-11-09 Thread Piotr Nowojski (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-23665?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Piotr Nowojski closed FLINK-23665.
--
Resolution: Duplicate

> Flaky test: BlockingShuffleITCase.testBoundedBlockingShuffle
> 
>
> Key: FLINK-23665
> URL: https://issues.apache.org/jira/browse/FLINK-23665
> Project: Flink
>  Issue Type: Bug
>  Components: Runtime / Task, Tests
>Affects Versions: 1.14.0, 1.15.0
>Reporter: Gabor Somogyi
>Priority: Major
>
> The test has been failed with the following output:
> {code:java}
> Aug 06 09:46:03 java.lang.AssertionError: 
> org.apache.flink.runtime.JobException: Recovery is suppressed by 
> NoRestartBackoffTimeStrategy
> Aug 06 09:46:03   at 
> org.apache.flink.test.runtime.JobGraphRunningUtil.execute(JobGraphRunningUtil.java:60)
> Aug 06 09:46:03   at 
> org.apache.flink.test.runtime.BlockingShuffleITCase.testBoundedBlockingShuffle(BlockingShuffleITCase.java:51)
> Aug 06 09:46:03   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native 
> Method)
> Aug 06 09:46:03   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
> Aug 06 09:46:03   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> Aug 06 09:46:03   at java.lang.reflect.Method.invoke(Method.java:498)
> Aug 06 09:46:03   at 
> org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:59)
> Aug 06 09:46:03   at 
> org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12)
> Aug 06 09:46:03   at 
> org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:56)
> Aug 06 09:46:03   at 
> org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17)
> Aug 06 09:46:03   at 
> org.junit.runners.ParentRunner$3.evaluate(ParentRunner.java:306)
> Aug 06 09:46:03   at 
> org.junit.runners.BlockJUnit4ClassRunner$1.evaluate(BlockJUnit4ClassRunner.java:100)
> Aug 06 09:46:03   at 
> org.junit.runners.ParentRunner.runLeaf(ParentRunner.java:366)
> Aug 06 09:46:03   at 
> org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:103)
> Aug 06 09:46:03   at 
> org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:63)
> Aug 06 09:46:03   at 
> org.junit.runners.ParentRunner$4.run(ParentRunner.java:331)
> Aug 06 09:46:03   at 
> org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:79)
> Aug 06 09:46:03   at 
> org.junit.runners.ParentRunner.runChildren(ParentRunner.java:329)
> Aug 06 09:46:03   at 
> org.junit.runners.ParentRunner.access$100(ParentRunner.java:66)
> Aug 06 09:46:03   at 
> org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:293)
> Aug 06 09:46:03   at 
> org.junit.runners.ParentRunner$3.evaluate(ParentRunner.java:306)
> Aug 06 09:46:03   at 
> org.junit.runners.ParentRunner.run(ParentRunner.java:413)
> Aug 06 09:46:03   at 
> org.apache.maven.surefire.junit4.JUnit4Provider.execute(JUnit4Provider.java:365)
> Aug 06 09:46:03   at 
> org.apache.maven.surefire.junit4.JUnit4Provider.executeWithRerun(JUnit4Provider.java:273)
> Aug 06 09:46:03   at 
> org.apache.maven.surefire.junit4.JUnit4Provider.executeTestSet(JUnit4Provider.java:238)
> Aug 06 09:46:03   at 
> org.apache.maven.surefire.junit4.JUnit4Provider.invoke(JUnit4Provider.java:159)
> Aug 06 09:46:03   at 
> org.apache.maven.surefire.booter.ForkedBooter.invokeProviderInSameClassLoader(ForkedBooter.java:384)
> Aug 06 09:46:03   at 
> org.apache.maven.surefire.booter.ForkedBooter.runSuitesInProcess(ForkedBooter.java:345)
> Aug 06 09:46:03   at 
> org.apache.maven.surefire.booter.ForkedBooter.execute(ForkedBooter.java:126)
> Aug 06 09:46:03   at 
> org.apache.maven.surefire.booter.ForkedBooter.main(ForkedBooter.java:418)
> Aug 06 09:46:03 Caused by: org.apache.flink.runtime.JobException: Recovery is 
> suppressed by NoRestartBackoffTimeStrategy
> Aug 06 09:46:03   at 
> org.apache.flink.runtime.executiongraph.failover.flip1.ExecutionFailureHandler.handleFailure(ExecutionFailureHandler.java:138)
> Aug 06 09:46:03   at 
> org.apache.flink.runtime.executiongraph.failover.flip1.ExecutionFailureHandler.getFailureHandlingResult(ExecutionFailureHandler.java:82)
> Aug 06 09:46:03   at 
> org.apache.flink.runtime.scheduler.DefaultScheduler.handleTaskFailure(DefaultScheduler.java:228)
> Aug 06 09:46:03   at 
> org.apache.flink.runtime.scheduler.DefaultScheduler.maybeHandleTaskFailure(DefaultScheduler.java:218)
> Aug 06 09:46:03   at 
> org.apache.flink.runtime.scheduler.DefaultScheduler.updateTaskExecutionStateInternal(DefaultScheduler.java:209)
> ug 06 09:46:03at 
> org.apache.flink.runtime.scheduler.Sc

[jira] [Updated] (FLINK-23665) Flaky test: BlockingShuffleITCase.testBoundedBlockingShuffle

2021-11-09 Thread Piotr Nowojski (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-23665?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Piotr Nowojski updated FLINK-23665:
---
Component/s: Runtime / Task

> Flaky test: BlockingShuffleITCase.testBoundedBlockingShuffle
> 
>
> Key: FLINK-23665
> URL: https://issues.apache.org/jira/browse/FLINK-23665
> Project: Flink
>  Issue Type: Bug
>  Components: Runtime / Task, Tests
>Affects Versions: 1.14.0, 1.15.0
>Reporter: Gabor Somogyi
>Priority: Major
>
> The test has been failed with the following output:
> {code:java}
> Aug 06 09:46:03 java.lang.AssertionError: 
> org.apache.flink.runtime.JobException: Recovery is suppressed by 
> NoRestartBackoffTimeStrategy
> Aug 06 09:46:03   at 
> org.apache.flink.test.runtime.JobGraphRunningUtil.execute(JobGraphRunningUtil.java:60)
> Aug 06 09:46:03   at 
> org.apache.flink.test.runtime.BlockingShuffleITCase.testBoundedBlockingShuffle(BlockingShuffleITCase.java:51)
> Aug 06 09:46:03   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native 
> Method)
> Aug 06 09:46:03   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
> Aug 06 09:46:03   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> Aug 06 09:46:03   at java.lang.reflect.Method.invoke(Method.java:498)
> Aug 06 09:46:03   at 
> org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:59)
> Aug 06 09:46:03   at 
> org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12)
> Aug 06 09:46:03   at 
> org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:56)
> Aug 06 09:46:03   at 
> org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17)
> Aug 06 09:46:03   at 
> org.junit.runners.ParentRunner$3.evaluate(ParentRunner.java:306)
> Aug 06 09:46:03   at 
> org.junit.runners.BlockJUnit4ClassRunner$1.evaluate(BlockJUnit4ClassRunner.java:100)
> Aug 06 09:46:03   at 
> org.junit.runners.ParentRunner.runLeaf(ParentRunner.java:366)
> Aug 06 09:46:03   at 
> org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:103)
> Aug 06 09:46:03   at 
> org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:63)
> Aug 06 09:46:03   at 
> org.junit.runners.ParentRunner$4.run(ParentRunner.java:331)
> Aug 06 09:46:03   at 
> org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:79)
> Aug 06 09:46:03   at 
> org.junit.runners.ParentRunner.runChildren(ParentRunner.java:329)
> Aug 06 09:46:03   at 
> org.junit.runners.ParentRunner.access$100(ParentRunner.java:66)
> Aug 06 09:46:03   at 
> org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:293)
> Aug 06 09:46:03   at 
> org.junit.runners.ParentRunner$3.evaluate(ParentRunner.java:306)
> Aug 06 09:46:03   at 
> org.junit.runners.ParentRunner.run(ParentRunner.java:413)
> Aug 06 09:46:03   at 
> org.apache.maven.surefire.junit4.JUnit4Provider.execute(JUnit4Provider.java:365)
> Aug 06 09:46:03   at 
> org.apache.maven.surefire.junit4.JUnit4Provider.executeWithRerun(JUnit4Provider.java:273)
> Aug 06 09:46:03   at 
> org.apache.maven.surefire.junit4.JUnit4Provider.executeTestSet(JUnit4Provider.java:238)
> Aug 06 09:46:03   at 
> org.apache.maven.surefire.junit4.JUnit4Provider.invoke(JUnit4Provider.java:159)
> Aug 06 09:46:03   at 
> org.apache.maven.surefire.booter.ForkedBooter.invokeProviderInSameClassLoader(ForkedBooter.java:384)
> Aug 06 09:46:03   at 
> org.apache.maven.surefire.booter.ForkedBooter.runSuitesInProcess(ForkedBooter.java:345)
> Aug 06 09:46:03   at 
> org.apache.maven.surefire.booter.ForkedBooter.execute(ForkedBooter.java:126)
> Aug 06 09:46:03   at 
> org.apache.maven.surefire.booter.ForkedBooter.main(ForkedBooter.java:418)
> Aug 06 09:46:03 Caused by: org.apache.flink.runtime.JobException: Recovery is 
> suppressed by NoRestartBackoffTimeStrategy
> Aug 06 09:46:03   at 
> org.apache.flink.runtime.executiongraph.failover.flip1.ExecutionFailureHandler.handleFailure(ExecutionFailureHandler.java:138)
> Aug 06 09:46:03   at 
> org.apache.flink.runtime.executiongraph.failover.flip1.ExecutionFailureHandler.getFailureHandlingResult(ExecutionFailureHandler.java:82)
> Aug 06 09:46:03   at 
> org.apache.flink.runtime.scheduler.DefaultScheduler.handleTaskFailure(DefaultScheduler.java:228)
> Aug 06 09:46:03   at 
> org.apache.flink.runtime.scheduler.DefaultScheduler.maybeHandleTaskFailure(DefaultScheduler.java:218)
> Aug 06 09:46:03   at 
> org.apache.flink.runtime.scheduler.DefaultScheduler.updateTaskExecutionStateInternal(DefaultScheduler.java:209)
> ug 06 09:46:03at 
> org.apache.flink.runtime.sche

[jira] [Updated] (FLINK-24800) BufferTimeoutITCase.testDisablingBufferTimeout failed on Azure

2021-11-09 Thread Piotr Nowojski (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-24800?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Piotr Nowojski updated FLINK-24800:
---
Priority: Blocker  (was: Critical)

> BufferTimeoutITCase.testDisablingBufferTimeout failed on Azure
> --
>
> Key: FLINK-24800
> URL: https://issues.apache.org/jira/browse/FLINK-24800
> Project: Flink
>  Issue Type: Bug
>  Components: Runtime / Network
>Affects Versions: 1.15.0
>Reporter: Yun Gao
>Priority: Blocker
>  Labels: test-stability
> Fix For: 1.15.0
>
>
> {code:java}
> 2021-11-05T12:18:50.5272055Z Nov 05 12:18:50 [INFO] Results:
> 2021-11-05T12:18:50.5273369Z Nov 05 12:18:50 [INFO] 
> 2021-11-05T12:18:50.5274011Z Nov 05 12:18:50 [ERROR] Failures: 
> 2021-11-05T12:18:50.5274518Z Nov 05 12:18:50 [ERROR]   
> BufferTimeoutITCase.testDisablingBufferTimeout:85 
> 2021-11-05T12:18:50.5274871Z Nov 05 12:18:50 Expected: <0>
> 2021-11-05T12:18:50.5275150Z Nov 05 12:18:50  but: was <1>
> 2021-11-05T12:18:50.5276136Z Nov 05 12:18:50 [INFO] 
> 2021-11-05T12:18:50.5276667Z Nov 05 12:18:50 [ERROR] Tests run: 1849, 
> Failures: 1, Errors: 0, Skipped: 58
> {code}
> https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=26018&view=logs&j=a57e0635-3fad-5b08-57c7-a4142d7d6fa9&t=2ef0effc-1da1-50e5-c2bd-aab434b1c5b7&l=10850



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Comment Edited] (FLINK-24690) Clarification of buffer size threshold calculation in BufferDebloater

2021-11-09 Thread Piotr Nowojski (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-24690?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17440943#comment-17440943
 ] 

Piotr Nowojski edited comment on FLINK-24690 at 11/9/21, 9:14 AM:
--

I would be in favour of simplifying this so that documentation is not needed, 
so always calculating threshold based on the current value - if current buffer 
size is 16KB with 50% threshold, a dead zone should be {{{}(8KB, 24KB){}}} 
(currently on master the dead zone is {{(8KB, 32KB)}}). 


was (Author: pnowojski):
I would be in favour of simplifying this so that documentation is not needed, 
so always calculating threshold based on the current value - if current buffer 
size is 16KB with 50% threshold, a dead zone should be {{{}(8KB, 24KB){}}}. 

> Clarification of buffer size threshold calculation in BufferDebloater 
> --
>
> Key: FLINK-24690
> URL: https://issues.apache.org/jira/browse/FLINK-24690
> Project: Flink
>  Issue Type: Improvement
>  Components: Runtime / Network
>Affects Versions: 1.14.0
>Reporter: Anton Kalashnikov
>Priority: Major
>
> It looks like that the variable `skipUpdate` in 
> BufferDebloater#recalculateBufferSize is calculated in a not obvious way.
> For example if 
> `taskmanager.network.memory.buffer-debloat.threshold-percentages` is set to 
> 50(means 50%) then it will be something like:
>  * 32000 -> 16000(possible)
>  * 32000 -> 17000(not possible)
>  * 16000 -> 24000(not possible) - but 16000 + 50%  = 24000
>  * 16000 -> 32000(only this possible)
> This happens because the algorithm takes into account only the largest value. 
> So in example of `16000 -> 24000` it would calculate 50% of 24000 so only 
> transit from 12000 -> 24000 possible. 
> In general, this approach is not so bad especially on small values (instead 
> of 256  ->374, the minimum possible transit is 256 -> 512). But we should 
> clarify it somewhere with test or javadoc or both. Also, we can discuss 
> changing this algorithm to a more natural implementation.
>  



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Commented] (FLINK-24815) Reduce the cpu cost of calculating stateSize during state allocation

2021-11-09 Thread Piotr Nowojski (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-24815?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17440963#comment-17440963
 ] 

Piotr Nowojski commented on FLINK-24815:


{quote}
I think we can try to provide a construction method with stateSize for 
OperatorSubtaskState
{quote}
[~Ming Li], could you elaborate on this idea? What changes in what places do 
you have in mind?

> Reduce the cpu cost of calculating stateSize during state allocation
> 
>
> Key: FLINK-24815
> URL: https://issues.apache.org/jira/browse/FLINK-24815
> Project: Flink
>  Issue Type: Improvement
>  Components: Runtime / Checkpointing
>Affects Versions: 1.14.0
>Reporter: ming li
>Priority: Major
>
> When the task failover, we will reassign the state for each subtask and 
> create a new {{OperatorSubtaskState}} object. At this time, the {{stateSize}} 
> field in the {{OperatorSubtaskState}} will be recalculated. When using 
> incremental {{{}Checkpoint{}}}, this field needs to traverse all shared 
> states and then accumulate the size of the state.
> Taking a job with 2000 parallelism and 100 share state for each task as an 
> example, it needs to traverse 2000 * 100 = 20w times. At this time, the cpu 
> of the JM scheduling thread will be full.
> I think we can try to provide a construction method with {{stateSize}} for 
> {{OperatorSubtaskState}} or delay the calculation of {{{}stateSize{}}}.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


<    5   6   7   8   9   10   11   12   13   14   >