[jira] [Commented] (FLINK-23230) Cannot compile Flink on MacOS with M1 chip

2021-11-30 Thread Robert Metzger (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-23230?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17451310#comment-17451310
 ] 

Robert Metzger commented on FLINK-23230:


Yes, I used zulu jdk8 for apple silicon. I was also hoping for a little bit 
more wonders too. The real wonder is probably that I could not hear any fans 
spinning (which is not true for the intel one at all), and that the compile 
time on battery is probably also 11 minutes ;) 

> Cannot compile Flink on MacOS with M1 chip
> --
>
> Key: FLINK-23230
> URL: https://issues.apache.org/jira/browse/FLINK-23230
> Project: Flink
>  Issue Type: Bug
>  Components: Build System
>Affects Versions: 1.13.1
>Reporter: Osama Neiroukh
>Priority: Minor
>  Labels: pull-request-available
>
> Flink doesn't currently compile on MacOS with M1 silicon.
> This is true for all recent versions (1.13.X) as well as master.
> Some of the problems have potentially easy fixes, such as installing node 
> separately or updating the relevant pom.xml to use a newer version of node. I 
> am getting some errors about deprecated features being used which are not 
> supported by newer node, but on the surface they seem easy to resolve. 
> I've had less success with complex dependencies such as protobuf.
> My long term objective is to use and contribute to Flink. If I can get some 
> help with the above issues, I am willing to make the modifications, submit 
> the changes as a pull request, and shepherd them to release. If compilation 
> on MacOS/M1 is not a priority, I can look for a virtual machine solution 
> instead. Feedback appreciated. 
>  
> Thanks
>  
> Osama



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Commented] (FLINK-23230) Cannot compile Flink on MacOS with M1 chip

2021-11-27 Thread Robert Metzger (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-23230?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17449830#comment-17449830
 ] 

Robert Metzger commented on FLINK-23230:


I've managed to get Flink compiling on my M1 MBP, with a few small pom changes 
to some protoc-related stuff.
I'm not sure if the changes are acceptable, but I'll open a PR soon to discuss.

mvn clean install time was ~11 minutes (2 minutes for the frontend), which is 
quite fast compared to the 19 minutes I needed on my 8 Core Intel i9 MBP from 
2019.

> Cannot compile Flink on MacOS with M1 chip
> --
>
> Key: FLINK-23230
> URL: https://issues.apache.org/jira/browse/FLINK-23230
> Project: Flink
>  Issue Type: Bug
>  Components: Build System
>Affects Versions: 1.13.1
>Reporter: Osama Neiroukh
>Priority: Minor
>  Labels: pull-request-available
>
> Flink doesn't currently compile on MacOS with M1 silicon.
> This is true for all recent versions (1.13.X) as well as master.
> Some of the problems have potentially easy fixes, such as installing node 
> separately or updating the relevant pom.xml to use a newer version of node. I 
> am getting some errors about deprecated features being used which are not 
> supported by newer node, but on the surface they seem easy to resolve. 
> I've had less success with complex dependencies such as protobuf.
> My long term objective is to use and contribute to Flink. If I can get some 
> help with the above issues, I am willing to make the modifications, submit 
> the changes as a pull request, and shepherd them to release. If compilation 
> on MacOS/M1 is not a priority, I can look for a virtual machine solution 
> instead. Feedback appreciated. 
>  
> Thanks
>  
> Osama



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Commented] (FLINK-13598) frocksdb doesn't have arm release

2021-11-16 Thread Robert Metzger (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-13598?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17444516#comment-17444516
 ] 

Robert Metzger commented on FLINK-13598:


It seems that frocksdb 6.20.3 only adds ARM support for linux, not for osx.
On an M1 mac, I get this error:

{code:java}
java.lang.Exception: Exception while creating StreamOperatorStateContext.
at 
org.apache.flink.streaming.api.operators.StreamTaskStateInitializerImpl.streamOperatorStateContext(StreamTaskStateInitializerImpl.java:255)
 ~[flink-streaming-java_2.11-1.14.0.jar:1.14.0]
at 
org.apache.flink.streaming.api.operators.AbstractStreamOperator.initializeState(AbstractStreamOperator.java:268)
 ~[flink-streaming-java_2.11-1.14.0.jar:1.14.0]
at 
org.apache.flink.streaming.runtime.tasks.RegularOperatorChain.initializeStateAndOpenOperators(RegularOperatorChain.java:109)
 ~[flink-streaming-java_2.11-1.14.0.jar:1.14.0]
at 
org.apache.flink.streaming.runtime.tasks.StreamTask.restoreGates(StreamTask.java:711)
 ~[flink-streaming-java_2.11-1.14.0.jar:1.14.0]
at 
org.apache.flink.streaming.runtime.tasks.StreamTaskActionExecutor$1.call(StreamTaskActionExecutor.java:55)
 ~[flink-streaming-java_2.11-1.14.0.jar:1.14.0]
at 
org.apache.flink.streaming.runtime.tasks.StreamTask.restoreInternal(StreamTask.java:687)
 ~[flink-streaming-java_2.11-1.14.0.jar:1.14.0]
at 
org.apache.flink.streaming.runtime.tasks.StreamTask.restore(StreamTask.java:654)
 ~[flink-streaming-java_2.11-1.14.0.jar:1.14.0]
at 
org.apache.flink.runtime.taskmanager.Task.runWithSystemExitMonitoring(Task.java:958)
 ~[flink-runtime-1.14.0.jar:1.14.0]
at 
org.apache.flink.runtime.taskmanager.Task.restoreAndInvoke(Task.java:927) 
~[flink-runtime-1.14.0.jar:1.14.0]
at org.apache.flink.runtime.taskmanager.Task.doRun(Task.java:766) 
~[flink-runtime-1.14.0.jar:1.14.0]
at org.apache.flink.runtime.taskmanager.Task.run(Task.java:575) 
~[flink-runtime-1.14.0.jar:1.14.0]
at java.lang.Thread.run(Thread.java:748) ~[?:1.8.0_312]
Caused by: org.apache.flink.util.FlinkException: Could not restore keyed state 
backend for StreamFlatMap_c21234bcbf1e8eb4c61f1927190efebd_(1/1) from any of 
the 1 provided restore options.
at 
org.apache.flink.streaming.api.operators.BackendRestorerProcedure.createAndRestore(BackendRestorerProcedure.java:160)
 ~[flink-streaming-java_2.11-1.14.0.jar:1.14.0]
at 
org.apache.flink.streaming.api.operators.StreamTaskStateInitializerImpl.keyedStatedBackend(StreamTaskStateInitializerImpl.java:346)
 ~[flink-streaming-java_2.11-1.14.0.jar:1.14.0]
at 
org.apache.flink.streaming.api.operators.StreamTaskStateInitializerImpl.streamOperatorStateContext(StreamTaskStateInitializerImpl.java:164)
 ~[flink-streaming-java_2.11-1.14.0.jar:1.14.0]
... 11 more
Caused by: java.io.IOException: Could not load the native RocksDB library
at 
org.apache.flink.contrib.streaming.state.EmbeddedRocksDBStateBackend.ensureRocksDBIsLoaded(EmbeddedRocksDBStateBackend.java:882)
 ~[flink-statebackend-rocksdb_2.11-1.14.0.jar:1.14.0]
at 
org.apache.flink.contrib.streaming.state.EmbeddedRocksDBStateBackend.createKeyedStateBackend(EmbeddedRocksDBStateBackend.java:402)
 ~[flink-statebackend-rocksdb_2.11-1.14.0.jar:1.14.0]
at 
org.apache.flink.contrib.streaming.state.RocksDBStateBackend.createKeyedStateBackend(RocksDBStateBackend.java:345)
 ~[flink-statebackend-rocksdb_2.11-1.14.0.jar:1.14.0]
at 
org.apache.flink.contrib.streaming.state.RocksDBStateBackend.createKeyedStateBackend(RocksDBStateBackend.java:87)
 ~[flink-statebackend-rocksdb_2.11-1.14.0.jar:1.14.0]
at 
org.apache.flink.streaming.api.operators.StreamTaskStateInitializerImpl.lambda$keyedStatedBackend$1(StreamTaskStateInitializerImpl.java:329)
 ~[flink-streaming-java_2.11-1.14.0.jar:1.14.0]
at 
org.apache.flink.streaming.api.operators.BackendRestorerProcedure.attemptCreateAndRestore(BackendRestorerProcedure.java:168)
 ~[flink-streaming-java_2.11-1.14.0.jar:1.14.0]
at 
org.apache.flink.streaming.api.operators.BackendRestorerProcedure.createAndRestore(BackendRestorerProcedure.java:135)
 ~[flink-streaming-java_2.11-1.14.0.jar:1.14.0]
at 
org.apache.flink.streaming.api.operators.StreamTaskStateInitializerImpl.keyedStatedBackend(StreamTaskStateInitializerImpl.java:346)
 ~[flink-streaming-java_2.11-1.14.0.jar:1.14.0]
at 
org.apache.flink.streaming.api.operators.StreamTaskStateInitializerImpl.streamOperatorStateContext(StreamTaskStateInitializerImpl.java:164)
 ~[flink-streaming-java_2.11-1.14.0.jar:1.14.0]
... 11 more
Caused by: java.lang.UnsatisfiedLinkError: 
/private/var/folders/js/yfk_y2450q7559kygttykwk0gn/T/rocksdb-lib-5783c058ce68d31d371327abc9b51cac/librocksdbjni-osx.jnilib:
 
dlopen(/private/var/folders/js/yfk_y2450q7559kygttykwk0gn/T/rocksdb-lib-5783c058ce68d

[jira] [Commented] (FLINK-24433) "No space left on device" in Azure e2e tests

2021-10-02 Thread Robert Metzger (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-24433?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17423572#comment-17423572
 ] 

Robert Metzger commented on FLINK-24433:


After each test run, we are logging "Environment Information" (search the logs 
for this string). They contain the disk space allocation.
When this error happened, 5.6GB of space were available before the test 
started. Before the first test 27Gb were available.
I suggest to analyze the disk space allocation per test. Maybe there are some 
low hanging fruits (like deleting some temp files, or pruning docker)

> "No space left on device" in Azure e2e tests
> 
>
> Key: FLINK-24433
> URL: https://issues.apache.org/jira/browse/FLINK-24433
> Project: Flink
>  Issue Type: Bug
>  Components: Build System / Azure Pipelines
>Affects Versions: 1.15.0
>Reporter: Dawid Wysakowicz
>Priority: Major
>
> https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=24668&view=logs&j=c88eea3b-64a0-564d-0031-9fdcd7b8abee&t=070ff179-953e-5bda-71fa-d6599415701c&l=19772
> {code}
> Sep 30 17:08:42 Job has been submitted with JobID 
> 5594c18e128a328ede39cfa59cb3cb07
> Sep 30 17:08:56 2021-09-30 17:08:56,809 main ERROR Recovering from 
> StringBuilderEncoder.encode('2021-09-30 17:08:56,807 WARN  
> org.apache.flink.streaming.api.operators.collect.CollectResultFetcher [] - An 
> exception occurred when fetching query results
> Sep 30 17:08:56 java.util.concurrent.ExecutionException: 
> org.apache.flink.runtime.rest.util.RestClientException: [Internal server 
> error.,  Sep 30 17:08:56 org.apache.flink.runtime.messages.FlinkJobNotFoundException: 
> Could not find Flink job (5594c18e128a328ede39cfa59cb3cb07)
> Sep 30 17:08:56   at 
> org.apache.flink.runtime.dispatcher.Dispatcher.getJobMasterGateway(Dispatcher.java:923)
> Sep 30 17:08:56   at 
> org.apache.flink.runtime.dispatcher.Dispatcher.performOperationOnJobMasterGateway(Dispatcher.java:937)
> Sep 30 17:08:56   at 
> org.apache.flink.runtime.dispatcher.Dispatcher.deliverCoordinationRequestToCoordina2021-09-30T17:08:57.1584224Z
>  ##[error]No space left on device
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (FLINK-15816) Limit the maximum length of the value of kubernetes.cluster-id configuration option

2021-09-30 Thread Robert Metzger (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-15816?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17422565#comment-17422565
 ] 

Robert Metzger commented on FLINK-15816:


Yes, I'm using standalone K8s HA.

What's the best way of solving this? Shall we enforce a limit on the cluster 
id? or truncate the label length at 63 characters?
I guess a limit on the cluster id is better.

> Limit the maximum length of the value of kubernetes.cluster-id configuration 
> option
> ---
>
> Key: FLINK-15816
> URL: https://issues.apache.org/jira/browse/FLINK-15816
> Project: Flink
>  Issue Type: Sub-task
>  Components: Deployment / Kubernetes
>Affects Versions: 1.10.0, 1.14.1
>Reporter: Canbin Zheng
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.11.0
>
> Attachments: image-2020-01-31-20-54-33-340.png
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> Two Kubernetes Service will be created when a session cluster is deployed, 
> one is the internal Service and the other is the rest Service, we set the 
> internal Service name to the value of the _kubernetes.cluster-id_ 
> configuration option and then set the rest Service name to  
> _${kubernetes.cluster-id}_ with a suffix *-rest* appended, said if we set the 
> _kubernetes.cluster-id_ to *session-test*, then the internal Service name 
> will be *session-test* and the rest Service name be *session-test-rest;* 
> there is a constraint in Kubernetes that the Service name must be no more 
> than 63 characters, for the current naming convention it leads to that the 
> value of _kubernetes.cluster-id_ should not be more than 58 characters, 
> otherwise there are scenarios that the internal Service is created 
> successfully then comes up with a ClusterDeploymentException when trying to 
> create the rest Service.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (FLINK-24392) Upgrade presto s3 fs implementation to Trino >= 348

2021-09-28 Thread Robert Metzger (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-24392?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Robert Metzger updated FLINK-24392:
---
Description: 
The Presto s3 filesystem implementation currently shipped with Flink doesn't 
support streaming uploads. All data needs to be materialized to a single file 
on disk, before it can be uploaded.
This can lead to situations where TaskManagers are running out of disk when 
creating a savepoint.

The Hadoop filesystem implementation supports streaming uploads (by using 
multipart uploads of smaller (say 100mb) files locally), but it does more API 
calls, leading to other issues.

Trino version >= 348 supports streaming uploads.

During experiments, I also noticed that the current presto s3 fs implementation 
seems to allocate a lot of memory outside the heap (when shipping large data, 
for example when creating a savepoint). On a K8s pod with a memory limit of 
4000Mi, I was not able to run Flink with a "taskmanager.memory.flink.size" 
above 3000m. This means that an additional 1gb of memory needs to be allocated 
just for the peaks in memory allocation when presto s3 is taking a savepoint. 
It would be good to confirm this behavior, and then either adjust the default 
memory configuration or the documentation.

As part of this upgrade, we also need to make sure that the new presto / Trino 
version is not doing substantially more S3 API calls than the current version. 
After switching away from the presto s3 to hadoop s3, I noticed that disposing 
an old checkpoint (~100gb) can take up to 15 minutes. The upgraded presto s3 fs 
should still be able to quickly dispose state.

  was:
The Presto s3 filesystem implementation currently shipped with Flink doesn't 
support streaming uploads. All data needs to be materialized to a single file 
on disk, before it can be uploaded.
This can lead to situations where TaskManagers are running out of disk when 
creating a savepoint.

The Hadoop filesystem implementation supports streaming uploads (by using 
multipart uploads of smaller (say 100mb) files locally), but it does more API 
calls, leading to other issues.

Trino version >= 348 supports streaming uploads.

During experiments, I also noticed that the current presto s3 fs implementation 
seems to allocate a lot of memory outside the heap (when shipping large data, 
for example when creating a savepoint). On a K8s pod with a memory limit of 
4000Mi, I was not able to run Flink with a "taskmanager.memory.flink.size" 
above 3000m. This means that an additional 1gb of memory needs to be allocated 
just for the peaks in memory allocation when presto s3 is taking a savepoint.
As part of this upgrade, we also need to make sure that the new presto / Trino 
version is not doing substantially more S3 API calls than the current version. 
After switching away from the presto s3 to hadoop s3, I noticed that disposing 
an old checkpoint (~100gb) can take up to 15 minutes. The upgraded presto s3 fs 
should still be able to quickly dispose state.


> Upgrade presto s3 fs implementation to Trino >= 348
> ---
>
> Key: FLINK-24392
> URL: https://issues.apache.org/jira/browse/FLINK-24392
> Project: Flink
>  Issue Type: Improvement
>  Components: FileSystems
>Affects Versions: 1.14.0
>Reporter: Robert Metzger
>Priority: Major
> Fix For: 1.15.0
>
>
> The Presto s3 filesystem implementation currently shipped with Flink doesn't 
> support streaming uploads. All data needs to be materialized to a single file 
> on disk, before it can be uploaded.
> This can lead to situations where TaskManagers are running out of disk when 
> creating a savepoint.
> The Hadoop filesystem implementation supports streaming uploads (by using 
> multipart uploads of smaller (say 100mb) files locally), but it does more API 
> calls, leading to other issues.
> Trino version >= 348 supports streaming uploads.
> During experiments, I also noticed that the current presto s3 fs 
> implementation seems to allocate a lot of memory outside the heap (when 
> shipping large data, for example when creating a savepoint). On a K8s pod 
> with a memory limit of 4000Mi, I was not able to run Flink with a 
> "taskmanager.memory.flink.size" above 3000m. This means that an additional 
> 1gb of memory needs to be allocated just for the peaks in memory allocation 
> when presto s3 is taking a savepoint. It would be good to confirm this 
> behavior, and then either adjust the default memory configuration or the 
> documentation.
> As part of this upgrade, we also need to make sure that the new presto / 
> Trino version is not doing substantially more S3 API calls than the current 
> version. After switching away from the presto s3 to hadoop s3, I noticed that 
> disposing an old checkpoint (~100gb) can take up to 15 minutes.

[jira] [Updated] (FLINK-24392) Upgrade presto s3 fs implementation to Trino >= 348

2021-09-28 Thread Robert Metzger (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-24392?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Robert Metzger updated FLINK-24392:
---
Description: 
The Presto s3 filesystem implementation currently shipped with Flink doesn't 
support streaming uploads. All data needs to be materialized to a single file 
on disk, before it can be uploaded.
This can lead to situations where TaskManagers are running out of disk when 
creating a savepoint.

The Hadoop filesystem implementation supports streaming uploads (by using 
multipart uploads of smaller (say 100mb) files locally), but it does more API 
calls, leading to other issues.

Trino version >= 348 supports streaming uploads.

During experiments, I also noticed that the current presto s3 fs implementation 
seems to allocate a lot of memory outside the heap (when shipping large data, 
for example when creating a savepoint). On a K8s pod with a memory limit of 
4000Mi, I was not able to run Flink with a "taskmanager.memory.flink.size" 
above 3000m. This means that an additional 1gb of memory needs to be allocated 
just for the peaks in memory allocation when presto s3 is taking a savepoint.
As part of this upgrade, we also need to make sure that the new presto / Trino 
version is not doing substantially more S3 API calls than the current version. 
After switching away from the presto s3 to hadoop s3, I noticed that disposing 
an old checkpoint (~100gb) can take up to 15 minutes. The upgraded presto s3 fs 
should still be able to quickly dispose state.

  was:
The Presto s3 filesystem implementation currently shipped with Flink doesn't 
support streaming uploads. All data needs to be materialized to a single file 
on disk, before it can be uploaded.
This can lead to situations where TaskManagers are running out of disk when 
creating a savepoint.

The Hadoop filesystem implementation supports streaming uploads (by using 
multipart uploads of smaller (say 100mb) files locally), but it does more API 
calls, leading to other issues.

Trino 348 supports streaming uploads.


> Upgrade presto s3 fs implementation to Trino >= 348
> ---
>
> Key: FLINK-24392
> URL: https://issues.apache.org/jira/browse/FLINK-24392
> Project: Flink
>  Issue Type: Improvement
>  Components: FileSystems
>Affects Versions: 1.14.0
>Reporter: Robert Metzger
>Priority: Major
> Fix For: 1.15.0
>
>
> The Presto s3 filesystem implementation currently shipped with Flink doesn't 
> support streaming uploads. All data needs to be materialized to a single file 
> on disk, before it can be uploaded.
> This can lead to situations where TaskManagers are running out of disk when 
> creating a savepoint.
> The Hadoop filesystem implementation supports streaming uploads (by using 
> multipart uploads of smaller (say 100mb) files locally), but it does more API 
> calls, leading to other issues.
> Trino version >= 348 supports streaming uploads.
> During experiments, I also noticed that the current presto s3 fs 
> implementation seems to allocate a lot of memory outside the heap (when 
> shipping large data, for example when creating a savepoint). On a K8s pod 
> with a memory limit of 4000Mi, I was not able to run Flink with a 
> "taskmanager.memory.flink.size" above 3000m. This means that an additional 
> 1gb of memory needs to be allocated just for the peaks in memory allocation 
> when presto s3 is taking a savepoint.
> As part of this upgrade, we also need to make sure that the new presto / 
> Trino version is not doing substantially more S3 API calls than the current 
> version. After switching away from the presto s3 to hadoop s3, I noticed that 
> disposing an old checkpoint (~100gb) can take up to 15 minutes. The upgraded 
> presto s3 fs should still be able to quickly dispose state.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (FLINK-24395) Checkpoint trigger time difference between log statement and web frontend

2021-09-28 Thread Robert Metzger (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-24395?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Robert Metzger updated FLINK-24395:
---
Description: 
Consider this checkpoint (68)

{code}
2021-09-28 10:14:43,644 INFO  
org.apache.flink.runtime.checkpoint.CheckpointCoordinator[] - Triggering 
checkpoint 68 (type=CHECKPOINT) @ 1632823660151 for job 
.
2021-09-28 10:16:41,428 INFO  
org.apache.flink.runtime.checkpoint.CheckpointCoordinator[] - Completed 
checkpoint 68 for job  (128940015376 bytes, 
checkpointDuration=540908 ms, finalizationTime=369 ms).
{code}

And what is shown in the UI about it:

 !image-2021-09-28-12-20-34-332.png! 

The trigger time is off by ~7 minutes (the difference in the hours are timezone 
related). It seems that the trigger message is logged too late.
(note that this has happened in a system where savepoint disposal is very slow)

  was:
Consider this checkpoint (68)

{code}
2021-09-28 10:14:43,644 INFO  
org.apache.flink.runtime.checkpoint.CheckpointCoordinator[] - Triggering 
checkpoint 68 (type=CHECKPOINT) @ 1632823660151 for job 
.
2021-09-28 10:16:41,428 INFO  
org.apache.flink.runtime.checkpoint.CheckpointCoordinator[] - Completed 
checkpoint 68 for job  (128940015376 bytes, 
checkpointDuration=540908 ms, finalizationTime=369 ms).
{code}

And what is shown in the UI about it:

 !image-2021-09-28-12-20-34-332.png! 

The trigger time is off by ~7 minutes. It seems that the trigger message is 
logged too late.
(note that this has happened in a system where savepoint disposal is very slow)


> Checkpoint trigger time difference between log statement and web frontend
> -
>
> Key: FLINK-24395
> URL: https://issues.apache.org/jira/browse/FLINK-24395
> Project: Flink
>  Issue Type: Bug
>  Components: Runtime / Checkpointing
>Affects Versions: 1.14.0
>Reporter: Robert Metzger
>Priority: Major
> Attachments: image-2021-09-28-12-20-34-332.png
>
>
> Consider this checkpoint (68)
> {code}
> 2021-09-28 10:14:43,644 INFO  
> org.apache.flink.runtime.checkpoint.CheckpointCoordinator[] - Triggering 
> checkpoint 68 (type=CHECKPOINT) @ 1632823660151 for job 
> .
> 2021-09-28 10:16:41,428 INFO  
> org.apache.flink.runtime.checkpoint.CheckpointCoordinator[] - Completed 
> checkpoint 68 for job  (128940015376 bytes, 
> checkpointDuration=540908 ms, finalizationTime=369 ms).
> {code}
> And what is shown in the UI about it:
>  !image-2021-09-28-12-20-34-332.png! 
> The trigger time is off by ~7 minutes (the difference in the hours are 
> timezone related). It seems that the trigger message is logged too late.
> (note that this has happened in a system where savepoint disposal is very 
> slow)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (FLINK-24395) Checkpoint trigger time difference between log statement and web frontend

2021-09-28 Thread Robert Metzger (Jira)
Robert Metzger created FLINK-24395:
--

 Summary: Checkpoint trigger time difference between log statement 
and web frontend
 Key: FLINK-24395
 URL: https://issues.apache.org/jira/browse/FLINK-24395
 Project: Flink
  Issue Type: Improvement
  Components: Runtime / Checkpointing
Affects Versions: 1.14.0
Reporter: Robert Metzger
 Attachments: image-2021-09-28-12-20-34-332.png

Consider this checkpoint (68)

{code}
2021-09-28 10:14:43,644 INFO  
org.apache.flink.runtime.checkpoint.CheckpointCoordinator[] - Triggering 
checkpoint 68 (type=CHECKPOINT) @ 1632823660151 for job 
.
2021-09-28 10:16:41,428 INFO  
org.apache.flink.runtime.checkpoint.CheckpointCoordinator[] - Completed 
checkpoint 68 for job  (128940015376 bytes, 
checkpointDuration=540908 ms, finalizationTime=369 ms).
{code}

And what is shown in the UI about it:

 !image-2021-09-28-12-20-34-332.png! 

The trigger time is off by ~7 minutes. It seems that the trigger message is 
logged too late.
(note that this has happened in a system where savepoint disposal is very slow)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (FLINK-24395) Checkpoint trigger time difference between log statement and web frontend

2021-09-28 Thread Robert Metzger (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-24395?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Robert Metzger updated FLINK-24395:
---
Issue Type: Bug  (was: Improvement)

> Checkpoint trigger time difference between log statement and web frontend
> -
>
> Key: FLINK-24395
> URL: https://issues.apache.org/jira/browse/FLINK-24395
> Project: Flink
>  Issue Type: Bug
>  Components: Runtime / Checkpointing
>Affects Versions: 1.14.0
>Reporter: Robert Metzger
>Priority: Major
> Attachments: image-2021-09-28-12-20-34-332.png
>
>
> Consider this checkpoint (68)
> {code}
> 2021-09-28 10:14:43,644 INFO  
> org.apache.flink.runtime.checkpoint.CheckpointCoordinator[] - Triggering 
> checkpoint 68 (type=CHECKPOINT) @ 1632823660151 for job 
> .
> 2021-09-28 10:16:41,428 INFO  
> org.apache.flink.runtime.checkpoint.CheckpointCoordinator[] - Completed 
> checkpoint 68 for job  (128940015376 bytes, 
> checkpointDuration=540908 ms, finalizationTime=369 ms).
> {code}
> And what is shown in the UI about it:
>  !image-2021-09-28-12-20-34-332.png! 
> The trigger time is off by ~7 minutes. It seems that the trigger message is 
> logged too late.
> (note that this has happened in a system where savepoint disposal is very 
> slow)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (FLINK-24392) Upgrade presto s3 fs implementation to Trinio >= 348

2021-09-28 Thread Robert Metzger (Jira)
Robert Metzger created FLINK-24392:
--

 Summary: Upgrade presto s3 fs implementation to Trinio >= 348
 Key: FLINK-24392
 URL: https://issues.apache.org/jira/browse/FLINK-24392
 Project: Flink
  Issue Type: Improvement
  Components: FileSystems
Affects Versions: 1.14.0
Reporter: Robert Metzger
 Fix For: 1.15.0


The Presto s3 filesystem implementation currently shipped with Flink doesn't 
support streaming uploads. All data needs to be materialized to a single file 
on disk, before it can be uploaded.
This can lead to situations where TaskManagers are running out of disk when 
creating a savepoint.

The Hadoop filesystem implementation supports streaming uploads (by using 
multipart uploads of smaller (say 100mb) files locally), but it does more API 
calls, leading to other issues.

Trinion 348 supports streaming uploads.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (FLINK-24320) Show in the Job / Checkpoints / Configuration if checkpoints are incremental

2021-09-23 Thread Robert Metzger (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-24320?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17419607#comment-17419607
 ] 

Robert Metzger commented on FLINK-24320:


Sorry, I clarified the description. I meant the information under `Job / 
Checkpoints / Configuration`.

The value is not yet in the config API response. This change requires updating 
extending the REST endpoint as well.

> Show in the Job / Checkpoints / Configuration if checkpoints are incremental
> 
>
> Key: FLINK-24320
> URL: https://issues.apache.org/jira/browse/FLINK-24320
> Project: Flink
>  Issue Type: Improvement
>  Components: Runtime / Checkpointing, Runtime / Web Frontend
>Affects Versions: 1.13.2
>Reporter: Robert Metzger
>Priority: Major
>  Labels: beginner-friendly
> Attachments: image-2021-09-17-13-31-02-148.png, 
> image-2021-09-24-10-49-53-657.png
>
>
> It would be nice if the Configuration page would also show if incremental 
> checkpoints are enabled.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (FLINK-24320) Show in the Job / Checkpoints / Configuration if checkpoints are incremental

2021-09-23 Thread Robert Metzger (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-24320?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Robert Metzger updated FLINK-24320:
---
Description: It would be nice if the Configuration page would also show if 
incremental checkpoints are enabled.  (was: It would be nice if the overview 
would also show if incremental checkpoints are enabled.)

> Show in the Job / Checkpoints / Configuration if checkpoints are incremental
> 
>
> Key: FLINK-24320
> URL: https://issues.apache.org/jira/browse/FLINK-24320
> Project: Flink
>  Issue Type: Improvement
>  Components: Runtime / Checkpointing, Runtime / Web Frontend
>Affects Versions: 1.13.2
>Reporter: Robert Metzger
>Priority: Major
>  Labels: beginner-friendly
> Attachments: image-2021-09-17-13-31-02-148.png, 
> image-2021-09-24-10-49-53-657.png
>
>
> It would be nice if the Configuration page would also show if incremental 
> checkpoints are enabled.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (FLINK-16504) Add a AWS DynamoDB sink

2021-09-18 Thread Robert Metzger (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-16504?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Robert Metzger reassigned FLINK-16504:
--

Assignee: Yuri Gusev

> Add a AWS DynamoDB sink
> ---
>
> Key: FLINK-16504
> URL: https://issues.apache.org/jira/browse/FLINK-16504
> Project: Flink
>  Issue Type: Improvement
>  Components: Connectors / Common
>Reporter: Robert Metzger
>Assignee: Yuri Gusev
>Priority: Minor
>  Labels: pull-request-available
>
> I'm adding this ticket to track the amount of demand for this connector.
> Please comment on this ticket, if you are looking for a DynamoDB connector.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (FLINK-16504) Add a AWS DynamoDB sink

2021-09-18 Thread Robert Metzger (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-16504?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Robert Metzger updated FLINK-16504:
---
Labels: pull-request-available  (was: auto-deprioritized-major 
auto-unassigned pull-request-available)

> Add a AWS DynamoDB sink
> ---
>
> Key: FLINK-16504
> URL: https://issues.apache.org/jira/browse/FLINK-16504
> Project: Flink
>  Issue Type: Improvement
>  Components: Connectors / Common
>Reporter: Robert Metzger
>Priority: Minor
>  Labels: pull-request-available
>
> I'm adding this ticket to track the amount of demand for this connector.
> Please comment on this ticket, if you are looking for a DynamoDB connector.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (FLINK-15816) Limit the maximum length of the value of kubernetes.cluster-id configuration option

2021-09-18 Thread Robert Metzger (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-15816?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17417129#comment-17417129
 ] 

Robert Metzger commented on FLINK-15816:


Oh, you are right!
I misread the error message. This is the original error I got:
{code}
io.fabric8.kubernetes.client.KubernetesClientException: Failure executing: PUT 
at: 
https://172.20.0.1/api/v1/namespaces/robert/configmaps/1f68c84c-7c13-47fa-aa33-c799512f358c-troubleshoot-throughput0-fork0-resourcemanager-leader.
 Message: ConfigMap 
"1f68c84c-7c13-47fa-aa33-c799512f358c-troubleshoot-throughput0-fork0-resourcemanager-leader"
 is invalid: metadata.labels: Invalid value: 
"1f68c84c-7c13-47fa-aa33-c799512f358c-troubleshoot-throughput0-fork0": must be 
no more than 63 characters.
{code}
The problem is not the configmap name, but the label Flink is putting.

> Limit the maximum length of the value of kubernetes.cluster-id configuration 
> option
> ---
>
> Key: FLINK-15816
> URL: https://issues.apache.org/jira/browse/FLINK-15816
> Project: Flink
>  Issue Type: Sub-task
>  Components: Deployment / Kubernetes
>Affects Versions: 1.10.0, 1.14.1
>Reporter: Canbin Zheng
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.11.0
>
> Attachments: image-2020-01-31-20-54-33-340.png
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> Two Kubernetes Service will be created when a session cluster is deployed, 
> one is the internal Service and the other is the rest Service, we set the 
> internal Service name to the value of the _kubernetes.cluster-id_ 
> configuration option and then set the rest Service name to  
> _${kubernetes.cluster-id}_ with a suffix *-rest* appended, said if we set the 
> _kubernetes.cluster-id_ to *session-test*, then the internal Service name 
> will be *session-test* and the rest Service name be *session-test-rest;* 
> there is a constraint in Kubernetes that the Service name must be no more 
> than 63 characters, for the current naming convention it leads to that the 
> value of _kubernetes.cluster-id_ should not be more than 58 characters, 
> otherwise there are scenarios that the internal Service is created 
> successfully then comes up with a ClusterDeploymentException when trying to 
> create the rest Service.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Closed] (FLINK-24129) TopicRangeTest.rangeCreationHaveALimitedScope fails on Azure

2021-09-17 Thread Robert Metzger (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-24129?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Robert Metzger closed FLINK-24129.
--
Resolution: Fixed

> TopicRangeTest.rangeCreationHaveALimitedScope fails on Azure
> 
>
> Key: FLINK-24129
> URL: https://issues.apache.org/jira/browse/FLINK-24129
> Project: Flink
>  Issue Type: Bug
>  Components: Connectors / Pulsar
>Affects Versions: 1.14.0
>Reporter: Till Rohrmann
>Assignee: David Morávek
>Priority: Critical
>  Labels: pull-request-available, test-stability
> Fix For: 1.14.0, 1.15.0
>
>
> The test {{TopicRangeTest.rangeCreationHaveALimitedScope}} fails on Azure with
> {code}
> [ERROR] Tests run: 11, Failures: 1, Errors: 0, Skipped: 0, Time elapsed: 
> 0.041 s <<< FAILURE! - in 
> org.apache.flink.connector.pulsar.source.enumerator.topic.TopicRangeTest
> 2021-09-02T12:41:55.9478399Z Sep 02 12:41:55 [ERROR] 
> rangeCreationHaveALimitedScope[4]  Time elapsed: 0.025 s  <<< FAILURE!
> 2021-09-02T12:41:55.9478983Z Sep 02 12:41:55 
> org.opentest4j.AssertionFailedError: Expected 
> java.lang.IllegalArgumentException to be thrown, but nothing was thrown.
> 2021-09-02T12:41:55.9479519Z Sep 02 12:41:55  at 
> org.junit.jupiter.api.AssertThrows.assertThrows(AssertThrows.java:71)
> 2021-09-02T12:41:55.9479983Z Sep 02 12:41:55  at 
> org.junit.jupiter.api.AssertThrows.assertThrows(AssertThrows.java:37)
> 2021-09-02T12:41:55.9480449Z Sep 02 12:41:55  at 
> org.junit.jupiter.api.Assertions.assertThrows(Assertions.java:3007)
> 2021-09-02T12:41:55.9481013Z Sep 02 12:41:55  at 
> org.apache.flink.connector.pulsar.source.enumerator.topic.TopicRangeTest.rangeCreationHaveALimitedScope(TopicRangeTest.java:44)
> 2021-09-02T12:41:55.9482349Z Sep 02 12:41:55  at 
> sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> 2021-09-02T12:41:55.9483361Z Sep 02 12:41:55  at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
> 2021-09-02T12:41:55.9483969Z Sep 02 12:41:55  at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> 2021-09-02T12:41:55.9484594Z Sep 02 12:41:55  at 
> java.lang.reflect.Method.invoke(Method.java:498)
> 2021-09-02T12:41:55.9485051Z Sep 02 12:41:55  at 
> org.junit.platform.commons.util.ReflectionUtils.invokeMethod(ReflectionUtils.java:688)
> 2021-09-02T12:41:55.9485595Z Sep 02 12:41:55  at 
> org.junit.jupiter.engine.execution.MethodInvocation.proceed(MethodInvocation.java:60)
> 2021-09-02T12:41:55.9486194Z Sep 02 12:41:55  at 
> org.junit.jupiter.engine.execution.InvocationInterceptorChain$ValidatingInvocation.proceed(InvocationInterceptorChain.java:131)
> 2021-09-02T12:41:55.9486952Z Sep 02 12:41:55  at 
> org.junit.jupiter.engine.extension.TimeoutExtension.intercept(TimeoutExtension.java:149)
> 2021-09-02T12:41:55.9487837Z Sep 02 12:41:55  at 
> org.junit.jupiter.engine.extension.TimeoutExtension.interceptTestableMethod(TimeoutExtension.java:140)
> 2021-09-02T12:41:55.9488774Z Sep 02 12:41:55  at 
> org.junit.jupiter.engine.extension.TimeoutExtension.interceptTestTemplateMethod(TimeoutExtension.java:92)
> 2021-09-02T12:41:55.9489775Z Sep 02 12:41:55  at 
> org.junit.jupiter.engine.execution.ExecutableInvoker$ReflectiveInterceptorCall.lambda$ofVoidMethod$0(ExecutableInvoker.java:115)
> 2021-09-02T12:41:55.9490737Z Sep 02 12:41:55  at 
> org.junit.jupiter.engine.execution.ExecutableInvoker.lambda$invoke$0(ExecutableInvoker.java:105)
> 2021-09-02T12:41:55.9491693Z Sep 02 12:41:55  at 
> org.junit.jupiter.engine.execution.InvocationInterceptorChain$InterceptedInvocation.proceed(InvocationInterceptorChain.java:106)
> 2021-09-02T12:41:55.9492584Z Sep 02 12:41:55  at 
> org.junit.jupiter.engine.execution.InvocationInterceptorChain.proceed(InvocationInterceptorChain.java:64)
> 2021-09-02T12:41:55.9493353Z Sep 02 12:41:55  at 
> org.junit.jupiter.engine.execution.InvocationInterceptorChain.chainAndInvoke(InvocationInterceptorChain.java:45)
> 2021-09-02T12:41:55.9493957Z Sep 02 12:41:55  at 
> org.junit.jupiter.engine.execution.InvocationInterceptorChain.invoke(InvocationInterceptorChain.java:37)
> 2021-09-02T12:41:55.9494608Z Sep 02 12:41:55  at 
> org.junit.jupiter.engine.execution.ExecutableInvoker.invoke(ExecutableInvoker.java:104)
> 2021-09-02T12:41:55.9495132Z Sep 02 12:41:55  at 
> org.junit.jupiter.engine.execution.ExecutableInvoker.invoke(ExecutableInvoker.java:98)
> 2021-09-02T12:41:55.9495735Z Sep 02 12:41:55  at 
> org.junit.jupiter.engine.descriptor.TestMethodTestDescriptor.lambda$invokeTestMethod$6(TestMethodTestDescriptor.java:210)
> 2021-09-02T12:41:55.9496357Z Sep 02 12:41:55  at 
> org.junit.platform.engine.support.hierarchical.ThrowableCollector.execute(ThrowableCollector.java:73)
> 2021-09-02T12:41:55.9497188Z Sep 02 12:41:55  

[jira] [Commented] (FLINK-24129) TopicRangeTest.rangeCreationHaveALimitedScope fails on Azure

2021-09-17 Thread Robert Metzger (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-24129?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17416834#comment-17416834
 ] 

Robert Metzger commented on FLINK-24129:


merged to release-1.14 in 
https://github.com/apache/flink/commit/64902ea420a7785383258b0d7fa7922f7cec2c85

> TopicRangeTest.rangeCreationHaveALimitedScope fails on Azure
> 
>
> Key: FLINK-24129
> URL: https://issues.apache.org/jira/browse/FLINK-24129
> Project: Flink
>  Issue Type: Bug
>  Components: Connectors / Pulsar
>Affects Versions: 1.14.0
>Reporter: Till Rohrmann
>Assignee: David Morávek
>Priority: Critical
>  Labels: pull-request-available, test-stability
> Fix For: 1.14.0, 1.15.0
>
>
> The test {{TopicRangeTest.rangeCreationHaveALimitedScope}} fails on Azure with
> {code}
> [ERROR] Tests run: 11, Failures: 1, Errors: 0, Skipped: 0, Time elapsed: 
> 0.041 s <<< FAILURE! - in 
> org.apache.flink.connector.pulsar.source.enumerator.topic.TopicRangeTest
> 2021-09-02T12:41:55.9478399Z Sep 02 12:41:55 [ERROR] 
> rangeCreationHaveALimitedScope[4]  Time elapsed: 0.025 s  <<< FAILURE!
> 2021-09-02T12:41:55.9478983Z Sep 02 12:41:55 
> org.opentest4j.AssertionFailedError: Expected 
> java.lang.IllegalArgumentException to be thrown, but nothing was thrown.
> 2021-09-02T12:41:55.9479519Z Sep 02 12:41:55  at 
> org.junit.jupiter.api.AssertThrows.assertThrows(AssertThrows.java:71)
> 2021-09-02T12:41:55.9479983Z Sep 02 12:41:55  at 
> org.junit.jupiter.api.AssertThrows.assertThrows(AssertThrows.java:37)
> 2021-09-02T12:41:55.9480449Z Sep 02 12:41:55  at 
> org.junit.jupiter.api.Assertions.assertThrows(Assertions.java:3007)
> 2021-09-02T12:41:55.9481013Z Sep 02 12:41:55  at 
> org.apache.flink.connector.pulsar.source.enumerator.topic.TopicRangeTest.rangeCreationHaveALimitedScope(TopicRangeTest.java:44)
> 2021-09-02T12:41:55.9482349Z Sep 02 12:41:55  at 
> sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> 2021-09-02T12:41:55.9483361Z Sep 02 12:41:55  at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
> 2021-09-02T12:41:55.9483969Z Sep 02 12:41:55  at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> 2021-09-02T12:41:55.9484594Z Sep 02 12:41:55  at 
> java.lang.reflect.Method.invoke(Method.java:498)
> 2021-09-02T12:41:55.9485051Z Sep 02 12:41:55  at 
> org.junit.platform.commons.util.ReflectionUtils.invokeMethod(ReflectionUtils.java:688)
> 2021-09-02T12:41:55.9485595Z Sep 02 12:41:55  at 
> org.junit.jupiter.engine.execution.MethodInvocation.proceed(MethodInvocation.java:60)
> 2021-09-02T12:41:55.9486194Z Sep 02 12:41:55  at 
> org.junit.jupiter.engine.execution.InvocationInterceptorChain$ValidatingInvocation.proceed(InvocationInterceptorChain.java:131)
> 2021-09-02T12:41:55.9486952Z Sep 02 12:41:55  at 
> org.junit.jupiter.engine.extension.TimeoutExtension.intercept(TimeoutExtension.java:149)
> 2021-09-02T12:41:55.9487837Z Sep 02 12:41:55  at 
> org.junit.jupiter.engine.extension.TimeoutExtension.interceptTestableMethod(TimeoutExtension.java:140)
> 2021-09-02T12:41:55.9488774Z Sep 02 12:41:55  at 
> org.junit.jupiter.engine.extension.TimeoutExtension.interceptTestTemplateMethod(TimeoutExtension.java:92)
> 2021-09-02T12:41:55.9489775Z Sep 02 12:41:55  at 
> org.junit.jupiter.engine.execution.ExecutableInvoker$ReflectiveInterceptorCall.lambda$ofVoidMethod$0(ExecutableInvoker.java:115)
> 2021-09-02T12:41:55.9490737Z Sep 02 12:41:55  at 
> org.junit.jupiter.engine.execution.ExecutableInvoker.lambda$invoke$0(ExecutableInvoker.java:105)
> 2021-09-02T12:41:55.9491693Z Sep 02 12:41:55  at 
> org.junit.jupiter.engine.execution.InvocationInterceptorChain$InterceptedInvocation.proceed(InvocationInterceptorChain.java:106)
> 2021-09-02T12:41:55.9492584Z Sep 02 12:41:55  at 
> org.junit.jupiter.engine.execution.InvocationInterceptorChain.proceed(InvocationInterceptorChain.java:64)
> 2021-09-02T12:41:55.9493353Z Sep 02 12:41:55  at 
> org.junit.jupiter.engine.execution.InvocationInterceptorChain.chainAndInvoke(InvocationInterceptorChain.java:45)
> 2021-09-02T12:41:55.9493957Z Sep 02 12:41:55  at 
> org.junit.jupiter.engine.execution.InvocationInterceptorChain.invoke(InvocationInterceptorChain.java:37)
> 2021-09-02T12:41:55.9494608Z Sep 02 12:41:55  at 
> org.junit.jupiter.engine.execution.ExecutableInvoker.invoke(ExecutableInvoker.java:104)
> 2021-09-02T12:41:55.9495132Z Sep 02 12:41:55  at 
> org.junit.jupiter.engine.execution.ExecutableInvoker.invoke(ExecutableInvoker.java:98)
> 2021-09-02T12:41:55.9495735Z Sep 02 12:41:55  at 
> org.junit.jupiter.engine.descriptor.TestMethodTestDescriptor.lambda$invokeTestMethod$6(TestMethodTestDescriptor.java:210)
> 2021-09-02T12:41:55.9496357Z Sep 02 12:41:55  at 
> org.j

[jira] [Commented] (FLINK-24129) TopicRangeTest.rangeCreationHaveALimitedScope fails on Azure

2021-09-17 Thread Robert Metzger (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-24129?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17416733#comment-17416733
 ] 

Robert Metzger commented on FLINK-24129:


Merged to master in 
https://github.com/apache/flink/commit/e18d2731a637b6f6d7f984221e95f02fb68b4e20

> TopicRangeTest.rangeCreationHaveALimitedScope fails on Azure
> 
>
> Key: FLINK-24129
> URL: https://issues.apache.org/jira/browse/FLINK-24129
> Project: Flink
>  Issue Type: Bug
>  Components: Connectors / Pulsar
>Affects Versions: 1.14.0
>Reporter: Till Rohrmann
>Assignee: David Morávek
>Priority: Critical
>  Labels: pull-request-available, test-stability
> Fix For: 1.14.0, 1.15.0
>
>
> The test {{TopicRangeTest.rangeCreationHaveALimitedScope}} fails on Azure with
> {code}
> [ERROR] Tests run: 11, Failures: 1, Errors: 0, Skipped: 0, Time elapsed: 
> 0.041 s <<< FAILURE! - in 
> org.apache.flink.connector.pulsar.source.enumerator.topic.TopicRangeTest
> 2021-09-02T12:41:55.9478399Z Sep 02 12:41:55 [ERROR] 
> rangeCreationHaveALimitedScope[4]  Time elapsed: 0.025 s  <<< FAILURE!
> 2021-09-02T12:41:55.9478983Z Sep 02 12:41:55 
> org.opentest4j.AssertionFailedError: Expected 
> java.lang.IllegalArgumentException to be thrown, but nothing was thrown.
> 2021-09-02T12:41:55.9479519Z Sep 02 12:41:55  at 
> org.junit.jupiter.api.AssertThrows.assertThrows(AssertThrows.java:71)
> 2021-09-02T12:41:55.9479983Z Sep 02 12:41:55  at 
> org.junit.jupiter.api.AssertThrows.assertThrows(AssertThrows.java:37)
> 2021-09-02T12:41:55.9480449Z Sep 02 12:41:55  at 
> org.junit.jupiter.api.Assertions.assertThrows(Assertions.java:3007)
> 2021-09-02T12:41:55.9481013Z Sep 02 12:41:55  at 
> org.apache.flink.connector.pulsar.source.enumerator.topic.TopicRangeTest.rangeCreationHaveALimitedScope(TopicRangeTest.java:44)
> 2021-09-02T12:41:55.9482349Z Sep 02 12:41:55  at 
> sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> 2021-09-02T12:41:55.9483361Z Sep 02 12:41:55  at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
> 2021-09-02T12:41:55.9483969Z Sep 02 12:41:55  at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> 2021-09-02T12:41:55.9484594Z Sep 02 12:41:55  at 
> java.lang.reflect.Method.invoke(Method.java:498)
> 2021-09-02T12:41:55.9485051Z Sep 02 12:41:55  at 
> org.junit.platform.commons.util.ReflectionUtils.invokeMethod(ReflectionUtils.java:688)
> 2021-09-02T12:41:55.9485595Z Sep 02 12:41:55  at 
> org.junit.jupiter.engine.execution.MethodInvocation.proceed(MethodInvocation.java:60)
> 2021-09-02T12:41:55.9486194Z Sep 02 12:41:55  at 
> org.junit.jupiter.engine.execution.InvocationInterceptorChain$ValidatingInvocation.proceed(InvocationInterceptorChain.java:131)
> 2021-09-02T12:41:55.9486952Z Sep 02 12:41:55  at 
> org.junit.jupiter.engine.extension.TimeoutExtension.intercept(TimeoutExtension.java:149)
> 2021-09-02T12:41:55.9487837Z Sep 02 12:41:55  at 
> org.junit.jupiter.engine.extension.TimeoutExtension.interceptTestableMethod(TimeoutExtension.java:140)
> 2021-09-02T12:41:55.9488774Z Sep 02 12:41:55  at 
> org.junit.jupiter.engine.extension.TimeoutExtension.interceptTestTemplateMethod(TimeoutExtension.java:92)
> 2021-09-02T12:41:55.9489775Z Sep 02 12:41:55  at 
> org.junit.jupiter.engine.execution.ExecutableInvoker$ReflectiveInterceptorCall.lambda$ofVoidMethod$0(ExecutableInvoker.java:115)
> 2021-09-02T12:41:55.9490737Z Sep 02 12:41:55  at 
> org.junit.jupiter.engine.execution.ExecutableInvoker.lambda$invoke$0(ExecutableInvoker.java:105)
> 2021-09-02T12:41:55.9491693Z Sep 02 12:41:55  at 
> org.junit.jupiter.engine.execution.InvocationInterceptorChain$InterceptedInvocation.proceed(InvocationInterceptorChain.java:106)
> 2021-09-02T12:41:55.9492584Z Sep 02 12:41:55  at 
> org.junit.jupiter.engine.execution.InvocationInterceptorChain.proceed(InvocationInterceptorChain.java:64)
> 2021-09-02T12:41:55.9493353Z Sep 02 12:41:55  at 
> org.junit.jupiter.engine.execution.InvocationInterceptorChain.chainAndInvoke(InvocationInterceptorChain.java:45)
> 2021-09-02T12:41:55.9493957Z Sep 02 12:41:55  at 
> org.junit.jupiter.engine.execution.InvocationInterceptorChain.invoke(InvocationInterceptorChain.java:37)
> 2021-09-02T12:41:55.9494608Z Sep 02 12:41:55  at 
> org.junit.jupiter.engine.execution.ExecutableInvoker.invoke(ExecutableInvoker.java:104)
> 2021-09-02T12:41:55.9495132Z Sep 02 12:41:55  at 
> org.junit.jupiter.engine.execution.ExecutableInvoker.invoke(ExecutableInvoker.java:98)
> 2021-09-02T12:41:55.9495735Z Sep 02 12:41:55  at 
> org.junit.jupiter.engine.descriptor.TestMethodTestDescriptor.lambda$invokeTestMethod$6(TestMethodTestDescriptor.java:210)
> 2021-09-02T12:41:55.9496357Z Sep 02 12:41:55  at 
> org.junit.p

[jira] [Closed] (FLINK-23969) Test Pulsar source end 2 end

2021-09-17 Thread Robert Metzger (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-23969?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Robert Metzger closed FLINK-23969.
--
Fix Version/s: 1.15.0
   Resolution: Fixed

> Test Pulsar source end 2 end
> 
>
> Key: FLINK-23969
> URL: https://issues.apache.org/jira/browse/FLINK-23969
> Project: Flink
>  Issue Type: Sub-task
>  Components: Connectors / Pulsar
>Reporter: Arvid Heise
>Assignee: Liu
>Priority: Critical
>  Labels: pull-request-available, release-testing
> Fix For: 1.14.0, 1.15.0
>
>
> Write a test application using Pulsar Source and execute it in distributed 
> fashion. Check fault-tolerance by crashing and restarting a TM.
> Ideally, we test different subscription modes and sticky keys in particular.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (FLINK-23969) Test Pulsar source end 2 end

2021-09-17 Thread Robert Metzger (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-23969?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17416729#comment-17416729
 ] 

Robert Metzger commented on FLINK-23969:


Merged to release-1.14: 
https://github.com/apache/flink/commit/a2b612af84ac358592db6e52cf14bcd718a16fbb

> Test Pulsar source end 2 end
> 
>
> Key: FLINK-23969
> URL: https://issues.apache.org/jira/browse/FLINK-23969
> Project: Flink
>  Issue Type: Sub-task
>  Components: Connectors / Pulsar
>Reporter: Arvid Heise
>Assignee: Liu
>Priority: Critical
>  Labels: pull-request-available, release-testing
> Fix For: 1.14.0
>
>
> Write a test application using Pulsar Source and execute it in distributed 
> fashion. Check fault-tolerance by crashing and restarting a TM.
> Ideally, we test different subscription modes and sticky keys in particular.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (FLINK-24248) flink-clients dependency missing in Gradle Example

2021-09-17 Thread Robert Metzger (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-24248?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17416728#comment-17416728
 ] 

Robert Metzger commented on FLINK-24248:


Merged to master in 
https://github.com/apache/flink/commit/0bbc91a2a960cac7e9849eba7ff3e6d8085812be

> flink-clients dependency missing in Gradle Example
> --
>
> Key: FLINK-24248
> URL: https://issues.apache.org/jira/browse/FLINK-24248
> Project: Flink
>  Issue Type: Bug
>  Components: Documentation
>Affects Versions: 1.13.2
>Reporter: Konstantin Knauf
>Assignee: Daisy Tsang
>Priority: Critical
>  Labels: pull-request-available
>
> The Gradle example on the "Project Configuration" page misses 
> ```
>   compile 
> "org.apache.flink:flink-clients_${scalaBinaryVersion}:${flinkVersion}"
> ```
> in order to be able to run the program locally. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Closed] (FLINK-24248) flink-clients dependency missing in Gradle Example

2021-09-17 Thread Robert Metzger (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-24248?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Robert Metzger closed FLINK-24248.
--
Fix Version/s: 1.15.0
   Resolution: Fixed

> flink-clients dependency missing in Gradle Example
> --
>
> Key: FLINK-24248
> URL: https://issues.apache.org/jira/browse/FLINK-24248
> Project: Flink
>  Issue Type: Bug
>  Components: Documentation
>Affects Versions: 1.13.2
>Reporter: Konstantin Knauf
>Assignee: Daisy Tsang
>Priority: Critical
>  Labels: pull-request-available
> Fix For: 1.15.0
>
>
> The Gradle example on the "Project Configuration" page misses 
> ```
>   compile 
> "org.apache.flink:flink-clients_${scalaBinaryVersion}:${flinkVersion}"
> ```
> in order to be able to run the program locally. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Closed] (FLINK-24292) Update Flink's Kafka examples to use KafkaSink

2021-09-17 Thread Robert Metzger (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-24292?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Robert Metzger closed FLINK-24292.
--
Fix Version/s: 1.15.0
   1.14.0
   Resolution: Fixed

> Update Flink's Kafka examples to use KafkaSink
> --
>
> Key: FLINK-24292
> URL: https://issues.apache.org/jira/browse/FLINK-24292
> Project: Flink
>  Issue Type: Improvement
>  Components: Connectors / Kafka
>Affects Versions: 1.14.0, 1.15.0
>Reporter: Fabian Paul
>Assignee: Fabian Paul
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.14.0, 1.15.0
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (FLINK-24292) Update Flink's Kafka examples to use KafkaSink

2021-09-17 Thread Robert Metzger (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-24292?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17416684#comment-17416684
 ] 

Robert Metzger commented on FLINK-24292:


release-1.14: 
https://github.com/apache/flink/commit/06d4828423e2d4e29fe5ddf5710ca651805e5d7a

> Update Flink's Kafka examples to use KafkaSink
> --
>
> Key: FLINK-24292
> URL: https://issues.apache.org/jira/browse/FLINK-24292
> Project: Flink
>  Issue Type: Improvement
>  Components: Connectors / Kafka
>Affects Versions: 1.14.0, 1.15.0
>Reporter: Fabian Paul
>Assignee: Fabian Paul
>Priority: Major
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (FLINK-24292) Update Flink's Kafka examples to use KafkaSink

2021-09-17 Thread Robert Metzger (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-24292?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17416678#comment-17416678
 ] 

Robert Metzger commented on FLINK-24292:


Merged to master / 1.15 in 
https://github.com/apache/flink/commit/a83f8f41a5247112304370e1ce4a5fcd5ef019dd

> Update Flink's Kafka examples to use KafkaSink
> --
>
> Key: FLINK-24292
> URL: https://issues.apache.org/jira/browse/FLINK-24292
> Project: Flink
>  Issue Type: Improvement
>  Components: Connectors / Kafka
>Affects Versions: 1.14.0, 1.15.0
>Reporter: Fabian Paul
>Assignee: Fabian Paul
>Priority: Major
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Closed] (FLINK-24281) Migrate all existing tests to new Kafka Sink

2021-09-17 Thread Robert Metzger (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-24281?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Robert Metzger closed FLINK-24281.
--
Fix Version/s: 1.14.1
   1.15.0
   Resolution: Fixed

> Migrate all existing tests to new Kafka Sink
> 
>
> Key: FLINK-24281
> URL: https://issues.apache.org/jira/browse/FLINK-24281
> Project: Flink
>  Issue Type: Improvement
>  Components: Connectors / Kafka
>Affects Versions: 1.14.0, 1.15.0
>Reporter: Fabian Paul
>Assignee: Fabian Paul
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.15.0, 1.14.1
>
>
> The FlinkKafkaProducer is deprecated since 1.14 but a lot of existing tests 
> are still using.
> We should replace it with the KafkaSink because it completely subsumes it.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (FLINK-24281) Migrate all existing tests to new Kafka Sink

2021-09-17 Thread Robert Metzger (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-24281?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17416675#comment-17416675
 ] 

Robert Metzger commented on FLINK-24281:


Merged to release-1.14 in 5dd99eddef34e0f90ed9a1bc8648735bd464c4b4 
83fd46fbdbb864984e2d0134fa7151e1e5c13f77

> Migrate all existing tests to new Kafka Sink
> 
>
> Key: FLINK-24281
> URL: https://issues.apache.org/jira/browse/FLINK-24281
> Project: Flink
>  Issue Type: Improvement
>  Components: Connectors / Kafka
>Affects Versions: 1.14.0, 1.15.0
>Reporter: Fabian Paul
>Assignee: Fabian Paul
>Priority: Major
>  Labels: pull-request-available
>
> The FlinkKafkaProducer is deprecated since 1.14 but a lot of existing tests 
> are still using.
> We should replace it with the KafkaSink because it completely subsumes it.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (FLINK-24281) Migrate all existing tests to new Kafka Sink

2021-09-17 Thread Robert Metzger (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-24281?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17416673#comment-17416673
 ] 

Robert Metzger commented on FLINK-24281:


Merged to 1.15 / master in: 4a895520f4af59dc5d9d30d155c93623de2fe819 
f82b2a9d7ea745c85bdcb6f13394a7cd8f9d7379

> Migrate all existing tests to new Kafka Sink
> 
>
> Key: FLINK-24281
> URL: https://issues.apache.org/jira/browse/FLINK-24281
> Project: Flink
>  Issue Type: Improvement
>  Components: Connectors / Kafka
>Affects Versions: 1.14.0, 1.15.0
>Reporter: Fabian Paul
>Assignee: Fabian Paul
>Priority: Major
>  Labels: pull-request-available
>
> The FlinkKafkaProducer is deprecated since 1.14 but a lot of existing tests 
> are still using.
> We should replace it with the KafkaSink because it completely subsumes it.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (FLINK-24320) Show in the Job / Checkpoints / Configuration if checkpoints are incremental

2021-09-17 Thread Robert Metzger (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-24320?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Robert Metzger updated FLINK-24320:
---
Attachment: (was: image-2021-09-17-13-31-32-311.png)

> Show in the Job / Checkpoints / Configuration if checkpoints are incremental
> 
>
> Key: FLINK-24320
> URL: https://issues.apache.org/jira/browse/FLINK-24320
> Project: Flink
>  Issue Type: Improvement
>  Components: Runtime / Checkpointing, Runtime / Web Frontend
>Affects Versions: 1.13.2
>Reporter: Robert Metzger
>Priority: Major
> Attachments: image-2021-09-17-13-31-02-148.png
>
>
> It would be nice if the overview would also show if incremental checkpoints 
> are enabled.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (FLINK-24320) Show in the Job / Checkpoints / Configuration if checkpoints are incremental

2021-09-17 Thread Robert Metzger (Jira)
Robert Metzger created FLINK-24320:
--

 Summary: Show in the Job / Checkpoints / Configuration if 
checkpoints are incremental
 Key: FLINK-24320
 URL: https://issues.apache.org/jira/browse/FLINK-24320
 Project: Flink
  Issue Type: Improvement
  Components: Runtime / Checkpointing, Runtime / Web Frontend
Affects Versions: 1.13.2
Reporter: Robert Metzger
 Attachments: image-2021-09-17-13-31-02-148.png, 
image-2021-09-17-13-31-32-311.png

It would be nice if the overview would also show if incremental checkpoints are 
enabled.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (FLINK-15816) Limit the maximum length of the value of kubernetes.cluster-id configuration option

2021-09-16 Thread Robert Metzger (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-15816?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17416020#comment-17416020
 ] 

Robert Metzger commented on FLINK-15816:


I reopened the issue, because I believe the "MAXIMUM_CHARACTERS_OF_CLUSTER_ID" 
is wrong. 

The longest string Flink appends is "-resourcemanager-leader" for the K8s HA 
config maps, which is 24 characters long.
63 - 24 = 39

> Limit the maximum length of the value of kubernetes.cluster-id configuration 
> option
> ---
>
> Key: FLINK-15816
> URL: https://issues.apache.org/jira/browse/FLINK-15816
> Project: Flink
>  Issue Type: Sub-task
>  Components: Deployment / Kubernetes
>Affects Versions: 1.10.0, 1.14.1
>Reporter: Canbin Zheng
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.11.0
>
> Attachments: image-2020-01-31-20-54-33-340.png
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> Two Kubernetes Service will be created when a session cluster is deployed, 
> one is the internal Service and the other is the rest Service, we set the 
> internal Service name to the value of the _kubernetes.cluster-id_ 
> configuration option and then set the rest Service name to  
> _${kubernetes.cluster-id}_ with a suffix *-rest* appended, said if we set the 
> _kubernetes.cluster-id_ to *session-test*, then the internal Service name 
> will be *session-test* and the rest Service name be *session-test-rest;* 
> there is a constraint in Kubernetes that the Service name must be no more 
> than 63 characters, for the current naming convention it leads to that the 
> value of _kubernetes.cluster-id_ should not be more than 58 characters, 
> otherwise there are scenarios that the internal Service is created 
> successfully then comes up with a ClusterDeploymentException when trying to 
> create the rest Service.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Comment Edited] (FLINK-15816) Limit the maximum length of the value of kubernetes.cluster-id configuration option

2021-09-16 Thread Robert Metzger (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-15816?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17416020#comment-17416020
 ] 

Robert Metzger edited comment on FLINK-15816 at 9/16/21, 9:56 AM:
--

I reopened the issue, because I believe the "MAXIMUM_CHARACTERS_OF_CLUSTER_ID" 
is wrong for the K8s HA.

The longest string Flink appends is "-resourcemanager-leader" for the K8s HA 
config maps, which is 24 characters long.
63 - 24 = 39


was (Author: rmetzger):
I reopened the issue, because I believe the "MAXIMUM_CHARACTERS_OF_CLUSTER_ID" 
is wrong. 

The longest string Flink appends is "-resourcemanager-leader" for the K8s HA 
config maps, which is 24 characters long.
63 - 24 = 39

> Limit the maximum length of the value of kubernetes.cluster-id configuration 
> option
> ---
>
> Key: FLINK-15816
> URL: https://issues.apache.org/jira/browse/FLINK-15816
> Project: Flink
>  Issue Type: Sub-task
>  Components: Deployment / Kubernetes
>Affects Versions: 1.10.0, 1.14.1
>Reporter: Canbin Zheng
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.11.0
>
> Attachments: image-2020-01-31-20-54-33-340.png
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> Two Kubernetes Service will be created when a session cluster is deployed, 
> one is the internal Service and the other is the rest Service, we set the 
> internal Service name to the value of the _kubernetes.cluster-id_ 
> configuration option and then set the rest Service name to  
> _${kubernetes.cluster-id}_ with a suffix *-rest* appended, said if we set the 
> _kubernetes.cluster-id_ to *session-test*, then the internal Service name 
> will be *session-test* and the rest Service name be *session-test-rest;* 
> there is a constraint in Kubernetes that the Service name must be no more 
> than 63 characters, for the current naming convention it leads to that the 
> value of _kubernetes.cluster-id_ should not be more than 58 characters, 
> otherwise there are scenarios that the internal Service is created 
> successfully then comes up with a ClusterDeploymentException when trying to 
> create the rest Service.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Reopened] (FLINK-15816) Limit the maximum length of the value of kubernetes.cluster-id configuration option

2021-09-16 Thread Robert Metzger (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-15816?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Robert Metzger reopened FLINK-15816:

  Assignee: (was: Canbin Zheng)

> Limit the maximum length of the value of kubernetes.cluster-id configuration 
> option
> ---
>
> Key: FLINK-15816
> URL: https://issues.apache.org/jira/browse/FLINK-15816
> Project: Flink
>  Issue Type: Sub-task
>  Components: Deployment / Kubernetes
>Affects Versions: 1.10.0
>Reporter: Canbin Zheng
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 1.11.0
>
> Attachments: image-2020-01-31-20-54-33-340.png
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> Two Kubernetes Service will be created when a session cluster is deployed, 
> one is the internal Service and the other is the rest Service, we set the 
> internal Service name to the value of the _kubernetes.cluster-id_ 
> configuration option and then set the rest Service name to  
> _${kubernetes.cluster-id}_ with a suffix *-rest* appended, said if we set the 
> _kubernetes.cluster-id_ to *session-test*, then the internal Service name 
> will be *session-test* and the rest Service name be *session-test-rest;* 
> there is a constraint in Kubernetes that the Service name must be no more 
> than 63 characters, for the current naming convention it leads to that the 
> value of _kubernetes.cluster-id_ should not be more than 58 characters, 
> otherwise there are scenarios that the internal Service is created 
> successfully then comes up with a ClusterDeploymentException when trying to 
> create the rest Service.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (FLINK-15816) Limit the maximum length of the value of kubernetes.cluster-id configuration option

2021-09-16 Thread Robert Metzger (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-15816?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Robert Metzger updated FLINK-15816:
---
Priority: Major  (was: Minor)

> Limit the maximum length of the value of kubernetes.cluster-id configuration 
> option
> ---
>
> Key: FLINK-15816
> URL: https://issues.apache.org/jira/browse/FLINK-15816
> Project: Flink
>  Issue Type: Sub-task
>  Components: Deployment / Kubernetes
>Affects Versions: 1.10.0, 1.14.1
>Reporter: Canbin Zheng
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.11.0
>
> Attachments: image-2020-01-31-20-54-33-340.png
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> Two Kubernetes Service will be created when a session cluster is deployed, 
> one is the internal Service and the other is the rest Service, we set the 
> internal Service name to the value of the _kubernetes.cluster-id_ 
> configuration option and then set the rest Service name to  
> _${kubernetes.cluster-id}_ with a suffix *-rest* appended, said if we set the 
> _kubernetes.cluster-id_ to *session-test*, then the internal Service name 
> will be *session-test* and the rest Service name be *session-test-rest;* 
> there is a constraint in Kubernetes that the Service name must be no more 
> than 63 characters, for the current naming convention it leads to that the 
> value of _kubernetes.cluster-id_ should not be more than 58 characters, 
> otherwise there are scenarios that the internal Service is created 
> successfully then comes up with a ClusterDeploymentException when trying to 
> create the rest Service.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (FLINK-15816) Limit the maximum length of the value of kubernetes.cluster-id configuration option

2021-09-16 Thread Robert Metzger (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-15816?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Robert Metzger updated FLINK-15816:
---
Affects Version/s: 1.14.1

> Limit the maximum length of the value of kubernetes.cluster-id configuration 
> option
> ---
>
> Key: FLINK-15816
> URL: https://issues.apache.org/jira/browse/FLINK-15816
> Project: Flink
>  Issue Type: Sub-task
>  Components: Deployment / Kubernetes
>Affects Versions: 1.10.0, 1.14.1
>Reporter: Canbin Zheng
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 1.11.0
>
> Attachments: image-2020-01-31-20-54-33-340.png
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> Two Kubernetes Service will be created when a session cluster is deployed, 
> one is the internal Service and the other is the rest Service, we set the 
> internal Service name to the value of the _kubernetes.cluster-id_ 
> configuration option and then set the rest Service name to  
> _${kubernetes.cluster-id}_ with a suffix *-rest* appended, said if we set the 
> _kubernetes.cluster-id_ to *session-test*, then the internal Service name 
> will be *session-test* and the rest Service name be *session-test-rest;* 
> there is a constraint in Kubernetes that the Service name must be no more 
> than 63 characters, for the current naming convention it leads to that the 
> value of _kubernetes.cluster-id_ should not be more than 58 characters, 
> otherwise there are scenarios that the internal Service is created 
> successfully then comes up with a ClusterDeploymentException when trying to 
> create the rest Service.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (FLINK-24208) Allow idempotent savepoint triggering

2021-09-08 Thread Robert Metzger (Jira)
Robert Metzger created FLINK-24208:
--

 Summary: Allow idempotent savepoint triggering
 Key: FLINK-24208
 URL: https://issues.apache.org/jira/browse/FLINK-24208
 Project: Flink
  Issue Type: Improvement
  Components: Runtime / Checkpointing
Reporter: Robert Metzger


As a user of Flink, I want to be able to trigger a savepoint from an external 
system in a way that I can detect if I have requested this savepoint already.

By passing a custom ID to the savepoint request, I can check (in case of an 
error of the original request, or the external system) if the request has been 
made already.




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (FLINK-24113) Introduce option in Application Mode to disable shutdown

2021-09-08 Thread Robert Metzger (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-24113?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Robert Metzger updated FLINK-24113:
---
Summary: Introduce option in Application Mode to disable shutdown  (was: 
Introduce option in Application Mode to request cluster shutdown)

> Introduce option in Application Mode to disable shutdown
> 
>
> Key: FLINK-24113
> URL: https://issues.apache.org/jira/browse/FLINK-24113
> Project: Flink
>  Issue Type: Improvement
>  Components: Runtime / Coordination
>Affects Versions: 1.15.0
>Reporter: Robert Metzger
>Priority: Major
>
> Currently a Flink JobManager started in Application Mode will shut down once 
> the job has completed.
> When doing a "stop with savepoint" operation, we want to keep the JobManager 
> alive after the job has stopped to retrieve and persist the final savepoint 
> location.
> Currently, Flink waits up to 5 minutes and then shuts down.
> I'm proposing to introduce a new configuration flag "application mode 
> shutdown behavior": "keepalive" (naming things is hard ;) ) which will keep 
> the JobManager in ApplicationMode running until a REST call confirms that it 
> can shutdown.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Comment Edited] (FLINK-24113) Introduce option in Application Mode to request cluster shutdown

2021-09-08 Thread Robert Metzger (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-24113?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17408654#comment-17408654
 ] 

Robert Metzger edited comment on FLINK-24113 at 9/8/21, 7:35 AM:
-

[~chesnay] I didn't know this DELETE /cluster REST call.
Given this new information, yes, we just need a flag to disable the shutdown in 
Application Mode.

[~wangyang0918] Just waiting for the result retrieval is not enough. We would 
need an additional REST call confirming that the result has been retrieved and 
persisted in an external system. However, such an option + the additional REST 
calls seem out of scope of the "savepoint" operation.


was (Author: rmetzger):
[~chesnay] I didn't know this DELETE /cluster REST call.
Given this new information, yes, we just need a flag do disable the shutdown in 
Application Mode.

[~wangyang0918] Just waiting for the result retrieval is not enough. We would 
need an additional REST call confirming that the result has been retrieved and 
persisted in an external system. However, such an option + the additional REST 
calls seem out of scope of the "savepoint" operation.

> Introduce option in Application Mode to request cluster shutdown
> 
>
> Key: FLINK-24113
> URL: https://issues.apache.org/jira/browse/FLINK-24113
> Project: Flink
>  Issue Type: Improvement
>  Components: Runtime / Coordination
>Affects Versions: 1.15.0
>Reporter: Robert Metzger
>Priority: Major
>
> Currently a Flink JobManager started in Application Mode will shut down once 
> the job has completed.
> When doing a "stop with savepoint" operation, we want to keep the JobManager 
> alive after the job has stopped to retrieve and persist the final savepoint 
> location.
> Currently, Flink waits up to 5 minutes and then shuts down.
> I'm proposing to introduce a new configuration flag "application mode 
> shutdown behavior": "keepalive" (naming things is hard ;) ) which will keep 
> the JobManager in ApplicationMode running until a REST call confirms that it 
> can shutdown.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (FLINK-24114) Make CompletedOperationCache.COMPLETED_OPERATION_RESULT_CACHE_DURATION_SECONDS configurable (at least for savepoint trigger operations)

2021-09-02 Thread Robert Metzger (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-24114?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17408749#comment-17408749
 ] 

Robert Metzger commented on FLINK-24114:


True, I could query it from the checkpoints endpoint. However, this is 
indicates that the /savepoint endpoints are semantically broken: Why do I 
trigger & poll against the /savepoint endpoint, and then retrieve the final 
result from the /checkpoints endpoint? (and there from a list of checkpoints I 
guess?).


> Make 
> CompletedOperationCache.COMPLETED_OPERATION_RESULT_CACHE_DURATION_SECONDS 
> configurable (at least for savepoint trigger operations)
> ---
>
> Key: FLINK-24114
> URL: https://issues.apache.org/jira/browse/FLINK-24114
> Project: Flink
>  Issue Type: Improvement
>  Components: Runtime / Coordination
>Affects Versions: 1.15.0
>Reporter: Robert Metzger
>Priority: Major
>
> Currently, it can happen that external services triggering savepoints can not 
> persist the savepoint location from the savepoint handler, because the 
> operation cache has a hardcoded value of 5 minutes, until entries (which have 
> been accessed at least once) are evicted.
> To avoid scenarios where the savepoint location has been accessed, but the 
> external system failed to persist the location, I propose to make this 
> eviction timeout configurable (so that I as a user can configure a value of 
> 24 hours for the cache eviction).
> (This is related to FLINK-24113)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (FLINK-21383) Docker image does not play well together with ConfigMap based flink-conf.yamls

2021-09-02 Thread Robert Metzger (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-21383?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17408691#comment-17408691
 ] 

Robert Metzger commented on FLINK-21383:


+1 for fixing this issue, as it might confuse users

> Docker image does not play well together with ConfigMap based flink-conf.yamls
> --
>
> Key: FLINK-21383
> URL: https://issues.apache.org/jira/browse/FLINK-21383
> Project: Flink
>  Issue Type: Bug
>  Components: Deployment / Kubernetes, flink-docker
>Affects Versions: 1.11.3, 1.12.1, 1.13.0
>Reporter: Till Rohrmann
>Priority: Minor
>  Labels: auto-deprioritized-major, usability
>
> Flink's Docker image does not play well together with ConfigMap based 
> flink-conf.yamls. The {{docker-entrypoint.sh}} script offers a few env 
> variables to overwrite configuration values (e.g. {{FLINK_PROPERTIES}}, 
> {{JOB_MANAGER_RPC_ADDRESS}}, etc.). The problem is that the entrypoint script 
> assumes that it can modify the existing {{flink-conf.yaml}}. This is not the 
> case if the {{flink-conf.yaml}} is based on a {{ConfigMap}}.
> Making things worse, failures updating the {{flink-conf.yaml}} are not 
> reported. Moreover, the called {{jobmanager.sh}} and {{taskmanager.sh}} 
> scripts don't support to pass in dynamic configuration properties into the 
> processes.
> I think the problem is that our assumption that we can modify the 
> {{flink-conf.yaml}} does not always hold true. If we updated the final 
> configuration from within the Flink process (dynamic properties and env 
> variables), then this problem could be avoided.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (FLINK-24113) Introduce option in Application Mode to request cluster shutdown

2021-09-02 Thread Robert Metzger (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-24113?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17408654#comment-17408654
 ] 

Robert Metzger commented on FLINK-24113:


[~chesnay] I didn't know this DELETE /cluster REST call.
Given this new information, yes, we just need a flag do disable the shutdown in 
Application Mode.

[~wangyang0918] Just waiting for the result retrieval is not enough. We would 
need an additional REST call confirming that the result has been retrieved and 
persisted in an external system. However, such an option + the additional REST 
calls seem out of scope of the "savepoint" operation.

> Introduce option in Application Mode to request cluster shutdown
> 
>
> Key: FLINK-24113
> URL: https://issues.apache.org/jira/browse/FLINK-24113
> Project: Flink
>  Issue Type: Improvement
>  Components: Runtime / Coordination
>Affects Versions: 1.15.0
>Reporter: Robert Metzger
>Priority: Major
>
> Currently a Flink JobManager started in Application Mode will shut down once 
> the job has completed.
> When doing a "stop with savepoint" operation, we want to keep the JobManager 
> alive after the job has stopped to retrieve and persist the final savepoint 
> location.
> Currently, Flink waits up to 5 minutes and then shuts down.
> I'm proposing to introduce a new configuration flag "application mode 
> shutdown behavior": "keepalive" (naming things is hard ;) ) which will keep 
> the JobManager in ApplicationMode running until a REST call confirms that it 
> can shutdown.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (FLINK-22262) Flink on Kubernetes ConfigMaps are created without OwnerReference

2021-09-02 Thread Robert Metzger (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-22262?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17408643#comment-17408643
 ] 

Robert Metzger commented on FLINK-22262:


Thanks a lot for your response. I didn't consider the deletion of the HA 
storage, thanks a lot for mentioning this.
Given that, I'll consider changing my operator to implement the cancellation 
through a proper cancel-REST call.

So for now, I won't need the feature of setting owner references.

> Flink on Kubernetes ConfigMaps are created without OwnerReference
> -
>
> Key: FLINK-22262
> URL: https://issues.apache.org/jira/browse/FLINK-22262
> Project: Flink
>  Issue Type: Bug
>  Components: Deployment / Kubernetes
>Affects Versions: 1.13.0
>Reporter: Andrea Peruffo
>Priority: Minor
>  Labels: auto-deprioritized-major
> Attachments: jm.log
>
>
> According to the documentation:
> [https://ci.apache.org/projects/flink/flink-docs-master/docs/deployment/resource-providers/native_kubernetes/#manual-resource-cleanup]
> The ConfigMaps created along with the Flink deployment is supposed to have an 
> OwnerReference pointing to the Deployment itself, unfortunately, this doesn't 
> happen and causes all sorts of issues when the classpath and the jars of the 
> job are updated.
> i.e.:
> Without manually removing the ConfigMap of the Job I cannot update the Jars 
> of the Job.
> Can you please give guidance if there are additional caveats on manually 
> removing the ConfigMap? Any other workaround that can be used?
> Thanks in advance.
> Example ConfigMap:
> {{apiVersion: v1}}
> {{data:}}
> {{ address: akka.tcp://flink@10.0.2.13:6123/user/rpc/jobmanager_2}}
> {{ checkpointID-049: 
> rO0ABXNyADtvcmcuYXBhY2hlLmZsaW5rLnJ1bnRpbWUuc3RhdGUuUmV0cmlldmFibGVTdHJlYW1TdGF0ZUhhbmRsZQABHhjxVZcrAgABTAAYd3JhcHBlZFN0cmVhbVN0YXRlSGFuZGxldAAyTG9yZy9hcGFjaGUvZmxpbmsvcnVudGltZS9zdGF0ZS9TdHJlYW1TdGF0ZUhhbmRsZTt4cHNyADlvcmcuYXBhY2hlLmZsaW5rLnJ1bnRpbWUuc3RhdGUuZmlsZXN5c3RlbS5GaWxlU3RhdGVIYW5kbGUE3HXYYr0bswIAAkoACXN0YXRlU2l6ZUwACGZpbGVQYXRodAAfTG9yZy9hcGFjaGUvZmxpbmsvY29yZS9mcy9QYXRoO3hwAAABOEtzcgAdb3JnLmFwYWNoZS5mbGluay5jb3JlLmZzLlBhdGgAAQIAAUwAA3VyaXQADkxqYXZhL25ldC9VUkk7eHBzcgAMamF2YS5uZXQuVVJJrAF4LkOeSasDAAFMAAZzdHJpbmd0ABJMamF2YS9sYW5nL1N0cmluZzt4cHQAUC9tbnQvZmxpbmsvc3RvcmFnZS9rc2hhL3RheGktcmlkZS1mYXJlLXByb2Nlc3Nvci9jb21wbGV0ZWRDaGVja3BvaW50MDQ0YTc2OWRkNDgxeA==}}
> {{ counter: "50"}}
> {{ sessionId: 0c2b69ee-6b41-48d3-b7fd-1bf2eda94f0f}}
> {{kind: ConfigMap}}
> {{metadata:}}
> {{ annotations:}}
> {{ control-plane.alpha.kubernetes.io/leader: 
> '\{"holderIdentity":"0f25a2cc-e212-46b0-8ba9-faac0732a316","leaseDuration":15.0,"acquireTime":"2021-04-13T14:30:51.439000Z","renewTime":"2021-04-13T14:39:32.011000Z","leaderTransitions":105}'}}
> {{ creationTimestamp: "2021-04-13T14:30:51Z"}}
> {{ labels:}}
> {{ app: taxi-ride-fare-processor}}
> {{ configmap-type: high-availability}}
> {{ type: flink-native-kubernetes}}
> {{ name: 
> taxi-ride-fare-processor--jobmanager-leader}}
> {{ namespace: taxi-ride-fare}}
> {{ resourceVersion: "64100"}}
> {{ selfLink: 
> /api/v1/namespaces/taxi-ride-fare/configmaps/taxi-ride-fare-processor--jobmanager-leader}}
> {{ uid: 9f912495-382a-45de-a789-fd5ad2a2459d}}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (FLINK-24114) Make CompletedOperationCache.COMPLETED_OPERATION_RESULT_CACHE_DURATION_SECONDS configurable (at least for savepoint trigger operations)

2021-09-01 Thread Robert Metzger (Jira)
Robert Metzger created FLINK-24114:
--

 Summary: Make 
CompletedOperationCache.COMPLETED_OPERATION_RESULT_CACHE_DURATION_SECONDS 
configurable (at least for savepoint trigger operations)
 Key: FLINK-24114
 URL: https://issues.apache.org/jira/browse/FLINK-24114
 Project: Flink
  Issue Type: Improvement
  Components: Runtime / Coordination
Affects Versions: 1.15.0
Reporter: Robert Metzger


Currently, it can happen that external services triggering savepoints can not 
persist the savepoint location from the savepoint handler, because the 
operation cache has a hardcoded value of 5 minutes, until entries (which have 
been accessed at least once) are evicted.
To avoid scenarios where the savepoint location has been accessed, but the 
external system failed to persist the location, I propose to make this eviction 
timeout configurable (so that I as a user can configure a value of 24 hours for 
the cache eviction).

(This is related to FLINK-24113)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (FLINK-24113) Introduce option in Application Mode to request cluster shutdown

2021-09-01 Thread Robert Metzger (Jira)
Robert Metzger created FLINK-24113:
--

 Summary: Introduce option in Application Mode to request cluster 
shutdown
 Key: FLINK-24113
 URL: https://issues.apache.org/jira/browse/FLINK-24113
 Project: Flink
  Issue Type: Improvement
  Components: Runtime / Coordination
Affects Versions: 1.15.0
Reporter: Robert Metzger


Currently a Flink JobManager started in Application Mode will shut down once 
the job has completed.

When doing a "stop with savepoint" operation, we want to keep the JobManager 
alive after the job has stopped to retrieve and persist the final savepoint 
location.
Currently, Flink waits up to 5 minutes and then shuts down.

I'm proposing to introduce a new configuration flag "application mode shutdown 
behavior": "keepalive" (naming things is hard ;) ) which will keep the 
JobManager in ApplicationMode running until a REST call confirms that it can 
shutdown.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (FLINK-22262) Flink on Kubernetes ConfigMaps are created without OwnerReference

2021-09-01 Thread Robert Metzger (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-22262?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17408210#comment-17408210
 ] 

Robert Metzger commented on FLINK-22262:


I understand that the lifecycle for HA-configmaps is well-defined in the 
current implementation, and that deletion of the configMaps should not happen 
under normal circumstances.

However, I wonder if we could add an optional parameter to the K8s HA mode to 
set an owner reference to the created config maps.
My use-case is the following: I have a K8s operator which, based on some input 
"FlinkCluster" custom resource creates a Flink cluster with Kubernetes HA 
enabled.
Cancellation (and in general cleanup) is implemented by just deleting the 
"FlinkCluster" custom resource instance, which, through owner references also 
deletes the pods responsible for running the Flink cluster components. ... but 
this leaves behind the HA config maps, because the JobManager gets killed, the 
job is not shutting down properly.
In this case, it would be great if I could configure Flink to set owner 
references for the config maps, so that (when the job gets erased), also the 
ConfigMaps disappear.

What do you think about this case [~wangyang0918]]?

> Flink on Kubernetes ConfigMaps are created without OwnerReference
> -
>
> Key: FLINK-22262
> URL: https://issues.apache.org/jira/browse/FLINK-22262
> Project: Flink
>  Issue Type: Bug
>  Components: Deployment / Kubernetes
>Affects Versions: 1.13.0
>Reporter: Andrea Peruffo
>Priority: Minor
>  Labels: auto-deprioritized-major
> Attachments: jm.log
>
>
> According to the documentation:
> [https://ci.apache.org/projects/flink/flink-docs-master/docs/deployment/resource-providers/native_kubernetes/#manual-resource-cleanup]
> The ConfigMaps created along with the Flink deployment is supposed to have an 
> OwnerReference pointing to the Deployment itself, unfortunately, this doesn't 
> happen and causes all sorts of issues when the classpath and the jars of the 
> job are updated.
> i.e.:
> Without manually removing the ConfigMap of the Job I cannot update the Jars 
> of the Job.
> Can you please give guidance if there are additional caveats on manually 
> removing the ConfigMap? Any other workaround that can be used?
> Thanks in advance.
> Example ConfigMap:
> {{apiVersion: v1}}
> {{data:}}
> {{ address: akka.tcp://flink@10.0.2.13:6123/user/rpc/jobmanager_2}}
> {{ checkpointID-049: 
> rO0ABXNyADtvcmcuYXBhY2hlLmZsaW5rLnJ1bnRpbWUuc3RhdGUuUmV0cmlldmFibGVTdHJlYW1TdGF0ZUhhbmRsZQABHhjxVZcrAgABTAAYd3JhcHBlZFN0cmVhbVN0YXRlSGFuZGxldAAyTG9yZy9hcGFjaGUvZmxpbmsvcnVudGltZS9zdGF0ZS9TdHJlYW1TdGF0ZUhhbmRsZTt4cHNyADlvcmcuYXBhY2hlLmZsaW5rLnJ1bnRpbWUuc3RhdGUuZmlsZXN5c3RlbS5GaWxlU3RhdGVIYW5kbGUE3HXYYr0bswIAAkoACXN0YXRlU2l6ZUwACGZpbGVQYXRodAAfTG9yZy9hcGFjaGUvZmxpbmsvY29yZS9mcy9QYXRoO3hwAAABOEtzcgAdb3JnLmFwYWNoZS5mbGluay5jb3JlLmZzLlBhdGgAAQIAAUwAA3VyaXQADkxqYXZhL25ldC9VUkk7eHBzcgAMamF2YS5uZXQuVVJJrAF4LkOeSasDAAFMAAZzdHJpbmd0ABJMamF2YS9sYW5nL1N0cmluZzt4cHQAUC9tbnQvZmxpbmsvc3RvcmFnZS9rc2hhL3RheGktcmlkZS1mYXJlLXByb2Nlc3Nvci9jb21wbGV0ZWRDaGVja3BvaW50MDQ0YTc2OWRkNDgxeA==}}
> {{ counter: "50"}}
> {{ sessionId: 0c2b69ee-6b41-48d3-b7fd-1bf2eda94f0f}}
> {{kind: ConfigMap}}
> {{metadata:}}
> {{ annotations:}}
> {{ control-plane.alpha.kubernetes.io/leader: 
> '\{"holderIdentity":"0f25a2cc-e212-46b0-8ba9-faac0732a316","leaseDuration":15.0,"acquireTime":"2021-04-13T14:30:51.439000Z","renewTime":"2021-04-13T14:39:32.011000Z","leaderTransitions":105}'}}
> {{ creationTimestamp: "2021-04-13T14:30:51Z"}}
> {{ labels:}}
> {{ app: taxi-ride-fare-processor}}
> {{ configmap-type: high-availability}}
> {{ type: flink-native-kubernetes}}
> {{ name: 
> taxi-ride-fare-processor--jobmanager-leader}}
> {{ namespace: taxi-ride-fare}}
> {{ resourceVersion: "64100"}}
> {{ selfLink: 
> /api/v1/namespaces/taxi-ride-fare/configmaps/taxi-ride-fare-processor--jobmanager-leader}}
> {{ uid: 9f912495-382a-45de-a789-fd5ad2a2459d}}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (FLINK-21510) ExecutionGraph metrics collide on restart

2021-09-01 Thread Robert Metzger (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-21510?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17407938#comment-17407938
 ] 

Robert Metzger commented on FLINK-21510:


How much effort would it be fix this issue?

These are some important metrics, that are missing when using Adaptive 
Scheduler. 

> ExecutionGraph metrics collide on restart
> -
>
> Key: FLINK-21510
> URL: https://issues.apache.org/jira/browse/FLINK-21510
> Project: Flink
>  Issue Type: Improvement
>  Components: Runtime / Coordination
>Reporter: Chesnay Schepler
>Priority: Minor
>  Labels: auto-deprioritized-major, auto-unassigned, reactive
>
> The ExecutionGraphBuilder registers several metrics directly on the 
> JobManagerJobMetricGroup, which are never cleaned up.
> These include upTime/DownTime/restartingTime as well as various checkpointing 
> metrics (see the CheckpointStatsTracker for details; examples are number of 
> checkpoints, checkpoint sizes etc).
> When the AdaptiveScheduler re-creates the EG these will collide with metrics 
> of prior attempts.
> Essentially we either need to create a separate metric group that we pass to 
> the EG or refactor the metrics to be based on some mutable EG reference.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (FLINK-24037) Allow wildcards in ENABLE_BUILT_IN_PLUGINS

2021-08-29 Thread Robert Metzger (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-24037?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17406398#comment-17406398
 ] 

Robert Metzger commented on FLINK-24037:


> Supporting patters bears the risk of accidentally enabling more plugins after 
> an upgrade.

We could fail the entrypoint script, if a pattern matches more than one plugin. 
there should anyways be only one plugin jar file per plugin directory.

> Considering that for docker we ship all of flink-dist anyway, do we even need 
> optional plugins for that use-case in the first place?

You mean we do with the filesystems what we do already with the metric 
reporters, and set them up as plugins in docker?
This could backfire if people have the hadoop s3 plugin, and now get a new 
Flink docker image with the hadoop and presto s3 filesystem implementations, 
because then s3 will default to the presto implementation.

> Allow wildcards in ENABLE_BUILT_IN_PLUGINS
> --
>
> Key: FLINK-24037
> URL: https://issues.apache.org/jira/browse/FLINK-24037
> Project: Flink
>  Issue Type: Improvement
>  Components: flink-docker
>Reporter: Robert Metzger
>Priority: Major
>
> As a user of Flink, I would like to be able to specify a certain default 
> plugin, (such as the S3 presto FS) without having to specific the Flink 
> version again.
> The Flink version is already specified by the Docker container I'm using.
> If one is using generic deployment scripts, I don't want to put the Flink 
> version in two locations.
> Suggested solutions:
> a) Allow wildcards in ENABLE_BUILT_IN_PLUGINS
> b) remove the version string from the jars in the distribution



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (FLINK-24037) Allow wildcards in ENABLE_BUILT_IN_PLUGINS

2021-08-28 Thread Robert Metzger (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-24037?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17406268#comment-17406268
 ] 

Robert Metzger commented on FLINK-24037:


True, that would work as well. However it increases the risk of accidentally 
mixing Flink versions, in particular when manually assembling the contents of 
the classpath. In a Docker image-based deployment, this wouldn't be much of an 
issue, because the version is (usually) determined by the image, but for our 
binary distribution as a tarball it could cause issues.

> Allow wildcards in ENABLE_BUILT_IN_PLUGINS
> --
>
> Key: FLINK-24037
> URL: https://issues.apache.org/jira/browse/FLINK-24037
> Project: Flink
>  Issue Type: Improvement
>  Components: flink-docker
>Reporter: Robert Metzger
>Priority: Major
>
> As a user of Flink, I would like to be able to specify a certain default 
> plugin, (such as the S3 presto FS) without having to specific the Flink 
> version again.
> The Flink version is already specified by the Docker container I'm using.
> If one is using generic deployment scripts, I don't want to put the Flink 
> version in two locations.
> Suggested solutions:
> a) Allow wildcards in ENABLE_BUILT_IN_PLUGINS
> b) remove the version string from the jars in the distribution



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (FLINK-24037) Allow wildcards in ENABLE_BUILT_IN_PLUGINS

2021-08-28 Thread Robert Metzger (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-24037?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Robert Metzger updated FLINK-24037:
---
Description: 
As a user of Flink, I would like to be able to specify a certain default 
plugin, (such as the S3 presto FS) without having to specific the Flink version 
again.
The Flink version is already specified by the Docker container I'm using.

If one is using generic deployment scripts, I don't want to put the Flink 
version in two locations.

Suggested solutions:
a) Allow wildcards in ENABLE_BUILT_IN_PLUGINS
b) remove the version string from the jars in the distribution

  was:
As a user of Flink, I would like to be able to specify a certain default 
plugin, (such as the S3 presto FS) without having to specific the Flink version 
again.
The Flink version is already specified by the Docker container I'm using.

If one is using generic deployment scripts, I don't want to put the Flink 
version in two locations.

Suggested solutions:
a) Allow wildcards in ENABLE


> Allow wildcards in ENABLE_BUILT_IN_PLUGINS
> --
>
> Key: FLINK-24037
> URL: https://issues.apache.org/jira/browse/FLINK-24037
> Project: Flink
>  Issue Type: Improvement
>  Components: flink-docker
>Reporter: Robert Metzger
>Priority: Major
>
> As a user of Flink, I would like to be able to specify a certain default 
> plugin, (such as the S3 presto FS) without having to specific the Flink 
> version again.
> The Flink version is already specified by the Docker container I'm using.
> If one is using generic deployment scripts, I don't want to put the Flink 
> version in two locations.
> Suggested solutions:
> a) Allow wildcards in ENABLE_BUILT_IN_PLUGINS
> b) remove the version string from the jars in the distribution



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (FLINK-24037) Allow wildcards in ENABLE_BUILT_IN_PLUGINS

2021-08-28 Thread Robert Metzger (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-24037?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Robert Metzger updated FLINK-24037:
---
Description: 
As a user of Flink, I would like to be able to specify a certain default 
plugin, (such as the S3 presto FS) without having to specific the Flink version 
again.
The Flink version is already specified by the Docker container I'm using.

If one is using generic deployment scripts, I don't want to put the Flink 
version in two locations.

Suggested solutions:
a) Allow wildcards in ENABLE

  was:
As a user of Flink, I would like to be able to specify a certain default 
plugin, (such as the S3 presto FS) without having to specific the Flink version 
again.
The Flink version is already specified by the Docker container I'm using.

If one is using generic deployment scripts, I don't want to put the Flink 
version in two locations.


> Allow wildcards in ENABLE_BUILT_IN_PLUGINS
> --
>
> Key: FLINK-24037
> URL: https://issues.apache.org/jira/browse/FLINK-24037
> Project: Flink
>  Issue Type: Improvement
>  Components: flink-docker
>Reporter: Robert Metzger
>Priority: Major
>
> As a user of Flink, I would like to be able to specify a certain default 
> plugin, (such as the S3 presto FS) without having to specific the Flink 
> version again.
> The Flink version is already specified by the Docker container I'm using.
> If one is using generic deployment scripts, I don't want to put the Flink 
> version in two locations.
> Suggested solutions:
> a) Allow wildcards in ENABLE



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (FLINK-24037) Allow wildcards in ENABLE_BUILT_IN_PLUGINS

2021-08-28 Thread Robert Metzger (Jira)
Robert Metzger created FLINK-24037:
--

 Summary: Allow wildcards in ENABLE_BUILT_IN_PLUGINS
 Key: FLINK-24037
 URL: https://issues.apache.org/jira/browse/FLINK-24037
 Project: Flink
  Issue Type: Improvement
  Components: flink-docker
Reporter: Robert Metzger


As a user of Flink, I would like to be able to specify a certain default 
plugin, (such as the S3 presto FS) without having to specific the Flink version 
again.
The Flink version is already specified by the Docker container I'm using.

If one is using generic deployment scripts, I don't want to put the Flink 
version in two locations.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (FLINK-23925) HistoryServer: Archiving job with more than one attempt fails

2021-08-24 Thread Robert Metzger (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-23925?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17403746#comment-17403746
 ] 

Robert Metzger commented on FLINK-23925:


It seems that "Runtime / Coordination" and "Runtime / Web Frontend" are the two 
main components where tickets containing the string "history server" are 
located: 
https://issues.apache.org/jira/issues/?jql=project%20%3D%20FLINK%20AND%20text%20~%20%22history%20server%22%20ORDER%20BY%20component%20ASC
 (with "Runtime / Coordination" actually being the more popular one).
I would personally not create a new component, because this is a small 
sub-component, closely connected to the rest of the web frontend 
infrastructure. But I'm happy to create a new component if you have a different 
opinion.

> HistoryServer: Archiving job with more than one attempt fails
> -
>
> Key: FLINK-23925
> URL: https://issues.apache.org/jira/browse/FLINK-23925
> Project: Flink
>  Issue Type: Bug
>  Components: Runtime / Coordination
>Affects Versions: 1.14.0, 1.13.2
>Reporter: Robert Metzger
>Priority: Major
>
> Error:
> {code}
> 2021-08-23 16:26:01,953 INFO  
> org.apache.flink.runtime.resourcemanager.StandaloneResourceManager [] - 
> Disconnect job manager 
> 0...@akka.tcp://flink@localhost:6123/user/rpc/jobmanager_2
>  for job ca9f6a073d311d60f457a1c4243e7dc3 from the resource manager.
> 2021-08-23 16:26:02,137 INFO  
> org.apache.flink.runtime.dispatcher.StandaloneDispatcher [] - Could not 
> archive completed job 
> CarTopSpeedWindowingExample(ca9f6a073d311d60f457a1c4243e7dc3) to the history 
> server.
> java.util.concurrent.CompletionException: java.lang.IllegalArgumentException: 
> attempt does not exist
>   at 
> java.util.concurrent.CompletableFuture.encodeThrowable(CompletableFuture.java:273)
>  ~[?:1.8.0_252]
>   at 
> java.util.concurrent.CompletableFuture.completeThrowable(CompletableFuture.java:280)
>  [?:1.8.0_252]
>   at 
> java.util.concurrent.CompletableFuture$AsyncRun.run(CompletableFuture.java:1643)
>  [?:1.8.0_252]
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>  [?:1.8.0_252]
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>  [?:1.8.0_252]
>   at java.lang.Thread.run(Thread.java:748) [?:1.8.0_252]
> Caused by: java.lang.IllegalArgumentException: attempt does not exist
>   at 
> org.apache.flink.runtime.executiongraph.ArchivedExecutionVertex.getPriorExecutionAttempt(ArchivedExecutionVertex.java:109)
>  ~[flink-dist_2.11-1.14-SNAPSHOT.jar:1.14-SNAPSHOT]
>   at 
> org.apache.flink.runtime.executiongraph.ArchivedExecutionVertex.getPriorExecutionAttempt(ArchivedExecutionVertex.java:31)
>  ~[flink-dist_2.11-1.14-SNAPSHOT.jar:1.14-SNAPSHOT]
>   at 
> org.apache.flink.runtime.rest.handler.job.SubtaskExecutionAttemptDetailsHandler.archiveJsonWithPath(SubtaskExecutionAttemptDetailsHandler.java:140)
>  ~[flink-dist_2.11-1.14-SNAPSHOT.jar:1.14-SNAPSHOT]
>   at 
> org.apache.flink.runtime.webmonitor.history.OnlyExecutionGraphJsonArchivist.archiveJsonWithPath(OnlyExecutionGraphJsonArchivist.java:51)
>  ~[flink-dist_2.11-1.14-SNAPSHOT.jar:1.14-SNAPSHOT]
>   at 
> org.apache.flink.runtime.webmonitor.WebMonitorEndpoint.archiveJsonWithPath(WebMonitorEndpoint.java:1031)
>  ~[flink-dist_2.11-1.14-SNAPSHOT.jar:1.14-SNAPSHOT]
>   at 
> org.apache.flink.runtime.dispatcher.JsonResponseHistoryServerArchivist.lambda$archiveExecutionGraph$0(JsonResponseHistoryServerArchivist.java:61)
>  ~[flink-dist_2.11-1.14-SNAPSHOT.jar:1.14-SNAPSHOT]
>   at 
> org.apache.flink.util.function.ThrowingRunnable.lambda$unchecked$0(ThrowingRunnable.java:49)
>  ~[flink-dist_2.11-1.14-SNAPSHOT.jar:1.14-SNAPSHOT]
>   at 
> java.util.concurrent.CompletableFuture$AsyncRun.run(CompletableFuture.java:1640)
>  ~[?:1.8.0_252]
>   ... 3 more
> {code}
> Steps to reproduce:
> - start a Flink reactive mode job manager:
> {code}
> mkdir usrlib
> cp ./examples/streaming/TopSpeedWindowing.jar usrlib/
> # Submit Job in Reactive Mode
> ./bin/standalone-job.sh start -Dscheduler-mode=reactive 
> -Dexecution.checkpointing.interval="10s" -j 
> org.apache.flink.streaming.examples.windowing.TopSpeedWindowing
> # Start first TaskManager
> ./bin/taskmanager.sh start
> {code}
> - Add another taskmanager to trigger a restart
> - Cancel the job
> See the failure in the jobmanager logs.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (FLINK-23925) HistoryServer: Archiving job with more than one attempt fails

2021-08-23 Thread Robert Metzger (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-23925?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Robert Metzger updated FLINK-23925:
---
Affects Version/s: 1.14.0

> HistoryServer: Archiving job with more than one attempt fails
> -
>
> Key: FLINK-23925
> URL: https://issues.apache.org/jira/browse/FLINK-23925
> Project: Flink
>  Issue Type: Bug
>  Components: Runtime / Coordination
>Affects Versions: 1.14.0, 1.13.2
>Reporter: Robert Metzger
>Priority: Major
>
> Error:
> {code}
> 2021-08-23 16:26:01,953 INFO  
> org.apache.flink.runtime.resourcemanager.StandaloneResourceManager [] - 
> Disconnect job manager 
> 0...@akka.tcp://flink@localhost:6123/user/rpc/jobmanager_2
>  for job ca9f6a073d311d60f457a1c4243e7dc3 from the resource manager.
> 2021-08-23 16:26:02,137 INFO  
> org.apache.flink.runtime.dispatcher.StandaloneDispatcher [] - Could not 
> archive completed job 
> CarTopSpeedWindowingExample(ca9f6a073d311d60f457a1c4243e7dc3) to the history 
> server.
> java.util.concurrent.CompletionException: java.lang.IllegalArgumentException: 
> attempt does not exist
>   at 
> java.util.concurrent.CompletableFuture.encodeThrowable(CompletableFuture.java:273)
>  ~[?:1.8.0_252]
>   at 
> java.util.concurrent.CompletableFuture.completeThrowable(CompletableFuture.java:280)
>  [?:1.8.0_252]
>   at 
> java.util.concurrent.CompletableFuture$AsyncRun.run(CompletableFuture.java:1643)
>  [?:1.8.0_252]
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>  [?:1.8.0_252]
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>  [?:1.8.0_252]
>   at java.lang.Thread.run(Thread.java:748) [?:1.8.0_252]
> Caused by: java.lang.IllegalArgumentException: attempt does not exist
>   at 
> org.apache.flink.runtime.executiongraph.ArchivedExecutionVertex.getPriorExecutionAttempt(ArchivedExecutionVertex.java:109)
>  ~[flink-dist_2.11-1.14-SNAPSHOT.jar:1.14-SNAPSHOT]
>   at 
> org.apache.flink.runtime.executiongraph.ArchivedExecutionVertex.getPriorExecutionAttempt(ArchivedExecutionVertex.java:31)
>  ~[flink-dist_2.11-1.14-SNAPSHOT.jar:1.14-SNAPSHOT]
>   at 
> org.apache.flink.runtime.rest.handler.job.SubtaskExecutionAttemptDetailsHandler.archiveJsonWithPath(SubtaskExecutionAttemptDetailsHandler.java:140)
>  ~[flink-dist_2.11-1.14-SNAPSHOT.jar:1.14-SNAPSHOT]
>   at 
> org.apache.flink.runtime.webmonitor.history.OnlyExecutionGraphJsonArchivist.archiveJsonWithPath(OnlyExecutionGraphJsonArchivist.java:51)
>  ~[flink-dist_2.11-1.14-SNAPSHOT.jar:1.14-SNAPSHOT]
>   at 
> org.apache.flink.runtime.webmonitor.WebMonitorEndpoint.archiveJsonWithPath(WebMonitorEndpoint.java:1031)
>  ~[flink-dist_2.11-1.14-SNAPSHOT.jar:1.14-SNAPSHOT]
>   at 
> org.apache.flink.runtime.dispatcher.JsonResponseHistoryServerArchivist.lambda$archiveExecutionGraph$0(JsonResponseHistoryServerArchivist.java:61)
>  ~[flink-dist_2.11-1.14-SNAPSHOT.jar:1.14-SNAPSHOT]
>   at 
> org.apache.flink.util.function.ThrowingRunnable.lambda$unchecked$0(ThrowingRunnable.java:49)
>  ~[flink-dist_2.11-1.14-SNAPSHOT.jar:1.14-SNAPSHOT]
>   at 
> java.util.concurrent.CompletableFuture$AsyncRun.run(CompletableFuture.java:1640)
>  ~[?:1.8.0_252]
>   ... 3 more
> {code}
> Steps to reproduce:
> - start a Flink reactive mode job manager:
> {code}
> mkdir usrlib
> cp ./examples/streaming/TopSpeedWindowing.jar usrlib/
> # Submit Job in Reactive Mode
> ./bin/standalone-job.sh start -Dscheduler-mode=reactive 
> -Dexecution.checkpointing.interval="10s" -j 
> org.apache.flink.streaming.examples.windowing.TopSpeedWindowing
> # Start first TaskManager
> ./bin/taskmanager.sh start
> {code}
> - Add another taskmanager to trigger a restart
> - Cancel the job
> See the failure in the jobmanager logs.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (FLINK-23925) HistoryServer: Archiving job with more than one attempt fails

2021-08-23 Thread Robert Metzger (Jira)
Robert Metzger created FLINK-23925:
--

 Summary: HistoryServer: Archiving job with more than one attempt 
fails
 Key: FLINK-23925
 URL: https://issues.apache.org/jira/browse/FLINK-23925
 Project: Flink
  Issue Type: Bug
  Components: Runtime / Coordination
Affects Versions: 1.13.2
Reporter: Robert Metzger


Error:
{code}
2021-08-23 16:26:01,953 INFO  
org.apache.flink.runtime.resourcemanager.StandaloneResourceManager [] - 
Disconnect job manager 
0...@akka.tcp://flink@localhost:6123/user/rpc/jobmanager_2
 for job ca9f6a073d311d60f457a1c4243e7dc3 from the resource manager.
2021-08-23 16:26:02,137 INFO  
org.apache.flink.runtime.dispatcher.StandaloneDispatcher [] - Could not 
archive completed job 
CarTopSpeedWindowingExample(ca9f6a073d311d60f457a1c4243e7dc3) to the history 
server.
java.util.concurrent.CompletionException: java.lang.IllegalArgumentException: 
attempt does not exist
at 
java.util.concurrent.CompletableFuture.encodeThrowable(CompletableFuture.java:273)
 ~[?:1.8.0_252]
at 
java.util.concurrent.CompletableFuture.completeThrowable(CompletableFuture.java:280)
 [?:1.8.0_252]
at 
java.util.concurrent.CompletableFuture$AsyncRun.run(CompletableFuture.java:1643)
 [?:1.8.0_252]
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) 
[?:1.8.0_252]
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) 
[?:1.8.0_252]
at java.lang.Thread.run(Thread.java:748) [?:1.8.0_252]
Caused by: java.lang.IllegalArgumentException: attempt does not exist
at 
org.apache.flink.runtime.executiongraph.ArchivedExecutionVertex.getPriorExecutionAttempt(ArchivedExecutionVertex.java:109)
 ~[flink-dist_2.11-1.14-SNAPSHOT.jar:1.14-SNAPSHOT]
at 
org.apache.flink.runtime.executiongraph.ArchivedExecutionVertex.getPriorExecutionAttempt(ArchivedExecutionVertex.java:31)
 ~[flink-dist_2.11-1.14-SNAPSHOT.jar:1.14-SNAPSHOT]
at 
org.apache.flink.runtime.rest.handler.job.SubtaskExecutionAttemptDetailsHandler.archiveJsonWithPath(SubtaskExecutionAttemptDetailsHandler.java:140)
 ~[flink-dist_2.11-1.14-SNAPSHOT.jar:1.14-SNAPSHOT]
at 
org.apache.flink.runtime.webmonitor.history.OnlyExecutionGraphJsonArchivist.archiveJsonWithPath(OnlyExecutionGraphJsonArchivist.java:51)
 ~[flink-dist_2.11-1.14-SNAPSHOT.jar:1.14-SNAPSHOT]
at 
org.apache.flink.runtime.webmonitor.WebMonitorEndpoint.archiveJsonWithPath(WebMonitorEndpoint.java:1031)
 ~[flink-dist_2.11-1.14-SNAPSHOT.jar:1.14-SNAPSHOT]
at 
org.apache.flink.runtime.dispatcher.JsonResponseHistoryServerArchivist.lambda$archiveExecutionGraph$0(JsonResponseHistoryServerArchivist.java:61)
 ~[flink-dist_2.11-1.14-SNAPSHOT.jar:1.14-SNAPSHOT]
at 
org.apache.flink.util.function.ThrowingRunnable.lambda$unchecked$0(ThrowingRunnable.java:49)
 ~[flink-dist_2.11-1.14-SNAPSHOT.jar:1.14-SNAPSHOT]
at 
java.util.concurrent.CompletableFuture$AsyncRun.run(CompletableFuture.java:1640)
 ~[?:1.8.0_252]
... 3 more
{code}

Steps to reproduce:
- start a Flink reactive mode job manager:
mkdir usrlib
cp ./examples/streaming/TopSpeedWindowing.jar usrlib/
# Submit Job in Reactive Mode
./bin/standalone-job.sh start -Dscheduler-mode=reactive 
-Dexecution.checkpointing.interval="10s" -j 
org.apache.flink.streaming.examples.windowing.TopSpeedWindowing
# Start first TaskManager
./bin/taskmanager.sh start

- Add another taskmanager to trigger a restart
- Cancel the job

See the failure in the jobmanager logs.





--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (FLINK-23925) HistoryServer: Archiving job with more than one attempt fails

2021-08-23 Thread Robert Metzger (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-23925?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Robert Metzger updated FLINK-23925:
---
Description: 
Error:
{code}
2021-08-23 16:26:01,953 INFO  
org.apache.flink.runtime.resourcemanager.StandaloneResourceManager [] - 
Disconnect job manager 
0...@akka.tcp://flink@localhost:6123/user/rpc/jobmanager_2
 for job ca9f6a073d311d60f457a1c4243e7dc3 from the resource manager.
2021-08-23 16:26:02,137 INFO  
org.apache.flink.runtime.dispatcher.StandaloneDispatcher [] - Could not 
archive completed job 
CarTopSpeedWindowingExample(ca9f6a073d311d60f457a1c4243e7dc3) to the history 
server.
java.util.concurrent.CompletionException: java.lang.IllegalArgumentException: 
attempt does not exist
at 
java.util.concurrent.CompletableFuture.encodeThrowable(CompletableFuture.java:273)
 ~[?:1.8.0_252]
at 
java.util.concurrent.CompletableFuture.completeThrowable(CompletableFuture.java:280)
 [?:1.8.0_252]
at 
java.util.concurrent.CompletableFuture$AsyncRun.run(CompletableFuture.java:1643)
 [?:1.8.0_252]
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) 
[?:1.8.0_252]
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) 
[?:1.8.0_252]
at java.lang.Thread.run(Thread.java:748) [?:1.8.0_252]
Caused by: java.lang.IllegalArgumentException: attempt does not exist
at 
org.apache.flink.runtime.executiongraph.ArchivedExecutionVertex.getPriorExecutionAttempt(ArchivedExecutionVertex.java:109)
 ~[flink-dist_2.11-1.14-SNAPSHOT.jar:1.14-SNAPSHOT]
at 
org.apache.flink.runtime.executiongraph.ArchivedExecutionVertex.getPriorExecutionAttempt(ArchivedExecutionVertex.java:31)
 ~[flink-dist_2.11-1.14-SNAPSHOT.jar:1.14-SNAPSHOT]
at 
org.apache.flink.runtime.rest.handler.job.SubtaskExecutionAttemptDetailsHandler.archiveJsonWithPath(SubtaskExecutionAttemptDetailsHandler.java:140)
 ~[flink-dist_2.11-1.14-SNAPSHOT.jar:1.14-SNAPSHOT]
at 
org.apache.flink.runtime.webmonitor.history.OnlyExecutionGraphJsonArchivist.archiveJsonWithPath(OnlyExecutionGraphJsonArchivist.java:51)
 ~[flink-dist_2.11-1.14-SNAPSHOT.jar:1.14-SNAPSHOT]
at 
org.apache.flink.runtime.webmonitor.WebMonitorEndpoint.archiveJsonWithPath(WebMonitorEndpoint.java:1031)
 ~[flink-dist_2.11-1.14-SNAPSHOT.jar:1.14-SNAPSHOT]
at 
org.apache.flink.runtime.dispatcher.JsonResponseHistoryServerArchivist.lambda$archiveExecutionGraph$0(JsonResponseHistoryServerArchivist.java:61)
 ~[flink-dist_2.11-1.14-SNAPSHOT.jar:1.14-SNAPSHOT]
at 
org.apache.flink.util.function.ThrowingRunnable.lambda$unchecked$0(ThrowingRunnable.java:49)
 ~[flink-dist_2.11-1.14-SNAPSHOT.jar:1.14-SNAPSHOT]
at 
java.util.concurrent.CompletableFuture$AsyncRun.run(CompletableFuture.java:1640)
 ~[?:1.8.0_252]
... 3 more
{code}

Steps to reproduce:
- start a Flink reactive mode job manager:
{code}
mkdir usrlib
cp ./examples/streaming/TopSpeedWindowing.jar usrlib/
# Submit Job in Reactive Mode
./bin/standalone-job.sh start -Dscheduler-mode=reactive 
-Dexecution.checkpointing.interval="10s" -j 
org.apache.flink.streaming.examples.windowing.TopSpeedWindowing
# Start first TaskManager
./bin/taskmanager.sh start
{code}
- Add another taskmanager to trigger a restart
- Cancel the job

See the failure in the jobmanager logs.



  was:
Error:
{code}
2021-08-23 16:26:01,953 INFO  
org.apache.flink.runtime.resourcemanager.StandaloneResourceManager [] - 
Disconnect job manager 
0...@akka.tcp://flink@localhost:6123/user/rpc/jobmanager_2
 for job ca9f6a073d311d60f457a1c4243e7dc3 from the resource manager.
2021-08-23 16:26:02,137 INFO  
org.apache.flink.runtime.dispatcher.StandaloneDispatcher [] - Could not 
archive completed job 
CarTopSpeedWindowingExample(ca9f6a073d311d60f457a1c4243e7dc3) to the history 
server.
java.util.concurrent.CompletionException: java.lang.IllegalArgumentException: 
attempt does not exist
at 
java.util.concurrent.CompletableFuture.encodeThrowable(CompletableFuture.java:273)
 ~[?:1.8.0_252]
at 
java.util.concurrent.CompletableFuture.completeThrowable(CompletableFuture.java:280)
 [?:1.8.0_252]
at 
java.util.concurrent.CompletableFuture$AsyncRun.run(CompletableFuture.java:1643)
 [?:1.8.0_252]
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) 
[?:1.8.0_252]
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) 
[?:1.8.0_252]
at java.lang.Thread.run(Thread.java:748) [?:1.8.0_252]
Caused by: java.lang.IllegalArgumentException: attempt does not exist
at 
org.apache.flink.runtime.executiongraph.ArchivedExecutionVertex.getPriorExecutionAttempt(ArchivedExecutionVertex.java:109)
 ~[flink-dist_2.11-1.14-SNAPSHOT.jar:1.14-SNAPSHOT]
at 
org.apache.flink.runtime

[jira] [Comment Edited] (FLINK-23525) Docker command fails on Azure: Exit code 137 returned from process: file name '/usr/bin/docker'

2021-08-23 Thread Robert Metzger (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-23525?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17403027#comment-17403027
 ] 

Robert Metzger edited comment on FLINK-23525 at 8/23/21, 8:27 AM:
--

The last 8 failures were all on Azure-hosted VMs, not our builders sponsored by 
Alibaba. And as Chesnay said all about the UnalignedCheckpointITCase.

I'll file a separate blocker for this failure: FLINK-23913 -- Please report 
cases of this specific issue in the new ticket, so that we can observe if there 
are different "exit code 137" failures as well.


was (Author: rmetzger):
The last 8 failures were all on Azure-hosted VMs, not our builders sponsored by 
Alibaba. And as Chesnay said all about the UnalignedCheckpointITCase.

I'll file a separate blocker for this failure.

> Docker command fails on Azure: Exit code 137 returned from process: file name 
> '/usr/bin/docker'
> ---
>
> Key: FLINK-23525
> URL: https://issues.apache.org/jira/browse/FLINK-23525
> Project: Flink
>  Issue Type: Bug
>  Components: Build System / Azure Pipelines
>Affects Versions: 1.14.0, 1.13.1
>Reporter: Dawid Wysakowicz
>Priority: Critical
>  Labels: auto-deprioritized-blocker, test-stability
> Fix For: 1.14.0
>
> Attachments: screenshot-1.png
>
>
> https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=21053&view=logs&j=4d4a0d10-fca2-5507-8eed-c07f0bdf4887&t=7b25afdf-cc6c-566f-5459-359dc2585798&l=10034
> {code}
> ##[error]Exit code 137 returned from process: file name '/usr/bin/docker', 
> arguments 'exec -i -u 1001  -w /home/vsts_azpcontainer 
> 9dca235e075b70486fac576ee17cee722940edf575e5478e0a52def5b46c28b5 
> /__a/externals/node/bin/node /__w/_temp/containerHandlerInvoker.js'.
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (FLINK-23913) UnalignedCheckpointITCase fails with exit code 137 (kernel oom) on Azure VMs

2021-08-23 Thread Robert Metzger (Jira)
Robert Metzger created FLINK-23913:
--

 Summary: UnalignedCheckpointITCase fails with exit code 137 
(kernel oom) on Azure VMs
 Key: FLINK-23913
 URL: https://issues.apache.org/jira/browse/FLINK-23913
 Project: Flink
  Issue Type: Bug
  Components: Runtime / Network
Affects Versions: 1.14.0
 Environment: UnalignedCheckpointITCase
Reporter: Robert Metzger
 Fix For: 1.14.0


Cases reported in FLINK-23525:
- 
https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=22618&view=logs&j=4d4a0d10-fca2-5507-8eed-c07f0bdf4887&t=7b25afdf-cc6c-566f-5459-359dc2585798&l=10338
- 
https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=22618&view=logs&j=39d5b1d5-3b41-54dc-6458-1e2ddd1cdcf3&t=0c010d0c-3dec-5bf1-d408-7b18988b1b2b&l=4743
- 
https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=22605&view=logs&j=39d5b1d5-3b41-54dc-6458-1e2ddd1cdcf3&t=0c010d0c-3dec-5bf1-d408-7b18988b1b2b&l=4743
- ... there are a lot more cases.

The problem seems to have started occurring around August 20.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (FLINK-23525) Docker command fails on Azure: Exit code 137 returned from process: file name '/usr/bin/docker'

2021-08-23 Thread Robert Metzger (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-23525?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Robert Metzger updated FLINK-23525:
---
Priority: Critical  (was: Blocker)

> Docker command fails on Azure: Exit code 137 returned from process: file name 
> '/usr/bin/docker'
> ---
>
> Key: FLINK-23525
> URL: https://issues.apache.org/jira/browse/FLINK-23525
> Project: Flink
>  Issue Type: Bug
>  Components: Build System / Azure Pipelines
>Affects Versions: 1.14.0, 1.13.1
>Reporter: Dawid Wysakowicz
>Priority: Critical
>  Labels: auto-deprioritized-blocker, test-stability
> Fix For: 1.14.0
>
> Attachments: screenshot-1.png
>
>
> https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=21053&view=logs&j=4d4a0d10-fca2-5507-8eed-c07f0bdf4887&t=7b25afdf-cc6c-566f-5459-359dc2585798&l=10034
> {code}
> ##[error]Exit code 137 returned from process: file name '/usr/bin/docker', 
> arguments 'exec -i -u 1001  -w /home/vsts_azpcontainer 
> 9dca235e075b70486fac576ee17cee722940edf575e5478e0a52def5b46c28b5 
> /__a/externals/node/bin/node /__w/_temp/containerHandlerInvoker.js'.
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (FLINK-23525) Docker command fails on Azure: Exit code 137 returned from process: file name '/usr/bin/docker'

2021-08-23 Thread Robert Metzger (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-23525?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17403027#comment-17403027
 ] 

Robert Metzger commented on FLINK-23525:


The last 8 failures were all on Azure-hosted VMs, not our builders sponsored by 
Alibaba. And as Chesnay said all about the UnalignedCheckpointITCase.

I'll file a separate blocker for this failure.

> Docker command fails on Azure: Exit code 137 returned from process: file name 
> '/usr/bin/docker'
> ---
>
> Key: FLINK-23525
> URL: https://issues.apache.org/jira/browse/FLINK-23525
> Project: Flink
>  Issue Type: Bug
>  Components: Build System / Azure Pipelines
>Affects Versions: 1.14.0, 1.13.1
>Reporter: Dawid Wysakowicz
>Priority: Blocker
>  Labels: auto-deprioritized-blocker, test-stability
> Fix For: 1.14.0
>
> Attachments: screenshot-1.png
>
>
> https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=21053&view=logs&j=4d4a0d10-fca2-5507-8eed-c07f0bdf4887&t=7b25afdf-cc6c-566f-5459-359dc2585798&l=10034
> {code}
> ##[error]Exit code 137 returned from process: file name '/usr/bin/docker', 
> arguments 'exec -i -u 1001  -w /home/vsts_azpcontainer 
> 9dca235e075b70486fac576ee17cee722940edf575e5478e0a52def5b46c28b5 
> /__a/externals/node/bin/node /__w/_temp/containerHandlerInvoker.js'.
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Closed] (FLINK-19379) Submitting job to running YARN session fails

2021-08-10 Thread Robert Metzger (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-19379?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Robert Metzger closed FLINK-19379.
--
Fix Version/s: (was: 1.14.0)
   Resolution: Won't Fix

I agree, thanks for the reminder. Closing this one.

> Submitting job to running YARN session fails
> 
>
> Key: FLINK-19379
> URL: https://issues.apache.org/jira/browse/FLINK-19379
> Project: Flink
>  Issue Type: Bug
>  Components: Command Line Client, Deployment / YARN, Documentation
>Affects Versions: 1.11.2
>Reporter: Robert Metzger
>Priority: Major
>  Labels: auto-unassigned, usability
>
> Steps to reproduce:
> 1. start a YARN session
> 2. submit a job using: ./bin/flink run -t yarn-session -yid 
> application_1600852002161_0003, where application_1600852002161_0003 is the 
> id of the session started in 1.
> Expected behavior: submit job to running session.
> Actual behavior: Fails with this unhelpful exception:
> {code}
> org.apache.flink.client.program.ProgramInvocationException: The main method 
> caused an error: null
> at 
> org.apache.flink.client.program.PackagedProgram.callMainMethod(PackagedProgram.java:302)
> at 
> org.apache.flink.client.program.PackagedProgram.invokeInteractiveModeForExecution(PackagedProgram.java:198)
> at 
> org.apache.flink.client.ClientUtils.executeProgram(ClientUtils.java:149)
> at 
> org.apache.flink.client.cli.CliFrontend.executeProgram(CliFrontend.java:699)
> at org.apache.flink.client.cli.CliFrontend.run(CliFrontend.java:232)
> at 
> org.apache.flink.client.cli.CliFrontend.parseParameters(CliFrontend.java:916)
> at 
> org.apache.flink.client.cli.CliFrontend.lambda$main$10(CliFrontend.java:992)
> at java.security.AccessController.doPrivileged(Native Method)
> at javax.security.auth.Subject.doAs(Subject.java:422)
> at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1893)
> at 
> org.apache.flink.runtime.security.contexts.HadoopSecurityContext.runSecured(HadoopSecurityContext.java:41)
> at org.apache.flink.client.cli.CliFrontend.main(CliFrontend.java:992)
> Caused by: java.lang.IllegalStateException
> at 
> org.apache.flink.util.Preconditions.checkState(Preconditions.java:179)
> at 
> org.apache.flink.client.deployment.executors.AbstractSessionClusterExecutor.execute(AbstractSessionClusterExecutor.java:61)
> at 
> org.apache.flink.api.java.ExecutionEnvironment.executeAsync(ExecutionEnvironment.java:973)
> at 
> org.apache.flink.client.program.ContextEnvironment.executeAsync(ContextEnvironment.java:124)
> at 
> org.apache.flink.client.program.ContextEnvironment.execute(ContextEnvironment.java:72)
> at com.ververica.TPCHQuery3.main(TPCHQuery3.java:184)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
> at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:498)
> at 
> org.apache.flink.client.program.PackagedProgram.callMainMethod(PackagedProgram.java:288)
> ... 11 more
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (FLINK-23525) Docker command fails on Azure: Exit code 137 returned from process: file name '/usr/bin/docker'

2021-08-04 Thread Robert Metzger (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-23525?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17393364#comment-17393364
 ] 

Robert Metzger commented on FLINK-23525:


Machine looks "normal", I can see the OOM killer killing a bunch of "java" 
processes.
I'll reduce the number of concurrent builds by one on each machine.

> Docker command fails on Azure: Exit code 137 returned from process: file name 
> '/usr/bin/docker'
> ---
>
> Key: FLINK-23525
> URL: https://issues.apache.org/jira/browse/FLINK-23525
> Project: Flink
>  Issue Type: Bug
>  Components: Build System / Azure Pipelines
>Affects Versions: 1.14.0, 1.13.1
>Reporter: Dawid Wysakowicz
>Priority: Blocker
>  Labels: test-stability
> Attachments: screenshot-1.png
>
>
> https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=21053&view=logs&j=4d4a0d10-fca2-5507-8eed-c07f0bdf4887&t=7b25afdf-cc6c-566f-5459-359dc2585798&l=10034
> {code}
> ##[error]Exit code 137 returned from process: file name '/usr/bin/docker', 
> arguments 'exec -i -u 1001  -w /home/vsts_azpcontainer 
> 9dca235e075b70486fac576ee17cee722940edf575e5478e0a52def5b46c28b5 
> /__a/externals/node/bin/node /__w/_temp/containerHandlerInvoker.js'.
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (FLINK-23525) Docker command fails on Azure: Exit code 137 returned from process: file name '/usr/bin/docker'

2021-08-04 Thread Robert Metzger (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-23525?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17393363#comment-17393363
 ] 

Robert Metzger commented on FLINK-23525:


The CI system is at peak utilization this afternoon: 
!screenshot-1.png! 

Looks like we are running too many parallel builders. Most cases are from 
machine 7. I'll ssh into the machine.

> Docker command fails on Azure: Exit code 137 returned from process: file name 
> '/usr/bin/docker'
> ---
>
> Key: FLINK-23525
> URL: https://issues.apache.org/jira/browse/FLINK-23525
> Project: Flink
>  Issue Type: Bug
>  Components: Build System / Azure Pipelines
>Affects Versions: 1.14.0, 1.13.1
>Reporter: Dawid Wysakowicz
>Priority: Blocker
>  Labels: test-stability
> Attachments: screenshot-1.png
>
>
> https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=21053&view=logs&j=4d4a0d10-fca2-5507-8eed-c07f0bdf4887&t=7b25afdf-cc6c-566f-5459-359dc2585798&l=10034
> {code}
> ##[error]Exit code 137 returned from process: file name '/usr/bin/docker', 
> arguments 'exec -i -u 1001  -w /home/vsts_azpcontainer 
> 9dca235e075b70486fac576ee17cee722940edf575e5478e0a52def5b46c28b5 
> /__a/externals/node/bin/node /__w/_temp/containerHandlerInvoker.js'.
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (FLINK-23525) Docker command fails on Azure: Exit code 137 returned from process: file name '/usr/bin/docker'

2021-08-04 Thread Robert Metzger (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-23525?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Robert Metzger updated FLINK-23525:
---
Attachment: screenshot-1.png

> Docker command fails on Azure: Exit code 137 returned from process: file name 
> '/usr/bin/docker'
> ---
>
> Key: FLINK-23525
> URL: https://issues.apache.org/jira/browse/FLINK-23525
> Project: Flink
>  Issue Type: Bug
>  Components: Build System / Azure Pipelines
>Affects Versions: 1.14.0, 1.13.1
>Reporter: Dawid Wysakowicz
>Priority: Blocker
>  Labels: test-stability
> Attachments: screenshot-1.png
>
>
> https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=21053&view=logs&j=4d4a0d10-fca2-5507-8eed-c07f0bdf4887&t=7b25afdf-cc6c-566f-5459-359dc2585798&l=10034
> {code}
> ##[error]Exit code 137 returned from process: file name '/usr/bin/docker', 
> arguments 'exec -i -u 1001  -w /home/vsts_azpcontainer 
> 9dca235e075b70486fac576ee17cee722940edf575e5478e0a52def5b46c28b5 
> /__a/externals/node/bin/node /__w/_temp/containerHandlerInvoker.js'.
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (FLINK-23557) 'Resuming Externalized Checkpoint (hashmap, sync, no parallelism change) end-to-end test' fails on Azure

2021-08-04 Thread Robert Metzger (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-23557?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Robert Metzger reassigned FLINK-23557:
--

Assignee: (was: Robert Metzger)

> 'Resuming Externalized Checkpoint (hashmap, sync, no parallelism change) 
> end-to-end test' fails on Azure
> 
>
> Key: FLINK-23557
> URL: https://issues.apache.org/jira/browse/FLINK-23557
> Project: Flink
>  Issue Type: Bug
>  Components: Runtime / Coordination
>Affects Versions: 1.14.0
>Reporter: Dawid Wysakowicz
>Priority: Blocker
>  Labels: test-stability
>
> https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=21129&view=logs&j=6caf31d6-847a-526e-9624-468e053467d6&t=1fdd9d50-31f7-5383-5578-49e27385b5f1&l=785
> {code}
> Caused by: org.apache.flink.runtime.client.JobSubmissionException: Failed to 
> submit JobGraph.
>   at 
> org.apache.flink.client.program.rest.RestClusterClient.lambda$submitJob$9(RestClusterClient.java:405)
>   at 
> java.base/java.util.concurrent.CompletableFuture.uniExceptionally(CompletableFuture.java:986)
>   at 
> java.base/java.util.concurrent.CompletableFuture$UniExceptionally.tryFire(CompletableFuture.java:970)
>   at 
> java.base/java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:506)
>   at 
> java.base/java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:2088)
>   at 
> org.apache.flink.util.concurrent.FutureUtils.lambda$retryOperationWithDelay$9(FutureUtils.java:373)
>   at 
> java.base/java.util.concurrent.CompletableFuture.uniWhenComplete(CompletableFuture.java:859)
>   at 
> java.base/java.util.concurrent.CompletableFuture$UniWhenComplete.tryFire(CompletableFuture.java:837)
>   at 
> java.base/java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:506)
>   at 
> java.base/java.util.concurrent.CompletableFuture.postFire(CompletableFuture.java:610)
>   at 
> java.base/java.util.concurrent.CompletableFuture$UniCompose.tryFire(CompletableFuture.java:1085)
>   at 
> java.base/java.util.concurrent.CompletableFuture$Completion.run(CompletableFuture.java:478)
>   at 
> java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
>   at 
> java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
>   at java.base/java.lang.Thread.run(Thread.java:829)
> Caused by: org.apache.flink.runtime.rest.util.RestClientException: [File 
> upload failed.]
>   at 
> org.apache.flink.runtime.rest.RestClient.parseResponse(RestClient.java:486)
>   at 
> org.apache.flink.runtime.rest.RestClient.lambda$submitRequest$3(RestClient.java:466)
>   at 
> java.base/java.util.concurrent.CompletableFuture$UniCompose.tryFire(CompletableFuture.java:1072)
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (FLINK-23557) 'Resuming Externalized Checkpoint (hashmap, sync, no parallelism change) end-to-end test' fails on Azure

2021-08-03 Thread Robert Metzger (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-23557?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17392334#comment-17392334
 ] 

Robert Metzger commented on FLINK-23557:


The issue most likely started occurring with commit 
https://github.com/apache/flink/commit/8367dbde5d65ded7bbd612dccf1558966905aae9.
 
I don't really understand how this is related.
[~dwysakowicz] any ideas?

> 'Resuming Externalized Checkpoint (hashmap, sync, no parallelism change) 
> end-to-end test' fails on Azure
> 
>
> Key: FLINK-23557
> URL: https://issues.apache.org/jira/browse/FLINK-23557
> Project: Flink
>  Issue Type: Bug
>  Components: Runtime / Coordination
>Affects Versions: 1.14.0
>Reporter: Dawid Wysakowicz
>Assignee: Robert Metzger
>Priority: Blocker
>  Labels: test-stability
>
> https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=21129&view=logs&j=6caf31d6-847a-526e-9624-468e053467d6&t=1fdd9d50-31f7-5383-5578-49e27385b5f1&l=785
> {code}
> Caused by: org.apache.flink.runtime.client.JobSubmissionException: Failed to 
> submit JobGraph.
>   at 
> org.apache.flink.client.program.rest.RestClusterClient.lambda$submitJob$9(RestClusterClient.java:405)
>   at 
> java.base/java.util.concurrent.CompletableFuture.uniExceptionally(CompletableFuture.java:986)
>   at 
> java.base/java.util.concurrent.CompletableFuture$UniExceptionally.tryFire(CompletableFuture.java:970)
>   at 
> java.base/java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:506)
>   at 
> java.base/java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:2088)
>   at 
> org.apache.flink.util.concurrent.FutureUtils.lambda$retryOperationWithDelay$9(FutureUtils.java:373)
>   at 
> java.base/java.util.concurrent.CompletableFuture.uniWhenComplete(CompletableFuture.java:859)
>   at 
> java.base/java.util.concurrent.CompletableFuture$UniWhenComplete.tryFire(CompletableFuture.java:837)
>   at 
> java.base/java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:506)
>   at 
> java.base/java.util.concurrent.CompletableFuture.postFire(CompletableFuture.java:610)
>   at 
> java.base/java.util.concurrent.CompletableFuture$UniCompose.tryFire(CompletableFuture.java:1085)
>   at 
> java.base/java.util.concurrent.CompletableFuture$Completion.run(CompletableFuture.java:478)
>   at 
> java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
>   at 
> java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
>   at java.base/java.lang.Thread.run(Thread.java:829)
> Caused by: org.apache.flink.runtime.rest.util.RestClientException: [File 
> upload failed.]
>   at 
> org.apache.flink.runtime.rest.RestClient.parseResponse(RestClient.java:486)
>   at 
> org.apache.flink.runtime.rest.RestClient.lambda$submitRequest$3(RestClient.java:466)
>   at 
> java.base/java.util.concurrent.CompletableFuture$UniCompose.tryFire(CompletableFuture.java:1072)
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (FLINK-23562) Update CI docker image to latest java version (1.8.0_292)

2021-08-03 Thread Robert Metzger (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-23562?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17392253#comment-17392253
 ] 

Robert Metzger commented on FLINK-23562:


Thanks a lot!

Merged to master in 
https://github.com/apache/flink/commit/4d19a9f09e58ae5726901b1c6c473b655d908440

> Update CI docker image to latest java version (1.8.0_292)
> -
>
> Key: FLINK-23562
> URL: https://issues.apache.org/jira/browse/FLINK-23562
> Project: Flink
>  Issue Type: Technical Debt
>  Components: Build System / Azure Pipelines
>Reporter: Robert Metzger
>Assignee: Robert Metzger
>Priority: Critical
>  Labels: pull-request-available
> Fix For: 1.14.0
>
>
> The java version we are using on our CI is outdated (1.8.0_282 vs 1.8.0_292). 
> The latest java version has TLSv1 disabled, which makes the 
> KubernetesClusterDescriptorTest fail.
> This will be fixed by FLINK-22802.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Closed] (FLINK-23562) Update CI docker image to latest java version (1.8.0_292)

2021-08-03 Thread Robert Metzger (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-23562?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Robert Metzger closed FLINK-23562.
--
Resolution: Fixed

> Update CI docker image to latest java version (1.8.0_292)
> -
>
> Key: FLINK-23562
> URL: https://issues.apache.org/jira/browse/FLINK-23562
> Project: Flink
>  Issue Type: Technical Debt
>  Components: Build System / Azure Pipelines
>Reporter: Robert Metzger
>Assignee: Robert Metzger
>Priority: Critical
>  Labels: pull-request-available
> Fix For: 1.14.0
>
>
> The java version we are using on our CI is outdated (1.8.0_282 vs 1.8.0_292). 
> The latest java version has TLSv1 disabled, which makes the 
> KubernetesClusterDescriptorTest fail.
> This will be fixed by FLINK-22802.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (FLINK-23589) Support Avro Microsecond precision

2021-08-02 Thread Robert Metzger (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-23589?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17391634#comment-17391634
 ] 

Robert Metzger commented on FLINK-23589:


[~jark] [~libenchao] What's your opinion on this ticket?

> Support Avro Microsecond precision
> --
>
> Key: FLINK-23589
> URL: https://issues.apache.org/jira/browse/FLINK-23589
> Project: Flink
>  Issue Type: Improvement
>  Components: Formats (JSON, Avro, Parquet, ORC, SequenceFile)
>Reporter: Robert Metzger
>Priority: Major
> Fix For: 1.14.0
>
>
> This was raised by a user: 
> https://lists.apache.org/thread.html/r463f748358202d207e4bf9c7fdcb77e609f35bbd670dbc5278dd7615%40%3Cuser.flink.apache.org%3E
> Here's the Avro spec: 
> https://avro.apache.org/docs/1.8.0/spec.html#Timestamp+%28microsecond+precision%29



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (FLINK-23589) Support Avro Microsecond precision

2021-08-02 Thread Robert Metzger (Jira)
Robert Metzger created FLINK-23589:
--

 Summary: Support Avro Microsecond precision
 Key: FLINK-23589
 URL: https://issues.apache.org/jira/browse/FLINK-23589
 Project: Flink
  Issue Type: Improvement
  Components: Formats (JSON, Avro, Parquet, ORC, SequenceFile)
Reporter: Robert Metzger
 Fix For: 1.14.0


This was raised by a user: 
https://lists.apache.org/thread.html/r463f748358202d207e4bf9c7fdcb77e609f35bbd670dbc5278dd7615%40%3Cuser.flink.apache.org%3E

Here's the Avro spec: 
https://avro.apache.org/docs/1.8.0/spec.html#Timestamp+%28microsecond+precision%29



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (FLINK-21004) test_process_mode_boot.py test hangs

2021-08-02 Thread Robert Metzger (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-21004?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17391624#comment-17391624
 ] 

Robert Metzger commented on FLINK-21004:


related? 
https://dev.azure.com/rmetzger/Flink/_build/results?buildId=9160&view=logs&j=fba17979-6d2e-591d-72f1-97cf42797c11&t=727942b6-6137-54f7-1ef9-e66e706ea068

> test_process_mode_boot.py test hangs
> 
>
> Key: FLINK-21004
> URL: https://issues.apache.org/jira/browse/FLINK-21004
> Project: Flink
>  Issue Type: Bug
>  Components: API / Python
>Affects Versions: 1.13.0
>Reporter: Huang Xingbo
>Priority: Minor
>  Labels: auto-deprioritized-major, test-stability
> Fix For: 1.14.0
>
>
> [https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=12159&view=logs&j=821b528f-1eed-5598-a3b4-7f748b13f261&t=4fad9527-b9a5-5015-1b70-8356e5c91490]
> {code:java}
> 2021-01-18T00:55:48.3027307Z 
> pyflink/fn_execution/tests/test_process_mode_boot.py 
> ==
> 2021-01-18T00:55:48.3028351Z Process produced no output for 900 seconds.
> 2021-01-18T00:55:48.3033084Z 
> ==
> 2021-01-18T00:55:48.3033633Z 
> ==
> 2021-01-18T00:55:48.3034073Z The following Java processes are running (JPS)
> 2021-01-18T00:55:48.3037261Z 
> ==
> 2021-01-18T00:55:48.3180991Z Picked up JAVA_TOOL_OPTIONS: 
> -XX:+HeapDumpOnOutOfMemoryError
> 2021-01-18T00:55:48.4930672Z 18493 Jps
> 2021-01-18T00:55:48.4931189Z 12477 PythonGatewayServer
> 2021-01-18T00:55:48.4979543Z 
> ==
> 2021-01-18T00:55:48.4984759Z Printing stack trace of Java process 18493
> 2021-01-18T00:55:48.4987182Z 
> ==
> 2021-01-18T00:55:48.5025804Z Picked up JAVA_TOOL_OPTIONS: 
> -XX:+HeapDumpOnOutOfMemoryError
> 2021-01-18T00:55:48.5943552Z 18493: No such process
> 2021-01-18T00:55:48.6089460Z 
> ==
> 2021-01-18T00:55:48.6089977Z Printing stack trace of Java process 12477
> 2021-01-18T00:55:48.6094322Z 
> ==
> 2021-01-18T00:55:48.6140780Z Picked up JAVA_TOOL_OPTIONS: 
> -XX:+HeapDumpOnOutOfMemoryError
> 2021-01-18T00:55:48.9394259Z 2021-01-18 00:55:48
> 2021-01-18T00:55:48.9401959Z Full thread dump OpenJDK 64-Bit Server VM 
> (25.275-b01 mixed mode):
> 2021-01-18T00:55:48.9402608Z 
> 2021-01-18T00:55:48.9403205Z "Attach Listener" #3327 daemon prio=9 os_prio=0 
> tid=0x7f4d9c02c800 nid=0x4864 waiting on condition [0x]
> 2021-01-18T00:55:48.9403817Zjava.lang.Thread.State: RUNNABLE
> 2021-01-18T00:55:48.9404137Z 
> 2021-01-18T00:55:48.9404634Z "process reaper" #2273 daemon prio=10 os_prio=0 
> tid=0x7f4db809c000 nid=0x3ff9 runnable [0x7f4d7b07]
> 2021-01-18T00:55:48.9405191Zjava.lang.Thread.State: RUNNABLE
> 2021-01-18T00:55:48.9405785Z  at 
> java.lang.UNIXProcess.waitForProcessExit(Native Method)
> 2021-01-18T00:55:48.9425226Z  at 
> java.lang.UNIXProcess.lambda$initStreams$3(UNIXProcess.java:289)
> 2021-01-18T00:55:48.9431982Z  at 
> java.lang.UNIXProcess$$Lambda$1358/33063055.run(Unknown Source)
> 2021-01-18T00:55:48.9432532Z  at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
> 2021-01-18T00:55:48.9439963Z  at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
> 2021-01-18T00:55:48.9444803Z  at java.lang.Thread.run(Thread.java:748)
> 2021-01-18T00:55:48.9445051Z 
> 2021-01-18T00:55:48.9452857Z "process reaper" #2033 daemon prio=10 os_prio=0 
> tid=0x7f4db8086800 nid=0x3e2a runnable [0x7f4d7b0e2000]
> 2021-01-18T00:55:48.9453377Zjava.lang.Thread.State: RUNNABLE
> 2021-01-18T00:55:48.9453743Z  at 
> java.lang.UNIXProcess.waitForProcessExit(Native Method)
> 2021-01-18T00:55:48.9454188Z  at 
> java.lang.UNIXProcess.lambda$initStreams$3(UNIXProcess.java:289)
> 2021-01-18T00:55:48.9454645Z  at 
> java.lang.UNIXProcess$$Lambda$1358/33063055.run(Unknown Source)
> 2021-01-18T00:55:48.9455136Z  at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
> 2021-01-18T00:55:48.9455669Z  at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
> 2021-01-18T00:55:48.9456119Z  at java.lang.Thread.run(Thread.java:748)
> 2021-01-18T00:55:48.9456355Z 
> 2021-01-18T00:55:48.9456708Z "process reaper" #1923 daemon prio=10 os_pr

[jira] [Commented] (FLINK-22889) JdbcExactlyOnceSinkE2eTest hangs on azure

2021-08-02 Thread Robert Metzger (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-22889?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17391623#comment-17391623
 ] 

Robert Metzger commented on FLINK-22889:


https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=21270&view=logs&j=d44f43ce-542c-597d-bf94-b0718c71e5e8&t=ed165f3f-d0f6-524b-5279-86f8ee7d0e2d

> JdbcExactlyOnceSinkE2eTest hangs on azure
> -
>
> Key: FLINK-22889
> URL: https://issues.apache.org/jira/browse/FLINK-22889
> Project: Flink
>  Issue Type: Bug
>  Components: Connectors / JDBC
>Affects Versions: 1.14.0, 1.13.1
>Reporter: Dawid Wysakowicz
>Assignee: Roman Khachatryan
>Priority: Critical
>  Labels: pull-request-available, test-stability
>
> https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=18690&view=logs&j=ba53eb01-1462-56a3-8e98-0dd97fbcaab5&t=bfbc6239-57a0-5db0-63f3-41551b4f7d51&l=16658



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (FLINK-23557) 'Resuming Externalized Checkpoint (hashmap, sync, no parallelism change) end-to-end test' fails on Azure

2021-08-02 Thread Robert Metzger (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-23557?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17391621#comment-17391621
 ] 

Robert Metzger commented on FLINK-23557:


This is most likely caused by 
https://issues.apache.org/jira/browse/FLINK-23460. I haven't figured out why 
yet (and I'm debugging this only on the side at the moment)

> 'Resuming Externalized Checkpoint (hashmap, sync, no parallelism change) 
> end-to-end test' fails on Azure
> 
>
> Key: FLINK-23557
> URL: https://issues.apache.org/jira/browse/FLINK-23557
> Project: Flink
>  Issue Type: Bug
>  Components: Runtime / Coordination
>Affects Versions: 1.14.0
>Reporter: Dawid Wysakowicz
>Assignee: Robert Metzger
>Priority: Blocker
>  Labels: test-stability
>
> https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=21129&view=logs&j=6caf31d6-847a-526e-9624-468e053467d6&t=1fdd9d50-31f7-5383-5578-49e27385b5f1&l=785
> {code}
> Caused by: org.apache.flink.runtime.client.JobSubmissionException: Failed to 
> submit JobGraph.
>   at 
> org.apache.flink.client.program.rest.RestClusterClient.lambda$submitJob$9(RestClusterClient.java:405)
>   at 
> java.base/java.util.concurrent.CompletableFuture.uniExceptionally(CompletableFuture.java:986)
>   at 
> java.base/java.util.concurrent.CompletableFuture$UniExceptionally.tryFire(CompletableFuture.java:970)
>   at 
> java.base/java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:506)
>   at 
> java.base/java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:2088)
>   at 
> org.apache.flink.util.concurrent.FutureUtils.lambda$retryOperationWithDelay$9(FutureUtils.java:373)
>   at 
> java.base/java.util.concurrent.CompletableFuture.uniWhenComplete(CompletableFuture.java:859)
>   at 
> java.base/java.util.concurrent.CompletableFuture$UniWhenComplete.tryFire(CompletableFuture.java:837)
>   at 
> java.base/java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:506)
>   at 
> java.base/java.util.concurrent.CompletableFuture.postFire(CompletableFuture.java:610)
>   at 
> java.base/java.util.concurrent.CompletableFuture$UniCompose.tryFire(CompletableFuture.java:1085)
>   at 
> java.base/java.util.concurrent.CompletableFuture$Completion.run(CompletableFuture.java:478)
>   at 
> java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
>   at 
> java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
>   at java.base/java.lang.Thread.run(Thread.java:829)
> Caused by: org.apache.flink.runtime.rest.util.RestClientException: [File 
> upload failed.]
>   at 
> org.apache.flink.runtime.rest.RestClient.parseResponse(RestClient.java:486)
>   at 
> org.apache.flink.runtime.rest.RestClient.lambda$submitRequest$3(RestClient.java:466)
>   at 
> java.base/java.util.concurrent.CompletableFuture$UniCompose.tryFire(CompletableFuture.java:1072)
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Comment Edited] (FLINK-23557) 'Resuming Externalized Checkpoint (hashmap, sync, no parallelism change) end-to-end test' fails on Azure

2021-08-02 Thread Robert Metzger (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-23557?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17391538#comment-17391538
 ] 

Robert Metzger edited comment on FLINK-23557 at 8/2/21, 1:02 PM:
-

The linux version is not causing the issue. The problem is most likely a change 
between 3a9e414265d987fbdc14be84d768c63941f04d58 and 
c3088af32543c807a69eced7129f89284204c2af. I'll look closer at the commits in 
between and bisect.
https://github.com/apache/flink/compare/3a9e414265d987fbdc14be84d768c63941f04d58...c3088af32543c807a69eced7129f89284204c2af

Update:
Between 3a9e414265d987fbdc14be84d768c63941f04d58 and 
8e63767302c9c954e972e53322358055e10b5d12 must be the offending commit.


was (Author: rmetzger):
The linux version is not causing the issue. The problem is most likely a change 
between 3a9e414265d987fbdc14be84d768c63941f04d58 and 
c3088af32543c807a69eced7129f89284204c2af. I'll look closer at the commits in 
between and bisect.
https://github.com/apache/flink/compare/3a9e414265d987fbdc14be84d768c63941f04d58...c3088af32543c807a69eced7129f89284204c2af

> 'Resuming Externalized Checkpoint (hashmap, sync, no parallelism change) 
> end-to-end test' fails on Azure
> 
>
> Key: FLINK-23557
> URL: https://issues.apache.org/jira/browse/FLINK-23557
> Project: Flink
>  Issue Type: Bug
>  Components: Runtime / Coordination
>Affects Versions: 1.14.0
>Reporter: Dawid Wysakowicz
>Assignee: Robert Metzger
>Priority: Blocker
>  Labels: test-stability
>
> https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=21129&view=logs&j=6caf31d6-847a-526e-9624-468e053467d6&t=1fdd9d50-31f7-5383-5578-49e27385b5f1&l=785
> {code}
> Caused by: org.apache.flink.runtime.client.JobSubmissionException: Failed to 
> submit JobGraph.
>   at 
> org.apache.flink.client.program.rest.RestClusterClient.lambda$submitJob$9(RestClusterClient.java:405)
>   at 
> java.base/java.util.concurrent.CompletableFuture.uniExceptionally(CompletableFuture.java:986)
>   at 
> java.base/java.util.concurrent.CompletableFuture$UniExceptionally.tryFire(CompletableFuture.java:970)
>   at 
> java.base/java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:506)
>   at 
> java.base/java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:2088)
>   at 
> org.apache.flink.util.concurrent.FutureUtils.lambda$retryOperationWithDelay$9(FutureUtils.java:373)
>   at 
> java.base/java.util.concurrent.CompletableFuture.uniWhenComplete(CompletableFuture.java:859)
>   at 
> java.base/java.util.concurrent.CompletableFuture$UniWhenComplete.tryFire(CompletableFuture.java:837)
>   at 
> java.base/java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:506)
>   at 
> java.base/java.util.concurrent.CompletableFuture.postFire(CompletableFuture.java:610)
>   at 
> java.base/java.util.concurrent.CompletableFuture$UniCompose.tryFire(CompletableFuture.java:1085)
>   at 
> java.base/java.util.concurrent.CompletableFuture$Completion.run(CompletableFuture.java:478)
>   at 
> java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
>   at 
> java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
>   at java.base/java.lang.Thread.run(Thread.java:829)
> Caused by: org.apache.flink.runtime.rest.util.RestClientException: [File 
> upload failed.]
>   at 
> org.apache.flink.runtime.rest.RestClient.parseResponse(RestClient.java:486)
>   at 
> org.apache.flink.runtime.rest.RestClient.lambda$submitRequest$3(RestClient.java:466)
>   at 
> java.base/java.util.concurrent.CompletableFuture$UniCompose.tryFire(CompletableFuture.java:1072)
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Comment Edited] (FLINK-23557) 'Resuming Externalized Checkpoint (hashmap, sync, no parallelism change) end-to-end test' fails on Azure

2021-08-02 Thread Robert Metzger (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-23557?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17391538#comment-17391538
 ] 

Robert Metzger edited comment on FLINK-23557 at 8/2/21, 12:11 PM:
--

The linux version is not causing the issue. The problem is most likely a change 
between 3a9e414265d987fbdc14be84d768c63941f04d58 and 
c3088af32543c807a69eced7129f89284204c2af. I'll look closer at the commits in 
between and bisect.
https://github.com/apache/flink/compare/3a9e414265d987fbdc14be84d768c63941f04d58...c3088af32543c807a69eced7129f89284204c2af


was (Author: rmetzger):
The linux version is not causing the issue. The problem is most likely a change 
between 3a9e414265d987fbdc14be84d768c63941f04d58 and 
c3088af32543c807a69eced7129f89284204c2af. I'll look closer at the commits in 
between and bisect.

> 'Resuming Externalized Checkpoint (hashmap, sync, no parallelism change) 
> end-to-end test' fails on Azure
> 
>
> Key: FLINK-23557
> URL: https://issues.apache.org/jira/browse/FLINK-23557
> Project: Flink
>  Issue Type: Bug
>  Components: Runtime / Coordination
>Affects Versions: 1.14.0
>Reporter: Dawid Wysakowicz
>Assignee: Robert Metzger
>Priority: Blocker
>  Labels: test-stability
>
> https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=21129&view=logs&j=6caf31d6-847a-526e-9624-468e053467d6&t=1fdd9d50-31f7-5383-5578-49e27385b5f1&l=785
> {code}
> Caused by: org.apache.flink.runtime.client.JobSubmissionException: Failed to 
> submit JobGraph.
>   at 
> org.apache.flink.client.program.rest.RestClusterClient.lambda$submitJob$9(RestClusterClient.java:405)
>   at 
> java.base/java.util.concurrent.CompletableFuture.uniExceptionally(CompletableFuture.java:986)
>   at 
> java.base/java.util.concurrent.CompletableFuture$UniExceptionally.tryFire(CompletableFuture.java:970)
>   at 
> java.base/java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:506)
>   at 
> java.base/java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:2088)
>   at 
> org.apache.flink.util.concurrent.FutureUtils.lambda$retryOperationWithDelay$9(FutureUtils.java:373)
>   at 
> java.base/java.util.concurrent.CompletableFuture.uniWhenComplete(CompletableFuture.java:859)
>   at 
> java.base/java.util.concurrent.CompletableFuture$UniWhenComplete.tryFire(CompletableFuture.java:837)
>   at 
> java.base/java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:506)
>   at 
> java.base/java.util.concurrent.CompletableFuture.postFire(CompletableFuture.java:610)
>   at 
> java.base/java.util.concurrent.CompletableFuture$UniCompose.tryFire(CompletableFuture.java:1085)
>   at 
> java.base/java.util.concurrent.CompletableFuture$Completion.run(CompletableFuture.java:478)
>   at 
> java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
>   at 
> java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
>   at java.base/java.lang.Thread.run(Thread.java:829)
> Caused by: org.apache.flink.runtime.rest.util.RestClientException: [File 
> upload failed.]
>   at 
> org.apache.flink.runtime.rest.RestClient.parseResponse(RestClient.java:486)
>   at 
> org.apache.flink.runtime.rest.RestClient.lambda$submitRequest$3(RestClient.java:466)
>   at 
> java.base/java.util.concurrent.CompletableFuture$UniCompose.tryFire(CompletableFuture.java:1072)
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (FLINK-23557) 'Resuming Externalized Checkpoint (hashmap, sync, no parallelism change) end-to-end test' fails on Azure

2021-08-02 Thread Robert Metzger (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-23557?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17391538#comment-17391538
 ] 

Robert Metzger commented on FLINK-23557:


The linux version is not causing the issue. The problem is most likely a change 
between 3a9e414265d987fbdc14be84d768c63941f04d58 and 
c3088af32543c807a69eced7129f89284204c2af. I'll look closer at the commits in 
between and bisect.

> 'Resuming Externalized Checkpoint (hashmap, sync, no parallelism change) 
> end-to-end test' fails on Azure
> 
>
> Key: FLINK-23557
> URL: https://issues.apache.org/jira/browse/FLINK-23557
> Project: Flink
>  Issue Type: Bug
>  Components: Runtime / Coordination
>Affects Versions: 1.14.0
>Reporter: Dawid Wysakowicz
>Assignee: Robert Metzger
>Priority: Blocker
>  Labels: test-stability
>
> https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=21129&view=logs&j=6caf31d6-847a-526e-9624-468e053467d6&t=1fdd9d50-31f7-5383-5578-49e27385b5f1&l=785
> {code}
> Caused by: org.apache.flink.runtime.client.JobSubmissionException: Failed to 
> submit JobGraph.
>   at 
> org.apache.flink.client.program.rest.RestClusterClient.lambda$submitJob$9(RestClusterClient.java:405)
>   at 
> java.base/java.util.concurrent.CompletableFuture.uniExceptionally(CompletableFuture.java:986)
>   at 
> java.base/java.util.concurrent.CompletableFuture$UniExceptionally.tryFire(CompletableFuture.java:970)
>   at 
> java.base/java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:506)
>   at 
> java.base/java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:2088)
>   at 
> org.apache.flink.util.concurrent.FutureUtils.lambda$retryOperationWithDelay$9(FutureUtils.java:373)
>   at 
> java.base/java.util.concurrent.CompletableFuture.uniWhenComplete(CompletableFuture.java:859)
>   at 
> java.base/java.util.concurrent.CompletableFuture$UniWhenComplete.tryFire(CompletableFuture.java:837)
>   at 
> java.base/java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:506)
>   at 
> java.base/java.util.concurrent.CompletableFuture.postFire(CompletableFuture.java:610)
>   at 
> java.base/java.util.concurrent.CompletableFuture$UniCompose.tryFire(CompletableFuture.java:1085)
>   at 
> java.base/java.util.concurrent.CompletableFuture$Completion.run(CompletableFuture.java:478)
>   at 
> java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
>   at 
> java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
>   at java.base/java.lang.Thread.run(Thread.java:829)
> Caused by: org.apache.flink.runtime.rest.util.RestClientException: [File 
> upload failed.]
>   at 
> org.apache.flink.runtime.rest.RestClient.parseResponse(RestClient.java:486)
>   at 
> org.apache.flink.runtime.rest.RestClient.lambda$submitRequest$3(RestClient.java:466)
>   at 
> java.base/java.util.concurrent.CompletableFuture$UniCompose.tryFire(CompletableFuture.java:1072)
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (FLINK-23391) KafkaSourceReaderTest.testKafkaSourceMetrics fails on azure

2021-08-02 Thread Robert Metzger (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-23391?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17391508#comment-17391508
 ] 

Robert Metzger commented on FLINK-23391:


https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=21270&view=logs&j=c5f0071e-1851-543e-9a45-9ac140befc32&t=15a22db7-8faa-5b34-3920-d33c9f0ca23c

> KafkaSourceReaderTest.testKafkaSourceMetrics fails on azure
> ---
>
> Key: FLINK-23391
> URL: https://issues.apache.org/jira/browse/FLINK-23391
> Project: Flink
>  Issue Type: Bug
>  Components: Connectors / Kafka
>Affects Versions: 1.13.1
>Reporter: Xintong Song
>Priority: Major
>  Labels: test-stability
> Fix For: 1.13.3
>
>
> https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=20456&view=logs&j=c5612577-f1f7-5977-6ff6-7432788526f7&t=53f6305f-55e6-561c-8f1e-3a1dde2c77df&l=6783
> {code}
> Jul 14 23:00:26 [ERROR] Tests run: 10, Failures: 0, Errors: 1, Skipped: 0, 
> Time elapsed: 99.93 s <<< FAILURE! - in 
> org.apache.flink.connector.kafka.source.reader.KafkaSourceReaderTest
> Jul 14 23:00:26 [ERROR] 
> testKafkaSourceMetrics(org.apache.flink.connector.kafka.source.reader.KafkaSourceReaderTest)
>   Time elapsed: 60.225 s  <<< ERROR!
> Jul 14 23:00:26 java.util.concurrent.TimeoutException: Offsets are not 
> committed successfully. Dangling offsets: 
> {15213={KafkaSourceReaderTest-0=OffsetAndMetadata{offset=10, 
> leaderEpoch=null, metadata=''}}}
> Jul 14 23:00:26   at 
> org.apache.flink.core.testutils.CommonTestUtils.waitUtil(CommonTestUtils.java:210)
> Jul 14 23:00:26   at 
> org.apache.flink.connector.kafka.source.reader.KafkaSourceReaderTest.testKafkaSourceMetrics(KafkaSourceReaderTest.java:275)
> Jul 14 23:00:26   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native 
> Method)
> Jul 14 23:00:26   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
> Jul 14 23:00:26   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> Jul 14 23:00:26   at java.lang.reflect.Method.invoke(Method.java:498)
> Jul 14 23:00:26   at 
> org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:50)
> Jul 14 23:00:26   at 
> org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12)
> Jul 14 23:00:26   at 
> org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:47)
> Jul 14 23:00:26   at 
> org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17)
> Jul 14 23:00:26   at 
> org.junit.internal.runners.statements.RunAfters.evaluate(RunAfters.java:27)
> Jul 14 23:00:26   at 
> org.junit.rules.ExpectedException$ExpectedExceptionStatement.evaluate(ExpectedException.java:239)
> Jul 14 23:00:26   at 
> org.apache.flink.util.TestNameProvider$1.evaluate(TestNameProvider.java:45)
> Jul 14 23:00:26   at 
> org.junit.rules.TestWatcher$1.evaluate(TestWatcher.java:55)
> Jul 14 23:00:26   at org.junit.rules.RunRules.evaluate(RunRules.java:20)
> Jul 14 23:00:26   at 
> org.junit.runners.ParentRunner.runLeaf(ParentRunner.java:325)
> Jul 14 23:00:26   at 
> org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:78)
> Jul 14 23:00:26   at 
> org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:57)
> Jul 14 23:00:26   at 
> org.junit.runners.ParentRunner$3.run(ParentRunner.java:290)
> Jul 14 23:00:26   at 
> org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:71)
> Jul 14 23:00:26   at 
> org.junit.runners.ParentRunner.runChildren(ParentRunner.java:288)
> Jul 14 23:00:26   at 
> org.junit.runners.ParentRunner.access$000(ParentRunner.java:58)
> Jul 14 23:00:26   at 
> org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:268)
> Jul 14 23:00:26   at 
> org.junit.internal.runners.statements.RunBefores.evaluate(RunBefores.java:26)
> Jul 14 23:00:26   at 
> org.junit.internal.runners.statements.RunAfters.evaluate(RunAfters.java:27)
> Jul 14 23:00:26   at 
> org.junit.runners.ParentRunner.run(ParentRunner.java:363)
> Jul 14 23:00:26   at org.junit.runners.Suite.runChild(Suite.java:128)
> Jul 14 23:00:26   at org.junit.runners.Suite.runChild(Suite.java:27)
> Jul 14 23:00:26   at 
> org.junit.runners.ParentRunner$3.run(ParentRunner.java:290)
> Jul 14 23:00:26   at 
> org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:71)
> Jul 14 23:00:26   at 
> org.junit.runners.ParentRunner.runChildren(ParentRunner.java:288)
> Jul 14 23:00:26   at 
> org.junit.runners.ParentRunner.access$000(ParentRunner.java:58)
> Jul 14 23:00:26   at 
> org.junit.runners.ParentRunner$2.evaluate(ParentRunner.jav

[jira] [Commented] (FLINK-23557) 'Resuming Externalized Checkpoint (hashmap, sync, no parallelism change) end-to-end test' fails on Azure

2021-08-02 Thread Robert Metzger (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-23557?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17391469#comment-17391469
 ] 

Robert Metzger commented on FLINK-23557:


The error has started occurring since Friday (30/7). I reverted some commits 
that bumped some http / netty related versions (but probably only in 
connectors) -- no success.

Green Jdk11 e2e run from Tuesday (27/7) last week: 
https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=21053&view=logs&j=6caf31d6-847a-526e-9624-468e053467d6&t=1fdd9d50-31f7-5383-5578-49e27385b5f1
 
{code}
openjdk version "11.0.11" 2021-04-20
OpenJDK Runtime Environment AdoptOpenJDK-11.0.11+9 (build 11.0.11+9)
OpenJDK 64-Bit Server VM AdoptOpenJDK-11.0.11+9 (build 11.0.11+9, mixed mode)
OS name: "linux", version: "5.8.0-1036-azure", arch: "amd64", family: "unix"
{code}

Red:  
https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=21255&view=logs&j=6caf31d6-847a-526e-9624-468e053467d6&t=1fdd9d50-31f7-5383-5578-49e27385b5f1&l=821
{code}
openjdk version "11.0.11" 2021-04-20
OpenJDK Runtime Environment AdoptOpenJDK-11.0.11+9 (build 11.0.11+9)
OpenJDK 64-Bit Server VM AdoptOpenJDK-11.0.11+9 (build 11.0.11+9, mixed mode)
OS name: "linux", version: "5.8.0-1039-azure", arch: "amd64", family: "unix"
{code}

Maybe the issue was caused by the operating system version bump by Azure? I'll 
rerun the green CI run again to rule that out.

> 'Resuming Externalized Checkpoint (hashmap, sync, no parallelism change) 
> end-to-end test' fails on Azure
> 
>
> Key: FLINK-23557
> URL: https://issues.apache.org/jira/browse/FLINK-23557
> Project: Flink
>  Issue Type: Bug
>  Components: Runtime / Coordination
>Affects Versions: 1.14.0
>Reporter: Dawid Wysakowicz
>Assignee: Robert Metzger
>Priority: Blocker
>  Labels: test-stability
>
> https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=21129&view=logs&j=6caf31d6-847a-526e-9624-468e053467d6&t=1fdd9d50-31f7-5383-5578-49e27385b5f1&l=785
> {code}
> Caused by: org.apache.flink.runtime.client.JobSubmissionException: Failed to 
> submit JobGraph.
>   at 
> org.apache.flink.client.program.rest.RestClusterClient.lambda$submitJob$9(RestClusterClient.java:405)
>   at 
> java.base/java.util.concurrent.CompletableFuture.uniExceptionally(CompletableFuture.java:986)
>   at 
> java.base/java.util.concurrent.CompletableFuture$UniExceptionally.tryFire(CompletableFuture.java:970)
>   at 
> java.base/java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:506)
>   at 
> java.base/java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:2088)
>   at 
> org.apache.flink.util.concurrent.FutureUtils.lambda$retryOperationWithDelay$9(FutureUtils.java:373)
>   at 
> java.base/java.util.concurrent.CompletableFuture.uniWhenComplete(CompletableFuture.java:859)
>   at 
> java.base/java.util.concurrent.CompletableFuture$UniWhenComplete.tryFire(CompletableFuture.java:837)
>   at 
> java.base/java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:506)
>   at 
> java.base/java.util.concurrent.CompletableFuture.postFire(CompletableFuture.java:610)
>   at 
> java.base/java.util.concurrent.CompletableFuture$UniCompose.tryFire(CompletableFuture.java:1085)
>   at 
> java.base/java.util.concurrent.CompletableFuture$Completion.run(CompletableFuture.java:478)
>   at 
> java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
>   at 
> java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
>   at java.base/java.lang.Thread.run(Thread.java:829)
> Caused by: org.apache.flink.runtime.rest.util.RestClientException: [File 
> upload failed.]
>   at 
> org.apache.flink.runtime.rest.RestClient.parseResponse(RestClient.java:486)
>   at 
> org.apache.flink.runtime.rest.RestClient.lambda$submitRequest$3(RestClient.java:466)
>   at 
> java.base/java.util.concurrent.CompletableFuture$UniCompose.tryFire(CompletableFuture.java:1072)
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (FLINK-23557) 'Resuming Externalized Checkpoint (hashmap, sync, no parallelism change) end-to-end test' fails on Azure

2021-08-02 Thread Robert Metzger (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-23557?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17391407#comment-17391407
 ] 

Robert Metzger commented on FLINK-23557:


The problem only occurs on the JDK11 build profiles. The exception on the 
server side is:
{code}
2021-07-28 20:36:15,032 WARN  org.apache.flink.runtime.rest.FileUploadHandler   
   [] - File upload failed.
org.apache.flink.shaded.netty4.io.netty.handler.codec.http.multipart.HttpPostRequestDecoder$ErrorDataDecoderException:
 java.io.IOException: Out of size: 3325693 > 3325692
at 
org.apache.flink.shaded.netty4.io.netty.handler.codec.http.multipart.HttpPostMultipartRequestDecoder.loadDataMultipartOptimized(HttpPostMultipartRequestDecoder.java:1190)
 ~[flink-dist_2.11-1.14-SNAPSHOT.jar:1.14-SNAPSHOT]
at 
org.apache.flink.shaded.netty4.io.netty.handler.codec.http.multipart.HttpPostMultipartRequestDecoder.getFileUpload(HttpPostMultipartRequestDecoder.java:926)
 ~[flink-dist_2.11-1.14-SNAPSHOT.jar:1.14-SNAPSHOT]
at 
org.apache.flink.shaded.netty4.io.netty.handler.codec.http.multipart.HttpPostMultipartRequestDecoder.decodeMultipart(HttpPostMultipartRequestDecoder.java:572)
 ~[flink-dist_2.11-1.14-SNAPSHOT.jar:1.14-SNAPSHOT]
at 
org.apache.flink.shaded.netty4.io.netty.handler.codec.http.multipart.HttpPostMultipartRequestDecoder.parseBodyMultipart(HttpPostMultipartRequestDecoder.java:463)
 ~[flink-dist_2.11-1.14-SNAPSHOT.jar:1.14-SNAPSHOT]
at 
org.apache.flink.shaded.netty4.io.netty.handler.codec.http.multipart.HttpPostMultipartRequestDecoder.parseBody(HttpPostMultipartRequestDecoder.java:432)
 ~[flink-dist_2.11-1.14-SNAPSHOT.jar:1.14-SNAPSHOT]
at 
org.apache.flink.shaded.netty4.io.netty.handler.codec.http.multipart.HttpPostMultipartRequestDecoder.offer(HttpPostMultipartRequestDecoder.java:347)
 ~[flink-dist_2.11-1.14-SNAPSHOT.jar:1.14-SNAPSHOT]
at 
org.apache.flink.shaded.netty4.io.netty.handler.codec.http.multipart.HttpPostMultipartRequestDecoder.offer(HttpPostMultipartRequestDecoder.java:54)
 ~[flink-dist_2.11-1.14-SNAPSHOT.jar:1.14-SNAPSHOT]
at 
org.apache.flink.shaded.netty4.io.netty.handler.codec.http.multipart.HttpPostRequestDecoder.offer(HttpPostRequestDecoder.java:223)
 ~[flink-dist_2.11-1.14-SNAPSHOT.jar:1.14-SNAPSHOT]
at 
org.apache.flink.runtime.rest.FileUploadHandler.channelRead0(FileUploadHandler.java:146)
 [flink-dist_2.11-1.14-SNAPSHOT.jar:1.14-SNAPSHOT]
at 
org.apache.flink.runtime.rest.FileUploadHandler.channelRead0(FileUploadHandler.java:69)
 [flink-dist_2.11-1.14-SNAPSHOT.jar:1.14-SNAPSHOT]
at 
org.apache.flink.shaded.netty4.io.netty.channel.SimpleChannelInboundHandler.channelRead(SimpleChannelInboundHandler.java:99)
 [flink-dist_2.11-1.14-SNAPSHOT.jar:1.14-SNAPSHOT]
at 
org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:379)
 [flink-dist_2.11-1.14-SNAPSHOT.jar:1.14-SNAPSHOT]
at 
org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:365)
 [flink-dist_2.11-1.14-SNAPSHOT.jar:1.14-SNAPSHOT]
at 
org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:357)
 [flink-dist_2.11-1.14-SNAPSHOT.jar:1.14-SNAPSHOT]
at 
org.apache.flink.shaded.netty4.io.netty.channel.CombinedChannelDuplexHandler$DelegatingChannelHandlerContext.fireChannelRead(CombinedChannelDuplexHandler.java:436)
 [flink-dist_2.11-1.14-SNAPSHOT.jar:1.14-SNAPSHOT]
at 
org.apache.flink.shaded.netty4.io.netty.handler.codec.ByteToMessageDecoder.fireChannelRead(ByteToMessageDecoder.java:324)
 [flink-dist_2.11-1.14-SNAPSHOT.jar:1.14-SNAPSHOT]
at 
org.apache.flink.shaded.netty4.io.netty.handler.codec.ByteToMessageDecoder.fireChannelRead(ByteToMessageDecoder.java:311)
 [flink-dist_2.11-1.14-SNAPSHOT.jar:1.14-SNAPSHOT]
at 
org.apache.flink.shaded.netty4.io.netty.handler.codec.ByteToMessageDecoder.callDecode(ByteToMessageDecoder.java:432)
 [flink-dist_2.11-1.14-SNAPSHOT.jar:1.14-SNAPSHOT]
at 
org.apache.flink.shaded.netty4.io.netty.handler.codec.ByteToMessageDecoder.channelRead(ByteToMessageDecoder.java:276)
 [flink-dist_2.11-1.14-SNAPSHOT.jar:1.14-SNAPSHOT]
at 
org.apache.flink.shaded.netty4.io.netty.channel.CombinedChannelDuplexHandler.channelRead(CombinedChannelDuplexHandler.java:251)
 [flink-dist_2.11-1.14-SNAPSHOT.jar:1.14-SNAPSHOT]
at 
org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:379)
 [flink-dist_2.11-1.14-SNAPSHOT.jar:1.14-SNAPSHOT]
at 
org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:365)
 [flink-dist_2.11-1.14-SNAPS

[jira] [Assigned] (FLINK-23557) 'Resuming Externalized Checkpoint (hashmap, sync, no parallelism change) end-to-end test' fails on Azure

2021-08-02 Thread Robert Metzger (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-23557?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Robert Metzger reassigned FLINK-23557:
--

Assignee: Robert Metzger

> 'Resuming Externalized Checkpoint (hashmap, sync, no parallelism change) 
> end-to-end test' fails on Azure
> 
>
> Key: FLINK-23557
> URL: https://issues.apache.org/jira/browse/FLINK-23557
> Project: Flink
>  Issue Type: Bug
>  Components: Runtime / Coordination
>Affects Versions: 1.14.0
>Reporter: Dawid Wysakowicz
>Assignee: Robert Metzger
>Priority: Blocker
>  Labels: test-stability
>
> https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=21129&view=logs&j=6caf31d6-847a-526e-9624-468e053467d6&t=1fdd9d50-31f7-5383-5578-49e27385b5f1&l=785
> {code}
> Caused by: org.apache.flink.runtime.client.JobSubmissionException: Failed to 
> submit JobGraph.
>   at 
> org.apache.flink.client.program.rest.RestClusterClient.lambda$submitJob$9(RestClusterClient.java:405)
>   at 
> java.base/java.util.concurrent.CompletableFuture.uniExceptionally(CompletableFuture.java:986)
>   at 
> java.base/java.util.concurrent.CompletableFuture$UniExceptionally.tryFire(CompletableFuture.java:970)
>   at 
> java.base/java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:506)
>   at 
> java.base/java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:2088)
>   at 
> org.apache.flink.util.concurrent.FutureUtils.lambda$retryOperationWithDelay$9(FutureUtils.java:373)
>   at 
> java.base/java.util.concurrent.CompletableFuture.uniWhenComplete(CompletableFuture.java:859)
>   at 
> java.base/java.util.concurrent.CompletableFuture$UniWhenComplete.tryFire(CompletableFuture.java:837)
>   at 
> java.base/java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:506)
>   at 
> java.base/java.util.concurrent.CompletableFuture.postFire(CompletableFuture.java:610)
>   at 
> java.base/java.util.concurrent.CompletableFuture$UniCompose.tryFire(CompletableFuture.java:1085)
>   at 
> java.base/java.util.concurrent.CompletableFuture$Completion.run(CompletableFuture.java:478)
>   at 
> java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
>   at 
> java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
>   at java.base/java.lang.Thread.run(Thread.java:829)
> Caused by: org.apache.flink.runtime.rest.util.RestClientException: [File 
> upload failed.]
>   at 
> org.apache.flink.runtime.rest.RestClient.parseResponse(RestClient.java:486)
>   at 
> org.apache.flink.runtime.rest.RestClient.lambda$submitRequest$3(RestClient.java:466)
>   at 
> java.base/java.util.concurrent.CompletableFuture$UniCompose.tryFire(CompletableFuture.java:1072)
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Closed] (FLINK-23546) stop-cluster.sh produces warning on macOS 11.4

2021-07-30 Thread Robert Metzger (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-23546?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Robert Metzger closed FLINK-23546.
--
Fix Version/s: 1.14.0
   Resolution: Fixed

Merged to master in 
https://github.com/apache/flink/commit/3b115544b04572831e162288097105c63ca5e048
merged to release-1.13 in 
https://github.com/apache/flink/commit/d5bf26448780d2bfc3ec4db28c8f8c91b1435487

> stop-cluster.sh produces warning on macOS 11.4
> --
>
> Key: FLINK-23546
> URL: https://issues.apache.org/jira/browse/FLINK-23546
> Project: Flink
>  Issue Type: Bug
>  Components: Deployment / Scripts
>Affects Versions: 1.14.0
>Reporter: Robert Metzger
>Assignee: Robert Metzger
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 1.14.0
>
>
> Since FLINK-17470, we are stopping daemons with a timeout, to SIGKILL them if 
> they are not gracefully stopping.
> I noticed that this mechanism causes warnings on macOS:
> {code}
> ❰robert❙/tmp/flink-1.14-SNAPSHOT❱✔≻ ./bin/start-cluster.sh
> Starting cluster.
> Starting standalonesession daemon on host MacBook-Pro-2.localdomain.
> Starting taskexecutor daemon on host MacBook-Pro-2.localdomain.
> ❰robert❙/tmp/flink-1.14-SNAPSHOT❱✔≻ ./bin/stop-cluster.sh
> Stopping taskexecutor daemon (pid: 50044) on host MacBook-Pro-2.localdomain.
> tail: illegal option -- -
> usage: tail [-F | -f | -r] [-q] [-b # | -c # | -n #] [file ...]
> Stopping standalonesession daemon (pid: 49812) on host 
> MacBook-Pro-2.localdomain.
> tail: illegal option -- -
> usage: tail [-F | -f | -r] [-q] [-b # | -c # | -n #] [file ...]
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (FLINK-23562) Update CI docker image to latest java version (1.8.0_292)

2021-07-30 Thread Robert Metzger (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-23562?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17390505#comment-17390505
 ] 

Robert Metzger commented on FLINK-23562:


This is supposed to fail: 
https://dev.azure.com/rmetzger/Flink/_build/results?buildId=9157&view=results
Once the blocking ticket is resolved, I'll rebase the PR.

> Update CI docker image to latest java version (1.8.0_292)
> -
>
> Key: FLINK-23562
> URL: https://issues.apache.org/jira/browse/FLINK-23562
> Project: Flink
>  Issue Type: Technical Debt
>  Components: Build System / Azure Pipelines
>Reporter: Robert Metzger
>Assignee: Robert Metzger
>Priority: Critical
> Fix For: 1.14.0
>
>
> The java version we are using on our CI is outdated (1.8.0_282 vs 1.8.0_292). 
> The latest java version has TLSv1 disabled, which makes the 
> KubernetesClusterDescriptorTest fail.
> This will be fixed by FLINK-22802.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (FLINK-23562) Update CI docker image to latest java version (1.8.0_292)

2021-07-30 Thread Robert Metzger (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-23562?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Robert Metzger reassigned FLINK-23562:
--

Assignee: Robert Metzger

> Update CI docker image to latest java version (1.8.0_292)
> -
>
> Key: FLINK-23562
> URL: https://issues.apache.org/jira/browse/FLINK-23562
> Project: Flink
>  Issue Type: Technical Debt
>  Components: Build System / Azure Pipelines
>Reporter: Robert Metzger
>Assignee: Robert Metzger
>Priority: Critical
> Fix For: 1.14.0
>
>
> The java version we are using on our CI is outdated (1.8.0_282 vs 1.8.0_292). 
> The latest java version has TLSv1 disabled, which makes the 
> KubernetesClusterDescriptorTest fail.
> This will be fixed by FLINK-22802.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (FLINK-23562) Update CI docker image to latest java version (1.8.0_292)

2021-07-30 Thread Robert Metzger (Jira)
Robert Metzger created FLINK-23562:
--

 Summary: Update CI docker image to latest java version (1.8.0_292)
 Key: FLINK-23562
 URL: https://issues.apache.org/jira/browse/FLINK-23562
 Project: Flink
  Issue Type: Technical Debt
  Components: Build System / Azure Pipelines
Reporter: Robert Metzger
 Fix For: 1.14.0


The java version we are using on our CI is outdated (1.8.0_282 vs 1.8.0_292). 
The latest java version has TLSv1 disabled, which makes the 
KubernetesClusterDescriptorTest fail.

This will be fixed by FLINK-22802.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (FLINK-23546) stop-cluster.sh produces warning on macOS 11.4

2021-07-29 Thread Robert Metzger (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-23546?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Robert Metzger reassigned FLINK-23546:
--

Assignee: Robert Metzger

> stop-cluster.sh produces warning on macOS 11.4
> --
>
> Key: FLINK-23546
> URL: https://issues.apache.org/jira/browse/FLINK-23546
> Project: Flink
>  Issue Type: Bug
>  Components: Deployment / Scripts
>Affects Versions: 1.14.0
>Reporter: Robert Metzger
>Assignee: Robert Metzger
>Priority: Minor
>
> Since FLINK-17470, we are stopping daemons with a timeout, to SIGKILL them if 
> they are not gracefully stopping.
> I noticed that this mechanism causes warnings on macOS:
> {code}
> ❰robert❙/tmp/flink-1.14-SNAPSHOT❱✔≻ ./bin/start-cluster.sh
> Starting cluster.
> Starting standalonesession daemon on host MacBook-Pro-2.localdomain.
> Starting taskexecutor daemon on host MacBook-Pro-2.localdomain.
> ❰robert❙/tmp/flink-1.14-SNAPSHOT❱✔≻ ./bin/stop-cluster.sh
> Stopping taskexecutor daemon (pid: 50044) on host MacBook-Pro-2.localdomain.
> tail: illegal option -- -
> usage: tail [-F | -f | -r] [-q] [-b # | -c # | -n #] [file ...]
> Stopping standalonesession daemon (pid: 49812) on host 
> MacBook-Pro-2.localdomain.
> tail: illegal option -- -
> usage: tail [-F | -f | -r] [-q] [-b # | -c # | -n #] [file ...]
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (FLINK-23546) stop-cluster.sh produces warning on macOS 11.4

2021-07-29 Thread Robert Metzger (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-23546?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17389945#comment-17389945
 ] 

Robert Metzger commented on FLINK-23546:


The error is probably coming from here: 
https://github.com/apache/flink/blame/master/flink-dist/src/main/flink-bin/bin/flink-daemon.sh#L100

> stop-cluster.sh produces warning on macOS 11.4
> --
>
> Key: FLINK-23546
> URL: https://issues.apache.org/jira/browse/FLINK-23546
> Project: Flink
>  Issue Type: Bug
>  Components: Deployment / Scripts
>Affects Versions: 1.14.0
>Reporter: Robert Metzger
>Priority: Minor
>
> Since FLINK-17470, we are stopping daemons with a timeout, to SIGKILL them if 
> they are not gracefully stopping.
> I noticed that this mechanism causes warnings on macOS:
> {code}
> ❰robert❙/tmp/flink-1.14-SNAPSHOT❱✔≻ ./bin/start-cluster.sh
> Starting cluster.
> Starting standalonesession daemon on host MacBook-Pro-2.localdomain.
> Starting taskexecutor daemon on host MacBook-Pro-2.localdomain.
> ❰robert❙/tmp/flink-1.14-SNAPSHOT❱✔≻ ./bin/stop-cluster.sh
> Stopping taskexecutor daemon (pid: 50044) on host MacBook-Pro-2.localdomain.
> tail: illegal option -- -
> usage: tail [-F | -f | -r] [-q] [-b # | -c # | -n #] [file ...]
> Stopping standalonesession daemon (pid: 49812) on host 
> MacBook-Pro-2.localdomain.
> tail: illegal option -- -
> usage: tail [-F | -f | -r] [-q] [-b # | -c # | -n #] [file ...]
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (FLINK-23546) stop-cluster.sh produces warning on macOS 11.4

2021-07-29 Thread Robert Metzger (Jira)
Robert Metzger created FLINK-23546:
--

 Summary: stop-cluster.sh produces warning on macOS 11.4
 Key: FLINK-23546
 URL: https://issues.apache.org/jira/browse/FLINK-23546
 Project: Flink
  Issue Type: Bug
  Components: Deployment / Scripts
Affects Versions: 1.14.0
Reporter: Robert Metzger


Since FLINK-17470, we are stopping daemons with a timeout, to SIGKILL them if 
they are not gracefully stopping.

I noticed that this mechanism causes warnings on macOS:

{code}
❰robert❙/tmp/flink-1.14-SNAPSHOT❱✔≻ ./bin/start-cluster.sh
Starting cluster.
Starting standalonesession daemon on host MacBook-Pro-2.localdomain.
Starting taskexecutor daemon on host MacBook-Pro-2.localdomain.
❰robert❙/tmp/flink-1.14-SNAPSHOT❱✔≻ ./bin/stop-cluster.sh
Stopping taskexecutor daemon (pid: 50044) on host MacBook-Pro-2.localdomain.
tail: illegal option -- -
usage: tail [-F | -f | -r] [-q] [-b # | -c # | -n #] [file ...]
Stopping standalonesession daemon (pid: 49812) on host 
MacBook-Pro-2.localdomain.
tail: illegal option -- -
usage: tail [-F | -f | -r] [-q] [-b # | -c # | -n #] [file ...]
{code}




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (FLINK-21569) Flink SQL with CSV file input job hangs

2021-07-28 Thread Robert Metzger (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-21569?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17389291#comment-17389291
 ] 

Robert Metzger commented on FLINK-21569:


Thanks a lot for looking into this. We try to avoid major version updates (in 
this case from Jackson 2.10 to 2.12) in bugfix Flink releases (say Flink 1.11.1 
to 1.11.2). The reason is that we want to avoid that users need to change their 
dependency management or that such an upgrade causes major differences between 
bugfix releases.

However in this case, Jackson is a shaded dependency, and I'd consider it a 
stable project. In my opinion, we can bump Jackson to 2.12 in Flink 1.12.

> Flink SQL with CSV file input job hangs
> ---
>
> Key: FLINK-21569
> URL: https://issues.apache.org/jira/browse/FLINK-21569
> Project: Flink
>  Issue Type: Bug
>  Components: Formats (JSON, Avro, Parquet, ORC, SequenceFile), Table 
> SQL / Runtime
>Affects Versions: 1.12.1
>Reporter: Nico Kruber
>Priority: Minor
>  Labels: auto-deprioritized-major
> Attachments: airports.csv, flights-small2.csv
>
>
> In extension to FLINK-21567, I actually also got the job to be stuck on 
> cancellation by doing the following in the SQL client:
> * configure SQL client defaults to run with parallelism 2
> * execute the following statement
> {code}
> CREATE TABLE `airports` (
>   `IATA_CODE` CHAR(3),
>   `AIRPORT` STRING,
>   `CITY` STRING,
>   `STATE` CHAR(2),
>   `COUNTRY` CHAR(3),
>   `LATITUDE` DOUBLE NULL,
>   `LONGITUDE` DOUBLE NULL,
>   PRIMARY KEY (`IATA_CODE`) NOT ENFORCED
> ) WITH (
>   'connector' = 'filesystem',
>   'path' = 'file:///tmp/kaggle-flight-delay/airports.csv',
>   'format' = 'csv',
>   'csv.allow-comments' = 'true',
>   'csv.ignore-parse-errors' = 'true',
>   'csv.null-literal' = ''
> );
> CREATE TABLE `flights` (
>   `_YEAR` CHAR(4),
>   `_MONTH` CHAR(2),
>   `_DAY` CHAR(2),
>   `_DAY_OF_WEEK` TINYINT,
>   `AIRLINE` CHAR(2),
>   `FLIGHT_NUMBER` SMALLINT,
>   `TAIL_NUMBER` CHAR(6),
>   `ORIGIN_AIRPORT` CHAR(3),
>   `DESTINATION_AIRPORT` CHAR(3),
>   `_SCHEDULED_DEPARTURE` CHAR(4),
>   `SCHEDULED_DEPARTURE` AS TO_TIMESTAMP(`_YEAR` || '-' || `_MONTH` || '-' || 
> `_DAY` || ' ' || SUBSTRING(`_SCHEDULED_DEPARTURE` FROM 0 FOR 2) || ':' || 
> SUBSTRING(`_SCHEDULED_DEPARTURE` FROM 3) || ':00'),
>   `_DEPARTURE_TIME` CHAR(4),
>   `DEPARTURE_DELAY` SMALLINT,
>   `DEPARTURE_TIME` AS TIMESTAMPADD(MINUTE, CAST(`DEPARTURE_DELAY` AS INT), 
> TO_TIMESTAMP(`_YEAR` || '-' || `_MONTH` || '-' || `_DAY` || ' ' || 
> SUBSTRING(`_SCHEDULED_DEPARTURE` FROM 0 FOR 2) || ':' || 
> SUBSTRING(`_SCHEDULED_DEPARTURE` FROM 3) || ':00')),
>   `TAXI_OUT` SMALLINT,
>   `WHEELS_OFF` CHAR(4),
>   `SCHEDULED_TIME` SMALLINT,
>   `ELAPSED_TIME` SMALLINT,
>   `AIR_TIME` SMALLINT,
>   `DISTANCE` SMALLINT,
>   `WHEELS_ON` CHAR(4),
>   `TAXI_IN` SMALLINT,
>   `SCHEDULED_ARRIVAL` CHAR(4),
>   `ARRIVAL_TIME` CHAR(4),
>   `ARRIVAL_DELAY` SMALLINT,
>   `DIVERTED` BOOLEAN,
>   `CANCELLED` BOOLEAN,
>   `CANCELLATION_REASON` CHAR(1),
>   `AIR_SYSTEM_DELAY` SMALLINT,
>   `SECURITY_DELAY` SMALLINT,
>   `AIRLINE_DELAY` SMALLINT,
>   `LATE_AIRCRAFT_DELAY` SMALLINT,
>   `WEATHER_DELAY` SMALLINT
> ) WITH (
>   'connector' = 'filesystem',
>   'path' = 'file:///tmp/kaggle-flight-delay/flights-small2.csv',
>   'format' = 'csv',
>   'csv.null-literal' = ''
> );
> SELECT `ORIGIN_AIRPORT`, `AIRPORT`, `STATE`, `NUM_DELAYS`
> FROM (
>   SELECT `ORIGIN_AIRPORT`, `AIRPORT`, `STATE`, COUNT(*) AS `NUM_DELAYS`,
> ROW_NUMBER() OVER (ORDER BY COUNT(*) DESC) AS rownum
>   FROM flights, airports
>   WHERE `ORIGIN_AIRPORT` = `IATA_CODE` AND `DEPARTURE_DELAY` > 0
>   GROUP BY `ORIGIN_AIRPORT`, `AIRPORT`, `STATE`)
> WHERE rownum <= 10;
> {code}
> Results are shown in the CLI but after quitting the result view, the job 
> seems stuck in CANCELLING until (at least) one of the TMs shuts itself down 
> because a task wouldn't react to the cancelling signal. This appears in its 
> TM logs:
> {code}
> 2021-03-02 18:39:19,451 WARN  org.apache.flink.runtime.taskmanager.Task   
>  [] - Task 'Source: TableSourceScan(table=[[default_catalog, 
> default_database, airports, project=[IATA_CODE, AIRPORT, STATE]]], 
> fields=[IATA_CODE, AIRPORT, STATE]) (2/2)#0' did not react to cancelling 
> signal for 30 seconds, but is stuck in method:
>  sun.misc.Unsafe.park(Native Method)
> java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
> java.util.concurrent.CompletableFuture$Signaller.block(CompletableFuture.java:1707)
> java.util.concurrent.ForkJoinPool.managedBlock(ForkJoinPool.java:3323)
> java.util.concurrent.CompletableFuture.waitingGet(CompletableFuture.java:1742)
> java.util.concurrent.CompletableFuture.join(CompletableFuture.java:1947)
> org.apache.flink.streaming.runtime.ta

[jira] [Commented] (FLINK-23525) Docker command fails on Azure: Exit code 137 returned from process: file name '/usr/bin/docker'

2021-07-28 Thread Robert Metzger (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-23525?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17388664#comment-17388664
 ] 

Robert Metzger commented on FLINK-23525:


This was most likely caused by the OOM killer

> Docker command fails on Azure: Exit code 137 returned from process: file name 
> '/usr/bin/docker'
> ---
>
> Key: FLINK-23525
> URL: https://issues.apache.org/jira/browse/FLINK-23525
> Project: Flink
>  Issue Type: Bug
>  Components: Build System / Azure Pipelines
>Affects Versions: 1.14.0
>Reporter: Dawid Wysakowicz
>Priority: Major
>  Labels: test-stability
>
> https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=21053&view=logs&j=4d4a0d10-fca2-5507-8eed-c07f0bdf4887&t=7b25afdf-cc6c-566f-5459-359dc2585798&l=10034
> {code}
> ##[error]Exit code 137 returned from process: file name '/usr/bin/docker', 
> arguments 'exec -i -u 1001  -w /home/vsts_azpcontainer 
> 9dca235e075b70486fac576ee17cee722940edf575e5478e0a52def5b46c28b5 
> /__a/externals/node/bin/node /__w/_temp/containerHandlerInvoker.js'.
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Issue Comment Deleted] (FLINK-13598) frocksdb doesn't have arm release

2021-07-20 Thread Robert Metzger (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-13598?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Robert Metzger updated FLINK-13598:
---
Comment: was deleted

(was: There's a user asking for this feature: 
https://lists.apache.org/thread.html/r33161536fcc4c157aed956db4ebc88b4b816feab43d8f32ac23a25d5%40%3Cuser.flink.apache.org%3E)

> frocksdb doesn't have arm release 
> --
>
> Key: FLINK-13598
> URL: https://issues.apache.org/jira/browse/FLINK-13598
> Project: Flink
>  Issue Type: Sub-task
>  Components: Runtime / State Backends
>Affects Versions: 1.9.0, 2.0.0
>Reporter: wangxiyuan
>Priority: Major
> Attachments: image-2020-08-20-09-22-24-021.png
>
>
> Flink now uses frocksdb which forks from rocksdb  for module 
> *flink-statebackend-rocksdb*. It doesn't contain arm release.
> Now rocksdb supports ARM from 
> [v6.2.2|https://search.maven.org/artifact/org.rocksdb/rocksdbjni/6.2.2/jar]
> Can frocksdb release an ARM package as well?
> Or AFAK, Since there were some bugs for rocksdb in the past, so that Flink 
> didn't use it directly. Have the bug been solved in rocksdb already? Can 
> Flink re-use rocksdb again now?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Issue Comment Deleted] (FLINK-13598) frocksdb doesn't have arm release

2021-07-20 Thread Robert Metzger (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-13598?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Robert Metzger updated FLINK-13598:
---
Comment: was deleted

(was: and another one: 
https://lists.apache.org/thread.html/r9a9025b06e486233e010df158a1d5da31562415f8c7da5509212db74%40%3Cuser.flink.apache.org%3E)

> frocksdb doesn't have arm release 
> --
>
> Key: FLINK-13598
> URL: https://issues.apache.org/jira/browse/FLINK-13598
> Project: Flink
>  Issue Type: Sub-task
>  Components: Runtime / State Backends
>Affects Versions: 1.9.0, 2.0.0
>Reporter: wangxiyuan
>Priority: Major
> Attachments: image-2020-08-20-09-22-24-021.png
>
>
> Flink now uses frocksdb which forks from rocksdb  for module 
> *flink-statebackend-rocksdb*. It doesn't contain arm release.
> Now rocksdb supports ARM from 
> [v6.2.2|https://search.maven.org/artifact/org.rocksdb/rocksdbjni/6.2.2/jar]
> Can frocksdb release an ARM package as well?
> Or AFAK, Since there were some bugs for rocksdb in the past, so that Flink 
> didn't use it directly. Have the bug been solved in rocksdb already? Can 
> Flink re-use rocksdb again now?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (FLINK-13598) frocksdb doesn't have arm release

2021-07-20 Thread Robert Metzger (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-13598?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17384182#comment-17384182
 ] 

Robert Metzger commented on FLINK-13598:


and another one: 
https://lists.apache.org/thread.html/r9a9025b06e486233e010df158a1d5da31562415f8c7da5509212db74%40%3Cuser.flink.apache.org%3E

> frocksdb doesn't have arm release 
> --
>
> Key: FLINK-13598
> URL: https://issues.apache.org/jira/browse/FLINK-13598
> Project: Flink
>  Issue Type: Sub-task
>  Components: Runtime / State Backends
>Affects Versions: 1.9.0, 2.0.0
>Reporter: wangxiyuan
>Priority: Major
> Attachments: image-2020-08-20-09-22-24-021.png
>
>
> Flink now uses frocksdb which forks from rocksdb  for module 
> *flink-statebackend-rocksdb*. It doesn't contain arm release.
> Now rocksdb supports ARM from 
> [v6.2.2|https://search.maven.org/artifact/org.rocksdb/rocksdbjni/6.2.2/jar]
> Can frocksdb release an ARM package as well?
> Or AFAK, Since there were some bugs for rocksdb in the past, so that Flink 
> didn't use it directly. Have the bug been solved in rocksdb already? Can 
> Flink re-use rocksdb again now?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (FLINK-13598) frocksdb doesn't have arm release

2021-07-20 Thread Robert Metzger (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-13598?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17384178#comment-17384178
 ] 

Robert Metzger commented on FLINK-13598:


There's a user asking for this feature: 
https://lists.apache.org/thread.html/r33161536fcc4c157aed956db4ebc88b4b816feab43d8f32ac23a25d5%40%3Cuser.flink.apache.org%3E

> frocksdb doesn't have arm release 
> --
>
> Key: FLINK-13598
> URL: https://issues.apache.org/jira/browse/FLINK-13598
> Project: Flink
>  Issue Type: Sub-task
>  Components: Runtime / State Backends
>Affects Versions: 1.9.0, 2.0.0
>Reporter: wangxiyuan
>Priority: Major
> Attachments: image-2020-08-20-09-22-24-021.png
>
>
> Flink now uses frocksdb which forks from rocksdb  for module 
> *flink-statebackend-rocksdb*. It doesn't contain arm release.
> Now rocksdb supports ARM from 
> [v6.2.2|https://search.maven.org/artifact/org.rocksdb/rocksdbjni/6.2.2/jar]
> Can frocksdb release an ARM package as well?
> Or AFAK, Since there were some bugs for rocksdb in the past, so that Flink 
> didn't use it directly. Have the bug been solved in rocksdb already? Can 
> Flink re-use rocksdb again now?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (FLINK-22483) Recover checkpoints when JobMaster gains leadership

2021-07-14 Thread Robert Metzger (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-22483?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17381072#comment-17381072
 ] 

Robert Metzger commented on FLINK-22483:


It seems like the 
{{CheckpointStoreITCase.testRestartOnRecoveryFailure(CheckpointStoreITCase.java:93)}}
 test is hanging (if you scroll further up, you see that the "main" thread is 
stuck in this method).
You can download the full logs of that CI run to get the output of the hanging 
test. Most likely, you'll see in the test what's going wrong.


> Recover checkpoints when JobMaster gains leadership
> ---
>
> Key: FLINK-22483
> URL: https://issues.apache.org/jira/browse/FLINK-22483
> Project: Flink
>  Issue Type: Improvement
>  Components: Runtime / Coordination
>Affects Versions: 1.13.0
>Reporter: Robert Metzger
>Assignee: Eduardo Winpenny Tejedor
>Priority: Critical
>  Labels: pull-request-available
> Fix For: 1.14.0
>
>
> Recovering checkpoints (from the CompletedCheckpointStore) is a potentially 
> long-lasting/blocking operation, for example if the file system 
> implementation is retrying to connect to a unavailable storage backend.
> Currently, we are calling the CompletedCheckpointStore.recover() method from 
> the main thread of the JobManager, making it unresponsive to any RPC call 
> while the recover method is blocked:
> {code}
> 2021-04-02 20:33:31,384 INFO  
> org.apache.flink.runtime.executiongraph.ExecutionGraph   [] - Job XXX 
> switched from state RUNNING to RESTARTING.
> com.amazonaws.SdkClientException: Unable to execute HTTP request: Connect to 
> minio.minio.svc:9000 [minio.minio.svc/] failed: Connection refused 
> (Connection refused)
>   at 
> com.amazonaws.http.AmazonHttpClient$RequestExecutor.handleRetryableException(AmazonHttpClient.java:1207)
>  ~[?:?]
>   at 
> com.amazonaws.http.AmazonHttpClient$RequestExecutor.executeHelper(AmazonHttpClient.java:1153)
>  ~[?:?]
>   at 
> com.amazonaws.http.AmazonHttpClient$RequestExecutor.doExecute(AmazonHttpClient.java:802)
>  ~[?:?]
>   at 
> com.amazonaws.http.AmazonHttpClient$RequestExecutor.executeWithTimer(AmazonHttpClient.java:770)
>  ~[?:?]
>   at 
> com.amazonaws.http.AmazonHttpClient$RequestExecutor.execute(AmazonHttpClient.java:744)
>  ~[?:?]
>   at 
> com.amazonaws.http.AmazonHttpClient$RequestExecutor.access$500(AmazonHttpClient.java:704)
>  ~[?:?]
>   at 
> com.amazonaws.http.AmazonHttpClient$RequestExecutionBuilderImpl.execute(AmazonHttpClient.java:686)
>  ~[?:?]
>   at 
> com.amazonaws.http.AmazonHttpClient.execute(AmazonHttpClient.java:550) ~[?:?]
>   at 
> com.amazonaws.http.AmazonHttpClient.execute(AmazonHttpClient.java:530) ~[?:?]
>   at 
> com.amazonaws.services.s3.AmazonS3Client.invoke(AmazonS3Client.java:5062) 
> ~[?:?]
>   at 
> com.amazonaws.services.s3.AmazonS3Client.invoke(AmazonS3Client.java:5008) 
> ~[?:?]
>   at 
> com.amazonaws.services.s3.AmazonS3Client.getObject(AmazonS3Client.java:1490) 
> ~[?:?]
>   at 
> com.facebook.presto.hive.s3.PrestoS3FileSystem$PrestoS3InputStream.lambda$openStream$1(PrestoS3FileSystem.java:905)
>  ~[?:?]
>   at com.facebook.presto.hive.RetryDriver.run(RetryDriver.java:138) ~[?:?]
>   at 
> com.facebook.presto.hive.s3.PrestoS3FileSystem$PrestoS3InputStream.openStream(PrestoS3FileSystem.java:902)
>  ~[?:?]
>   at 
> com.facebook.presto.hive.s3.PrestoS3FileSystem$PrestoS3InputStream.openStream(PrestoS3FileSystem.java:887)
>  ~[?:?]
>   at 
> com.facebook.presto.hive.s3.PrestoS3FileSystem$PrestoS3InputStream.seekStream(PrestoS3FileSystem.java:880)
>  ~[?:?]
>   at 
> com.facebook.presto.hive.s3.PrestoS3FileSystem$PrestoS3InputStream.lambda$read$0(PrestoS3FileSystem.java:819)
>  ~[?:?]
>   at com.facebook.presto.hive.RetryDriver.run(RetryDriver.java:138) ~[?:?]
>   at 
> com.facebook.presto.hive.s3.PrestoS3FileSystem$PrestoS3InputStream.read(PrestoS3FileSystem.java:818)
>  ~[?:?]
>   at java.io.BufferedInputStream.read1(BufferedInputStream.java:284) 
> ~[?:1.8.0_282]
>   at XXX.recover(KubernetesHaCheckpointStore.java:69) 
> ~[vvp-flink-ha-kubernetes-flink112-1.1.0.jar:?]
>   at 
> org.apache.flink.runtime.checkpoint.CheckpointCoordinator.restoreLatestCheckpointedStateInternal(CheckpointCoordinator.java:1511)
>  ~[flink-dist_2.12-1.12.2-stream1.jar:1.12.2-stream1]
>   at 
> org.apache.flink.runtime.checkpoint.CheckpointCoordinator.restoreLatestCheckpointedStateToAll(CheckpointCoordinator.java:1451)
>  ~[flink-dist_2.12-1.12.2-stream1.jar:1.12.2-stream1]
>   at 
> org.apache.flink.runtime.scheduler.SchedulerBase.restoreState(SchedulerBase.java:421)
>  ~[flink-dist_2.12-1.12.2-stream1.jar:1.12.2-stream1]
>   at 
> org.apache.flink.runtime.scheduler.Defaul

[jira] [Commented] (FLINK-22545) JVM crashes when runing OperatorEventSendingCheckpointITCase.testOperatorEventAckLost

2021-07-13 Thread Robert Metzger (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-22545?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17379866#comment-17379866
 ] 

Robert Metzger commented on FLINK-22545:


Thanks a lot for looking into this!

>From the exception message, it seems that the intention was rather to catch 
>fatal errors (exceptions thrown out of the thread). However since we register 
>an uncaught exception handler, this additional check doesn't seem necessary?


> JVM crashes when runing 
> OperatorEventSendingCheckpointITCase.testOperatorEventAckLost
> -
>
> Key: FLINK-22545
> URL: https://issues.apache.org/jira/browse/FLINK-22545
> Project: Flink
>  Issue Type: Bug
>  Components: Runtime / Coordination, Tests
>Affects Versions: 1.12.3
>Reporter: Guowei Ma
>Assignee: Stephan Ewen
>Priority: Critical
>  Labels: test-stability
> Fix For: 1.12.5
>
>
> https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=17501&view=logs&j=39d5b1d5-3b41-54dc-6458-1e2ddd1cdcf3&t=a99e99c7-21cd-5a1f-7274-585e62b72f56&l=4287



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (FLINK-23191) Azure: Upload CI logs to S3 as well

2021-06-30 Thread Robert Metzger (Jira)
Robert Metzger created FLINK-23191:
--

 Summary: Azure: Upload CI logs to S3 as well
 Key: FLINK-23191
 URL: https://issues.apache.org/jira/browse/FLINK-23191
 Project: Flink
  Issue Type: Improvement
  Components: Build System / Azure Pipelines
Reporter: Robert Metzger


We are currently uploading the CI logs to Azure as artifacts. The maximum 
retention (we've also configured) is 60 days. Afterwards, the logs are gone.

For rarely occurring test failures, the logs might be lost at the time we start 
looking into them.
Therefore, we should store the CI logs somewhere permanently, such as S3 
(similarly to how we stored them when we were using travis)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (FLINK-22545) JVM crashes when runing OperatorEventSendingCheckpointITCase.testOperatorEventAckLost

2021-06-29 Thread Robert Metzger (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-22545?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17371566#comment-17371566
 ] 

Robert Metzger commented on FLINK-22545:


I've unassigned myself from the ticket for now, because I'll be on vacation for 
a few days.

> JVM crashes when runing 
> OperatorEventSendingCheckpointITCase.testOperatorEventAckLost
> -
>
> Key: FLINK-22545
> URL: https://issues.apache.org/jira/browse/FLINK-22545
> Project: Flink
>  Issue Type: Bug
>  Components: Runtime / Coordination, Tests
>Affects Versions: 1.12.3
>Reporter: Guowei Ma
>Priority: Major
>  Labels: auto-deprioritized-critical, test-stability
>
> https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=17501&view=logs&j=39d5b1d5-3b41-54dc-6458-1e2ddd1cdcf3&t=a99e99c7-21cd-5a1f-7274-585e62b72f56&l=4287



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (FLINK-22545) JVM crashes when runing OperatorEventSendingCheckpointITCase.testOperatorEventAckLost

2021-06-29 Thread Robert Metzger (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-22545?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Robert Metzger reassigned FLINK-22545:
--

Assignee: (was: Robert Metzger)

> JVM crashes when runing 
> OperatorEventSendingCheckpointITCase.testOperatorEventAckLost
> -
>
> Key: FLINK-22545
> URL: https://issues.apache.org/jira/browse/FLINK-22545
> Project: Flink
>  Issue Type: Bug
>  Components: Runtime / Coordination, Tests
>Affects Versions: 1.12.3
>Reporter: Guowei Ma
>Priority: Major
>  Labels: auto-deprioritized-critical, test-stability
>
> https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=17501&view=logs&j=39d5b1d5-3b41-54dc-6458-1e2ddd1cdcf3&t=a99e99c7-21cd-5a1f-7274-585e62b72f56&l=4287



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (FLINK-22545) JVM crashes when runing OperatorEventSendingCheckpointITCase.testOperatorEventAckLost

2021-06-28 Thread Robert Metzger (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-22545?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17370656#comment-17370656
 ] 

Robert Metzger commented on FLINK-22545:


Yes, I did not find anything suspicious (yet). 
I'll post again once I have more findings ;) 

> JVM crashes when runing 
> OperatorEventSendingCheckpointITCase.testOperatorEventAckLost
> -
>
> Key: FLINK-22545
> URL: https://issues.apache.org/jira/browse/FLINK-22545
> Project: Flink
>  Issue Type: Bug
>  Components: Runtime / Coordination, Tests
>Affects Versions: 1.12.3
>Reporter: Guowei Ma
>Assignee: Robert Metzger
>Priority: Major
>  Labels: auto-deprioritized-critical, test-stability
>
> https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=17501&view=logs&j=39d5b1d5-3b41-54dc-6458-1e2ddd1cdcf3&t=a99e99c7-21cd-5a1f-7274-585e62b72f56&l=4287



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (FLINK-22545) JVM crashes when runing OperatorEventSendingCheckpointITCase.testOperatorEventAckLost

2021-06-28 Thread Robert Metzger (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-22545?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17370617#comment-17370617
 ] 

Robert Metzger commented on FLINK-22545:


The issue has only happened in the release-1.12 branch.

The JVM is stopping because of this
{code}
23:53:46,730 [SourceCoordinator-Source: numbers -> Map -> Sink: Data stream 
collect sink] ERROR org.apache.flink.runtime.util.FatalExitExceptionHandler 
 [] - FATAL: Thread 'SourceCoordinator-Source: numbers -> Map -> Sink: Data 
stream collect sink' produced an uncaught exception. Stopping the process...
java.lang.Error: This indicates that a fatal error has happened and caused the 
coordinator executor thread to exit. Check the earlier logsto see the root 
cause of the problem.
at 
org.apache.flink.runtime.source.coordinator.SourceCoordinatorProvider$CoordinatorExecutorThreadFactory.newThread(SourceCoordinatorProvider.java:114)
 ~[flink-runtime_2.11-1.12-SNAPSHOT.jar:1.12-SNAPSHOT]
at 
java.util.concurrent.ThreadPoolExecutor$Worker.(ThreadPoolExecutor.java:619)
 ~[?:1.8.0_282]
at 
java.util.concurrent.ThreadPoolExecutor.addWorker(ThreadPoolExecutor.java:932) 
~[?:1.8.0_282]
at 
java.util.concurrent.ThreadPoolExecutor.processWorkerExit(ThreadPoolExecutor.java:1025)
 ~[?:1.8.0_282]
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1167) 
~[?:1.8.0_282]
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) 
~[?:1.8.0_282]
at java.lang.Thread.run(Thread.java:748) [?:1.8.0_282]
{code}

> JVM crashes when runing 
> OperatorEventSendingCheckpointITCase.testOperatorEventAckLost
> -
>
> Key: FLINK-22545
> URL: https://issues.apache.org/jira/browse/FLINK-22545
> Project: Flink
>  Issue Type: Bug
>  Components: Runtime / Coordination, Tests
>Affects Versions: 1.12.3
>Reporter: Guowei Ma
>Assignee: Robert Metzger
>Priority: Major
>  Labels: auto-deprioritized-critical, test-stability
>
> https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=17501&view=logs&j=39d5b1d5-3b41-54dc-6458-1e2ddd1cdcf3&t=a99e99c7-21cd-5a1f-7274-585e62b72f56&l=4287



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


<    1   2   3   4   5   6   7   8   9   10   >