[jira] [Assigned] (FLINK-22492) KinesisTableApiITCase with wrong results

2021-06-24 Thread Arvid Heise (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-22492?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Arvid Heise reassigned FLINK-22492:
---

Assignee: Arvid Heise

> KinesisTableApiITCase with wrong results
> 
>
> Key: FLINK-22492
> URL: https://issues.apache.org/jira/browse/FLINK-22492
> Project: Flink
>  Issue Type: Bug
>  Components: Connectors / Kinesis, Table SQL / Ecosystem
>Affects Versions: 1.13.0
>Reporter: Dawid Wysakowicz
>Assignee: Arvid Heise
>Priority: Minor
>  Labels: auto-deprioritized-major, test-stability
>
> https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=17280=logs=c88eea3b-64a0-564d-0031-9fdcd7b8abee=ff888d9b-cd34-53cc-d90f-3e446d355529=27178
> {code}
> Apr 27 12:26:04 [ERROR] Tests run: 1, Failures: 1, Errors: 0, Skipped: 0, 
> Time elapsed: 59.289 s <<< FAILURE! - in 
> org.apache.flink.streaming.kinesis.test.KinesisTableApiITCase
> Apr 27 12:26:04 [ERROR] 
> testTableApiSourceAndSink(org.apache.flink.streaming.kinesis.test.KinesisTableApiITCase)
>   Time elapsed: 59.283 s  <<< FAILURE!
> Apr 27 12:26:04 java.lang.AssertionError: expected:<3> but was:<0>
> Apr 27 12:26:04   at org.junit.Assert.fail(Assert.java:88)
> Apr 27 12:26:04   at org.junit.Assert.failNotEquals(Assert.java:834)
> Apr 27 12:26:04   at org.junit.Assert.assertEquals(Assert.java:645)
> Apr 27 12:26:04   at org.junit.Assert.assertEquals(Assert.java:631)
> Apr 27 12:26:04   at 
> org.apache.flink.streaming.kinesis.test.KinesisTableApiITCase.testTableApiSourceAndSink(KinesisTableApiITCase.java:121)
> Apr 27 12:26:04   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native 
> Method)
> Apr 27 12:26:04   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
> Apr 27 12:26:04   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> Apr 27 12:26:04   at java.lang.reflect.Method.invoke(Method.java:498)
> Apr 27 12:26:04   at 
> org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:50)
> Apr 27 12:26:04   at 
> org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12)
> Apr 27 12:26:04   at 
> org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:47)
> Apr 27 12:26:04   at 
> org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17)
> Apr 27 12:26:04   at 
> org.junit.internal.runners.statements.FailOnTimeout$CallableStatement.call(FailOnTimeout.java:298)
> Apr 27 12:26:04   at 
> org.junit.internal.runners.statements.FailOnTimeout$CallableStatement.call(FailOnTimeout.java:292)
> Apr 27 12:26:04   at 
> java.util.concurrent.FutureTask.run(FutureTask.java:266)
> Apr 27 12:26:04   at java.lang.Thread.run(Thread.java:748)
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (FLINK-22085) KafkaSourceLegacyITCase hangs/fails on azure

2021-07-02 Thread Arvid Heise (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-22085?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17373612#comment-17373612
 ] 

Arvid Heise commented on FLINK-22085:
-

I think having `notifyDataAvailable` at the end of recovery certainly makes 
sense. Extra `notifyDataAvailable` are rather cheap when not on the hot-path.

> KafkaSourceLegacyITCase hangs/fails on azure
> 
>
> Key: FLINK-22085
> URL: https://issues.apache.org/jira/browse/FLINK-22085
> Project: Flink
>  Issue Type: Bug
>  Components: Connectors / Kafka
>Affects Versions: 1.13.0, 1.14.0
>Reporter: Dawid Wysakowicz
>Assignee: Yun Gao
>Priority: Critical
>  Labels: pull-request-available, test-stability
> Fix For: 1.14.0
>
>
> 1) Observations
> a) The Azure pipeline would occasionally hang without printing any test error 
> information.
> [https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=15939=logs=c5f0071e-1851-543e-9a45-9ac140befc32=1fb1a56f-e8b5-5a82-00a0-a2db7757b4f5=8219]
> b) By running the test KafkaSourceLegacyITCase::testBrokerFailure() with INFO 
> level logging, the the test would hang with the following error message 
> printed repeatedly:
> {code:java}
> 20451 [New I/O boss #50] ERROR 
> org.apache.flink.networking.NetworkFailureHandler [] - Closing communication 
> channel because of an exception
> java.net.ConnectException: Connection refused: localhost/127.0.0.1:50073
> at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method) 
> ~[?:1.8.0_151]
> at 
> sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:717) 
> ~[?:1.8.0_151]
> at 
> org.apache.flink.shaded.testutils.org.jboss.netty.channel.socket.nio.NioClientBoss.connect(NioClientBoss.java:152)
>  ~[flink-test-utils_2.11-1.13-SNAPSHOT.jar:1.13-SNAPSHOT]
> at 
> org.apache.flink.shaded.testutils.org.jboss.netty.channel.socket.nio.NioClientBoss.processSelectedKeys(NioClientBoss.java:105)
>  [flink-test-utils_2.11-1.13-SNAPSHOT.jar:1.13-SNAPSHOT]
> at 
> org.apache.flink.shaded.testutils.org.jboss.netty.channel.socket.nio.NioClientBoss.process(NioClientBoss.java:79)
>  [flink-test-utils_2.11-1.13-SNAPSHOT.jar:1.13-SNAPSHOT]
> at 
> org.apache.flink.shaded.testutils.org.jboss.netty.channel.socket.nio.AbstractNioSelector.run(AbstractNioSelector.java:337)
>  [flink-test-utils_2.11-1.13-SNAPSHOT.jar:1.13-SNAPSHOT]
> at 
> org.apache.flink.shaded.testutils.org.jboss.netty.channel.socket.nio.NioClientBoss.run(NioClientBoss.java:42)
>  [flink-test-utils_2.11-1.13-SNAPSHOT.jar:1.13-SNAPSHOT]
> at 
> org.apache.flink.shaded.testutils.org.jboss.netty.util.ThreadRenamingRunnable.run(ThreadRenamingRunnable.java:108)
>  [flink-test-utils_2.11-1.13-SNAPSHOT.jar:1.13-SNAPSHOT]
> at 
> org.apache.flink.shaded.testutils.org.jboss.netty.util.internal.DeadLockProofWorker$1.run(DeadLockProofWorker.java:42)
>  [flink-test-utils_2.11-1.13-SNAPSHOT.jar:1.13-SNAPSHOT]
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>  [?:1.8.0_151]
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>  [?:1.8.0_151]
> at java.lang.Thread.run(Thread.java:748) [?:1.8.0_151]
> {code}
> *2) Root cause explanations*
> The test would hang because it enters the following loop:
>  - closeOnFlush() is called for a given channel
>  - closeOnFlush() calls channel.write(..)
>  - channel.write() triggers the exceptionCaught(...) callback
>  - closeOnFlush() is called for the same channel again.
> *3) Solution*
> Update closeOnFlush() so that, if a channel is being closed by this method, 
> then closeOnFlush() would not try to write to this channel if it is called on 
> this channel again.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (FLINK-20888) ContinuousFileReaderOperator should not close the output on close()

2021-06-29 Thread Arvid Heise (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-20888?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17371579#comment-17371579
 ] 

Arvid Heise commented on FLINK-20888:
-

Merged into master as dfeae1b652bf268aadce6df1e98c377e98e56655.

> ContinuousFileReaderOperator should not close the output on close()
> ---
>
> Key: FLINK-20888
> URL: https://issues.apache.org/jira/browse/FLINK-20888
> Project: Flink
>  Issue Type: Bug
>  Components: Connectors / FileSystem
>Affects Versions: 1.12.0
>Reporter: Yun Gao
>Assignee: Arvid Heise
>Priority: Minor
>  Labels: auto-deprioritized-major, pull-request-available
>
> Currently ContinuousFileReaderOperator would close the output on close(), if 
> it is chained with more operators, it would also close the following 
> operators before their endOfInput() is called (by default, we would call 
> op1.endInput(), op1.close(), op2.endInput(), op2.close()... in order). This 
> might cause some problems like in 
> [https://lists.apache.org/thread.html/r50a94aaea4fe25f3927a4274ea8272e6b76ecec8f3fe48d2566689bd%40%3Cuser.flink.apache.org%3E]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (FLINK-22492) KinesisTableApiITCase with wrong results

2021-06-25 Thread Arvid Heise (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-22492?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17369717#comment-17369717
 ] 

Arvid Heise commented on FLINK-22492:
-

Merged into master as 66120f98066c690d10bcca46c9e78767be5bec1b.

> KinesisTableApiITCase with wrong results
> 
>
> Key: FLINK-22492
> URL: https://issues.apache.org/jira/browse/FLINK-22492
> Project: Flink
>  Issue Type: Bug
>  Components: Connectors / Kinesis, Table SQL / Ecosystem
>Affects Versions: 1.13.0
>Reporter: Dawid Wysakowicz
>Assignee: Arvid Heise
>Priority: Critical
>  Labels: pull-request-available, test-stability
>
> https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=17280=logs=c88eea3b-64a0-564d-0031-9fdcd7b8abee=ff888d9b-cd34-53cc-d90f-3e446d355529=27178
> {code}
> Apr 27 12:26:04 [ERROR] Tests run: 1, Failures: 1, Errors: 0, Skipped: 0, 
> Time elapsed: 59.289 s <<< FAILURE! - in 
> org.apache.flink.streaming.kinesis.test.KinesisTableApiITCase
> Apr 27 12:26:04 [ERROR] 
> testTableApiSourceAndSink(org.apache.flink.streaming.kinesis.test.KinesisTableApiITCase)
>   Time elapsed: 59.283 s  <<< FAILURE!
> Apr 27 12:26:04 java.lang.AssertionError: expected:<3> but was:<0>
> Apr 27 12:26:04   at org.junit.Assert.fail(Assert.java:88)
> Apr 27 12:26:04   at org.junit.Assert.failNotEquals(Assert.java:834)
> Apr 27 12:26:04   at org.junit.Assert.assertEquals(Assert.java:645)
> Apr 27 12:26:04   at org.junit.Assert.assertEquals(Assert.java:631)
> Apr 27 12:26:04   at 
> org.apache.flink.streaming.kinesis.test.KinesisTableApiITCase.testTableApiSourceAndSink(KinesisTableApiITCase.java:121)
> Apr 27 12:26:04   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native 
> Method)
> Apr 27 12:26:04   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
> Apr 27 12:26:04   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> Apr 27 12:26:04   at java.lang.reflect.Method.invoke(Method.java:498)
> Apr 27 12:26:04   at 
> org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:50)
> Apr 27 12:26:04   at 
> org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12)
> Apr 27 12:26:04   at 
> org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:47)
> Apr 27 12:26:04   at 
> org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17)
> Apr 27 12:26:04   at 
> org.junit.internal.runners.statements.FailOnTimeout$CallableStatement.call(FailOnTimeout.java:298)
> Apr 27 12:26:04   at 
> org.junit.internal.runners.statements.FailOnTimeout$CallableStatement.call(FailOnTimeout.java:292)
> Apr 27 12:26:04   at 
> java.util.concurrent.FutureTask.run(FutureTask.java:266)
> Apr 27 12:26:04   at java.lang.Thread.run(Thread.java:748)
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (FLINK-20888) ContinuousFileReaderOperator should not close the output on close()

2021-07-01 Thread Arvid Heise (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-20888?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17372568#comment-17372568
 ] 

Arvid Heise commented on FLINK-20888:
-

Sorry yes. I have updated the status.

> ContinuousFileReaderOperator should not close the output on close()
> ---
>
> Key: FLINK-20888
> URL: https://issues.apache.org/jira/browse/FLINK-20888
> Project: Flink
>  Issue Type: Bug
>  Components: Connectors / FileSystem
>Affects Versions: 1.12.0
>Reporter: Yun Gao
>Assignee: Arvid Heise
>Priority: Minor
>  Labels: auto-deprioritized-major, pull-request-available
> Fix For: 1.14.0, 1.12.5, 1.13.3
>
>
> Currently ContinuousFileReaderOperator would close the output on close(), if 
> it is chained with more operators, it would also close the following 
> operators before their endOfInput() is called (by default, we would call 
> op1.endInput(), op1.close(), op2.endInput(), op2.close()... in order). This 
> might cause some problems like in 
> [https://lists.apache.org/thread.html/r50a94aaea4fe25f3927a4274ea8272e6b76ecec8f3fe48d2566689bd%40%3Cuser.flink.apache.org%3E]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (FLINK-20888) ContinuousFileReaderOperator should not close the output on close()

2021-07-01 Thread Arvid Heise (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-20888?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Arvid Heise resolved FLINK-20888.
-
Fix Version/s: 1.12.5
   1.14.0
   Resolution: Fixed

> ContinuousFileReaderOperator should not close the output on close()
> ---
>
> Key: FLINK-20888
> URL: https://issues.apache.org/jira/browse/FLINK-20888
> Project: Flink
>  Issue Type: Bug
>  Components: Connectors / FileSystem
>Affects Versions: 1.12.0
>Reporter: Yun Gao
>Assignee: Arvid Heise
>Priority: Minor
>  Labels: auto-deprioritized-major, pull-request-available
> Fix For: 1.14.0, 1.12.5, 1.13.3
>
>
> Currently ContinuousFileReaderOperator would close the output on close(), if 
> it is chained with more operators, it would also close the following 
> operators before their endOfInput() is called (by default, we would call 
> op1.endInput(), op1.close(), op2.endInput(), op2.close()... in order). This 
> might cause some problems like in 
> [https://lists.apache.org/thread.html/r50a94aaea4fe25f3927a4274ea8272e6b76ecec8f3fe48d2566689bd%40%3Cuser.flink.apache.org%3E]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (FLINK-22964) Connector-base exposes dependency to flink-core.

2021-06-30 Thread Arvid Heise (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-22964?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Arvid Heise resolved FLINK-22964.
-
Release Note: Connectors do not transitively hold a reference to 
`flink-core` anymore. That means that a fat jar with a connector does not 
include `flink-core` with this fix.
  Resolution: Fixed

> Connector-base exposes dependency to flink-core.
> 
>
> Key: FLINK-22964
> URL: https://issues.apache.org/jira/browse/FLINK-22964
> Project: Flink
>  Issue Type: Bug
>  Components: Connectors / Common
>Affects Versions: 1.14.0, 1.13.1, 1.12.4
>Reporter: Arvid Heise
>Assignee: Arvid Heise
>Priority: Major
>  Labels: classloading, pull-request-available
> Fix For: 1.14.0, 1.12.5, 1.13.2
>
>
> Connectors get shaded into the user jar and as such should contain no 
> unnecessary dependencies to flink. However, connector-base is exposing 
> `flink-core` which then by default gets shaded into the user jar. Except for 
> 6MB of extra size, the dependency also causes class loading issues, when 
> `classloader.parent-first-patterns` does not include `o.a.f`.
> Fix is to make `flink-core` provided in `connector-base`.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (FLINK-22964) Connector-base exposes dependency to flink-core.

2021-06-30 Thread Arvid Heise (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-22964?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17371863#comment-17371863
 ] 

Arvid Heise commented on FLINK-22964:
-

Merged into master as 96dba32541d0f756858a5bcdda730e6817b68600, into 1.13 
asfdc09154d5f4bb4c618da0e0f224ded4a807916f, into 1.12 as 
ca999b28ccaff2cb82638ada8a4157444c9a8efb.

> Connector-base exposes dependency to flink-core.
> 
>
> Key: FLINK-22964
> URL: https://issues.apache.org/jira/browse/FLINK-22964
> Project: Flink
>  Issue Type: Bug
>  Components: Connectors / Common
>Affects Versions: 1.14.0, 1.13.1, 1.12.4
>Reporter: Arvid Heise
>Assignee: Arvid Heise
>Priority: Major
>  Labels: classloading, pull-request-available
> Fix For: 1.14.0, 1.12.5, 1.13.2
>
>
> Connectors get shaded into the user jar and as such should contain no 
> unnecessary dependencies to flink. However, connector-base is exposing 
> `flink-core` which then by default gets shaded into the user jar. Except for 
> 6MB of extra size, the dependency also causes class loading issues, when 
> `classloader.parent-first-patterns` does not include `o.a.f`.
> Fix is to make `flink-core` provided in `connector-base`.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (FLINK-20888) ContinuousFileReaderOperator should not close the output on close()

2021-06-30 Thread Arvid Heise (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-20888?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17371866#comment-17371866
 ] 

Arvid Heise commented on FLINK-20888:
-

Merged into 1.12 as 83a26d519261f3d7607a2e25ce77e5f2c10b11f2 and into 1.13 as 
8636a4e08b9eaaef838ac3b2a4ae83afadd5a202.

> ContinuousFileReaderOperator should not close the output on close()
> ---
>
> Key: FLINK-20888
> URL: https://issues.apache.org/jira/browse/FLINK-20888
> Project: Flink
>  Issue Type: Bug
>  Components: Connectors / FileSystem
>Affects Versions: 1.12.0
>Reporter: Yun Gao
>Assignee: Arvid Heise
>Priority: Minor
>  Labels: auto-deprioritized-major, pull-request-available
>
> Currently ContinuousFileReaderOperator would close the output on close(), if 
> it is chained with more operators, it would also close the following 
> operators before their endOfInput() is called (by default, we would call 
> op1.endInput(), op1.close(), op2.endInput(), op2.close()... in order). This 
> might cause some problems like in 
> [https://lists.apache.org/thread.html/r50a94aaea4fe25f3927a4274ea8272e6b76ecec8f3fe48d2566689bd%40%3Cuser.flink.apache.org%3E]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (FLINK-22492) KinesisTableApiITCase with wrong results

2021-06-26 Thread Arvid Heise (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-22492?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Arvid Heise resolved FLINK-22492.
-
Fix Version/s: 1.13.2
   1.14.0
   Resolution: Fixed

> KinesisTableApiITCase with wrong results
> 
>
> Key: FLINK-22492
> URL: https://issues.apache.org/jira/browse/FLINK-22492
> Project: Flink
>  Issue Type: Bug
>  Components: Connectors / Kinesis, Table SQL / Ecosystem
>Affects Versions: 1.13.0
>Reporter: Dawid Wysakowicz
>Assignee: Arvid Heise
>Priority: Critical
>  Labels: pull-request-available, test-stability
> Fix For: 1.14.0, 1.13.2
>
>
> https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=17280=logs=c88eea3b-64a0-564d-0031-9fdcd7b8abee=ff888d9b-cd34-53cc-d90f-3e446d355529=27178
> {code}
> Apr 27 12:26:04 [ERROR] Tests run: 1, Failures: 1, Errors: 0, Skipped: 0, 
> Time elapsed: 59.289 s <<< FAILURE! - in 
> org.apache.flink.streaming.kinesis.test.KinesisTableApiITCase
> Apr 27 12:26:04 [ERROR] 
> testTableApiSourceAndSink(org.apache.flink.streaming.kinesis.test.KinesisTableApiITCase)
>   Time elapsed: 59.283 s  <<< FAILURE!
> Apr 27 12:26:04 java.lang.AssertionError: expected:<3> but was:<0>
> Apr 27 12:26:04   at org.junit.Assert.fail(Assert.java:88)
> Apr 27 12:26:04   at org.junit.Assert.failNotEquals(Assert.java:834)
> Apr 27 12:26:04   at org.junit.Assert.assertEquals(Assert.java:645)
> Apr 27 12:26:04   at org.junit.Assert.assertEquals(Assert.java:631)
> Apr 27 12:26:04   at 
> org.apache.flink.streaming.kinesis.test.KinesisTableApiITCase.testTableApiSourceAndSink(KinesisTableApiITCase.java:121)
> Apr 27 12:26:04   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native 
> Method)
> Apr 27 12:26:04   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
> Apr 27 12:26:04   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> Apr 27 12:26:04   at java.lang.reflect.Method.invoke(Method.java:498)
> Apr 27 12:26:04   at 
> org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:50)
> Apr 27 12:26:04   at 
> org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12)
> Apr 27 12:26:04   at 
> org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:47)
> Apr 27 12:26:04   at 
> org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17)
> Apr 27 12:26:04   at 
> org.junit.internal.runners.statements.FailOnTimeout$CallableStatement.call(FailOnTimeout.java:298)
> Apr 27 12:26:04   at 
> org.junit.internal.runners.statements.FailOnTimeout$CallableStatement.call(FailOnTimeout.java:292)
> Apr 27 12:26:04   at 
> java.util.concurrent.FutureTask.run(FutureTask.java:266)
> Apr 27 12:26:04   at java.lang.Thread.run(Thread.java:748)
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (FLINK-22492) KinesisTableApiITCase with wrong results

2021-06-26 Thread Arvid Heise (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-22492?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17369816#comment-17369816
 ] 

Arvid Heise commented on FLINK-22492:
-

See for FLINK-23009 for different issue with the same test.

> KinesisTableApiITCase with wrong results
> 
>
> Key: FLINK-22492
> URL: https://issues.apache.org/jira/browse/FLINK-22492
> Project: Flink
>  Issue Type: Bug
>  Components: Connectors / Kinesis, Table SQL / Ecosystem
>Affects Versions: 1.13.0
>Reporter: Dawid Wysakowicz
>Assignee: Arvid Heise
>Priority: Critical
>  Labels: pull-request-available, test-stability
>
> https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=17280=logs=c88eea3b-64a0-564d-0031-9fdcd7b8abee=ff888d9b-cd34-53cc-d90f-3e446d355529=27178
> {code}
> Apr 27 12:26:04 [ERROR] Tests run: 1, Failures: 1, Errors: 0, Skipped: 0, 
> Time elapsed: 59.289 s <<< FAILURE! - in 
> org.apache.flink.streaming.kinesis.test.KinesisTableApiITCase
> Apr 27 12:26:04 [ERROR] 
> testTableApiSourceAndSink(org.apache.flink.streaming.kinesis.test.KinesisTableApiITCase)
>   Time elapsed: 59.283 s  <<< FAILURE!
> Apr 27 12:26:04 java.lang.AssertionError: expected:<3> but was:<0>
> Apr 27 12:26:04   at org.junit.Assert.fail(Assert.java:88)
> Apr 27 12:26:04   at org.junit.Assert.failNotEquals(Assert.java:834)
> Apr 27 12:26:04   at org.junit.Assert.assertEquals(Assert.java:645)
> Apr 27 12:26:04   at org.junit.Assert.assertEquals(Assert.java:631)
> Apr 27 12:26:04   at 
> org.apache.flink.streaming.kinesis.test.KinesisTableApiITCase.testTableApiSourceAndSink(KinesisTableApiITCase.java:121)
> Apr 27 12:26:04   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native 
> Method)
> Apr 27 12:26:04   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
> Apr 27 12:26:04   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> Apr 27 12:26:04   at java.lang.reflect.Method.invoke(Method.java:498)
> Apr 27 12:26:04   at 
> org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:50)
> Apr 27 12:26:04   at 
> org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12)
> Apr 27 12:26:04   at 
> org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:47)
> Apr 27 12:26:04   at 
> org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17)
> Apr 27 12:26:04   at 
> org.junit.internal.runners.statements.FailOnTimeout$CallableStatement.call(FailOnTimeout.java:298)
> Apr 27 12:26:04   at 
> org.junit.internal.runners.statements.FailOnTimeout$CallableStatement.call(FailOnTimeout.java:292)
> Apr 27 12:26:04   at 
> java.util.concurrent.FutureTask.run(FutureTask.java:266)
> Apr 27 12:26:04   at java.lang.Thread.run(Thread.java:748)
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (FLINK-22492) KinesisTableApiITCase with wrong results

2021-06-25 Thread Arvid Heise (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-22492?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17369815#comment-17369815
 ] 

Arvid Heise commented on FLINK-22492:
-

Merged into 1.13 as b06862333119359afba6a8f43ed08a55a7c7e57b.

> KinesisTableApiITCase with wrong results
> 
>
> Key: FLINK-22492
> URL: https://issues.apache.org/jira/browse/FLINK-22492
> Project: Flink
>  Issue Type: Bug
>  Components: Connectors / Kinesis, Table SQL / Ecosystem
>Affects Versions: 1.13.0
>Reporter: Dawid Wysakowicz
>Assignee: Arvid Heise
>Priority: Critical
>  Labels: pull-request-available, test-stability
>
> https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=17280=logs=c88eea3b-64a0-564d-0031-9fdcd7b8abee=ff888d9b-cd34-53cc-d90f-3e446d355529=27178
> {code}
> Apr 27 12:26:04 [ERROR] Tests run: 1, Failures: 1, Errors: 0, Skipped: 0, 
> Time elapsed: 59.289 s <<< FAILURE! - in 
> org.apache.flink.streaming.kinesis.test.KinesisTableApiITCase
> Apr 27 12:26:04 [ERROR] 
> testTableApiSourceAndSink(org.apache.flink.streaming.kinesis.test.KinesisTableApiITCase)
>   Time elapsed: 59.283 s  <<< FAILURE!
> Apr 27 12:26:04 java.lang.AssertionError: expected:<3> but was:<0>
> Apr 27 12:26:04   at org.junit.Assert.fail(Assert.java:88)
> Apr 27 12:26:04   at org.junit.Assert.failNotEquals(Assert.java:834)
> Apr 27 12:26:04   at org.junit.Assert.assertEquals(Assert.java:645)
> Apr 27 12:26:04   at org.junit.Assert.assertEquals(Assert.java:631)
> Apr 27 12:26:04   at 
> org.apache.flink.streaming.kinesis.test.KinesisTableApiITCase.testTableApiSourceAndSink(KinesisTableApiITCase.java:121)
> Apr 27 12:26:04   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native 
> Method)
> Apr 27 12:26:04   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
> Apr 27 12:26:04   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> Apr 27 12:26:04   at java.lang.reflect.Method.invoke(Method.java:498)
> Apr 27 12:26:04   at 
> org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:50)
> Apr 27 12:26:04   at 
> org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12)
> Apr 27 12:26:04   at 
> org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:47)
> Apr 27 12:26:04   at 
> org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17)
> Apr 27 12:26:04   at 
> org.junit.internal.runners.statements.FailOnTimeout$CallableStatement.call(FailOnTimeout.java:298)
> Apr 27 12:26:04   at 
> org.junit.internal.runners.statements.FailOnTimeout$CallableStatement.call(FailOnTimeout.java:292)
> Apr 27 12:26:04   at 
> java.util.concurrent.FutureTask.run(FutureTask.java:266)
> Apr 27 12:26:04   at java.lang.Thread.run(Thread.java:748)
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (FLINK-23147) ThreadPools can be poisoned by context class loaders

2021-06-28 Thread Arvid Heise (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-23147?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17370506#comment-17370506
 ] 

Arvid Heise commented on FLINK-23147:
-

My main question is why are our plugins different from other plugins? What is 
even our/we in this context? 
If we/our refers to plugins inside the main repo, then what happens if we 
really externalize connectors, metrics reporters, filesystems etc. to 
flink-packages? Is this still we? Or is we/our then everything under flink 
apache org umbrella?

If we need Preconditions to implement a plugin sufficiently, why shouldn't we 
give externals the same power? Keep in mind that most users c our code and 
adjust it. Now they suddenly would need to change all preconditions because 
it's internal.

We can be pragmatic here and defer the discussion until that time comes and 
keep {{PluginLoader}} as is. My main argument is that plugins and submodules 
are two different things and we shouldn't weaken class separation of plugins to 
enable submodules.

> ThreadPools can be poisoned by context class loaders
> 
>
> Key: FLINK-23147
> URL: https://issues.apache.org/jira/browse/FLINK-23147
> Project: Flink
>  Issue Type: Improvement
>  Components: Runtime / Coordination
>Reporter: Chesnay Schepler
>Priority: Major
> Fix For: 1.14.0
>
>
> Newly created threads in a thread pool inherit the context class loader (CCL) 
> of the currently running thread.
> For thread pools this is very problematic because the CCL is unlikely to be 
> reset at any point; not only can this leak another CL by accident, it can 
> also cause class loading issues, for example when using a {{ServiceLoader}} 
> because it relies on the CCL.
> With the scala-free runtime this for example means that if an actor system 
> threads schedules something into future thread pool of the JM then a new 
> thread is created which uses a plugin loader as a CCL. The plugin loaders are 
> quite restrictive and prevent the loading of 3rd-party dependencies; so if 
> the JM schedules something into the future thread pool which requires one of 
> these dependencies to be accessible then we're gambling as to whether this 
> dependency can actually be loaded in the end.
> Because it is difficult to ensure that we set the CCL correctly on all 
> transitions from akka to Flink land I suggest to add a safeguard to the 
> ExecutorThreadFactory to enforce that newly created threads are always 
> initialized with the CL that has loaded Flink.
> /cc [~arvid] [~sewen]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (FLINK-16852) Add metrics to the source coordinator.

2021-07-09 Thread Arvid Heise (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-16852?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17377917#comment-17377917
 ] 

Arvid Heise commented on FLINK-16852:
-

Hi [~becket_qin], could you write which metrics do you envision for coordinator?

> Add metrics to the source coordinator.
> --
>
> Key: FLINK-16852
> URL: https://issues.apache.org/jira/browse/FLINK-16852
> Project: Flink
>  Issue Type: Sub-task
>  Components: Connectors / Common
>Reporter: Jiangjie Qin
>Priority: Major
>
> Add metrics to the SourceCoordinator.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (FLINK-16851) Add common metrics to the SourceReader base implementation.

2021-07-09 Thread Arvid Heise (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-16851?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Arvid Heise reassigned FLINK-16851:
---

Assignee: Arvid Heise

> Add common metrics to the SourceReader base implementation.
> ---
>
> Key: FLINK-16851
> URL: https://issues.apache.org/jira/browse/FLINK-16851
> Project: Flink
>  Issue Type: Sub-task
>  Components: Connectors / Common
>Reporter: Jiangjie Qin
>Assignee: Arvid Heise
>Priority: Major
>
> Add the metrics to the base SourceReader implementation. This is relevant to 
> [FLIP-33|[https://cwiki.apache.org/confluence/display/FLINK/FLIP-33%3A+Standardize+Connector+Metrics?src=contextnavpagetreemode]].



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (FLINK-23182) Connection leak in RMQSource

2021-07-06 Thread Arvid Heise (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-23182?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17375844#comment-17375844
 ] 

Arvid Heise commented on FLINK-23182:
-

Merged into master as b662ed19406c8f552c916b6840c8d62fad64b77c, into 1.13 as 
c830bde767c8080509f24fc688350cff0eb9f3ed, and into 1.12 as 
4d017d3b361e066ad9114ebcd7e4011ef5fb6752.

Thank you very much for the contribution!

> Connection leak in RMQSource 
> -
>
> Key: FLINK-23182
> URL: https://issues.apache.org/jira/browse/FLINK-23182
> Project: Flink
>  Issue Type: Bug
>  Components: Connectors/ RabbitMQ
>Affects Versions: 1.13.1, 1.12.4
>Reporter: Michał Ciesielczyk
>Assignee: Michał Ciesielczyk
>Priority: Major
>  Labels: pull-request-available
>
> The RabbitMQ connection is not closed properly in the RMQSource connector in 
> case of failures. This leads to a connection leak (we loose handles to still 
> opened connections) that will last until the Flink TaskManager is either 
> stopped or crashes.
> The issue is caused by improper resource releasing in open and close methods 
> of RMQSource:
>  - 
> [https://github.com/apache/flink/blob/master/flink-connectors/flink-connector-rabbitmq/src/main/java/org/apache/flink/streaming/connectors/rabbitmq/RMQSource.java#L260]
>  - here the connection is opened, but not closed in case of failure (e.g. 
> caused by invalid queue configuration)
>  - 
> [https://github.com/apache/flink/blob/master/flink-connectors/flink-connector-rabbitmq/src/main/java/org/apache/flink/streaming/connectors/rabbitmq/RMQSource.java#L282]
>  - here the connection might not closed properly if stopping the consumer 
> causes a failure first
> In both cases, the solution is relatively simple - make sure that the 
> connection#close is always called if it should be (failing to close one 
> resource should not prevent other close methods from being called). In open 
> we probably can silently close allocated resources (as the process did not 
> succeed eventually anyway). In close, we should either throw the first caught 
> exception or the last one, and log all the others as warnings.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (FLINK-23182) Connection leak in RMQSource

2021-07-06 Thread Arvid Heise (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-23182?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Arvid Heise resolved FLINK-23182.
-
Fix Version/s: 1.13.3
   1.12.6
   1.14.0
   Resolution: Fixed

> Connection leak in RMQSource 
> -
>
> Key: FLINK-23182
> URL: https://issues.apache.org/jira/browse/FLINK-23182
> Project: Flink
>  Issue Type: Bug
>  Components: Connectors/ RabbitMQ
>Affects Versions: 1.13.1, 1.12.4
>Reporter: Michał Ciesielczyk
>Assignee: Michał Ciesielczyk
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.14.0, 1.12.6, 1.13.3
>
>
> The RabbitMQ connection is not closed properly in the RMQSource connector in 
> case of failures. This leads to a connection leak (we loose handles to still 
> opened connections) that will last until the Flink TaskManager is either 
> stopped or crashes.
> The issue is caused by improper resource releasing in open and close methods 
> of RMQSource:
>  - 
> [https://github.com/apache/flink/blob/master/flink-connectors/flink-connector-rabbitmq/src/main/java/org/apache/flink/streaming/connectors/rabbitmq/RMQSource.java#L260]
>  - here the connection is opened, but not closed in case of failure (e.g. 
> caused by invalid queue configuration)
>  - 
> [https://github.com/apache/flink/blob/master/flink-connectors/flink-connector-rabbitmq/src/main/java/org/apache/flink/streaming/connectors/rabbitmq/RMQSource.java#L282]
>  - here the connection might not closed properly if stopping the consumer 
> causes a failure first
> In both cases, the solution is relatively simple - make sure that the 
> connection#close is always called if it should be (failing to close one 
> resource should not prevent other close methods from being called). In open 
> we probably can silently close allocated resources (as the process did not 
> succeed eventually anyway). In close, we should either throw the first caught 
> exception or the last one, and log all the others as warnings.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Comment Edited] (FLINK-22416) UpsertKafkaTableITCase hangs when collecting results

2021-06-30 Thread Arvid Heise (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-22416?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17372111#comment-17372111
 ] 

Arvid Heise edited comment on FLINK-22416 at 6/30/21, 4:40 PM:
---

Looking at the stack traces: 
- all subtasks are waiting for data. 
- Legacy thread is waiting for data to arrive. 
- Kafka consumer thread is waiting for partition assignments. 

So it seems as if there is just no data after recovery... I guess we need to 
trace why {{hasAssignedPartitions == false}}.


{noformat}
"Kafka Fetcher for Source: TableSourceScan(table=[[default_catalog, 
default_database, upsert_kafka]], fields=[k_user_id, name, k_event_id, user_id, 
payload, timestamp]) (4/4)#0" #1434 daemon prio=5 os_prio=0 
tid=0x7fd8c0003000 nid=0x4681 waiting on condition [0x7fd8187c6000]
   java.lang.Thread.State: WAITING (parking)
at sun.misc.Unsafe.park(Native Method)
- parking to wait for  <0xf8c34308> (a 
java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject)
at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
at 
java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2039)
at 
org.apache.flink.streaming.connectors.kafka.internals.ClosableBlockingQueue.getBatchBlocking(ClosableBlockingQueue.java:396)
at 
org.apache.flink.streaming.connectors.kafka.internals.KafkaConsumerThread.run(KafkaConsumerThread.java:240)
{noformat}



was (Author: arvid):
Looking at the stack traces: 
- all subtasks are waiting for data. 
- Legacy thread is waiting for data to arrive. 
- Kafka consumer thread is waiting for partition assignments. 

So it seems as if there is just no data after recovery... I guess we need to 
trace why {{hasAssignedPartitions == false}}.

> UpsertKafkaTableITCase hangs when collecting results
> 
>
> Key: FLINK-22416
> URL: https://issues.apache.org/jira/browse/FLINK-22416
> Project: Flink
>  Issue Type: Bug
>  Components: Connectors / Kafka, Table SQL / Ecosystem
>Affects Versions: 1.13.0
>Reporter: Dawid Wysakowicz
>Assignee: Qingsheng Ren
>Priority: Blocker
>  Labels: test-stability
> Fix For: 1.14.0
>
> Attachments: idea-test.png, threads_report.txt
>
>
> https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=17037=logs=c5f0071e-1851-543e-9a45-9ac140befc32=1fb1a56f-e8b5-5a82-00a0-a2db7757b4f5=7002
> {code}
> 2021-04-22T11:16:35.6812919Z Apr 22 11:16:35 [ERROR] 
> testSourceSinkWithKeyAndPartialValue[format = 
> csv](org.apache.flink.streaming.connectors.kafka.table.UpsertKafkaTableITCase)
>   Time elapsed: 30.01 s  <<< ERROR!
> 2021-04-22T11:16:35.6814151Z Apr 22 11:16:35 
> org.junit.runners.model.TestTimedOutException: test timed out after 30 seconds
> 2021-04-22T11:16:35.6814781Z Apr 22 11:16:35  at 
> java.lang.Thread.sleep(Native Method)
> 2021-04-22T11:16:35.6815444Z Apr 22 11:16:35  at 
> org.apache.flink.streaming.api.operators.collect.CollectResultFetcher.sleepBeforeRetry(CollectResultFetcher.java:237)
> 2021-04-22T11:16:35.6816250Z Apr 22 11:16:35  at 
> org.apache.flink.streaming.api.operators.collect.CollectResultFetcher.next(CollectResultFetcher.java:113)
> 2021-04-22T11:16:35.6817033Z Apr 22 11:16:35  at 
> org.apache.flink.streaming.api.operators.collect.CollectResultIterator.nextResultFromFetcher(CollectResultIterator.java:106)
> 2021-04-22T11:16:35.6817719Z Apr 22 11:16:35  at 
> org.apache.flink.streaming.api.operators.collect.CollectResultIterator.hasNext(CollectResultIterator.java:80)
> 2021-04-22T11:16:35.6818351Z Apr 22 11:16:35  at 
> org.apache.flink.table.api.internal.TableResultImpl$CloseableRowIteratorWrapper.hasNext(TableResultImpl.java:370)
> 2021-04-22T11:16:35.6818980Z Apr 22 11:16:35  at 
> org.apache.flink.streaming.connectors.kafka.table.KafkaTableTestUtils.collectRows(KafkaTableTestUtils.java:52)
> 2021-04-22T11:16:35.6819978Z Apr 22 11:16:35  at 
> org.apache.flink.streaming.connectors.kafka.table.UpsertKafkaTableITCase.testSourceSinkWithKeyAndPartialValue(UpsertKafkaTableITCase.java:147)
> 2021-04-22T11:16:35.6820803Z Apr 22 11:16:35  at 
> sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> 2021-04-22T11:16:35.6821365Z Apr 22 11:16:35  at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
> 2021-04-22T11:16:35.6822072Z Apr 22 11:16:35  at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> 2021-04-22T11:16:35.6822656Z Apr 22 11:16:35  at 
> java.lang.reflect.Method.invoke(Method.java:498)
> 2021-04-22T11:16:35.6823124Z Apr 22 11:16:35  at 
> org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:50)
> 

[jira] [Commented] (FLINK-22416) UpsertKafkaTableITCase hangs when collecting results

2021-06-30 Thread Arvid Heise (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-22416?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17372111#comment-17372111
 ] 

Arvid Heise commented on FLINK-22416:
-

Looking at the stack traces: 
- all subtasks are waiting for data. 
- Legacy thread is waiting for data to arrive. 
- Kafka consumer thread is waiting for partition assignments. 

So it seems as if there is just no data after recovery... I guess we need to 
trace why {{hasAssignedPartitions == false}}.

> UpsertKafkaTableITCase hangs when collecting results
> 
>
> Key: FLINK-22416
> URL: https://issues.apache.org/jira/browse/FLINK-22416
> Project: Flink
>  Issue Type: Bug
>  Components: Connectors / Kafka, Table SQL / Ecosystem
>Affects Versions: 1.13.0
>Reporter: Dawid Wysakowicz
>Assignee: Qingsheng Ren
>Priority: Blocker
>  Labels: test-stability
> Fix For: 1.14.0
>
> Attachments: idea-test.png, threads_report.txt
>
>
> https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=17037=logs=c5f0071e-1851-543e-9a45-9ac140befc32=1fb1a56f-e8b5-5a82-00a0-a2db7757b4f5=7002
> {code}
> 2021-04-22T11:16:35.6812919Z Apr 22 11:16:35 [ERROR] 
> testSourceSinkWithKeyAndPartialValue[format = 
> csv](org.apache.flink.streaming.connectors.kafka.table.UpsertKafkaTableITCase)
>   Time elapsed: 30.01 s  <<< ERROR!
> 2021-04-22T11:16:35.6814151Z Apr 22 11:16:35 
> org.junit.runners.model.TestTimedOutException: test timed out after 30 seconds
> 2021-04-22T11:16:35.6814781Z Apr 22 11:16:35  at 
> java.lang.Thread.sleep(Native Method)
> 2021-04-22T11:16:35.6815444Z Apr 22 11:16:35  at 
> org.apache.flink.streaming.api.operators.collect.CollectResultFetcher.sleepBeforeRetry(CollectResultFetcher.java:237)
> 2021-04-22T11:16:35.6816250Z Apr 22 11:16:35  at 
> org.apache.flink.streaming.api.operators.collect.CollectResultFetcher.next(CollectResultFetcher.java:113)
> 2021-04-22T11:16:35.6817033Z Apr 22 11:16:35  at 
> org.apache.flink.streaming.api.operators.collect.CollectResultIterator.nextResultFromFetcher(CollectResultIterator.java:106)
> 2021-04-22T11:16:35.6817719Z Apr 22 11:16:35  at 
> org.apache.flink.streaming.api.operators.collect.CollectResultIterator.hasNext(CollectResultIterator.java:80)
> 2021-04-22T11:16:35.6818351Z Apr 22 11:16:35  at 
> org.apache.flink.table.api.internal.TableResultImpl$CloseableRowIteratorWrapper.hasNext(TableResultImpl.java:370)
> 2021-04-22T11:16:35.6818980Z Apr 22 11:16:35  at 
> org.apache.flink.streaming.connectors.kafka.table.KafkaTableTestUtils.collectRows(KafkaTableTestUtils.java:52)
> 2021-04-22T11:16:35.6819978Z Apr 22 11:16:35  at 
> org.apache.flink.streaming.connectors.kafka.table.UpsertKafkaTableITCase.testSourceSinkWithKeyAndPartialValue(UpsertKafkaTableITCase.java:147)
> 2021-04-22T11:16:35.6820803Z Apr 22 11:16:35  at 
> sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> 2021-04-22T11:16:35.6821365Z Apr 22 11:16:35  at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
> 2021-04-22T11:16:35.6822072Z Apr 22 11:16:35  at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> 2021-04-22T11:16:35.6822656Z Apr 22 11:16:35  at 
> java.lang.reflect.Method.invoke(Method.java:498)
> 2021-04-22T11:16:35.6823124Z Apr 22 11:16:35  at 
> org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:50)
> 2021-04-22T11:16:35.6823672Z Apr 22 11:16:35  at 
> org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12)
> 2021-04-22T11:16:35.6824202Z Apr 22 11:16:35  at 
> org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:47)
> 2021-04-22T11:16:35.6824709Z Apr 22 11:16:35  at 
> org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17)
> 2021-04-22T11:16:35.6825230Z Apr 22 11:16:35  at 
> org.junit.internal.runners.statements.RunBefores.evaluate(RunBefores.java:26)
> 2021-04-22T11:16:35.6825716Z Apr 22 11:16:35  at 
> org.junit.internal.runners.statements.RunAfters.evaluate(RunAfters.java:27)
> 2021-04-22T11:16:35.6826204Z Apr 22 11:16:35  at 
> org.junit.rules.ExternalResource$1.evaluate(ExternalResource.java:48)
> 2021-04-22T11:16:35.6826807Z Apr 22 11:16:35  at 
> org.junit.internal.runners.statements.FailOnTimeout$CallableStatement.call(FailOnTimeout.java:298)
> 2021-04-22T11:16:35.6827378Z Apr 22 11:16:35  at 
> org.junit.internal.runners.statements.FailOnTimeout$CallableStatement.call(FailOnTimeout.java:292)
> 2021-04-22T11:16:35.6827926Z Apr 22 11:16:35  at 
> java.util.concurrent.FutureTask.run(FutureTask.java:266)
> 2021-04-22T11:16:35.6828331Z Apr 22 11:16:35  at 
> java.lang.Thread.run(Thread.java:748)
> 2021-04-22T11:16:35.6828600Z Apr 22 11:16:35 
> 2021-04-22T11:16:35.6829073Z Apr 22 

[jira] [Comment Edited] (FLINK-22416) UpsertKafkaTableITCase hangs when collecting results

2021-06-30 Thread Arvid Heise (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-22416?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17372111#comment-17372111
 ] 

Arvid Heise edited comment on FLINK-22416 at 6/30/21, 4:42 PM:
---

Looking at the stack traces: 
- all subtasks are waiting for data. 
- Legacy thread is waiting for data to arrive. 
- Kafka consumer thread is waiting for partition assignments. 

So it seems as if there is just no data after recovery... I guess we need to 
trace why {{hasAssignedPartitions == false}}.


{noformat}
"Kafka Fetcher for Source: TableSourceScan(table=[[default_catalog, 
default_database, upsert_kafka]], fields=[k_user_id, name, k_event_id, user_id, 
payload, timestamp]) (4/4)#0" #1434 daemon prio=5 os_prio=0 
tid=0x7fd8c0003000 nid=0x4681 waiting on condition [0x7fd8187c6000]
   java.lang.Thread.State: WAITING (parking)
at sun.misc.Unsafe.park(Native Method)
- parking to wait for  <0xf8c34308> (a 
java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject)
at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
at 
java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2039)
at 
org.apache.flink.streaming.connectors.kafka.internals.ClosableBlockingQueue.getBatchBlocking(ClosableBlockingQueue.java:396)
at 
org.apache.flink.streaming.connectors.kafka.internals.KafkaConsumerThread.run(KafkaConsumerThread.java:240)
{noformat}

It may be similar to FLINK-21996, but afaik this ticket was for new sources 
only and here we have a legacy source. [~sewen] would it be possible to have 
the same issue for old sources?


was (Author: arvid):
Looking at the stack traces: 
- all subtasks are waiting for data. 
- Legacy thread is waiting for data to arrive. 
- Kafka consumer thread is waiting for partition assignments. 

So it seems as if there is just no data after recovery... I guess we need to 
trace why {{hasAssignedPartitions == false}}.


{noformat}
"Kafka Fetcher for Source: TableSourceScan(table=[[default_catalog, 
default_database, upsert_kafka]], fields=[k_user_id, name, k_event_id, user_id, 
payload, timestamp]) (4/4)#0" #1434 daemon prio=5 os_prio=0 
tid=0x7fd8c0003000 nid=0x4681 waiting on condition [0x7fd8187c6000]
   java.lang.Thread.State: WAITING (parking)
at sun.misc.Unsafe.park(Native Method)
- parking to wait for  <0xf8c34308> (a 
java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject)
at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
at 
java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2039)
at 
org.apache.flink.streaming.connectors.kafka.internals.ClosableBlockingQueue.getBatchBlocking(ClosableBlockingQueue.java:396)
at 
org.apache.flink.streaming.connectors.kafka.internals.KafkaConsumerThread.run(KafkaConsumerThread.java:240)
{noformat}


> UpsertKafkaTableITCase hangs when collecting results
> 
>
> Key: FLINK-22416
> URL: https://issues.apache.org/jira/browse/FLINK-22416
> Project: Flink
>  Issue Type: Bug
>  Components: Connectors / Kafka, Table SQL / Ecosystem
>Affects Versions: 1.13.0
>Reporter: Dawid Wysakowicz
>Assignee: Qingsheng Ren
>Priority: Blocker
>  Labels: test-stability
> Fix For: 1.14.0
>
> Attachments: idea-test.png, threads_report.txt
>
>
> https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=17037=logs=c5f0071e-1851-543e-9a45-9ac140befc32=1fb1a56f-e8b5-5a82-00a0-a2db7757b4f5=7002
> {code}
> 2021-04-22T11:16:35.6812919Z Apr 22 11:16:35 [ERROR] 
> testSourceSinkWithKeyAndPartialValue[format = 
> csv](org.apache.flink.streaming.connectors.kafka.table.UpsertKafkaTableITCase)
>   Time elapsed: 30.01 s  <<< ERROR!
> 2021-04-22T11:16:35.6814151Z Apr 22 11:16:35 
> org.junit.runners.model.TestTimedOutException: test timed out after 30 seconds
> 2021-04-22T11:16:35.6814781Z Apr 22 11:16:35  at 
> java.lang.Thread.sleep(Native Method)
> 2021-04-22T11:16:35.6815444Z Apr 22 11:16:35  at 
> org.apache.flink.streaming.api.operators.collect.CollectResultFetcher.sleepBeforeRetry(CollectResultFetcher.java:237)
> 2021-04-22T11:16:35.6816250Z Apr 22 11:16:35  at 
> org.apache.flink.streaming.api.operators.collect.CollectResultFetcher.next(CollectResultFetcher.java:113)
> 2021-04-22T11:16:35.6817033Z Apr 22 11:16:35  at 
> org.apache.flink.streaming.api.operators.collect.CollectResultIterator.nextResultFromFetcher(CollectResultIterator.java:106)
> 2021-04-22T11:16:35.6817719Z Apr 22 11:16:35  at 
> org.apache.flink.streaming.api.operators.collect.CollectResultIterator.hasNext(CollectResultIterator.java:80)
> 

[jira] [Assigned] (FLINK-22964) Connector-base exposes dependency to flink-core.

2021-06-29 Thread Arvid Heise (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-22964?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Arvid Heise reassigned FLINK-22964:
---

Assignee: Arvid Heise

> Connector-base exposes dependency to flink-core.
> 
>
> Key: FLINK-22964
> URL: https://issues.apache.org/jira/browse/FLINK-22964
> Project: Flink
>  Issue Type: Bug
>  Components: Connectors / Common
>Affects Versions: 1.14.0, 1.13.1, 1.12.4
>Reporter: Arvid Heise
>Assignee: Arvid Heise
>Priority: Major
>  Labels: classloading
> Fix For: 1.14.0, 1.12.5, 1.13.2
>
>
> Connectors get shaded into the user jar and as such should contain no 
> unnecessary dependencies to flink. However, connector-base is exposing 
> `flink-core` which then by default gets shaded into the user jar. Except for 
> 6MB of extra size, the dependency also causes class loading issues, when 
> `classloader.parent-first-patterns` does not include `o.a.f`.
> Fix is to make `flink-core` provided in `connector-base`.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (FLINK-23248) SinkWriter is not closed when failing

2021-07-06 Thread Arvid Heise (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-23248?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Arvid Heise resolved FLINK-23248.
-
Resolution: Fixed

> SinkWriter is not closed when failing
> -
>
> Key: FLINK-23248
> URL: https://issues.apache.org/jira/browse/FLINK-23248
> Project: Flink
>  Issue Type: Bug
>  Components: Runtime / Task
>Affects Versions: 1.13.1, 1.12.4
>Reporter: Fabian Paul
>Assignee: Fabian Paul
>Priority: Critical
>  Labels: pull-request-available
> Fix For: 1.12.6, 1.13.3
>
>
> Currently the SinkWriter is only closed when the operator finishes in 
> `AbstractSinkWriterOperator#close()` but we also must close the SinkWrite on 
> `AbstractSinkWriterOperator#dispose()` to release possible acquired resources 
> when failing
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (FLINK-23248) SinkWriter is not closed when failing

2021-07-06 Thread Arvid Heise (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-23248?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Arvid Heise updated FLINK-23248:

Fix Version/s: 1.13.3
   1.12.6

> SinkWriter is not closed when failing
> -
>
> Key: FLINK-23248
> URL: https://issues.apache.org/jira/browse/FLINK-23248
> Project: Flink
>  Issue Type: Bug
>  Components: Runtime / Task
>Affects Versions: 1.13.1, 1.12.4
>Reporter: Fabian Paul
>Assignee: Fabian Paul
>Priority: Critical
>  Labels: pull-request-available
> Fix For: 1.12.6, 1.13.3
>
>
> Currently the SinkWriter is only closed when the operator finishes in 
> `AbstractSinkWriterOperator#close()` but we also must close the SinkWrite on 
> `AbstractSinkWriterOperator#dispose()` to release possible acquired resources 
> when failing
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (FLINK-23248) SinkWriter is not closed when failing

2021-07-06 Thread Arvid Heise (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-23248?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17375344#comment-17375344
 ] 

Arvid Heise commented on FLINK-23248:
-

Merged into 1.13 as 9d77656381f55d05a3eb74c6c6c6de873de2b7c7, into 1.12 as 
701d20e896112b64d3a2dad9bb7e2e976bc88cb1.

> SinkWriter is not closed when failing
> -
>
> Key: FLINK-23248
> URL: https://issues.apache.org/jira/browse/FLINK-23248
> Project: Flink
>  Issue Type: Bug
>  Components: Runtime / Task
>Affects Versions: 1.13.1, 1.12.4
>Reporter: Fabian Paul
>Assignee: Fabian Paul
>Priority: Critical
>  Labels: pull-request-available
>
> Currently the SinkWriter is only closed when the operator finishes in 
> `AbstractSinkWriterOperator#close()` but we also must close the SinkWrite on 
> `AbstractSinkWriterOperator#dispose()` to release possible acquired resources 
> when failing
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (FLINK-22019) UnalignedCheckpointRescaleITCase hangs on azure

2021-04-26 Thread Arvid Heise (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-22019?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17331824#comment-17331824
 ] 

Arvid Heise commented on FLINK-22019:
-

Probably a duplicate of FLINK-22368.

> UnalignedCheckpointRescaleITCase hangs on azure
> ---
>
> Key: FLINK-22019
> URL: https://issues.apache.org/jira/browse/FLINK-22019
> Project: Flink
>  Issue Type: Bug
>  Components: Runtime / Checkpointing
>Affects Versions: 1.13.0
>Reporter: Dawid Wysakowicz
>Priority: Major
>  Labels: test-stability
>
> https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=15658=logs=a57e0635-3fad-5b08-57c7-a4142d7d6fa9=5360d54c-8d94-5d85-304e-a89267eb785a=9347



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Comment Edited] (FLINK-22019) UnalignedCheckpointRescaleITCase hangs on azure

2021-04-26 Thread Arvid Heise (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-22019?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17331824#comment-17331824
 ] 

Arvid Heise edited comment on FLINK-22019 at 4/26/21, 6:59 AM:
---

{noformat}
12:04:37,326 [flink-akka.actor.default-dispatcher-5] INFO  
org.apache.flink.runtime.executiongraph.ExecutionGraph   [] - Sink: sink 
(16/20) (e233bba6efcad6dbde2753c42d85edb0) switched from RUNNING to CANCELED.
...
14:46:28,861 [Checkpoint Timer] INFO  
org.apache.flink.runtime.checkpoint.CheckpointCoordinator[] - Failed to 
trigger checkpoint for job 4cbfe7eee09cd5b18122bb5497ab9649 since some tasks of 
job 4cbfe7eee09cd5b18122bb5497ab9649 has been finished, abort the checkpoint 
Failure reason: Not all required tasks are currently running.{noformat}


Probably a duplicate of FLINK-22368.


was (Author: arvid):
Probably a duplicate of FLINK-22368.

> UnalignedCheckpointRescaleITCase hangs on azure
> ---
>
> Key: FLINK-22019
> URL: https://issues.apache.org/jira/browse/FLINK-22019
> Project: Flink
>  Issue Type: Bug
>  Components: Runtime / Checkpointing
>Affects Versions: 1.13.0
>Reporter: Dawid Wysakowicz
>Priority: Major
>  Labels: test-stability
>
> https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=15658=logs=a57e0635-3fad-5b08-57c7-a4142d7d6fa9=5360d54c-8d94-5d85-304e-a89267eb785a=9347



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (FLINK-22136) Device application for unaligned checkpoint test on cluster

2021-04-27 Thread Arvid Heise (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-22136?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17333024#comment-17333024
 ] 

Arvid Heise commented on FLINK-22136:
-

Merged into master as 
7973bbc6e748a0ca36741bfa574291503ebdb8ed^..7973bbc6e748a0ca36741bfa574291503ebdb8ed.

> Device application for unaligned checkpoint test on cluster
> ---
>
> Key: FLINK-22136
> URL: https://issues.apache.org/jira/browse/FLINK-22136
> Project: Flink
>  Issue Type: Sub-task
>Reporter: Arvid Heise
>Assignee: Anton Kalashnikov
>Priority: Major
>  Labels: pull-request-available
>
> To test unaligned checkpoints, we should use a few different applications 
> that use different features:
> * Mixing forward/rescale channels with keyby or other shuffle operations
> * Unions
> * 2 or n-ary operators
> * Associated state ((keyed) process function)
> * Correctness verifications
> The sinks should not be mocked but rather should be able to induce a fair 
> amount of backpressure into the system. Quite possibly, it would be a good 
> idea to have a way to add more backpressure to the sink by running the 
> respective system on the cluster and be able to add/remove parallel instances.
> Things to check in the application
> * Inflight data is restored to the correct keygroups -> can be checked with 
> keyed state in a process function
> * Correctness: Completeness (no lost records) + no duplicates
> * Orderness of data for keyed exchanges (we guarantee that records with the 
> same key retain orderness across keyed operators)
> * (To detect errors early, we can also use magic headers)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (FLINK-22136) Device application for unaligned checkpoint test on cluster

2021-04-27 Thread Arvid Heise (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-22136?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Arvid Heise resolved FLINK-22136.
-
Fix Version/s: 1.14.0
   Resolution: Fixed

> Device application for unaligned checkpoint test on cluster
> ---
>
> Key: FLINK-22136
> URL: https://issues.apache.org/jira/browse/FLINK-22136
> Project: Flink
>  Issue Type: Sub-task
>Reporter: Arvid Heise
>Assignee: Anton Kalashnikov
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.14.0
>
>
> To test unaligned checkpoints, we should use a few different applications 
> that use different features:
> * Mixing forward/rescale channels with keyby or other shuffle operations
> * Unions
> * 2 or n-ary operators
> * Associated state ((keyed) process function)
> * Correctness verifications
> The sinks should not be mocked but rather should be able to induce a fair 
> amount of backpressure into the system. Quite possibly, it would be a good 
> idea to have a way to add more backpressure to the sink by running the 
> respective system on the cluster and be able to add/remove parallel instances.
> Things to check in the application
> * Inflight data is restored to the correct keygroups -> can be checked with 
> keyed state in a process function
> * Correctness: Completeness (no lost records) + no duplicates
> * Orderness of data for keyed exchanges (we guarantee that records with the 
> same key retain orderness across keyed operators)
> * (To detect errors early, we can also use magic headers)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (FLINK-17170) Cannot stop streaming job with savepoint which uses kinesis consumer

2021-04-30 Thread Arvid Heise (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-17170?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17337132#comment-17337132
 ] 

Arvid Heise commented on FLINK-17170:
-

Hi [~dannycranmer] , why do we need to have this{{ fetcher.awaitTermination()}} 
in cancel/close in the first place? Wouldn't it suffice to just rely on the 
await at the end of {{FlinkKinesisConsumer#run}}? As far as I can see, the 
{{cancel}} is shutting down the fetcher, so it should return in a graceful 
manner in {{run}}. Going further, {{runFetcher}} is already invoking 
{{awaitTermination}} on its own, so we could clean this up even further.

> Cannot stop streaming job with savepoint which uses kinesis consumer
> 
>
> Key: FLINK-17170
> URL: https://issues.apache.org/jira/browse/FLINK-17170
> Project: Flink
>  Issue Type: Bug
>  Components: API / DataStream, Connectors / Kinesis
>Affects Versions: 1.10.0, 1.11.3, 1.12.2
>Reporter: Vasii Cosmin Radu
>Priority: Critical
>  Labels: stale-critical, usability
> Attachments: Screenshot 2020-04-15 at 18.16.26.png, threaddump_tm1
>
>
> I am encountering a very strange situation where I can't stop with savepoint 
> a streaming job.
> The job reads from kinesis and sinks to S3, very simple job, no mapping 
> function, no watermarks, just source->sink. 
> Source is using flink-kinesis-consumer, sink is using StreamingFileSink. 
> Everything works fine, except stopping the job with savepoints.
> The behaviour happens only when multiple task managers are involved, having 
> sub-tasks off the job spread across multiple task manager instances. When a 
> single task manager has all the sub-tasks this issue never occurred.
> Using latest Flink 1.10.0 version, deployment done in HA mode (2 job 
> managers), in EC2, savepoints and checkpoints written on S3.
> When trying to stop, the savepoint is created correctly and appears on S3, 
> but not all sub-tasks are stopped. Some of them finished, but some just 
> remain hanged. Sometimes, on the same task manager part of the sub-tasks are 
> finished, part aren't.
> The logs don't show any errors. For the ones that succeed, the standard 
> messages appear, with "Source: <> switched from RUNNING to FINISHED".
> For the sub-tasks hanged the last message is 
> "org.apache.flink.streaming.connectors.kinesis.internals.KinesisDataFetcher - 
> Shutting down the shard consumer threads of subtask 0 ..." and that's it.
>  
> I tried using the cli (flink stop )
> Timeout Message:
> {code:java}
> root@ec2-XX-XX-XX-XX:/opt/flink/current/bin# ./flink stop 
> cf43cecd9339e8f02a12333e52966a25
> root@ec2-XX-XX-XX-XX:/opt/flink/current/bin# ./flink stop 
> cf43cecd9339e8f02a12333e52966a25Suspending job 
> "cf43cecd9339e8f02a12333e52966a25" with a savepoint. 
>  The program 
> finished with the following exception: org.apache.flink.util.FlinkException: 
> Could not stop with a savepoint job "cf43cecd9339e8f02a12333e52966a25". at 
> org.apache.flink.client.cli.CliFrontend.lambda$stop$5(CliFrontend.java:462) 
> at 
> org.apache.flink.client.cli.CliFrontend.runClusterAction(CliFrontend.java:843)
>  at org.apache.flink.client.cli.CliFrontend.stop(CliFrontend.java:454) at 
> org.apache.flink.client.cli.CliFrontend.parseParameters(CliFrontend.java:907) 
> at 
> org.apache.flink.client.cli.CliFrontend.lambda$main$10(CliFrontend.java:968) 
> at java.security.AccessController.doPrivileged(Native Method) at 
> javax.security.auth.Subject.doAs(Subject.java:422) at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1682)
>  at 
> org.apache.flink.runtime.security.HadoopSecurityContext.runSecured(HadoopSecurityContext.java:41)
>  at org.apache.flink.client.cli.CliFrontend.main(CliFrontend.java:968)Caused 
> by: java.util.concurrent.TimeoutException at 
> java.util.concurrent.CompletableFuture.timedGet(CompletableFuture.java:1784) 
> at java.util.concurrent.CompletableFuture.get(CompletableFuture.java:1928) at 
> org.apache.flink.client.cli.CliFrontend.lambda$stop$5(CliFrontend.java:460) 
> ... 9 more{code}
>  
> Using the monitoring api, I keep getting infinite message when querying based 
> on the savepoint id, that the status id is still "IN_PROGRESS".
>  
> When performing a cancel instead of stop, it works. But cancel is deprecated, 
> so I am a bit concerned that this might fail also, maybe I was just lucky.
>  
> I attached a screenshot with what the UI is showing when this happens
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (FLINK-17170) Cannot stop streaming job with savepoint which uses kinesis consumer

2021-05-01 Thread Arvid Heise (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-17170?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Arvid Heise reassigned FLINK-17170:
---

Assignee: Arvid Heise

> Cannot stop streaming job with savepoint which uses kinesis consumer
> 
>
> Key: FLINK-17170
> URL: https://issues.apache.org/jira/browse/FLINK-17170
> Project: Flink
>  Issue Type: Bug
>  Components: API / DataStream, Connectors / Kinesis
>Affects Versions: 1.10.0, 1.11.3, 1.12.2
>Reporter: Vasii Cosmin Radu
>Assignee: Arvid Heise
>Priority: Critical
>  Labels: stale-critical, usability
> Attachments: Screenshot 2020-04-15 at 18.16.26.png, threaddump_tm1
>
>
> I am encountering a very strange situation where I can't stop with savepoint 
> a streaming job.
> The job reads from kinesis and sinks to S3, very simple job, no mapping 
> function, no watermarks, just source->sink. 
> Source is using flink-kinesis-consumer, sink is using StreamingFileSink. 
> Everything works fine, except stopping the job with savepoints.
> The behaviour happens only when multiple task managers are involved, having 
> sub-tasks off the job spread across multiple task manager instances. When a 
> single task manager has all the sub-tasks this issue never occurred.
> Using latest Flink 1.10.0 version, deployment done in HA mode (2 job 
> managers), in EC2, savepoints and checkpoints written on S3.
> When trying to stop, the savepoint is created correctly and appears on S3, 
> but not all sub-tasks are stopped. Some of them finished, but some just 
> remain hanged. Sometimes, on the same task manager part of the sub-tasks are 
> finished, part aren't.
> The logs don't show any errors. For the ones that succeed, the standard 
> messages appear, with "Source: <> switched from RUNNING to FINISHED".
> For the sub-tasks hanged the last message is 
> "org.apache.flink.streaming.connectors.kinesis.internals.KinesisDataFetcher - 
> Shutting down the shard consumer threads of subtask 0 ..." and that's it.
>  
> I tried using the cli (flink stop )
> Timeout Message:
> {code:java}
> root@ec2-XX-XX-XX-XX:/opt/flink/current/bin# ./flink stop 
> cf43cecd9339e8f02a12333e52966a25
> root@ec2-XX-XX-XX-XX:/opt/flink/current/bin# ./flink stop 
> cf43cecd9339e8f02a12333e52966a25Suspending job 
> "cf43cecd9339e8f02a12333e52966a25" with a savepoint. 
>  The program 
> finished with the following exception: org.apache.flink.util.FlinkException: 
> Could not stop with a savepoint job "cf43cecd9339e8f02a12333e52966a25". at 
> org.apache.flink.client.cli.CliFrontend.lambda$stop$5(CliFrontend.java:462) 
> at 
> org.apache.flink.client.cli.CliFrontend.runClusterAction(CliFrontend.java:843)
>  at org.apache.flink.client.cli.CliFrontend.stop(CliFrontend.java:454) at 
> org.apache.flink.client.cli.CliFrontend.parseParameters(CliFrontend.java:907) 
> at 
> org.apache.flink.client.cli.CliFrontend.lambda$main$10(CliFrontend.java:968) 
> at java.security.AccessController.doPrivileged(Native Method) at 
> javax.security.auth.Subject.doAs(Subject.java:422) at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1682)
>  at 
> org.apache.flink.runtime.security.HadoopSecurityContext.runSecured(HadoopSecurityContext.java:41)
>  at org.apache.flink.client.cli.CliFrontend.main(CliFrontend.java:968)Caused 
> by: java.util.concurrent.TimeoutException at 
> java.util.concurrent.CompletableFuture.timedGet(CompletableFuture.java:1784) 
> at java.util.concurrent.CompletableFuture.get(CompletableFuture.java:1928) at 
> org.apache.flink.client.cli.CliFrontend.lambda$stop$5(CliFrontend.java:460) 
> ... 9 more{code}
>  
> Using the monitoring api, I keep getting infinite message when querying based 
> on the savepoint id, that the status id is still "IN_PROGRESS".
>  
> When performing a cancel instead of stop, it works. But cancel is deprecated, 
> so I am a bit concerned that this might fail also, maybe I was just lucky.
>  
> I attached a screenshot with what the UI is showing when this happens
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (FLINK-22282) Move creation of SplitEnumerator to the SourceCoordinator thread

2021-04-22 Thread Arvid Heise (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-22282?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Arvid Heise updated FLINK-22282:

Fix Version/s: (was: 1.12.3)
   1.12.4

> Move creation of SplitEnumerator to the SourceCoordinator thread
> 
>
> Key: FLINK-22282
> URL: https://issues.apache.org/jira/browse/FLINK-22282
> Project: Flink
>  Issue Type: Improvement
>  Components: Connectors / Common
>Affects Versions: 1.12.2
>Reporter: Jiangjie Qin
>Assignee: Jiangjie Qin
>Priority: Critical
> Fix For: 1.13.0, 1.12.4
>
>
> Currently the creation of the SplitEnumerator is in the JM main thread. In 
> case the SplitEnumerator instantiation takes long, the job execution will 
> timeout. The fix is moving the SplitEnumerator creation to the coordinator 
> thread.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (FLINK-19743) Add Source metrics definitions

2021-04-22 Thread Arvid Heise (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-19743?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Arvid Heise updated FLINK-19743:

Fix Version/s: (was: 1.12.3)
   1.12.4

> Add Source metrics definitions
> --
>
> Key: FLINK-19743
> URL: https://issues.apache.org/jira/browse/FLINK-19743
> Project: Flink
>  Issue Type: Sub-task
>  Components: Connectors / Common
>Affects Versions: 1.11.2
>Reporter: Jiangjie Qin
>Assignee: Jiangjie Qin
>Priority: Major
>  Labels: pull-request-available, stale-assigned
> Fix For: 1.13.0, 1.12.4
>
>
> Add the metrics defined in 
> [FLIP-33|https://cwiki.apache.org/confluence/display/FLINK/FLIP-33%3A+Standardize+Connector+Metrics]
>  to \{{OperatorMetricsGroup}} and {{SourceReaderContext}}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (FLINK-20553) Add end-to-end test case for new Kafka source

2021-04-22 Thread Arvid Heise (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-20553?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Arvid Heise updated FLINK-20553:

Fix Version/s: (was: 1.12.3)
   1.12.4
   1.13.1

> Add end-to-end test case for new Kafka source
> -
>
> Key: FLINK-20553
> URL: https://issues.apache.org/jira/browse/FLINK-20553
> Project: Flink
>  Issue Type: Improvement
>  Components: Connectors / Kafka
>Affects Versions: 1.12.0
>Reporter: Qingsheng Ren
>Priority: Major
> Fix For: 1.13.1, 1.12.4
>
>
> The new Kafka source needs an E2E test case to be run periodically on CI. 
> Currently we have one for the old Kafka source under 
> {{flink-end-to-end-tests}}, and we can modify this and add a configuration 
> for running the case with new Kafka source.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (FLINK-20576) Flink Temporal Join Hive Dim Error

2021-04-22 Thread Arvid Heise (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-20576?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Arvid Heise updated FLINK-20576:

Fix Version/s: (was: 1.12.3)
   1.12.4

> Flink Temporal Join Hive Dim Error
> --
>
> Key: FLINK-20576
> URL: https://issues.apache.org/jira/browse/FLINK-20576
> Project: Flink
>  Issue Type: Bug
>  Components: Connectors / Hive, Table SQL / Ecosystem
>Affects Versions: 1.12.0
>Reporter: HideOnBush
>Assignee: Leonard Xu
>Priority: Major
>  Labels: stale-assigned
> Fix For: 1.13.0, 1.12.4
>
>
>  
> KAFKA DDL
> {code:java}
> CREATE TABLE hive_catalog.flink_db_test.kfk_master_test (
> master Row String, action int, orderStatus int, orderKey String, actionTime bigint, 
> areaName String, paidAmount double, foodAmount double, startTime String, 
> person double, orderSubType int, checkoutTime String>,
> proctime as PROCTIME()
> ) WITH (properties ..){code}
>  
> FLINK client query sql
> {noformat}
> SELECT * FROM hive_catalog.flink_db_test.kfk_master_test AS kafk_tbl 
>   JOIN hive_catalog.gauss.dim_extend_shop_info /*+ 
> OPTIONS('streaming-source.enable'='true', 
> 'streaming-source.partition.include' = 'latest', 
>'streaming-source.monitor-interval' = '12 
> h','streaming-source.partition-order' = 'partition-name') */ FOR SYSTEM_TIME 
> AS OF kafk_tbl.proctime AS dim 
>ON kafk_tbl.groupID = dim.group_id where kafk_tbl.groupID is not 
> null;{noformat}
> When I execute the above statement, these stack error messages are returned
> Caused by: java.lang.NullPointerException: bufferCaused by: 
> java.lang.NullPointerException: buffer at 
> org.apache.flink.core.memory.MemorySegment.(MemorySegment.java:161) 
> ~[flink-dist_2.11-1.12.0.jar:1.12.0] at 
> org.apache.flink.core.memory.HybridMemorySegment.(HybridMemorySegment.java:86)
>  ~[flink-dist_2.11-1.12.0.jar:1.12.0] at 
> org.apache.flink.core.memory.MemorySegmentFactory.wrap(MemorySegmentFactory.java:55)
>  ~[flink-dist_2.11-1.12.0.jar:1.12.0] at 
> org.apache.flink.table.data.binary.BinaryStringData.fromBytes(BinaryStringData.java:98)
>  ~[flink-table_2.11-1.12.0.jar:1.12.0]
>  
> Caused by: org.apache.flink.util.FlinkRuntimeException: Failed to load table 
> into cache after 3 retriesCaused by: 
> org.apache.flink.util.FlinkRuntimeException: Failed to load table into cache 
> after 3 retries at 
> org.apache.flink.table.filesystem.FileSystemLookupFunction.checkCacheReload(FileSystemLookupFunction.java:143)
>  ~[flink-table-blink_2.11-1.12.0.jar:1.12.0] at 
> org.apache.flink.table.filesystem.FileSystemLookupFunction.eval(FileSystemLookupFunction.java:103)
>  ~[flink-table-blink_2.11-1.12.0.jar:1.12.0] at 
> LookupFunction$1577.flatMap(Unknown Source) ~[?:?] at 
> org.apache.flink.table.runtime.operators.join.lookup.LookupJoinRunner.processElement(LookupJoinRunner.java:82)
>  ~[flink-table-blink_2.11-1.12.0.jar:1.12.0] at 
> org.apache.flink.table.runtime.operators.join.lookup.LookupJoinRunner.processElement(LookupJoinRunner.java:36)
>  ~[flink-table-blink_2.11-1.12.0.jar:1.12.0] at 
> org.apache.flink.streaming.api.operators.ProcessOperator.processElement(ProcessOperator.java:66)
>  ~[flink-dist_2.11-1.12.0.jar:1.12.0]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (FLINK-20457) Fix the handling of timestamp in DataStream.from_collection

2021-04-22 Thread Arvid Heise (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-20457?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Arvid Heise updated FLINK-20457:

Fix Version/s: (was: 1.12.3)
   1.12.4

> Fix the handling of timestamp in DataStream.from_collection
> ---
>
> Key: FLINK-20457
> URL: https://issues.apache.org/jira/browse/FLINK-20457
> Project: Flink
>  Issue Type: Bug
>  Components: API / Python
>Affects Versions: 1.12.0
>Reporter: Dian Fu
>Assignee: zhengyu.lou
>Priority: Major
>  Labels: stale-assigned
> Fix For: 1.13.0, 1.12.4
>
>
> Currently, DataStream.from_collection firstly converts date/time/dateTime 
> objects to int at Python side and then construct the corresponding 
> Date/Time/Timestamp object at Java side. It will lose the timezone 
> information. Pickle could handle date/time/datetime properly and the 
> conversion could be avoided.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (FLINK-20655) Add E2E tests to the new KafkaSource based on FLIP-27.

2021-04-22 Thread Arvid Heise (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-20655?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Arvid Heise updated FLINK-20655:

Fix Version/s: (was: 1.12.3)
   1.12.4

> Add E2E tests to the new KafkaSource based on FLIP-27.
> --
>
> Key: FLINK-20655
> URL: https://issues.apache.org/jira/browse/FLINK-20655
> Project: Flink
>  Issue Type: Test
>  Components: Connectors / Kafka
>Affects Versions: 1.12.0
>Reporter: Jiangjie Qin
>Assignee: Qingsheng Ren
>Priority: Major
>  Labels: stale-assigned
> Fix For: 1.13.0, 1.12.4
>
>
> Add the following e2e tests for KafkaSource based on FLIP-27.
>  # A basic read test which reads from a Kafka topic.
>  # Stop the job with savepoint and resume.
>  # Kill a TM and verify the failover works fine.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (FLINK-19925) Errors$NativeIoException: readAddress(..) failed: Connection reset by peer

2021-04-22 Thread Arvid Heise (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-19925?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Arvid Heise updated FLINK-19925:

Fix Version/s: (was: 1.12.3)
   1.12.4

> Errors$NativeIoException: readAddress(..) failed: Connection reset by peer
> --
>
> Key: FLINK-19925
> URL: https://issues.apache.org/jira/browse/FLINK-19925
> Project: Flink
>  Issue Type: Bug
>  Components: Runtime / Network
>Affects Versions: 1.12.0
>Reporter: godfrey he
>Priority: Major
>  Labels: test-stability
> Fix For: 1.13.0, 1.12.4
>
>
> Errors$NativeIoException will occur sometime when we run TPCDS based on 
> master, the full exception stack is 
> {code:java}
> Caused by: 
> org.apache.flink.runtime.io.network.netty.exception.LocalTransportException: 
> readAddress(..) failed: Connection reset by peer (connection to 'xxx')
>   at 
> org.apache.flink.runtime.io.network.netty.CreditBasedPartitionRequestClientHandler.exceptionCaught(CreditBasedPartitionRequestClientHandler.java:173)
>  ~[flink-dist_2.11-1.12-SNAPSHOT.jar:1.12-SNAPSHOT]
>   at 
> org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannelHandlerContext.invokeExceptionCaught(AbstractChannelHandlerContext.java:302)
>  ~[flink-dist_2.11-1.12-SNAPSHOT.jar:1.12-SNAPSHOT]
>   at 
> org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannelHandlerContext.invokeExceptionCaught(AbstractChannelHandlerContext.java:281)
>  ~[flink-dist_2.11-1.12-SNAPSHOT.jar:1.12-SNAPSHOT]
>   at 
> org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannelHandlerContext.fireExceptionCaught(AbstractChannelHandlerContext.java:273)
>  ~[flink-dist_2.11-1.12-SNAPSHOT.jar:1.12-SNAPSHOT]
>   at 
> org.apache.flink.shaded.netty4.io.netty.channel.DefaultChannelPipeline$HeadContext.exceptionCaught(DefaultChannelPipeline.java:1377)
>  ~[flink-dist_2.11-1.12-SNAPSHOT.jar:1.12-SNAPSHOT]
>   at 
> org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannelHandlerContext.invokeExceptionCaught(AbstractChannelHandlerContext.java:302)
>  ~[flink-dist_2.11-1.12-SNAPSHOT.jar:1.12-SNAPSHOT]
>   at 
> org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannelHandlerContext.invokeExceptionCaught(AbstractChannelHandlerContext.java:281)
>  ~[flink-dist_2.11-1.12-SNAPSHOT.jar:1.12-SNAPSHOT]
>   at 
> org.apache.flink.shaded.netty4.io.netty.channel.DefaultChannelPipeline.fireExceptionCaught(DefaultChannelPipeline.java:907)
>  ~[flink-dist_2.11-1.12-SNAPSHOT.jar:1.12-SNAPSHOT]
>   at 
> org.apache.flink.shaded.netty4.io.netty.channel.epoll.AbstractEpollStreamChannel$EpollStreamUnsafe.handleReadException(AbstractEpollStreamChannel.java:728)
>  ~[flink-dist_2.11-1.12-SNAPSHOT.jar:1.12-SNAPSHOT]
>   at 
> org.apache.flink.shaded.netty4.io.netty.channel.epoll.AbstractEpollStreamChannel$EpollStreamUnsafe.epollInReady(AbstractEpollStreamChannel.java:818)
>  ~[flink-dist_2.11-1.12-SNAPSHOT.jar:1.12-SNAPSHOT]
>   at 
> org.apache.flink.shaded.netty4.io.netty.channel.epoll.EpollEventLoop.processReady(EpollEventLoop.java:475)
>  ~[flink-dist_2.11-1.12-SNAPSHOT.jar:1.12-SNAPSHOT]
>   at 
> org.apache.flink.shaded.netty4.io.netty.channel.epoll.EpollEventLoop.run(EpollEventLoop.java:378)
>  ~[flink-dist_2.11-1.12-SNAPSHOT.jar:1.12-SNAPSHOT]
>   at 
> org.apache.flink.shaded.netty4.io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:989)
>  ~[flink-dist_2.11-1.12-SNAPSHOT.jar:1.12-SNAPSHOT]
>   at 
> org.apache.flink.shaded.netty4.io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74)
>  ~[flink-dist_2.11-1.12-SNAPSHOT.jar:1.12-SNAPSHOT]
>   at java.lang.Thread.run(Thread.java:834) ~[?:1.8.0_102]
> Caused by: 
> org.apache.flink.shaded.netty4.io.netty.channel.unix.Errors$NativeIoException:
>  readAddress(..) failed: Connection reset by peer
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (FLINK-20656) Update docs for new KafkaSource connector.

2021-04-22 Thread Arvid Heise (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-20656?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Arvid Heise updated FLINK-20656:

Fix Version/s: (was: 1.12.3)
   1.12.4

> Update docs for new KafkaSource connector.
> --
>
> Key: FLINK-20656
> URL: https://issues.apache.org/jira/browse/FLINK-20656
> Project: Flink
>  Issue Type: Task
>  Components: Connectors / Kafka
>Affects Versions: 1.12.0
>Reporter: Jiangjie Qin
>Assignee: Jiangjie Qin
>Priority: Major
>  Labels: stale-assigned
> Fix For: 1.13.0, 1.12.4
>
>
> We need to add docs for the KafkaSource connector. Namely the following page:
> https://ci.apache.org/projects/flink/flink-docs-release-1.12/dev/connectors/kafka.html



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (FLINK-20884) NullPointerException in create statements with computed columns which the subQuery is SqlNodeList

2021-04-22 Thread Arvid Heise (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-20884?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Arvid Heise updated FLINK-20884:

Fix Version/s: (was: 1.12.3)
   1.12.4

> NullPointerException in create statements with computed columns which the 
> subQuery is SqlNodeList 
> --
>
> Key: FLINK-20884
> URL: https://issues.apache.org/jira/browse/FLINK-20884
> Project: Flink
>  Issue Type: Bug
>  Components: Table SQL / Planner
>Affects Versions: 1.12.0
>Reporter: allan.hou
>Assignee: Jark Wu
>Priority: Major
>  Labels: stale-assigned
> Fix For: 1.13.0, 1.12.4
>
>
> Create statement:
> {code:java}
> CREATE TABLE hive.flink_hive.my_kafka_student (
> id INT,
> name VARCHAR,
> score INT,
> test as case when score in (1,2,3) then 1 else 0 end
> ) WITH (
> 'connector' = 'kafka',
> ...
> );
> {code}
> Then drop the table or run the application.It always fails with
> {code:java}
> Caused by: java.lang.NullPointerException
> at java.util.Objects.requireNonNull(Objects.java:203) ~[?:1.8.0_262]
> at 
> org.apache.calcite.sql2rel.SqlToRelConverter$Blackboard.convertExpression(SqlToRelConverter.java:4914)
>  ~[flink-table_2.11-1.12-SNAPSHOT.jar:1.12-SNAPSHOT]
> at 
> org.apache.calcite.sql2rel.StandardConvertletTable.convertExpressionList(StandardConvertletTable.java:839)
>  ~[flink-table_2.11-1.12-SNAPSHOT.jar:1.12-SNAPSHOT]
> at 
> org.apache.calcite.sql2rel.StandardConvertletTable.convertCall(StandardConvertletTable.java:815)
>  ~[flink-table_2.11-1.12-SNAPSHOT.jar:1.12-SNAPSHOT]
> at 
> org.apache.calcite.sql2rel.StandardConvertletTable.convertCall(StandardConvertletTable.java:802)
>  ~[flink-table_2.11-1.12-SNAPSHOT.jar:1.12-SNAPSHOT]
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) 
> ~[?:1.8.0_262]
> at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) 
> ~[?:1.8.0_262]
> at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>  ~[?:1.8.0_262]
> at java.lang.reflect.Method.invoke(Method.java:498) ~[?:1.8.0_262]
> at 
> org.apache.calcite.sql2rel.ReflectiveConvertletTable.lambda$registerNodeTypeMethod$0(ReflectiveConvertletTable.java:83)
>  ~[flink-table_2.11-1.12-SNAPSHOT.jar:1.12-SNAPSHOT]
> at 
> org.apache.calcite.sql2rel.SqlNodeToRexConverterImpl.convertCall(SqlNodeToRexConverterImpl.java:62)
>  ~[flink-table_2.11-1.12-SNAPSHOT.jar:1.12-SNAPSHOT]
> at 
> org.apache.calcite.sql2rel.SqlToRelConverter$Blackboard.visit(SqlToRelConverter.java:5098)
>  ~[flink-table_2.11-1.12-SNAPSHOT.jar:1.12-SNAPSHOT]
> at 
> org.apache.calcite.sql2rel.SqlToRelConverter$Blackboard.visit(SqlToRelConverter.java:4374)
>  ~[flink-table_2.11-1.12-SNAPSHOT.jar:1.12-SNAPSHOT]
> at org.apache.calcite.sql.SqlCall.accept(SqlCall.java:139) 
> ~[flink-table_2.11-1.12-SNAPSHOT.jar:1.12-SNAPSHOT]
> at 
> org.apache.calcite.sql2rel.SqlToRelConverter$Blackboard.convertExpression(SqlToRelConverter.java:4961)
>  ~[flink-table_2.11-1.12-SNAPSHOT.jar:1.12-SNAPSHOT]
> at 
> org.apache.calcite.sql2rel.StandardConvertletTable.convertCase(StandardConvertletTable.java:387)
>  ~[flink-table_2.11-1.12-SNAPSHOT.jar:1.12-SNAPSHOT]
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) 
> ~[?:1.8.0_262]
> at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) 
> ~[?:1.8.0_262]
> at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>  ~[?:1.8.0_262]
> at java.lang.reflect.Method.invoke(Method.java:498) ~[?:1.8.0_262]
> at 
> org.apache.calcite.sql2rel.ReflectiveConvertletTable.lambda$registerNodeTypeMethod$0(ReflectiveConvertletTable.java:83)
>  ~[flink-table_2.11-1.12-SNAPSHOT.jar:1.12-SNAPSHOT]
> at 
> org.apache.calcite.sql2rel.SqlNodeToRexConverterImpl.convertCall(SqlNodeToRexConverterImpl.java:62)
>  ~[flink-table_2.11-1.12-SNAPSHOT.jar:1.12-SNAPSHOT]
> at 
> org.apache.calcite.sql2rel.SqlToRelConverter$Blackboard.visit(SqlToRelConverter.java:5098)
>  ~[flink-table_2.11-1.12-SNAPSHOT.jar:1.12-SNAPSHOT]
> at 
> org.apache.calcite.sql2rel.SqlToRelConverter$Blackboard.visit(SqlToRelConverter.java:4374)
>  ~[flink-table_2.11-1.12-SNAPSHOT.jar:1.12-SNAPSHOT]
> at org.apache.calcite.sql.SqlCall.accept(SqlCall.java:139) 
> ~[flink-table_2.11-1.12-SNAPSHOT.jar:1.12-SNAPSHOT]
> at 
> org.apache.calcite.sql2rel.SqlToRelConverter$Blackboard.convertExpression(SqlToRelConverter.java:4961)
>  ~[flink-table_2.11-1.12-SNAPSHOT.jar:1.12-SNAPSHOT]
> at 
> 

[jira] [Updated] (FLINK-20849) Improve JavaDoc and logging of new KafkaSource

2021-04-22 Thread Arvid Heise (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-20849?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Arvid Heise updated FLINK-20849:

Fix Version/s: (was: 1.12.3)
   1.12.4

> Improve JavaDoc and logging of new KafkaSource
> --
>
> Key: FLINK-20849
> URL: https://issues.apache.org/jira/browse/FLINK-20849
> Project: Flink
>  Issue Type: Improvement
>  Components: Connectors / Kafka
>Reporter: Qingsheng Ren
>Priority: Major
> Fix For: 1.13.0, 1.12.4
>
>
> Some JavaDoc and logging message of the new KafkaSource should be more 
> descriptive to provide more information to users. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (FLINK-21029) Failure of shutdown lead to restart of (connected) pipeline

2021-04-22 Thread Arvid Heise (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-21029?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Arvid Heise updated FLINK-21029:

Fix Version/s: (was: 1.12.3)
   1.12.4

> Failure of shutdown lead to restart of (connected) pipeline
> ---
>
> Key: FLINK-21029
> URL: https://issues.apache.org/jira/browse/FLINK-21029
> Project: Flink
>  Issue Type: Bug
>  Components: Runtime / Coordination
>Affects Versions: 1.11.2
>Reporter: Theo Diefenthal
>Priority: Major
> Fix For: 1.11.4, 1.13.0, 1.12.4
>
>
> This bug happened in combination with 
> https://issues.apache.org/jira/browse/FLINK-21028 .
> When I wanted to stop a job via CLI "flink stop..." with disjoint job graph 
> (independent pipelines in the graph), one task wan't able to stop properly 
> (Reported in mentioned bug). This lead to restarting the job. I think, this 
> is a wrong behavior in general and a separated bug:
> If any crash occurs on (trying) to stop a job, Flink shouldn't try to restart 
> but continue stopping the job.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (FLINK-21153) yarn-per-job deployment target ignores yarn options

2021-04-22 Thread Arvid Heise (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-21153?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Arvid Heise updated FLINK-21153:

Fix Version/s: (was: 1.12.3)
   1.12.4

> yarn-per-job deployment target ignores yarn options
> ---
>
> Key: FLINK-21153
> URL: https://issues.apache.org/jira/browse/FLINK-21153
> Project: Flink
>  Issue Type: Bug
>  Components: Command Line Client, Deployment / YARN
>Affects Versions: 1.12.1, 1.13.0
>Reporter: Till Rohrmann
>Assignee: Dawid Wysakowicz
>Priority: Major
>  Labels: stale-assigned, usability
> Fix For: 1.13.0, 1.12.4
>
>
> While looking into the problem reported in FLINK-6949, I stumbled across an 
> odd behaviour of Flink. I tried to deploy a Flink cluster on Yarn and ship 
> some files to the cluster. Only the first command successfully shipped the 
> additional files to the cluster:
> 1) {{bin/flink run -p 1 --yarnship ../flink-test-job/cluster -m yarn-cluster 
> ../flink-test-job/target/flink-test-job-1.0-SNAPSHOT.jar}}
> 2) {{bin/flink run -p 1 --yarnship ../flink-test-job/cluster -t yarn-per-job 
> ../flink-test-job/target/flink-test-job-1.0-SNAPSHOT.jar}} 
> The problem seems to be that the second command does not activate the 
> {{FlinkYarnSessionCli}} but uses the {{GenericCLI}}.
> [~kkl0u], [~aljoscha], [~tison] what is the intended behaviour in this case. 
> I always thought that {{-m yarn-cluster}} and {{-t yarn-per-job}} would be 
> equivalent.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (FLINK-20777) Default value of property "partition.discovery.interval.ms" is not as documented in new Kafka Source

2021-04-22 Thread Arvid Heise (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-20777?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Arvid Heise updated FLINK-20777:

Fix Version/s: (was: 1.12.3)
   1.12.4

> Default value of property "partition.discovery.interval.ms" is not as 
> documented in new Kafka Source
> 
>
> Key: FLINK-20777
> URL: https://issues.apache.org/jira/browse/FLINK-20777
> Project: Flink
>  Issue Type: Bug
>  Components: Connectors / Kafka
>Reporter: Qingsheng Ren
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.12.4
>
>
> The default value of property "partition.discovery.interval.ms" is documented 
> as 30 seconds in {{KafkaSourceOptions}}, but it will be set as -1 in 
> {{KafkaSourceBuilder}} if user doesn't pass in this property explicitly. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (FLINK-20714) Hive delegation token is not obtained when using `kinit` to submit Yarn per-job

2021-04-22 Thread Arvid Heise (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-20714?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Arvid Heise updated FLINK-20714:

Fix Version/s: (was: 1.12.3)
   1.12.4

> Hive delegation token is not obtained when using `kinit` to submit Yarn 
> per-job 
> 
>
> Key: FLINK-20714
> URL: https://issues.apache.org/jira/browse/FLINK-20714
> Project: Flink
>  Issue Type: Improvement
>  Components: Deployment / YARN
>Affects Versions: 1.11.2, 1.11.3, 1.12.0
> Environment: Flink 1.11.2 on Yarn
>Reporter: jackwangcs
>Priority: Major
>  Labels: keberos
> Fix For: 1.11.4, 1.13.0, 1.12.4
>
>
> Hive delegation token is not obtained when using `kinit` to submit Yarn 
> per-job. 
> In YarnClusterDescriptor, it calls org.apache.flink.yarn.Utils#setTokensFor 
> to obtain tokens for the job. But setTokensFor only obtains HDFS and HBase 
> tokens currently, since the Hive integration is supported, the Hive 
> delegation should be obtained also. 
>  Otherwise, it will throw the following error when it tries to connect to 
> Hive metastore:
> {code:java}
> Caused by: MetaException(message:Could not connect to meta store using any of 
> the URIs provided. Most recent failure: 
> org.apache.thrift.transport.TTransportException: GSS initiate failedCaused 
> by: MetaException(message:Could not connect to meta store using any of the 
> URIs provided. Most recent failure: 
> org.apache.thrift.transport.TTransportException: GSS initiate failed at 
> org.apache.thrift.transport.TSaslTransport.sendAndThrowMessage(TSaslTransport.java:232)
>  at org.apache.thrift.transport.TSaslTransport.open(TSaslTransport.java:316) 
> at 
> org.apache.thrift.transport.TSaslClientTransport.open(TSaslClientTransport.java:37)
>  at 
> org.apache.hadoop.hive.thrift.client.TUGIAssumingTransport$1.run(TUGIAssumingTransport.java:52)
>  at 
> org.apache.hadoop.hive.thrift.client.TUGIAssumingTransport$1.run(TUGIAssumingTransport.java:49)
>  at java.security.AccessController.doPrivileged(Native Method) at 
> javax.security.auth.Subject.doAs(Subject.java:422) at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1924)
>  at 
> org.apache.hadoop.hive.thrift.client.TUGIAssumingTransport.open(TUGIAssumingTransport.java:49)
>  at 
> org.apache.hadoop.hive.metastore.HiveMetaStoreClient.open(HiveMetaStoreClient.java:464)
>  at 
> org.apache.hadoop.hive.metastore.HiveMetaStoreClient.(HiveMetaStoreClient.java:244)
>  at 
> org.apache.hadoop.hive.metastore.HiveMetaStoreClient.(HiveMetaStoreClient.java:187)
>  at 
> org.apache.flink.table.catalog.hive.client.HiveShimV100.getHiveMetastoreClient(HiveShimV100.java:97)
>  at 
> org.apache.flink.table.catalog.hive.client.HiveMetastoreClientWrapper.createMetastoreClient(HiveMetastoreClientWrapper.java:240)
>  at 
> org.apache.flink.table.catalog.hive.client.HiveMetastoreClientWrapper.(HiveMetastoreClientWrapper.java:71)
>  at 
> org.apache.flink.table.catalog.hive.client.HiveMetastoreClientFactory.create(HiveMetastoreClientFactory.java:35)
>  at 
> org.apache.flink.connectors.hive.HiveTableMetaStoreFactory$HiveTableMetaStore.(HiveTableMetaStoreFactory.java:74)
>  at 
> org.apache.flink.connectors.hive.HiveTableMetaStoreFactory$HiveTableMetaStore.(HiveTableMetaStoreFactory.java:68)
>  at 
> org.apache.flink.connectors.hive.HiveTableMetaStoreFactory.createTableMetaStore(HiveTableMetaStoreFactory.java:65)
>  at 
> org.apache.flink.connectors.hive.HiveTableMetaStoreFactory.createTableMetaStore(HiveTableMetaStoreFactory.java:43)
>  at 
> org.apache.flink.table.filesystem.PartitionLoader.(PartitionLoader.java:61)
>  at 
> org.apache.flink.table.filesystem.FileSystemCommitter.commitUpToCheckpoint(FileSystemCommitter.java:97)
>  at 
> org.apache.flink.table.filesystem.FileSystemOutputFormat.finalizeGlobal(FileSystemOutputFormat.java:95)
>  {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (FLINK-22124) The job finished without any exception if error was thrown during state access

2021-04-22 Thread Arvid Heise (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-22124?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Arvid Heise updated FLINK-22124:

Parent: (was: FLINK-22123)
Issue Type: Bug  (was: Sub-task)

> The job finished without any exception if error was thrown during state access
> --
>
> Key: FLINK-22124
> URL: https://issues.apache.org/jira/browse/FLINK-22124
> Project: Flink
>  Issue Type: Bug
>  Components: API / Python
>Affects Versions: 1.13.0
>Reporter: Dian Fu
>Assignee: Huang Xingbo
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.13.0, 1.12.3
>
>
> For the following job:
> {code}
> import logging
> from pyflink.common import WatermarkStrategy, Row
> from pyflink.common.serialization import Encoder
> from pyflink.common.typeinfo import Types
> from pyflink.datastream import StreamExecutionEnvironment
> from pyflink.datastream.connectors import FileSink, OutputFileConfig, 
> NumberSequenceSource
> from pyflink.datastream.execution_mode import RuntimeExecutionMode
> from pyflink.datastream.functions import KeyedProcessFunction, RuntimeContext
> from pyflink.datastream.state import MapStateDescriptor
> env = StreamExecutionEnvironment.get_execution_environment()
> env.set_parallelism(2)
> env.set_runtime_mode(RuntimeExecutionMode.BATCH)
> seq_num_source = NumberSequenceSource(1, 1000)
> file_sink = FileSink \
> 
> .for_row_format('/Users/dianfu/code/src/apache/playgrounds/examples/output/data_stream_batch_state',
> Encoder.simple_string_encoder()) \
> 
> .with_output_file_config(OutputFileConfig.builder().with_part_prefix('pre').with_part_suffix('suf').build())
>  \
> .build()
> ds = env.from_source(
> source=seq_num_source,
> watermark_strategy=WatermarkStrategy.for_monotonous_timestamps(),
> source_name='file_source',
> type_info=Types.LONG())
> class MyKeyedProcessFunction(KeyedProcessFunction):
> def __init__(self):
> self.state = None
> def open(self, runtime_context: RuntimeContext):
> logging.info("open")
> state_desc = MapStateDescriptor('map', Types.LONG(), Types.LONG())
> self.state = runtime_context.get_map_state(state_desc)
> def process_element(self, value, ctx: 'KeyedProcessFunction.Context'):
> existing = self.state.get(value[0])
> if existing is None:
> result = value[1]
> self.state.put(value[0], result)
> elif existing <= 10:
> result = value[1] + existing
> self.state.put(value[0], result)
> else:
> result = existing
> yield result
> ds.map(lambda a: Row(a % 4, 1), output_type=Types.ROW([Types.LONG(), 
> Types.LONG()])) \
> .key_by(lambda a: a[0]) \
> .process(MyKeyedProcessFunction(), Types.LONG()) \
> .sink_to(file_sink)
> env.execute('data_stream_batch_state')
> {code}
> As it will encounter KeyError in `self.state.get(value[0])` if value[0] 
> doesn't exist in the state, the job finished without any error message. This 
> issue should be addressed. We should make sure the error message appears in 
> the log file to help users to figure out what happens.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (FLINK-22014) Flink JobManager failed to restart after failure in kubernetes HA setup

2021-04-22 Thread Arvid Heise (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-22014?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Arvid Heise updated FLINK-22014:

Fix Version/s: (was: 1.12.3)
   1.12.4

> Flink JobManager failed to restart after failure in kubernetes HA setup
> ---
>
> Key: FLINK-22014
> URL: https://issues.apache.org/jira/browse/FLINK-22014
> Project: Flink
>  Issue Type: Bug
>  Components: Deployment / Kubernetes
>Affects Versions: 1.11.3, 1.12.2, 1.13.0
>Reporter: Mikalai Lushchytski
>Assignee: Till Rohrmann
>Priority: Major
>  Labels: k8s-ha, pull-request-available
> Fix For: 1.11.4, 1.13.0, 1.12.4
>
> Attachments: flink-logs.txt.zip, image-2021-04-19-11-17-58-215.png, 
> scalyr-logs (1).txt
>
>
> After the JobManager pod failed and the new one started, it was not able to 
> recover jobs due to the absence of recovery data in storage - config map 
> pointed at not existing file.
>   
>  Due to this the JobManager pod entered into the `CrashLoopBackOff`state and 
> was not able to recover - each attempt failed with the same error so the 
> whole cluster became unrecoverable and not operating.
>   
>  I had to manually delete the config map and start the jobs again without the 
> save point.
>   
>  If I tried to emulate the failure further by deleting job manager pod 
> manually, the new pod every time recovered well and issue was not 
> reproducible anymore artificially.
>   
>  Below is the failure log:
> {code:java}
> 2021-03-26 08:22:57,925 INFO 
> org.apache.flink.runtime.resourcemanager.slotmanager.SlotManagerImpl [] - 
> Starting the SlotManager.
>  2021-03-26 08:22:57,928 INFO 
> org.apache.flink.runtime.leaderretrieval.DefaultLeaderRetrievalService [] - 
> Starting DefaultLeaderRetrievalService with KubernetesLeaderRetrievalDriver
> {configMapName='stellar-flink-cluster-dispatcher-leader'}.
>  2021-03-26 08:22:57,931 INFO 
> org.apache.flink.runtime.jobmanager.DefaultJobGraphStore [] - Retrieved job 
> ids [198c46bac791e73ebcc565a550fa4ff6, 344f5ebc1b5c3a566b4b2837813e4940, 
> 96c4603a0822d10884f7fe536703d811, d9ded24224aab7c7041420b3efc1b6ba] from 
> KubernetesStateHandleStore{configMapName='stellar-flink-cluster-dispatcher-leader'}
> 2021-03-26 08:22:57,933 INFO 
> org.apache.flink.runtime.dispatcher.runner.SessionDispatcherLeaderProcess [] 
> - Trying to recover job with job id 198c46bac791e73ebcc565a550fa4ff6.
>  2021-03-26 08:22:58,029 INFO 
> org.apache.flink.runtime.dispatcher.runner.SessionDispatcherLeaderProcess [] 
> - Stopping SessionDispatcherLeaderProcess.
>  2021-03-26 08:28:22,677 INFO 
> org.apache.flink.runtime.jobmanager.DefaultJobGraphStore [] - Stopping 
> DefaultJobGraphStore. 2021-03-26 08:28:22,681 ERROR 
> org.apache.flink.runtime.entrypoint.ClusterEntrypoint [] - Fatal error 
> occurred in the cluster entrypoint. java.util.concurrent.CompletionException: 
> org.apache.flink.util.FlinkRuntimeException: Could not recover job with job 
> id 198c46bac791e73ebcc565a550fa4ff6.
>at java.util.concurrent.CompletableFuture.encodeThrowable(Unknown Source) 
> ~[?:?]
>at java.util.concurrent.CompletableFuture.completeThrowable(Unknown 
> Source) [?:?]
>at java.util.concurrent.CompletableFuture$AsyncSupply.run(Unknown Source) 
> [?:?]
>at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source) [?:?]
>at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source) [?:?]
>at java.lang.Thread.run(Unknown Source) [?:?] Caused by: 
> org.apache.flink.util.FlinkRuntimeException: Could not recover job with job 
> id 198c46bac791e73ebcc565a550fa4ff6.
>at 
> org.apache.flink.runtime.dispatcher.runner.SessionDispatcherLeaderProcess.recoverJob(SessionDispatcherLeaderProcess.java:144
>  undefined) ~[flink-dist_2.12-1.12.2.jar:1.12.2]
>at 
> org.apache.flink.runtime.dispatcher.runner.SessionDispatcherLeaderProcess.recoverJobs(SessionDispatcherLeaderProcess.java:122
>  undefined) ~[flink-dist_2.12-1.12.2.jar:1.12.2]
>at 
> org.apache.flink.runtime.dispatcher.runner.AbstractDispatcherLeaderProcess.supplyUnsynchronizedIfRunning(AbstractDispatcherLeaderProcess.java:198
>  undefined) ~[flink-dist_2.12-1.12.2.jar:1.12.2]
>at 
> org.apache.flink.runtime.dispatcher.runner.SessionDispatcherLeaderProcess.recoverJobsIfRunning(SessionDispatcherLeaderProcess.java:113
>  undefined) ~[flink-dist_2.12-1.12.2.jar:1.12.2] ... 4 more 
> Caused by: org.apache.flink.util.FlinkException: Could not retrieve submitted 
> JobGraph from state handle under jobGraph-198c46bac791e73ebcc565a550fa4ff6. 
> This indicates that the retrieved state handle is broken. Try cleaning the 
> state handle store.
>at 
> 

[jira] [Updated] (FLINK-20654) Unaligned checkpoint recovery may lead to corrupted data stream

2021-04-22 Thread Arvid Heise (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-20654?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Arvid Heise updated FLINK-20654:

Release Note:   (was: Using unaligned checkpoints in Flink 1.12.0 combined 
with two/multiple inputs tasks or with union inputs for single input tasks can 
result in corrupted state. 

This can happen if a new checkpoint is triggered before recovery is fully 
completed. For state to be corrupted a task with two or more input gates must 
receive a checkpoint barrier exactly at the same time this tasks finishes 
recovering spilled in-flight data. In such case this new checkpoint can 
succeed, with corrupted/missing in-flight data, which will result in various 
deserialisation/corrupted data stream errors when someone attempts to recover 
from such corrupted checkpoint.

Using unaligned checkpoints in Flink 1.12.1, a corruption may occur in the 
checkpoint following a declined checkpoint.

A late barrier of a canceled checkpoint may lead to buffers being not written 
into the successive checkpoint, such that recovery is not possible. This 
happens, when the next checkpoint barrier arrives at a given operator before 
all previous barriers arrived, which can only happen after cancellation in 
unaligned checkpoints.  
)

> Unaligned checkpoint recovery may lead to corrupted data stream
> ---
>
> Key: FLINK-20654
> URL: https://issues.apache.org/jira/browse/FLINK-20654
> Project: Flink
>  Issue Type: Bug
>  Components: Runtime / Checkpointing
>Affects Versions: 1.12.0, 1.12.1
>Reporter: Arvid Heise
>Assignee: Piotr Nowojski
>Priority: Critical
>  Labels: pull-request-available, test-stability
> Fix For: 1.13.0, 1.12.3
>
>
> Fix of FLINK-20433 shows potential corruption after recovery for all 
> variations of UnalignedCheckpointITCase.
> To reproduce, run UCITCase a couple hundreds times. The issue showed for me 
> in:
> - execute [Parallel union, p = 5]
> - execute [Parallel union, p = 10]
> - execute [Parallel cogroup, p = 5]
> - execute [parallel pipeline with remote channels, p = 5]
> with decreasing frequency.
> The issue manifests as one of the following issues:
> - stream corrupted exception
> - EOF exception
> - assertion failure in NUM_LOST or NUM_OUT_OF_ORDER
> - (for union) ArithmeticException overflow (because the number that should be 
> [0;10] has been mis-deserialized)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (FLINK-21693) TestStreamEnvironment does not implement executeAsync

2021-04-22 Thread Arvid Heise (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-21693?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Arvid Heise updated FLINK-21693:

Fix Version/s: (was: 1.12.3)
   1.12.4

> TestStreamEnvironment does not implement executeAsync
> -
>
> Key: FLINK-21693
> URL: https://issues.apache.org/jira/browse/FLINK-21693
> Project: Flink
>  Issue Type: Bug
>  Components: API / DataStream
>Affects Versions: 1.11.3, 1.12.2, 1.13.0
>Reporter: Till Rohrmann
>Priority: Major
> Fix For: 1.11.4, 1.13.0, 1.12.4
>
>
> When implementing FLINK-14392 we forgot to implement 
> {{TestStreamEnvironment.executeAsync}}. As a consequence, when using 
> {{TestStreamEnvironment}} and calling {{executeAsync}} the system will always 
> start a new local {{MiniCluster}}. The proper behaviour would be to use the 
> {{MiniCluster}} specified in the {{TestStreamEnvironment}}.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (FLINK-20687) Missing 'yarn-application' target in CLI help message

2021-04-22 Thread Arvid Heise (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-20687?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Arvid Heise updated FLINK-20687:

Fix Version/s: (was: 1.12.3)
   1.12.4

> Missing 'yarn-application' target in CLI help message
> -
>
> Key: FLINK-20687
> URL: https://issues.apache.org/jira/browse/FLINK-20687
> Project: Flink
>  Issue Type: Improvement
>  Components: Command Line Client
>Affects Versions: 1.12.0
>Reporter: Ruguo Yu
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 1.13.0, 1.12.4
>
> Attachments: image-2020-12-20-21-48-18-391.png, 
> image-2020-12-20-22-02-01-312.png
>
>
> Missing 'yarn-application' target in CLI help message when i enter command 
> 'flink run-application -h', as follows:
> !image-2020-12-20-21-48-18-391.png|width=516,height=372!
> The target name is obtained through SPI, and I checked the SPI 
> META-INF/servicesis is correct.
>  
> Next i put flink-shaded-hadoop-*-.jar to flink lib derectory or set 
> HADOOP_CLASSPATH, it can show 'yarn-application', as follows:
> !image-2020-12-20-22-02-01-312.png|width=808,height=507!
>  However, I think it is reasonable to show 'yarn-application' without any 
> action. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (FLINK-21666) Consistent documentation and recommendations regarding Stop vs Cancel with savepoint

2021-04-22 Thread Arvid Heise (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-21666?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Arvid Heise updated FLINK-21666:

Fix Version/s: (was: 1.12.3)
   1.12.4

> Consistent documentation and recommendations regarding Stop vs Cancel with 
> savepoint
> 
>
> Key: FLINK-21666
> URL: https://issues.apache.org/jira/browse/FLINK-21666
> Project: Flink
>  Issue Type: Improvement
>  Components: Client / Job Submission, Documentation, Runtime / REST
>Affects Versions: 1.11.0, 1.11.1, 1.11.2, 1.11.3, 1.12.0, 1.12.1, 1.12.2
>Reporter: Thomas Eckestad
>Priority: Major
> Fix For: 1.11.4, 1.13.0, 1.12.4
>
>
> *Background*
> Cancel with savepoint is marked as deprecated in the cli-documentation. 
> Cancel with savepoint is not marked as deprecated in the REST-API 
> documentation.
> The documentation explaining why stop should be preferred over cancel with 
> savepoint could be improved for the CLI-case and is non existent for the REST 
> case.
> *Actions needed*
>  * Make it clear what the semantics of cancel with savepoint and stop with 
> savepoint are.
>  ** I think that the documentation in 
> [https://ci.apache.org/projects/flink/flink-docs-stable/deployment/cli.html] 
> for Stop is a good start but could be further improved. Especially to mention 
> that no new data will be processed by the sources after the stop is initiated.
>  ** For Cancel, the documentation on the same page as above is really sparse. 
> In exactly what way is a job terminated ungracefully when using Cancel with 
> savepoint? 
>  * If there are no valid use cases when anyone could benefit from using 
> cancel with savepoint, document/deprecate to make it less likely that cancel 
> with savepoint will be used by mistake instead of stop.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (FLINK-21445) Application mode does not set the configuration when building PackagedProgram

2021-04-22 Thread Arvid Heise (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-21445?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Arvid Heise updated FLINK-21445:

Fix Version/s: (was: 1.12.3)
   1.12.4

> Application mode does not set the configuration when building PackagedProgram
> -
>
> Key: FLINK-21445
> URL: https://issues.apache.org/jira/browse/FLINK-21445
> Project: Flink
>  Issue Type: Bug
>  Components: Deployment / Kubernetes, Deployment / Scripts, 
> Deployment / YARN
>Affects Versions: 1.11.3, 1.12.1, 1.13.0
>Reporter: Yang Wang
>Assignee: Nicholas Jiang
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.11.4, 1.13.0, 1.12.4
>
>
> Application mode uses {{ClassPathPackagedProgramRetriever}} to create the 
> {{PackagedProgram}}. However, it does not set the configuration. This will 
> cause some client configurations not take effect. For example, 
> {{classloader.resolve-order}}.
> I think we just forget to do this since we have done the similar thing in 
> {{CliFrontend}}.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (FLINK-21589) Document Table/SQL API limitations regarding upgrades with savepoints

2021-04-22 Thread Arvid Heise (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-21589?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Arvid Heise updated FLINK-21589:

Fix Version/s: (was: 1.12.3)
   1.12.4

> Document Table/SQL API limitations regarding upgrades with savepoints
> -
>
> Key: FLINK-21589
> URL: https://issues.apache.org/jira/browse/FLINK-21589
> Project: Flink
>  Issue Type: Improvement
>  Components: Documentation, Table SQL / API
>Reporter: Chesnay Schepler
>Priority: Major
> Fix For: 1.11.4, 1.13.0, 1.12.4
>
>
> We don't appear to mention anywhere that users cannot upgrade Table/SQL API 
> applications with savepoints (or it is well hidden).
> This should be mentioned in the API docs (presumably under streaming 
> concepts) and the application upgrading documentation.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (FLINK-21116) DefaultDispatcherRunnerITCase hangs on azure

2021-04-22 Thread Arvid Heise (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-21116?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Arvid Heise updated FLINK-21116:

Fix Version/s: (was: 1.12.3)
   1.12.4

> DefaultDispatcherRunnerITCase hangs on azure
> 
>
> Key: FLINK-21116
> URL: https://issues.apache.org/jira/browse/FLINK-21116
> Project: Flink
>  Issue Type: Bug
>  Components: Runtime / Coordination, Tests
>Affects Versions: 1.11.4
>Reporter: Dawid Wysakowicz
>Priority: Major
>  Labels: test-stability
> Fix For: 1.11.4, 1.13.0, 1.12.4
>
>
> https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=12430=logs=3b6ec2fd-a816-5e75-c775-06fb87cb6670=b33fdd4f-3de5-542e-2624-5d53167bb672
> {code}
> "jobmanager-future-thread-3" #38 daemon prio=5 os_prio=0 
> tid=0x7f3470004000 nid=0x35f1 waiting on condition [0x7f349f8fd000]
>java.lang.Thread.State: WAITING (parking)
>   at sun.misc.Unsafe.park(Native Method)
>   - parking to wait for  <0x8665d530> (a 
> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject)
>   at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
>   at 
> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2039)
>   at 
> java.util.concurrent.ScheduledThreadPoolExecutor$DelayedWorkQueue.take(ScheduledThreadPoolExecutor.java:1088)
>   at 
> java.util.concurrent.ScheduledThreadPoolExecutor$DelayedWorkQueue.take(ScheduledThreadPoolExecutor.java:809)
>   at 
> java.util.concurrent.ThreadPoolExecutor.getTask(ThreadPoolExecutor.java:1074)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1134)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>   at java.lang.Thread.run(Thread.java:748)
> "jobmanager-future-thread-2" #37 daemon prio=5 os_prio=0 
> tid=0x7f3470002000 nid=0x301c waiting on condition [0x7f349f9fe000]
>java.lang.Thread.State: TIMED_WAITING (parking)
>   at sun.misc.Unsafe.park(Native Method)
>   - parking to wait for  <0x8665d530> (a 
> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject)
>   at 
> java.util.concurrent.locks.LockSupport.parkNanos(LockSupport.java:215)
>   at 
> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awaitNanos(AbstractQueuedSynchronizer.java:2078)
>   at 
> java.util.concurrent.ScheduledThreadPoolExecutor$DelayedWorkQueue.take(ScheduledThreadPoolExecutor.java:1093)
>   at 
> java.util.concurrent.ScheduledThreadPoolExecutor$DelayedWorkQueue.take(ScheduledThreadPoolExecutor.java:809)
>   at 
> java.util.concurrent.ThreadPoolExecutor.getTask(ThreadPoolExecutor.java:1074)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1134)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>   at java.lang.Thread.run(Thread.java:748)
> "jobmanager-future-thread-1" #36 daemon prio=5 os_prio=0 
> tid=0x7f347806c000 nid=0x71cb waiting on condition [0x7f349edf6000]
>java.lang.Thread.State: WAITING (parking)
>   at sun.misc.Unsafe.park(Native Method)
>   - parking to wait for  <0x8665d530> (a 
> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject)
>   at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
>   at 
> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2039)
>   at 
> java.util.concurrent.ScheduledThreadPoolExecutor$DelayedWorkQueue.take(ScheduledThreadPoolExecutor.java:1088)
>   at 
> java.util.concurrent.ScheduledThreadPoolExecutor$DelayedWorkQueue.take(ScheduledThreadPoolExecutor.java:809)
>   at 
> java.util.concurrent.ThreadPoolExecutor.getTask(ThreadPoolExecutor.java:1074)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1134)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>   at java.lang.Thread.run(Thread.java:748)
> "pool-2-thread-1" #34 prio=5 os_prio=0 tid=0x7f35feab3800 nid=0x71c5 
> waiting on condition [0x7f349eff8000]
>   at 
> org.apache.maven.surefire.booter.ForkedBooter.invokeProviderInSameClassLoader(ForkedBooter.java:384)
>   at 
> org.apache.maven.surefire.booter.ForkedBooter.runSuitesInProcess(ForkedBooter.java:345)
>   at 
> org.apache.maven.surefire.booter.ForkedBooter.execute(ForkedBooter.java:126)
>   at 
> org.apache.maven.surefire.booter.ForkedBooter.main(ForkedBooter.java:418)
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (FLINK-22040) Maven: Entry has not been leased from this pool / fix for e2e tests

2021-04-22 Thread Arvid Heise (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-22040?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Arvid Heise updated FLINK-22040:

Fix Version/s: (was: 1.12.3)
   1.12.4

> Maven: Entry has not been leased from this pool / fix for e2e tests
> ---
>
> Key: FLINK-22040
> URL: https://issues.apache.org/jira/browse/FLINK-22040
> Project: Flink
>  Issue Type: Sub-task
>  Components: Build System / Azure Pipelines
>Affects Versions: 1.12.2, 1.13.0
>Reporter: Robert Metzger
>Assignee: Robert Metzger
>Priority: Major
>  Labels: stale-assigned
> Fix For: 1.13.0, 1.12.4
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (FLINK-21876) Handle it properly when the returned value of Python UDF doesn't match the defined result type

2021-04-22 Thread Arvid Heise (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-21876?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Arvid Heise updated FLINK-21876:

Fix Version/s: (was: 1.12.3)
   1.12.4

> Handle it properly when the returned value of Python UDF doesn't match the 
> defined result type
> --
>
> Key: FLINK-21876
> URL: https://issues.apache.org/jira/browse/FLINK-21876
> Project: Flink
>  Issue Type: Improvement
>  Components: API / Python
>Affects Versions: 1.10.0, 1.11.0, 1.12.0
>Reporter: Dian Fu
>Assignee: Dian Fu
>Priority: Major
>  Labels: stale-assigned
> Fix For: 1.13.0, 1.12.4
>
>
> Currently, when the returned value of Python UDF doesn't match the defined 
> result type of the Python UDF, it will thrown the following exception during 
> execution:
> {code:java}
> Caused by: java.io.EOFException
> at java.io.DataInputStream.readFully(DataInputStream.java:197)
> at java.io.DataInputStream.readFully(DataInputStream.java:169)
> at 
> org.apache.flink.table.runtime.typeutils.StringDataSerializer.deserializeInternal(StringDataSerializer.java:88)
> at 
> org.apache.flink.table.runtime.typeutils.StringDataSerializer.deserialize(StringDataSerializer.java:82)
> at 
> org.apache.flink.table.runtime.typeutils.StringDataSerializer.deserialize(StringDataSerializer.java:34)
> at 
> org.apache.flink.table.runtime.typeutils.serializers.python.MapDataSerializer.deserializeInternal(MapDataSerializer.java:129)
> at 
> org.apache.flink.table.runtime.typeutils.serializers.python.MapDataSerializer.deserialize(MapDataSerializer.java:110)
> at 
> org.apache.flink.table.runtime.typeutils.serializers.python.MapDataSerializer.deserialize(MapDataSerializer.java:46)
> at 
> org.apache.flink.table.runtime.typeutils.serializers.python.RowDataSerializer.deserialize(RowDataSerializer.java:106)
> at 
> org.apache.flink.table.runtime.typeutils.serializers.python.RowDataSerializer.deserialize(RowDataSerializer.java:49)
> at 
> org.apache.flink.table.runtime.operators.python.scalar.RowDataPythonScalarFunctionOperator.emitResult(RowDataPythonScalarFunctionOperator.java:81)
> at 
> org.apache.flink.streaming.api.operators.python.AbstractPythonFunctionOperator.emitResults(AbstractPythonFunctionOperator.java:250)
> at 
> org.apache.flink.streaming.api.operators.python.AbstractPythonFunctionOperator.invokeFinishBundle(AbstractPythonFunctionOperator.java:273)
> at 
> org.apache.flink.streaming.api.operators.python.AbstractPythonFunctionOperator.processWatermark(AbstractPythonFunctionOperator.java:199)
> at 
> org.apache.flink.streaming.runtime.tasks.ChainingOutput.emitWatermark(ChainingOutput.java:123)
> at 
> org.apache.flink.streaming.runtime.tasks.SourceOperatorStreamTask$AsyncDataOutputToOutput.emitWatermark(SourceOperatorStreamTask.java:170)
> at 
> org.apache.flink.streaming.runtime.tasks.SourceOperatorStreamTask.advanceToEndOfEventTime(SourceOperatorStreamTask.java:110)
> at 
> org.apache.flink.streaming.runtime.tasks.SourceOperatorStreamTask.afterInvoke(SourceOperatorStreamTask.java:116)
> at 
> org.apache.flink.streaming.runtime.tasks.StreamTask.invoke(StreamTask.java:589)
> at org.apache.flink.runtime.taskmanager.Task.doRun(Task.java:755)
> at org.apache.flink.runtime.taskmanager.Task.run(Task.java:570)
> at java.lang.Thread.run(Thread.java:748)
> {code}
> The exception isn't straight forward for users and it's difficult for users 
> to figure out the root cause of the issue.
> As Python is dynamic language, this case should be very common and it would 
> be great if we could handle this case properly.
> See 
> [https://stackoverflow.com/questions/66687797/pyflink-java-io-eofexception-at-java-io-datainputstream-readfully]
>  for more details.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (FLINK-21711) DataStreamSink doesn't allow setting maxParallelism

2021-04-22 Thread Arvid Heise (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-21711?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Arvid Heise updated FLINK-21711:

Fix Version/s: (was: 1.12.3)
   1.12.4

> DataStreamSink doesn't allow setting maxParallelism
> ---
>
> Key: FLINK-21711
> URL: https://issues.apache.org/jira/browse/FLINK-21711
> Project: Flink
>  Issue Type: Improvement
>  Components: API / DataStream
>Affects Versions: 1.11.3, 1.12.2, 1.13.0
>Reporter: Robert Metzger
>Priority: Major
> Fix For: 1.11.4, 1.13.0, 1.12.4
>
>
> It seems that we can only set the max parallelism of the sink only via this 
> internal API:
> {code}
> input.addSink(new 
> ParallelismTrackingSink<>()).getTransformation().setMaxParallelism(1);
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (FLINK-22323) JobEdges Typos

2021-04-22 Thread Arvid Heise (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-22323?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Arvid Heise updated FLINK-22323:

Fix Version/s: (was: 1.12.3)
   1.12.4

> JobEdges Typos
> --
>
> Key: FLINK-22323
> URL: https://issues.apache.org/jira/browse/FLINK-22323
> Project: Flink
>  Issue Type: Bug
>  Components: Runtime / Task
>Reporter: lqjacklee
>Priority: Minor
> Fix For: 1.12.4
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (FLINK-20465) Fail globally when not resuming from the latest checkpoint in regional failover

2021-04-22 Thread Arvid Heise (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-20465?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Arvid Heise updated FLINK-20465:

Fix Version/s: (was: 1.12.3)
   1.12.4

> Fail globally when not resuming from the latest checkpoint in regional 
> failover
> ---
>
> Key: FLINK-20465
> URL: https://issues.apache.org/jira/browse/FLINK-20465
> Project: Flink
>  Issue Type: Improvement
>  Components: Runtime / Coordination
>Affects Versions: 1.12.0
>Reporter: Till Rohrmann
>Priority: Critical
> Fix For: 1.13.0, 1.12.4
>
>
> As a follow up for FLINK-20290 we should assert that we resume from the 
> latest checkpoint when doing a regional failover in the 
> {{SourceCoordinators}} in order to avoid losing input splits (see 
> FLINK-20427). If the assumption does not hold, then we should fail the job 
> globally so that we reset the master state to a consistent view of the state. 
> Such a behaviour can act as a safety net in case that Flink ever tries to 
> recover from not the latest available checkpoint.
> One idea how to solve it is to remember the latest completed checkpoint id 
> somewhere along the way to the 
> {{SplitAssignmentTracker.getAndRemoveUncheckpointedAssignment}} and failing 
> when the restored checkpoint id is smaller.
> cc [~sewen], [~jqin]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (FLINK-20951) IllegalArgumentException when reading Hive parquet table if condition not contain all partitioned fields

2021-04-22 Thread Arvid Heise (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-20951?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Arvid Heise updated FLINK-20951:

Fix Version/s: (was: 1.12.3)
   1.12.4

> IllegalArgumentException when reading Hive parquet table if condition not 
> contain all partitioned fields
> 
>
> Key: FLINK-20951
> URL: https://issues.apache.org/jira/browse/FLINK-20951
> Project: Flink
>  Issue Type: Bug
>  Components: Connectors / Hive
>Affects Versions: 1.12.0
> Environment: flink 1.12.0release-12
> sql-cli
>Reporter: YUJIANBO
>Assignee: Rui Li
>Priority: Critical
>  Labels: stale-assigned
> Fix For: 1.13.0, 1.12.4
>
>
> The production hive table is partitioned by two fields:datekey and event
> I have do this test by Flink-sql-cli:(Spark Sql All is OK)
> (1)First:
> SELECT vid From table_A WHERE datekey = '20210112' AND event = 'XXX' AND vid
> = 'aa';(OK)
> SELECT vid From table_A WHERE datekey = '20210112' AND vid = 'aa';
> (Error)
> (2)Second:
> SELECT vid From table_B WHERE datekey = '20210112' AND event = 'YYY' AND vid
> = 'bb';(OK)
> SELECT vid From table_B WHERE datekey = '20210112' AND vid = 'bb';
> (Error)
> The exception is:
> {code}
> java.lang.RuntimeException: One or more fetchers have encountered exception
> at
> org.apache.flink.connector.base.source.reader.fetcher.SplitFetcherManager.checkErrors(SplitFetcherManager.java:199)
> at
> org.apache.flink.connector.base.source.reader.SourceReaderBase.getNextFetch(SourceReaderBase.java:154)
> at
> org.apache.flink.connector.base.source.reader.SourceReaderBase.pollNext(SourceReaderBase.java:116)
> at
> org.apache.flink.streaming.api.operators.SourceOperator.emitNext(SourceOperator.java:273)
> at
> org.apache.flink.streaming.runtime.io.StreamTaskSourceInput.emitNext(StreamTaskSourceInput.java:67)
> at
> org.apache.flink.streaming.runtime.io.StreamOneInputProcessor.processInput(StreamOneInputProcessor.java:65)
> at
> org.apache.flink.streaming.runtime.tasks.StreamTask.processInput(StreamTask.java:395)
> at
> org.apache.flink.streaming.runtime.tasks.mailbox.MailboxProcessor.runMailboxLoop(MailboxProcessor.java:191)
> at
> org.apache.flink.streaming.runtime.tasks.StreamTask.runMailboxLoop(StreamTask.java:609)
> at
> org.apache.flink.streaming.runtime.tasks.StreamTask.invoke(StreamTask.java:573)
> at org.apache.flink.runtime.taskmanager.Task.doRun(Task.java:755)
> at org.apache.flink.runtime.taskmanager.Task.run(Task.java:570)
> at java.lang.Thread.run(Thread.java:745)
> Caused by: java.lang.RuntimeException: SplitFetcher thread 19 received
> unexpected exception while polling the records
> at
> org.apache.flink.connector.base.source.reader.fetcher.SplitFetcher.runOnce(SplitFetcher.java:146)
> at
> org.apache.flink.connector.base.source.reader.fetcher.SplitFetcher.run(SplitFetcher.java:101)
> at
> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
> at java.util.concurrent.FutureTask.run(FutureTask.java:266)
> at
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
> at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
> ... 1 more
> Caused by: java.lang.IllegalArgumentException
> at java.nio.Buffer.position(Buffer.java:244)
> at
> org.apache.flink.hive.shaded.parquet.io.api.Binary$ByteBufferBackedBinary.getBytes(Binary.java:424)
> at
> org.apache.flink.hive.shaded.formats.parquet.vector.reader.BytesColumnReader.readBatchFromDictionaryIds(BytesColumnReader.java:79)
> at
> org.apache.flink.hive.shaded.formats.parquet.vector.reader.BytesColumnReader.readBatchFromDictionaryIds(BytesColumnReader.java:33)
> at
> org.apache.flink.hive.shaded.formats.parquet.vector.reader.AbstractColumnReader.readToVector(AbstractColumnReader.java:199)
> at
> org.apache.flink.hive.shaded.formats.parquet.ParquetVectorizedInputFormat$ParquetReader.nextBatch(ParquetVectorizedInputFormat.java:359)
> at
> org.apache.flink.hive.shaded.formats.parquet.ParquetVectorizedInputFormat$ParquetReader.readBatch(ParquetVectorizedInputFormat.java:328)
> at
> org.apache.flink.connector.file.src.impl.FileSourceSplitReader.fetch(FileSourceSplitReader.java:67)
> at
> org.apache.flink.connector.base.source.reader.fetcher.FetchTask.run(FetchTask.java:56)
> at
> org.apache.flink.connector.base.source.reader.fetcher.SplitFetcher.runOnce(SplitFetcher.java:138)
> ... 6 more
> {code}



--
This message was sent by Atlassian Jira

[jira] [Updated] (FLINK-20431) KafkaSourceReaderTest.testCommitOffsetsWithoutAliveFetchers:133->lambda$testCommitOffsetsWithoutAliveFetchers$3:134 expected:<10> but was:<1>

2021-04-22 Thread Arvid Heise (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-20431?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Arvid Heise updated FLINK-20431:

Fix Version/s: (was: 1.12.3)
   1.12.4

> KafkaSourceReaderTest.testCommitOffsetsWithoutAliveFetchers:133->lambda$testCommitOffsetsWithoutAliveFetchers$3:134
>  expected:<10> but was:<1>
> -
>
> Key: FLINK-20431
> URL: https://issues.apache.org/jira/browse/FLINK-20431
> Project: Flink
>  Issue Type: Bug
>  Components: Connectors / Kafka
>Affects Versions: 1.12.2, 1.13.0
>Reporter: Huang Xingbo
>Assignee: Jiangjie Qin
>Priority: Critical
>  Labels: pull-request-available, stale-assigned, test-stability
> Fix For: 1.13.1, 1.12.4
>
>
> [https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=10351=logs=c5f0071e-1851-543e-9a45-9ac140befc32=1fb1a56f-e8b5-5a82-00a0-a2db7757b4f5]
> [ERROR] Failures: 
> [ERROR] 
> KafkaSourceReaderTest.testCommitOffsetsWithoutAliveFetchers:133->lambda$testCommitOffsetsWithoutAliveFetchers$3:134
>  expected:<10> but was:<1>
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (FLINK-22573) AsyncIO can timeout elements after completion

2021-05-05 Thread Arvid Heise (Jira)
Arvid Heise created FLINK-22573:
---

 Summary: AsyncIO can timeout elements after completion
 Key: FLINK-22573
 URL: https://issues.apache.org/jira/browse/FLINK-22573
 Project: Flink
  Issue Type: Bug
  Components: API / DataStream
Affects Versions: 1.12.3, 1.13.0, 1.11.3, 1.14.0
Reporter: Arvid Heise
Assignee: Arvid Heise


AsyncIO emits completed elements over the mailbox at which any timer is also 
canceled. However, if the mailbox cannot process (heavy backpressure), it may 
be that the timer still triggers on a completed element.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (FLINK-21346) Tasks on AZP should timeout more gracefully

2021-02-09 Thread Arvid Heise (Jira)
Arvid Heise created FLINK-21346:
---

 Summary: Tasks on AZP should timeout more gracefully
 Key: FLINK-21346
 URL: https://issues.apache.org/jira/browse/FLINK-21346
 Project: Flink
  Issue Type: Improvement
  Components: Build System / CI
Affects Versions: 1.13.0
Reporter: Arvid Heise
Assignee: Arvid Heise


Currently, whenever a non-e2e test times out, where are left with little 
information on what went wrong. e2e tests have a good watchdog that uploads on 
timeout.

 

The goal of this ticket is to unify the treatment of tests and e2e tests and 
always upload stack traces and logs on timeout.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (FLINK-20847) Update CompletedCheckpointStore.shutdown() signature

2021-02-09 Thread Arvid Heise (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-20847?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17282286#comment-17282286
 ] 

Arvid Heise commented on FLINK-20847:
-

Merged into master as 074b907c5651ce504268c222571024fc6e713471 .

> Update CompletedCheckpointStore.shutdown() signature
> 
>
> Key: FLINK-20847
> URL: https://issues.apache.org/jira/browse/FLINK-20847
> Project: Flink
>  Issue Type: Improvement
>  Components: Runtime / Checkpointing
>Reporter: Roman Khachatryan
>Assignee: Roman Khachatryan
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 1.13.0
>
>
> # remove unused postCleanup argument
>  # add javadoc for checkpointsCleaner



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (FLINK-19520) Add reliable test randomization for checkpointing

2021-02-10 Thread Arvid Heise (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-19520?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Arvid Heise resolved FLINK-19520.
-
Fix Version/s: 1.13.0
   Resolution: Fixed

> Add reliable test randomization for checkpointing
> -
>
> Key: FLINK-19520
> URL: https://issues.apache.org/jira/browse/FLINK-19520
> Project: Flink
>  Issue Type: Test
>  Components: Runtime / Configuration
>Affects Versions: 1.12.0
>Reporter: Arvid Heise
>Assignee: Arvid Heise
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.13.0
>
>
> With the larger refactoring of checkpoint alignment and the additional of 
> more unaligned checkpoint settings, it becomes increasingly important to 
> provide a large test coverage.
> Unfortunately, adding sufficient test cases in a test matrix appears to be 
> unrealistic: many of the encountered issues were subtle, sometimes caused by 
> race conditions or unusual test configurations and often only visible in e2e 
> tests.
> Hence, we like to rely on all existing Flink tests to provide a sufficient 
> coverage for checkpointing. However, as more and more options in unaligned 
> checkpoint are going to be implemented in this and the upcoming release, 
> running all Flink tests - especially e2e - in a test matrix is prohibitively 
> expensive, even for nightly builds.
> Thus, we want to introduce test randomization for all tests that do not use a 
> specific checkpointing mode. In a similar way, we switched from aligned 
> checkpoints by default in tests to unaligned checkpoint during the last 
> release cycle.
> To not burden the developers of other components too much, we set the 
> following requirements:
>  * Randomization should be seeded in a way that both builds on Azure 
> pipelines and local builds will result in the same settings to ease debugging 
> and ensure reproducibility.
>  * Randomized options should be shown in the test log.
>  * Execution order of test cases will not influence the randomization.
>  * Randomization is hidden, no change on any test is needed.
>  * Randomization only happens during local/remote test execution. User 
> deployments are not affected.
>  * Test developers are able to avoid randomization by explicitly providing 
> checkpoint configs.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (FLINK-19520) Add reliable test randomization for checkpointing

2021-02-10 Thread Arvid Heise (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-19520?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17282456#comment-17282456
 ] 

Arvid Heise commented on FLINK-19520:
-

Merged into master as 4fd2cec5b9390e6f7e26b0a86b5f4886ff7e68e5.

> Add reliable test randomization for checkpointing
> -
>
> Key: FLINK-19520
> URL: https://issues.apache.org/jira/browse/FLINK-19520
> Project: Flink
>  Issue Type: Test
>  Components: Runtime / Configuration
>Affects Versions: 1.12.0
>Reporter: Arvid Heise
>Assignee: Arvid Heise
>Priority: Major
>  Labels: pull-request-available
>
> With the larger refactoring of checkpoint alignment and the additional of 
> more unaligned checkpoint settings, it becomes increasingly important to 
> provide a large test coverage.
> Unfortunately, adding sufficient test cases in a test matrix appears to be 
> unrealistic: many of the encountered issues were subtle, sometimes caused by 
> race conditions or unusual test configurations and often only visible in e2e 
> tests.
> Hence, we like to rely on all existing Flink tests to provide a sufficient 
> coverage for checkpointing. However, as more and more options in unaligned 
> checkpoint are going to be implemented in this and the upcoming release, 
> running all Flink tests - especially e2e - in a test matrix is prohibitively 
> expensive, even for nightly builds.
> Thus, we want to introduce test randomization for all tests that do not use a 
> specific checkpointing mode. In a similar way, we switched from aligned 
> checkpoints by default in tests to unaligned checkpoint during the last 
> release cycle.
> To not burden the developers of other components too much, we set the 
> following requirements:
>  * Randomization should be seeded in a way that both builds on Azure 
> pipelines and local builds will result in the same settings to ease debugging 
> and ensure reproducibility.
>  * Randomized options should be shown in the test log.
>  * Execution order of test cases will not influence the randomization.
>  * Randomization is hidden, no change on any test is needed.
>  * Randomization only happens during local/remote test execution. User 
> deployments are not affected.
>  * Test developers are able to avoid randomization by explicitly providing 
> checkpoint configs.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (FLINK-21103) E2e tests time out on azure

2021-02-10 Thread Arvid Heise (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-21103?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17282453#comment-17282453
 ] 

Arvid Heise commented on FLINK-21103:
-

Another case 
[https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_apis/build/builds/13184/logs/170]

Here I may have gotten some root cause


{noformat}
2021-02-10T10:44:49.4158320Z Feb 10 10:44:49 Waiting for job to process up to 
1000 records, current progress: 943 records ...
2021-02-10T10:44:52.2562683Z 
2021-02-10T10:44:52.2566380Z 

2021-02-10T10:44:52.2568765Z  The program finished with the following exception:
2021-02-10T10:44:52.2570340Z 
2021-02-10T10:44:52.2575752Z org.apache.flink.util.FlinkException: Triggering a 
savepoint for the job ecf66fd871368d8f7fc8fc9ee25b4972 failed.
2021-02-10T10:44:52.2578429Zat 
org.apache.flink.client.cli.CliFrontend.triggerSavepoint(CliFrontend.java:777)
2021-02-10T10:44:52.2580763Zat 
org.apache.flink.client.cli.CliFrontend.lambda$savepoint$9(CliFrontend.java:754)
2021-02-10T10:44:52.2583490Zat 
org.apache.flink.client.cli.CliFrontend.runClusterAction(CliFrontend.java:1002)
2021-02-10T10:44:52.2586043Zat 
org.apache.flink.client.cli.CliFrontend.savepoint(CliFrontend.java:751)
2021-02-10T10:44:52.2588747Zat 
org.apache.flink.client.cli.CliFrontend.parseAndRun(CliFrontend.java:1072)
2021-02-10T10:44:52.2591056Zat 
org.apache.flink.client.cli.CliFrontend.lambda$main$10(CliFrontend.java:1132)
2021-02-10T10:44:52.2593583Zat 
org.apache.flink.runtime.security.contexts.NoOpSecurityContext.runSecured(NoOpSecurityContext.java:28)
2021-02-10T10:44:52.2612006Zat 
org.apache.flink.client.cli.CliFrontend.main(CliFrontend.java:1132)
2021-02-10T10:44:52.2613264Z Caused by: 
java.util.concurrent.CompletionException: 
java.util.concurrent.CompletionException: 
org.apache.flink.runtime.checkpoint.CheckpointException: Asynchronous task 
checkpoint failed.
2021-02-10T10:44:52.2614138Zat 
org.apache.flink.runtime.scheduler.SchedulerBase.lambda$triggerSavepoint$5(SchedulerBase.java:946)
2021-02-10T10:44:52.2614819Zat 
java.util.concurrent.CompletableFuture.uniHandle(CompletableFuture.java:836)
2021-02-10T10:44:52.2615430Zat 
java.util.concurrent.CompletableFuture$UniHandle.tryFire(CompletableFuture.java:811)
2021-02-10T10:44:52.2616068Zat 
java.util.concurrent.CompletableFuture$Completion.run(CompletableFuture.java:456)
2021-02-10T10:44:52.2616750Zat 
org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRunAsync(AkkaRpcActor.java:442)
2021-02-10T10:44:52.2617375Zat 
org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRpcMessage(AkkaRpcActor.java:209)
2021-02-10T10:44:52.2618037Zat 
org.apache.flink.runtime.rpc.akka.FencedAkkaRpcActor.handleRpcMessage(FencedAkkaRpcActor.java:77)
2021-02-10T10:44:52.2618699Zat 
org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleMessage(AkkaRpcActor.java:159)
2021-02-10T10:44:52.2619304Zat 
akka.japi.pf.UnitCaseStatement.apply(CaseStatements.scala:26)
2021-02-10T10:44:52.2620094Zat 
akka.japi.pf.UnitCaseStatement.apply(CaseStatements.scala:21)
2021-02-10T10:44:52.2620656Zat 
scala.PartialFunction$class.applyOrElse(PartialFunction.scala:123)
2021-02-10T10:44:52.2621230Zat 
akka.japi.pf.UnitCaseStatement.applyOrElse(CaseStatements.scala:21)
2021-02-10T10:44:52.2625679Zat 
scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:170)
2021-02-10T10:44:52.2626154Zat 
scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:171)
2021-02-10T10:44:52.2626612Zat 
scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:171)
2021-02-10T10:44:52.2627051Zat 
akka.actor.Actor$class.aroundReceive(Actor.scala:517)
2021-02-10T10:44:52.2627486Zat 
akka.actor.AbstractActor.aroundReceive(AbstractActor.scala:225)
2021-02-10T10:44:52.2627912Zat 
akka.actor.ActorCell.receiveMessage(ActorCell.scala:592)
2021-02-10T10:44:52.2628325Zat 
akka.actor.ActorCell.invoke(ActorCell.scala:561)
2021-02-10T10:44:52.2628731Zat 
akka.dispatch.Mailbox.processMailbox(Mailbox.scala:258)
2021-02-10T10:44:52.2629122Zat akka.dispatch.Mailbox.run(Mailbox.scala:225)
2021-02-10T10:44:52.2629506Zat akka.dispatch.Mailbox.exec(Mailbox.scala:235)
2021-02-10T10:44:52.2629925Zat 
akka.dispatch.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
2021-02-10T10:44:52.2630402Zat 
akka.dispatch.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
2021-02-10T10:44:52.2630880Zat 
akka.dispatch.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
2021-02-10T10:44:52.2631368Zat 
akka.dispatch.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
2021-02-10T10:44:52.2631980Z Caused by: 
java.util.concurrent.CompletionException: 
org.apache.flink.runtime.checkpoint.CheckpointException: Asynchronous task 
checkpoint failed.

[jira] [Commented] (FLINK-21349) 'Quickstarts Scala nightly end-to-end test' fails with errors in logs

2021-02-10 Thread Arvid Heise (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-21349?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17282893#comment-17282893
 ] 

Arvid Heise commented on FLINK-21349:
-

Probably a duplicate of https://issues.apache.org/jira/browse/FLINK-21213 .

> 'Quickstarts Scala nightly end-to-end test' fails with errors in logs
> -
>
> Key: FLINK-21349
> URL: https://issues.apache.org/jira/browse/FLINK-21349
> Project: Flink
>  Issue Type: Bug
>  Components: Tests
>Affects Versions: 1.11.4
>Reporter: Dawid Wysakowicz
>Priority: Major
>  Labels: test-stability
>
> https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=13178=logs=c88eea3b-64a0-564d-0031-9fdcd7b8abee=2b7514ee-e706-5046-657b-3430666e7bd9
> {code}
> 2021-02-09T23:15:44.6563035Z 2021-02-09 23:15:42,855 INFO  
> org.apache.flink.runtime.taskmanager.Task[] - Freeing 
> task resources for Source: Sequence Source -> Map -> Sink: Unnamed (1/1) 
> (99fb2a8c8084ff2d87fa8bc5a91f2a82).
> 2021-02-09T23:15:44.6564613Z 2021-02-09 23:15:42,855 INFO  
> org.apache.flink.runtime.taskexecutor.TaskExecutor   [] - 
> Un-registering task and sending final execution state FINISHED to JobManager 
> for task Source: Sequence Source -> Map -> Sink: Unnamed (1/1) 
> 99fb2a8c8084ff2d87fa8bc5a91f2a82.
> 2021-02-09T23:15:44.6565773Z 2021-02-09 23:15:42,862 WARN  
> org.apache.flink.streaming.runtime.tasks.AsyncCheckpointRunnable [] - As task 
> is already not running, no longer decline checkpoint 1.
> 2021-02-09T23:15:44.6566892Z java.lang.Exception: Could not materialize 
> checkpoint 1 for operator Source: Sequence Source -> Map -> Sink: Unnamed 
> (1/1).
> 2021-02-09T23:15:44.6568059Z  at 
> org.apache.flink.streaming.runtime.tasks.AsyncCheckpointRunnable.handleExecutionException(AsyncCheckpointRunnable.java:215)
>  [flink-dist_2.11-1.11-SNAPSHOT.jar:1.11-SNAPSHOT]
> 2021-02-09T23:15:44.6569266Z  at 
> org.apache.flink.streaming.runtime.tasks.AsyncCheckpointRunnable.run(AsyncCheckpointRunnable.java:150)
>  [flink-dist_2.11-1.11-SNAPSHOT.jar:1.11-SNAPSHOT]
> 2021-02-09T23:15:44.6570111Z  at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>  [?:1.8.0_282]
> 2021-02-09T23:15:44.6570749Z  at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>  [?:1.8.0_282]
> 2021-02-09T23:15:44.6571299Z  at java.lang.Thread.run(Thread.java:748) 
> [?:1.8.0_282]
> 2021-02-09T23:15:44.6571749Z Caused by: 
> java.util.concurrent.CancellationException
> 2021-02-09T23:15:44.6572242Z  at 
> java.util.concurrent.FutureTask.report(FutureTask.java:121) ~[?:1.8.0_282]
> 2021-02-09T23:15:44.6572800Z  at 
> java.util.concurrent.FutureTask.get(FutureTask.java:192) ~[?:1.8.0_282]
> 2021-02-09T23:15:44.6573827Z  at 
> org.apache.flink.runtime.concurrent.FutureUtils.runIfNotDoneAndGet(FutureUtils.java:501)
>  ~[flink-dist_2.11-1.11-SNAPSHOT.jar:1.11-SNAPSHOT]
> 2021-02-09T23:15:44.6574980Z  at 
> org.apache.flink.streaming.api.operators.OperatorSnapshotFinalizer.(OperatorSnapshotFinalizer.java:57)
>  ~[flink-dist_2.11-1.11-SNAPSHOT.jar:1.11-SNAPSHOT]
> 2021-02-09T23:15:44.6576164Z  at 
> org.apache.flink.streaming.runtime.tasks.AsyncCheckpointRunnable.run(AsyncCheckpointRunnable.java:113)
>  ~[flink-dist_2.11-1.11-SNAPSHOT.jar:1.11-SNAPSHOT]
> 2021-02-09T23:15:44.6576754Z  ... 3 more
> 
> 2021-02-09T23:15:44.6841230Z [FAIL] 'Quickstarts Scala nightly end-to-end 
> test' failed after 0 minutes and 39 seconds! Test exited with exit code 0 but 
> the logs contained errors, exceptions or non-empty .out files
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Comment Edited] (FLINK-21349) 'Quickstarts Scala nightly end-to-end test' fails with errors in logs

2021-02-10 Thread Arvid Heise (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-21349?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17282893#comment-17282893
 ] 

Arvid Heise edited comment on FLINK-21349 at 2/11/21, 7:20 AM:
---

Probably a duplicate of FLINK-21213 .


was (Author: aheise):
Probably a duplicate of https://issues.apache.org/jira/browse/FLINK-21213 .

> 'Quickstarts Scala nightly end-to-end test' fails with errors in logs
> -
>
> Key: FLINK-21349
> URL: https://issues.apache.org/jira/browse/FLINK-21349
> Project: Flink
>  Issue Type: Bug
>  Components: Tests
>Affects Versions: 1.11.4
>Reporter: Dawid Wysakowicz
>Priority: Major
>  Labels: test-stability
>
> https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=13178=logs=c88eea3b-64a0-564d-0031-9fdcd7b8abee=2b7514ee-e706-5046-657b-3430666e7bd9
> {code}
> 2021-02-09T23:15:44.6563035Z 2021-02-09 23:15:42,855 INFO  
> org.apache.flink.runtime.taskmanager.Task[] - Freeing 
> task resources for Source: Sequence Source -> Map -> Sink: Unnamed (1/1) 
> (99fb2a8c8084ff2d87fa8bc5a91f2a82).
> 2021-02-09T23:15:44.6564613Z 2021-02-09 23:15:42,855 INFO  
> org.apache.flink.runtime.taskexecutor.TaskExecutor   [] - 
> Un-registering task and sending final execution state FINISHED to JobManager 
> for task Source: Sequence Source -> Map -> Sink: Unnamed (1/1) 
> 99fb2a8c8084ff2d87fa8bc5a91f2a82.
> 2021-02-09T23:15:44.6565773Z 2021-02-09 23:15:42,862 WARN  
> org.apache.flink.streaming.runtime.tasks.AsyncCheckpointRunnable [] - As task 
> is already not running, no longer decline checkpoint 1.
> 2021-02-09T23:15:44.6566892Z java.lang.Exception: Could not materialize 
> checkpoint 1 for operator Source: Sequence Source -> Map -> Sink: Unnamed 
> (1/1).
> 2021-02-09T23:15:44.6568059Z  at 
> org.apache.flink.streaming.runtime.tasks.AsyncCheckpointRunnable.handleExecutionException(AsyncCheckpointRunnable.java:215)
>  [flink-dist_2.11-1.11-SNAPSHOT.jar:1.11-SNAPSHOT]
> 2021-02-09T23:15:44.6569266Z  at 
> org.apache.flink.streaming.runtime.tasks.AsyncCheckpointRunnable.run(AsyncCheckpointRunnable.java:150)
>  [flink-dist_2.11-1.11-SNAPSHOT.jar:1.11-SNAPSHOT]
> 2021-02-09T23:15:44.6570111Z  at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>  [?:1.8.0_282]
> 2021-02-09T23:15:44.6570749Z  at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>  [?:1.8.0_282]
> 2021-02-09T23:15:44.6571299Z  at java.lang.Thread.run(Thread.java:748) 
> [?:1.8.0_282]
> 2021-02-09T23:15:44.6571749Z Caused by: 
> java.util.concurrent.CancellationException
> 2021-02-09T23:15:44.6572242Z  at 
> java.util.concurrent.FutureTask.report(FutureTask.java:121) ~[?:1.8.0_282]
> 2021-02-09T23:15:44.6572800Z  at 
> java.util.concurrent.FutureTask.get(FutureTask.java:192) ~[?:1.8.0_282]
> 2021-02-09T23:15:44.6573827Z  at 
> org.apache.flink.runtime.concurrent.FutureUtils.runIfNotDoneAndGet(FutureUtils.java:501)
>  ~[flink-dist_2.11-1.11-SNAPSHOT.jar:1.11-SNAPSHOT]
> 2021-02-09T23:15:44.6574980Z  at 
> org.apache.flink.streaming.api.operators.OperatorSnapshotFinalizer.(OperatorSnapshotFinalizer.java:57)
>  ~[flink-dist_2.11-1.11-SNAPSHOT.jar:1.11-SNAPSHOT]
> 2021-02-09T23:15:44.6576164Z  at 
> org.apache.flink.streaming.runtime.tasks.AsyncCheckpointRunnable.run(AsyncCheckpointRunnable.java:113)
>  ~[flink-dist_2.11-1.11-SNAPSHOT.jar:1.11-SNAPSHOT]
> 2021-02-09T23:15:44.6576754Z  ... 3 more
> 
> 2021-02-09T23:15:44.6841230Z [FAIL] 'Quickstarts Scala nightly end-to-end 
> test' failed after 0 minutes and 39 seconds! Test exited with exit code 0 but 
> the logs contained errors, exceptions or non-empty .out files
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (FLINK-21936) Rescale pointwise connections for unaligned checkpoints

2021-03-24 Thread Arvid Heise (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-21936?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Arvid Heise reassigned FLINK-21936:
---

Assignee: (was: Arvid Heise)

> Rescale pointwise connections for unaligned checkpoints
> ---
>
> Key: FLINK-21936
> URL: https://issues.apache.org/jira/browse/FLINK-21936
> Project: Flink
>  Issue Type: Improvement
>  Components: Runtime / Checkpointing
>Affects Versions: 1.13.0
>Reporter: Arvid Heise
>Priority: Major
>
> We currently do not have any hard guarantees on pointwise connection 
> regarding data consistency. However, since data was structured implicitly in 
> the same way as any preceding source or keyby, some users relied on this 
> behavior to divide compute-intensive tasks into smaller chunks while relying 
> on ordering guarantees.
> As long as the parallelism does not change, unaligned checkpoints (UC) 
> retains these properties. With the implementation of rescaling of UC 
> (FLINK-19801), that has changed. For most exchanges, there is a meaningful 
> way to reassign state from one channel to another (even in random order). For 
> some exchanges, the mapping is ambiguous and requires post-filtering. 
> However, for point-wise connections, it's impossible while retaining these 
> properties.
> Consider, {{source -> keyby -> task1 -> forward -> task2}}. Now if we want to 
> rescale from parallelism p = 1 to p = 2, suddenly the records inside the 
> keyby channels need to be divided into two channels according to the 
> keygroups. That is easily possible by using the keygroup ranges of the 
> operators and a way to determine the key(group) of the record (independent of 
> the actual approach). For the forward channel, we completely lack the key 
> context. No record in the forward channel has any keygroup assigned; it's 
> also not possible to calculate it as there is no guarantee that the key is 
> still present.
> The root cause for this limitation is the conceptual mismatch between what we 
> provide and what some users assume we provide (or we assume that the users 
> assume). For example, it's impossible to use (keyed) state in task2 right 
> now, because there is no key context, but we still want to guarantee 
> orderness in respect to that key context.
> For 1.13, the easiest solution is to disable channel state in pointwise 
> connections. For any non-trivial application with at least one shuffle, the 
> number of pointwise channels (linear to p) is quickly dwarfed by all-to-all 
> connections (quadratic to p). I'd add some alternative ideas to the 
> discussion.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (FLINK-21936) Rescale pointwise connections for unaligned checkpoints

2021-03-24 Thread Arvid Heise (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-21936?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Arvid Heise updated FLINK-21936:

Summary: Rescale pointwise connections for unaligned checkpoints  (was: 
Disable checkpointing of inflight data in pointwise connections for unaligned 
checkpoints)

> Rescale pointwise connections for unaligned checkpoints
> ---
>
> Key: FLINK-21936
> URL: https://issues.apache.org/jira/browse/FLINK-21936
> Project: Flink
>  Issue Type: Improvement
>  Components: Runtime / Checkpointing
>Affects Versions: 1.13.0
>Reporter: Arvid Heise
>Assignee: Arvid Heise
>Priority: Major
>
> We currently do not have any hard guarantees on pointwise connection 
> regarding data consistency. However, since data was structured implicitly in 
> the same way as any preceding source or keyby, some users relied on this 
> behavior to divide compute-intensive tasks into smaller chunks while relying 
> on ordering guarantees.
> As long as the parallelism does not change, unaligned checkpoints (UC) 
> retains these properties. With the implementation of rescaling of UC 
> (FLINK-19801), that has changed. For most exchanges, there is a meaningful 
> way to reassign state from one channel to another (even in random order). For 
> some exchanges, the mapping is ambiguous and requires post-filtering. 
> However, for point-wise connections, it's impossible while retaining these 
> properties.
> Consider, {{source -> keyby -> task1 -> forward -> task2}}. Now if we want to 
> rescale from parallelism p = 1 to p = 2, suddenly the records inside the 
> keyby channels need to be divided into two channels according to the 
> keygroups. That is easily possible by using the keygroup ranges of the 
> operators and a way to determine the key(group) of the record (independent of 
> the actual approach). For the forward channel, we completely lack the key 
> context. No record in the forward channel has any keygroup assigned; it's 
> also not possible to calculate it as there is no guarantee that the key is 
> still present.
> The root cause for this limitation is the conceptual mismatch between what we 
> provide and what some users assume we provide (or we assume that the users 
> assume). For example, it's impossible to use (keyed) state in task2 right 
> now, because there is no key context, but we still want to guarantee 
> orderness in respect to that key context.
> For 1.13, the easiest solution is to disable channel state in pointwise 
> connections. For any non-trivial application with at least one shuffle, the 
> number of pointwise channels (linear to p) is quickly dwarfed by all-to-all 
> connections (quadratic to p). I'd add some alternative ideas to the 
> discussion.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (FLINK-21936) Disable checkpointing of inflight data in pointwise connections for unaligned checkpoints

2021-03-24 Thread Arvid Heise (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-21936?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17307637#comment-17307637
 ] 

Arvid Heise commented on FLINK-21936:
-

Note that the initial example was deliberately kept simple. {{source -> keyby 
-> task1 -> forward -> task2 -> forward -> task3}} should also retain the 
properties; any key group context needs to be preserved until the next 
all-to-all connection. 

We also need to explore how much we expect from rescale: {{source -> keyby -> 
task1 -> forward -> task2 -> rescale (2x) -> task3}} would potentially split 
keys into two, but retains the ordering within the groups.
{{source -> keyby -> task1 -> forward -> task2 -> rescale (.5x) -> task3}} 
merges key group ranges but each key is still processed by the same downstream 
subtask of task3 and order is retained.

> Disable checkpointing of inflight data in pointwise connections for unaligned 
> checkpoints
> -
>
> Key: FLINK-21936
> URL: https://issues.apache.org/jira/browse/FLINK-21936
> Project: Flink
>  Issue Type: Improvement
>  Components: Runtime / Checkpointing
>Affects Versions: 1.13.0
>Reporter: Arvid Heise
>Assignee: Arvid Heise
>Priority: Major
>
> We currently do not have any hard guarantees on pointwise connection 
> regarding data consistency. However, since data was structured implicitly in 
> the same way as any preceding source or keyby, some users relied on this 
> behavior to divide compute-intensive tasks into smaller chunks while relying 
> on ordering guarantees.
> As long as the parallelism does not change, unaligned checkpoints (UC) 
> retains these properties. With the implementation of rescaling of UC 
> (FLINK-19801), that has changed. For most exchanges, there is a meaningful 
> way to reassign state from one channel to another (even in random order). For 
> some exchanges, the mapping is ambiguous and requires post-filtering. 
> However, for point-wise connections, it's impossible while retaining these 
> properties.
> Consider, {{source -> keyby -> task1 -> forward -> task2}}. Now if we want to 
> rescale from parallelism p = 1 to p = 2, suddenly the records inside the 
> keyby channels need to be divided into two channels according to the 
> keygroups. That is easily possible by using the keygroup ranges of the 
> operators and a way to determine the key(group) of the record (independent of 
> the actual approach). For the forward channel, we completely lack the key 
> context. No record in the forward channel has any keygroup assigned; it's 
> also not possible to calculate it as there is no guarantee that the key is 
> still present.
> The root cause for this limitation is the conceptual mismatch between what we 
> provide and what some users assume we provide (or we assume that the users 
> assume). For example, it's impossible to use (keyed) state in task2 right 
> now, because there is no key context, but we still want to guarantee 
> orderness in respect to that key context.
> For 1.13, the easiest solution is to disable channel state in pointwise 
> connections. For any non-trivial application with at least one shuffle, the 
> number of pointwise channels (linear to p) is quickly dwarfed by all-to-all 
> connections (quadratic to p). I'd add some alternative ideas to the 
> discussion.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (FLINK-21945) Disable checkpointing of inflight data in pointwise connections for unaligned checkpoints

2021-03-24 Thread Arvid Heise (Jira)
Arvid Heise created FLINK-21945:
---

 Summary: Disable checkpointing of inflight data in pointwise 
connections for unaligned checkpoints
 Key: FLINK-21945
 URL: https://issues.apache.org/jira/browse/FLINK-21945
 Project: Flink
  Issue Type: Sub-task
  Components: Runtime / Checkpointing
Affects Versions: 1.13.0
Reporter: Arvid Heise
Assignee: Arvid Heise






--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (FLINK-20654) Unaligned checkpoint recovery may lead to corrupted data stream

2021-03-26 Thread Arvid Heise (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-20654?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Arvid Heise resolved FLINK-20654.
-
Resolution: Fixed

> Unaligned checkpoint recovery may lead to corrupted data stream
> ---
>
> Key: FLINK-20654
> URL: https://issues.apache.org/jira/browse/FLINK-20654
> Project: Flink
>  Issue Type: Bug
>  Components: Runtime / Checkpointing
>Affects Versions: 1.12.0, 1.12.1
>Reporter: Arvid Heise
>Assignee: Piotr Nowojski
>Priority: Blocker
>  Labels: pull-request-available, test-stability
> Fix For: 1.13.0, 1.12.2
>
>
> Fix of FLINK-20433 shows potential corruption after recovery for all 
> variations of UnalignedCheckpointITCase.
> To reproduce, run UCITCase a couple hundreds times. The issue showed for me 
> in:
> - execute [Parallel union, p = 5]
> - execute [Parallel union, p = 10]
> - execute [Parallel cogroup, p = 5]
> - execute [parallel pipeline with remote channels, p = 5]
> with decreasing frequency.
> The issue manifests as one of the following issues:
> - stream corrupted exception
> - EOF exception
> - assertion failure in NUM_LOST or NUM_OUT_OF_ORDER
> - (for union) ArithmeticException overflow (because the number that should be 
> [0;10] has been mis-deserialized)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (FLINK-20654) Unaligned checkpoint recovery may lead to corrupted data stream

2021-03-26 Thread Arvid Heise (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-20654?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17309412#comment-17309412
 ] 

Arvid Heise commented on FLINK-20654:
-

Merged another set of test tunings + improved logging into master as 
f0d5d3be89c9762fdf147077a90831e365cf6ba0..913ea8e398e8396d044c14a0911d8e134ec4377d.
 I'm closing this ticket as there have been no other failures in the past two 
weeks. When another issue occurs the new logging should help us to drill it 
down.

> Unaligned checkpoint recovery may lead to corrupted data stream
> ---
>
> Key: FLINK-20654
> URL: https://issues.apache.org/jira/browse/FLINK-20654
> Project: Flink
>  Issue Type: Bug
>  Components: Runtime / Checkpointing
>Affects Versions: 1.12.0, 1.12.1
>Reporter: Arvid Heise
>Assignee: Piotr Nowojski
>Priority: Blocker
>  Labels: pull-request-available, test-stability
> Fix For: 1.12.2, 1.13.0
>
>
> Fix of FLINK-20433 shows potential corruption after recovery for all 
> variations of UnalignedCheckpointITCase.
> To reproduce, run UCITCase a couple hundreds times. The issue showed for me 
> in:
> - execute [Parallel union, p = 5]
> - execute [Parallel union, p = 10]
> - execute [Parallel cogroup, p = 5]
> - execute [parallel pipeline with remote channels, p = 5]
> with decreasing frequency.
> The issue manifests as one of the following issues:
> - stream corrupted exception
> - EOF exception
> - assertion failure in NUM_LOST or NUM_OUT_OF_ORDER
> - (for union) ArithmeticException overflow (because the number that should be 
> [0;10] has been mis-deserialized)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (FLINK-21992) Investigate potential buffer leak in unaligned checkpoint

2021-03-26 Thread Arvid Heise (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-21992?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Arvid Heise updated FLINK-21992:

Component/s: Runtime / Checkpointing

> Investigate potential buffer leak in unaligned checkpoint
> -
>
> Key: FLINK-21992
> URL: https://issues.apache.org/jira/browse/FLINK-21992
> Project: Flink
>  Issue Type: Bug
>  Components: Runtime / Checkpointing
>Affects Versions: 1.13.0
>Reporter: Arvid Heise
>Priority: Blocker
>
> A user on mailing list reported that his job gets stuck with unaligned 
> checkpoint enabled.
> http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/Source-Operators-Stuck-in-the-requestBufferBuilderBlocking-tt42530.html
> We received two similar reports in the past, but the users didn't follow up, 
> so it was not as easy to diagnose as this time where the initial report 
> already contains many relevant data points. 
> Beside a buffer leak, there could also be an issue with priority notification.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (FLINK-22019) UnalignedCheckpointRescaleITCase hangs on azure

2021-03-29 Thread Arvid Heise (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-22019?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17310986#comment-17310986
 ] 

Arvid Heise commented on FLINK-22019:
-

Since there are no logs, there is not much I can do now. 
I hope that one of these days, I can resume working on 
https://github.com/apache/flink/pull/14834 and have a better handle on this one.
We can add a timeout to the test to let it fail more quickly though (note that 
we need to remove it for the PR again).

> UnalignedCheckpointRescaleITCase hangs on azure
> ---
>
> Key: FLINK-22019
> URL: https://issues.apache.org/jira/browse/FLINK-22019
> Project: Flink
>  Issue Type: Bug
>  Components: Runtime / Checkpointing
>Affects Versions: 1.13.0
>Reporter: Dawid Wysakowicz
>Priority: Major
>  Labels: test-stability
>
> https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=15658=logs=a57e0635-3fad-5b08-57c7-a4142d7d6fa9=5360d54c-8d94-5d85-304e-a89267eb785a=9347



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (FLINK-22001) Exceptions from JobMaster initialization are not forwarded to the user

2021-03-30 Thread Arvid Heise (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-22001?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17311472#comment-17311472
 ] 

Arvid Heise commented on FLINK-22001:
-

Another case here 
https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=15427=logs=a57e0635-3fad-5b08-57c7-a4142d7d6fa9=5360d54c-8d94-5d85-304e-a89267eb785a=9343

> Exceptions from JobMaster initialization are not forwarded to the user
> --
>
> Key: FLINK-22001
> URL: https://issues.apache.org/jira/browse/FLINK-22001
> Project: Flink
>  Issue Type: Bug
>  Components: Runtime / Coordination
>Affects Versions: 1.13.0
>Reporter: Robert Metzger
>Assignee: Robert Metzger
>Priority: Blocker
> Fix For: 1.13.0
>
>
> Steps to reproduce:
> Set up a streaming job with an invalid parallelism configuration, for example:
> {code}
> .setParallelism(15).setMaxParallelism(1);
> {code}
> This should report the following exception to the user:
> {code}
> Caused by: org.apache.flink.runtime.JobException: Vertex 
> Window(GlobalWindows(), DeltaTrigger, TimeEvictor, ComparableAggregator, 
> PassThroughWindowFunction)'s parallelism (15) is higher than the max 
> parallelism (1). Please lower the parallelism or increase the max parallelism.
>   at 
> org.apache.flink.runtime.executiongraph.ExecutionJobVertex.(ExecutionJobVertex.java:160)
>   at 
> org.apache.flink.runtime.executiongraph.DefaultExecutionGraph.attachJobGraph(DefaultExecutionGraph.java:781)
>   at 
> org.apache.flink.runtime.executiongraph.DefaultExecutionGraphBuilder.buildGraph(DefaultExecutionGraphBuilder.java:193)
>   at 
> org.apache.flink.runtime.scheduler.DefaultExecutionGraphFactory.createAndRestoreExecutionGraph(DefaultExecutionGraphFactory.java:106)
>   at 
> org.apache.flink.runtime.scheduler.SchedulerBase.createAndRestoreExecutionGraph(SchedulerBase.java:252)
>   at 
> org.apache.flink.runtime.scheduler.SchedulerBase.(SchedulerBase.java:185)
>   at 
> org.apache.flink.runtime.scheduler.DefaultScheduler.(DefaultScheduler.java:119)
>   at 
> org.apache.flink.runtime.scheduler.DefaultSchedulerFactory.createInstance(DefaultSchedulerFactory.java:132)
>   at 
> org.apache.flink.runtime.jobmaster.DefaultSlotPoolServiceSchedulerFactory.createScheduler(DefaultSlotPoolServiceSchedulerFactory.java:110)
>   at 
> org.apache.flink.runtime.jobmaster.JobMaster.createScheduler(JobMaster.java:340)
>   at 
> org.apache.flink.runtime.jobmaster.JobMaster.(JobMaster.java:317)
>   at 
> org.apache.flink.runtime.jobmaster.factories.DefaultJobMasterServiceFactory.createJobMasterService(DefaultJobMasterServiceFactory.java:94)
>   at 
> org.apache.flink.runtime.jobmaster.factories.DefaultJobMasterServiceFactory.createJobMasterService(DefaultJobMasterServiceFactory.java:39)
>   at 
> org.apache.flink.runtime.jobmaster.JobManagerRunnerImpl.startJobMasterServiceSafely(JobManagerRunnerImpl.java:363)
>   ... 13 more
> {code}
> However, what the user sees is 
> {code}
> 2021-03-28 20:32:33,935 INFO  
> org.apache.flink.runtime.dispatcher.StandaloneDispatcher [] - Job 
> 419f60eac551619fc1081c670ced3649 reached globally terminal state FAILED.
> ...
> 2021-03-28 20:32:33,974 INFO  
> org.apache.flink.runtime.dispatcher.StandaloneDispatcher [] - Stopped 
> dispatcher akka://flink/user/rpc/dispatcher_2.
> 2021-03-28 20:32:33,977 INFO  
> org.apache.flink.runtime.rpc.akka.AkkaRpcService [] - Stopping 
> Akka RPC service.
> Exception in thread "main" org.apache.flink.util.FlinkException: Failed to 
> execute job 'CarTopSpeedWindowingExample'.
>   at 
> org.apache.flink.streaming.api.environment.StreamExecutionEnvironment.executeAsync(StreamExecutionEnvironment.java:1975)
>   at 
> org.apache.flink.streaming.api.environment.StreamExecutionEnvironment.execute(StreamExecutionEnvironment.java:1853)
>   at 
> org.apache.flink.streaming.api.environment.LocalStreamEnvironment.execute(LocalStreamEnvironment.java:69)
>   at 
> org.apache.flink.streaming.api.environment.StreamExecutionEnvironment.execute(StreamExecutionEnvironment.java:1839)
>   at 
> org.apache.flink.streaming.examples.windowing.TopSpeedWindowing.main(TopSpeedWindowing.java:101)
> Caused by: java.lang.RuntimeException: Error while waiting for job to be 
> initialized
>   at 
> org.apache.flink.client.ClientUtils.waitUntilJobInitializationFinished(ClientUtils.java:160)
>   at 
> org.apache.flink.client.program.PerJobMiniClusterFactory.lambda$submitJob$2(PerJobMiniClusterFactory.java:83)
>   at 
> 

[jira] [Commented] (FLINK-21880) UnalignedCheckpointRescaleITCase fails on azure

2021-03-30 Thread Arvid Heise (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-21880?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17311473#comment-17311473
 ] 

Arvid Heise commented on FLINK-21880:
-

The OP error is pure test. Probably the cancellation of the first job happened 
right when chk-10 was created but left incomplete (shouldn't that be cleaned 
up?). 

The second error is swallowed by FLINK-22001, it's quite possibly the same 
error.

> UnalignedCheckpointRescaleITCase fails on azure
> ---
>
> Key: FLINK-21880
> URL: https://issues.apache.org/jira/browse/FLINK-21880
> Project: Flink
>  Issue Type: Bug
>  Components: Runtime / Checkpointing
>Affects Versions: 1.13.0
>Reporter: Dawid Wysakowicz
>Priority: Major
>  Labels: test-stability
>
> https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=15037=logs=a57e0635-3fad-5b08-57c7-a4142d7d6fa9=5360d54c-8d94-5d85-304e-a89267eb785a=9404
> {code}
> Caused by: java.lang.RuntimeException: 
> org.apache.flink.runtime.client.JobInitializationException: Could not start 
> the JobMaster.
>   at org.apache.flink.util.ExceptionUtils.rethrow(ExceptionUtils.java:316)
>   at 
> org.apache.flink.util.function.FunctionUtils.lambda$uncheckedFunction$2(FunctionUtils.java:75)
>   at 
> java.util.concurrent.CompletableFuture.uniApply(CompletableFuture.java:616)
>   at 
> java.util.concurrent.CompletableFuture$UniApply.tryFire(CompletableFuture.java:591)
>   at 
> java.util.concurrent.CompletableFuture$Completion.exec(CompletableFuture.java:457)
>   at java.util.concurrent.ForkJoinTask.doExec(ForkJoinTask.java:289)
>   at 
> java.util.concurrent.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1056)
>   at java.util.concurrent.ForkJoinPool.runWorker(ForkJoinPool.java:1692)
>   at 
> java.util.concurrent.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:175)
> Caused by: org.apache.flink.runtime.client.JobInitializationException: Could 
> not start the JobMaster.
>   at 
> org.apache.flink.runtime.jobmaster.JobManagerRunnerImpl.startJobMasterServiceSafely(JobManagerRunnerImpl.java:390)
>   at 
> org.apache.flink.runtime.jobmaster.JobManagerRunnerImpl.startJobMaster(JobManagerRunnerImpl.java:351)
>   at 
> org.apache.flink.runtime.jobmaster.JobManagerRunnerImpl.verifyJobSchedulingStatusAndStartJobManager(JobManagerRunnerImpl.java:329)
>   at 
> org.apache.flink.runtime.jobmaster.JobManagerRunnerImpl.lambda$grantLeadership$1(JobManagerRunnerImpl.java:306)
>   at 
> org.apache.flink.util.function.ThrowingRunnable.lambda$unchecked$0(ThrowingRunnable.java:49)
>   at 
> java.util.concurrent.CompletableFuture.uniRun(CompletableFuture.java:719)
>   at 
> java.util.concurrent.CompletableFuture.uniRunStage(CompletableFuture.java:731)
>   at 
> java.util.concurrent.CompletableFuture.thenRun(CompletableFuture.java:2023)
>   at 
> org.apache.flink.runtime.jobmaster.JobManagerRunnerImpl.grantLeadership(JobManagerRunnerImpl.java:302)
>   at 
> org.apache.flink.runtime.highavailability.nonha.embedded.EmbeddedLeaderService$GrantLeadershipCall.run(EmbeddedLeaderService.java:557)
>   at 
> java.util.concurrent.CompletableFuture$AsyncRun.run(CompletableFuture.java:1640)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>   at java.lang.Thread.run(Thread.java:748)
> Caused by: java.io.FileNotFoundException: Cannot find meta data file 
> '_metadata' in directory 
> 'file:/tmp/junit3214690414199099741/junit526862574608820717/894d38376ec2541a38da0a1d73831c16/chk-10'.
>  Please try to load the checkpoint/savepoint directly from the metadata file 
> instead of the directory.
>   at 
> org.apache.flink.runtime.state.filesystem.AbstractFsCheckpointStorageAccess.resolveCheckpointPointer(AbstractFsCheckpointStorageAccess.java:290)
>   at 
> org.apache.flink.runtime.state.filesystem.AbstractFsCheckpointStorageAccess.resolveCheckpoint(AbstractFsCheckpointStorageAccess.java:136)
>   at 
> org.apache.flink.runtime.checkpoint.CheckpointCoordinator.restoreSavepoint(CheckpointCoordinator.java:1626)
>   at 
> org.apache.flink.runtime.scheduler.SchedulerBase.tryRestoreExecutionGraphFromSavepoint(SchedulerBase.java:381)
>   at 
> org.apache.flink.runtime.scheduler.SchedulerBase.createAndRestoreExecutionGraph(SchedulerBase.java:308)
>   at 
> org.apache.flink.runtime.scheduler.SchedulerBase.(SchedulerBase.java:218)
>   at 
> org.apache.flink.runtime.scheduler.DefaultScheduler.(DefaultScheduler.java:130)
>   at 
> org.apache.flink.runtime.scheduler.DefaultSchedulerFactory.createInstance(DefaultSchedulerFactory.java:124)
>   at 
> 

[jira] [Updated] (FLINK-21181) Buffer pool is destroyed error when outputting data over a timer after cancellation.

2021-03-30 Thread Arvid Heise (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-21181?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Arvid Heise updated FLINK-21181:

Description: 
A [user 
reported|http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/What-causes-a-buffer-pool-exception-How-can-I-mitigate-it-td40959.html]
 the issue and provided some taskmanager log with the following relevant lines:
{noformat}
2021-01-26 04:37:43,280 INFO  org.apache.flink.runtime.taskmanager.Task 
   [] - Attempting to cancel task forward fill -> (Sink: tag db sink, 
Sink: back fill db sink, Sink: min max step db sink) (2/2) 
(8c1f256176fb89f112c27883350a02bc).
2021-01-26 04:37:43,280 INFO  org.apache.flink.runtime.taskmanager.Task 
   [] - forward fill -> (Sink: tag db sink, Sink: back fill db sink, 
Sink: min max step db sink) (2/2) (8c1f256176fb89f112c27883350a02bc) switched 
from RUNNING to CANCELING.
2021-01-26 04:37:43,280 INFO  org.apache.flink.runtime.taskmanager.Task 
   [] - Triggering cancellation of task code forward fill -> (Sink: tag 
db sink, Sink: back fill db sink, Sink: min max step db sink) (2/2) 
(8c1f256176fb89f112c27883350a02bc).
2021-01-26 04:37:43,282 ERROR 
xx.pipeline.stream.functions.process.ForwardFillKeyedProcessFunction [] - 
Error in timer.
java.lang.RuntimeException: Buffer pool is destroyed.
at 
org.apache.flink.streaming.runtime.io.RecordWriterOutput.pushToRecordWriter(RecordWriterOutput.java:110)
 ~[flink-dist_2.12-1.11.0.jar:1.11.0]
at 
org.apache.flink.streaming.runtime.io.RecordWriterOutput.collect(RecordWriterOutput.java:89)
 ~[flink-dist_2.12-1.11.0.jar:1.11.0]
at 
org.apache.flink.streaming.runtime.io.RecordWriterOutput.collect(RecordWriterOutput.java:45)
 ~[flink-dist_2.12-1.11.0.jar:1.11.0]
at 
org.apache.flink.streaming.runtime.tasks.OperatorChain$BroadcastingOutputCollector.collect(OperatorChain.java:787)
 ~[flink-dist_2.12-1.11.0.jar:1.11.0]
at 
org.apache.flink.streaming.runtime.tasks.OperatorChain$BroadcastingOutputCollector.collect(OperatorChain.java:740)
 ~[flink-dist_2.12-1.11.0.jar:1.11.0]
at 
org.apache.flink.streaming.api.operators.CountingOutput.collect(CountingOutput.java:52)
 ~[flink-dist_2.12-1.11.0.jar:1.11.0]
at 
org.apache.flink.streaming.api.operators.CountingOutput.collect(CountingOutput.java:30)
 ~[flink-dist_2.12-1.11.0.jar:1.11.0]
at 
org.apache.flink.streaming.api.operators.TimestampedCollector.collect(TimestampedCollector.java:53)
 ~[flink-dist_2.12-1.11.0.jar:1.11.0]
at 
xx.pipeline.stream.functions.process.ForwardFillKeyedProcessFunction.collect(ForwardFillKeyedProcessFunction.java:452)
 ~[develop-17e9fd0e.jar:?]
at 
xx.pipeline.stream.functions.process.ForwardFillKeyedProcessFunction.onTimer(ForwardFillKeyedProcessFunction.java:277)
 ~[develop-17e9fd0e.jar:?]
at 
org.apache.flink.streaming.api.operators.KeyedProcessOperator.invokeUserFunction(KeyedProcessOperator.java:94)
 ~[flink-dist_2.12-1.11.0.jar:1.11.0]
at 
org.apache.flink.streaming.api.operators.KeyedProcessOperator.onProcessingTime(KeyedProcessOperator.java:78)
 ~[flink-dist_2.12-1.11.0.jar:1.11.0]
at 
org.apache.flink.streaming.api.operators.InternalTimerServiceImpl.onProcessingTime(InternalTimerServiceImpl.java:260)
 ~[flink-dist_2.12-1.11.0.jar:1.11.0]
at 
org.apache.flink.streaming.runtime.tasks.StreamTask.invokeProcessingTimeCallback(StreamTask.java:1181)
 ~[flink-dist_2.12-1.11.0.jar:1.11.0]
at 
org.apache.flink.streaming.runtime.tasks.StreamTask.lambda$null$13(StreamTask.java:1172)
 ~[flink-dist_2.12-1.11.0.jar:1.11.0]
at 
org.apache.flink.streaming.runtime.tasks.StreamTaskActionExecutor$1.runThrowing(StreamTaskActionExecutor.java:47)
 [flink-dist_2.12-1.11.0.jar:1.11.0]
at 
org.apache.flink.streaming.runtime.tasks.mailbox.Mail.run(Mail.java:78) 
[flink-dist_2.12-1.11.0.jar:1.11.0]
at 
org.apache.flink.streaming.runtime.tasks.mailbox.MailboxProcessor.processMail(MailboxProcessor.java:270)
 [flink-dist_2.12-1.11.0.jar:1.11.0]
at 
org.apache.flink.streaming.runtime.tasks.mailbox.MailboxProcessor.runMailboxStep(MailboxProcessor.java:190)
 [flink-dist_2.12-1.11.0.jar:1.11.0]
at 
org.apache.flink.streaming.runtime.tasks.mailbox.MailboxProcessor.runMailboxLoop(MailboxProcessor.java:181)
 [flink-dist_2.12-1.11.0.jar:1.11.0]
at 
org.apache.flink.streaming.runtime.tasks.StreamTask.runMailboxLoop(StreamTask.java:558)
 [flink-dist_2.12-1.11.0.jar:1.11.0]
at 
org.apache.flink.streaming.runtime.tasks.StreamTask.invoke(StreamTask.java:530) 
[flink-dist_2.12-1.11.0.jar:1.11.0]
at org.apache.flink.runtime.taskmanager.Task.doRun(Task.java:721) 
[develop-17e9fd0e.jar:?]
at org.apache.flink.runtime.taskmanager.Task.run(Task.java:546) 
[develop-17e9fd0e.jar:?]
at java.lang.Thread.run(Thread.java:748) 

[jira] [Assigned] (FLINK-22053) NumberSequenceSource causes fatal exception when less splits than parallelism.

2021-03-31 Thread Arvid Heise (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-22053?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Arvid Heise reassigned FLINK-22053:
---

Assignee: Arvid Heise

> NumberSequenceSource causes fatal exception when less splits than parallelism.
> --
>
> Key: FLINK-22053
> URL: https://issues.apache.org/jira/browse/FLINK-22053
> Project: Flink
>  Issue Type: Bug
>  Components: API / Core
>Affects Versions: 1.12.2, 1.13.0
>Reporter: Arvid Heise
>Assignee: Arvid Heise
>Priority: Major
>
> If more splits than 
> {noformat}
> Caused by: java.lang.IllegalArgumentException: 'from' must be <= 'to'
>   at 
> org.apache.flink.util.Preconditions.checkArgument(Preconditions.java:138) 
> ~[flink-core-1.13-SNAPSHOT.jar:1.13-SNAPSHOT]
>   at 
> org.apache.flink.api.connector.source.lib.NumberSequenceSource$NumberSequenceSplit.(NumberSequenceSource.java:148)
>  ~[flink-core-1.13-SNAPSHOT.jar:1.13-SNAPSHOT]
>   at 
> org.apache.flink.api.connector.source.lib.NumberSequenceSource.createEnumerator(NumberSequenceSource.java:111)
>  ~[flink-core-1.13-SNAPSHOT.jar:1.13-SNAPSHOT]
>   at 
> org.apache.flink.runtime.source.coordinator.SourceCoordinator.start(SourceCoordinator.java:126)
>  ~[flink-runtime_2.11-1.13-SNAPSHOT.jar:1.13-SNAPSHOT]
>   at 
> org.apache.flink.runtime.operators.coordination.RecreateOnResetOperatorCoordinator$DeferrableCoordinator.applyCall(RecreateOnResetOperatorCoordinator.java:296)
>  ~[flink-runtime_2.11-1.13-SNAPSHOT.jar:1.13-SNAPSHOT]
>   at 
> org.apache.flink.runtime.operators.coordination.RecreateOnResetOperatorCoordinator.start(RecreateOnResetOperatorCoordinator.java:71)
>  ~[flink-runtime_2.11-1.13-SNAPSHOT.jar:1.13-SNAPSHOT]
>   at 
> org.apache.flink.runtime.operators.coordination.OperatorCoordinatorHolder.start(OperatorCoordinatorHolder.java:182)
>  ~[flink-runtime_2.11-1.13-SNAPSHOT.jar:1.13-SNAPSHOT]
>   at 
> org.apache.flink.runtime.scheduler.DefaultOperatorCoordinatorHandler.startAllOperatorCoordinators(DefaultOperatorCoordinatorHandler.java:85)
>  ~[flink-runtime_2.11-1.13-SNAPSHOT.jar:1.13-SNAPSHOT]
>   at 
> org.apache.flink.runtime.scheduler.SchedulerBase.startScheduling(SchedulerBase.java:501)
>  ~[flink-runtime_2.11-1.13-SNAPSHOT.jar:1.13-SNAPSHOT]
>   at 
> org.apache.flink.runtime.jobmaster.JobMaster.startScheduling(JobMaster.java:955)
>  ~[flink-runtime_2.11-1.13-SNAPSHOT.jar:1.13-SNAPSHOT]
>   at 
> org.apache.flink.runtime.jobmaster.JobMaster.startJobExecution(JobMaster.java:873)
>  ~[flink-runtime_2.11-1.13-SNAPSHOT.jar:1.13-SNAPSHOT]
>   at 
> org.apache.flink.runtime.jobmaster.JobMaster.onStart(JobMaster.java:383) 
> ~[flink-runtime_2.11-1.13-SNAPSHOT.jar:1.13-SNAPSHOT]
>   at 
> org.apache.flink.runtime.rpc.RpcEndpoint.internalCallOnStart(RpcEndpoint.java:181)
>  ~[flink-runtime_2.11-1.13-SNAPSHOT.jar:1.13-SNAPSHOT]
>   at 
> org.apache.flink.runtime.rpc.akka.AkkaRpcActor$StoppedState.start(AkkaRpcActor.java:605)
>  ~[flink-runtime_2.11-1.13-SNAPSHOT.jar:1.13-SNAPSHOT]
>   at 
> org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleControlMessage(AkkaRpcActor.java:180)
>  ~[flink-runtime_2.11-1.13-SNAPSHOT.jar:1.13-SNAPSHOT]
>   at akka.japi.pf.UnitCaseStatement.apply(CaseStatements.scala:26) 
> ~[akka-actor_2.11-2.5.21.jar:2.5.21]
>   at akka.japi.pf.UnitCaseStatement.apply(CaseStatements.scala:21) 
> ~[akka-actor_2.11-2.5.21.jar:2.5.21]
>   at scala.PartialFunction$class.applyOrElse(PartialFunction.scala:123) 
> ~[scala-library-2.11.12.jar:?]
>   at akka.japi.pf.UnitCaseStatement.applyOrElse(CaseStatements.scala:21) 
> ~[akka-actor_2.11-2.5.21.jar:2.5.21]
>   at scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:170) 
> ~[scala-library-2.11.12.jar:?]
>   ... 12 more
> {noformat}
> To reproduce
> {noformat}
> @Test
> public void testLessSplitsThanParallelism() throws Exception {
> StreamExecutionEnvironment env = 
> StreamExecutionEnvironment.getExecutionEnvironment();
> env.setParallelism(12);
> env.fromSequence(0, 10);
> env.execute();
> }
> {noformat}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (FLINK-22003) UnalignedCheckpointITCase fail

2021-03-31 Thread Arvid Heise (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-22003?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17312302#comment-17312302
 ] 

Arvid Heise commented on FLINK-22003:
-

This one is really strange:

Usually, when we trigger a checkpoint, we see some actions on the source
{noformat}
22:24:51,290 [Checkpoint Timer] INFO  
org.apache.flink.runtime.checkpoint.CheckpointCoordinator[] - Triggering 
checkpoint 1 (type=CHECKPOINT) @ 1616883891289 for job 
bb1c23aad807944d0a54775098106574.
22:24:51,290 [SourceCoordinator-Source: source] INFO  
org.apache.flink.test.checkpointing.UnalignedCheckpointTestBase [] - 
snapshotState EnumeratorState{unassignedSplits=[], numRestarts=0, 
numCompletedCheckpoints=0}
22:24:51,291 [Flink Netty Server (0) Thread 0] TRACE 
org.apache.flink.runtime.io.network.logger.NetworkActionsLogger [] - [Source: 
source (2/5)#0 (8f1ca6eb04b6e2341c658cc0b1ac7c6c)] 
PipelinedSubpartition#pollBuffer Buffer{size=38, hash=924008396} @ 
ResultSubpartitionInfo{partitionIdx=0, subPartitionIdx=0}
{noformat}

In this case, after checkpoint 11 is triggered nothing happens. 

{noformat}
22:24:54,694 [Checkpoint Timer] INFO  
org.apache.flink.runtime.checkpoint.CheckpointCoordinator[] - Triggering 
checkpoint 11 (type=CHECKPOINT) @ 1616883893885 for job 
bb1c23aad807944d0a54775098106574.
22:24:54,694 [ failing-map (5/5)#4] INFO  
org.apache.flink.runtime.taskmanager.Task[] - failing-map 
(5/5)#4 (7c0e288b2cd57831596d58e8ce31e435) switched from CREATED to DEPLOYING.
{noformat}

Actually, it should have been canceled, as obviously not all tasks are running 
similar to

{noformat}
22:24:51,044 [Checkpoint Timer] WARN  
org.apache.flink.runtime.checkpoint.CheckpointCoordinator[] - Failed to 
trigger checkpoint for job bb1c23aad807944d0a54775098106574.)
org.apache.flink.runtime.checkpoint.CheckpointException: Checkpoint triggering 
task Source: source (1/5) of job bb1c23aad807944d0a54775098106574 has not being 
executed at the moment. Aborting checkpoint. Failure reason: Not all required 
tasks are currently running.
{noformat}

I'm currently assuming that there is a race condition in the code of 
FLINK-21067.

> UnalignedCheckpointITCase fail
> --
>
> Key: FLINK-22003
> URL: https://issues.apache.org/jira/browse/FLINK-22003
> Project: Flink
>  Issue Type: Bug
>  Components: Runtime / Checkpointing
>Affects Versions: 1.13.0
>Reporter: Guowei Ma
>Priority: Major
>  Labels: test-stability
>
> https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=15601=logs=119bbba7-f5e3-5e08-e72d-09f1529665de=7dc1f5a9-54e1-502e-8b02-c7df69073cfc=4142
> {code:java}
> [ERROR] execute[parallel pipeline with remote channels, p = 
> 5](org.apache.flink.test.checkpointing.UnalignedCheckpointITCase)  Time 
> elapsed: 60.018 s  <<< ERROR!
> org.junit.runners.model.TestTimedOutException: test timed out after 6 
> milliseconds
>   at sun.misc.Unsafe.park(Native Method)
>   at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
>   at 
> java.util.concurrent.CompletableFuture$Signaller.block(CompletableFuture.java:1707)
>   at 
> java.util.concurrent.ForkJoinPool.managedBlock(ForkJoinPool.java:3323)
>   at 
> java.util.concurrent.CompletableFuture.waitingGet(CompletableFuture.java:1742)
>   at 
> java.util.concurrent.CompletableFuture.get(CompletableFuture.java:1908)
>   at 
> org.apache.flink.streaming.api.environment.StreamExecutionEnvironment.execute(StreamExecutionEnvironment.java:1859)
>   at 
> org.apache.flink.streaming.api.environment.LocalStreamEnvironment.execute(LocalStreamEnvironment.java:69)
>   at 
> org.apache.flink.streaming.api.environment.StreamExecutionEnvironment.execute(StreamExecutionEnvironment.java:1839)
>   at 
> org.apache.flink.streaming.api.environment.StreamExecutionEnvironment.execute(StreamExecutionEnvironment.java:1822)
>   at 
> org.apache.flink.test.checkpointing.UnalignedCheckpointTestBase.execute(UnalignedCheckpointTestBase.java:138)
>   at 
> org.apache.flink.test.checkpointing.UnalignedCheckpointITCase.execute(UnalignedCheckpointITCase.java:184)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:498)
>   at 
> org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:50)
>   at 
> org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12)
>   at 
> org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:47)
>   at 
> 

[jira] [Commented] (FLINK-22003) UnalignedCheckpointITCase fail

2021-03-31 Thread Arvid Heise (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-22003?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17312550#comment-17312550
 ] 

Arvid Heise commented on FLINK-22003:
-

In general, I have seen long recovery times in these tests from time to time, 
so it's quite possible that it's indeed just a test-issue. I have wrongly 
assumed that all tasks need to be running for the checkpoints to be triggered.

I already removed the timeout - it's rather arbitrary and we should never add 
it to master. I'm also positive that FLINK-17012 would alleviate a bit of the 
pain if we abort all checkpoints if the source is still in recovery.

> UnalignedCheckpointITCase fail
> --
>
> Key: FLINK-22003
> URL: https://issues.apache.org/jira/browse/FLINK-22003
> Project: Flink
>  Issue Type: Bug
>  Components: Runtime / Checkpointing
>Affects Versions: 1.13.0
>Reporter: Guowei Ma
>Priority: Major
>  Labels: test-stability
>
> https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=15601=logs=119bbba7-f5e3-5e08-e72d-09f1529665de=7dc1f5a9-54e1-502e-8b02-c7df69073cfc=4142
> {code:java}
> [ERROR] execute[parallel pipeline with remote channels, p = 
> 5](org.apache.flink.test.checkpointing.UnalignedCheckpointITCase)  Time 
> elapsed: 60.018 s  <<< ERROR!
> org.junit.runners.model.TestTimedOutException: test timed out after 6 
> milliseconds
>   at sun.misc.Unsafe.park(Native Method)
>   at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
>   at 
> java.util.concurrent.CompletableFuture$Signaller.block(CompletableFuture.java:1707)
>   at 
> java.util.concurrent.ForkJoinPool.managedBlock(ForkJoinPool.java:3323)
>   at 
> java.util.concurrent.CompletableFuture.waitingGet(CompletableFuture.java:1742)
>   at 
> java.util.concurrent.CompletableFuture.get(CompletableFuture.java:1908)
>   at 
> org.apache.flink.streaming.api.environment.StreamExecutionEnvironment.execute(StreamExecutionEnvironment.java:1859)
>   at 
> org.apache.flink.streaming.api.environment.LocalStreamEnvironment.execute(LocalStreamEnvironment.java:69)
>   at 
> org.apache.flink.streaming.api.environment.StreamExecutionEnvironment.execute(StreamExecutionEnvironment.java:1839)
>   at 
> org.apache.flink.streaming.api.environment.StreamExecutionEnvironment.execute(StreamExecutionEnvironment.java:1822)
>   at 
> org.apache.flink.test.checkpointing.UnalignedCheckpointTestBase.execute(UnalignedCheckpointTestBase.java:138)
>   at 
> org.apache.flink.test.checkpointing.UnalignedCheckpointITCase.execute(UnalignedCheckpointITCase.java:184)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:498)
>   at 
> org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:50)
>   at 
> org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12)
>   at 
> org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:47)
>   at 
> org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17)
>   at 
> org.junit.internal.runners.statements.FailOnTimeout$CallableStatement.call(FailOnTimeout.java:298)
>   at 
> org.junit.internal.runners.statements.FailOnTimeout$CallableStatement.call(FailOnTimeout.java:292)
>   at java.util.concurrent.FutureTask.run(FutureTask.java:266)
>   at java.lang.Thread.run(Thread.java:748)
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (FLINK-21873) CoordinatedSourceRescaleITCase.testUpscaling fails on AZP

2021-04-07 Thread Arvid Heise (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-21873?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Arvid Heise reassigned FLINK-21873:
---

Assignee: Arvid Heise

> CoordinatedSourceRescaleITCase.testUpscaling fails on AZP
> -
>
> Key: FLINK-21873
> URL: https://issues.apache.org/jira/browse/FLINK-21873
> Project: Flink
>  Issue Type: Bug
>  Components: Connectors / Common
>Affects Versions: 1.13.0
>Reporter: Till Rohrmann
>Assignee: Arvid Heise
>Priority: Major
>  Labels: test-stability
> Fix For: 1.14.0
>
>
> The test {{CoordinatedSourceRescaleITCase.testUpscaling}} fails on AZP with
> {code}
>   at scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:170)
>   at scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:171)
>   at scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:171)
>   at akka.actor.Actor$class.aroundReceive(Actor.scala:517)
>   at akka.actor.AbstractActor.aroundReceive(AbstractActor.scala:225)
>   at akka.actor.ActorCell.receiveMessage(ActorCell.scala:592)
>   at akka.actor.ActorCell.invoke(ActorCell.scala:561)
>   at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:258)
>   at akka.dispatch.Mailbox.run(Mailbox.scala:225)
>   at akka.dispatch.Mailbox.exec(Mailbox.scala:235)
>   ... 4 more
> Caused by: java.lang.Exception: successfully restored checkpoint
>   at 
> org.apache.flink.connector.base.source.reader.CoordinatedSourceRescaleITCase$FailingMapFunction.map(CoordinatedSourceRescaleITCase.java:139)
>   at 
> org.apache.flink.connector.base.source.reader.CoordinatedSourceRescaleITCase$FailingMapFunction.map(CoordinatedSourceRescaleITCase.java:126)
>   at 
> org.apache.flink.streaming.api.operators.StreamMap.processElement(StreamMap.java:38)
>   at 
> org.apache.flink.streaming.runtime.tasks.CopyingChainingOutput.pushToOperator(CopyingChainingOutput.java:71)
>   at 
> org.apache.flink.streaming.runtime.tasks.CopyingChainingOutput.collect(CopyingChainingOutput.java:46)
>   at 
> org.apache.flink.streaming.runtime.tasks.CopyingChainingOutput.collect(CopyingChainingOutput.java:26)
>   at 
> org.apache.flink.streaming.runtime.tasks.SourceOperatorStreamTask$AsyncDataOutputToOutput.emitRecord(SourceOperatorStreamTask.java:161)
>   at 
> org.apache.flink.streaming.api.operators.source.SourceOutputWithWatermarks.collect(SourceOutputWithWatermarks.java:110)
>   at 
> org.apache.flink.streaming.api.operators.source.SourceOutputWithWatermarks.collect(SourceOutputWithWatermarks.java:101)
>   at 
> org.apache.flink.api.connector.source.lib.util.IteratorSourceReader.pollNext(IteratorSourceReader.java:95)
>   at 
> org.apache.flink.streaming.api.operators.SourceOperator.emitNext(SourceOperator.java:275)
>   at 
> org.apache.flink.streaming.runtime.io.StreamTaskSourceInput.emitNext(StreamTaskSourceInput.java:68)
>   at 
> org.apache.flink.streaming.runtime.io.StreamOneInputProcessor.processInput(StreamOneInputProcessor.java:65)
>   at 
> org.apache.flink.streaming.runtime.tasks.StreamTask.processInput(StreamTask.java:408)
>   at 
> org.apache.flink.streaming.runtime.tasks.mailbox.MailboxProcessor.runMailboxLoop(MailboxProcessor.java:190)
>   at 
> org.apache.flink.streaming.runtime.tasks.StreamTask.runMailboxLoop(StreamTask.java:624)
>   at 
> org.apache.flink.streaming.runtime.tasks.StreamTask.invoke(StreamTask.java:588)
>   at org.apache.flink.runtime.taskmanager.Task.doRun(Task.java:760)
>   at org.apache.flink.runtime.taskmanager.Task.run(Task.java:562)
>   at java.lang.Thread.run(Thread.java:748)
> {code}
> https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=14997=logs=fc5181b0-e452-5c8f-68de-1097947f6483=62110053-334f-5295-a0ab-80dd7e2babbf=22049
> cc [~AHeise]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (FLINK-22132) Test unaligned checkpoints rescaling manually on a real cluster

2021-04-07 Thread Arvid Heise (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-22132?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17316161#comment-17316161
 ] 

Arvid Heise commented on FLINK-22132:
-

Since the task is pretty huge. I'm splitting into subtasks for defining the 
application and executing the application.

> Test unaligned checkpoints rescaling manually on a real cluster
> ---
>
> Key: FLINK-22132
> URL: https://issues.apache.org/jira/browse/FLINK-22132
> Project: Flink
>  Issue Type: Test
>  Components: Runtime / Checkpointing
>Affects Versions: 1.13.0
>Reporter: Piotr Nowojski
>Priority: Blocker
> Fix For: 1.13.0
>
>
> To test unaligned checkpoints, we should use a few different applications 
> that use different features:
> - Mixing forward/rescale channels with keyby or other shuffle operations
> - Unions
> - 2 or n-ary operators
> - Associated state ((keyed) process function)
> - Correctness verifications
> The sinks should not be mocked but rather should be able to induce a fair 
> amount of backpressure into the system. Then, after induced failure, the user 
> needs to restart from a retained checkpoint with
> - lower
> - same
> - higher degree of parallelism.
> To enable unaligned checkpoints, set 
> - execution.checkpointing.unaligned: true
> - execution.checkpointing.alignment-timeout to 0s, 10s, 1min (for high 
> backpressure)
> The primary objective is to check if all data is recovered properly and if 
> the semantics is correct (does state match input?). 
> The secondary objective is to check if Flink UI shows the information 
> correctly:
> - unaligned checkpoint enabled on job level
> - timeout on job level
> - for each checkpoint, if it's unaligned or not; how much data was written



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (FLINK-22132) Test unaligned checkpoints rescaling manually on a real cluster

2021-04-07 Thread Arvid Heise (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-22132?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Arvid Heise updated FLINK-22132:

Description: 
To test unaligned checkpoints, we should use a few different applications that 
use different features:
- Mixing forward/rescale channels with keyby or other shuffle operations
- Unions
- 2 or n-ary operators
- Associated state ((keyed) process function)
- Correctness verifications

The sinks should not be mocked but rather should be able to induce a fair 
amount of backpressure into the system. Then, after induced failure, the user 
needs to restart from a retained checkpoint with
- lower
- same
- higher degree of parallelism.

To enable unaligned checkpoints, set 
- execution.checkpointing.unaligned: true
- execution.checkpointing.alignment-timeout to 0s, 10s, 1min (for high 
backpressure)

The primary objective is to check if all data is recovered properly and if the 
semantics is correct (does state match input?). 

The secondary objective is to check if Flink UI shows the information correctly:
- unaligned checkpoint enabled on job level
- timeout on job level
- for each checkpoint, if it's unaligned or not; how much data was written

> Test unaligned checkpoints rescaling manually on a real cluster
> ---
>
> Key: FLINK-22132
> URL: https://issues.apache.org/jira/browse/FLINK-22132
> Project: Flink
>  Issue Type: Test
>  Components: Runtime / Checkpointing
>Affects Versions: 1.13.0
>Reporter: Piotr Nowojski
>Priority: Blocker
> Fix For: 1.13.0
>
>
> To test unaligned checkpoints, we should use a few different applications 
> that use different features:
> - Mixing forward/rescale channels with keyby or other shuffle operations
> - Unions
> - 2 or n-ary operators
> - Associated state ((keyed) process function)
> - Correctness verifications
> The sinks should not be mocked but rather should be able to induce a fair 
> amount of backpressure into the system. Then, after induced failure, the user 
> needs to restart from a retained checkpoint with
> - lower
> - same
> - higher degree of parallelism.
> To enable unaligned checkpoints, set 
> - execution.checkpointing.unaligned: true
> - execution.checkpointing.alignment-timeout to 0s, 10s, 1min (for high 
> backpressure)
> The primary objective is to check if all data is recovered properly and if 
> the semantics is correct (does state match input?). 
> The secondary objective is to check if Flink UI shows the information 
> correctly:
> - unaligned checkpoint enabled on job level
> - timeout on job level
> - for each checkpoint, if it's unaligned or not; how much data was written



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (FLINK-22136) Device application for unaligned checkpoint test on cluster

2021-04-07 Thread Arvid Heise (Jira)
Arvid Heise created FLINK-22136:
---

 Summary: Device application for unaligned checkpoint test on 
cluster
 Key: FLINK-22136
 URL: https://issues.apache.org/jira/browse/FLINK-22136
 Project: Flink
  Issue Type: Sub-task
Reporter: Arvid Heise


To test unaligned checkpoints, we should use a few different applications that 
use different features:

* Mixing forward/rescale channels with keyby or other shuffle operations
* Unions
* 2 or n-ary operators
* Associated state ((keyed) process function)
* Correctness verifications

The sinks should not be mocked but rather should be able to induce a fair 
amount of backpressure into the system. Quite possibly, it would be a good idea 
to have a way to add more backpressure to the sink by running the respective 
system on the cluster and be able to add/remove parallel instances.





--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (FLINK-22137) Execute unaligned checkpoint test on a cluster

2021-04-07 Thread Arvid Heise (Jira)
Arvid Heise created FLINK-22137:
---

 Summary: Execute unaligned checkpoint test on a cluster
 Key: FLINK-22137
 URL: https://issues.apache.org/jira/browse/FLINK-22137
 Project: Flink
  Issue Type: Sub-task
Reporter: Arvid Heise


Start application and at some point cancel/induce failure, the user needs to 
restart from a retained checkpoint with

* lower
* same
* higher degree of parallelism.

To enable unaligned checkpoints, set

* execution.checkpointing.unaligned: true
* execution.checkpointing.alignment-timeout to 0s, 10s, 1min (for high 
backpressure)

The primary objective is to check if all data is recovered properly and if the 
semantics is correct (does state match input?).

The secondary objective is to check if Flink UI shows the information correctly:

* unaligned checkpoint enabled on job level
* timeout on job level
* for each checkpoint, if it's unaligned or not; how much data was written



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (FLINK-22132) Test unaligned checkpoints rescaling manually on a real cluster

2021-04-07 Thread Arvid Heise (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-22132?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Arvid Heise updated FLINK-22132:

Description: 
To test unaligned checkpoints, we should use a few different applications that 
use different features.

The sinks should not be mocked but rather should be able to induce a fair 
amount of backpressure into the system. Quite possibly, it would be a good idea 
to have a way to add more backpressure to the sink by running the respective 
system on the cluster and be able to add/remove parallel instances.

The primary objective is to check if all data is recovered properly and if the 
semantics is correct (does state match input?). 

The secondary objective is to check if Flink UI shows the information correctly.

More details in the subtasks.

  was:
To test unaligned checkpoints, we should use a few different applications that 
use different features:
- Mixing forward/rescale channels with keyby or other shuffle operations
- Unions
- 2 or n-ary operators
- Associated state ((keyed) process function)
- Correctness verifications

The sinks should not be mocked but rather should be able to induce a fair 
amount of backpressure into the system. Then, after induced failure, the user 
needs to restart from a retained checkpoint with
- lower
- same
- higher degree of parallelism.

To enable unaligned checkpoints, set 
- execution.checkpointing.unaligned: true
- execution.checkpointing.alignment-timeout to 0s, 10s, 1min (for high 
backpressure)

The primary objective is to check if all data is recovered properly and if the 
semantics is correct (does state match input?). 

The secondary objective is to check if Flink UI shows the information correctly:
- unaligned checkpoint enabled on job level
- timeout on job level
- for each checkpoint, if it's unaligned or not; how much data was written


> Test unaligned checkpoints rescaling manually on a real cluster
> ---
>
> Key: FLINK-22132
> URL: https://issues.apache.org/jira/browse/FLINK-22132
> Project: Flink
>  Issue Type: Test
>  Components: Runtime / Checkpointing
>Affects Versions: 1.13.0
>Reporter: Piotr Nowojski
>Priority: Blocker
> Fix For: 1.13.0
>
>
> To test unaligned checkpoints, we should use a few different applications 
> that use different features.
> The sinks should not be mocked but rather should be able to induce a fair 
> amount of backpressure into the system. Quite possibly, it would be a good 
> idea to have a way to add more backpressure to the sink by running the 
> respective system on the cluster and be able to add/remove parallel instances.
> The primary objective is to check if all data is recovered properly and if 
> the semantics is correct (does state match input?). 
> The secondary objective is to check if Flink UI shows the information 
> correctly.
> More details in the subtasks.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (FLINK-22084) RescalingITCase fails with adaptive scheduler

2021-04-07 Thread Arvid Heise (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-22084?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17316018#comment-17316018
 ] 

Arvid Heise commented on FLINK-22084:
-

We shouldn't support changes in maxParallelism. But in this case, we are 
telling Flink in the second run to use a value by itself and that should be set 
to 13.

I'm mostly wondering why it was working before and doesn't work anymore? Are we 
replacing -1 with 128 at the correct place?

Btw f it's now too complicated to support, I'd also be fine with dropping it 
(since it's super rare), but I still think that this is a valid use.

> RescalingITCase fails with adaptive scheduler
> -
>
> Key: FLINK-22084
> URL: https://issues.apache.org/jira/browse/FLINK-22084
> Project: Flink
>  Issue Type: Bug
>  Components: Runtime / Checkpointing, Runtime / Coordination
>Affects Versions: 1.13.0
>Reporter: Dawid Wysakowicz
>Assignee: Austin Cawley-Edwards
>Priority: Blocker
>  Labels: pull-request-available, test-stability
> Fix For: 1.13.0
>
>
> https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=15934=logs=8fd9202e-fd17-5b26-353c-ac1ff76c8f28=a0a633b8-47ef-5c5a-2806-3c13b9e48228=4472
> {code}
> 2021-03-31T22:16:07.8416407Z [ERROR] 
> testSavepointRescalingOutKeyedStateDerivedMaxParallelism[backend = 
> rocksdb](org.apache.flink.test.checkpointing.RescalingITCase)  Time elapsed: 
> 9.945 s  <<< ERROR!
> 2021-03-31T22:16:07.8417534Z 
> org.apache.flink.runtime.client.JobExecutionException: Job execution failed.
> 2021-03-31T22:16:07.8418516Z  at 
> org.apache.flink.runtime.jobmaster.JobResult.toJobExecutionResult(JobResult.java:144)
> 2021-03-31T22:16:07.8419281Z  at 
> org.apache.flink.test.util.TestUtils.submitJobAndWaitForResult(TestUtils.java:63)
> 2021-03-31T22:16:07.8420142Z  at 
> org.apache.flink.test.checkpointing.RescalingITCase.testSavepointRescalingKeyedState(RescalingITCase.java:251)
> 2021-03-31T22:16:07.8421173Z  at 
> org.apache.flink.test.checkpointing.RescalingITCase.testSavepointRescalingOutKeyedStateDerivedMaxParallelism(RescalingITCase.java:168)
> 2021-03-31T22:16:07.8421985Z  at 
> sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> 2021-03-31T22:16:07.8422651Z  at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
> 2021-03-31T22:16:07.8423649Z  at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> 2021-03-31T22:16:07.8424231Z  at 
> java.lang.reflect.Method.invoke(Method.java:498)
> 2021-03-31T22:16:07.8424657Z  at 
> org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:50)
> 2021-03-31T22:16:07.8425147Z  at 
> org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12)
> 2021-03-31T22:16:07.8425609Z  at 
> org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:47)
> 2021-03-31T22:16:07.8426183Z  at 
> org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17)
> 2021-03-31T22:16:07.8569060Z  at 
> org.junit.internal.runners.statements.RunBefores.evaluate(RunBefores.java:26)
> 2021-03-31T22:16:07.8569781Z  at 
> org.apache.flink.util.TestNameProvider$1.evaluate(TestNameProvider.java:45)
> 2021-03-31T22:16:07.8570451Z  at 
> org.junit.rules.TestWatcher$1.evaluate(TestWatcher.java:55)
> 2021-03-31T22:16:07.8571040Z  at 
> org.junit.rules.RunRules.evaluate(RunRules.java:20)
> 2021-03-31T22:16:07.8571604Z  at 
> org.junit.runners.ParentRunner.runLeaf(ParentRunner.java:325)
> 2021-03-31T22:16:07.8572303Z  at 
> org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:78)
> 2021-03-31T22:16:07.8573259Z  at 
> org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:57)
> 2021-03-31T22:16:07.8573975Z  at 
> org.junit.runners.ParentRunner$3.run(ParentRunner.java:290)
> 2021-03-31T22:16:07.8574660Z  at 
> org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:71)
> 2021-03-31T22:16:07.8575359Z  at 
> org.junit.runners.ParentRunner.runChildren(ParentRunner.java:288)
> 2021-03-31T22:16:07.8576037Z  at 
> org.junit.runners.ParentRunner.access$000(ParentRunner.java:58)
> 2021-03-31T22:16:07.8576728Z  at 
> org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:268)
> 2021-03-31T22:16:07.8577588Z  at 
> org.junit.runners.ParentRunner.run(ParentRunner.java:363)
> 2021-03-31T22:16:07.8578181Z  at 
> org.junit.runners.Suite.runChild(Suite.java:128)
> 2021-03-31T22:16:07.8578771Z  at 
> org.junit.runners.Suite.runChild(Suite.java:27)
> 2021-03-31T22:16:07.8579402Z  at 
> org.junit.runners.ParentRunner$3.run(ParentRunner.java:290)
> 2021-03-31T22:16:07.8580061Z  at 
> org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:71)
> 2021-03-31T22:16:07.8580774Z  at 
> 

[jira] [Commented] (FLINK-22019) UnalignedCheckpointRescaleITCase hangs on azure

2021-04-07 Thread Arvid Heise (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-22019?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17316029#comment-17316029
 ] 

Arvid Heise commented on FLINK-22019:
-

For the time being, I'm assuming that this is caused by FLINK-22088 and kind of 
duplicates FLINK-22003.

> UnalignedCheckpointRescaleITCase hangs on azure
> ---
>
> Key: FLINK-22019
> URL: https://issues.apache.org/jira/browse/FLINK-22019
> Project: Flink
>  Issue Type: Bug
>  Components: Runtime / Checkpointing
>Affects Versions: 1.13.0
>Reporter: Dawid Wysakowicz
>Priority: Major
>  Labels: test-stability
>
> https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=15658=logs=a57e0635-3fad-5b08-57c7-a4142d7d6fa9=5360d54c-8d94-5d85-304e-a89267eb785a=9347



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (FLINK-20816) NotifyCheckpointAbortedITCase failed due to timeout

2021-04-07 Thread Arvid Heise (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-20816?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Arvid Heise reassigned FLINK-20816:
---

Assignee: Arvid Heise

> NotifyCheckpointAbortedITCase failed due to timeout
> ---
>
> Key: FLINK-20816
> URL: https://issues.apache.org/jira/browse/FLINK-20816
> Project: Flink
>  Issue Type: Bug
>  Components: Runtime / Checkpointing
>Affects Versions: 1.12.2, 1.13.0
>Reporter: Matthias
>Assignee: Arvid Heise
>Priority: Critical
>  Labels: test-stability
> Fix For: 1.13.0
>
> Attachments: flink-20816-failure.log, flink-20816-success.log
>
>
> [This 
> build|https://dev.azure.com/mapohl/flink/_build/results?buildId=152=logs=0a15d512-44ac-5ba5-97ab-13a5d066c22c=634cd701-c189-5dff-24cb-606ed884db87=4245]
>  failed caused by a failing of {{NotifyCheckpointAbortedITCase}} due to a 
> timeout.
> {code}
> 2020-12-29T21:48:40.9430511Z [INFO] Running 
> org.apache.flink.test.checkpointing.NotifyCheckpointAbortedITCase
> 2020-12-29T21:50:28.0087043Z [ERROR] Tests run: 2, Failures: 0, Errors: 1, 
> Skipped: 0, Time elapsed: 107.062 s <<< FAILURE! - in 
> org.apache.flink.test.checkpointing.NotifyCheckpointAbortedITCase
> 2020-12-29T21:50:28.0087961Z [ERROR] 
> testNotifyCheckpointAborted[unalignedCheckpointEnabled 
> =true](org.apache.flink.test.checkpointing.NotifyCheckpointAbortedITCase)  
> Time elapsed: 104.044 s  <<< ERROR!
> 2020-12-29T21:50:28.0088619Z org.junit.runners.model.TestTimedOutException: 
> test timed out after 10 milliseconds
> 2020-12-29T21:50:28.0088972Z  at java.lang.Object.wait(Native Method)
> 2020-12-29T21:50:28.0089267Z  at java.lang.Object.wait(Object.java:502)
> 2020-12-29T21:50:28.0089633Z  at 
> org.apache.flink.core.testutils.OneShotLatch.await(OneShotLatch.java:61)
> 2020-12-29T21:50:28.0090458Z  at 
> org.apache.flink.test.checkpointing.NotifyCheckpointAbortedITCase.verifyAllOperatorsNotifyAborted(NotifyCheckpointAbortedITCase.java:200)
> 2020-12-29T21:50:28.0091313Z  at 
> org.apache.flink.test.checkpointing.NotifyCheckpointAbortedITCase.testNotifyCheckpointAborted(NotifyCheckpointAbortedITCase.java:183)
> 2020-12-29T21:50:28.0091819Z  at 
> sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> 2020-12-29T21:50:28.0092199Z  at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
> 2020-12-29T21:50:28.0092675Z  at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> 2020-12-29T21:50:28.0093095Z  at 
> java.lang.reflect.Method.invoke(Method.java:498)
> 2020-12-29T21:50:28.0093495Z  at 
> org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:50)
> 2020-12-29T21:50:28.0093980Z  at 
> org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12)
> 2020-12-29T21:50:28.009Z  at 
> org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:47)
> 2020-12-29T21:50:28.0094917Z  at 
> org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17)
> 2020-12-29T21:50:28.0095663Z  at 
> org.junit.internal.runners.statements.FailOnTimeout$CallableStatement.call(FailOnTimeout.java:298)
> 2020-12-29T21:50:28.0096221Z  at 
> org.junit.internal.runners.statements.FailOnTimeout$CallableStatement.call(FailOnTimeout.java:292)
> 2020-12-29T21:50:28.0096675Z  at 
> java.util.concurrent.FutureTask.run(FutureTask.java:266)
> 2020-12-29T21:50:28.0097022Z  at java.lang.Thread.run(Thread.java:748)
> {code}
> The branch contained changes from FLINK-20594 and FLINK-20595. These issues 
> remove code that is not used anymore and should have had only affects on unit 
> tests. [The previous 
> build|https://dev.azure.com/mapohl/flink/_build/results?buildId=151=results]
>  containing all the changes accept for 
> [9c57c37|https://github.com/XComp/flink/commit/9c57c37c50733a1f592a4fc5e492b22be80d8279]
>  passed.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (FLINK-16504) Add a AWS DynamoDB sink

2021-04-07 Thread Arvid Heise (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-16504?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17316107#comment-17316107
 ] 

Arvid Heise commented on FLINK-16504:
-

It seems as if I get proper priority for creating the connector ecosystem in 
1.14, so I'd keep it out for now (wouldn't make it into 1.13 anyways and we can 
easily add it to 1.14 latest).

> Add a AWS DynamoDB sink
> ---
>
> Key: FLINK-16504
> URL: https://issues.apache.org/jira/browse/FLINK-16504
> Project: Flink
>  Issue Type: Improvement
>  Components: Connectors / Common
>Reporter: Robert Metzger
>Assignee: Yuri Gusev
>Priority: Major
>
> I'm adding this ticket to track the amount of demand for this connector.
> Please comment on this ticket, if you are looking for a DynamoDB connector.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (FLINK-21797) Performance regression on 03/11/21

2021-04-07 Thread Arvid Heise (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-21797?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Arvid Heise reassigned FLINK-21797:
---

Assignee: (was: Arvid Heise)

> Performance regression on 03/11/21
> --
>
> Key: FLINK-21797
> URL: https://issues.apache.org/jira/browse/FLINK-21797
> Project: Flink
>  Issue Type: Improvement
>  Components: Runtime / Network
>Affects Versions: 1.13.0
>Reporter: Arvid Heise
>Priority: Major
>
> http://codespeed.dak8s.net:8000/timeline/#/?exe=1,3,5=tupleKeyBy=2=200=off=on=on



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Closed] (FLINK-21378) Rescale pointwise connection during unaligned checkpoint recovery

2021-04-07 Thread Arvid Heise (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-21378?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Arvid Heise closed FLINK-21378.
---
Resolution: Incomplete

The idea is incomplete, see FLINK-21936.

> Rescale pointwise connection during unaligned checkpoint recovery
> -
>
> Key: FLINK-21378
> URL: https://issues.apache.org/jira/browse/FLINK-21378
> Project: Flink
>  Issue Type: Improvement
>  Components: Runtime / Checkpointing
>Affects Versions: 1.13.0
>Reporter: Arvid Heise
>Assignee: Arvid Heise
>Priority: Major
>
> FLINK-19801 added support for rescaling of unaligned checkpoints through 
> virtual channels: A mapping of old to new channel infos helped to create a 
> virtual channel that demultiplexes buffers from different original channel 
> over the same physical channel.
> The calculation of FLINK-19801, however, assumes that subpartition = channel 
> index, which holds for all fully connected exchanges, but not for point-wise 
> connection. For point-wise connections, there are few channels per subtask 
> and they correspond to one particular subpartition.
> A possible approach is to actually use the subpartition information while 
> constructing {{InflightDataRescalingDescriptor}} in {{TaskStateAssignment}}. 
> Thus, instead of taking subtask index as the channel index, we should take 
> the subpartition as the channel index. The easiest way to implement it is, by 
> translating subtask index to subpartition index and then calculate the 
> channel index from it.
> For that, the following changes are needed:
> * {{StateAssignmentOperation}} attaches the (upstream/downstream) -> 
> subpartition mapping to all assignments of pointwise exchanges. The 
> information can be derived through {{ExecutionEdge}} -> 
> {{IntermediateResultPartition.partitionNumber}} (note that on execution graph 
> level subpartition is named partition).
> * For non-pointwise exchanges, this mapping is the identity function.
> * {{TaskStateAssignment}} uses this additional lookup to translate subtask 
> mapping to subpartition mappings, which can then be used to calculate the 
> channel indexes both on input and output side.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (FLINK-21880) UnalignedCheckpointRescaleITCase fails on azure

2021-04-07 Thread Arvid Heise (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-21880?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17316001#comment-17316001
 ] 

Arvid Heise commented on FLINK-21880:
-

Merged fix into master as 2b1cf60219ba3ae88c3bd7e2537f726caf00e16d.

> UnalignedCheckpointRescaleITCase fails on azure
> ---
>
> Key: FLINK-21880
> URL: https://issues.apache.org/jira/browse/FLINK-21880
> Project: Flink
>  Issue Type: Bug
>  Components: Runtime / Checkpointing
>Affects Versions: 1.13.0
>Reporter: Dawid Wysakowicz
>Priority: Major
>  Labels: pull-request-available, test-stability
>
> https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=15037=logs=a57e0635-3fad-5b08-57c7-a4142d7d6fa9=5360d54c-8d94-5d85-304e-a89267eb785a=9404
> {code}
> Caused by: java.lang.RuntimeException: 
> org.apache.flink.runtime.client.JobInitializationException: Could not start 
> the JobMaster.
>   at org.apache.flink.util.ExceptionUtils.rethrow(ExceptionUtils.java:316)
>   at 
> org.apache.flink.util.function.FunctionUtils.lambda$uncheckedFunction$2(FunctionUtils.java:75)
>   at 
> java.util.concurrent.CompletableFuture.uniApply(CompletableFuture.java:616)
>   at 
> java.util.concurrent.CompletableFuture$UniApply.tryFire(CompletableFuture.java:591)
>   at 
> java.util.concurrent.CompletableFuture$Completion.exec(CompletableFuture.java:457)
>   at java.util.concurrent.ForkJoinTask.doExec(ForkJoinTask.java:289)
>   at 
> java.util.concurrent.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1056)
>   at java.util.concurrent.ForkJoinPool.runWorker(ForkJoinPool.java:1692)
>   at 
> java.util.concurrent.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:175)
> Caused by: org.apache.flink.runtime.client.JobInitializationException: Could 
> not start the JobMaster.
>   at 
> org.apache.flink.runtime.jobmaster.JobManagerRunnerImpl.startJobMasterServiceSafely(JobManagerRunnerImpl.java:390)
>   at 
> org.apache.flink.runtime.jobmaster.JobManagerRunnerImpl.startJobMaster(JobManagerRunnerImpl.java:351)
>   at 
> org.apache.flink.runtime.jobmaster.JobManagerRunnerImpl.verifyJobSchedulingStatusAndStartJobManager(JobManagerRunnerImpl.java:329)
>   at 
> org.apache.flink.runtime.jobmaster.JobManagerRunnerImpl.lambda$grantLeadership$1(JobManagerRunnerImpl.java:306)
>   at 
> org.apache.flink.util.function.ThrowingRunnable.lambda$unchecked$0(ThrowingRunnable.java:49)
>   at 
> java.util.concurrent.CompletableFuture.uniRun(CompletableFuture.java:719)
>   at 
> java.util.concurrent.CompletableFuture.uniRunStage(CompletableFuture.java:731)
>   at 
> java.util.concurrent.CompletableFuture.thenRun(CompletableFuture.java:2023)
>   at 
> org.apache.flink.runtime.jobmaster.JobManagerRunnerImpl.grantLeadership(JobManagerRunnerImpl.java:302)
>   at 
> org.apache.flink.runtime.highavailability.nonha.embedded.EmbeddedLeaderService$GrantLeadershipCall.run(EmbeddedLeaderService.java:557)
>   at 
> java.util.concurrent.CompletableFuture$AsyncRun.run(CompletableFuture.java:1640)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>   at java.lang.Thread.run(Thread.java:748)
> Caused by: java.io.FileNotFoundException: Cannot find meta data file 
> '_metadata' in directory 
> 'file:/tmp/junit3214690414199099741/junit526862574608820717/894d38376ec2541a38da0a1d73831c16/chk-10'.
>  Please try to load the checkpoint/savepoint directly from the metadata file 
> instead of the directory.
>   at 
> org.apache.flink.runtime.state.filesystem.AbstractFsCheckpointStorageAccess.resolveCheckpointPointer(AbstractFsCheckpointStorageAccess.java:290)
>   at 
> org.apache.flink.runtime.state.filesystem.AbstractFsCheckpointStorageAccess.resolveCheckpoint(AbstractFsCheckpointStorageAccess.java:136)
>   at 
> org.apache.flink.runtime.checkpoint.CheckpointCoordinator.restoreSavepoint(CheckpointCoordinator.java:1626)
>   at 
> org.apache.flink.runtime.scheduler.SchedulerBase.tryRestoreExecutionGraphFromSavepoint(SchedulerBase.java:381)
>   at 
> org.apache.flink.runtime.scheduler.SchedulerBase.createAndRestoreExecutionGraph(SchedulerBase.java:308)
>   at 
> org.apache.flink.runtime.scheduler.SchedulerBase.(SchedulerBase.java:218)
>   at 
> org.apache.flink.runtime.scheduler.DefaultScheduler.(DefaultScheduler.java:130)
>   at 
> org.apache.flink.runtime.scheduler.DefaultSchedulerFactory.createInstance(DefaultSchedulerFactory.java:124)
>   at 
> org.apache.flink.runtime.jobmaster.DefaultSlotPoolServiceSchedulerFactory.createScheduler(DefaultSlotPoolServiceSchedulerFactory.java:108)
>   at 
> 

[jira] [Resolved] (FLINK-21880) UnalignedCheckpointRescaleITCase fails on azure

2021-04-07 Thread Arvid Heise (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-21880?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Arvid Heise resolved FLINK-21880.
-
Fix Version/s: 1.13.0
   Resolution: Fixed

> UnalignedCheckpointRescaleITCase fails on azure
> ---
>
> Key: FLINK-21880
> URL: https://issues.apache.org/jira/browse/FLINK-21880
> Project: Flink
>  Issue Type: Bug
>  Components: Runtime / Checkpointing
>Affects Versions: 1.13.0
>Reporter: Dawid Wysakowicz
>Priority: Major
>  Labels: pull-request-available, test-stability
> Fix For: 1.13.0
>
>
> https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=15037=logs=a57e0635-3fad-5b08-57c7-a4142d7d6fa9=5360d54c-8d94-5d85-304e-a89267eb785a=9404
> {code}
> Caused by: java.lang.RuntimeException: 
> org.apache.flink.runtime.client.JobInitializationException: Could not start 
> the JobMaster.
>   at org.apache.flink.util.ExceptionUtils.rethrow(ExceptionUtils.java:316)
>   at 
> org.apache.flink.util.function.FunctionUtils.lambda$uncheckedFunction$2(FunctionUtils.java:75)
>   at 
> java.util.concurrent.CompletableFuture.uniApply(CompletableFuture.java:616)
>   at 
> java.util.concurrent.CompletableFuture$UniApply.tryFire(CompletableFuture.java:591)
>   at 
> java.util.concurrent.CompletableFuture$Completion.exec(CompletableFuture.java:457)
>   at java.util.concurrent.ForkJoinTask.doExec(ForkJoinTask.java:289)
>   at 
> java.util.concurrent.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1056)
>   at java.util.concurrent.ForkJoinPool.runWorker(ForkJoinPool.java:1692)
>   at 
> java.util.concurrent.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:175)
> Caused by: org.apache.flink.runtime.client.JobInitializationException: Could 
> not start the JobMaster.
>   at 
> org.apache.flink.runtime.jobmaster.JobManagerRunnerImpl.startJobMasterServiceSafely(JobManagerRunnerImpl.java:390)
>   at 
> org.apache.flink.runtime.jobmaster.JobManagerRunnerImpl.startJobMaster(JobManagerRunnerImpl.java:351)
>   at 
> org.apache.flink.runtime.jobmaster.JobManagerRunnerImpl.verifyJobSchedulingStatusAndStartJobManager(JobManagerRunnerImpl.java:329)
>   at 
> org.apache.flink.runtime.jobmaster.JobManagerRunnerImpl.lambda$grantLeadership$1(JobManagerRunnerImpl.java:306)
>   at 
> org.apache.flink.util.function.ThrowingRunnable.lambda$unchecked$0(ThrowingRunnable.java:49)
>   at 
> java.util.concurrent.CompletableFuture.uniRun(CompletableFuture.java:719)
>   at 
> java.util.concurrent.CompletableFuture.uniRunStage(CompletableFuture.java:731)
>   at 
> java.util.concurrent.CompletableFuture.thenRun(CompletableFuture.java:2023)
>   at 
> org.apache.flink.runtime.jobmaster.JobManagerRunnerImpl.grantLeadership(JobManagerRunnerImpl.java:302)
>   at 
> org.apache.flink.runtime.highavailability.nonha.embedded.EmbeddedLeaderService$GrantLeadershipCall.run(EmbeddedLeaderService.java:557)
>   at 
> java.util.concurrent.CompletableFuture$AsyncRun.run(CompletableFuture.java:1640)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>   at java.lang.Thread.run(Thread.java:748)
> Caused by: java.io.FileNotFoundException: Cannot find meta data file 
> '_metadata' in directory 
> 'file:/tmp/junit3214690414199099741/junit526862574608820717/894d38376ec2541a38da0a1d73831c16/chk-10'.
>  Please try to load the checkpoint/savepoint directly from the metadata file 
> instead of the directory.
>   at 
> org.apache.flink.runtime.state.filesystem.AbstractFsCheckpointStorageAccess.resolveCheckpointPointer(AbstractFsCheckpointStorageAccess.java:290)
>   at 
> org.apache.flink.runtime.state.filesystem.AbstractFsCheckpointStorageAccess.resolveCheckpoint(AbstractFsCheckpointStorageAccess.java:136)
>   at 
> org.apache.flink.runtime.checkpoint.CheckpointCoordinator.restoreSavepoint(CheckpointCoordinator.java:1626)
>   at 
> org.apache.flink.runtime.scheduler.SchedulerBase.tryRestoreExecutionGraphFromSavepoint(SchedulerBase.java:381)
>   at 
> org.apache.flink.runtime.scheduler.SchedulerBase.createAndRestoreExecutionGraph(SchedulerBase.java:308)
>   at 
> org.apache.flink.runtime.scheduler.SchedulerBase.(SchedulerBase.java:218)
>   at 
> org.apache.flink.runtime.scheduler.DefaultScheduler.(DefaultScheduler.java:130)
>   at 
> org.apache.flink.runtime.scheduler.DefaultSchedulerFactory.createInstance(DefaultSchedulerFactory.java:124)
>   at 
> org.apache.flink.runtime.jobmaster.DefaultSlotPoolServiceSchedulerFactory.createScheduler(DefaultSlotPoolServiceSchedulerFactory.java:108)
>   at 
> 

[jira] [Commented] (FLINK-21873) CoordinatedSourceRescaleITCase.testUpscaling fails on AZP

2021-04-07 Thread Arvid Heise (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-21873?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17316183#comment-17316183
 ] 

Arvid Heise commented on FLINK-21873:
-

The exception actually showed that everything works as expected. However, the 
message is currently checked by identity, while apparently, we just received an 
equal message (I feel that this is connected to the inlining of Java's string 
pool). So the proper fix is to just check for equality.

While doing so, I have also switched to use Flink's internal ExceptionUtils and 
made it more lenient in this way.

> CoordinatedSourceRescaleITCase.testUpscaling fails on AZP
> -
>
> Key: FLINK-21873
> URL: https://issues.apache.org/jira/browse/FLINK-21873
> Project: Flink
>  Issue Type: Bug
>  Components: Connectors / Common
>Affects Versions: 1.13.0
>Reporter: Till Rohrmann
>Assignee: Arvid Heise
>Priority: Major
>  Labels: pull-request-available, test-stability
> Fix For: 1.14.0
>
>
> The test {{CoordinatedSourceRescaleITCase.testUpscaling}} fails on AZP with
> {code}
>   at scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:170)
>   at scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:171)
>   at scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:171)
>   at akka.actor.Actor$class.aroundReceive(Actor.scala:517)
>   at akka.actor.AbstractActor.aroundReceive(AbstractActor.scala:225)
>   at akka.actor.ActorCell.receiveMessage(ActorCell.scala:592)
>   at akka.actor.ActorCell.invoke(ActorCell.scala:561)
>   at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:258)
>   at akka.dispatch.Mailbox.run(Mailbox.scala:225)
>   at akka.dispatch.Mailbox.exec(Mailbox.scala:235)
>   ... 4 more
> Caused by: java.lang.Exception: successfully restored checkpoint
>   at 
> org.apache.flink.connector.base.source.reader.CoordinatedSourceRescaleITCase$FailingMapFunction.map(CoordinatedSourceRescaleITCase.java:139)
>   at 
> org.apache.flink.connector.base.source.reader.CoordinatedSourceRescaleITCase$FailingMapFunction.map(CoordinatedSourceRescaleITCase.java:126)
>   at 
> org.apache.flink.streaming.api.operators.StreamMap.processElement(StreamMap.java:38)
>   at 
> org.apache.flink.streaming.runtime.tasks.CopyingChainingOutput.pushToOperator(CopyingChainingOutput.java:71)
>   at 
> org.apache.flink.streaming.runtime.tasks.CopyingChainingOutput.collect(CopyingChainingOutput.java:46)
>   at 
> org.apache.flink.streaming.runtime.tasks.CopyingChainingOutput.collect(CopyingChainingOutput.java:26)
>   at 
> org.apache.flink.streaming.runtime.tasks.SourceOperatorStreamTask$AsyncDataOutputToOutput.emitRecord(SourceOperatorStreamTask.java:161)
>   at 
> org.apache.flink.streaming.api.operators.source.SourceOutputWithWatermarks.collect(SourceOutputWithWatermarks.java:110)
>   at 
> org.apache.flink.streaming.api.operators.source.SourceOutputWithWatermarks.collect(SourceOutputWithWatermarks.java:101)
>   at 
> org.apache.flink.api.connector.source.lib.util.IteratorSourceReader.pollNext(IteratorSourceReader.java:95)
>   at 
> org.apache.flink.streaming.api.operators.SourceOperator.emitNext(SourceOperator.java:275)
>   at 
> org.apache.flink.streaming.runtime.io.StreamTaskSourceInput.emitNext(StreamTaskSourceInput.java:68)
>   at 
> org.apache.flink.streaming.runtime.io.StreamOneInputProcessor.processInput(StreamOneInputProcessor.java:65)
>   at 
> org.apache.flink.streaming.runtime.tasks.StreamTask.processInput(StreamTask.java:408)
>   at 
> org.apache.flink.streaming.runtime.tasks.mailbox.MailboxProcessor.runMailboxLoop(MailboxProcessor.java:190)
>   at 
> org.apache.flink.streaming.runtime.tasks.StreamTask.runMailboxLoop(StreamTask.java:624)
>   at 
> org.apache.flink.streaming.runtime.tasks.StreamTask.invoke(StreamTask.java:588)
>   at org.apache.flink.runtime.taskmanager.Task.doRun(Task.java:760)
>   at org.apache.flink.runtime.taskmanager.Task.run(Task.java:562)
>   at java.lang.Thread.run(Thread.java:748)
> {code}
> https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=14997=logs=fc5181b0-e452-5c8f-68de-1097947f6483=62110053-334f-5295-a0ab-80dd7e2babbf=22049
> cc [~AHeise]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (FLINK-19801) Add support for virtual channels

2021-03-30 Thread Arvid Heise (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-19801?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Arvid Heise updated FLINK-19801:

Release Note: While recovering from unaligned checkpoints, users can now 
change the parallelism of the job. This change allows users to quickly upscale 
the job under backpressure.

> Add support for virtual channels
> 
>
> Key: FLINK-19801
> URL: https://issues.apache.org/jira/browse/FLINK-19801
> Project: Flink
>  Issue Type: Sub-task
>  Components: Runtime / Checkpointing
>Affects Versions: 1.12.0
>Reporter: Arvid Heise
>Assignee: Arvid Heise
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.13.0
>
>
> During rescaling of unaligned checkpoints, if state from multiple former 
> channels are read on input or output side to recover a specific channel, then 
> these buffers are multiplexed on output side and demultiplexed on input side 
> to guarantee a consistent recovery of spanning records:
> Assume two channels C1, C2 connect operator A and B and both have one buffer 
> in the output and in the input part of the channel respectively, where a 
> record spans. Assume that the buffers are named O1 for output buffer of C1 
> and I2 for input buffer of C2 etc. Then after rescaling both channels become 
> one channel C. Then, the buffers may be restored as I1, I2, O1, O2.
> Channels use the mapping of FLINK-19533 to infer the need for virtual 
> channels and distribute the needed resources. Virtual channels are removed on 
> the EndOfChannelRecovery epoch marker.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (FLINK-22053) Recovery of a completed split in NumberSequenceSource causes fatal exception.

2021-03-30 Thread Arvid Heise (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-22053?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Arvid Heise reassigned FLINK-22053:
---

Assignee: Arvid Heise

> Recovery of a completed split in NumberSequenceSource causes fatal exception.
> -
>
> Key: FLINK-22053
> URL: https://issues.apache.org/jira/browse/FLINK-22053
> Project: Flink
>  Issue Type: Bug
>  Components: API / Core
>Affects Versions: 1.12.2, 1.13.0
>Reporter: Arvid Heise
>Assignee: Arvid Heise
>Priority: Major
>
> If a checkpoint happens after the only split is processed, the split is 
> checkpointed with (from > to). Upon recovery this split causes an exception 
> in the coordinator and a subsequent fatal exception.
> {noformat}
> Caused by: java.lang.IllegalArgumentException: 'from' must be <= 'to'
>   at 
> org.apache.flink.util.Preconditions.checkArgument(Preconditions.java:138) 
> ~[flink-core-1.13-SNAPSHOT.jar:1.13-SNAPSHOT]
>   at 
> org.apache.flink.api.connector.source.lib.NumberSequenceSource$NumberSequenceSplit.(NumberSequenceSource.java:148)
>  ~[flink-core-1.13-SNAPSHOT.jar:1.13-SNAPSHOT]
>   at 
> org.apache.flink.api.connector.source.lib.NumberSequenceSource.createEnumerator(NumberSequenceSource.java:111)
>  ~[flink-core-1.13-SNAPSHOT.jar:1.13-SNAPSHOT]
>   at 
> org.apache.flink.runtime.source.coordinator.SourceCoordinator.start(SourceCoordinator.java:126)
>  ~[flink-runtime_2.11-1.13-SNAPSHOT.jar:1.13-SNAPSHOT]
>   at 
> org.apache.flink.runtime.operators.coordination.RecreateOnResetOperatorCoordinator$DeferrableCoordinator.applyCall(RecreateOnResetOperatorCoordinator.java:296)
>  ~[flink-runtime_2.11-1.13-SNAPSHOT.jar:1.13-SNAPSHOT]
>   at 
> org.apache.flink.runtime.operators.coordination.RecreateOnResetOperatorCoordinator.start(RecreateOnResetOperatorCoordinator.java:71)
>  ~[flink-runtime_2.11-1.13-SNAPSHOT.jar:1.13-SNAPSHOT]
>   at 
> org.apache.flink.runtime.operators.coordination.OperatorCoordinatorHolder.start(OperatorCoordinatorHolder.java:182)
>  ~[flink-runtime_2.11-1.13-SNAPSHOT.jar:1.13-SNAPSHOT]
>   at 
> org.apache.flink.runtime.scheduler.DefaultOperatorCoordinatorHandler.startAllOperatorCoordinators(DefaultOperatorCoordinatorHandler.java:85)
>  ~[flink-runtime_2.11-1.13-SNAPSHOT.jar:1.13-SNAPSHOT]
>   at 
> org.apache.flink.runtime.scheduler.SchedulerBase.startScheduling(SchedulerBase.java:501)
>  ~[flink-runtime_2.11-1.13-SNAPSHOT.jar:1.13-SNAPSHOT]
>   at 
> org.apache.flink.runtime.jobmaster.JobMaster.startScheduling(JobMaster.java:955)
>  ~[flink-runtime_2.11-1.13-SNAPSHOT.jar:1.13-SNAPSHOT]
>   at 
> org.apache.flink.runtime.jobmaster.JobMaster.startJobExecution(JobMaster.java:873)
>  ~[flink-runtime_2.11-1.13-SNAPSHOT.jar:1.13-SNAPSHOT]
>   at 
> org.apache.flink.runtime.jobmaster.JobMaster.onStart(JobMaster.java:383) 
> ~[flink-runtime_2.11-1.13-SNAPSHOT.jar:1.13-SNAPSHOT]
>   at 
> org.apache.flink.runtime.rpc.RpcEndpoint.internalCallOnStart(RpcEndpoint.java:181)
>  ~[flink-runtime_2.11-1.13-SNAPSHOT.jar:1.13-SNAPSHOT]
>   at 
> org.apache.flink.runtime.rpc.akka.AkkaRpcActor$StoppedState.start(AkkaRpcActor.java:605)
>  ~[flink-runtime_2.11-1.13-SNAPSHOT.jar:1.13-SNAPSHOT]
>   at 
> org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleControlMessage(AkkaRpcActor.java:180)
>  ~[flink-runtime_2.11-1.13-SNAPSHOT.jar:1.13-SNAPSHOT]
>   at akka.japi.pf.UnitCaseStatement.apply(CaseStatements.scala:26) 
> ~[akka-actor_2.11-2.5.21.jar:2.5.21]
>   at akka.japi.pf.UnitCaseStatement.apply(CaseStatements.scala:21) 
> ~[akka-actor_2.11-2.5.21.jar:2.5.21]
>   at scala.PartialFunction$class.applyOrElse(PartialFunction.scala:123) 
> ~[scala-library-2.11.12.jar:?]
>   at akka.japi.pf.UnitCaseStatement.applyOrElse(CaseStatements.scala:21) 
> ~[akka-actor_2.11-2.5.21.jar:2.5.21]
>   at scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:170) 
> ~[scala-library-2.11.12.jar:?]
>   ... 12 more
> {noformat}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (FLINK-22053) NumberSequenceSource causes fatal exception when less splits than parallelism.

2021-03-30 Thread Arvid Heise (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-22053?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Arvid Heise updated FLINK-22053:

Summary: NumberSequenceSource causes fatal exception when less splits than 
parallelism.  (was: Recovery of a completed split in NumberSequenceSource 
causes fatal exception.)

> NumberSequenceSource causes fatal exception when less splits than parallelism.
> --
>
> Key: FLINK-22053
> URL: https://issues.apache.org/jira/browse/FLINK-22053
> Project: Flink
>  Issue Type: Bug
>  Components: API / Core
>Affects Versions: 1.12.2, 1.13.0
>Reporter: Arvid Heise
>Assignee: Arvid Heise
>Priority: Major
>
> If a checkpoint happens after the only split is processed, the split is 
> checkpointed with (from > to). Upon recovery this split causes an exception 
> in the coordinator and a subsequent fatal exception.
> {noformat}
> Caused by: java.lang.IllegalArgumentException: 'from' must be <= 'to'
>   at 
> org.apache.flink.util.Preconditions.checkArgument(Preconditions.java:138) 
> ~[flink-core-1.13-SNAPSHOT.jar:1.13-SNAPSHOT]
>   at 
> org.apache.flink.api.connector.source.lib.NumberSequenceSource$NumberSequenceSplit.(NumberSequenceSource.java:148)
>  ~[flink-core-1.13-SNAPSHOT.jar:1.13-SNAPSHOT]
>   at 
> org.apache.flink.api.connector.source.lib.NumberSequenceSource.createEnumerator(NumberSequenceSource.java:111)
>  ~[flink-core-1.13-SNAPSHOT.jar:1.13-SNAPSHOT]
>   at 
> org.apache.flink.runtime.source.coordinator.SourceCoordinator.start(SourceCoordinator.java:126)
>  ~[flink-runtime_2.11-1.13-SNAPSHOT.jar:1.13-SNAPSHOT]
>   at 
> org.apache.flink.runtime.operators.coordination.RecreateOnResetOperatorCoordinator$DeferrableCoordinator.applyCall(RecreateOnResetOperatorCoordinator.java:296)
>  ~[flink-runtime_2.11-1.13-SNAPSHOT.jar:1.13-SNAPSHOT]
>   at 
> org.apache.flink.runtime.operators.coordination.RecreateOnResetOperatorCoordinator.start(RecreateOnResetOperatorCoordinator.java:71)
>  ~[flink-runtime_2.11-1.13-SNAPSHOT.jar:1.13-SNAPSHOT]
>   at 
> org.apache.flink.runtime.operators.coordination.OperatorCoordinatorHolder.start(OperatorCoordinatorHolder.java:182)
>  ~[flink-runtime_2.11-1.13-SNAPSHOT.jar:1.13-SNAPSHOT]
>   at 
> org.apache.flink.runtime.scheduler.DefaultOperatorCoordinatorHandler.startAllOperatorCoordinators(DefaultOperatorCoordinatorHandler.java:85)
>  ~[flink-runtime_2.11-1.13-SNAPSHOT.jar:1.13-SNAPSHOT]
>   at 
> org.apache.flink.runtime.scheduler.SchedulerBase.startScheduling(SchedulerBase.java:501)
>  ~[flink-runtime_2.11-1.13-SNAPSHOT.jar:1.13-SNAPSHOT]
>   at 
> org.apache.flink.runtime.jobmaster.JobMaster.startScheduling(JobMaster.java:955)
>  ~[flink-runtime_2.11-1.13-SNAPSHOT.jar:1.13-SNAPSHOT]
>   at 
> org.apache.flink.runtime.jobmaster.JobMaster.startJobExecution(JobMaster.java:873)
>  ~[flink-runtime_2.11-1.13-SNAPSHOT.jar:1.13-SNAPSHOT]
>   at 
> org.apache.flink.runtime.jobmaster.JobMaster.onStart(JobMaster.java:383) 
> ~[flink-runtime_2.11-1.13-SNAPSHOT.jar:1.13-SNAPSHOT]
>   at 
> org.apache.flink.runtime.rpc.RpcEndpoint.internalCallOnStart(RpcEndpoint.java:181)
>  ~[flink-runtime_2.11-1.13-SNAPSHOT.jar:1.13-SNAPSHOT]
>   at 
> org.apache.flink.runtime.rpc.akka.AkkaRpcActor$StoppedState.start(AkkaRpcActor.java:605)
>  ~[flink-runtime_2.11-1.13-SNAPSHOT.jar:1.13-SNAPSHOT]
>   at 
> org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleControlMessage(AkkaRpcActor.java:180)
>  ~[flink-runtime_2.11-1.13-SNAPSHOT.jar:1.13-SNAPSHOT]
>   at akka.japi.pf.UnitCaseStatement.apply(CaseStatements.scala:26) 
> ~[akka-actor_2.11-2.5.21.jar:2.5.21]
>   at akka.japi.pf.UnitCaseStatement.apply(CaseStatements.scala:21) 
> ~[akka-actor_2.11-2.5.21.jar:2.5.21]
>   at scala.PartialFunction$class.applyOrElse(PartialFunction.scala:123) 
> ~[scala-library-2.11.12.jar:?]
>   at akka.japi.pf.UnitCaseStatement.applyOrElse(CaseStatements.scala:21) 
> ~[akka-actor_2.11-2.5.21.jar:2.5.21]
>   at scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:170) 
> ~[scala-library-2.11.12.jar:?]
>   ... 12 more
> {noformat}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


<    2   3   4   5   6   7   8   9   10   11   >