[jira] [Updated] (FLINK-16821) Run Kubernetes test failed with invalid named "minikube"

2020-06-10 Thread Zhijiang (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-16821?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhijiang updated FLINK-16821:
-
Priority: Major  (was: Blocker)

> Run Kubernetes test failed with invalid named "minikube"
> 
>
> Key: FLINK-16821
> URL: https://issues.apache.org/jira/browse/FLINK-16821
> Project: Flink
>  Issue Type: Bug
>  Components: Deployment / Kubernetes, Tests
>Affects Versions: 1.10.0
>Reporter: Zhijiang
>Assignee: Robert Metzger
>Priority: Major
>  Labels: pull-request-available, test-stability
> Fix For: 1.10.1, 1.11.0
>
>  Time Spent: 40m
>  Remaining Estimate: 0h
>
> This is the test run 
> [https://dev.azure.com/rmetzger/Flink/_build/results?buildId=6702=logs=c88eea3b-64a0-564d-0031-9fdcd7b8abee=1e2bbe5b-4657-50be-1f07-d84bfce5b1f5]
> Log output
> {code:java}
> 2020-03-27T00:07:38.9666021Z Running 'Run Kubernetes test'
> 2020-03-27T00:07:38.956Z 
> ==
> 2020-03-27T00:07:38.9677101Z TEST_DATA_DIR: 
> /home/vsts/work/1/s/flink-end-to-end-tests/test-scripts/temp-test-directory-38967103614
> 2020-03-27T00:07:41.7529865Z Flink dist directory: 
> /home/vsts/work/1/s/flink-dist/target/flink-1.11-SNAPSHOT-bin/flink-1.11-SNAPSHOT
> 2020-03-27T00:07:41.7721475Z Flink dist directory: 
> /home/vsts/work/1/s/flink-dist/target/flink-1.11-SNAPSHOT-bin/flink-1.11-SNAPSHOT
> 2020-03-27T00:07:41.8208394Z Docker version 19.03.8, build afacb8b7f0
> 2020-03-27T00:07:42.4793914Z docker-compose version 1.25.4, build 8d51620a
> 2020-03-27T00:07:42.5359301Z Installing minikube ...
> 2020-03-27T00:07:42.5494076Z   % Total% Received % Xferd  Average Speed   
> TimeTime Time  Current
> 2020-03-27T00:07:42.5494729Z  Dload  Upload   
> Total   SpentLeft  Speed
> 2020-03-27T00:07:42.5498136Z 
> 2020-03-27T00:07:42.6214887Z   0 00 00 0  0  0 
> --:--:-- --:--:-- --:--:-- 0
> 2020-03-27T00:07:43.3467750Z   0 00 00 0  0  0 
> --:--:-- --:--:-- --:--:-- 0
> 2020-03-27T00:07:43.3469636Z 100 52.0M  100 52.0M0 0  65.2M  0 
> --:--:-- --:--:-- --:--:-- 65.2M
> 2020-03-27T00:07:43.4262625Z * There is no local cluster named "minikube"
> 2020-03-27T00:07:43.4264438Z   - To fix this, run: minikube start
> 2020-03-27T00:07:43.4282404Z Starting minikube ...
> 2020-03-27T00:07:43.7749694Z * minikube v1.9.0 on Ubuntu 16.04
> 2020-03-27T00:07:43.7761742Z * Using the none driver based on user 
> configuration
> 2020-03-27T00:07:43.7762229Z X The none driver requires conntrack to be 
> installed for kubernetes version 1.18.0
> 2020-03-27T00:07:43.8202161Z * There is no local cluster named "minikube"
> 2020-03-27T00:07:43.8203353Z   - To fix this, run: minikube start
> 2020-03-27T00:07:43.8568899Z * There is no local cluster named "minikube"
> 2020-03-27T00:07:43.8570685Z   - To fix this, run: minikube start
> 2020-03-27T00:07:43.8583793Z Command: start_kubernetes_if_not_running failed. 
> Retrying...
> 2020-03-27T00:07:48.9017252Z * There is no local cluster named "minikube"
> 2020-03-27T00:07:48.9019347Z   - To fix this, run: minikube start
> 2020-03-27T00:07:48.9031515Z Starting minikube ...
> 2020-03-27T00:07:49.0612601Z * minikube v1.9.0 on Ubuntu 16.04
> 2020-03-27T00:07:49.0616688Z * Using the none driver based on user 
> configuration
> 2020-03-27T00:07:49.0620173Z X The none driver requires conntrack to be 
> installed for kubernetes version 1.18.0
> 2020-03-27T00:07:49.1040676Z * There is no local cluster named "minikube"
> 2020-03-27T00:07:49.1042353Z   - To fix this, run: minikube start
> 2020-03-27T00:07:49.1453522Z * There is no local cluster named "minikube"
> 2020-03-27T00:07:49.1454594Z   - To fix this, run: minikube start
> 2020-03-27T00:07:49.1468436Z Command: start_kubernetes_if_not_running failed. 
> Retrying...
> 2020-03-27T00:07:54.1907713Z * There is no local cluster named "minikube"
> 2020-03-27T00:07:54.1909876Z   - To fix this, run: minikube start
> 2020-03-27T00:07:54.1921479Z Starting minikube ...
> 2020-03-27T00:07:54.3388738Z * minikube v1.9.0 on Ubuntu 16.04
> 2020-03-27T00:07:54.3395499Z * Using the none driver based on user 
> configuration
> 2020-03-27T00:07:54.3396443Z X The none driver requires conntrack to be 
> installed for kubernetes version 1.18.0
> 2020-03-27T00:07:54.3824399Z * There is no local cluster named "minikube"
> 2020-03-27T00:07:54.3837652Z   - To fix this, run: minikube start
> 2020-03-27T00:07:54.4203902Z * There is no local cluster named "minikube"
> 2020-03-27T00:07:54.4204895Z   - To fix this, run: minikube start
> 2020-03-27T00:07:54.4217866Z Command: start_kubernetes_if_not_running failed. 
> Retrying...
> 

[jira] [Assigned] (FLINK-18238) RemoteChannelThroughputBenchmark deadlocks

2020-06-10 Thread Zhijiang (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-18238?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhijiang reassigned FLINK-18238:


Assignee: Yingjie Cao

> RemoteChannelThroughputBenchmark deadlocks
> --
>
> Key: FLINK-18238
> URL: https://issues.apache.org/jira/browse/FLINK-18238
> Project: Flink
>  Issue Type: Bug
>  Components: Benchmarks, Runtime / Network
>Affects Versions: 1.11.0
>Reporter: Piotr Nowojski
>Assignee: Yingjie Cao
>Priority: Blocker
> Fix For: 1.11.0
>
>
> In the last couple of days 
> {{RemoteChannelThroughputBenchmark.remoteRebalance}} deadlocked for the 
> second time:
> http://codespeed.dak8s.net:8080/job/flink-master-benchmarks/6019/



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Comment Edited] (FLINK-18136) Don't start channel state writing for savepoints (RPC)

2020-06-09 Thread Zhijiang (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-18136?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17129018#comment-17129018
 ] 

Zhijiang edited comment on FLINK-18136 at 6/10/20, 5:27 AM:


Merged in master: 6dc624d931858e0ee5da14bc6b41b834c94cbba7

Merged in release-1.11: 7f410fa4f4758e48fcc6c33d8354c21ab02e2dc6


was (Author: zjwang):
Merged in master: 6dc624d931858e0ee5da14bc6b41b834c94cbba7

> Don't start channel state writing for savepoints (RPC)
> --
>
> Key: FLINK-18136
> URL: https://issues.apache.org/jira/browse/FLINK-18136
> Project: Flink
>  Issue Type: Bug
>  Components: Runtime / Checkpointing, Runtime / Task
>Affects Versions: 1.11.0
>Reporter: Roman Khachatryan
>Assignee: Roman Khachatryan
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.11.0, 1.12.0
>
>
> ChannelStateWriter#start should be only called for unaligned checkpoint. 
> While source triggering savepoint, 
> SubtaskCheckpointCoordinator#initCheckpoint is introduced to judge the 
> condition whether to start the internal writer or not. And this new method is 
> also used in other places like CheckpointBarrierUnaligner.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Closed] (FLINK-18057) SingleInputGateTest.testConcurrentReadStateAndProcessAndClose: expected:<3> but was:<2>

2020-06-09 Thread Zhijiang (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-18057?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhijiang closed FLINK-18057.

Resolution: Fixed

Merged in release-1.11 : fe229a319fff73512953ea1b55391b29e48db129

Merged in master: 9a7dbfcb8cbc6d72083c03ad83d9df97660f15d4

> SingleInputGateTest.testConcurrentReadStateAndProcessAndClose: expected:<3> 
> but was:<2>
> ---
>
> Key: FLINK-18057
> URL: https://issues.apache.org/jira/browse/FLINK-18057
> Project: Flink
>  Issue Type: Bug
>  Components: Runtime / Network, Tests
>Affects Versions: 1.11.0
>Reporter: Robert Metzger
>Assignee: Yingjie Cao
>Priority: Major
>  Labels: pull-request-available, test-stability
> Fix For: 1.11.0, 1.12.0
>
>
> CI: 
> https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=2524=logs=0da23115-68bb-5dcd-192c-bd4c8adebde1=05b74a19-4ee4-5036-c46f-ada307df6cf0
> {code}
> java.lang.AssertionError: expected:<3> but was:<2>
>   at org.junit.Assert.fail(Assert.java:88)
>   at org.junit.Assert.failNotEquals(Assert.java:834)
>   at org.junit.Assert.assertEquals(Assert.java:645)
>   at org.junit.Assert.assertEquals(Assert.java:631)
>   at 
> org.apache.flink.runtime.io.network.partition.consumer.SingleInputGateTest.testConcurrentReadStateAndProcessAndClose(SingleInputGateTest.java:239)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (FLINK-18063) Fix the race condition for aborting current checkpoint in CheckpointBarrierUnaligner

2020-06-09 Thread Zhijiang (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-18063?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhijiang updated FLINK-18063:
-
Summary: Fix the race condition for aborting current checkpoint in 
CheckpointBarrierUnaligner  (was: Fix the race condition for aborting current 
checkpoint in CheckpointBarrierUnaligner#processEndOfPartition)

> Fix the race condition for aborting current checkpoint in 
> CheckpointBarrierUnaligner
> 
>
> Key: FLINK-18063
> URL: https://issues.apache.org/jira/browse/FLINK-18063
> Project: Flink
>  Issue Type: Bug
>  Components: Runtime / Checkpointing
>Affects Versions: 1.11.0
>Reporter: Zhijiang
>Assignee: Zhijiang
>Priority: Blocker
>  Labels: pull-request-available
> Fix For: 1.11.0, 1.12.0
>
>
> There are three aborting scenarios which might encounter race condition:
>     1. CheckpointBarrierUnaligner#processCancellationBarrier
>     2. CheckpointBarrierUnaligner#processEndOfPartition
>     3. AlternatingCheckpointBarrierHandler#processBarrier
> We only consider the pending checkpoint triggered by #processBarrier from 
> task thread to abort it. Actually the checkpoint might also be triggered by 
> #notifyBarrierReceived from netty thread in race condition, so we should also 
> handle properly to abort it.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Closed] (FLINK-18063) Fix the race condition for aborting current checkpoint in CheckpointBarrierUnaligner#processEndOfPartition

2020-06-09 Thread Zhijiang (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-18063?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhijiang closed FLINK-18063.

Resolution: Fixed

Merged in master: f672ae53596aa26f1e3ed4b3d24aff2197be749d

Merged in release-1.11: 2fe888c9aff75a3ace18c30764b26cc9f53e2451

> Fix the race condition for aborting current checkpoint in 
> CheckpointBarrierUnaligner#processEndOfPartition
> --
>
> Key: FLINK-18063
> URL: https://issues.apache.org/jira/browse/FLINK-18063
> Project: Flink
>  Issue Type: Bug
>  Components: Runtime / Checkpointing
>Affects Versions: 1.11.0
>Reporter: Zhijiang
>Assignee: Zhijiang
>Priority: Blocker
>  Labels: pull-request-available
> Fix For: 1.11.0, 1.12.0
>
>
> There are three aborting scenarios which might encounter race condition:
>     1. CheckpointBarrierUnaligner#processCancellationBarrier
>     2. CheckpointBarrierUnaligner#processEndOfPartition
>     3. AlternatingCheckpointBarrierHandler#processBarrier
> We only consider the pending checkpoint triggered by #processBarrier from 
> task thread to abort it. Actually the checkpoint might also be triggered by 
> #notifyBarrierReceived from netty thread in race condition, so we should also 
> handle properly to abort it.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (FLINK-18063) Fix the race condition for aborting checkpoint in CheckpointBarrierUnaligner

2020-06-09 Thread Zhijiang (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-18063?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhijiang updated FLINK-18063:
-
Summary: Fix the race condition for aborting checkpoint in 
CheckpointBarrierUnaligner  (was: Fix the race condition for aborting current 
checkpoint in CheckpointBarrierUnaligner)

> Fix the race condition for aborting checkpoint in CheckpointBarrierUnaligner
> 
>
> Key: FLINK-18063
> URL: https://issues.apache.org/jira/browse/FLINK-18063
> Project: Flink
>  Issue Type: Bug
>  Components: Runtime / Checkpointing
>Affects Versions: 1.11.0
>Reporter: Zhijiang
>Assignee: Zhijiang
>Priority: Blocker
>  Labels: pull-request-available
> Fix For: 1.11.0, 1.12.0
>
>
> There are three aborting scenarios which might encounter race condition:
>     1. CheckpointBarrierUnaligner#processCancellationBarrier
>     2. CheckpointBarrierUnaligner#processEndOfPartition
>     3. AlternatingCheckpointBarrierHandler#processBarrier
> We only consider the pending checkpoint triggered by #processBarrier from 
> task thread to abort it. Actually the checkpoint might also be triggered by 
> #notifyBarrierReceived from netty thread in race condition, so we should also 
> handle properly to abort it.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (FLINK-18063) Fix the race condition for aborting current checkpoint in CheckpointBarrierUnaligner#processEndOfPartition

2020-06-09 Thread Zhijiang (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-18063?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhijiang updated FLINK-18063:
-
Description: 
There are three aborting scenarios which might encounter race condition:

    1. CheckpointBarrierUnaligner#processCancellationBarrier

    2. CheckpointBarrierUnaligner#processEndOfPartition

    3. AlternatingCheckpointBarrierHandler#processBarrier

We only consider the pending checkpoint triggered by #processBarrier from task 
thread to abort it. Actually the checkpoint might also be triggered by 
#notifyBarrierReceived from netty thread in race condition, so we should also 
handle properly to abort it.

  was:
In the handle of CheckpointBarrierUnaligner#processEndOfPartition, it only 
aborts the current checkpoint by judging the condition of pending checkpoint 
from task thread processing, so it will miss one scenario that checkpoint 
triggered by notifyBarrierReceived from netty thread.

The proper fix should also judge the pending checkpoint inside 
ThreadSafeUnaligner in order to abort it and reset internal variables in case.


> Fix the race condition for aborting current checkpoint in 
> CheckpointBarrierUnaligner#processEndOfPartition
> --
>
> Key: FLINK-18063
> URL: https://issues.apache.org/jira/browse/FLINK-18063
> Project: Flink
>  Issue Type: Bug
>  Components: Runtime / Checkpointing
>Affects Versions: 1.11.0
>Reporter: Zhijiang
>Assignee: Zhijiang
>Priority: Blocker
>  Labels: pull-request-available
> Fix For: 1.11.0, 1.12.0
>
>
> There are three aborting scenarios which might encounter race condition:
>     1. CheckpointBarrierUnaligner#processCancellationBarrier
>     2. CheckpointBarrierUnaligner#processEndOfPartition
>     3. AlternatingCheckpointBarrierHandler#processBarrier
> We only consider the pending checkpoint triggered by #processBarrier from 
> task thread to abort it. Actually the checkpoint might also be triggered by 
> #notifyBarrierReceived from netty thread in race condition, so we should also 
> handle properly to abort it.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (FLINK-17260) StreamingKafkaITCase failure on Azure

2020-06-09 Thread Zhijiang (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-17260?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17129065#comment-17129065
 ] 

Zhijiang commented on FLINK-17260:
--

another instance 
[https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_apis/build/builds/3003/logs/143]

> StreamingKafkaITCase failure on Azure
> -
>
> Key: FLINK-17260
> URL: https://issues.apache.org/jira/browse/FLINK-17260
> Project: Flink
>  Issue Type: Bug
>  Components: Connectors / Kafka, Tests
>Affects Versions: 1.11.0, 1.12.0
>Reporter: Roman Khachatryan
>Assignee: Aljoscha Krettek
>Priority: Blocker
>  Labels: pull-request-available, test-stability
> Fix For: 1.11.0
>
>
> [https://dev.azure.com/rmetzger/5bd3ef0a-4359-41af-abca-811b04098d2e/_apis/build/builds/7544/logs/165]
>  
> {code:java}
> 2020-04-16T00:12:32.2848429Z [INFO] Running 
> org.apache.flink.tests.util.kafka.StreamingKafkaITCase
> 2020-04-16T00:14:47.9100927Z [ERROR] Tests run: 3, Failures: 1, Errors: 0, 
> Skipped: 0, Time elapsed: 135.621 s <<< FAILURE! - in 
> org.apache.flink.tests.util.k afka.StreamingKafkaITCase
> 2020-04-16T00:14:47.9103036Z [ERROR] testKafka[0: 
> kafka-version:0.10.2.0](org.apache.flink.tests.util.kafka.StreamingKafkaITCase)
>   Time elapsed: 46.222 s  <<<  FAILURE!
> 2020-04-16T00:14:47.9104033Z java.lang.AssertionError: 
> expected:<[elephant,27,64213]> but was:<[]>
> 2020-04-16T00:14:47.9104638Zat org.junit.Assert.fail(Assert.java:88)
> 2020-04-16T00:14:47.9105148Zat 
> org.junit.Assert.failNotEquals(Assert.java:834)
> 2020-04-16T00:14:47.9105701Zat 
> org.junit.Assert.assertEquals(Assert.java:118)
> 2020-04-16T00:14:47.9106239Zat 
> org.junit.Assert.assertEquals(Assert.java:144)
> 2020-04-16T00:14:47.9107177Zat 
> org.apache.flink.tests.util.kafka.StreamingKafkaITCase.testKafka(StreamingKafkaITCase.java:162)
> 2020-04-16T00:14:47.9107845Zat 
> sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> 2020-04-16T00:14:47.9108434Zat 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
> 2020-04-16T00:14:47.9109318Zat 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> 2020-04-16T00:14:47.9109914Zat 
> java.lang.reflect.Method.invoke(Method.java:498)
> 2020-04-16T00:14:47.9110434Zat 
> org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:50)
> 2020-04-16T00:14:47.9110985Zat 
> org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12)
> 2020-04-16T00:14:47.9111548Zat 
> org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:47)
> 2020-04-16T00:14:47.9112083Zat 
> org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17)
> 2020-04-16T00:14:47.9112629Zat 
> org.apache.flink.util.ExternalResource$1.evaluate(ExternalResource.java:48)
> 2020-04-16T00:14:47.9113145Zat 
> org.apache.flink.util.ExternalResource$1.evaluate(ExternalResource.java:48)
> 2020-04-16T00:14:47.9113637Zat 
> org.junit.rules.TestWatcher$1.evaluate(TestWatcher.java:55)
> 2020-04-16T00:14:47.9114072Zat 
> org.junit.rules.RunRules.evaluate(RunRules.java:20)
> 2020-04-16T00:14:47.9114490Zat 
> org.junit.runners.ParentRunner.runLeaf(ParentRunner.java:325)
> 2020-04-16T00:14:47.9115256Zat 
> org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:78)
> 2020-04-16T00:14:47.9115791Zat 
> org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:57)
> 2020-04-16T00:14:47.9116292Zat 
> org.junit.runners.ParentRunner$3.run(ParentRunner.java:290)
> 2020-04-16T00:14:47.9116736Zat 
> org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:71)
> 2020-04-16T00:14:47.9117779Zat 
> org.junit.runners.ParentRunner.runChildren(ParentRunner.java:288)
> 2020-04-16T00:14:47.9118274Zat 
> org.junit.runners.ParentRunner.access$000(ParentRunner.java:58)
> 2020-04-16T00:14:47.9118766Zat 
> org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:268)
> 2020-04-16T00:14:47.9119204Zat 
> org.junit.runners.ParentRunner.run(ParentRunner.java:363)
> 2020-04-16T00:14:47.9119625Zat 
> org.junit.runners.Suite.runChild(Suite.java:128)
> 2020-04-16T00:14:47.9120005Zat 
> org.junit.runners.Suite.runChild(Suite.java:27)
> 2020-04-16T00:14:47.9120428Zat 
> org.junit.runners.ParentRunner$3.run(ParentRunner.java:290)
> 2020-04-16T00:14:47.9120876Zat 
> org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:71)
> 2020-04-16T00:14:47.9121350Zat 
> org.junit.runners.ParentRunner.runChildren(ParentRunner.java:288)
> 2020-04-16T00:14:47.9121805Zat 
> org.junit.runners.ParentRunner.access$000(ParentRunner.java:58)
> 

[jira] [Updated] (FLINK-18136) Don't start channel state writing for savepoints (RPC)

2020-06-09 Thread Zhijiang (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-18136?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhijiang updated FLINK-18136:
-
Description: ChannelStateWriter#start should be only called for unaligned 
checkpoint. While source triggering savepoint, 
SubtaskCheckpointCoordinator#initCheckpoint is introduced to judge the 
condition whether to start the internal writer or not. And this new method is 
also used in other places like CheckpointBarrierUnaligner.

> Don't start channel state writing for savepoints (RPC)
> --
>
> Key: FLINK-18136
> URL: https://issues.apache.org/jira/browse/FLINK-18136
> Project: Flink
>  Issue Type: Bug
>  Components: Runtime / Checkpointing, Runtime / Task
>Affects Versions: 1.11.0
>Reporter: Roman Khachatryan
>Assignee: Roman Khachatryan
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.11.0, 1.12.0
>
>
> ChannelStateWriter#start should be only called for unaligned checkpoint. 
> While source triggering savepoint, 
> SubtaskCheckpointCoordinator#initCheckpoint is introduced to judge the 
> condition whether to start the internal writer or not. And this new method is 
> also used in other places like CheckpointBarrierUnaligner.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Closed] (FLINK-18136) Don't start channel state writing for savepoints (RPC)

2020-06-09 Thread Zhijiang (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-18136?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhijiang closed FLINK-18136.

Resolution: Fixed

Merged in master: 6dc624d931858e0ee5da14bc6b41b834c94cbba7

> Don't start channel state writing for savepoints (RPC)
> --
>
> Key: FLINK-18136
> URL: https://issues.apache.org/jira/browse/FLINK-18136
> Project: Flink
>  Issue Type: Bug
>  Components: Runtime / Checkpointing, Runtime / Task
>Affects Versions: 1.11.0
>Reporter: Roman Khachatryan
>Assignee: Roman Khachatryan
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.11.0, 1.12.0
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Comment Edited] (FLINK-18050) Fix the bug of recycling buffer twice once exception in ChannelStateWriteRequestDispatcher#dispatch

2020-06-09 Thread Zhijiang (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-18050?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17128958#comment-17128958
 ] 

Zhijiang edited comment on FLINK-18050 at 6/9/20, 7:36 AM:
---

Merged in master: 

ed7b0b1bea84a10ee45d10343f239cd183659a74, 

f2dd4b8500a82532dae17087c227ce34e1aeac9b

Merged in release-1.11: 

a233c0ff82273ca59bb1decdb1ffb6020d27ccfd,  

822e01b613b0b6821383f3cd5b0357054242b6a9


was (Author: zjwang):
Merged in master: ed7b0b1bea84a10ee45d10343f239cd183659a74, 

f2dd4b8500a82532dae17087c227ce34e1aeac9b

Merged in release-1.11: a233c0ff82273ca59bb1decdb1ffb6020d27ccfd,  
822e01b613b0b6821383f3cd5b0357054242b6a9

> Fix the bug of recycling buffer twice once exception in 
> ChannelStateWriteRequestDispatcher#dispatch
> ---
>
> Key: FLINK-18050
> URL: https://issues.apache.org/jira/browse/FLINK-18050
> Project: Flink
>  Issue Type: Bug
>  Components: Runtime / Checkpointing
>Affects Versions: 1.11.0
>Reporter: Zhijiang
>Assignee: Roman Khachatryan
>Priority: Blocker
>  Labels: pull-request-available
> Fix For: 1.11.0, 1.12.0
>
>
> When task finishes, the `CheckpointBarrierUnaligner` will decline the current 
> checkpoint, which would write abort request into `ChannelStateWriter`.
> The abort request will be executed before other write output request in the 
> queue, and close the underlying `CheckpointStateOutputStream`. Then when the 
> dispatcher executes the next write output request to access the stream, it 
> will throw ClosedByInterruptException to make dispatcher thread exit.
> In this process, the underlying buffers for current write output request will 
> be recycled twice. 
>  * ChannelStateCheckpointWriter#write will recycle all the buffers in finally 
> part, which can cover both exception and normal cases.
>  * ChannelStateWriteRequestDispatcherImpl#dispatch will call 
> `request.cancel(e)`  to recycle the underlying buffers again in the case of 
> exception.
> The effect of this bug can cause further exception in the network shuffle 
> process, which references the same buffer as above, then this exception will 
> send to the downstream side to make it failure.
>  
> This bug can be reproduced easily via running 
> UnalignedCheckpointITCase#shouldPerformUnalignedCheckpointOnParallelRemoteChannel.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Closed] (FLINK-18050) Fix the bug of recycling buffer twice once exception in ChannelStateWriteRequestDispatcher#dispatch

2020-06-09 Thread Zhijiang (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-18050?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhijiang closed FLINK-18050.

Resolution: Fixed

Merged in master: ed7b0b1bea84a10ee45d10343f239cd183659a74, 

f2dd4b8500a82532dae17087c227ce34e1aeac9b

Merged in release-1.11: a233c0ff82273ca59bb1decdb1ffb6020d27ccfd,  
822e01b613b0b6821383f3cd5b0357054242b6a9

> Fix the bug of recycling buffer twice once exception in 
> ChannelStateWriteRequestDispatcher#dispatch
> ---
>
> Key: FLINK-18050
> URL: https://issues.apache.org/jira/browse/FLINK-18050
> Project: Flink
>  Issue Type: Bug
>  Components: Runtime / Checkpointing
>Affects Versions: 1.11.0
>Reporter: Zhijiang
>Assignee: Roman Khachatryan
>Priority: Blocker
>  Labels: pull-request-available
> Fix For: 1.11.0, 1.12.0
>
>
> When task finishes, the `CheckpointBarrierUnaligner` will decline the current 
> checkpoint, which would write abort request into `ChannelStateWriter`.
> The abort request will be executed before other write output request in the 
> queue, and close the underlying `CheckpointStateOutputStream`. Then when the 
> dispatcher executes the next write output request to access the stream, it 
> will throw ClosedByInterruptException to make dispatcher thread exit.
> In this process, the underlying buffers for current write output request will 
> be recycled twice. 
>  * ChannelStateCheckpointWriter#write will recycle all the buffers in finally 
> part, which can cover both exception and normal cases.
>  * ChannelStateWriteRequestDispatcherImpl#dispatch will call 
> `request.cancel(e)`  to recycle the underlying buffers again in the case of 
> exception.
> The effect of this bug can cause further exception in the network shuffle 
> process, which references the same buffer as above, then this exception will 
> send to the downstream side to make it failure.
>  
> This bug can be reproduced easily via running 
> UnalignedCheckpointITCase#shouldPerformUnalignedCheckpointOnParallelRemoteChannel.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (FLINK-18057) SingleInputGateTest.testConcurrentReadStateAndProcessAndClose: expected:<3> but was:<2>

2020-06-09 Thread Zhijiang (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-18057?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhijiang updated FLINK-18057:
-
Affects Version/s: (was: 1.12.0)
   1.11.0

> SingleInputGateTest.testConcurrentReadStateAndProcessAndClose: expected:<3> 
> but was:<2>
> ---
>
> Key: FLINK-18057
> URL: https://issues.apache.org/jira/browse/FLINK-18057
> Project: Flink
>  Issue Type: Bug
>  Components: Runtime / Network, Tests
>Affects Versions: 1.11.0
>Reporter: Robert Metzger
>Assignee: Yingjie Cao
>Priority: Major
>  Labels: pull-request-available, test-stability
> Fix For: 1.11.0
>
>
> CI: 
> https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=2524=logs=0da23115-68bb-5dcd-192c-bd4c8adebde1=05b74a19-4ee4-5036-c46f-ada307df6cf0
> {code}
> java.lang.AssertionError: expected:<3> but was:<2>
>   at org.junit.Assert.fail(Assert.java:88)
>   at org.junit.Assert.failNotEquals(Assert.java:834)
>   at org.junit.Assert.assertEquals(Assert.java:645)
>   at org.junit.Assert.assertEquals(Assert.java:631)
>   at 
> org.apache.flink.runtime.io.network.partition.consumer.SingleInputGateTest.testConcurrentReadStateAndProcessAndClose(SingleInputGateTest.java:239)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (FLINK-18057) SingleInputGateTest.testConcurrentReadStateAndProcessAndClose: expected:<3> but was:<2>

2020-06-09 Thread Zhijiang (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-18057?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhijiang updated FLINK-18057:
-
Fix Version/s: 1.12.0

> SingleInputGateTest.testConcurrentReadStateAndProcessAndClose: expected:<3> 
> but was:<2>
> ---
>
> Key: FLINK-18057
> URL: https://issues.apache.org/jira/browse/FLINK-18057
> Project: Flink
>  Issue Type: Bug
>  Components: Runtime / Network, Tests
>Affects Versions: 1.11.0
>Reporter: Robert Metzger
>Assignee: Yingjie Cao
>Priority: Major
>  Labels: pull-request-available, test-stability
> Fix For: 1.11.0, 1.12.0
>
>
> CI: 
> https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=2524=logs=0da23115-68bb-5dcd-192c-bd4c8adebde1=05b74a19-4ee4-5036-c46f-ada307df6cf0
> {code}
> java.lang.AssertionError: expected:<3> but was:<2>
>   at org.junit.Assert.fail(Assert.java:88)
>   at org.junit.Assert.failNotEquals(Assert.java:834)
>   at org.junit.Assert.assertEquals(Assert.java:645)
>   at org.junit.Assert.assertEquals(Assert.java:631)
>   at 
> org.apache.flink.runtime.io.network.partition.consumer.SingleInputGateTest.testConcurrentReadStateAndProcessAndClose(SingleInputGateTest.java:239)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (FLINK-17978) Test Hadoop dependency change

2020-06-08 Thread Zhijiang (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-17978?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhijiang reassigned FLINK-17978:


Assignee: Yang Wang

> Test Hadoop dependency change
> -
>
> Key: FLINK-17978
> URL: https://issues.apache.org/jira/browse/FLINK-17978
> Project: Flink
>  Issue Type: Sub-task
>  Components: Runtime / Coordination
>Affects Versions: 1.11.0
>Reporter: Till Rohrmann
>Assignee: Yang Wang
>Priority: Critical
>  Labels: release-testing
> Fix For: 1.11.0
>
>
> Test the Hadoop dependency change:
> * Run Flink with HBase/ORC (maybe add e2e test)
> * Validate meaningful exception message if Hadoop dependency is missing



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Closed] (FLINK-18118) Some SQL Jobs with two input operators are loosing data with unaligned checkpoints

2020-06-08 Thread Zhijiang (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-18118?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhijiang closed FLINK-18118.

Resolution: Fixed

After syncing with [~AHeise], all the blocker issues for two inputs in UC were 
already resolved in other separate tickets, so I will close this one.

> Some SQL Jobs with two input operators are loosing data with unaligned 
> checkpoints
> --
>
> Key: FLINK-18118
> URL: https://issues.apache.org/jira/browse/FLINK-18118
> Project: Flink
>  Issue Type: Bug
>  Components: Runtime / Checkpointing, Table SQL / Runtime
>Affects Versions: 1.11.0
>Reporter: Jark Wu
>Assignee: Arvid Heise
>Priority: Blocker
> Fix For: 1.11.0
>
>
> After trying to enable unaligned checkpoints by default, a lot of Blink 
> streaming SQL/Table API tests containing joins or set operations are throwing 
> errors that are indicating we are loosing some data (full records, without 
> deserialisation errors). Example errors:
> {noformat}
> [ERROR] Failures: 
> [ERROR]   JoinITCase.testFullJoinWithEqualPk:775 expected: 3,3, null,4, null,5)> but was:
> [ERROR]   JoinITCase.testStreamJoinWithSameRecord:391 expected: 1,1,1,1, 2,2,2,2, 2,2,2,2, 3,3,3,3, 3,3,3,3, 4,4,4,4, 4,4,4,4, 5,5,5,5, 
> 5,5,5,5)> but was:
> [ERROR]   SemiAntiJoinStreamITCase.testAntiJoin:352 expected:<0> but was:<1>
> [ERROR]   SetOperatorsITCase.testIntersect:55 expected: 2,2,Hello, 3,2,Hello world)> but was:
> [ERROR]   JoinITCase.testJoinPushThroughJoin:1272 expected: 2,1,Hello, 2,1,Hello world)> but was:
> {noformat}
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Closed] (FLINK-17768) UnalignedCheckpointITCase.shouldPerformUnalignedCheckpointOnLocalAndRemoteChannel is instable

2020-06-08 Thread Zhijiang (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-17768?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhijiang closed FLINK-17768.

Resolution: Fixed

There are at-least 5-6 bugs causing this unstable issue, and most of them were 
already resolved. The last pending ticket 
https://issues.apache.org/jira/browse/FLINK-17869 would enable this ITCase, so 
i would close this ticket now.

> UnalignedCheckpointITCase.shouldPerformUnalignedCheckpointOnLocalAndRemoteChannel
>  is instable
> -
>
> Key: FLINK-17768
> URL: https://issues.apache.org/jira/browse/FLINK-17768
> Project: Flink
>  Issue Type: Bug
>  Components: Runtime / Checkpointing
>Affects Versions: 1.11.0
>Reporter: Dian Fu
>Assignee: Zhijiang
>Priority: Blocker
>  Labels: test-stability
> Fix For: 1.11.0
>
>
> UnalignedCheckpointITCase.shouldPerformUnalignedCheckpointOnLocalAndRemoteChannel
>  and shouldPerformUnalignedCheckpointOnParallelRemoteChannel failed in azure:
> {code}
> 2020-05-16T12:41:32.3546620Z [ERROR] 
> shouldPerformUnalignedCheckpointOnLocalAndRemoteChannel(org.apache.flink.test.checkpointing.UnalignedCheckpointITCase)
>   Time elapsed: 18.865 s  <<< ERROR!
> 2020-05-16T12:41:32.3548739Z java.util.concurrent.ExecutionException: 
> org.apache.flink.runtime.client.JobExecutionException: Job execution failed.
> 2020-05-16T12:41:32.3550177Z  at 
> java.util.concurrent.CompletableFuture.reportGet(CompletableFuture.java:357)
> 2020-05-16T12:41:32.3551416Z  at 
> java.util.concurrent.CompletableFuture.get(CompletableFuture.java:1908)
> 2020-05-16T12:41:32.3552959Z  at 
> org.apache.flink.streaming.api.environment.StreamExecutionEnvironment.execute(StreamExecutionEnvironment.java:1665)
> 2020-05-16T12:41:32.3554979Z  at 
> org.apache.flink.streaming.api.environment.LocalStreamEnvironment.execute(LocalStreamEnvironment.java:74)
> 2020-05-16T12:41:32.3556584Z  at 
> org.apache.flink.streaming.api.environment.StreamExecutionEnvironment.execute(StreamExecutionEnvironment.java:1645)
> 2020-05-16T12:41:32.3558068Z  at 
> org.apache.flink.streaming.api.environment.StreamExecutionEnvironment.execute(StreamExecutionEnvironment.java:1627)
> 2020-05-16T12:41:32.3559431Z  at 
> org.apache.flink.test.checkpointing.UnalignedCheckpointITCase.execute(UnalignedCheckpointITCase.java:158)
> 2020-05-16T12:41:32.3560954Z  at 
> org.apache.flink.test.checkpointing.UnalignedCheckpointITCase.shouldPerformUnalignedCheckpointOnLocalAndRemoteChannel(UnalignedCheckpointITCase.java:145)
> 2020-05-16T12:41:32.3562203Z  at 
> sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> 2020-05-16T12:41:32.3563433Z  at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
> 2020-05-16T12:41:32.3564846Z  at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> 2020-05-16T12:41:32.3565894Z  at 
> java.lang.reflect.Method.invoke(Method.java:498)
> 2020-05-16T12:41:32.3566870Z  at 
> org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:50)
> 2020-05-16T12:41:32.3568064Z  at 
> org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12)
> 2020-05-16T12:41:32.3569727Z  at 
> org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:47)
> 2020-05-16T12:41:32.3570818Z  at 
> org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17)
> 2020-05-16T12:41:32.3571840Z  at 
> org.junit.rules.Verifier$1.evaluate(Verifier.java:35)
> 2020-05-16T12:41:32.3572771Z  at 
> org.junit.rules.ExternalResource$1.evaluate(ExternalResource.java:48)
> 2020-05-16T12:41:32.3574008Z  at 
> org.junit.internal.runners.statements.FailOnTimeout$CallableStatement.call(FailOnTimeout.java:298)
> 2020-05-16T12:41:32.3575406Z  at 
> org.junit.internal.runners.statements.FailOnTimeout$CallableStatement.call(FailOnTimeout.java:292)
> 2020-05-16T12:41:32.3576476Z  at 
> java.util.concurrent.FutureTask.run(FutureTask.java:266)
> 2020-05-16T12:41:32.3577253Z  at java.lang.Thread.run(Thread.java:748)
> 2020-05-16T12:41:32.3578228Z Caused by: 
> org.apache.flink.runtime.client.JobExecutionException: Job execution failed.
> 2020-05-16T12:41:32.3579520Z  at 
> org.apache.flink.runtime.jobmaster.JobResult.toJobExecutionResult(JobResult.java:147)
> 2020-05-16T12:41:32.3580935Z  at 
> org.apache.flink.client.program.PerJobMiniClusterFactory$PerJobMiniClusterJobClient.lambda$getJobExecutionResult$2(PerJobMiniClusterFactory.java:186)
> 2020-05-16T12:41:32.3582361Z  at 
> java.util.concurrent.CompletableFuture.uniApply(CompletableFuture.java:616)
> 2020-05-16T12:41:32.3583456Z  at 
> java.util.concurrent.CompletableFuture$UniApply.tryFire(CompletableFuture.java:591)
> 2020-05-16T12:41:32.3584816Z  

[jira] [Assigned] (FLINK-18173) Bundle flink-csv,flink-json,flink-avro jars in lib

2020-06-08 Thread Zhijiang (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-18173?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhijiang reassigned FLINK-18173:


Assignee: Jingsong Lee

> Bundle flink-csv,flink-json,flink-avro jars in lib
> --
>
> Key: FLINK-18173
> URL: https://issues.apache.org/jira/browse/FLINK-18173
> Project: Flink
>  Issue Type: Bug
>  Components: Build System, Table SQL / API
>Reporter: Jingsong Lee
>Assignee: Jingsong Lee
>Priority: Blocker
> Fix For: 1.11.0
>
>
> The biggest problem for distributions I see is the variety of problems caused 
> by users' lack of format dependency.
> These three formats are very small and no third party dependence, and they 
> are widely used by table users.
> Actually, we don't have any other built-in table formats now... In total 
> 151K...
>  
> 73K flink-avro-1.10.0.jar
> 36K flink-csv-1.10.0.jar
> 42K flink-json-1.10.0.jar
>  We can just bundle them in "flink/lib/".
> It not solve all problems and it is independent of "fat" and "slim". But also 
> improve usability.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (FLINK-18109) Manually test external resource framework with GPUDriver

2020-06-08 Thread Zhijiang (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-18109?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhijiang reassigned FLINK-18109:


Assignee: Yangze Guo

> Manually test external resource framework with GPUDriver
> 
>
> Key: FLINK-18109
> URL: https://issues.apache.org/jira/browse/FLINK-18109
> Project: Flink
>  Issue Type: Sub-task
>Affects Versions: 1.11.0
>Reporter: Yangze Guo
>Assignee: Yangze Guo
>Priority: Blocker
> Fix For: 1.11.0
>
>
> We need to:
>  - Test it in standalone mode.
>  ** The discovery script works well in default mode
>  ** The discovery script works well in coordination mode.
>  - Test it with Yarn 2.10 and 3.1.
>  ** GPU resources are allocated successfully.
>  - Test it with Kubernetes
>  ** GPU resources are allocated successfully.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (FLINK-18089) Add the network zero-copy test into the azure E2E pipeline

2020-06-08 Thread Zhijiang (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-18089?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhijiang updated FLINK-18089:
-
Priority: Major  (was: Blocker)

> Add the network zero-copy test into the azure E2E pipeline
> --
>
> Key: FLINK-18089
> URL: https://issues.apache.org/jira/browse/FLINK-18089
> Project: Flink
>  Issue Type: Sub-task
>Reporter: Yun Gao
>Assignee: Yun Gao
>Priority: Major
>  Labels: release-testing
> Fix For: 1.11.0
>
>
> The zero-copy E2E test added in 
> [Flink-10742|https://issues.apache.org/jira/browse/FLINK-10742] is only added 
> to the deprecated travis pipeline previously. It should be added into the 
> Azure test pipeline.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Closed] (FLINK-18009) Select an invalid column crashes the SQL CLI

2020-06-08 Thread Zhijiang (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-18009?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhijiang closed FLINK-18009.

Resolution: Duplicate

> Select an invalid column crashes the SQL CLI
> 
>
> Key: FLINK-18009
> URL: https://issues.apache.org/jira/browse/FLINK-18009
> Project: Flink
>  Issue Type: Bug
>  Components: Table SQL / Client
>Reporter: Rui Li
>Assignee: Nicholas Jiang
>Priority: Blocker
> Fix For: 1.11.0
>
>
> To reproduce: just select a non-existing column from table in SQL CLI



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (FLINK-18009) Select an invalid column crashes the SQL CLI

2020-06-08 Thread Zhijiang (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-18009?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17128092#comment-17128092
 ] 

Zhijiang commented on FLINK-18009:
--

After syncing with [~nicholasjiang],  i will close this issue now to avoid 
tracing it every release sync. If it is not resolved after FLINK-17893, then we 
can reopen it again.

> Select an invalid column crashes the SQL CLI
> 
>
> Key: FLINK-18009
> URL: https://issues.apache.org/jira/browse/FLINK-18009
> Project: Flink
>  Issue Type: Bug
>  Components: Table SQL / Client
>Reporter: Rui Li
>Assignee: Nicholas Jiang
>Priority: Blocker
> Fix For: 1.11.0
>
>
> To reproduce: just select a non-existing column from table in SQL CLI



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (FLINK-17733) Add documentation for real-time hive

2020-06-08 Thread Zhijiang (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-17733?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhijiang updated FLINK-17733:
-
Affects Version/s: 1.11.0

> Add documentation for real-time hive
> 
>
> Key: FLINK-17733
> URL: https://issues.apache.org/jira/browse/FLINK-17733
> Project: Flink
>  Issue Type: Sub-task
>  Components: Documentation
>Affects Versions: 1.11.0
>Reporter: Jingsong Lee
>Assignee: Jingsong Lee
>Priority: Blocker
> Fix For: 1.11.0, 1.12.0
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (FLINK-17733) Add documentation for real-time hive

2020-06-08 Thread Zhijiang (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-17733?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhijiang reassigned FLINK-17733:


Assignee: Jingsong Lee

> Add documentation for real-time hive
> 
>
> Key: FLINK-17733
> URL: https://issues.apache.org/jira/browse/FLINK-17733
> Project: Flink
>  Issue Type: Sub-task
>  Components: Documentation
>Reporter: Jingsong Lee
>Assignee: Jingsong Lee
>Priority: Blocker
> Fix For: 1.11.0
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (FLINK-17733) Add documentation for real-time hive

2020-06-08 Thread Zhijiang (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-17733?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhijiang updated FLINK-17733:
-
Fix Version/s: 1.12.0

> Add documentation for real-time hive
> 
>
> Key: FLINK-17733
> URL: https://issues.apache.org/jira/browse/FLINK-17733
> Project: Flink
>  Issue Type: Sub-task
>  Components: Documentation
>Reporter: Jingsong Lee
>Assignee: Jingsong Lee
>Priority: Blocker
> Fix For: 1.11.0, 1.12.0
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (FLINK-16375) Remove references to registerTableSource/Sink methods from documentation

2020-06-08 Thread Zhijiang (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-16375?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17128067#comment-17128067
 ] 

Zhijiang commented on FLINK-16375:
--

[~aljoscha] [~dwysakowicz]  [~lzljs3620320] Could you sync who can take over 
this unassinged issue?

> Remove references to registerTableSource/Sink methods from documentation
> 
>
> Key: FLINK-16375
> URL: https://issues.apache.org/jira/browse/FLINK-16375
> Project: Flink
>  Issue Type: Sub-task
>  Components: Documentation, Table SQL / API
>Reporter: Dawid Wysakowicz
>Priority: Blocker
> Fix For: 1.11.0
>
>
> We should remove mentions of the registerTableSouce/Sink methods from 
> documentation and replace them with suggested approach.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (FLINK-18071) CoordinatorEventsExactlyOnceITCase.checkListContainsSequence fails on CI

2020-06-05 Thread Zhijiang (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-18071?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhijiang reassigned FLINK-18071:


Assignee: Yun Gao

> CoordinatorEventsExactlyOnceITCase.checkListContainsSequence fails on CI
> 
>
> Key: FLINK-18071
> URL: https://issues.apache.org/jira/browse/FLINK-18071
> Project: Flink
>  Issue Type: Bug
>  Components: Runtime / Coordination, Tests
>Affects Versions: 1.11.0, 1.12.0
>Reporter: Robert Metzger
>Assignee: Yun Gao
>Priority: Critical
>  Labels: test-stability
> Fix For: 1.11.0, 1.12.0
>
>
> CI: 
> https://dev.azure.com/georgeryan1322/Flink/_build/results?buildId=330=logs=6e58d712-c5cc-52fb-0895-6ff7bd56c46b=f30a8e80-b2cf-535c-9952-7f521a4ae374
> {code}
> [ERROR] Tests run: 1, Failures: 1, Errors: 0, Skipped: 0, Time elapsed: 8.795 
> s <<< FAILURE! - in 
> org.apache.flink.runtime.operators.coordination.CoordinatorEventsExactlyOnceITCase
> [ERROR] 
> test(org.apache.flink.runtime.operators.coordination.CoordinatorEventsExactlyOnceITCase)
>   Time elapsed: 4.647 s  <<< FAILURE!
> java.lang.AssertionError: List did not contain expected sequence of 200 
> elements, but was: [152, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 
> 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 
> 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 
> 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 
> 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 
> 91, 92, 93, 94, 95, 96, 97, 98, 99, 100, 101, 102, 103, 104, 105, 106, 107, 
> 108, 109, 110, 111, 112, 113, 114, 115, 116, 117, 118, 119, 120, 121, 122, 
> 123, 124, 125, 126, 127, 128, 129, 130, 131, 132, 133, 134, 135, 136, 137, 
> 138, 139, 140, 141, 142, 143, 144, 145, 146, 147, 148, 149, 150, 151, 152, 
> 153, 154, 155, 156, 157, 158, 159, 160, 161, 162, 163, 164, 165, 166, 167, 
> 168, 169, 170, 171, 172, 173, 174, 175, 176, 177, 178, 179, 180, 181, 182, 
> 183, 184, 185, 186, 187, 188, 189, 190, 191, 192, 193, 194, 195, 196, 197, 
> 198, 199]
>   at org.junit.Assert.fail(Assert.java:88)
>   at 
> org.apache.flink.runtime.operators.coordination.CoordinatorEventsExactlyOnceITCase.failList(CoordinatorEventsExactlyOnceITCase.java:160)
>   at 
> org.apache.flink.runtime.operators.coordination.CoordinatorEventsExactlyOnceITCase.checkListContainsSequence(CoordinatorEventsExactlyOnceITCase.java:148)
>   at 
> org.apache.flink.runtime.operators.coordination.CoordinatorEventsExactlyOnceITCase.test(CoordinatorEventsExactlyOnceITCase.java:143)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:498)
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (FLINK-18057) SingleInputGateTest.testConcurrentReadStateAndProcessAndClose: expected:<3> but was:<2>

2020-06-05 Thread Zhijiang (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-18057?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhijiang reassigned FLINK-18057:


Assignee: Ying Cao

> SingleInputGateTest.testConcurrentReadStateAndProcessAndClose: expected:<3> 
> but was:<2>
> ---
>
> Key: FLINK-18057
> URL: https://issues.apache.org/jira/browse/FLINK-18057
> Project: Flink
>  Issue Type: Bug
>  Components: Runtime / Network, Tests
>Affects Versions: 1.12.0
>Reporter: Robert Metzger
>Assignee: Ying Cao
>Priority: Major
>  Labels: test-stability
>
> CI: 
> https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=2524=logs=0da23115-68bb-5dcd-192c-bd4c8adebde1=05b74a19-4ee4-5036-c46f-ada307df6cf0
> {code}
> java.lang.AssertionError: expected:<3> but was:<2>
>   at org.junit.Assert.fail(Assert.java:88)
>   at org.junit.Assert.failNotEquals(Assert.java:834)
>   at org.junit.Assert.assertEquals(Assert.java:645)
>   at org.junit.Assert.assertEquals(Assert.java:631)
>   at 
> org.apache.flink.runtime.io.network.partition.consumer.SingleInputGateTest.testConcurrentReadStateAndProcessAndClose(SingleInputGateTest.java:239)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (FLINK-18057) SingleInputGateTest.testConcurrentReadStateAndProcessAndClose: expected:<3> but was:<2>

2020-06-05 Thread Zhijiang (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-18057?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhijiang reassigned FLINK-18057:


Assignee: Yingjie Cao  (was: Ying Cao)

> SingleInputGateTest.testConcurrentReadStateAndProcessAndClose: expected:<3> 
> but was:<2>
> ---
>
> Key: FLINK-18057
> URL: https://issues.apache.org/jira/browse/FLINK-18057
> Project: Flink
>  Issue Type: Bug
>  Components: Runtime / Network, Tests
>Affects Versions: 1.12.0
>Reporter: Robert Metzger
>Assignee: Yingjie Cao
>Priority: Major
>  Labels: test-stability
>
> CI: 
> https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=2524=logs=0da23115-68bb-5dcd-192c-bd4c8adebde1=05b74a19-4ee4-5036-c46f-ada307df6cf0
> {code}
> java.lang.AssertionError: expected:<3> but was:<2>
>   at org.junit.Assert.fail(Assert.java:88)
>   at org.junit.Assert.failNotEquals(Assert.java:834)
>   at org.junit.Assert.assertEquals(Assert.java:645)
>   at org.junit.Assert.assertEquals(Assert.java:631)
>   at 
> org.apache.flink.runtime.io.network.partition.consumer.SingleInputGateTest.testConcurrentReadStateAndProcessAndClose(SingleInputGateTest.java:239)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (FLINK-17769) Wrong order of log events on a task failure

2020-06-05 Thread Zhijiang (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-17769?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhijiang reassigned FLINK-17769:


Assignee: Yuan Mei

> Wrong order of log events on a task failure
> ---
>
> Key: FLINK-17769
> URL: https://issues.apache.org/jira/browse/FLINK-17769
> Project: Flink
>  Issue Type: Bug
>  Components: Runtime / Task
>Reporter: Robert Metzger
>Assignee: Yuan Mei
>Priority: Critical
> Fix For: 1.11.0
>
>
> In this example, errors from the {{close()}} method call are logged before 
> the {{switched from RUNNING to FAILED}} log line with the actual exception 
> (which is confusing, because the exceptions coming from {{close()}} could be 
> considered as the failure root cause, because they are first in the log)
> {code}
> 2020-05-14 10:12:42,660 INFO  
> org.apache.flink.streaming.connectors.kinesis.FlinkKinesisProducer [] - 
> Started Kinesis producer instance for region 'eu-central-1'
> 2020-05-14 10:12:42,660 DEBUG 
> org.apache.flink.streaming.api.operators.BackendRestorerProcedure [] - 
> Creating operator state backend for 
> StreamSource_cbc357ccb763df2852fee8c4fc7d55f2_(1/1) with empty state.
> 2020-05-14 10:12:42,823 INFO  
> org.apache.flink.streaming.connectors.kinesis.FlinkKinesisProducer [] - 
> Closing producer
> 2020-05-14 10:12:42,823 INFO  
> org.apache.flink.streaming.connectors.kinesis.FlinkKinesisProducer [] - 
> Flushing outstanding 2 records
> 2020-05-14 10:12:42,826 ERROR 
> org.apache.flink.streaming.runtime.tasks.StreamTask  [] - Error 
> during disposal of stream operator.
> org.apache.flink.kinesis.shaded.com.amazonaws.services.kinesis.producer.DaemonException:
>  The child process has been shutdown and can no longer accept messages.
> 2020-05-14 10:12:42,834 WARN  org.apache.flink.runtime.taskmanager.Task   
>  [] - Source: Custom Source -> Sink: Unnamed (1/1) 
> (4a49aea047aeb3e67cf79c788df0e558) switched from RUNNING to FAILED.
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (FLINK-18116) E2E performance test

2020-06-04 Thread Zhijiang (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-18116?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhijiang updated FLINK-18116:
-
Priority: Blocker  (was: Major)

> E2E performance test
> 
>
> Key: FLINK-18116
> URL: https://issues.apache.org/jira/browse/FLINK-18116
> Project: Flink
>  Issue Type: Sub-task
>  Components: API / Core, API / DataStream, API / State Processor, 
> Build System, Client / Job Submission
>Affects Versions: 1.11.0
>Reporter: Aihua Li
>Assignee: Aihua Li
>Priority: Blocker
>  Labels: release-testing
> Fix For: 1.11.0
>
>
> it's mainly to verify the performance don't less than 1.10 version by 
> checking the metrics of end-to-end performance test,such as qps,latency .



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (FLINK-18115) StabilityTest

2020-06-04 Thread Zhijiang (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-18115?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhijiang updated FLINK-18115:
-
Affects Version/s: (was: 1.10.0)
   1.11.0

> StabilityTest
> -
>
> Key: FLINK-18115
> URL: https://issues.apache.org/jira/browse/FLINK-18115
> Project: Flink
>  Issue Type: Sub-task
>  Components: API / Core, API / State Processor, Build System, Client 
> / Job Submission
>Affects Versions: 1.11.0
>Reporter: Aihua Li
>Assignee: Aihua Li
>Priority: Blocker
>  Labels: release-testing
> Fix For: 1.11.0
>
>
> It mainly checks the flink job can recover from  various unabnormal 
> situations including disk full, network interruption, zk unable to connect, 
> rpc message timeout, etc. 
> If job can't be recoverd it means test failed.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (FLINK-18115) StabilityTest

2020-06-04 Thread Zhijiang (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-18115?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhijiang updated FLINK-18115:
-
Priority: Blocker  (was: Major)

> StabilityTest
> -
>
> Key: FLINK-18115
> URL: https://issues.apache.org/jira/browse/FLINK-18115
> Project: Flink
>  Issue Type: Sub-task
>  Components: API / Core, API / State Processor, Build System, Client 
> / Job Submission
>Affects Versions: 1.10.0
>Reporter: Aihua Li
>Priority: Blocker
>  Labels: release-testing
> Fix For: 1.11.0
>
>
> It mainly checks the flink job can recover from  various unabnormal 
> situations including disk full, network interruption, zk unable to connect, 
> rpc message timeout, etc. 
> If job can't be recoverd it means test failed.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (FLINK-18115) StabilityTest

2020-06-04 Thread Zhijiang (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-18115?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhijiang reassigned FLINK-18115:


Assignee: Aihua Li

> StabilityTest
> -
>
> Key: FLINK-18115
> URL: https://issues.apache.org/jira/browse/FLINK-18115
> Project: Flink
>  Issue Type: Sub-task
>  Components: API / Core, API / State Processor, Build System, Client 
> / Job Submission
>Affects Versions: 1.10.0
>Reporter: Aihua Li
>Assignee: Aihua Li
>Priority: Blocker
>  Labels: release-testing
> Fix For: 1.11.0
>
>
> It mainly checks the flink job can recover from  various unabnormal 
> situations including disk full, network interruption, zk unable to connect, 
> rpc message timeout, etc. 
> If job can't be recoverd it means test failed.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (FLINK-18116) E2E performance test

2020-06-04 Thread Zhijiang (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-18116?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhijiang reassigned FLINK-18116:


Assignee: Aihua Li

> E2E performance test
> 
>
> Key: FLINK-18116
> URL: https://issues.apache.org/jira/browse/FLINK-18116
> Project: Flink
>  Issue Type: Sub-task
>  Components: API / Core, API / DataStream, API / State Processor, 
> Build System, Client / Job Submission
>Affects Versions: 1.11.0
>Reporter: Aihua Li
>Assignee: Aihua Li
>Priority: Major
>  Labels: release-testing
> Fix For: 1.11.0
>
>
> it's mainly to verify the performance don't less than 1.10 version by 
> checking the metrics of end-to-end performance test,such as qps,latency .



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (FLINK-18112) Single Task Failure Recovery Prototype

2020-06-03 Thread Zhijiang (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-18112?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhijiang reassigned FLINK-18112:


Assignee: Yuan Mei

> Single Task Failure Recovery Prototype
> --
>
> Key: FLINK-18112
> URL: https://issues.apache.org/jira/browse/FLINK-18112
> Project: Flink
>  Issue Type: New Feature
>  Components: Runtime / Checkpointing, Runtime / Coordination, Runtime 
> / Network
>Affects Versions: 1.12.0
> Environment: Build a prototype of single task failure recovery to 
> address and answer the following questions:
> Step 1: Scheduling part, restart a single node without restarting the 
> upstream or downstream nodes.
> Step 2: Checkpointing part, as my understanding of how regional failover 
> works, this part might not need modification.
> Step 3: Network part
>   - how the recovered node able to link to the upstream ResultPartitions, and 
> continue getting data
>   - how the downstream node able to link to the recovered node, and continue 
> getting node
>   - how different netty transit mode affects the results
>   - what if the failed node buffered data pool is full
> Step 4: Failover process verification
>Reporter: Yuan Mei
>Assignee: Yuan Mei
>Priority: Major
> Fix For: 1.12.0
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Comment Edited] (FLINK-17994) Fix the race condition between CheckpointBarrierUnaligner#processBarrier and #notifyBarrierReceived

2020-06-03 Thread Zhijiang (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-17994?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17120851#comment-17120851
 ] 

Zhijiang edited comment on FLINK-17994 at 6/4/20, 2:36 AM:
---

Merged in release-1.11 : 64ca88ac989ee7525cb821670f293404b7b30d2d

Merged in master: aa882c102f1b5d67f0ecb4efaaf9d067b4d8d424


was (Author: zjwang):
Merged in release-1.11 : 64ca88ac989ee7525cb821670f293404b7b30d2d

Will cherry-pick to master later and then update the commit info, close it now 
for not showing it in release-1.11 blockers.

> Fix the race condition between CheckpointBarrierUnaligner#processBarrier and 
> #notifyBarrierReceived
> ---
>
> Key: FLINK-17994
> URL: https://issues.apache.org/jira/browse/FLINK-17994
> Project: Flink
>  Issue Type: Bug
>  Components: Runtime / Checkpointing
>Affects Versions: 1.11.0
>Reporter: Zhijiang
>Assignee: Zhijiang
>Priority: Blocker
>  Labels: pull-request-available
> Fix For: 1.11.0, 1.12.0
>
>
> The race condition issue happens as follow:
>  * ch1 is received from network for one input channel by netty thread and 
> schedule the ch1 into mailbox via #notifyBarrierReceived
>  * ch2 is received from network for another input channel by netty thread, 
> but before calling #notifyBarrierReceived this barrier was inserted into 
> channel's data queue in advance. Then it would cause task thread process ch2 
> earlier than #notifyBarrierReceived by netty thread.
>  * Task thread would execute checkpoint for ch2 immediately because ch2 > ch1.
>  * After that, the previous scheduled ch1 is performed from mailbox by task 
> thread, then it causes the IllegalArgumentException inside 
> SubtaskCheckpointCoordinatorImpl#checkpointState because it breaks the 
> initial assumption that checkpoint is executed in incremental way for aligned 
> mode.
> The key problem is that we can not remove the checkpoint action from mailbox 
> queue before the next checkpoint is going to execute now. One possible 
> solution is that we record the previous aborted checkpoint id inside 
> SubtaskCheckpointCoordinatorImpl#abortedCheckpointIds, then when the queued 
> checkpoint inside mailbox is executing, it will exit directly if found the 
> checkpoint id was already aborted before.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (FLINK-18088) Umbrella for testing features in release-1.11.0

2020-06-03 Thread Zhijiang (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-18088?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhijiang reassigned FLINK-18088:


Assignee: Zhijiang

> Umbrella for testing features in release-1.11.0 
> 
>
> Key: FLINK-18088
> URL: https://issues.apache.org/jira/browse/FLINK-18088
> Project: Flink
>  Issue Type: Test
>Affects Versions: 1.11.0
>Reporter: Zhijiang
>Assignee: Zhijiang
>Priority: Blocker
>  Labels: release-testing
> Fix For: 1.11.0
>
>
> This is the umbrella issue for tracing the testing progress of all the 
> related features in release-1.11.0, either the way of e2e or manually testing 
> in cluster, to confirm the features work in practice with good quality.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (FLINK-18088) Umbrella for testing features in release-1.11.0

2020-06-03 Thread Zhijiang (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-18088?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhijiang updated FLINK-18088:
-
Labels: release-testing  (was: )

> Umbrella for testing features in release-1.11.0 
> 
>
> Key: FLINK-18088
> URL: https://issues.apache.org/jira/browse/FLINK-18088
> Project: Flink
>  Issue Type: Test
>Affects Versions: 1.11.0
>Reporter: Zhijiang
>Priority: Blocker
>  Labels: release-testing
> Fix For: 1.11.0
>
>
> This is the umbrella issue for tracing the testing progress of all the 
> related features in release-1.11.0, either the way of e2e or manually testing 
> in cluster, to confirm the features work in practice with good quality.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (FLINK-18089) Add the zero-copy test into the azure E2E pipeline

2020-06-03 Thread Zhijiang (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-18089?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhijiang updated FLINK-18089:
-
Parent: FLINK-18088
Issue Type: Sub-task  (was: Test)

> Add the zero-copy test into the azure E2E pipeline
> --
>
> Key: FLINK-18089
> URL: https://issues.apache.org/jira/browse/FLINK-18089
> Project: Flink
>  Issue Type: Sub-task
>Reporter: Yun Gao
>Assignee: Yun Gao
>Priority: Major
>  Labels: release-testing
> Fix For: 1.11.0
>
>
> The zero-copy E2E test added in 
> [Flink-10742|https://issues.apache.org/jira/browse/FLINK-10742] is only added 
> to the deprecated travis pipeline previously. It should be added into the 
> Azure test pipeline.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (FLINK-18089) Add the zero-copy test into the azure E2E pipeline

2020-06-03 Thread Zhijiang (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-18089?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhijiang updated FLINK-18089:
-
Priority: Blocker  (was: Major)

> Add the zero-copy test into the azure E2E pipeline
> --
>
> Key: FLINK-18089
> URL: https://issues.apache.org/jira/browse/FLINK-18089
> Project: Flink
>  Issue Type: Sub-task
>Reporter: Yun Gao
>Assignee: Yun Gao
>Priority: Blocker
>  Labels: release-testing
> Fix For: 1.11.0
>
>
> The zero-copy E2E test added in 
> [Flink-10742|https://issues.apache.org/jira/browse/FLINK-10742] is only added 
> to the deprecated travis pipeline previously. It should be added into the 
> Azure test pipeline.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (FLINK-18089) Add the network zero-copy test into the azure E2E pipeline

2020-06-03 Thread Zhijiang (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-18089?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhijiang updated FLINK-18089:
-
Summary: Add the network zero-copy test into the azure E2E pipeline  (was: 
Add the zero-copy test into the azure E2E pipeline)

> Add the network zero-copy test into the azure E2E pipeline
> --
>
> Key: FLINK-18089
> URL: https://issues.apache.org/jira/browse/FLINK-18089
> Project: Flink
>  Issue Type: Sub-task
>Reporter: Yun Gao
>Assignee: Yun Gao
>Priority: Blocker
>  Labels: release-testing
> Fix For: 1.11.0
>
>
> The zero-copy E2E test added in 
> [Flink-10742|https://issues.apache.org/jira/browse/FLINK-10742] is only added 
> to the deprecated travis pipeline previously. It should be added into the 
> Azure test pipeline.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (FLINK-18089) Add the zero-copy test into the azure E2E pipeline

2020-06-03 Thread Zhijiang (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-18089?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhijiang reassigned FLINK-18089:


Assignee: Yun Gao

> Add the zero-copy test into the azure E2E pipeline
> --
>
> Key: FLINK-18089
> URL: https://issues.apache.org/jira/browse/FLINK-18089
> Project: Flink
>  Issue Type: Test
>Reporter: Yun Gao
>Assignee: Yun Gao
>Priority: Major
> Fix For: 1.11.0
>
>
> The zero-copy E2E test added in 
> [Flink-10742|https://issues.apache.org/jira/browse/FLINK-10742] is only added 
> to the deprecated travis pipeline previously. It should be added into the 
> Azure test pipeline.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (FLINK-18089) Add the zero-copy test into the azure E2E pipeline

2020-06-03 Thread Zhijiang (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-18089?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhijiang updated FLINK-18089:
-
Labels: release-testing  (was: )

> Add the zero-copy test into the azure E2E pipeline
> --
>
> Key: FLINK-18089
> URL: https://issues.apache.org/jira/browse/FLINK-18089
> Project: Flink
>  Issue Type: Test
>Reporter: Yun Gao
>Assignee: Yun Gao
>Priority: Major
>  Labels: release-testing
> Fix For: 1.11.0
>
>
> The zero-copy E2E test added in 
> [Flink-10742|https://issues.apache.org/jira/browse/FLINK-10742] is only added 
> to the deprecated travis pipeline previously. It should be added into the 
> Azure test pipeline.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (FLINK-18085) Manually test application mode for standalone, Yarn, K8s

2020-06-03 Thread Zhijiang (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-18085?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhijiang updated FLINK-18085:
-
Labels: release-testing  (was: )

> Manually test application mode for standalone, Yarn, K8s
> 
>
> Key: FLINK-18085
> URL: https://issues.apache.org/jira/browse/FLINK-18085
> Project: Flink
>  Issue Type: Sub-task
>Reporter: Yang Wang
>Priority: Critical
>  Labels: release-testing
> Fix For: 1.11.0
>
>
> We have introduced the application mode from 1.11. It need to be tested in a 
> real cluster with following functionality check.
>  * Application normally start and finish
>  * Application failed exceptionally
>  * HA configured, kill jobmanager and job should recover from latest 
> checkpoint
>  
> For Yarn deployment, we also need to verify it could work with provided lib 
> and remote user jars.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (FLINK-18085) Manually test application mode for standalone, Yarn, K8s

2020-06-03 Thread Zhijiang (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-18085?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhijiang updated FLINK-18085:
-
Parent: FLINK-18088
Issue Type: Sub-task  (was: Test)

> Manually test application mode for standalone, Yarn, K8s
> 
>
> Key: FLINK-18085
> URL: https://issues.apache.org/jira/browse/FLINK-18085
> Project: Flink
>  Issue Type: Sub-task
>Reporter: Yang Wang
>Assignee: Yang Wang
>Priority: Critical
>  Labels: release-testing
> Fix For: 1.11.0
>
>
> We have introduced the application mode from 1.11. It need to be tested in a 
> real cluster with following functionality check.
>  * Application normally start and finish
>  * Application failed exceptionally
>  * HA configured, kill jobmanager and job should recover from latest 
> checkpoint
>  
> For Yarn deployment, we also need to verify it could work with provided lib 
> and remote user jars.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (FLINK-18085) Manually test application mode for standalone, Yarn, K8s

2020-06-03 Thread Zhijiang (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-18085?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhijiang updated FLINK-18085:
-
Parent: (was: FLINK-16654)
Issue Type: Test  (was: Sub-task)

> Manually test application mode for standalone, Yarn, K8s
> 
>
> Key: FLINK-18085
> URL: https://issues.apache.org/jira/browse/FLINK-18085
> Project: Flink
>  Issue Type: Test
>Reporter: Yang Wang
>Assignee: Yang Wang
>Priority: Critical
>  Labels: release-testing
> Fix For: 1.11.0
>
>
> We have introduced the application mode from 1.11. It need to be tested in a 
> real cluster with following functionality check.
>  * Application normally start and finish
>  * Application failed exceptionally
>  * HA configured, kill jobmanager and job should recover from latest 
> checkpoint
>  
> For Yarn deployment, we also need to verify it could work with provided lib 
> and remote user jars.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (FLINK-18085) Manually test application mode for standalone, Yarn, K8s

2020-06-03 Thread Zhijiang (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-18085?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhijiang reassigned FLINK-18085:


Assignee: Yang Wang

> Manually test application mode for standalone, Yarn, K8s
> 
>
> Key: FLINK-18085
> URL: https://issues.apache.org/jira/browse/FLINK-18085
> Project: Flink
>  Issue Type: Sub-task
>Reporter: Yang Wang
>Assignee: Yang Wang
>Priority: Critical
>  Labels: release-testing
> Fix For: 1.11.0
>
>
> We have introduced the application mode from 1.11. It need to be tested in a 
> real cluster with following functionality check.
>  * Application normally start and finish
>  * Application failed exceptionally
>  * HA configured, kill jobmanager and job should recover from latest 
> checkpoint
>  
> For Yarn deployment, we also need to verify it could work with provided lib 
> and remote user jars.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (FLINK-18078) E2E tests manually for Hive streaming dim join

2020-06-03 Thread Zhijiang (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-18078?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhijiang updated FLINK-18078:
-
Parent: FLINK-18088
Issue Type: Sub-task  (was: Test)

> E2E tests manually for Hive streaming dim join
> --
>
> Key: FLINK-18078
> URL: https://issues.apache.org/jira/browse/FLINK-18078
> Project: Flink
>  Issue Type: Sub-task
>  Components: Connectors / Hive, Tests
>Reporter: Jingsong Lee
>Assignee: Rui Li
>Priority: Blocker
>  Labels: release-testing
> Fix For: 1.11.0
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (FLINK-18078) E2E tests manually for Hive streaming dim join

2020-06-03 Thread Zhijiang (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-18078?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhijiang updated FLINK-18078:
-
Parent: (was: FLINK-18021)
Issue Type: Test  (was: Sub-task)

> E2E tests manually for Hive streaming dim join
> --
>
> Key: FLINK-18078
> URL: https://issues.apache.org/jira/browse/FLINK-18078
> Project: Flink
>  Issue Type: Test
>  Components: Connectors / Hive, Tests
>Reporter: Jingsong Lee
>Assignee: Rui Li
>Priority: Blocker
>  Labels: release-testing
> Fix For: 1.11.0
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (FLINK-18078) E2E tests manually for Hive streaming dim join

2020-06-03 Thread Zhijiang (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-18078?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhijiang updated FLINK-18078:
-
Labels: release-testing  (was: )

> E2E tests manually for Hive streaming dim join
> --
>
> Key: FLINK-18078
> URL: https://issues.apache.org/jira/browse/FLINK-18078
> Project: Flink
>  Issue Type: Sub-task
>  Components: Connectors / Hive, Tests
>Reporter: Jingsong Lee
>Assignee: Rui Li
>Priority: Blocker
>  Labels: release-testing
> Fix For: 1.11.0
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (FLINK-18077) E2E tests manually for Hive streaming source

2020-06-03 Thread Zhijiang (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-18077?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhijiang updated FLINK-18077:
-
Parent: FLINK-18088
Issue Type: Sub-task  (was: Test)

> E2E tests manually for Hive streaming source
> 
>
> Key: FLINK-18077
> URL: https://issues.apache.org/jira/browse/FLINK-18077
> Project: Flink
>  Issue Type: Sub-task
>  Components: Connectors / Hive, Tests
>Reporter: Jingsong Lee
>Assignee: Jingsong Lee
>Priority: Blocker
>  Labels: release-testing
> Fix For: 1.11.0
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (FLINK-18077) E2E tests manually for Hive streaming source

2020-06-03 Thread Zhijiang (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-18077?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhijiang updated FLINK-18077:
-
Parent: (was: FLINK-18021)
Issue Type: Test  (was: Sub-task)

> E2E tests manually for Hive streaming source
> 
>
> Key: FLINK-18077
> URL: https://issues.apache.org/jira/browse/FLINK-18077
> Project: Flink
>  Issue Type: Test
>  Components: Connectors / Hive, Tests
>Reporter: Jingsong Lee
>Assignee: Jingsong Lee
>Priority: Blocker
>  Labels: release-testing
> Fix For: 1.11.0
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (FLINK-18077) E2E tests manually for Hive streaming source

2020-06-03 Thread Zhijiang (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-18077?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhijiang updated FLINK-18077:
-
Labels: release-testing  (was: )

> E2E tests manually for Hive streaming source
> 
>
> Key: FLINK-18077
> URL: https://issues.apache.org/jira/browse/FLINK-18077
> Project: Flink
>  Issue Type: Sub-task
>  Components: Connectors / Hive, Tests
>Reporter: Jingsong Lee
>Assignee: Jingsong Lee
>Priority: Blocker
>  Labels: release-testing
> Fix For: 1.11.0
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (FLINK-18079) KafkaShuffle Manual Tests

2020-06-03 Thread Zhijiang (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-18079?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhijiang reassigned FLINK-18079:


Assignee: Yuan Mei

> KafkaShuffle Manual Tests
> -
>
> Key: FLINK-18079
> URL: https://issues.apache.org/jira/browse/FLINK-18079
> Project: Flink
>  Issue Type: Sub-task
>  Components: API / DataStream, Runtime / Checkpointing
>Affects Versions: 1.11.0
>Reporter: Yuan Mei
>Assignee: Yuan Mei
>Priority: Major
>  Labels: release-testing
> Fix For: 1.11.0
>
>
> Manual Tests and Results to demonstrate KafkaShuffle working as expected.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (FLINK-17813) Manually test unaligned checkpoints on a cluster

2020-06-03 Thread Zhijiang (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-17813?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhijiang updated FLINK-17813:
-
Labels: release-testing  (was: )

> Manually test unaligned checkpoints on a cluster
> 
>
> Key: FLINK-17813
> URL: https://issues.apache.org/jira/browse/FLINK-17813
> Project: Flink
>  Issue Type: Sub-task
>  Components: Runtime / Checkpointing, Runtime / Network
>Reporter: Piotr Nowojski
>Assignee: Roman Khachatryan
>Priority: Blocker
>  Labels: release-testing
> Fix For: 1.11.0
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (FLINK-18079) KafkaShuffle Manual Tests

2020-06-03 Thread Zhijiang (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-18079?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhijiang updated FLINK-18079:
-
Parent: FLINK-18088
Issue Type: Sub-task  (was: Improvement)

> KafkaShuffle Manual Tests
> -
>
> Key: FLINK-18079
> URL: https://issues.apache.org/jira/browse/FLINK-18079
> Project: Flink
>  Issue Type: Sub-task
>  Components: API / DataStream, Runtime / Checkpointing
>Affects Versions: 1.11.0
>Reporter: Yuan Mei
>Priority: Major
>  Labels: release-testing
> Fix For: 1.11.0
>
>
> Manual Tests and Results to demonstrate KafkaShuffle working as expected.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (FLINK-18088) Umbrella for testing features in release-1.11.0

2020-06-03 Thread Zhijiang (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-18088?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhijiang updated FLINK-18088:
-
Summary: Umbrella for testing features in release-1.11.0   (was: Umbrella 
for features testing in release-1.11.0 )

> Umbrella for testing features in release-1.11.0 
> 
>
> Key: FLINK-18088
> URL: https://issues.apache.org/jira/browse/FLINK-18088
> Project: Flink
>  Issue Type: Test
>Affects Versions: 1.11.0
>Reporter: Zhijiang
>Priority: Blocker
> Fix For: 1.11.0
>
>
> This is the umbrella issue for tracing the testing progress of all the 
> related features in release-1.11.0, either the way of e2e or manually testing 
> in cluster, to confirm the features work in practice with good quality.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (FLINK-17813) Manually test unaligned checkpoints on a cluster

2020-06-03 Thread Zhijiang (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-17813?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhijiang updated FLINK-17813:
-
Parent: FLINK-18088
Issue Type: Sub-task  (was: Test)

> Manually test unaligned checkpoints on a cluster
> 
>
> Key: FLINK-17813
> URL: https://issues.apache.org/jira/browse/FLINK-17813
> Project: Flink
>  Issue Type: Sub-task
>  Components: Runtime / Checkpointing, Runtime / Network
>Reporter: Piotr Nowojski
>Assignee: Roman Khachatryan
>Priority: Blocker
> Fix For: 1.11.0
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (FLINK-17813) Manually test unaligned checkpoints on a cluster

2020-06-03 Thread Zhijiang (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-17813?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhijiang updated FLINK-17813:
-
Parent: (was: FLINK-14551)
Issue Type: Test  (was: Sub-task)

> Manually test unaligned checkpoints on a cluster
> 
>
> Key: FLINK-17813
> URL: https://issues.apache.org/jira/browse/FLINK-17813
> Project: Flink
>  Issue Type: Test
>  Components: Runtime / Checkpointing, Runtime / Network
>Reporter: Piotr Nowojski
>Assignee: Roman Khachatryan
>Priority: Blocker
> Fix For: 1.11.0
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (FLINK-18088) Umbrella for features testing in release-1.11.0

2020-06-03 Thread Zhijiang (Jira)
Zhijiang created FLINK-18088:


 Summary: Umbrella for features testing in release-1.11.0 
 Key: FLINK-18088
 URL: https://issues.apache.org/jira/browse/FLINK-18088
 Project: Flink
  Issue Type: Test
Affects Versions: 1.11.0
Reporter: Zhijiang
 Fix For: 1.11.0


This is the umbrella issue for tracing the testing progress of all the related 
features in release-1.11.0, either the way of e2e or manually testing in 
cluster, to confirm the features work in practice with good quality.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Comment Edited] (FLINK-17992) Exception from RemoteInputChannel#onBuffer should not fail the whole NetworkClientHandler

2020-06-02 Thread Zhijiang (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-17992?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17119309#comment-17119309
 ] 

Zhijiang edited comment on FLINK-17992 at 6/3/20, 2:33 AM:
---

Merged in release-1.11: 34e6d22bdd179796daf6df46738d85303a839704

Merged in master: 371f3de5371afb78d465315098bebec6ed36656b


was (Author: zjwang):
Merged in release-1.11: 34e6d22bdd179796daf6df46738d85303a839704

Pick it to master and release-1.10 later and then update the commit infos.

> Exception from RemoteInputChannel#onBuffer should not fail the whole 
> NetworkClientHandler
> -
>
> Key: FLINK-17992
> URL: https://issues.apache.org/jira/browse/FLINK-17992
> Project: Flink
>  Issue Type: Bug
>  Components: Runtime / Network
>Affects Versions: 1.10.0, 1.10.1, 1.11.0
>Reporter: Zhijiang
>Assignee: Zhijiang
>Priority: Blocker
>  Labels: pull-request-available
> Fix For: 1.11.0, 1.12.0
>
>
> RemoteInputChannel#onBuffer is invoked by 
> CreditBasedPartitionRequestClientHandler while receiving and decoding the 
> network data. #onBuffer can throw exceptions which would tag the error in 
> client handler and fail all the added input channels inside handler. Then it 
> would cause a tricky potential issue as following.
> If the RemoteInputChannel is canceling by canceler thread, then the task 
> thread might exit early than canceler thread terminate. That means the 
> PartitionRequestClient might not be closed (triggered by canceler thread) 
> while the new task attempt is already deployed into this TaskManger. 
> Therefore the new task might reuse the previous PartitionRequestClient while 
> requesting partitions, but note that the respective client handler was 
> already tagged an error before during above RemoteInputChannel#onBuffer. It 
> will cause the next round unnecessary failover.
> It is hard to find this potential issue in production because it can be 
> restored normal finally after one or more additional failover. We find this 
> potential problem from UnalignedCheckpointITCase because it will define the 
> precise restart times within configured failures.
> The solution is to only fail the respective task when its internal 
> RemoteInputChannel#onBuffer throws any exceptions instead of failing the 
> whole channels inside client handler, then the client is still health and can 
> also be reused by other input channels as long as it is not released yet.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (FLINK-18063) Fix the race condition for aborting current checkpoint in CheckpointBarrierUnaligner#processEndOfPartition

2020-06-02 Thread Zhijiang (Jira)
Zhijiang created FLINK-18063:


 Summary: Fix the race condition for aborting current checkpoint in 
CheckpointBarrierUnaligner#processEndOfPartition
 Key: FLINK-18063
 URL: https://issues.apache.org/jira/browse/FLINK-18063
 Project: Flink
  Issue Type: Bug
  Components: Runtime / Checkpointing
Affects Versions: 1.11.0
Reporter: Zhijiang
Assignee: Zhijiang
 Fix For: 1.11.0, 1.12.0


In the handle of CheckpointBarrierUnaligner#processEndOfPartition, it only 
aborts the current checkpoint by judging the condition of pending checkpoint 
from task thread processing, so it will miss one scenario that checkpoint 
triggered by notifyBarrierReceived from netty thread.

The proper fix should also judge the pending checkpoint inside 
ThreadSafeUnaligner in order to abort it and reset internal variables in case.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (FLINK-18050) Fix the bug of recycling buffer twice once exception in ChannelStateWriteRequestDispatcher#dispatch

2020-06-02 Thread Zhijiang (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-18050?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhijiang reassigned FLINK-18050:


Assignee: Arvid Heise  (was: Zhijiang)

> Fix the bug of recycling buffer twice once exception in 
> ChannelStateWriteRequestDispatcher#dispatch
> ---
>
> Key: FLINK-18050
> URL: https://issues.apache.org/jira/browse/FLINK-18050
> Project: Flink
>  Issue Type: Bug
>  Components: Runtime / Checkpointing
>Affects Versions: 1.11.0
>Reporter: Zhijiang
>Assignee: Arvid Heise
>Priority: Blocker
> Fix For: 1.11.0, 1.12.0
>
>
> When task finishes, the `CheckpointBarrierUnaligner` will decline the current 
> checkpoint, which would write abort request into `ChannelStateWriter`.
> The abort request will be executed before other write output request in the 
> queue, and close the underlying `CheckpointStateOutputStream`. Then when the 
> dispatcher executes the next write output request to access the stream, it 
> will throw ClosedByInterruptException to make dispatcher thread exit.
> In this process, the underlying buffers for current write output request will 
> be recycled twice. 
>  * ChannelStateCheckpointWriter#write will recycle all the buffers in finally 
> part, which can cover both exception and normal cases.
>  * ChannelStateWriteRequestDispatcherImpl#dispatch will call 
> `request.cancel(e)`  to recycle the underlying buffers again in the case of 
> exception.
> The effect of this bug can cause further exception in the network shuffle 
> process, which references the same buffer as above, then this exception will 
> send to the downstream side to make it failure.
>  
> This bug can be reproduced easily via running 
> UnalignedCheckpointITCase#shouldPerformUnalignedCheckpointOnParallelRemoteChannel.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Issue Comment Deleted] (FLINK-17823) Resolve the race condition while releasing RemoteInputChannel

2020-06-02 Thread Zhijiang (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-17823?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhijiang updated FLINK-17823:
-
Comment: was deleted

(was: Merged in master: 8c7c7267be95cddd7122d2b97e5334f5db4cc37c)

> Resolve the race condition while releasing RemoteInputChannel
> -
>
> Key: FLINK-17823
> URL: https://issues.apache.org/jira/browse/FLINK-17823
> Project: Flink
>  Issue Type: Bug
>  Components: Runtime / Network
>Affects Versions: 1.11.0
>Reporter: Zhijiang
>Assignee: Zhijiang
>Priority: Blocker
>  Labels: pull-request-available
> Fix For: 1.11.0
>
>
> RemoteInputChannel#releaseAllResources might be called by canceler thread. 
> Meanwhile, the task thread can also call RemoteInputChannel#getNextBuffer. 
> There probably cause two potential problems:
>  * Task thread might get null buffer after canceler thread already released 
> all the buffers, then it might cause misleading NPE in getNextBuffer.
>  * Task thread and canceler thread might pull the same buffer concurrently, 
> which causes unexpected exception when the same buffer is recycled twice.
> The solution is to properly synchronize the buffer queue in release method to 
> avoid the same buffer pulled by both canceler thread and task thread. And in 
> getNextBuffer method, we add some explicit checks to avoid misleading NPE and 
> hint some valid exceptions.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Comment Edited] (FLINK-17823) Resolve the race condition while releasing RemoteInputChannel

2020-06-02 Thread Zhijiang (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-17823?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17113714#comment-17113714
 ] 

Zhijiang edited comment on FLINK-17823 at 6/2/20, 8:03 AM:
---

Merged in release-1.11: 3eb1075ded64da20e6f7a5bc268f455eaf6573eb

Merged in master: 8c7c7267be95cddd7122d2b97e5334f5db4cc37c


was (Author: zjwang):
Merged in release-1.11: 3eb1075ded64da20e6f7a5bc268f455eaf6573eb

Will merge to master later and update the info.

> Resolve the race condition while releasing RemoteInputChannel
> -
>
> Key: FLINK-17823
> URL: https://issues.apache.org/jira/browse/FLINK-17823
> Project: Flink
>  Issue Type: Bug
>  Components: Runtime / Network
>Affects Versions: 1.11.0
>Reporter: Zhijiang
>Assignee: Zhijiang
>Priority: Blocker
>  Labels: pull-request-available
> Fix For: 1.11.0
>
>
> RemoteInputChannel#releaseAllResources might be called by canceler thread. 
> Meanwhile, the task thread can also call RemoteInputChannel#getNextBuffer. 
> There probably cause two potential problems:
>  * Task thread might get null buffer after canceler thread already released 
> all the buffers, then it might cause misleading NPE in getNextBuffer.
>  * Task thread and canceler thread might pull the same buffer concurrently, 
> which causes unexpected exception when the same buffer is recycled twice.
> The solution is to properly synchronize the buffer queue in release method to 
> avoid the same buffer pulled by both canceler thread and task thread. And in 
> getNextBuffer method, we add some explicit checks to avoid misleading NPE and 
> hint some valid exceptions.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (FLINK-17823) Resolve the race condition while releasing RemoteInputChannel

2020-06-02 Thread Zhijiang (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-17823?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17123475#comment-17123475
 ] 

Zhijiang commented on FLINK-17823:
--

Merged in master: 8c7c7267be95cddd7122d2b97e5334f5db4cc37c

> Resolve the race condition while releasing RemoteInputChannel
> -
>
> Key: FLINK-17823
> URL: https://issues.apache.org/jira/browse/FLINK-17823
> Project: Flink
>  Issue Type: Bug
>  Components: Runtime / Network
>Affects Versions: 1.11.0
>Reporter: Zhijiang
>Assignee: Zhijiang
>Priority: Blocker
>  Labels: pull-request-available
> Fix For: 1.11.0
>
>
> RemoteInputChannel#releaseAllResources might be called by canceler thread. 
> Meanwhile, the task thread can also call RemoteInputChannel#getNextBuffer. 
> There probably cause two potential problems:
>  * Task thread might get null buffer after canceler thread already released 
> all the buffers, then it might cause misleading NPE in getNextBuffer.
>  * Task thread and canceler thread might pull the same buffer concurrently, 
> which causes unexpected exception when the same buffer is recycled twice.
> The solution is to properly synchronize the buffer queue in release method to 
> avoid the same buffer pulled by both canceler thread and task thread. And in 
> getNextBuffer method, we add some explicit checks to avoid misleading NPE and 
> hint some valid exceptions.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (FLINK-18056) Hive file sink throws exception when the target in-progress file exists.

2020-06-02 Thread Zhijiang (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-18056?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhijiang reassigned FLINK-18056:


Assignee: Yun Gao

> Hive file sink throws exception when the target in-progress file exists.
> 
>
> Key: FLINK-18056
> URL: https://issues.apache.org/jira/browse/FLINK-18056
> Project: Flink
>  Issue Type: Bug
>  Components: Connectors / FileSystem, Connectors / Hive
>Reporter: Yun Gao
>Assignee: Yun Gao
>Priority: Blocker
> Fix For: 1.11.0
>
>
> Currently after failover or restart, the Hive file sink will try to overwrite 
> the data since the last checkpoint, however, currently neither the 
> in-progress file is deleted nor hive uses the overwritten mode, thus an 
> exception occurs after restarting:
> {code:java}
> org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.hdfs.protocol.AlreadyBeingCreatedException):
>  failed to create file 
> /user/hive/warehouse/datalake/dt=2020-06-01/hr=22/.part-0-10.inprogress for 
> DFSClient_NONMAPREDUCE_-1017064593_62 for client 100.96.206.42 because 
> current leaseholder is trying to recreate file.
> {code}
> The full stack of the exception is
> {code:java}
> org.apache.flink.connectors.hive.FlinkHiveException: 
> org.apache.flink.table.catalog.exceptions.CatalogException: Failed to create 
> Hive RecordWriter
> at 
> org.apache.flink.connectors.hive.write.HiveWriterFactory.createRecordWriter(HiveWriterFactory.java:159)
> at 
> org.apache.flink.connectors.hive.write.HiveBulkWriterFactory.create(HiveBulkWriterFactory.java:47)
> at 
> org.apache.flink.formats.hadoop.bulk.HadoopPathBasedPartFileWriter$HadoopPathBasedBucketWriter.openNewInProgressFile(HadoopPathBasedPartFileWriter.java:234)
> at 
> org.apache.flink.formats.hadoop.bulk.HadoopPathBasedPartFileWriter$HadoopPathBasedBucketWriter.openNewInProgressFile(HadoopPathBasedPartFileWriter.java:207)
> at 
> org.apache.flink.streaming.api.functions.sink.filesystem.Bucket.rollPartFile(Bucket.java:209)
> at 
> org.apache.flink.streaming.api.functions.sink.filesystem.Bucket.write(Bucket.java:200)
> at 
> org.apache.flink.streaming.api.functions.sink.filesystem.Buckets.onElement(Buckets.java:284)
> at 
> org.apache.flink.streaming.api.functions.sink.filesystem.StreamingFileSinkHelper.onElement(StreamingFileSinkHelper.java:104)
> at 
> org.apache.flink.table.filesystem.stream.StreamingFileWriter.processElement(StreamingFileWriter.java:118)
> at 
> org.apache.flink.streaming.runtime.tasks.OperatorChain$CopyingChainingOutput.pushToOperator(OperatorChain.java:717)
> at 
> org.apache.flink.streaming.runtime.tasks.OperatorChain$CopyingChainingOutput.collect(OperatorChain.java:692)
> at 
> org.apache.flink.streaming.runtime.tasks.OperatorChain$CopyingChainingOutput.collect(OperatorChain.java:672)
> at 
> org.apache.flink.streaming.api.operators.CountingOutput.collect(CountingOutput.java:52)
> at 
> org.apache.flink.streaming.api.operators.CountingOutput.collect(CountingOutput.java:30)
> at StreamExecCalc$16.processElement(Unknown Source)
> at 
> org.apache.flink.streaming.runtime.tasks.OperatorChain$CopyingChainingOutput.pushToOperator(OperatorChain.java:717)
> at 
> org.apache.flink.streaming.runtime.tasks.OperatorChain$CopyingChainingOutput.collect(OperatorChain.java:692)
> at 
> org.apache.flink.streaming.runtime.tasks.OperatorChain$CopyingChainingOutput.collect(OperatorChain.java:672)
> at 
> org.apache.flink.streaming.api.operators.CountingOutput.collect(CountingOutput.java:52)
> at 
> org.apache.flink.streaming.api.operators.CountingOutput.collect(CountingOutput.java:30)
> at 
> org.apache.flink.table.runtime.operators.wmassigners.WatermarkAssignerOperator.processElement(WatermarkAssignerOperator.java:123)
> at 
> org.apache.flink.streaming.runtime.tasks.OperatorChain$CopyingChainingOutput.pushToOperator(OperatorChain.java:717)
> at 
> org.apache.flink.streaming.runtime.tasks.OperatorChain$CopyingChainingOutput.collect(OperatorChain.java:692)
> at 
> org.apache.flink.streaming.runtime.tasks.OperatorChain$CopyingChainingOutput.collect(OperatorChain.java:672)
> at 
> org.apache.flink.streaming.api.operators.CountingOutput.collect(CountingOutput.java:52)
> at 
> org.apache.flink.streaming.api.operators.CountingOutput.collect(CountingOutput.java:30)
> at StreamExecCalc$2.processElement(Unknown Source)
> at 
> org.apache.flink.streaming.runtime.tasks.OperatorChain$CopyingChainingOutput.pushToOperator(OperatorChain.java:717)
> at 
> org.apache.flink.streaming.runtime.tasks.OperatorChain$CopyingChainingOutput.collect(OperatorChain.java:692)
> at 
> 

[jira] [Updated] (FLINK-17182) RemoteInputChannelTest.testConcurrentOnSenderBacklogAndRecycle fail on azure

2020-06-01 Thread Zhijiang (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-17182?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhijiang updated FLINK-17182:
-
Priority: Major  (was: Critical)

> RemoteInputChannelTest.testConcurrentOnSenderBacklogAndRecycle fail on azure
> 
>
> Key: FLINK-17182
> URL: https://issues.apache.org/jira/browse/FLINK-17182
> Project: Flink
>  Issue Type: Bug
>  Components: Runtime / Network
>Reporter: Dawid Wysakowicz
>Assignee: Yun Gao
>Priority: Major
>  Labels: pull-request-available, test-stability
> Fix For: 1.11.0
>
>
> https://dev.azure.com/rmetzger/Flink/_build/results?buildId=7546=logs=0da23115-68bb-5dcd-192c-bd4c8adebde1=d2c1c472-9d7b-5913-b8e4-461f3092fb7a
> {code}
> [ERROR] Tests run: 21, Failures: 1, Errors: 0, Skipped: 0, Time elapsed: 
> 3.943 s <<< FAILURE! - in 
> org.apache.flink.runtime.io.network.partition.consumer.RemoteInputChannelTest
> [ERROR] 
> testConcurrentOnSenderBacklogAndRecycle(org.apache.flink.runtime.io.network.partition.consumer.RemoteInputChannelTest)
>   Time elapsed: 0.011 s  <<< FAILURE!
> java.lang.AssertionError: There should be 248 buffers available in channel. 
> expected:<248> but was:<238>
>   at org.junit.Assert.fail(Assert.java:88)
>   at org.junit.Assert.failNotEquals(Assert.java:834)
>   at org.junit.Assert.assertEquals(Assert.java:645)
>   at 
> org.apache.flink.runtime.io.network.partition.consumer.RemoteInputChannelTest.testConcurrentOnSenderBacklogAndRecycle(RemoteInputChannelTest.java:869)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:498)
>   at 
> org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:50)
>   at 
> org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12)
>   at 
> org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:47)
>   at 
> org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17)
>   at org.junit.runners.ParentRunner.runLeaf(ParentRunner.java:325)
>   at 
> org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:78)
>   at 
> org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:57)
>   at org.junit.runners.ParentRunner$3.run(ParentRunner.java:290)
>   at org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:71)
>   at org.junit.runners.ParentRunner.runChildren(ParentRunner.java:288)
>   at org.junit.runners.ParentRunner.access$000(ParentRunner.java:58)
>   at org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:268)
>   at org.junit.runners.ParentRunner.run(ParentRunner.java:363)
>   at 
> org.apache.maven.surefire.junit4.JUnit4Provider.execute(JUnit4Provider.java:365)
>   at 
> org.apache.maven.surefire.junit4.JUnit4Provider.executeWithRerun(JUnit4Provider.java:273)
>   at 
> org.apache.maven.surefire.junit4.JUnit4Provider.executeTestSet(JUnit4Provider.java:238)
>   at 
> org.apache.maven.surefire.junit4.JUnit4Provider.invoke(JUnit4Provider.java:159)
>   at 
> org.apache.maven.surefire.booter.ForkedBooter.invokeProviderInSameClassLoader(ForkedBooter.java:384)
>   at 
> org.apache.maven.surefire.booter.ForkedBooter.runSuitesInProcess(ForkedBooter.java:345)
>   at 
> org.apache.maven.surefire.booter.ForkedBooter.execute(ForkedBooter.java:126)
>   at 
> org.apache.maven.surefire.booter.ForkedBooter.main(ForkedBooter.java:418)
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (FLINK-18050) Fix the bug of recycling buffer twice once exception in ChannelStateWriteRequestDispatcher#dispatch

2020-06-01 Thread Zhijiang (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-18050?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhijiang updated FLINK-18050:
-
Description: 
When task finishes, the `CheckpointBarrierUnaligner` will decline the current 
checkpoint, which would write abort request into `ChannelStateWriter`.

The abort request will be executed before other write output request in the 
queue, and close the underlying `CheckpointStateOutputStream`. Then when the 
dispatcher executes the next write output request to access the stream, it will 
throw ClosedByInterruptException to make dispatcher thread exit.

In this process, the underlying buffers for current write output request will 
be recycled twice. 
 * ChannelStateCheckpointWriter#write will recycle all the buffers in finally 
part, which can cover both exception and normal cases.
 * ChannelStateWriteRequestDispatcherImpl#dispatch will call 
`request.cancel(e)`  to recycle the underlying buffers again in the case of 
exception.

The effect of this bug can cause further exception in the network shuffle 
process, which references the same buffer as above, then this exception will 
send to the downstream side to make it failure.

 

This bug can be reproduced easily via running 
UnalignedCheckpointITCase#shouldPerformUnalignedCheckpointOnParallelRemoteChannel.

  was:
During ChannelStateWriteRequestDispatcherImpl#dispatch, `request.cancel(e)` is 
called to recycle the internal buffer of request once exception happens.

But for the case of requesting write output, the buffers would be also finally 
recycled inside ChannelStateCheckpointWriter#write no matter exceptions or not. 
So the buffers in request will be recycled twice in the case of exception, 
which would cause further exceptions in the network shuffle process to 
reference the same buffer.

This bug can be reproduced easily via running 
UnalignedCheckpointITCase#shouldPerformUnalignedCheckpointOnParallelRemoteChannel.


> Fix the bug of recycling buffer twice once exception in 
> ChannelStateWriteRequestDispatcher#dispatch
> ---
>
> Key: FLINK-18050
> URL: https://issues.apache.org/jira/browse/FLINK-18050
> Project: Flink
>  Issue Type: Bug
>  Components: Runtime / Checkpointing
>Affects Versions: 1.11.0
>Reporter: Zhijiang
>Assignee: Zhijiang
>Priority: Blocker
> Fix For: 1.11.0, 1.12.0
>
>
> When task finishes, the `CheckpointBarrierUnaligner` will decline the current 
> checkpoint, which would write abort request into `ChannelStateWriter`.
> The abort request will be executed before other write output request in the 
> queue, and close the underlying `CheckpointStateOutputStream`. Then when the 
> dispatcher executes the next write output request to access the stream, it 
> will throw ClosedByInterruptException to make dispatcher thread exit.
> In this process, the underlying buffers for current write output request will 
> be recycled twice. 
>  * ChannelStateCheckpointWriter#write will recycle all the buffers in finally 
> part, which can cover both exception and normal cases.
>  * ChannelStateWriteRequestDispatcherImpl#dispatch will call 
> `request.cancel(e)`  to recycle the underlying buffers again in the case of 
> exception.
> The effect of this bug can cause further exception in the network shuffle 
> process, which references the same buffer as above, then this exception will 
> send to the downstream side to make it failure.
>  
> This bug can be reproduced easily via running 
> UnalignedCheckpointITCase#shouldPerformUnalignedCheckpointOnParallelRemoteChannel.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (FLINK-18050) Fix the bug of recycling buffer twice once exception in ChannelStateWriteRequestDispatcher#dispatch

2020-06-01 Thread Zhijiang (Jira)
Zhijiang created FLINK-18050:


 Summary: Fix the bug of recycling buffer twice once exception in 
ChannelStateWriteRequestDispatcher#dispatch
 Key: FLINK-18050
 URL: https://issues.apache.org/jira/browse/FLINK-18050
 Project: Flink
  Issue Type: Bug
  Components: Runtime / Checkpointing
Affects Versions: 1.11.0
Reporter: Zhijiang
Assignee: Zhijiang
 Fix For: 1.11.0, 1.12.0


During ChannelStateWriteRequestDispatcherImpl#dispatch, `request.cancel(e)` is 
called to recycle the internal buffer of request once exception happens.

But for the case of requesting write output, the buffers would be also finally 
recycled inside ChannelStateCheckpointWriter#write no matter exceptions or not. 
So the buffers in request will be recycled twice in the case of exception, 
which would cause further exceptions in the network shuffle process to 
reference the same buffer.

This bug can be reproduced easily via running 
UnalignedCheckpointITCase#shouldPerformUnalignedCheckpointOnParallelRemoteChannel.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Closed] (FLINK-17994) Fix the race condition between CheckpointBarrierUnaligner#processBarrier and #notifyBarrierReceived

2020-06-01 Thread Zhijiang (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-17994?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhijiang closed FLINK-17994.

Resolution: Fixed

Merged in release-1.11 : 64ca88ac989ee7525cb821670f293404b7b30d2d

Will cherry-pick to master later and then update the commit info, close it now 
for not showing it in release-1.11 blockers.

> Fix the race condition between CheckpointBarrierUnaligner#processBarrier and 
> #notifyBarrierReceived
> ---
>
> Key: FLINK-17994
> URL: https://issues.apache.org/jira/browse/FLINK-17994
> Project: Flink
>  Issue Type: Bug
>  Components: Runtime / Checkpointing
>Affects Versions: 1.11.0
>Reporter: Zhijiang
>Assignee: Zhijiang
>Priority: Blocker
>  Labels: pull-request-available
> Fix For: 1.11.0, 1.12.0
>
>
> The race condition issue happens as follow:
>  * ch1 is received from network for one input channel by netty thread and 
> schedule the ch1 into mailbox via #notifyBarrierReceived
>  * ch2 is received from network for another input channel by netty thread, 
> but before calling #notifyBarrierReceived this barrier was inserted into 
> channel's data queue in advance. Then it would cause task thread process ch2 
> earlier than #notifyBarrierReceived by netty thread.
>  * Task thread would execute checkpoint for ch2 immediately because ch2 > ch1.
>  * After that, the previous scheduled ch1 is performed from mailbox by task 
> thread, then it causes the IllegalArgumentException inside 
> SubtaskCheckpointCoordinatorImpl#checkpointState because it breaks the 
> initial assumption that checkpoint is executed in incremental way for aligned 
> mode.
> The key problem is that we can not remove the checkpoint action from mailbox 
> queue before the next checkpoint is going to execute now. One possible 
> solution is that we record the previous aborted checkpoint id inside 
> SubtaskCheckpointCoordinatorImpl#abortedCheckpointIds, then when the queued 
> checkpoint inside mailbox is executing, it will exit directly if found the 
> checkpoint id was already aborted before.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (FLINK-17994) Fix the race condition between CheckpointBarrierUnaligner#processBarrier and #notifyBarrierReceived

2020-06-01 Thread Zhijiang (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-17994?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhijiang updated FLINK-17994:
-
Affects Version/s: 1.11.0

> Fix the race condition between CheckpointBarrierUnaligner#processBarrier and 
> #notifyBarrierReceived
> ---
>
> Key: FLINK-17994
> URL: https://issues.apache.org/jira/browse/FLINK-17994
> Project: Flink
>  Issue Type: Bug
>  Components: Runtime / Checkpointing
>Affects Versions: 1.11.0
>Reporter: Zhijiang
>Assignee: Zhijiang
>Priority: Blocker
>  Labels: pull-request-available
> Fix For: 1.11.0, 1.12.0
>
>
> The race condition issue happens as follow:
>  * ch1 is received from network for one input channel by netty thread and 
> schedule the ch1 into mailbox via #notifyBarrierReceived
>  * ch2 is received from network for another input channel by netty thread, 
> but before calling #notifyBarrierReceived this barrier was inserted into 
> channel's data queue in advance. Then it would cause task thread process ch2 
> earlier than #notifyBarrierReceived by netty thread.
>  * Task thread would execute checkpoint for ch2 immediately because ch2 > ch1.
>  * After that, the previous scheduled ch1 is performed from mailbox by task 
> thread, then it causes the IllegalArgumentException inside 
> SubtaskCheckpointCoordinatorImpl#checkpointState because it breaks the 
> initial assumption that checkpoint is executed in incremental way for aligned 
> mode.
> The key problem is that we can not remove the checkpoint action from mailbox 
> queue before the next checkpoint is going to execute now. One possible 
> solution is that we record the previous aborted checkpoint id inside 
> SubtaskCheckpointCoordinatorImpl#abortedCheckpointIds, then when the queued 
> checkpoint inside mailbox is executing, it will exit directly if found the 
> checkpoint id was already aborted before.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (FLINK-17994) Fix the race condition between CheckpointBarrierUnaligner#processBarrier and #notifyBarrierReceived

2020-06-01 Thread Zhijiang (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-17994?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhijiang updated FLINK-17994:
-
Fix Version/s: 1.12.0

> Fix the race condition between CheckpointBarrierUnaligner#processBarrier and 
> #notifyBarrierReceived
> ---
>
> Key: FLINK-17994
> URL: https://issues.apache.org/jira/browse/FLINK-17994
> Project: Flink
>  Issue Type: Bug
>  Components: Runtime / Checkpointing
>Reporter: Zhijiang
>Assignee: Zhijiang
>Priority: Blocker
>  Labels: pull-request-available
> Fix For: 1.11.0, 1.12.0
>
>
> The race condition issue happens as follow:
>  * ch1 is received from network for one input channel by netty thread and 
> schedule the ch1 into mailbox via #notifyBarrierReceived
>  * ch2 is received from network for another input channel by netty thread, 
> but before calling #notifyBarrierReceived this barrier was inserted into 
> channel's data queue in advance. Then it would cause task thread process ch2 
> earlier than #notifyBarrierReceived by netty thread.
>  * Task thread would execute checkpoint for ch2 immediately because ch2 > ch1.
>  * After that, the previous scheduled ch1 is performed from mailbox by task 
> thread, then it causes the IllegalArgumentException inside 
> SubtaskCheckpointCoordinatorImpl#checkpointState because it breaks the 
> initial assumption that checkpoint is executed in incremental way for aligned 
> mode.
> The key problem is that we can not remove the checkpoint action from mailbox 
> queue before the next checkpoint is going to execute now. One possible 
> solution is that we record the previous aborted checkpoint id inside 
> SubtaskCheckpointCoordinatorImpl#abortedCheckpointIds, then when the queued 
> checkpoint inside mailbox is executing, it will exit directly if found the 
> checkpoint id was already aborted before.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (FLINK-17992) Exception from RemoteInputChannel#onBuffer should not fail the whole NetworkClientHandler

2020-05-29 Thread Zhijiang (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-17992?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhijiang updated FLINK-17992:
-
Affects Version/s: 1.11.0

> Exception from RemoteInputChannel#onBuffer should not fail the whole 
> NetworkClientHandler
> -
>
> Key: FLINK-17992
> URL: https://issues.apache.org/jira/browse/FLINK-17992
> Project: Flink
>  Issue Type: Bug
>  Components: Runtime / Network
>Affects Versions: 1.10.0, 1.10.1, 1.11.0
>Reporter: Zhijiang
>Assignee: Zhijiang
>Priority: Blocker
>  Labels: pull-request-available
> Fix For: 1.11.0, 1.12.0
>
>
> RemoteInputChannel#onBuffer is invoked by 
> CreditBasedPartitionRequestClientHandler while receiving and decoding the 
> network data. #onBuffer can throw exceptions which would tag the error in 
> client handler and fail all the added input channels inside handler. Then it 
> would cause a tricky potential issue as following.
> If the RemoteInputChannel is canceling by canceler thread, then the task 
> thread might exit early than canceler thread terminate. That means the 
> PartitionRequestClient might not be closed (triggered by canceler thread) 
> while the new task attempt is already deployed into this TaskManger. 
> Therefore the new task might reuse the previous PartitionRequestClient while 
> requesting partitions, but note that the respective client handler was 
> already tagged an error before during above RemoteInputChannel#onBuffer. It 
> will cause the next round unnecessary failover.
> It is hard to find this potential issue in production because it can be 
> restored normal finally after one or more additional failover. We find this 
> potential problem from UnalignedCheckpointITCase because it will define the 
> precise restart times within configured failures.
> The solution is to only fail the respective task when its internal 
> RemoteInputChannel#onBuffer throws any exceptions instead of failing the 
> whole channels inside client handler, then the client is still health and can 
> also be reused by other input channels as long as it is not released yet.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Comment Edited] (FLINK-17992) Exception from RemoteInputChannel#onBuffer should not fail the whole NetworkClientHandler

2020-05-29 Thread Zhijiang (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-17992?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17119309#comment-17119309
 ] 

Zhijiang edited comment on FLINK-17992 at 5/29/20, 6:42 AM:


Merged in release-1.11: 34e6d22bdd179796daf6df46738d85303a839704

Pick it to master and release-1.10 later and then update the commit infos.


was (Author: zjwang):
Merged in release-1.11: 34e6d22bdd179796daf6df46738d85303a839704

Pick it to master later and then update the commit info.

> Exception from RemoteInputChannel#onBuffer should not fail the whole 
> NetworkClientHandler
> -
>
> Key: FLINK-17992
> URL: https://issues.apache.org/jira/browse/FLINK-17992
> Project: Flink
>  Issue Type: Bug
>  Components: Runtime / Network
>Affects Versions: 1.10.0, 1.10.1, 1.11.0
>Reporter: Zhijiang
>Assignee: Zhijiang
>Priority: Blocker
>  Labels: pull-request-available
> Fix For: 1.11.0, 1.12.0
>
>
> RemoteInputChannel#onBuffer is invoked by 
> CreditBasedPartitionRequestClientHandler while receiving and decoding the 
> network data. #onBuffer can throw exceptions which would tag the error in 
> client handler and fail all the added input channels inside handler. Then it 
> would cause a tricky potential issue as following.
> If the RemoteInputChannel is canceling by canceler thread, then the task 
> thread might exit early than canceler thread terminate. That means the 
> PartitionRequestClient might not be closed (triggered by canceler thread) 
> while the new task attempt is already deployed into this TaskManger. 
> Therefore the new task might reuse the previous PartitionRequestClient while 
> requesting partitions, but note that the respective client handler was 
> already tagged an error before during above RemoteInputChannel#onBuffer. It 
> will cause the next round unnecessary failover.
> It is hard to find this potential issue in production because it can be 
> restored normal finally after one or more additional failover. We find this 
> potential problem from UnalignedCheckpointITCase because it will define the 
> precise restart times within configured failures.
> The solution is to only fail the respective task when its internal 
> RemoteInputChannel#onBuffer throws any exceptions instead of failing the 
> whole channels inside client handler, then the client is still health and can 
> also be reused by other input channels as long as it is not released yet.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Closed] (FLINK-17992) Exception from RemoteInputChannel#onBuffer should not fail the whole NetworkClientHandler

2020-05-29 Thread Zhijiang (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-17992?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhijiang closed FLINK-17992.

Resolution: Fixed

Merged in release-1.11: 34e6d22bdd179796daf6df46738d85303a839704

Pick it to master later and then update the commit info.

> Exception from RemoteInputChannel#onBuffer should not fail the whole 
> NetworkClientHandler
> -
>
> Key: FLINK-17992
> URL: https://issues.apache.org/jira/browse/FLINK-17992
> Project: Flink
>  Issue Type: Bug
>  Components: Runtime / Network
>Affects Versions: 1.10.0, 1.10.1
>Reporter: Zhijiang
>Assignee: Zhijiang
>Priority: Blocker
>  Labels: pull-request-available
> Fix For: 1.11.0
>
>
> RemoteInputChannel#onBuffer is invoked by 
> CreditBasedPartitionRequestClientHandler while receiving and decoding the 
> network data. #onBuffer can throw exceptions which would tag the error in 
> client handler and fail all the added input channels inside handler. Then it 
> would cause a tricky potential issue as following.
> If the RemoteInputChannel is canceling by canceler thread, then the task 
> thread might exit early than canceler thread terminate. That means the 
> PartitionRequestClient might not be closed (triggered by canceler thread) 
> while the new task attempt is already deployed into this TaskManger. 
> Therefore the new task might reuse the previous PartitionRequestClient while 
> requesting partitions, but note that the respective client handler was 
> already tagged an error before during above RemoteInputChannel#onBuffer. It 
> will cause the next round unnecessary failover.
> It is hard to find this potential issue in production because it can be 
> restored normal finally after one or more additional failover. We find this 
> potential problem from UnalignedCheckpointITCase because it will define the 
> precise restart times within configured failures.
> The solution is to only fail the respective task when its internal 
> RemoteInputChannel#onBuffer throws any exceptions instead of failing the 
> whole channels inside client handler, then the client is still health and can 
> also be reused by other input channels as long as it is not released yet.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (FLINK-17992) Exception from RemoteInputChannel#onBuffer should not fail the whole NetworkClientHandler

2020-05-29 Thread Zhijiang (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-17992?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhijiang updated FLINK-17992:
-
Fix Version/s: 1.12.0

> Exception from RemoteInputChannel#onBuffer should not fail the whole 
> NetworkClientHandler
> -
>
> Key: FLINK-17992
> URL: https://issues.apache.org/jira/browse/FLINK-17992
> Project: Flink
>  Issue Type: Bug
>  Components: Runtime / Network
>Affects Versions: 1.10.0, 1.10.1
>Reporter: Zhijiang
>Assignee: Zhijiang
>Priority: Blocker
>  Labels: pull-request-available
> Fix For: 1.11.0, 1.12.0
>
>
> RemoteInputChannel#onBuffer is invoked by 
> CreditBasedPartitionRequestClientHandler while receiving and decoding the 
> network data. #onBuffer can throw exceptions which would tag the error in 
> client handler and fail all the added input channels inside handler. Then it 
> would cause a tricky potential issue as following.
> If the RemoteInputChannel is canceling by canceler thread, then the task 
> thread might exit early than canceler thread terminate. That means the 
> PartitionRequestClient might not be closed (triggered by canceler thread) 
> while the new task attempt is already deployed into this TaskManger. 
> Therefore the new task might reuse the previous PartitionRequestClient while 
> requesting partitions, but note that the respective client handler was 
> already tagged an error before during above RemoteInputChannel#onBuffer. It 
> will cause the next round unnecessary failover.
> It is hard to find this potential issue in production because it can be 
> restored normal finally after one or more additional failover. We find this 
> potential problem from UnalignedCheckpointITCase because it will define the 
> precise restart times within configured failures.
> The solution is to only fail the respective task when its internal 
> RemoteInputChannel#onBuffer throws any exceptions instead of failing the 
> whole channels inside client handler, then the client is still health and can 
> also be reused by other input channels as long as it is not released yet.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Comment Edited] (FLINK-17230) Fix incorrect returned address of Endpoint for the ClusterIP Service

2020-05-28 Thread Zhijiang (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-17230?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17118789#comment-17118789
 ] 

Zhijiang edited comment on FLINK-17230 at 5/28/20, 3:11 PM:


I think it makes sense to cover it in release-1.11. Could you change the type 
field as bug and make the priority as blocker?


was (Author: zjwang):
I think it makes sense to cover it in release-1.11 and i changed the type field 
as bug.

> Fix incorrect returned address of Endpoint for the ClusterIP Service
> 
>
> Key: FLINK-17230
> URL: https://issues.apache.org/jira/browse/FLINK-17230
> Project: Flink
>  Issue Type: Sub-task
>  Components: Deployment / Kubernetes
>Affects Versions: 1.10.0, 1.10.1
>Reporter: Canbin Zheng
>Assignee: Canbin Zheng
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.11.0, 1.12.0
>
>
> At the moment, when the type of the external Service is set to {{ClusterIP}}, 
> we return an incorrect address 
> {{KubernetesUtils.getInternalServiceName(clusterId) + "." + nameSpace}} for 
> the Endpoint.
> This ticket aims to fix this bug by returning 
> {{KubernetesUtils.getRestServiceName(clusterId) + "." + nameSpace}} instead.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (FLINK-17230) Fix incorrect returned address of Endpoint for the ClusterIP Service

2020-05-28 Thread Zhijiang (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-17230?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17118789#comment-17118789
 ] 

Zhijiang commented on FLINK-17230:
--

I think it makes sense to cover it in release-1.11 and i changed the type field 
as bug.

> Fix incorrect returned address of Endpoint for the ClusterIP Service
> 
>
> Key: FLINK-17230
> URL: https://issues.apache.org/jira/browse/FLINK-17230
> Project: Flink
>  Issue Type: Sub-task
>  Components: Deployment / Kubernetes
>Affects Versions: 1.10.0, 1.10.1
>Reporter: Canbin Zheng
>Assignee: Canbin Zheng
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.11.0, 1.12.0
>
>
> At the moment, when the type of the external Service is set to {{ClusterIP}}, 
> we return an incorrect address 
> {{KubernetesUtils.getInternalServiceName(clusterId) + "." + nameSpace}} for 
> the Endpoint.
> This ticket aims to fix this bug by returning 
> {{KubernetesUtils.getRestServiceName(clusterId) + "." + nameSpace}} instead.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (FLINK-17994) Fix the race condition between CheckpointBarrierUnaligner#processBarrier and #notifyBarrierReceived

2020-05-28 Thread Zhijiang (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-17994?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhijiang updated FLINK-17994:
-
Description: 
The race condition issue happens as follow:
 * ch1 is received from network for one input channel by netty thread and 
schedule the ch1 into mailbox via #notifyBarrierReceived
 * ch2 is received from network for another input channel by netty thread, but 
before calling #notifyBarrierReceived this barrier was inserted into channel's 
data queue in advance. Then it would cause task thread process ch2 earlier than 
#notifyBarrierReceived by netty thread.
 * Task thread would execute checkpoint for ch2 immediately because ch2 > ch1.
 * After that, the previous scheduled ch1 is performed from mailbox by task 
thread, then it causes the IllegalArgumentException inside 
SubtaskCheckpointCoordinatorImpl#checkpointState because it breaks the initial 
assumption that checkpoint is executed in incremental way for aligned mode.

The key problem is that we can not remove the checkpoint action from mailbox 
queue before the next checkpoint is going to execute now. One possible solution 
is that we record the previous aborted checkpoint id inside 
SubtaskCheckpointCoordinatorImpl#abortedCheckpointIds, then when the queued 
checkpoint inside mailbox is executing, it will exit directly if found the 
checkpoint id was already aborted before.

  was:
The race condition issue happens as follow:
 * ch1 is received from network by netty thread and schedule the ch1 into 
mailbox via #notifyBarrierReceived
 * ch2 is received from network by netty thread, but before calling 
#notifyBarrierReceived this barrier was inserted into channel's data queue in 
advance. Then it would cause task thread process ch2 earlier than 
#notifyBarrierReceived by netty thread.
 * Task thread would execute checkpoint for ch2 directly because ch2 > ch1.
 * After that, the previous scheduled ch1 is performed from mailbox by task 
thread, then it causes the IllegalArgumentException inside 
SubtaskCheckpointCoordinatorImpl#checkpointState because it breaks the 
assumption that checkpoint is executed in incremental way. 

One possible solution for this race condition is inserting the received barrier 
into channel's data queue after calling #notifyBarrierReceived, then we can 
make the assumption that the checkpoint is always triggered by netty thread, to 
simplify the current situation that checkpoint might be triggered either by 
task thread or netty thread. 

To do so we can also avoid accessing #notifyBarrierReceived method by task 
thread while processing the barrier to simplify the logic inside 
CheckpointBarrierUnaligner.


> Fix the race condition between CheckpointBarrierUnaligner#processBarrier and 
> #notifyBarrierReceived
> ---
>
> Key: FLINK-17994
> URL: https://issues.apache.org/jira/browse/FLINK-17994
> Project: Flink
>  Issue Type: Bug
>  Components: Runtime / Checkpointing
>Reporter: Zhijiang
>Assignee: Zhijiang
>Priority: Blocker
> Fix For: 1.11.0
>
>
> The race condition issue happens as follow:
>  * ch1 is received from network for one input channel by netty thread and 
> schedule the ch1 into mailbox via #notifyBarrierReceived
>  * ch2 is received from network for another input channel by netty thread, 
> but before calling #notifyBarrierReceived this barrier was inserted into 
> channel's data queue in advance. Then it would cause task thread process ch2 
> earlier than #notifyBarrierReceived by netty thread.
>  * Task thread would execute checkpoint for ch2 immediately because ch2 > ch1.
>  * After that, the previous scheduled ch1 is performed from mailbox by task 
> thread, then it causes the IllegalArgumentException inside 
> SubtaskCheckpointCoordinatorImpl#checkpointState because it breaks the 
> initial assumption that checkpoint is executed in incremental way for aligned 
> mode.
> The key problem is that we can not remove the checkpoint action from mailbox 
> queue before the next checkpoint is going to execute now. One possible 
> solution is that we record the previous aborted checkpoint id inside 
> SubtaskCheckpointCoordinatorImpl#abortedCheckpointIds, then when the queued 
> checkpoint inside mailbox is executing, it will exit directly if found the 
> checkpoint id was already aborted before.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (FLINK-17994) Fix the race condition between CheckpointBarrierUnaligner#processBarrier and #notifyBarrierReceived

2020-05-27 Thread Zhijiang (Jira)
Zhijiang created FLINK-17994:


 Summary: Fix the race condition between 
CheckpointBarrierUnaligner#processBarrier and #notifyBarrierReceived
 Key: FLINK-17994
 URL: https://issues.apache.org/jira/browse/FLINK-17994
 Project: Flink
  Issue Type: Bug
  Components: Runtime / Checkpointing
Reporter: Zhijiang
Assignee: Zhijiang
 Fix For: 1.11.0


The race condition issue happens as follow:
 * ch1 is received from network by netty thread and schedule the ch1 into 
mailbox via #notifyBarrierReceived
 * ch2 is received from network by netty thread, but before calling 
#notifyBarrierReceived this barrier was inserted into channel's data queue in 
advance. Then it would cause task thread process ch2 earlier than 
#notifyBarrierReceived by netty thread.
 * Task thread would execute checkpoint for ch2 directly because ch2 > ch1.
 * After that, the previous scheduled ch1 is performed from mailbox by task 
thread, then it causes the IllegalArgumentException inside 
SubtaskCheckpointCoordinatorImpl#checkpointState because it breaks the 
assumption that checkpoint is executed in incremental way. 

One possible solution for this race condition is inserting the received barrier 
into channel's data queue after calling #notifyBarrierReceived, then we can 
make the assumption that the checkpoint is always triggered by netty thread, to 
simplify the current situation that checkpoint might be triggered either by 
task thread or netty thread. 

To do so we can also avoid accessing #notifyBarrierReceived method by task 
thread while processing the barrier to simplify the logic inside 
CheckpointBarrierUnaligner.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (FLINK-17869) Fix the race condition of aborting unaligned checkpoint

2020-05-27 Thread Zhijiang (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-17869?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhijiang updated FLINK-17869:
-
Description: 
On ChannelStateWriter side, the lifecycle of checkpoint should be as follows:

start -> in progress/abort -> stop.

The ChannelStateWriteResult is created during #start, and removed by #abort or 
#stop processes. There are some potential race conditions here:
 * #start is called while receiving the first barrier by netty thread and 
schedule to execute the checkpoint
 * The task thread might process cancel checkpoint and call #abort before 
performing the above respective checkpoint
 * The checkpoint can still be executed by task thread afterwards even thought 
the above abort happened before, because we can not remove the checkpoint 
action from mailbox during aborting.
 * While checkpoint executing, it will call `ChannelStateWriter#getWriteResult` 
then it would cause `IllegalStateException` because the respective result was 
already removed in advance during handling #abort method before.
 * Therefore it will cause unnecessary task failure during performing checkpoint

I guess we do not want to fail the task when one checkpoint is aborted by 
design. And the illegal state check during ChannelStateWriter#getWriteResult 
was mainly proposed for normal process validation I guess.

If we do not remove the ChannelStateWriteResult while handling #abort and rely 
on #stop to remove it, then it might probably exist another scenario that the 
checkpoint will never be performed after #start (we have another mechanism to 
exit the triggering checkpoint in advance if the abort is sent by 
CheckpointCoordinator), then the legacy ChannelStateWriteResult will be 
retained inside ChannelStateWriter long time.

Maybe the potential option to fix this issue is to let 
SubtaskCheckpointCoordinatorImpl handle the exception from 
ChannelStateWriter#getWriteResult properly to not fail the task in the aborted 
case.

  was:
On ChannelStateWriter side, the lifecycle of checkpoint should be as follows:

start -> in progress/abort -> stop.

We must guarantee that #abort should be queued after #start, otherwise the 
aborted checkpoint might be started later again in the case of race condition.

There are two cases might trigger abort checkpoint:
 * CheckpointBarrierUnaligner#processEndOfPartition
 * CheckpointBarrierUnaligner#processCancellationBarrier

The current condition to execute abort checkpoint for above both cases is based 
on #isCheckpointPending(), which can not cover all the cases. The unaligned 
checkpoint might be triggered either by task thread via 
CheckpointBarrierUnaligner#processBarrier or netty thread via 
ThreadSafeUnaligner#notifyBarrierReceived. Anyway we should maintain the 
current triggered checkpoint id in order to handle both above abort cases 
properly.

Another bug is that during ChannelStateWriterImpl#abort, we should not remove 
the respective ChannelStateWriteResult. Otherwise it would throw 
IllegalArgumentException when ChannelStateWriterImpl#getWriteResult in the 
process of checkpoint. ChannelStateWriteResult should be created at #start 
method and only removed at #stop method.


> Fix the race condition of aborting unaligned checkpoint
> ---
>
> Key: FLINK-17869
> URL: https://issues.apache.org/jira/browse/FLINK-17869
> Project: Flink
>  Issue Type: Bug
>  Components: Runtime / Checkpointing
>Reporter: Zhijiang
>Assignee: Zhijiang
>Priority: Blocker
> Fix For: 1.11.0
>
>
> On ChannelStateWriter side, the lifecycle of checkpoint should be as follows:
> start -> in progress/abort -> stop.
> The ChannelStateWriteResult is created during #start, and removed by #abort 
> or #stop processes. There are some potential race conditions here:
>  * #start is called while receiving the first barrier by netty thread and 
> schedule to execute the checkpoint
>  * The task thread might process cancel checkpoint and call #abort before 
> performing the above respective checkpoint
>  * The checkpoint can still be executed by task thread afterwards even 
> thought the above abort happened before, because we can not remove the 
> checkpoint action from mailbox during aborting.
>  * While checkpoint executing, it will call 
> `ChannelStateWriter#getWriteResult` then it would cause 
> `IllegalStateException` because the respective result was already removed in 
> advance during handling #abort method before.
>  * Therefore it will cause unnecessary task failure during performing 
> checkpoint
> I guess we do not want to fail the task when one checkpoint is aborted by 
> design. And the illegal state check during ChannelStateWriter#getWriteResult 
> was mainly proposed for normal process validation I guess.
> If we do not remove the 

[jira] [Created] (FLINK-17992) Exception from RemoteInputChannel#onBuffer should not fail the whole NetworkClientHandler

2020-05-27 Thread Zhijiang (Jira)
Zhijiang created FLINK-17992:


 Summary: Exception from RemoteInputChannel#onBuffer should not 
fail the whole NetworkClientHandler
 Key: FLINK-17992
 URL: https://issues.apache.org/jira/browse/FLINK-17992
 Project: Flink
  Issue Type: Bug
  Components: Runtime / Network
Affects Versions: 1.10.1, 1.10.0
Reporter: Zhijiang
Assignee: Zhijiang
 Fix For: 1.11.0


RemoteInputChannel#onBuffer is invoked by 
CreditBasedPartitionRequestClientHandler while receiving and decoding the 
network data. #onBuffer can throw exceptions which would tag the error in 
client handler and fail all the added input channels inside handler. Then it 
would cause a tricky potential issue as following.

If the RemoteInputChannel is canceling by canceler thread, then the task thread 
might exit early than canceler thread terminate. That means the 
PartitionRequestClient might not be closed (triggered by canceler thread) while 
the new task attempt is already deployed into this TaskManger. Therefore the 
new task might reuse the previous PartitionRequestClient while requesting 
partitions, but note that the respective client handler was already tagged an 
error before during above RemoteInputChannel#onBuffer. It will cause the next 
round unnecessary failover.

It is hard to find this potential issue in production because it can be 
restored normal finally after one or more additional failover. We find this 
potential problem from UnalignedCheckpointITCase because it will define the 
precise restart times within configured failures.

The solution is to only fail the respective task when its internal 
RemoteInputChannel#onBuffer throws any exceptions instead of failing the whole 
channels inside client handler, then the client is still health and can also be 
reused by other input channels as long as it is not released yet.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (FLINK-15507) Activate local recovery for RocksDB backends by default

2020-05-27 Thread Zhijiang (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-15507?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhijiang reassigned FLINK-15507:


Assignee: Zakelly Lan  (was: Zhijiang)

> Activate local recovery for RocksDB backends by default
> ---
>
> Key: FLINK-15507
> URL: https://issues.apache.org/jira/browse/FLINK-15507
> Project: Flink
>  Issue Type: Improvement
>  Components: Runtime / State Backends
>Reporter: Stephan Ewen
>Assignee: Zakelly Lan
>Priority: Blocker
>  Labels: pull-request-available
> Fix For: 1.11.0
>
>
> For the RocksDB state backend, local recovery has no overhead when 
> incremental checkpoints are used. 
> It should be activated by default, because it greatly helps with recovery.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (FLINK-15507) Activate local recovery for RocksDB backends by default

2020-05-27 Thread Zhijiang (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-15507?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhijiang reassigned FLINK-15507:


Assignee: Zhijiang  (was: Zakelly Lan)

> Activate local recovery for RocksDB backends by default
> ---
>
> Key: FLINK-15507
> URL: https://issues.apache.org/jira/browse/FLINK-15507
> Project: Flink
>  Issue Type: Improvement
>  Components: Runtime / State Backends
>Reporter: Stephan Ewen
>Assignee: Zhijiang
>Priority: Blocker
>  Labels: pull-request-available
> Fix For: 1.11.0
>
>
> For the RocksDB state backend, local recovery has no overhead when 
> incremental checkpoints are used. 
> It should be activated by default, because it greatly helps with recovery.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Closed] (FLINK-17820) Memory threshold is ignored for channel state

2020-05-27 Thread Zhijiang (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-17820?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhijiang closed FLINK-17820.


Merged in release-1.11: 05b97924572594ef244236a2f328177a2ec84fc4

Merged in master: 8b4fe87a74d3ec631350ebac4dfdf69094c802e3

> Memory threshold is ignored for channel state
> -
>
> Key: FLINK-17820
> URL: https://issues.apache.org/jira/browse/FLINK-17820
> Project: Flink
>  Issue Type: Bug
>  Components: Runtime / Checkpointing, Runtime / Task
>Affects Versions: 1.11.0
>Reporter: Roman Khachatryan
>Assignee: Roman Khachatryan
>Priority: Critical
>  Labels: pull-request-available
> Fix For: 1.11.0
>
>
> Config parameter state.backend.fs.memory-threshold is ignored for channel 
> state. Causing each subtask to have a file per checkpoint. Regardless of the 
> size of channel state (of this subtask).
> This also causes slow cleanup and delays the next checkpoint.
>  
> The problem is that {{ChannelStateCheckpointWriter.finishWriteAndResult}} 
> calls flush(); which actually flushes the data on disk.
>  
> From FSDataOutputStream.flush Javadoc:
> A completed flush does not mean that the data is necessarily persistent. Data 
> persistence can is only assumed after calls to close() or sync().
>  
> Possible solutions:
> 1. not to flush in {{ChannelStateCheckpointWriter.finishWriteAndResult (which 
> can lead to data loss in a wrapping stream).}}
> {{2. change }}{{FsCheckpointStateOutputStream.flush behavior}}
> {{3. wrap }}{{FsCheckpointStateOutputStream to prevent flush}}{{}}{{}}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (FLINK-17572) Remove checkpoint alignment buffered metric from webui

2020-05-25 Thread Zhijiang (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-17572?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17116388#comment-17116388
 ] 

Zhijiang commented on FLINK-17572:
--

Already assigned to you, welcome to contribute!

> Remove checkpoint alignment buffered metric from webui
> --
>
> Key: FLINK-17572
> URL: https://issues.apache.org/jira/browse/FLINK-17572
> Project: Flink
>  Issue Type: Sub-task
>  Components: Runtime / Web Frontend
>Affects Versions: 1.11.0
>Reporter: Yingjie Cao
>Assignee: Nicholas Jiang
>Priority: Minor
> Fix For: 1.12.0
>
>
> After FLINK-16404, we never cache buffers while checkpoint barrier alignment, 
> so the checkpoint alignment buffered metric will be always 0, we should 
> remove it directly.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (FLINK-17572) Remove checkpoint alignment buffered metric from webui

2020-05-25 Thread Zhijiang (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-17572?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhijiang reassigned FLINK-17572:


Assignee: Nicholas Jiang

> Remove checkpoint alignment buffered metric from webui
> --
>
> Key: FLINK-17572
> URL: https://issues.apache.org/jira/browse/FLINK-17572
> Project: Flink
>  Issue Type: Sub-task
>  Components: Runtime / Web Frontend
>Affects Versions: 1.11.0
>Reporter: Yingjie Cao
>Assignee: Nicholas Jiang
>Priority: Minor
> Fix For: 1.12.0
>
>
> After FLINK-16404, we never cache buffers while checkpoint barrier alignment, 
> so the checkpoint alignment buffered metric will be always 0, we should 
> remove it directly.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (FLINK-17565) Bump fabric8 version from 4.5.2 to 4.9.2

2020-05-25 Thread Zhijiang (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-17565?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17116373#comment-17116373
 ] 

Zhijiang commented on FLINK-17565:
--

Since this fix can resolve some known bugs, it is in compliance with the 
release rule. Makes sense for me. :) 

> Bump fabric8 version from 4.5.2 to 4.9.2
> 
>
> Key: FLINK-17565
> URL: https://issues.apache.org/jira/browse/FLINK-17565
> Project: Flink
>  Issue Type: Improvement
>  Components: Deployment / Kubernetes
>Reporter: Canbin Zheng
>Assignee: Canbin Zheng
>Priority: Critical
>  Labels: pull-request-available
> Fix For: 1.11.0, 1.12.0
>
>
> Currently, we are using a version of 4.5.2, it's better that we upgrade it to 
> 4.9.2, some of the reasons are as follows:
>  # It removed the use of reapers manually doing cascade deletion of 
> resources, leave it up to Kubernetes APIServer, which solves the issue of 
> FLINK-17566, more info: 
> [https://github.com/fabric8io/kubernetes-client/issues/1880]
>  # It introduced a regression in building Quantity values in 4.7.0, release 
> note [https://github.com/fabric8io/kubernetes-client/issues/1953].
>  # It provided better support for K8s 1.17, release note: 
> [https://github.com/fabric8io/kubernetes-client/releases/tag/v4.7.0].
> For more release notes, please refer to [fabric8 
> releases|https://github.com/fabric8io/kubernetes-client/releases].



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (FLINK-17916) Provide API to separate KafkaShuffle's Producer and Consumer to different jobs

2020-05-25 Thread Zhijiang (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-17916?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhijiang reassigned FLINK-17916:


Assignee: Yuan Mei

> Provide API to separate KafkaShuffle's Producer and Consumer to different jobs
> --
>
> Key: FLINK-17916
> URL: https://issues.apache.org/jira/browse/FLINK-17916
> Project: Flink
>  Issue Type: Improvement
>  Components: API / DataStream, Connectors / Kafka
>Affects Versions: 1.11.0
>Reporter: Yuan Mei
>Assignee: Yuan Mei
>Priority: Major
> Fix For: 1.11.0
>
>
> Follow up of FLINK-15670
> *Separate sink (producer) and source (consumer) to different jobs*
>  * In the same job, a sink and a source are recovered independently according 
> to regional failover. However, they share the same checkpoint coordinator and 
> correspondingly, share the same global checkpoint snapshot.
>  * That says if the consumer fails, the producer can not commit written data 
> because of two-phase commit set-up (the producer needs a checkpoint-complete 
> signal to complete the second stage)
>  * Same applies to the producer
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (FLINK-15670) Kafka Shuffle: uses Kafka as a message bus to shuffle and persist data at the same time

2020-05-24 Thread Zhijiang (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-15670?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhijiang updated FLINK-15670:
-
Summary: Kafka Shuffle: uses Kafka as a message bus to shuffle and persist 
data at the same time  (was: Provide a Kafka Source/Sink pair that aligns 
Kafka's Partitions and Flink's KeyGroups)

> Kafka Shuffle: uses Kafka as a message bus to shuffle and persist data at the 
> same time
> ---
>
> Key: FLINK-15670
> URL: https://issues.apache.org/jira/browse/FLINK-15670
> Project: Flink
>  Issue Type: New Feature
>  Components: API / DataStream, Connectors / Kafka
>Reporter: Stephan Ewen
>Assignee: Yuan Mei
>Priority: Major
>  Labels: pull-request-available, usability
> Fix For: 1.11.0
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> This Source/Sink pair would serve two purposes:
> 1. You can read topics that are already partitioned by key and process them 
> without partitioning them again (avoid shuffles)
> 2. You can use this to shuffle through Kafka, thereby decomposing the job 
> into smaller jobs and independent pipelined regions that fail over 
> independently.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (FLINK-17869) Fix the race condition of aborting unaligned checkpoint

2020-05-22 Thread Zhijiang (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-17869?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhijiang updated FLINK-17869:
-
Description: 
On ChannelStateWriter side, the lifecycle of checkpoint should be as follows:

start -> in progress/abort -> stop.

We must guarantee that #abort should be queued after #start, otherwise the 
aborted checkpoint might be started later again in the case of race condition.

There are two cases might trigger abort checkpoint:
 * CheckpointBarrierUnaligner#processEndOfPartition
 * CheckpointBarrierUnaligner#processCancellationBarrier

The current condition to execute abort checkpoint for above both cases is based 
on #isCheckpointPending(), which can not cover all the cases. The unaligned 
checkpoint might be triggered either by task thread via 
CheckpointBarrierUnaligner#processBarrier or netty thread via 
ThreadSafeUnaligner#notifyBarrierReceived. Anyway we should maintain the 
current triggered checkpoint id in order to handle both above abort cases 
properly.

Another bug is that during ChannelStateWriterImpl#abort, we should not remove 
the respective ChannelStateWriteResult. Otherwise it would throw 
IllegalArgumentException when ChannelStateWriterImpl#getWriteResult in the 
process of checkpoint. ChannelStateWriteResult should be created at #start 
method and only removed at #stop method.

  was:
On ChannelStateWriter side, the lifecycle of checkpoint should be as follows:

start -> in progress/abort -> stop.

We must guarantee that #abort should be queued after #start, otherwise the 
aborted checkpoint might be started later again in the case of race condition.

There are two cases might trigger abort checkpoint:
 * One is CheckpointBarrierUnaligner#processEndOfPartition, which should abort 
all the current and future checkpoints, no need to judge the condition 
`isCheckpointPending()` as current code did. 
 * Another is CheckpointBarrierUnaligner#processCancellationBarrier, which 
should only abort the respective checkpoint id if already triggered before.

The unaligned checkpoint might be triggered either by task thread or netty 
thread inside ThreadSafeUnaligner. Anyway we should know the current triggered 
checkpoint id in order to handle both above cases properly.

Another bug is that during ChannelStateWriterImpl#abort, we should not remove 
the respective ChannelStateWriteResult. Otherwise it would throw 
IllegalArgumentException when ChannelStateWriterImpl#getWriteResult in the 
process of checkpoint. ChannelStateWriteResult should be created at #start 
method and only removed at #stop method.


> Fix the race condition of aborting unaligned checkpoint
> ---
>
> Key: FLINK-17869
> URL: https://issues.apache.org/jira/browse/FLINK-17869
> Project: Flink
>  Issue Type: Bug
>  Components: Runtime / Checkpointing
>Reporter: Zhijiang
>Assignee: Zhijiang
>Priority: Blocker
> Fix For: 1.11.0
>
>
> On ChannelStateWriter side, the lifecycle of checkpoint should be as follows:
> start -> in progress/abort -> stop.
> We must guarantee that #abort should be queued after #start, otherwise the 
> aborted checkpoint might be started later again in the case of race condition.
> There are two cases might trigger abort checkpoint:
>  * CheckpointBarrierUnaligner#processEndOfPartition
>  * CheckpointBarrierUnaligner#processCancellationBarrier
> The current condition to execute abort checkpoint for above both cases is 
> based on #isCheckpointPending(), which can not cover all the cases. The 
> unaligned checkpoint might be triggered either by task thread via 
> CheckpointBarrierUnaligner#processBarrier or netty thread via 
> ThreadSafeUnaligner#notifyBarrierReceived. Anyway we should maintain the 
> current triggered checkpoint id in order to handle both above abort cases 
> properly.
> Another bug is that during ChannelStateWriterImpl#abort, we should not remove 
> the respective ChannelStateWriteResult. Otherwise it would throw 
> IllegalArgumentException when ChannelStateWriterImpl#getWriteResult in the 
> process of checkpoint. ChannelStateWriteResult should be created at #start 
> method and only removed at #stop method.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (FLINK-17873) Add check for max concurrent checkpoints under UC mode

2020-05-21 Thread Zhijiang (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-17873?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhijiang reassigned FLINK-17873:


Assignee: Yuan Mei

> Add check for max concurrent checkpoints under UC mode
> --
>
> Key: FLINK-17873
> URL: https://issues.apache.org/jira/browse/FLINK-17873
> Project: Flink
>  Issue Type: Sub-task
>  Components: Runtime / Checkpointing
>Reporter: Yuan Mei
>Assignee: Yuan Mei
>Priority: Major
> Fix For: 1.11.0
>
>
> Currently, the UC mode only supports max concurrent checkpoint number = 1.
> So we need to check whether the configured max allowed checkpoints are more 
> than 1 under the UC mode.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (FLINK-17823) Resolve the race condition while releasing RemoteInputChannel

2020-05-21 Thread Zhijiang (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-17823?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhijiang resolved FLINK-17823.
--
Resolution: Fixed

Merged in release-1.11: 3eb1075ded64da20e6f7a5bc268f455eaf6573eb

Will merge to master later and update the info.

> Resolve the race condition while releasing RemoteInputChannel
> -
>
> Key: FLINK-17823
> URL: https://issues.apache.org/jira/browse/FLINK-17823
> Project: Flink
>  Issue Type: Bug
>  Components: Runtime / Network
>Affects Versions: 1.11.0
>Reporter: Zhijiang
>Assignee: Zhijiang
>Priority: Blocker
>  Labels: pull-request-available
> Fix For: 1.11.0
>
>
> RemoteInputChannel#releaseAllResources might be called by canceler thread. 
> Meanwhile, the task thread can also call RemoteInputChannel#getNextBuffer. 
> There probably cause two potential problems:
>  * Task thread might get null buffer after canceler thread already released 
> all the buffers, then it might cause misleading NPE in getNextBuffer.
>  * Task thread and canceler thread might pull the same buffer concurrently, 
> which causes unexpected exception when the same buffer is recycled twice.
> The solution is to properly synchronize the buffer queue in release method to 
> avoid the same buffer pulled by both canceler thread and task thread. And in 
> getNextBuffer method, we add some explicit checks to avoid misleading NPE and 
> hint some valid exceptions.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (FLINK-17869) Fix the race condition of aborting unaligned checkpoint

2020-05-21 Thread Zhijiang (Jira)
Zhijiang created FLINK-17869:


 Summary: Fix the race condition of aborting unaligned checkpoint
 Key: FLINK-17869
 URL: https://issues.apache.org/jira/browse/FLINK-17869
 Project: Flink
  Issue Type: Bug
  Components: Runtime / Checkpointing
Reporter: Zhijiang
Assignee: Zhijiang
 Fix For: 1.11.0


On ChannelStateWriter side, the lifecycle of checkpoint should be as follows:

start -> in progress/abort -> stop.

We must guarantee that #abort should be queued after #start, otherwise the 
aborted checkpoint might be started later again in the case of race condition.

There are two cases might trigger abort checkpoint:
 * One is CheckpointBarrierUnaligner#processEndOfPartition, which should abort 
all the current and future checkpoints, no need to judge the condition 
`isCheckpointPending()` as current code did. 
 * Another is CheckpointBarrierUnaligner#processCancellationBarrier, which 
should only abort the respective checkpoint id if already triggered before.

The unaligned checkpoint might be triggered either by task thread or netty 
thread inside ThreadSafeUnaligner. Anyway we should know the current triggered 
checkpoint id in order to handle both above cases properly.

Another bug is that during ChannelStateWriterImpl#abort, we should not remove 
the respective ChannelStateWriteResult. Otherwise it would throw 
IllegalArgumentException when ChannelStateWriterImpl#getWriteResult in the 
process of checkpoint. ChannelStateWriteResult should be created at #start 
method and only removed at #stop method.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Closed] (FLINK-17315) UnalignedCheckpointITCase.shouldPerformUnalignedCheckpointMassivelyParallel failed in timeout

2020-05-21 Thread Zhijiang (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-17315?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhijiang closed FLINK-17315.

Resolution: Duplicate

> UnalignedCheckpointITCase.shouldPerformUnalignedCheckpointMassivelyParallel 
> failed in timeout
> -
>
> Key: FLINK-17315
> URL: https://issues.apache.org/jira/browse/FLINK-17315
> Project: Flink
>  Issue Type: Bug
>  Components: Runtime / Checkpointing, Tests
>Reporter: Zhijiang
>Assignee: Arvid Heise
>Priority: Critical
>  Labels: pull-request-available, test-stability
> Fix For: 1.11.0
>
>
> Build: 
> [https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=1=logs=5c8e7682-d68f-54d1-16a2-a09310218a49=45cc9205-bdb7-5b54-63cd-89fdc0983323]
> logs
> {code:java}
> 2020-04-21T20:25:23.1139147Z [ERROR] Errors: 
> 2020-04-21T20:25:23.1140908Z [ERROR]   
> UnalignedCheckpointITCase.shouldPerformUnalignedCheckpointMassivelyParallel:80->execute:87
>  » TestTimedOut
> 2020-04-21T20:25:23.1141383Z [INFO] 
> 2020-04-21T20:25:23.1141675Z [ERROR] Tests run: 1525, Failures: 0, Errors: 1, 
> Skipped: 36
> {code}
>  
> I run it in my local machine and it almost takes about 40 seconds to finish, 
> so the configured 90 seconds timeout seems not enough in heavy load 
> environment sometimes. Maybe we can remove the timeout in tests since azure 
> already configured to monitor the timeout.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (FLINK-17315) UnalignedCheckpointITCase.shouldPerformUnalignedCheckpointMassivelyParallel failed in timeout

2020-05-21 Thread Zhijiang (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-17315?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17112838#comment-17112838
 ] 

Zhijiang commented on FLINK-17315:
--

I close this ticket and try to resolve all the issues in 
[FLINK-17768|https://issues.apache.org/jira/browse/FLINK-17768]

> UnalignedCheckpointITCase.shouldPerformUnalignedCheckpointMassivelyParallel 
> failed in timeout
> -
>
> Key: FLINK-17315
> URL: https://issues.apache.org/jira/browse/FLINK-17315
> Project: Flink
>  Issue Type: Bug
>  Components: Runtime / Checkpointing, Tests
>Reporter: Zhijiang
>Assignee: Arvid Heise
>Priority: Critical
>  Labels: pull-request-available, test-stability
> Fix For: 1.11.0
>
>
> Build: 
> [https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=1=logs=5c8e7682-d68f-54d1-16a2-a09310218a49=45cc9205-bdb7-5b54-63cd-89fdc0983323]
> logs
> {code:java}
> 2020-04-21T20:25:23.1139147Z [ERROR] Errors: 
> 2020-04-21T20:25:23.1140908Z [ERROR]   
> UnalignedCheckpointITCase.shouldPerformUnalignedCheckpointMassivelyParallel:80->execute:87
>  » TestTimedOut
> 2020-04-21T20:25:23.1141383Z [INFO] 
> 2020-04-21T20:25:23.1141675Z [ERROR] Tests run: 1525, Failures: 0, Errors: 1, 
> Skipped: 36
> {code}
>  
> I run it in my local machine and it almost takes about 40 seconds to finish, 
> so the configured 90 seconds timeout seems not enough in heavy load 
> environment sometimes. Maybe we can remove the timeout in tests since azure 
> already configured to monitor the timeout.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (FLINK-17826) Add missing custom query support on new jdbc connector

2020-05-20 Thread Zhijiang (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-17826?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17112398#comment-17112398
 ] 

Zhijiang commented on FLINK-17826:
--

If it is regarded as a bug fix, then it fits the rule to make it in 
release-1.11. We should adjust the type field as bug in Jira board accordingly.

> Add missing custom query support on new jdbc connector
> --
>
> Key: FLINK-17826
> URL: https://issues.apache.org/jira/browse/FLINK-17826
> Project: Flink
>  Issue Type: Sub-task
>  Components: Connectors / JDBC
>Reporter: Jark Wu
>Assignee: Flavio Pompermaier
>Priority: Major
> Fix For: 1.11.0
>
>
> In FLINK-17361, we added custom query on JDBC tables, but missing to add the 
> same ability on new jdbc connector (i.e. 
> {{JdbcDynamicTableSourceSinkFactory}}). 
> In the new jdbc connector, maybe we should call it {{scan.query}} to keep 
> consistent with other scan options, besides we need to make {{"table-name"}} 
> optional, but add validation that "table-name" and "scan.query" shouldn't 
> both be empty, and "table-name" must not be empty when used as sink.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


<    1   2   3   4   5   6   7   8   9   10   >