[jira] [Comment Edited] (FLINK-10052) Tolerate temporarily suspended ZooKeeper connections

2021-04-23 Thread Chen Qin (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-10052?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17330890#comment-17330890
 ] 

Chen Qin edited comment on FLINK-10052 at 4/23/21, 4:13 PM:


here is another exception we observed in another job, may or may not caused by 
this pr.

{code:java}
2021-04-23 11:09:03,388 INFO  org.apache.flink.yarn.YarnResourceManager 
- Closing TaskExecutor connection 
container_e26_1617655625710_8692_01_000115 because: ResourceManager leader 
changed to new address null
2021-04-23 11:09:03,391 INFO  
org.apache.flink.runtime.executiongraph.ExecutionGraph- 
USER_AGGREGATE_STATE.user_signal_v2.SINK-async (200/360) 
(bf815073df08c3426bf41b63d74510fb) switched from RUNNING to FAILED on 
container_e26_1617655625710_8692_01_000115 @ .ec2.pin220.com 
(dataPort=46719).
org.apache.flink.util.FlinkException: ResourceManager leader changed to new 
address null
at 
org.apache.flink.runtime.taskexecutor.TaskExecutor.notifyOfNewResourceManagerLeader(TaskExecutor.java:1093)
at 
org.apache.flink.runtime.taskexecutor.TaskExecutor.access$800(TaskExecutor.java:173)
at 
org.apache.flink.runtime.taskexecutor.TaskExecutor$ResourceManagerLeaderListener.lambda$notifyLeaderAddress$0(TaskExecutor.java:1816)
at 
org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRunAsync(AkkaRpcActor.java:402)
at 
org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRpcMessage(AkkaRpcActor.java:195)
at 
org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleMessage(AkkaRpcActor.java:152)
at akka.japi.pf.UnitCaseStatement.apply(CaseStatements.scala:24)
at akka.japi.pf.UnitCaseStatement.apply(CaseStatements.scala:20)
at scala.PartialFunction$class.applyOrElse(PartialFunction.scala:123)
at akka.japi.pf.UnitCaseStatement.applyOrElse(CaseStatements.scala:20)
at scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:170)
at scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:171)
at scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:171)
at akka.actor.Actor$class.aroundReceive(Actor.scala:539)
at akka.actor.AbstractActor.aroundReceive(AbstractActor.scala:227)
at akka.actor.ActorCell.receiveMessage(ActorCell.scala:612)
at akka.actor.ActorCell.invoke(ActorCell.scala:581)
at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:268)
at akka.dispatch.Mailbox.run(Mailbox.scala:229)
at akka.dispatch.Mailbox.exec(Mailbox.scala:241)
at akka.dispatch.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
at 
akka.dispatch.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
at akka.dispatch.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
at 
akka.dispatch.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
{code}



was (Author: foxss):
here is another exception we observed in another job after apply this pr

{code:java}
2021-04-23 11:09:03,388 INFO  org.apache.flink.yarn.YarnResourceManager 
- Closing TaskExecutor connection 
container_e26_1617655625710_8692_01_000115 because: ResourceManager leader 
changed to new address null
2021-04-23 11:09:03,391 INFO  
org.apache.flink.runtime.executiongraph.ExecutionGraph- 
USER_AGGREGATE_STATE.user_signal_v2.SINK-async (200/360) 
(bf815073df08c3426bf41b63d74510fb) switched from RUNNING to FAILED on 
container_e26_1617655625710_8692_01_000115 @ .ec2.pin220.com 
(dataPort=46719).
org.apache.flink.util.FlinkException: ResourceManager leader changed to new 
address null
at 
org.apache.flink.runtime.taskexecutor.TaskExecutor.notifyOfNewResourceManagerLeader(TaskExecutor.java:1093)
at 
org.apache.flink.runtime.taskexecutor.TaskExecutor.access$800(TaskExecutor.java:173)
at 
org.apache.flink.runtime.taskexecutor.TaskExecutor$ResourceManagerLeaderListener.lambda$notifyLeaderAddress$0(TaskExecutor.java:1816)
at 
org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRunAsync(AkkaRpcActor.java:402)
at 
org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRpcMessage(AkkaRpcActor.java:195)
at 
org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleMessage(AkkaRpcActor.java:152)
at akka.japi.pf.UnitCaseStatement.apply(CaseStatements.scala:24)
at akka.japi.pf.UnitCaseStatement.apply(CaseStatements.scala:20)
at scala.PartialFunction$class.applyOrElse(PartialFunction.scala:123)
at akka.japi.pf.UnitCaseStatement.applyOrElse(CaseStatements.scala:20)
at scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:170)
at scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:171)
at scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:171)
at 

[jira] [Comment Edited] (FLINK-10052) Tolerate temporarily suspended ZooKeeper connections

2021-04-23 Thread Chen Qin (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-10052?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17330858#comment-17330858
 ] 

Chen Qin edited comment on FLINK-10052 at 4/23/21, 4:04 PM:


run load testing on pr, seems suspended message no longer trigger leadership 
lost and job restart. At same time, found following exception when job restarts 
caused by other user jar issue.

{code:java}
2021-04-21 18:24:44,639 ERROR 
org.apache.flink.runtime.taskexecutor.TaskExecutor- Fatal error 
occurred in TaskExecutor akka.tcp://flink@xxx:33435/user/rpc/taskmanager_0.
org.apache.flink.util.FlinkException: Unhandled error in 
ZooKeeperLeaderRetrievalService:Background operation retry gave up
at 
org.apache.flink.runtime.leaderretrieval.ZooKeeperLeaderRetrievalService.unhandledError(ZooKeeperLeaderRetrievalService.java:208)
at 
org.apache.flink.shaded.curator4.org.apache.curator.framework.imps.CuratorFrameworkImpl$6.apply(CuratorFrameworkImpl.java:713)
at 
org.apache.flink.shaded.curator4.org.apache.curator.framework.imps.CuratorFrameworkImpl$6.apply(CuratorFrameworkImpl.java:709)
at 
org.apache.flink.shaded.curator4.org.apache.curator.framework.listen.ListenerContainer$1.run(ListenerContainer.java:100)
at 
org.apache.flink.shaded.curator4.org.apache.curator.shaded.com.google.common.util.concurrent.DirectExecutor.execute(DirectExecutor.java:30)
at 
org.apache.flink.shaded.curator4.org.apache.curator.framework.listen.ListenerContainer.forEach(ListenerContainer.java:92)
at 
org.apache.flink.shaded.curator4.org.apache.curator.framework.imps.CuratorFrameworkImpl.logError(CuratorFrameworkImpl.java:708)
at 
org.apache.flink.shaded.curator4.org.apache.curator.framework.imps.CuratorFrameworkImpl.checkBackgroundRetry(CuratorFrameworkImpl.java:874)
at 
org.apache.flink.shaded.curator4.org.apache.curator.framework.imps.CuratorFrameworkImpl.performBackgroundOperation(CuratorFrameworkImpl.java:990)
at 
org.apache.flink.shaded.curator4.org.apache.curator.framework.imps.CuratorFrameworkImpl.backgroundOperationsLoop(CuratorFrameworkImpl.java:943)
at 
org.apache.flink.shaded.curator4.org.apache.curator.framework.imps.CuratorFrameworkImpl.access$300(CuratorFrameworkImpl.java:66)
at 
org.apache.flink.shaded.curator4.org.apache.curator.framework.imps.CuratorFrameworkImpl$4.call(CuratorFrameworkImpl.java:346)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at 
java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180)
at 
java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
Caused by: 
org.apache.flink.shaded.zookeeper3.org.apache.zookeeper.KeeperException$ConnectionLossException:
 KeeperErrorCode = ConnectionLoss
at 
org.apache.flink.shaded.zookeeper3.org.apache.zookeeper.KeeperException.create(KeeperException.java:102)
at 
org.apache.flink.shaded.curator4.org.apache.curator.framework.imps.CuratorFrameworkImpl.checkBackgroundRetry(CuratorFrameworkImpl.java:862)
... 10 more
{code}


was (Author: foxss):
run load testing on pr, found following exception when job restarts.

{code:java}
2021-04-21 18:24:44,639 ERROR 
org.apache.flink.runtime.taskexecutor.TaskExecutor- Fatal error 
occurred in TaskExecutor akka.tcp://flink@xxx:33435/user/rpc/taskmanager_0.
org.apache.flink.util.FlinkException: Unhandled error in 
ZooKeeperLeaderRetrievalService:Background operation retry gave up
at 
org.apache.flink.runtime.leaderretrieval.ZooKeeperLeaderRetrievalService.unhandledError(ZooKeeperLeaderRetrievalService.java:208)
at 
org.apache.flink.shaded.curator4.org.apache.curator.framework.imps.CuratorFrameworkImpl$6.apply(CuratorFrameworkImpl.java:713)
at 
org.apache.flink.shaded.curator4.org.apache.curator.framework.imps.CuratorFrameworkImpl$6.apply(CuratorFrameworkImpl.java:709)
at 
org.apache.flink.shaded.curator4.org.apache.curator.framework.listen.ListenerContainer$1.run(ListenerContainer.java:100)
at 
org.apache.flink.shaded.curator4.org.apache.curator.shaded.com.google.common.util.concurrent.DirectExecutor.execute(DirectExecutor.java:30)
at 
org.apache.flink.shaded.curator4.org.apache.curator.framework.listen.ListenerContainer.forEach(ListenerContainer.java:92)
at 
org.apache.flink.shaded.curator4.org.apache.curator.framework.imps.CuratorFrameworkImpl.logError(CuratorFrameworkImpl.java:708)
at 

[jira] [Comment Edited] (FLINK-10052) Tolerate temporarily suspended ZooKeeper connections

2021-03-30 Thread Andrew.D.lin (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-10052?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17311547#comment-17311547
 ] 

Andrew.D.lin edited comment on FLINK-10052 at 3/30/21, 2:21 PM:


Note change here![FLINK-18677|https://issues.apache.org/jira/browse/FLINK-18677]

I think it should not be increased logic here,[handleStateChange|#L152]],Here 
is just to add logs at the beginning.

Connection State processing should be managed by 
[ConnectionStateErrorPolicy|https://github.com/chendonglin521/curator-1/blob/15a9f03f6f7b156806d05d0dd7ce6cfd3ef39c72/curator-framework/src/main/java/org/apache/curator/framework/state/ConnectionStateErrorPolicy.java#L27]
 (support session and Standard)


was (Author: andrew_lin):
Note change here!FLINK-18677

I think it should not be increased logic here,[handleStateChange|#L152]],Here 
is just to add logs at the beginning.

Connection State processing should be managed by 
[ConnectionStateErrorPolicy|https://github.com/chendonglin521/curator-1/blob/15a9f03f6f7b156806d05d0dd7ce6cfd3ef39c72/curator-framework/src/main/java/org/apache/curator/framework/state/ConnectionStateErrorPolicy.java#L27]
 (support session and Standard)

> Tolerate temporarily suspended ZooKeeper connections
> 
>
> Key: FLINK-10052
> URL: https://issues.apache.org/jira/browse/FLINK-10052
> Project: Flink
>  Issue Type: Improvement
>  Components: Runtime / Coordination
>Affects Versions: 1.4.2, 1.5.2, 1.6.0, 1.8.1
>Reporter: Till Rohrmann
>Assignee: Zili Chen
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.13.0
>
>  Time Spent: 50m
>  Remaining Estimate: 0h
>
> This issue results from FLINK-10011 which uncovered a problem with Flink's HA 
> recovery and proposed the following solution to harden Flink:
> The {{ZooKeeperLeaderElectionService}} uses the {{LeaderLatch}} Curator 
> recipe for leader election. The leader latch revokes leadership in case of a 
> suspended ZooKeeper connection. This can be premature in case that the system 
> can reconnect to ZooKeeper before its session expires. The effect of the lost 
> leadership is that all jobs will be canceled and directly restarted after 
> regaining the leadership.
> Instead of directly revoking the leadership upon a SUSPENDED ZooKeeper 
> connection, it would be better to wait until the ZooKeeper connection is 
> LOST. That way we would allow the system to reconnect and not lose the 
> leadership. This could be achievable by using Curator's {{LeaderSelector}} 
> instead of the {{LeaderLatch}}.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Comment Edited] (FLINK-10052) Tolerate temporarily suspended ZooKeeper connections

2021-03-30 Thread Andrew.D.lin (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-10052?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17311547#comment-17311547
 ] 

Andrew.D.lin edited comment on FLINK-10052 at 3/30/21, 2:21 PM:


Note change here!FLINK-18677

I think it should not be increased logic here,[handleStateChange|#L152]],Here 
is just to add logs at the beginning.

Connection State processing should be managed by 
[ConnectionStateErrorPolicy|https://github.com/chendonglin521/curator-1/blob/15a9f03f6f7b156806d05d0dd7ce6cfd3ef39c72/curator-framework/src/main/java/org/apache/curator/framework/state/ConnectionStateErrorPolicy.java#L27]
 (support session and Standard)


was (Author: andrew_lin):
Note change here![FLINK-18677|https://issues.apache.org/jira/browse/FLINK-18677]

I think it should not be increased logic 
here,[handleStateChange|[https://github.com/apache/flink/blob/940dfb0deccb31e0ca576b4c044cbf588e0765dd/flink-runtime/src/main/java/org/apache/flink/runtime/leaderretrieval/ZooKeeperLeaderRetrievalDriver.java#L152]|https://github.com/apache/flink/blob/940dfb0deccb31e0ca576b4c044cbf588e0765dd/flink-runtime/src/main/java/org/apache/flink/runtime/leaderretrieval/ZooKeeperLeaderRetrievalDriver.java#L152],]Here
 is just to add logs at the beginning.

Connection State processing should be managed 
by[ConnectionStateErrorPolicy|https://github.com/chendonglin521/curator-1/blob/15a9f03f6f7b156806d05d0dd7ce6cfd3ef39c72/curator-framework/src/main/java/org/apache/curator/framework/state/ConnectionStateErrorPolicy.java#L27]
 (support session and Standard)

> Tolerate temporarily suspended ZooKeeper connections
> 
>
> Key: FLINK-10052
> URL: https://issues.apache.org/jira/browse/FLINK-10052
> Project: Flink
>  Issue Type: Improvement
>  Components: Runtime / Coordination
>Affects Versions: 1.4.2, 1.5.2, 1.6.0, 1.8.1
>Reporter: Till Rohrmann
>Assignee: Zili Chen
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.13.0
>
>  Time Spent: 50m
>  Remaining Estimate: 0h
>
> This issue results from FLINK-10011 which uncovered a problem with Flink's HA 
> recovery and proposed the following solution to harden Flink:
> The {{ZooKeeperLeaderElectionService}} uses the {{LeaderLatch}} Curator 
> recipe for leader election. The leader latch revokes leadership in case of a 
> suspended ZooKeeper connection. This can be premature in case that the system 
> can reconnect to ZooKeeper before its session expires. The effect of the lost 
> leadership is that all jobs will be canceled and directly restarted after 
> regaining the leadership.
> Instead of directly revoking the leadership upon a SUSPENDED ZooKeeper 
> connection, it would be better to wait until the ZooKeeper connection is 
> LOST. That way we would allow the system to reconnect and not lose the 
> leadership. This could be achievable by using Curator's {{LeaderSelector}} 
> instead of the {{LeaderLatch}}.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Comment Edited] (FLINK-10052) Tolerate temporarily suspended ZooKeeper connections

2019-07-17 Thread lamber-ken (JIRA)


[ 
https://issues.apache.org/jira/browse/FLINK-10052?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16887268#comment-16887268
 ] 

lamber-ken edited comment on FLINK-10052 at 7/17/19 5:28 PM:
-

[~Tison] yes, I checke the shaded class file by +javap -v+ command.

The main thing is that maven-shaded-plugin help us relocate 
+org.apache.zookeeper.ClientCnxn$EventThread+

I'll update PR#9066. It's welcome if you can help me to review the pr. 


was (Author: lamber-ken):
[~Tison] yes, I checke the shaded class file by +javap -v+ command, I'll update 
PR#9066. It's welcome if you can help me to review the pr.

> Tolerate temporarily suspended ZooKeeper connections
> 
>
> Key: FLINK-10052
> URL: https://issues.apache.org/jira/browse/FLINK-10052
> Project: Flink
>  Issue Type: Improvement
>  Components: Runtime / Coordination
>Affects Versions: 1.4.2, 1.5.2, 1.6.0, 1.8.1
>Reporter: Till Rohrmann
>Assignee: Dominik Wosiński
>Priority: Major
>
> This issue results from FLINK-10011 which uncovered a problem with Flink's HA 
> recovery and proposed the following solution to harden Flink:
> The {{ZooKeeperLeaderElectionService}} uses the {{LeaderLatch}} Curator 
> recipe for leader election. The leader latch revokes leadership in case of a 
> suspended ZooKeeper connection. This can be premature in case that the system 
> can reconnect to ZooKeeper before its session expires. The effect of the lost 
> leadership is that all jobs will be canceled and directly restarted after 
> regaining the leadership.
> Instead of directly revoking the leadership upon a SUSPENDED ZooKeeper 
> connection, it would be better to wait until the ZooKeeper connection is 
> LOST. That way we would allow the system to reconnect and not lose the 
> leadership. This could be achievable by using Curator's {{LeaderSelector}} 
> instead of the {{LeaderLatch}}.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Comment Edited] (FLINK-10052) Tolerate temporarily suspended ZooKeeper connections

2019-07-17 Thread lamber-ken (JIRA)


[ 
https://issues.apache.org/jira/browse/FLINK-10052?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16886699#comment-16886699
 ] 

lamber-ken edited comment on FLINK-10052 at 7/17/19 6:01 AM:
-

[~Tison] (y), I have some points need to talk with you.

First, for your first point, I thought it yesterday and wont create a new 
curator Jira like your CURATOR-532 that user can manually config ZooKeeper3.4.x 
Compatibility, but I give up that idea, because I found that it also needs to 
reflect +org.apache.zookeeper.ClientCnxn$EventThread+ which may throw 
ClassNotFoundException because of shading. Click it for more detail 
[InjectSessionExpiration|https://github.com/apache/curator/blob/master/curator-client/src/main/java/org/apache/curator/utils/InjectSessionExpiration.java].

Second, for your second point, I am not familiar with LeaderSeclector currently 
and I'm learning about it. I also think it is a ideally way we can just use 
SessionConnectionStateErrorPolicy directly in curator-4.x

Third, I don't understand the meaning of a flink scope leader latch

 


was (Author: lamber-ken):
[~Tison] (y), I have some points need to talk with you.

First, for your first point, I thought it yesterday and wont create a new 
curator Jira like your CURATOR-532 that use can manually config ZooKeeper3.4.x 
Compatibility, but I give up that idea, because I found that it also needs to 
reflect +org.apache.zookeeper.ClientCnxn$EventThread+ which may throw 
ClassNotFoundException because of shading. Click it for more detail 
[InjectSessionExpiration|https://github.com/apache/curator/blob/master/curator-client/src/main/java/org/apache/curator/utils/InjectSessionExpiration.java].

Second, for your second point, I am not familiar with LeaderSeclector currently 
and I'm learning about it. I also think it is a ideally way we can just use 
SessionConnectionStateErrorPolicy directly in curator-4.x

Third, I don't understand the meaning of a flink scope leader latch

 

> Tolerate temporarily suspended ZooKeeper connections
> 
>
> Key: FLINK-10052
> URL: https://issues.apache.org/jira/browse/FLINK-10052
> Project: Flink
>  Issue Type: Improvement
>  Components: Runtime / Coordination
>Affects Versions: 1.4.2, 1.5.2, 1.6.0, 1.8.1
>Reporter: Till Rohrmann
>Assignee: Dominik Wosiński
>Priority: Major
>
> This issue results from FLINK-10011 which uncovered a problem with Flink's HA 
> recovery and proposed the following solution to harden Flink:
> The {{ZooKeeperLeaderElectionService}} uses the {{LeaderLatch}} Curator 
> recipe for leader election. The leader latch revokes leadership in case of a 
> suspended ZooKeeper connection. This can be premature in case that the system 
> can reconnect to ZooKeeper before its session expires. The effect of the lost 
> leadership is that all jobs will be canceled and directly restarted after 
> regaining the leadership.
> Instead of directly revoking the leadership upon a SUSPENDED ZooKeeper 
> connection, it would be better to wait until the ZooKeeper connection is 
> LOST. That way we would allow the system to reconnect and not lose the 
> leadership. This could be achievable by using Curator's {{LeaderSelector}} 
> instead of the {{LeaderLatch}}.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Comment Edited] (FLINK-10052) Tolerate temporarily suspended ZooKeeper connections

2019-07-16 Thread lamber-ken (JIRA)


[ 
https://issues.apache.org/jira/browse/FLINK-10052?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16886663#comment-16886663
 ] 

lamber-ken edited comment on FLINK-10052 at 7/17/19 4:06 AM:
-

[~Tison], right, it's a better way to upgrate curator dependcy to fix this 
ideally, but there's a problem that curator-4.x detect the version of zookeeper 
by test whether +org.apache.zookeeper.admin.ZooKeeperAdmin+ is in classpath or 
not, like bellow.
{code:java}
Class.forName("org.apache.admin.ZooKeeperAdmin");
{code}
But flink-runtime module shades +org.apache.zookeeper+ to 
+org.apache.flink.shaded.zookeeper.org.apache.zookeeper+ , so it'll detect 
failed.

I think two ways to fix this issue,
First, rewrite +LeaderLatch#handleStateChange+ at flink-shaded-curator 
moduleflink, like [PR#9066|https://github.com/apache/flink/pull/9066].
Second, it also could be achievable by using Curator's LeaderSelector instead 
of the LeaderLatch as mentioned in issue description 







was (Author: lamber-ken):
[~Tison], right, it's a better way to upgrate curator dependcy to fix this 
ideally, but there's a problem that curator-4.x detect the version of zookeeper 
by test whether +org.apache.zookeeper.admin.ZooKeeperAdmin+ is in classpath or 
not, like bellow.
{code:java}
Class.forName("org.apache.admin.ZooKeeperAdmin");
{code}
But flink-runtime module shades +org.apache.zookeeper+ to 
+org.apache.flink.shaded.zookeeper.org.apache.zookeeper+ , so it'll detect 
failed.

I think two ways to fix this issue,
First, rewrite +LeaderLatch#handleStateChange+ at flink-shaded-curator 
moduleflink, like [PR#9066|https://github.com/apache/flink/pull/9066].
Seconde, it also could be achievable by using Curator's LeaderSelector instead 
of the LeaderLatch as mentioned in issue description 






> Tolerate temporarily suspended ZooKeeper connections
> 
>
> Key: FLINK-10052
> URL: https://issues.apache.org/jira/browse/FLINK-10052
> Project: Flink
>  Issue Type: Improvement
>  Components: Runtime / Coordination
>Affects Versions: 1.4.2, 1.5.2, 1.6.0, 1.8.1
>Reporter: Till Rohrmann
>Assignee: Dominik Wosiński
>Priority: Major
>
> This issue results from FLINK-10011 which uncovered a problem with Flink's HA 
> recovery and proposed the following solution to harden Flink:
> The {{ZooKeeperLeaderElectionService}} uses the {{LeaderLatch}} Curator 
> recipe for leader election. The leader latch revokes leadership in case of a 
> suspended ZooKeeper connection. This can be premature in case that the system 
> can reconnect to ZooKeeper before its session expires. The effect of the lost 
> leadership is that all jobs will be canceled and directly restarted after 
> regaining the leadership.
> Instead of directly revoking the leadership upon a SUSPENDED ZooKeeper 
> connection, it would be better to wait until the ZooKeeper connection is 
> LOST. That way we would allow the system to reconnect and not lose the 
> leadership. This could be achievable by using Curator's {{LeaderSelector}} 
> instead of the {{LeaderLatch}}.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Comment Edited] (FLINK-10052) Tolerate temporarily suspended ZooKeeper connections

2019-07-13 Thread lamber-ken (JIRA)


[ 
https://issues.apache.org/jira/browse/FLINK-10052?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16884567#comment-16884567
 ] 

lamber-ken edited comment on FLINK-10052 at 7/14/19 5:30 AM:
-

Hi, all. 

BTW, if we upgrade curator dependency to 4.x, there's a problem that 
curator-4.x detect the version of zookeeper by test whether 
+org.apache.zookeeper.admin.ZooKeeperAdmin+ is in classpath or not.

But flink-runtime module shades +org.apache.zookeeper+ to 
+org.apache.flink.shaded.zookeeper.org.apache.zookeeper.+  So it will not fix 
this issue by upgrading curator's version.

Here is [Curator 
Compatibility|https://github.com/apache/curator/blob/master/curator-client/src/main/java/org/apache/curator/utils/Compatibility.java].


was (Author: lamber-ken):
Hi, all. 

BTW, if we upgrade curator dependency to 4.x, there's a problem that 
curator-4.x detect the version of zookeeper by test whether 
+org.apache.zookeeper.admin.ZooKeeperAdmin+ is in classpath or not.

But flink-runtime module will shade +org.apache.zookeeper+ to 
+org.apache.flink.shaded.zookeeper.org.apache.zookeeper.+  So it will not fix 
this issue by upgrading curator's version.

Here is [Curator 
Compatibility|https://github.com/apache/curator/blob/master/curator-client/src/main/java/org/apache/curator/utils/Compatibility.java].

> Tolerate temporarily suspended ZooKeeper connections
> 
>
> Key: FLINK-10052
> URL: https://issues.apache.org/jira/browse/FLINK-10052
> Project: Flink
>  Issue Type: Improvement
>  Components: Runtime / Coordination
>Affects Versions: 1.4.2, 1.5.2, 1.6.0, 1.8.1
>Reporter: Till Rohrmann
>Assignee: Dominik Wosiński
>Priority: Major
>
> This issue results from FLINK-10011 which uncovered a problem with Flink's HA 
> recovery and proposed the following solution to harden Flink:
> The {{ZooKeeperLeaderElectionService}} uses the {{LeaderLatch}} Curator 
> recipe for leader election. The leader latch revokes leadership in case of a 
> suspended ZooKeeper connection. This can be premature in case that the system 
> can reconnect to ZooKeeper before its session expires. The effect of the lost 
> leadership is that all jobs will be canceled and directly restarted after 
> regaining the leadership.
> Instead of directly revoking the leadership upon a SUSPENDED ZooKeeper 
> connection, it would be better to wait until the ZooKeeper connection is 
> LOST. That way we would allow the system to reconnect and not lose the 
> leadership. This could be achievable by using Curator's {{LeaderSelector}} 
> instead of the {{LeaderLatch}}.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Comment Edited] (FLINK-10052) Tolerate temporarily suspended ZooKeeper connections

2019-07-13 Thread lamber-ken (JIRA)


[ 
https://issues.apache.org/jira/browse/FLINK-10052?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16884567#comment-16884567
 ] 

lamber-ken edited comment on FLINK-10052 at 7/14/19 5:29 AM:
-

Hi, all. 

BTW, if we upgrade curator dependency to 4.x, there's a problem that 
curator-4.x detect the version of zookeeper by test whether 
+org.apache.zookeeper.admin.ZooKeeperAdmin+ is in classpath or not.

But flink-runtime module will shade +org.apache.zookeeper+ to 
+org.apache.flink.shaded.zookeeper.org.apache.zookeeper.+  So it will not fix 
this issue by upgrading curator's version.

Here is [Curator 
Compatibility|https://github.com/apache/curator/blob/master/curator-client/src/main/java/org/apache/curator/utils/Compatibility.java].


was (Author: lamber-ken):
Hi, all. 

BTW, if we upgrade curator dependency to 4.x, there's a problem that 
curator-4.x detect the version of zookeeper by test 

whether +org.apache.zookeeper.admin.ZooKeeperAdmin+ is in classpath or not. But 
flink-runtime module will shade +org.apache.zookeeper+ to 
+org.apache.flink.shaded.zookeeper.org.apache.zookeeper.+  So it will not fix 
this issue by upgrading curator's version.

Here is [Curator 
Compatibility|https://github.com/apache/curator/blob/master/curator-client/src/main/java/org/apache/curator/utils/Compatibility.java].

> Tolerate temporarily suspended ZooKeeper connections
> 
>
> Key: FLINK-10052
> URL: https://issues.apache.org/jira/browse/FLINK-10052
> Project: Flink
>  Issue Type: Improvement
>  Components: Runtime / Coordination
>Affects Versions: 1.4.2, 1.5.2, 1.6.0, 1.8.1
>Reporter: Till Rohrmann
>Assignee: Dominik Wosiński
>Priority: Major
>
> This issue results from FLINK-10011 which uncovered a problem with Flink's HA 
> recovery and proposed the following solution to harden Flink:
> The {{ZooKeeperLeaderElectionService}} uses the {{LeaderLatch}} Curator 
> recipe for leader election. The leader latch revokes leadership in case of a 
> suspended ZooKeeper connection. This can be premature in case that the system 
> can reconnect to ZooKeeper before its session expires. The effect of the lost 
> leadership is that all jobs will be canceled and directly restarted after 
> regaining the leadership.
> Instead of directly revoking the leadership upon a SUSPENDED ZooKeeper 
> connection, it would be better to wait until the ZooKeeper connection is 
> LOST. That way we would allow the system to reconnect and not lose the 
> leadership. This could be achievable by using Curator's {{LeaderSelector}} 
> instead of the {{LeaderLatch}}.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Comment Edited] (FLINK-10052) Tolerate temporarily suspended ZooKeeper connections

2019-07-10 Thread lamber-ken (JIRA)


[ 
https://issues.apache.org/jira/browse/FLINK-10052?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16882303#comment-16882303
 ] 

lamber-ken edited comment on FLINK-10052 at 7/10/19 5:46 PM:
-

Thanks for remind me [~elevy] that I created a duplicated jira 
[FLINK-13189|https://issues.apache.org/jira/browse/FLINK-13189], I think may 
nobody solved this issuse before because of latest code of flink in github.

I solve this problem by rewrite +LeaderLatch#handleStateChange+ at 
flink-shaded-curator module, and any suggestion will be welcome, thanks.


was (Author: lamber-ken):
Hi all, I'm very sorry that I created a duplicated jira 
[FLINK-13189|https://issues.apache.org/jira/browse/FLINK-13189], I think may 
nobody solved this issuse before because of latest code of flink in github. 

I solve this problem by rewrite +LeaderLatch#handleStateChange+ at 
flink-shaded-curator module, and any suggestion will be welcome, thanks.

> Tolerate temporarily suspended ZooKeeper connections
> 
>
> Key: FLINK-10052
> URL: https://issues.apache.org/jira/browse/FLINK-10052
> Project: Flink
>  Issue Type: Improvement
>  Components: Runtime / Coordination
>Affects Versions: 1.4.2, 1.5.2, 1.6.0
>Reporter: Till Rohrmann
>Assignee: Dominik Wosiński
>Priority: Major
>
> This issue results from FLINK-10011 which uncovered a problem with Flink's HA 
> recovery and proposed the following solution to harden Flink:
> The {{ZooKeeperLeaderElectionService}} uses the {{LeaderLatch}} Curator 
> recipe for leader election. The leader latch revokes leadership in case of a 
> suspended ZooKeeper connection. This can be premature in case that the system 
> can reconnect to ZooKeeper before its session expires. The effect of the lost 
> leadership is that all jobs will be canceled and directly restarted after 
> regaining the leadership.
> Instead of directly revoking the leadership upon a SUSPENDED ZooKeeper 
> connection, it would be better to wait until the ZooKeeper connection is 
> LOST. That way we would allow the system to reconnect and not lose the 
> leadership. This could be achievable by using Curator's {{LeaderSelector}} 
> instead of the {{LeaderLatch}}.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Comment Edited] (FLINK-10052) Tolerate temporarily suspended ZooKeeper connections

2018-09-06 Thread JIRA


[ 
https://issues.apache.org/jira/browse/FLINK-10052?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16605551#comment-16605551
 ] 

Dominik Wosiński edited comment on FLINK-10052 at 9/6/18 9:48 AM:
--

This may be also possibly related to : 
[FLINK-5996|https://issues.apache.org/jira/browse/FLINK-5996]


was (Author: wosinsan):
https://issues.apache.org/jira/browse/FLINK-5996

> Tolerate temporarily suspended ZooKeeper connections
> 
>
> Key: FLINK-10052
> URL: https://issues.apache.org/jira/browse/FLINK-10052
> Project: Flink
>  Issue Type: Improvement
>  Components: Distributed Coordination
>Affects Versions: 1.4.2, 1.5.2, 1.6.0
>Reporter: Till Rohrmann
>Assignee: Dominik Wosiński
>Priority: Major
>
> This issue results from FLINK-10011 which uncovered a problem with Flink's HA 
> recovery and proposed the following solution to harden Flink:
> The {{ZooKeeperLeaderElectionService}} uses the {{LeaderLatch}} Curator 
> recipe for leader election. The leader latch revokes leadership in case of a 
> suspended ZooKeeper connection. This can be premature in case that the system 
> can reconnect to ZooKeeper before its session expires. The effect of the lost 
> leadership is that all jobs will be canceled and directly restarted after 
> regaining the leadership.
> Instead of directly revoking the leadership upon a SUSPENDED ZooKeeper 
> connection, it would be better to wait until the ZooKeeper connection is 
> LOST. That way we would allow the system to reconnect and not lose the 
> leadership. This could be achievable by using Curator's {{LeaderSelector}} 
> instead of the {{LeaderLatch}}.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Comment Edited] (FLINK-10052) Tolerate temporarily suspended ZooKeeper connections

2018-09-06 Thread Chesnay Schepler (JIRA)


[ 
https://issues.apache.org/jira/browse/FLINK-10052?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16605551#comment-16605551
 ] 

Chesnay Schepler edited comment on FLINK-10052 at 9/6/18 9:48 AM:
--

https://issues.apache.org/jira/browse/FLINK-5996


was (Author: wosinsan):
[#https://issues.apache.org/jira/browse/FLINK-5996]

> Tolerate temporarily suspended ZooKeeper connections
> 
>
> Key: FLINK-10052
> URL: https://issues.apache.org/jira/browse/FLINK-10052
> Project: Flink
>  Issue Type: Improvement
>  Components: Distributed Coordination
>Affects Versions: 1.4.2, 1.5.2, 1.6.0
>Reporter: Till Rohrmann
>Assignee: Dominik Wosiński
>Priority: Major
>
> This issue results from FLINK-10011 which uncovered a problem with Flink's HA 
> recovery and proposed the following solution to harden Flink:
> The {{ZooKeeperLeaderElectionService}} uses the {{LeaderLatch}} Curator 
> recipe for leader election. The leader latch revokes leadership in case of a 
> suspended ZooKeeper connection. This can be premature in case that the system 
> can reconnect to ZooKeeper before its session expires. The effect of the lost 
> leadership is that all jobs will be canceled and directly restarted after 
> regaining the leadership.
> Instead of directly revoking the leadership upon a SUSPENDED ZooKeeper 
> connection, it would be better to wait until the ZooKeeper connection is 
> LOST. That way we would allow the system to reconnect and not lose the 
> leadership. This could be achievable by using Curator's {{LeaderSelector}} 
> instead of the {{LeaderLatch}}.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Comment Edited] (FLINK-10052) Tolerate temporarily suspended ZooKeeper connections

2018-08-03 Thread Elias Levy (JIRA)


[ 
https://issues.apache.org/jira/browse/FLINK-10052?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16568832#comment-16568832
 ] 

Elias Levy edited comment on FLINK-10052 at 8/3/18 9:46 PM:


[~till.rohrmann] as I mentioned in FLINK-10011, it may not be necessary to 
replace the {{LeaderLatch}} Curator recipe to avoid loosing leadership during 
temporary connection failures.

The Curator error handling 
[documentation|https://curator.apache.org/errors.html] talks about a 
{{SessionConnectionStateErrorPolicy}} that treats {{SUSPENDED}} and {{LOST}} 
connection states differently.  And this 
[test|https://github.com/apache/curator/blob/d502dde1c4601b2abc6d831d764561a73316bf00/curator-recipes/src/test/java/org/apache/curator/framework/recipes/leader/TestLeaderLatch.java#L72-L146]
 shows how leadership is not lost with a {{LeaderLatch}} and that policy.  The 
[code|https://github.com/apache/curator/blob/ed3082ecfebc332ba96da7a5bda4508a1985db6e/curator-recipes/src/main/java/org/apache/curator/framework/recipes/leader/LeaderLatch.java#L625-L631]
 implementing the policy.  And [this 
shows|https://github.com/apache/curator/blob/5920c744508afd678a20309313e1b8d78baac0c4/curator-framework/src/main/java/org/apache/curator/framework/state/ConnectionStateManager.java#L298-L314]
 that Curator will inject a session expiration even while it is in 
{{SUSPENDED}} state, so that a disconnected client won't continue to think it 
is leader past its session expiration.

So it is possible that all we need to do is call 
{{connectionStateErrorPolicy(new SessionConnectionStateErrorPolicy())}} in the 
{{CuratorFrameworkFactory}}.


was (Author: elevy):
[~till.rohrmann] as I mentioned in FLINK-10011, it may not be necessary to 
replace the {{LeaderLatch}} Curator recipe to avoid loosing leadership during 
temporary connection failures.

The Curator error handling 
[documentation|https://curator.apache.org/errors.html] talks about a 
{{SessionConnectionStateErrorPolicy}} that treats {{SUSPENDED }}and {{LOST}} 
connection states differently.  And this 
[test|https://github.com/apache/curator/blob/d502dde1c4601b2abc6d831d764561a73316bf00/curator-recipes/src/test/java/org/apache/curator/framework/recipes/leader/TestLeaderLatch.java#L72-L146]
 shows how leadership is not lost with a {{LeaderLatch}} and that policy.  The 
[code|https://github.com/apache/curator/blob/ed3082ecfebc332ba96da7a5bda4508a1985db6e/curator-recipes/src/main/java/org/apache/curator/framework/recipes/leader/LeaderLatch.java#L625-L631]
 implementing the policy.  And [this 
shows|https://github.com/apache/curator/blob/5920c744508afd678a20309313e1b8d78baac0c4/curator-framework/src/main/java/org/apache/curator/framework/state/ConnectionStateManager.java#L298-L314]
 that Curator will inject a session expiration even while it is in {{SUSPENDED 
}}state, so that a disconnected client won't continue to think it is leader 
past its session expiration.

So it is possible that all we need to do is call 
{{connectionStateErrorPolicy(new SessionConnectionStateErrorPolicy())}} in the 
{{CuratorFrameworkFactory}}.

> Tolerate temporarily suspended ZooKeeper connections
> 
>
> Key: FLINK-10052
> URL: https://issues.apache.org/jira/browse/FLINK-10052
> Project: Flink
>  Issue Type: Improvement
>  Components: Distributed Coordination
>Affects Versions: 1.4.2, 1.5.2, 1.6.0
>Reporter: Till Rohrmann
>Priority: Major
>
> This issue results from FLINK-10011 which uncovered a problem with Flink's HA 
> recovery and proposed the following solution to harden Flink:
> The {{ZooKeeperLeaderElectionService}} uses the {{LeaderLatch}} Curator 
> recipe for leader election. The leader latch revokes leadership in case of a 
> suspended ZooKeeper connection. This can be premature in case that the system 
> can reconnect to ZooKeeper before its session expires. The effect of the lost 
> leadership is that all jobs will be canceled and directly restarted after 
> regaining the leadership.
> Instead of directly revoking the leadership upon a SUSPENDED ZooKeeper 
> connection, it would be better to wait until the ZooKeeper connection is 
> LOST. That way we would allow the system to reconnect and not lose the 
> leadership. This could be achievable by using Curator's {{LeaderSelector}} 
> instead of the {{LeaderLatch}}.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)