[jira] [Commented] (FLINK-14112) Removing zookeeper state should cause the task manager and job managers to restart

Aaron Levin (Jira) Wed, 18 Sep 2019 09:52:55 -0700


    [ 
https://issues.apache.org/jira/browse/FLINK-14112?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16932660#comment-16932660
 ]


Aaron Levin commented on FLINK-14112:
-------------------------------------

[~Tison] [~till.rohrmann] thanks for responding so quickly! I agree that it's 
unlikely that someone is going to delete the znodes in {{/flink}} but figured 
in the rare case it happens it might be nice to hard fail, but if you decide to 
{{WONTFIX}} I understand! :) 

[~Tison] there were a lot of the following in the logs:
{noformat}
[2019-08-21 21:38:48.549762] 2019-08-21 21:38:48,549 INFO  
org.apache.flink.runtime.resourcemanager.StandaloneResourceManager  - Closing 
TaskExecutor connection 62be23badd5a51b757c221cc750881cb because: 
ResourceManager leader changed to new address null
{noformat}
and
{noformat}
[2019-08-21 21:39:11.008135] 2019-08-21 21:39:11,007 WARN  
akka.remote.transport.netty.NettyTransport - Remote connection to [null] failed 
with java.net.ConnectException: Connection refused: 
qa-flinkjobmanager--087a757cd7fe67436.northwest.stripe.io/10.100.46.70:6123
{noformat}
The second example is {{WARN}} so potentially we could just configure not to 
log that. Anyway, I think most of the log chunder is likely due to the way 
{{null}} is handled in this code.

> Removing zookeeper state should cause the task manager and job managers to 
> restart
> ----------------------------------------------------------------------------------
>
>                 Key: FLINK-14112
>                 URL: https://issues.apache.org/jira/browse/FLINK-14112
>             Project: Flink
>          Issue Type: Wish
>          Components: Runtime / Coordination
>    Affects Versions: 1.8.1, 1.9.0
>            Reporter: Aaron Levin
>            Priority: Minor
>
> Suppose you have a flink application running on a cluster with the following 
> configuration:
> {noformat}
> high-availability.zookeeper.path.root: /flink
> {noformat}
> Now suppose you delete all the znodes within {{/flink}}. I experienced the 
> following:
>  * massive amount of logging
>  * application did not restart
>  * task manager did not crash or restart
>  * job manager did not crash or restart
> From this state I had to restart all the task managers and all the job 
> managers in order for the flink application to recover.
> It would be desirable for the Task Managers and Job Managers to crash if the 
> znode is not available (though perhaps you all have thought about this more 
> deeply than I!)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (FLINK-14112) Removing zookeeper state should cause the task manager and job managers to restart

Reply via email to