I don’t have logs available yet, but I do have some information from ZK.

The culprit appears to be the /flink/default/leader/dispatcher_lock znode.

I took a look at the dispatcher code here: 
https://github.com/apache/flink/blob/master/flink-runtime/src/main/java/org/apache/flink/runtime/dispatcher/Dispatcher.java#L762-L785

And it looks to me that when leadership is granted it will perform job recovery 
on all jobs before it writes the new leader information to the 
/flink/default/leader/dispatcher_lock znode.

So this leaves me with three questions:

1) Why does the web monitor specifically have to wait for the dispatcher?
2) Is there a reason why the dispatcher can’t write the lock until after job 
recovery?
3) Is there anything I can/should be doing to speed up job recovery?

Thanks!

-Joey

On Aug 2, 2018, at 9:24 AM, Joey Echeverria 
<jechever...@splunk.com<mailto:jechever...@splunk.com>> wrote:

Thanks or the tips Gary and Vino. I’ll try to reproduce it with test data and 
see if I can post some logs.

I’ll also watch the leader znode to see if the election isn’t happening or if 
it’s not being retrieved.

Thanks!

-Joey

On Aug 1, 2018, at 11:19 PM, Gary Yao 
<g...@data-artisans.com<mailto:g...@data-artisans.com>> wrote:

Hi Joey,

If the other components (e.g., Dispatcher, ResourceManager) are able to finish
the leader election in a timely manner, I currently do not see a reason why it
should take the REST server 20 - 45 minutes.

You can check the contents of znode /flink/.../leader/rest_server_lock to see
if there is indeed no leader, or if the leader information cannot be retrieved
from ZooKeeper.

If you can reproduce this in a staging environment with some test jobs, I'd
like to see the ClusterEntrypoint/JobManager logs (perhaps on debug level).

Best,
Gary

On Mon, Jul 30, 2018 at 8:10 PM, Joey Echeverria 
<jechever...@splunk.com<mailto:jechever...@splunk.com>> wrote:
I’m running Flink 1.5.0 in Kubernetes with HA enabled, but only a single Job 
Manager running. I’m using Zookeeper to store the fencing/leader information 
and S3 to store the job manager state. We’ve been running around 250 or so 
streaming jobs and we’ve noticed that if the job manager pod is deleted, it 
takes something like 20-45 minutes for the job manager’s REST endpoints and web 
UI to become available. Until it becomes available, we get a 503 response from 
the HTTP server with the message "Could not retrieve the redirect address of 
the current leader. Please try to refresh.”.

Has anyone else run into this?

Are there any configuration settings I should be looking at to speed up the 
availability of the HTTP endpoints?

Thanks!

-Joey



Reply via email to