Re: Delay in REST/UI readiness during JM recovery

vino yang Mon, 06 Aug 2018 19:11:47 -0700

Hi Joey,

Thank you for finding these issues and creating them.


Thanks, vino.

2018-08-07 8:18 GMT+08:00 Joey Echeverria <jechever...@splunk.com>:

> Thanks for the ping Vino.
>
> I created two JIRAs for the first two items:
>
> 1) https://issues.apache.org/jira/browse/FLINK-10077
> 2) https://issues.apache.org/jira/browse/FLINK-10078
>
> Regarding (3) we’re doing some testing with different options for the
> state storage. I’ll report back if we find anything significant there.
>
> -Joey
>
>
> On Aug 6, 2018, at 8:47 AM, vino yang <yanghua1...@gmail.com> wrote:
>
> Hi Joey,
>
> Did you create these JIRA issues based on Till's suggestion?
>
> If you didn't create them or you don't know how to do it, I can do it for
> you. But I won't do it right away, I will wait for a while.
>
> Thanks, vino.
>
> 2018-08-03 17:23 GMT+08:00 Till Rohrmann <trohrm...@apache.org>:
>
>> Hi Joey,
>>
>> your analysis is correct. Currently, the Dispatcher will first try to
>> recover all jobs before it confirms the leadership.
>>
>> 1) The Dispatcher provides much of the relevant information you see in
>> the web-ui. Without a leading Dispatcher, the web-ui cannot show much
>> information. But this could also be changed such that in the situation
>> where no Dispatcher is a leader, we cannot display certain information
>> (number of running jobs, job details, etc.). Could you create a JIRA issue
>> to fix this problem?
>>
>> 2) The reason why the Dispatcher first tries to recover the jobs before
>> confirming the leadership is because it first tries to restore its internal
>> state before it is accessible by other components and, thus, state changes.
>> For example, the following problem could arise: Assume that you submit a
>> job to the cluster. The cluster receives the JobGraph and persists it in
>> ZooKeeper. Before the Dispatcher can acknowledge the job submission it
>> fails. The client sees the failure and tries to re-submit the job. Now the
>> Dispatcher is restarted and starts recovering the persisted jobs. If we
>> don't wait for this to complete, then the retried job submission could
>> succeed first because it is just faster. This would, however, let the job
>> recovery fail because the Dispatcher is already executing this job (due to
>> the re-submission) and the assumption is that recovered jobs are submitted
>> first.
>>
>> The same applies if you should submit a modified job with the same JobID
>> as a persisted job. Which job should the system then execute? The old one
>> or the newly submitted job. By waiting to first complete the recovery, we
>> give precedence to the persisted jobs.
>>
>> One could solve this problem also slightly differently, by only blocking
>> the job submission while a recovery is happening. However, one should check
>> that no other RPCs change the internal state in such a way that it
>> interferes with the job recovery.
>>
>> Could you maybe open a JIRA issue for solving this problem?
>>
>> 3) The job recovery is mainly limited by the connection to your
>> persistent storage system (HDFS or S3 I assume) where the JobGraphs are
>> stored. Alternatively, you could split the number of executed jobs across
>> multiple Flink clusters in order to decrease the number of jobs which need
>> to be recovered in case of a failure.
>>
>> Thanks a lot for reporting and analysing this problem. This is definitely
>> something we should improve!
>>
>> Cheers,
>> Till
>>
>> On Fri, Aug 3, 2018 at 5:48 AM vino yang <yanghua1...@gmail.com> wrote:
>>
>>> Hi Joey,
>>>
>>> Good question!
>>> I will copy it to Till and Chesnay who know this part of the
>>> implementation.
>>>
>>> Thanks, vino.
>>>
>>> 2018-08-03 11:09 GMT+08:00 Joey Echeverria <jechever...@splunk.com>:
>>>
>>>> I don’t have logs available yet, but I do have some information from
>>>> ZK.
>>>>
>>>> The culprit appears to be the /flink/default/leader/dispatcher_lock
>>>> znode.
>>>>
>>>> I took a look at the dispatcher code here: https://github.com/apach
>>>> e/flink/blob/master/flink-runtime/src/main/java/org/
>>>> apache/flink/runtime/dispatcher/Dispatcher.java#L762-L785
>>>>
>>>> And it looks to me that when leadership is granted it will perform job
>>>> recovery on all jobs before it writes the new leader information to
>>>> the /flink/default/leader/dispatcher_lock znode.
>>>>
>>>> So this leaves me with three questions:
>>>>
>>>> 1) Why does the web monitor specifically have to wait for the
>>>> dispatcher?
>>>> 2) Is there a reason why the dispatcher can’t write the lock until
>>>> after job recovery?
>>>> 3) Is there anything I can/should be doing to speed up job recovery?
>>>>
>>>> Thanks!
>>>>
>>>> -Joey
>>>>
>>>>
>>>> On Aug 2, 2018, at 9:24 AM, Joey Echeverria <jechever...@splunk.com>
>>>> wrote:
>>>>
>>>> Thanks or the tips Gary and Vino. I’ll try to reproduce it with test
>>>> data and see if I can post some logs.
>>>>
>>>> I’ll also watch the leader znode to see if the election isn’t happening
>>>> or if it’s not being retrieved.
>>>>
>>>> Thanks!
>>>>
>>>> -Joey
>>>>
>>>> On Aug 1, 2018, at 11:19 PM, Gary Yao <g...@data-artisans.com> wrote:
>>>>
>>>> Hi Joey,
>>>>
>>>> If the other components (e.g., Dispatcher, ResourceManager) are able to
>>>> finish
>>>> the leader election in a timely manner, I currently do not see a reason
>>>> why it
>>>> should take the REST server 20 - 45 minutes.
>>>>
>>>> You can check the contents of znode /flink/.../leader/rest_server_lock
>>>> to see
>>>> if there is indeed no leader, or if the leader information cannot be
>>>> retrieved
>>>> from ZooKeeper.
>>>>
>>>> If you can reproduce this in a staging environment with some test jobs,
>>>> I'd
>>>> like to see the ClusterEntrypoint/JobManager logs (perhaps on debug
>>>> level).
>>>>
>>>> Best,
>>>> Gary
>>>>
>>>> On Mon, Jul 30, 2018 at 8:10 PM, Joey Echeverria <
>>>> jechever...@splunk.com> wrote:
>>>>
>>>>> I’m running Flink 1.5.0 in Kubernetes with HA enabled, but only a
>>>>> single Job Manager running. I’m using Zookeeper to store the 
>>>>> fencing/leader
>>>>> information and S3 to store the job manager state. We’ve been running
>>>>> around 250 or so streaming jobs and we’ve noticed that if the job manager
>>>>> pod is deleted, it takes something like 20-45 minutes for the job 
>>>>> manager’s
>>>>> REST endpoints and web UI to become available. Until it becomes available,
>>>>> we get a 503 response from the HTTP server with the message "Could not
>>>>> retrieve the redirect address of the current leader. Please try to
>>>>> refresh.”.
>>>>>
>>>>> Has anyone else run into this?
>>>>>
>>>>> Are there any configuration settings I should be looking at to speed
>>>>> up the availability of the HTTP endpoints?
>>>>>
>>>>> Thanks!
>>>>>
>>>>> -Joey
>>>>
>>>>
>>>>
>>>>
>>>>
>>>
>
>

Re: Delay in REST/UI readiness during JM recovery

Reply via email to