Re: JobGraphs not cleaned up in HA mode

Encho Mishinev Wed, 29 Aug 2018 01:10:52 -0700

Hi,

Unfortunately the thing I described does indeed happen every time. As
mentioned in the first email, I am running on Kubernetes so certain things
could be different compared to just a standalone cluster.


Any ideas for workarounds are welcome, as this problem basically prevents
me from using HA.

Thanks,
Encho

On Wed, Aug 29, 2018 at 5:15 AM vino yang <yanghua1...@gmail.com> wrote:

> Hi Encho,
>
> From your description, I feel that there are extra bugs.
>
> About your description:
>
> *- Start both job managers*
> *- Start a batch job in JobManager 1 and let it finish*
> *The jobgraphs in both Zookeeper and HDFS remained.*
>
> Is it necessarily happening every time?
>
> In the Standalone cluster, the problems we encountered were sporadic.
>
> Thanks, vino.
>
> Encho Mishinev <encho.mishi...@gmail.com> 于2018年8月28日周二 下午8:07写道：
>
>> Hello Till,
>>
>> I spend a few more hours testing and looking at the logs and it seems
>> like there's a more general problem here. While the two job managers are
>> active neither of them can properly delete jobgraphs. The above problem I
>> described comes from the fact that Kubernetes gets JobManager 1 quickly
>> after I manually kill it, so when I stop the job on JobManager 2 both are
>> alive.
>>
>> I did a very simple test:
>>
>> - Start both job managers
>> - Start a batch job in JobManager 1 and let it finish
>> The jobgraphs in both Zookeeper and HDFS remained.
>>
>> On the other hand if we do:
>>
>> - Start only JobManager 1 (again in HA mode)
>> - Start a batch job and let it finish
>> The jobgraphs in both Zookeeper and HDFS are deleted fine.
>>
>> It seems like the standby manager still leaves some kind of lock on the
>> jobgraphs. Do you think that's possible? Have you seen a similar problem?
>> The only logs that appear on the standby manager while waiting are of the
>> type:
>>
>> 2018-08-28 11:54:10,789 INFO
>> org.apache.flink.runtime.jobmanager.ZooKeeperSubmittedJobGraphStore  -
>> Recovered SubmittedJobGraph(9e0a109b57511930c95d3b54574a66e3, null).
>>
>> Note that this log appears on the standby jobmanager immediately when a
>> new job is submitted to the active jobmanager.
>> Also note that the blobs and checkpoints are cleared fine. The problem is
>> only for jobgraphs both in ZooKeeper and HDFS.
>>
>> Trying to access the UI of the standby manager redirects to the active
>> one, so it is not a problem of them not knowing who the leader is. Do you
>> have any ideas?
>>
>> Thanks a lot,
>> Encho
>>
>> On Tue, Aug 28, 2018 at 10:27 AM Till Rohrmann <trohrm...@apache.org>
>> wrote:
>>
>>> Hi Encho,
>>>
>>> thanks a lot for reporting this issue. The problem arises whenever the
>>> old leader maintains the connection to ZooKeeper. If this is the case, then
>>> ephemeral nodes which we create to protect against faulty delete operations
>>> are not removed and consequently the new leader is not able to delete the
>>> persisted job graph. So one thing to check is whether the old JM still has
>>> an open connection to ZooKeeper. The next thing to check is the session
>>> timeout of your ZooKeeper cluster. If you stop the job within the session
>>> timeout, then it is also not guaranteed that ZooKeeper has detected that
>>> the ephemeral nodes of the old JM must be deleted. In order to understand
>>> this better it would be helpful if you could tell us the timing of the
>>> different actions.
>>>
>>> Cheers,
>>> Till
>>>
>>> On Tue, Aug 28, 2018 at 8:17 AM vino yang <yanghua1...@gmail.com> wrote:
>>>
>>>> Hi Encho,
>>>>
>>>> A temporary solution can be used to determine if it has been cleaned up
>>>> by monitoring the specific JobID under Zookeeper's "/jobgraph".
>>>> Another solution, modify the source code, rudely modify the cleanup
>>>> mode to the synchronous form, but the flink operation Zookeeper's path
>>>> needs to obtain the corresponding lock, so it is dangerous to do so, and it
>>>> is not recommended.
>>>> I think maybe this problem can be solved in the next version. It
>>>> depends on Till.
>>>>
>>>> Thanks, vino.
>>>>
>>>> Encho Mishinev <encho.mishi...@gmail.com> 于2018年8月28日周二 下午1:17写道：
>>>>
>>>>> Thank you very much for the info! Will keep track of the progress.
>>>>>
>>>>> In the meantime is there any viable workaround? It seems like HA
>>>>> doesn't really work due to this bug.
>>>>>
>>>>> On Tue, Aug 28, 2018 at 4:52 AM vino yang <yanghua1...@gmail.com>
>>>>> wrote:
>>>>>
>>>>>> About some implementation mechanisms.
>>>>>> Flink uses Zookeeper to store JobGraph (Job's description information
>>>>>> and metadata) as a basis for Job recovery.
>>>>>> However, previous implementations may cause this information to not
>>>>>> be properly cleaned up because it is asynchronously deleted by a 
>>>>>> background
>>>>>> thread.
>>>>>>
>>>>>> Thanks, vino.
>>>>>>
>>>>>> vino yang <yanghua1...@gmail.com> 于2018年8月28日周二 上午9:49写道：
>>>>>>
>>>>>>> Hi Encho,
>>>>>>>
>>>>>>> This is a problem already known to the Flink community, you can
>>>>>>> track its progress through FLINK-10011[1], and currently Till is fixing
>>>>>>> this issue.
>>>>>>>
>>>>>>> [1]: https://issues.apache.org/jira/browse/FLINK-10011
>>>>>>>
>>>>>>> Thanks, vino.
>>>>>>>
>>>>>>> Encho Mishinev <encho.mishi...@gmail.com> 于2018年8月27日周一 下午10:13写道：
>>>>>>>
>>>>>>>> I am running Flink 1.5.3 with two job managers and two task
>>>>>>>> managers in Kubernetes along with HDFS and Zookeeper in 
>>>>>>>> high-availability
>>>>>>>> mode.
>>>>>>>>
>>>>>>>> My problem occurs after the following actions:
>>>>>>>> - Upload a .jar file to jobmanager-1
>>>>>>>> - Run a streaming job from the jar on jobmanager-1
>>>>>>>> - Wait for 1 or 2 checkpoints to succeed
>>>>>>>> - Kill pod of jobmanager-1
>>>>>>>> After a short delay, jobmanager-2 takes leadership and correctly
>>>>>>>> restores the job and continues it
>>>>>>>> - Stop job from jobmanager-2
>>>>>>>>
>>>>>>>> At this point all seems well, but the problem is that jobmanager-2
>>>>>>>> does not clean up anything that was left from jobmanager-1. This means 
>>>>>>>> that
>>>>>>>> both in HDFS and in Zookeeper remain job graphs, which later on 
>>>>>>>> obstruct
>>>>>>>> any work of both managers as after any reset they unsuccessfully try to
>>>>>>>> restore a non-existent job and fail over and over again.
>>>>>>>>
>>>>>>>> I am quite certain that jobmanager-2 does not know about any of
>>>>>>>> jobmanager-1’s files since the Zookeeper logs reveal that it tries to
>>>>>>>> duplicate job folders:
>>>>>>>>
>>>>>>>> 2018-08-27 13:11:00,038 [myid:] - INFO  [ProcessThread(sid:0
>>>>>>>> cport:2181)::PrepRequestProcessor@648] - Got user-level
>>>>>>>> KeeperException when processing sessionid:0x1657aa15e480033 type:create
>>>>>>>> cxid:0x46 zxid:0x1ab txntype:-1 reqpath:n/a Error
>>>>>>>> Path:/flink/default/jobgraphs/83bfa359ca59ce1d4635e18e16651e15/bbb259fd-7826-4950-bc7c-c2be23346c77
>>>>>>>> Error:KeeperErrorCode = NodeExists for
>>>>>>>> /flink/default/jobgraphs/83bfa359ca59ce1d4635e18e16651e15/bbb259fd-7826-4950-bc7c-c2be23346c77
>>>>>>>>
>>>>>>>> 2018-08-27 13:11:02,296 [myid:] - INFO  [ProcessThread(sid:0
>>>>>>>> cport:2181)::PrepRequestProcessor@648] - Got user-level
>>>>>>>> KeeperException when processing sessionid:0x1657aa15e480033 type:create
>>>>>>>> cxid:0x5c zxid:0x1ac txntype:-1 reqpath:n/a Error
>>>>>>>> Path:/flink/default/checkpoint-counter/83bfa359ca59ce1d4635e18e16651e15
>>>>>>>> Error:KeeperErrorCode = NodeExists for
>>>>>>>> /flink/default/checkpoint-counter/83bfa359ca59ce1d4635e18e16651e15
>>>>>>>>
>>>>>>>> Also jobmanager-2 attempts to delete the jobgraphs folder in
>>>>>>>> Zookeeper when the job is stopped, but fails since there are leftover 
>>>>>>>> files
>>>>>>>> in it from jobmanager-1:
>>>>>>>>
>>>>>>>> 2018-08-27 13:12:13,406 [myid:] - INFO  [ProcessThread(sid:0
>>>>>>>> cport:2181)::PrepRequestProcessor@648] - Got user-level
>>>>>>>> KeeperException when processing sessionid:0x1657aa15e480033 type:delete
>>>>>>>> cxid:0xa8 zxid:0x1bd txntype:-1 reqpath:n/a Error
>>>>>>>> Path:/flink/default/jobgraphs/83bfa359ca59ce1d4635e18e16651e15
>>>>>>>> Error:KeeperErrorCode = Directory not empty for
>>>>>>>> /flink/default/jobgraphs/83bfa359ca59ce1d4635e18e16651e15
>>>>>>>>
>>>>>>>> I’ve noticed that when restoring the job, it seems like
>>>>>>>> jobmanager-2 does not get anything more than jobID, while it perhaps 
>>>>>>>> needs
>>>>>>>> some metadata? Here is the log that seems suspicious to me:
>>>>>>>>
>>>>>>>> 2018-08-27 13:09:18,113 INFO
>>>>>>>> org.apache.flink.runtime.jobmanager.ZooKeeperSubmittedJobGraphStore  -
>>>>>>>> Recovered SubmittedJobGraph(83bfa359ca59ce1d4635e18e16651e15, null).
>>>>>>>>
>>>>>>>> All other logs seem fine in jobmanager-2, it doesn’t seem to be
>>>>>>>> aware that it’s overwriting anything or not deleting properly.
>>>>>>>>
>>>>>>>> My question is - what is the intended way for the job managers to
>>>>>>>> correctly exchange metadata in HA mode and why is it not working for 
>>>>>>>> me?
>>>>>>>>
>>>>>>>> Thanks in advance!
>>>>>>>
>>>>>>>

Re: JobGraphs not cleaned up in HA mode

Reply via email to