Re: JobGraphs not cleaned up in HA mode

vino yang Tue, 28 Aug 2018 19:16:28 -0700

Hi Encho,

>From your description, I feel that there are extra bugs.


About your description:

*- Start both job managers*
*- Start a batch job in JobManager 1 and let it finish*
*The jobgraphs in both Zookeeper and HDFS remained.*

Is it necessarily happening every time?

In the Standalone cluster, the problems we encountered were sporadic.

Thanks, vino.

Encho Mishinev <encho.mishi...@gmail.com> 于2018年8月28日周二 下午8:07写道：

> Hello Till,
>
> I spend a few more hours testing and looking at the logs and it seems like
> there's a more general problem here. While the two job managers are active
> neither of them can properly delete jobgraphs. The above problem I
> described comes from the fact that Kubernetes gets JobManager 1 quickly
> after I manually kill it, so when I stop the job on JobManager 2 both are
> alive.
>
> I did a very simple test:
>
> - Start both job managers
> - Start a batch job in JobManager 1 and let it finish
> The jobgraphs in both Zookeeper and HDFS remained.
>
> On the other hand if we do:
>
> - Start only JobManager 1 (again in HA mode)
> - Start a batch job and let it finish
> The jobgraphs in both Zookeeper and HDFS are deleted fine.
>
> It seems like the standby manager still leaves some kind of lock on the
> jobgraphs. Do you think that's possible? Have you seen a similar problem?
> The only logs that appear on the standby manager while waiting are of the
> type:
>
> 2018-08-28 11:54:10,789 INFO
> org.apache.flink.runtime.jobmanager.ZooKeeperSubmittedJobGraphStore  -
> Recovered SubmittedJobGraph(9e0a109b57511930c95d3b54574a66e3, null).
>
> Note that this log appears on the standby jobmanager immediately when a
> new job is submitted to the active jobmanager.
> Also note that the blobs and checkpoints are cleared fine. The problem is
> only for jobgraphs both in ZooKeeper and HDFS.
>
> Trying to access the UI of the standby manager redirects to the active
> one, so it is not a problem of them not knowing who the leader is. Do you
> have any ideas?
>
> Thanks a lot,
> Encho
>
> On Tue, Aug 28, 2018 at 10:27 AM Till Rohrmann <trohrm...@apache.org>
> wrote:
>
>> Hi Encho,
>>
>> thanks a lot for reporting this issue. The problem arises whenever the
>> old leader maintains the connection to ZooKeeper. If this is the case, then
>> ephemeral nodes which we create to protect against faulty delete operations
>> are not removed and consequently the new leader is not able to delete the
>> persisted job graph. So one thing to check is whether the old JM still has
>> an open connection to ZooKeeper. The next thing to check is the session
>> timeout of your ZooKeeper cluster. If you stop the job within the session
>> timeout, then it is also not guaranteed that ZooKeeper has detected that
>> the ephemeral nodes of the old JM must be deleted. In order to understand
>> this better it would be helpful if you could tell us the timing of the
>> different actions.
>>
>> Cheers,
>> Till
>>
>> On Tue, Aug 28, 2018 at 8:17 AM vino yang <yanghua1...@gmail.com> wrote:
>>
>>> Hi Encho,
>>>
>>> A temporary solution can be used to determine if it has been cleaned up
>>> by monitoring the specific JobID under Zookeeper's "/jobgraph".
>>> Another solution, modify the source code, rudely modify the cleanup mode
>>> to the synchronous form, but the flink operation Zookeeper's path needs to
>>> obtain the corresponding lock, so it is dangerous to do so, and it is not
>>> recommended.
>>> I think maybe this problem can be solved in the next version. It depends
>>> on Till.
>>>
>>> Thanks, vino.
>>>
>>> Encho Mishinev <encho.mishi...@gmail.com> 于2018年8月28日周二 下午1:17写道：
>>>
>>>> Thank you very much for the info! Will keep track of the progress.
>>>>
>>>> In the meantime is there any viable workaround? It seems like HA
>>>> doesn't really work due to this bug.
>>>>
>>>> On Tue, Aug 28, 2018 at 4:52 AM vino yang <yanghua1...@gmail.com>
>>>> wrote:
>>>>
>>>>> About some implementation mechanisms.
>>>>> Flink uses Zookeeper to store JobGraph (Job's description information
>>>>> and metadata) as a basis for Job recovery.
>>>>> However, previous implementations may cause this information to not be
>>>>> properly cleaned up because it is asynchronously deleted by a background
>>>>> thread.
>>>>>
>>>>> Thanks, vino.
>>>>>
>>>>> vino yang <yanghua1...@gmail.com> 于2018年8月28日周二 上午9:49写道：
>>>>>
>>>>>> Hi Encho,
>>>>>>
>>>>>> This is a problem already known to the Flink community, you can track
>>>>>> its progress through FLINK-10011[1], and currently Till is fixing this
>>>>>> issue.
>>>>>>
>>>>>> [1]: https://issues.apache.org/jira/browse/FLINK-10011
>>>>>>
>>>>>> Thanks, vino.
>>>>>>
>>>>>> Encho Mishinev <encho.mishi...@gmail.com> 于2018年8月27日周一 下午10:13写道：
>>>>>>
>>>>>>> I am running Flink 1.5.3 with two job managers and two task managers
>>>>>>> in Kubernetes along with HDFS and Zookeeper in high-availability mode.
>>>>>>>
>>>>>>> My problem occurs after the following actions:
>>>>>>> - Upload a .jar file to jobmanager-1
>>>>>>> - Run a streaming job from the jar on jobmanager-1
>>>>>>> - Wait for 1 or 2 checkpoints to succeed
>>>>>>> - Kill pod of jobmanager-1
>>>>>>> After a short delay, jobmanager-2 takes leadership and correctly
>>>>>>> restores the job and continues it
>>>>>>> - Stop job from jobmanager-2
>>>>>>>
>>>>>>> At this point all seems well, but the problem is that jobmanager-2
>>>>>>> does not clean up anything that was left from jobmanager-1. This means 
>>>>>>> that
>>>>>>> both in HDFS and in Zookeeper remain job graphs, which later on obstruct
>>>>>>> any work of both managers as after any reset they unsuccessfully try to
>>>>>>> restore a non-existent job and fail over and over again.
>>>>>>>
>>>>>>> I am quite certain that jobmanager-2 does not know about any of
>>>>>>> jobmanager-1’s files since the Zookeeper logs reveal that it tries to
>>>>>>> duplicate job folders:
>>>>>>>
>>>>>>> 2018-08-27 13:11:00,038 [myid:] - INFO  [ProcessThread(sid:0
>>>>>>> cport:2181)::PrepRequestProcessor@648] - Got user-level
>>>>>>> KeeperException when processing sessionid:0x1657aa15e480033 type:create
>>>>>>> cxid:0x46 zxid:0x1ab txntype:-1 reqpath:n/a Error
>>>>>>> Path:/flink/default/jobgraphs/83bfa359ca59ce1d4635e18e16651e15/bbb259fd-7826-4950-bc7c-c2be23346c77
>>>>>>> Error:KeeperErrorCode = NodeExists for
>>>>>>> /flink/default/jobgraphs/83bfa359ca59ce1d4635e18e16651e15/bbb259fd-7826-4950-bc7c-c2be23346c77
>>>>>>>
>>>>>>> 2018-08-27 13:11:02,296 [myid:] - INFO  [ProcessThread(sid:0
>>>>>>> cport:2181)::PrepRequestProcessor@648] - Got user-level
>>>>>>> KeeperException when processing sessionid:0x1657aa15e480033 type:create
>>>>>>> cxid:0x5c zxid:0x1ac txntype:-1 reqpath:n/a Error
>>>>>>> Path:/flink/default/checkpoint-counter/83bfa359ca59ce1d4635e18e16651e15
>>>>>>> Error:KeeperErrorCode = NodeExists for
>>>>>>> /flink/default/checkpoint-counter/83bfa359ca59ce1d4635e18e16651e15
>>>>>>>
>>>>>>> Also jobmanager-2 attempts to delete the jobgraphs folder in
>>>>>>> Zookeeper when the job is stopped, but fails since there are leftover 
>>>>>>> files
>>>>>>> in it from jobmanager-1:
>>>>>>>
>>>>>>> 2018-08-27 13:12:13,406 [myid:] - INFO  [ProcessThread(sid:0
>>>>>>> cport:2181)::PrepRequestProcessor@648] - Got user-level
>>>>>>> KeeperException when processing sessionid:0x1657aa15e480033 type:delete
>>>>>>> cxid:0xa8 zxid:0x1bd txntype:-1 reqpath:n/a Error
>>>>>>> Path:/flink/default/jobgraphs/83bfa359ca59ce1d4635e18e16651e15
>>>>>>> Error:KeeperErrorCode = Directory not empty for
>>>>>>> /flink/default/jobgraphs/83bfa359ca59ce1d4635e18e16651e15
>>>>>>>
>>>>>>> I’ve noticed that when restoring the job, it seems like jobmanager-2
>>>>>>> does not get anything more than jobID, while it perhaps needs some
>>>>>>> metadata? Here is the log that seems suspicious to me:
>>>>>>>
>>>>>>> 2018-08-27 13:09:18,113 INFO
>>>>>>> org.apache.flink.runtime.jobmanager.ZooKeeperSubmittedJobGraphStore  -
>>>>>>> Recovered SubmittedJobGraph(83bfa359ca59ce1d4635e18e16651e15, null).
>>>>>>>
>>>>>>> All other logs seem fine in jobmanager-2, it doesn’t seem to be
>>>>>>> aware that it’s overwriting anything or not deleting properly.
>>>>>>>
>>>>>>> My question is - what is the intended way for the job managers to
>>>>>>> correctly exchange metadata in HA mode and why is it not working for me?
>>>>>>>
>>>>>>> Thanks in advance!
>>>>>>
>>>>>>

Re: JobGraphs not cleaned up in HA mode

Reply via email to