Re: Jobmanager trying to be registered for Zombie Job

Matthias Pohl Mon, 25 Apr 2022 23:50:47 -0700

Hi Peter,
based on our analysis the issue already existed before 1.15, yes. We
couldn't come up with any other reasoning. It was just never reported... or
noticing an older ticket.


Matthias

On Mon, Apr 25, 2022 at 6:21 PM Peter Schrott <pe...@bluerootlabs.io> wrote:

> Hi Matthias,
>
> You are welcome & thanks a lot for your help too!
>
> It's not quite clear to me, the bug was already there since 1.13.6 but not
> reported yet (FLINK-27354 is a new ticket)?
>
> Best, Peter
>
>
> On Mon, Apr 25, 2022 at 5:48 PM Matthias Pohl <matth...@ververica.com>
> wrote:
>
>> Thanks again, Peter for sharing your logs. I looked into the issue with
>> the help of Chesnay. Essentially, it's FLINK-27354 [1] that is causing this
>> issue. We couldn't come up with a reason why it should have popped up just
>> now with 1.15. The bug itself is already present in 1.14. You can find more
>> details on the investigation in FLINK-27354 [1] itself.
>>
>> Best,
>> Matthias
>>
>> [1] https://issues.apache.org/jira/browse/FLINK-27354
>>
>> On Mon, Apr 25, 2022 at 2:00 PM Matthias Pohl <matth...@ververica.com>
>> wrote:
>>
>>> Thanks Peter, we're looking into it...
>>>
>>> On Mon, Apr 25, 2022 at 11:54 AM Peter Schrott <pe...@bluerootlabs.io>
>>> wrote:
>>>
>>>> Hi,
>>>>
>>>> sorry for the late reply. It took me quite some time to get the logs
>>>> out of the system. I have attached them now.
>>>>
>>>> Its logs of 2 jobmanagers and 2 taskamangers. It can be seen on jm 1
>>>> that the job starts crashing and recovering a few times. This happens
>>>> until 2022-04-20 12:12:14,607. After that the above described behavior can
>>>> be seen.
>>>>
>>>> I hope this helps.
>>>>
>>>> Best, Peter
>>>>
>>>> On Fri, Apr 22, 2022 at 12:06 PM Matthias Pohl <matth...@ververica.com>
>>>> wrote:
>>>>
>>>>> FYI: I created FLINK-27354 [1] to cover the issue of retrying to
>>>>> connect to the RM while shutting down the JobMaster.
>>>>>
>>>>> This doesn't explain your issue though, Peter. It's still unclear why
>>>>> the JobMaster is still around as stated in my previous email.
>>>>>
>>>>> Matthias
>>>>>
>>>>> [1] https://issues.apache.org/jira/browse/FLINK-27354
>>>>>
>>>>> On Fri, Apr 22, 2022 at 11:54 AM Matthias Pohl <matth...@ververica.com>
>>>>> wrote:
>>>>>
>>>>>> Just by looking through the code, it appears that these logs could be
>>>>>> produced while stopping the job. The ResourceManager sends a confirmation
>>>>>> of the JobMaster being disconnected at the end back to the JobMaster. If
>>>>>> the JobMaster is still around to process the request, it would try to
>>>>>> reconnect (I'd consider that a bug because the JobMaster is in shutdown
>>>>>> mode already and wouldn't need to re-establish the connection). This 
>>>>>> method
>>>>>> would have been swallowed otherwise if the JobMaster was already 
>>>>>> terminated.
>>>>>>
>>>>>> The only explanation I can come up with right now (without having any
>>>>>> logs) is that stopping the JobMaster didn't finish for some reason. For
>>>>>> that it would be helpful to look at the logs to see whether there is some
>>>>>> other issue that causes the JobMaster to stop entirely.
>>>>>>
>>>>>> On Fri, Apr 22, 2022 at 10:14 AM Matthias Pohl <
>>>>>> matth...@ververica.com> wrote:
>>>>>>
>>>>>>> ...if possible it would be good to get debug rather than only info
>>>>>>> logs. Did you encounter anything odd in the TaskManager logs as well.
>>>>>>> Sharing those might be of value as well.
>>>>>>>
>>>>>>> On Fri, Apr 22, 2022 at 8:57 AM Matthias Pohl <
>>>>>>> matth...@ververica.com> wrote:
>>>>>>>
>>>>>>>> Hi Peter,
>>>>>>>> thanks for sharing. That doesn't sound right. May you provide the
>>>>>>>> entire jobmanager logs?
>>>>>>>>
>>>>>>>> Best,
>>>>>>>> Matthias
>>>>>>>>
>>>>>>>> On Thu, Apr 21, 2022 at 6:08 PM Peter Schrott <
>>>>>>>> pe...@bluerootlabs.io> wrote:
>>>>>>>>
>>>>>>>>> Hi Flink-Users,
>>>>>>>>>
>>>>>>>>> I am not sure if this does something to my cluster or not. But
>>>>>>>>> since updating to Flink 1.15 (atm rc4) I find the following logs:
>>>>>>>>>
>>>>>>>>> INFO: Registering job manager ab7db9ff0ebd26b3b89c3e2e56684762
>>>>>>>>> @akka.tcp://
>>>>>>>>> fl...@flink-jobmanager-xxx.com:40015/user/rpc/jobmanager_2 for
>>>>>>>>> job 5566648d9b1aac6c1a1b78187fd56975.
>>>>>>>>>
>>>>>>>>> as many times as number of parallelisms (here 10 times). These
>>>>>>>>> logs are triggered every 5 minutes.
>>>>>>>>>
>>>>>>>>> Then they are followed by:
>>>>>>>>>
>>>>>>>>> INFO: Registration of job manager ab7db9ff0ebd26b3b89c3e2e56684762
>>>>>>>>> @akka.tcp://
>>>>>>>>> fl...@flink-jobmanager-xxx.com:40015/user/rpc/jobmanager_2 failed.
>>>>>>>>>
>>>>>>>>> also 10 log entries.
>>>>>>>>>
>>>>>>>>> I followed the lifetime of the job (
>>>>>>>>> 5566648d9b1aac6c1a1b78187fd56975), it was a long-running sql
>>>>>>>>> streaming job, started on Apr 13th on a standalone cluster. After some
>>>>>>>>> recovery attempts it finally failed (using the failover strategy) on 
>>>>>>>>> the
>>>>>>>>> 20th Apr (yesterday) for good. Then those logs started to appear. Now 
>>>>>>>>> there
>>>>>>>>> was no other job running on my cluster anymore but the logs appeared 
>>>>>>>>> every
>>>>>>>>> 5 minutes until I restarted this jobmanager service.
>>>>>>>>>
>>>>>>>>> This job was just an example, it happens to other jobs too.
>>>>>>>>>
>>>>>>>>> It's just INFO logs but it does not look healthy either.
>>>>>>>>>
>>>>>>>>> Thanks & Best
>>>>>>>>> Peter
>>>>>>>>>
>>>>>>>>

Re: Jobmanager trying to be registered for Zombie Job

Reply via email to