Re: Jobmanager trying to be registered for Zombie Job

2022-04-26 Thread Matthias Pohl
Hi Peter,
based on our analysis the issue already existed before 1.15, yes. We
couldn't come up with any other reasoning. It was just never reported... or
noticing an older ticket.

Matthias

On Mon, Apr 25, 2022 at 6:21 PM Peter Schrott  wrote:

> Hi Matthias,
>
> You are welcome & thanks a lot for your help too!
>
> It's not quite clear to me, the bug was already there since 1.13.6 but not
> reported yet (FLINK-27354 is a new ticket)?
>
> Best, Peter
>
>
> On Mon, Apr 25, 2022 at 5:48 PM Matthias Pohl 
> wrote:
>
>> Thanks again, Peter for sharing your logs. I looked into the issue with
>> the help of Chesnay. Essentially, it's FLINK-27354 [1] that is causing this
>> issue. We couldn't come up with a reason why it should have popped up just
>> now with 1.15. The bug itself is already present in 1.14. You can find more
>> details on the investigation in FLINK-27354 [1] itself.
>>
>> Best,
>> Matthias
>>
>> [1] https://issues.apache.org/jira/browse/FLINK-27354
>>
>> On Mon, Apr 25, 2022 at 2:00 PM Matthias Pohl 
>> wrote:
>>
>>> Thanks Peter, we're looking into it...
>>>
>>> On Mon, Apr 25, 2022 at 11:54 AM Peter Schrott 
>>> wrote:
>>>
 Hi,

 sorry for the late reply. It took me quite some time to get the logs
 out of the system. I have attached them now.

 Its logs of 2 jobmanagers and 2 taskamangers. It can be seen on jm 1
 that the job starts crashing and recovering a few times. This happens
 until 2022-04-20 12:12:14,607. After that the above described behavior can
 be seen.

 I hope this helps.

 Best, Peter

 On Fri, Apr 22, 2022 at 12:06 PM Matthias Pohl 
 wrote:

> FYI: I created FLINK-27354 [1] to cover the issue of retrying to
> connect to the RM while shutting down the JobMaster.
>
> This doesn't explain your issue though, Peter. It's still unclear why
> the JobMaster is still around as stated in my previous email.
>
> Matthias
>
> [1] https://issues.apache.org/jira/browse/FLINK-27354
>
> On Fri, Apr 22, 2022 at 11:54 AM Matthias Pohl 
> wrote:
>
>> Just by looking through the code, it appears that these logs could be
>> produced while stopping the job. The ResourceManager sends a confirmation
>> of the JobMaster being disconnected at the end back to the JobMaster. If
>> the JobMaster is still around to process the request, it would try to
>> reconnect (I'd consider that a bug because the JobMaster is in shutdown
>> mode already and wouldn't need to re-establish the connection). This 
>> method
>> would have been swallowed otherwise if the JobMaster was already 
>> terminated.
>>
>> The only explanation I can come up with right now (without having any
>> logs) is that stopping the JobMaster didn't finish for some reason. For
>> that it would be helpful to look at the logs to see whether there is some
>> other issue that causes the JobMaster to stop entirely.
>>
>> On Fri, Apr 22, 2022 at 10:14 AM Matthias Pohl <
>> matth...@ververica.com> wrote:
>>
>>> ...if possible it would be good to get debug rather than only info
>>> logs. Did you encounter anything odd in the TaskManager logs as well.
>>> Sharing those might be of value as well.
>>>
>>> On Fri, Apr 22, 2022 at 8:57 AM Matthias Pohl <
>>> matth...@ververica.com> wrote:
>>>
 Hi Peter,
 thanks for sharing. That doesn't sound right. May you provide the
 entire jobmanager logs?

 Best,
 Matthias

 On Thu, Apr 21, 2022 at 6:08 PM Peter Schrott <
 pe...@bluerootlabs.io> wrote:

> Hi Flink-Users,
>
> I am not sure if this does something to my cluster or not. But
> since updating to Flink 1.15 (atm rc4) I find the following logs:
>
> INFO: Registering job manager ab7db9ff0ebd26b3b89c3e2e56684762
> @akka.tcp://
> fl...@flink-jobmanager-xxx.com:40015/user/rpc/jobmanager_2 for
> job 5566648d9b1aac6c1a1b78187fd56975.
>
> as many times as number of parallelisms (here 10 times). These
> logs are triggered every 5 minutes.
>
> Then they are followed by:
>
> INFO: Registration of job manager ab7db9ff0ebd26b3b89c3e2e56684762
> @akka.tcp://
> fl...@flink-jobmanager-xxx.com:40015/user/rpc/jobmanager_2 failed.
>
> also 10 log entries.
>
> I followed the lifetime of the job (
> 5566648d9b1aac6c1a1b78187fd56975), it was a long-running sql
> streaming job, started on Apr 13th on a standalone cluster. After some
> recovery attempts it finally failed (using the failover strategy) on 
> the
> 20th Apr (yesterday) for good. Then those logs started to appear. Now 
> there
> was no other job running on my cluster anymore but the logs appeared 

Re: Jobmanager trying to be registered for Zombie Job

2022-04-25 Thread Peter Schrott
Hi Matthias,

You are welcome & thanks a lot for your help too!

It's not quite clear to me, the bug was already there since 1.13.6 but not
reported yet (FLINK-27354 is a new ticket)?

Best, Peter


On Mon, Apr 25, 2022 at 5:48 PM Matthias Pohl 
wrote:

> Thanks again, Peter for sharing your logs. I looked into the issue with
> the help of Chesnay. Essentially, it's FLINK-27354 [1] that is causing this
> issue. We couldn't come up with a reason why it should have popped up just
> now with 1.15. The bug itself is already present in 1.14. You can find more
> details on the investigation in FLINK-27354 [1] itself.
>
> Best,
> Matthias
>
> [1] https://issues.apache.org/jira/browse/FLINK-27354
>
> On Mon, Apr 25, 2022 at 2:00 PM Matthias Pohl 
> wrote:
>
>> Thanks Peter, we're looking into it...
>>
>> On Mon, Apr 25, 2022 at 11:54 AM Peter Schrott 
>> wrote:
>>
>>> Hi,
>>>
>>> sorry for the late reply. It took me quite some time to get the logs out
>>> of the system. I have attached them now.
>>>
>>> Its logs of 2 jobmanagers and 2 taskamangers. It can be seen on jm 1
>>> that the job starts crashing and recovering a few times. This happens
>>> until 2022-04-20 12:12:14,607. After that the above described behavior can
>>> be seen.
>>>
>>> I hope this helps.
>>>
>>> Best, Peter
>>>
>>> On Fri, Apr 22, 2022 at 12:06 PM Matthias Pohl 
>>> wrote:
>>>
 FYI: I created FLINK-27354 [1] to cover the issue of retrying to
 connect to the RM while shutting down the JobMaster.

 This doesn't explain your issue though, Peter. It's still unclear why
 the JobMaster is still around as stated in my previous email.

 Matthias

 [1] https://issues.apache.org/jira/browse/FLINK-27354

 On Fri, Apr 22, 2022 at 11:54 AM Matthias Pohl 
 wrote:

> Just by looking through the code, it appears that these logs could be
> produced while stopping the job. The ResourceManager sends a confirmation
> of the JobMaster being disconnected at the end back to the JobMaster. If
> the JobMaster is still around to process the request, it would try to
> reconnect (I'd consider that a bug because the JobMaster is in shutdown
> mode already and wouldn't need to re-establish the connection). This 
> method
> would have been swallowed otherwise if the JobMaster was already 
> terminated.
>
> The only explanation I can come up with right now (without having any
> logs) is that stopping the JobMaster didn't finish for some reason. For
> that it would be helpful to look at the logs to see whether there is some
> other issue that causes the JobMaster to stop entirely.
>
> On Fri, Apr 22, 2022 at 10:14 AM Matthias Pohl 
> wrote:
>
>> ...if possible it would be good to get debug rather than only info
>> logs. Did you encounter anything odd in the TaskManager logs as well.
>> Sharing those might be of value as well.
>>
>> On Fri, Apr 22, 2022 at 8:57 AM Matthias Pohl 
>> wrote:
>>
>>> Hi Peter,
>>> thanks for sharing. That doesn't sound right. May you provide the
>>> entire jobmanager logs?
>>>
>>> Best,
>>> Matthias
>>>
>>> On Thu, Apr 21, 2022 at 6:08 PM Peter Schrott 
>>> wrote:
>>>
 Hi Flink-Users,

 I am not sure if this does something to my cluster or not. But
 since updating to Flink 1.15 (atm rc4) I find the following logs:

 INFO: Registering job manager ab7db9ff0ebd26b3b89c3e2e56684762
 @akka.tcp://
 fl...@flink-jobmanager-xxx.com:40015/user/rpc/jobmanager_2 for job
 5566648d9b1aac6c1a1b78187fd56975.

 as many times as number of parallelisms (here 10 times). These logs
 are triggered every 5 minutes.

 Then they are followed by:

 INFO: Registration of job manager ab7db9ff0ebd26b3b89c3e2e56684762
 @akka.tcp://
 fl...@flink-jobmanager-xxx.com:40015/user/rpc/jobmanager_2 failed.

 also 10 log entries.

 I followed the lifetime of the job (
 5566648d9b1aac6c1a1b78187fd56975), it was a long-running sql
 streaming job, started on Apr 13th on a standalone cluster. After some
 recovery attempts it finally failed (using the failover strategy) on 
 the
 20th Apr (yesterday) for good. Then those logs started to appear. Now 
 there
 was no other job running on my cluster anymore but the logs appeared 
 every
 5 minutes until I restarted this jobmanager service.

 This job was just an example, it happens to other jobs too.

 It's just INFO logs but it does not look healthy either.

 Thanks & Best
 Peter

>>>
>
> --
>
> Matthias Pohl | Engineer
>
> Follow us @VervericaData Ververica 
>
> --
>
> Join Flink Forward 

Re: Jobmanager trying to be registered for Zombie Job

2022-04-25 Thread Matthias Pohl
Thanks again, Peter for sharing your logs. I looked into the issue with the
help of Chesnay. Essentially, it's FLINK-27354 [1] that is causing this
issue. We couldn't come up with a reason why it should have popped up just
now with 1.15. The bug itself is already present in 1.14. You can find more
details on the investigation in FLINK-27354 [1] itself.

Best,
Matthias

[1] https://issues.apache.org/jira/browse/FLINK-27354

On Mon, Apr 25, 2022 at 2:00 PM Matthias Pohl 
wrote:

> Thanks Peter, we're looking into it...
>
> On Mon, Apr 25, 2022 at 11:54 AM Peter Schrott 
> wrote:
>
>> Hi,
>>
>> sorry for the late reply. It took me quite some time to get the logs out
>> of the system. I have attached them now.
>>
>> Its logs of 2 jobmanagers and 2 taskamangers. It can be seen on jm 1 that
>> the job starts crashing and recovering a few times. This happens
>> until 2022-04-20 12:12:14,607. After that the above described behavior can
>> be seen.
>>
>> I hope this helps.
>>
>> Best, Peter
>>
>> On Fri, Apr 22, 2022 at 12:06 PM Matthias Pohl 
>> wrote:
>>
>>> FYI: I created FLINK-27354 [1] to cover the issue of retrying to connect
>>> to the RM while shutting down the JobMaster.
>>>
>>> This doesn't explain your issue though, Peter. It's still unclear why
>>> the JobMaster is still around as stated in my previous email.
>>>
>>> Matthias
>>>
>>> [1] https://issues.apache.org/jira/browse/FLINK-27354
>>>
>>> On Fri, Apr 22, 2022 at 11:54 AM Matthias Pohl 
>>> wrote:
>>>
 Just by looking through the code, it appears that these logs could be
 produced while stopping the job. The ResourceManager sends a confirmation
 of the JobMaster being disconnected at the end back to the JobMaster. If
 the JobMaster is still around to process the request, it would try to
 reconnect (I'd consider that a bug because the JobMaster is in shutdown
 mode already and wouldn't need to re-establish the connection). This method
 would have been swallowed otherwise if the JobMaster was already 
 terminated.

 The only explanation I can come up with right now (without having any
 logs) is that stopping the JobMaster didn't finish for some reason. For
 that it would be helpful to look at the logs to see whether there is some
 other issue that causes the JobMaster to stop entirely.

 On Fri, Apr 22, 2022 at 10:14 AM Matthias Pohl 
 wrote:

> ...if possible it would be good to get debug rather than only info
> logs. Did you encounter anything odd in the TaskManager logs as well.
> Sharing those might be of value as well.
>
> On Fri, Apr 22, 2022 at 8:57 AM Matthias Pohl 
> wrote:
>
>> Hi Peter,
>> thanks for sharing. That doesn't sound right. May you provide the
>> entire jobmanager logs?
>>
>> Best,
>> Matthias
>>
>> On Thu, Apr 21, 2022 at 6:08 PM Peter Schrott 
>> wrote:
>>
>>> Hi Flink-Users,
>>>
>>> I am not sure if this does something to my cluster or not. But since
>>> updating to Flink 1.15 (atm rc4) I find the following logs:
>>>
>>> INFO: Registering job manager ab7db9ff0ebd26b3b89c3e2e56684762
>>> @akka.tcp://
>>> fl...@flink-jobmanager-xxx.com:40015/user/rpc/jobmanager_2 for job
>>> 5566648d9b1aac6c1a1b78187fd56975.
>>>
>>> as many times as number of parallelisms (here 10 times). These logs
>>> are triggered every 5 minutes.
>>>
>>> Then they are followed by:
>>>
>>> INFO: Registration of job manager ab7db9ff0ebd26b3b89c3e2e56684762
>>> @akka.tcp://
>>> fl...@flink-jobmanager-xxx.com:40015/user/rpc/jobmanager_2 failed.
>>>
>>> also 10 log entries.
>>>
>>> I followed the lifetime of the job (5566648d9b1aac6c1a1b78187fd56975),
>>> it was a long-running sql streaming job, started on Apr 13th on a
>>> standalone cluster. After some recovery attempts it finally failed 
>>> (using
>>> the failover strategy) on the 20th Apr (yesterday) for good. Then those
>>> logs started to appear. Now there was no other job running on my cluster
>>> anymore but the logs appeared every 5 minutes until I restarted this
>>> jobmanager service.
>>>
>>> This job was just an example, it happens to other jobs too.
>>>
>>> It's just INFO logs but it does not look healthy either.
>>>
>>> Thanks & Best
>>> Peter
>>>
>>

-- 

Matthias Pohl | Engineer

Follow us @VervericaData Ververica 

--

Join Flink Forward  - The Apache Flink
Conference

Stream Processing | Event Driven | Real Time

--

Ververica GmbH | Invalidenstrasse 115, 10115 Berlin, Germany

--
Ververica GmbH
Registered at Amtsgericht Charlottenburg: HRB 158244 B
Managing Directors: Yip Park Tung Jason, Jinwei (Kevin) Zhang, Karl Anton
Wehner


Re: Jobmanager trying to be registered for Zombie Job

2022-04-25 Thread Matthias Pohl
Thanks Peter, we're looking into it...

On Mon, Apr 25, 2022 at 11:54 AM Peter Schrott 
wrote:

> Hi,
>
> sorry for the late reply. It took me quite some time to get the logs out
> of the system. I have attached them now.
>
> Its logs of 2 jobmanagers and 2 taskamangers. It can be seen on jm 1 that
> the job starts crashing and recovering a few times. This happens
> until 2022-04-20 12:12:14,607. After that the above described behavior can
> be seen.
>
> I hope this helps.
>
> Best, Peter
>
> On Fri, Apr 22, 2022 at 12:06 PM Matthias Pohl 
> wrote:
>
>> FYI: I created FLINK-27354 [1] to cover the issue of retrying to connect
>> to the RM while shutting down the JobMaster.
>>
>> This doesn't explain your issue though, Peter. It's still unclear why the
>> JobMaster is still around as stated in my previous email.
>>
>> Matthias
>>
>> [1] https://issues.apache.org/jira/browse/FLINK-27354
>>
>> On Fri, Apr 22, 2022 at 11:54 AM Matthias Pohl 
>> wrote:
>>
>>> Just by looking through the code, it appears that these logs could be
>>> produced while stopping the job. The ResourceManager sends a confirmation
>>> of the JobMaster being disconnected at the end back to the JobMaster. If
>>> the JobMaster is still around to process the request, it would try to
>>> reconnect (I'd consider that a bug because the JobMaster is in shutdown
>>> mode already and wouldn't need to re-establish the connection). This method
>>> would have been swallowed otherwise if the JobMaster was already terminated.
>>>
>>> The only explanation I can come up with right now (without having any
>>> logs) is that stopping the JobMaster didn't finish for some reason. For
>>> that it would be helpful to look at the logs to see whether there is some
>>> other issue that causes the JobMaster to stop entirely.
>>>
>>> On Fri, Apr 22, 2022 at 10:14 AM Matthias Pohl 
>>> wrote:
>>>
 ...if possible it would be good to get debug rather than only info
 logs. Did you encounter anything odd in the TaskManager logs as well.
 Sharing those might be of value as well.

 On Fri, Apr 22, 2022 at 8:57 AM Matthias Pohl 
 wrote:

> Hi Peter,
> thanks for sharing. That doesn't sound right. May you provide the
> entire jobmanager logs?
>
> Best,
> Matthias
>
> On Thu, Apr 21, 2022 at 6:08 PM Peter Schrott 
> wrote:
>
>> Hi Flink-Users,
>>
>> I am not sure if this does something to my cluster or not. But since
>> updating to Flink 1.15 (atm rc4) I find the following logs:
>>
>> INFO: Registering job manager ab7db9ff0ebd26b3b89c3e2e56684762
>> @akka.tcp://
>> fl...@flink-jobmanager-xxx.com:40015/user/rpc/jobmanager_2 for job
>> 5566648d9b1aac6c1a1b78187fd56975.
>>
>> as many times as number of parallelisms (here 10 times). These logs
>> are triggered every 5 minutes.
>>
>> Then they are followed by:
>>
>> INFO: Registration of job manager ab7db9ff0ebd26b3b89c3e2e56684762
>> @akka.tcp://
>> fl...@flink-jobmanager-xxx.com:40015/user/rpc/jobmanager_2 failed.
>>
>> also 10 log entries.
>>
>> I followed the lifetime of the job (5566648d9b1aac6c1a1b78187fd56975),
>> it was a long-running sql streaming job, started on Apr 13th on a
>> standalone cluster. After some recovery attempts it finally failed (using
>> the failover strategy) on the 20th Apr (yesterday) for good. Then those
>> logs started to appear. Now there was no other job running on my cluster
>> anymore but the logs appeared every 5 minutes until I restarted this
>> jobmanager service.
>>
>> This job was just an example, it happens to other jobs too.
>>
>> It's just INFO logs but it does not look healthy either.
>>
>> Thanks & Best
>> Peter
>>
>


Re: Jobmanager trying to be registered for Zombie Job

2022-04-22 Thread Matthias Pohl
FYI: I created FLINK-27354 [1] to cover the issue of retrying to connect to
the RM while shutting down the JobMaster.

This doesn't explain your issue though, Peter. It's still unclear why the
JobMaster is still around as stated in my previous email.

Matthias

[1] https://issues.apache.org/jira/browse/FLINK-27354

On Fri, Apr 22, 2022 at 11:54 AM Matthias Pohl 
wrote:

> Just by looking through the code, it appears that these logs could be
> produced while stopping the job. The ResourceManager sends a confirmation
> of the JobMaster being disconnected at the end back to the JobMaster. If
> the JobMaster is still around to process the request, it would try to
> reconnect (I'd consider that a bug because the JobMaster is in shutdown
> mode already and wouldn't need to re-establish the connection). This method
> would have been swallowed otherwise if the JobMaster was already terminated.
>
> The only explanation I can come up with right now (without having any
> logs) is that stopping the JobMaster didn't finish for some reason. For
> that it would be helpful to look at the logs to see whether there is some
> other issue that causes the JobMaster to stop entirely.
>
> On Fri, Apr 22, 2022 at 10:14 AM Matthias Pohl 
> wrote:
>
>> ...if possible it would be good to get debug rather than only info logs.
>> Did you encounter anything odd in the TaskManager logs as well. Sharing
>> those might be of value as well.
>>
>> On Fri, Apr 22, 2022 at 8:57 AM Matthias Pohl 
>> wrote:
>>
>>> Hi Peter,
>>> thanks for sharing. That doesn't sound right. May you provide the entire
>>> jobmanager logs?
>>>
>>> Best,
>>> Matthias
>>>
>>> On Thu, Apr 21, 2022 at 6:08 PM Peter Schrott 
>>> wrote:
>>>
 Hi Flink-Users,

 I am not sure if this does something to my cluster or not. But since
 updating to Flink 1.15 (atm rc4) I find the following logs:

 INFO: Registering job manager ab7db9ff0ebd26b3b89c3e2e56684762
 @akka.tcp://fl...@flink-jobmanager-xxx.com:40015/user/rpc/jobmanager_2
 for job 5566648d9b1aac6c1a1b78187fd56975.

 as many times as number of parallelisms (here 10 times). These logs are
 triggered every 5 minutes.

 Then they are followed by:

 INFO: Registration of job manager ab7db9ff0ebd26b3b89c3e2e56684762
 @akka.tcp://fl...@flink-jobmanager-xxx.com:40015/user/rpc/jobmanager_2
 failed.

 also 10 log entries.

 I followed the lifetime of the job (5566648d9b1aac6c1a1b78187fd56975),
 it was a long-running sql streaming job, started on Apr 13th on a
 standalone cluster. After some recovery attempts it finally failed (using
 the failover strategy) on the 20th Apr (yesterday) for good. Then those
 logs started to appear. Now there was no other job running on my cluster
 anymore but the logs appeared every 5 minutes until I restarted this
 jobmanager service.

 This job was just an example, it happens to other jobs too.

 It's just INFO logs but it does not look healthy either.

 Thanks & Best
 Peter

>>>


Re: Jobmanager trying to be registered for Zombie Job

2022-04-22 Thread Matthias Pohl
Just by looking through the code, it appears that these logs could be
produced while stopping the job. The ResourceManager sends a confirmation
of the JobMaster being disconnected at the end back to the JobMaster. If
the JobMaster is still around to process the request, it would try to
reconnect (I'd consider that a bug because the JobMaster is in shutdown
mode already and wouldn't need to re-establish the connection). This method
would have been swallowed otherwise if the JobMaster was already terminated.

The only explanation I can come up with right now (without having any logs)
is that stopping the JobMaster didn't finish for some reason. For that it
would be helpful to look at the logs to see whether there is some other
issue that causes the JobMaster to stop entirely.

On Fri, Apr 22, 2022 at 10:14 AM Matthias Pohl 
wrote:

> ...if possible it would be good to get debug rather than only info logs.
> Did you encounter anything odd in the TaskManager logs as well. Sharing
> those might be of value as well.
>
> On Fri, Apr 22, 2022 at 8:57 AM Matthias Pohl 
> wrote:
>
>> Hi Peter,
>> thanks for sharing. That doesn't sound right. May you provide the entire
>> jobmanager logs?
>>
>> Best,
>> Matthias
>>
>> On Thu, Apr 21, 2022 at 6:08 PM Peter Schrott 
>> wrote:
>>
>>> Hi Flink-Users,
>>>
>>> I am not sure if this does something to my cluster or not. But since
>>> updating to Flink 1.15 (atm rc4) I find the following logs:
>>>
>>> INFO: Registering job manager ab7db9ff0ebd26b3b89c3e2e56684762
>>> @akka.tcp://fl...@flink-jobmanager-xxx.com:40015/user/rpc/jobmanager_2
>>> for job 5566648d9b1aac6c1a1b78187fd56975.
>>>
>>> as many times as number of parallelisms (here 10 times). These logs are
>>> triggered every 5 minutes.
>>>
>>> Then they are followed by:
>>>
>>> INFO: Registration of job manager ab7db9ff0ebd26b3b89c3e2e56684762
>>> @akka.tcp://fl...@flink-jobmanager-xxx.com:40015/user/rpc/jobmanager_2
>>> failed.
>>>
>>> also 10 log entries.
>>>
>>> I followed the lifetime of the job (5566648d9b1aac6c1a1b78187fd56975),
>>> it was a long-running sql streaming job, started on Apr 13th on a
>>> standalone cluster. After some recovery attempts it finally failed (using
>>> the failover strategy) on the 20th Apr (yesterday) for good. Then those
>>> logs started to appear. Now there was no other job running on my cluster
>>> anymore but the logs appeared every 5 minutes until I restarted this
>>> jobmanager service.
>>>
>>> This job was just an example, it happens to other jobs too.
>>>
>>> It's just INFO logs but it does not look healthy either.
>>>
>>> Thanks & Best
>>> Peter
>>>
>>


Re: Jobmanager trying to be registered for Zombie Job

2022-04-22 Thread Matthias Pohl
...if possible it would be good to get debug rather than only info logs.
Did you encounter anything odd in the TaskManager logs as well. Sharing
those might be of value as well.

On Fri, Apr 22, 2022 at 8:57 AM Matthias Pohl 
wrote:

> Hi Peter,
> thanks for sharing. That doesn't sound right. May you provide the entire
> jobmanager logs?
>
> Best,
> Matthias
>
> On Thu, Apr 21, 2022 at 6:08 PM Peter Schrott 
> wrote:
>
>> Hi Flink-Users,
>>
>> I am not sure if this does something to my cluster or not. But since
>> updating to Flink 1.15 (atm rc4) I find the following logs:
>>
>> INFO: Registering job manager ab7db9ff0ebd26b3b89c3e2e56684762
>> @akka.tcp://fl...@flink-jobmanager-xxx.com:40015/user/rpc/jobmanager_2
>> for job 5566648d9b1aac6c1a1b78187fd56975.
>>
>> as many times as number of parallelisms (here 10 times). These logs are
>> triggered every 5 minutes.
>>
>> Then they are followed by:
>>
>> INFO: Registration of job manager ab7db9ff0ebd26b3b89c3e2e56684762
>> @akka.tcp://fl...@flink-jobmanager-xxx.com:40015/user/rpc/jobmanager_2
>> failed.
>>
>> also 10 log entries.
>>
>> I followed the lifetime of the job (5566648d9b1aac6c1a1b78187fd56975),
>> it was a long-running sql streaming job, started on Apr 13th on a
>> standalone cluster. After some recovery attempts it finally failed (using
>> the failover strategy) on the 20th Apr (yesterday) for good. Then those
>> logs started to appear. Now there was no other job running on my cluster
>> anymore but the logs appeared every 5 minutes until I restarted this
>> jobmanager service.
>>
>> This job was just an example, it happens to other jobs too.
>>
>> It's just INFO logs but it does not look healthy either.
>>
>> Thanks & Best
>> Peter
>>
>


Re: Jobmanager trying to be registered for Zombie Job

2022-04-22 Thread Matthias Pohl
Hi Peter,
thanks for sharing. That doesn't sound right. May you provide the entire
jobmanager logs?

Best,
Matthias

On Thu, Apr 21, 2022 at 6:08 PM Peter Schrott  wrote:

> Hi Flink-Users,
>
> I am not sure if this does something to my cluster or not. But since
> updating to Flink 1.15 (atm rc4) I find the following logs:
>
> INFO: Registering job manager ab7db9ff0ebd26b3b89c3e2e56684...@akka.tcp://
> fl...@flink-jobmanager-xxx.com:40015/user/rpc/jobmanager_2 for job
> 5566648d9b1aac6c1a1b78187fd56975.
>
> as many times as number of parallelisms (here 10 times). These logs are
> triggered every 5 minutes.
>
> Then they are followed by:
>
> INFO: Registration of job manager ab7db9ff0ebd26b3b89c3e2e56684762
> @akka.tcp://fl...@flink-jobmanager-xxx.com:40015/user/rpc/jobmanager_2
> failed.
>
> also 10 log entries.
>
> I followed the lifetime of the job (5566648d9b1aac6c1a1b78187fd56975), it
> was a long-running sql streaming job, started on Apr 13th on a standalone
> cluster. After some recovery attempts it finally failed (using the failover
> strategy) on the 20th Apr (yesterday) for good. Then those logs started to
> appear. Now there was no other job running on my cluster anymore but the
> logs appeared every 5 minutes until I restarted this jobmanager service.
>
> This job was just an example, it happens to other jobs too.
>
> It's just INFO logs but it does not look healthy either.
>
> Thanks & Best
> Peter
>


Jobmanager trying to be registered for Zombie Job

2022-04-21 Thread Peter Schrott
Hi Flink-Users,

I am not sure if this does something to my cluster or not. But since
updating to Flink 1.15 (atm rc4) I find the following logs:

INFO: Registering job manager ab7db9ff0ebd26b3b89c3e2e56684...@akka.tcp://
fl...@flink-jobmanager-xxx.com:40015/user/rpc/jobmanager_2 for job
5566648d9b1aac6c1a1b78187fd56975.

as many times as number of parallelisms (here 10 times). These logs are
triggered every 5 minutes.

Then they are followed by:

INFO: Registration of job manager ab7db9ff0ebd26b3b89c3e2e56684762
@akka.tcp://fl...@flink-jobmanager-xxx.com:40015/user/rpc/jobmanager_2
failed.

also 10 log entries.

I followed the lifetime of the job (5566648d9b1aac6c1a1b78187fd56975), it
was a long-running sql streaming job, started on Apr 13th on a standalone
cluster. After some recovery attempts it finally failed (using the failover
strategy) on the 20th Apr (yesterday) for good. Then those logs started to
appear. Now there was no other job running on my cluster anymore but the
logs appeared every 5 minutes until I restarted this jobmanager service.

This job was just an example, it happens to other jobs too.

It's just INFO logs but it does not look healthy either.

Thanks & Best
Peter