Re: Job Manager getting restarted while restarting task manager

yu'an huang Sun, 16 Oct 2022 18:01:01 -0700

Are you able to replay this scenario? Did you accidently send killing
signal to the job mananger process?


On Thu, 13 Oct 2022 at 4:02 PM, Puneet Duggal <puneetduggal1...@gmail.com>
wrote:

> Hi,
>
> We use session deployment mode with HA setup. Currently we have 3 job
> managers and 3 task managers running on flink version 1.12.1. Please find
> attached the complete job manager logs.
>
>
>
>
>
> On 13-Oct-2022, at 7:28 AM, Xintong Song <tonysong...@gmail.com> wrote:
>
> I meant your jobmanager also received a SIGTERM signal, and you would need
> to figure out where it comes from.
>
> To be specific, this line of log:
>
>> 2022-10-11 22:11:21,683 INFO
>> org.apache.flink.runtime.entrypoint.ClusterEntrypoint        [] - RECEIVED
>> SIGNAL 15: SIGTERM. Shutting down as requested.
>>
>
> I believe this is from the jobmanager log, as `ClusterEntrypoint` is a
> class used by jobmanager only.
>
> Best,
> Xintong
>
>
>
> On Thu, Oct 13, 2022 at 9:06 AM yu'an huang <h.yuan...@gmail.com> wrote:
>
>> Hi,
>>
>> Which deployment mode do you use? What is the Flink version?
>> I think killing TaskManagers won't make the JobMananger restart. You can
>> provide the whole log as an attachment to investigate.
>>
>> On Wed, 12 Oct 2022 at 6:01 PM, Puneet Duggal <puneetduggal1...@gmail.com>
>> wrote:
>>
>>> Hi Xintong Song,
>>>
>>> Thanks for your immediate reply. Yes, I do restart task manager via kill
>>> command and then flink restart because I have seen cases where simple flink
>>> restart does not pickup the latest configuration. But what I am confused
>>> about is why killing the task manager process and then restarting it is
>>> causing the job manager to stop and restart.
>>>
>>> Regards,
>>> Puneet
>>>
>>>
>>> On 12-Oct-2022, at 7:33 AM, Xintong Song <tonysong...@gmail.com> wrote:
>>>
>>> The log shows that the jobmanager received a SIGTERM signal from
>>> external. Depending on how you deploy Flink, that could be a 'kill <PID>'
>>> command, or a kubernetes pod removal / eviction, etc. You may want to check
>>> where the signal came from.
>>>
>>> Best,
>>> Xintong
>>>
>>>
>>>
>>> On Wed, Oct 12, 2022 at 6:26 AM Puneet Duggal <
>>> puneetduggal1...@gmail.com> wrote:
>>>
>>>> Hi,
>>>>
>>>> I am facing an issue where when restarting task manager after adding
>>>> some configuration changes, even though task manager restarts successfully
>>>> with the updated configuration change, is causing the leader job manager to
>>>> restart as well. Pasting the leader job manager logs here
>>>>
>>>>
>>>> 2022-10-11 22:11:02,207 WARN  akka.remote.ReliableDeliverySupervisor
>>>>                    [] - Association with remote system [
>>>> akka.tcp://flink@<TM-IP>:35376] has failed, address is now gated for
>>>> [50] ms. Reason: [Disassociated]
>>>> 2022-10-11 22:11:02,411 WARN
>>>> akka.remote.transport.netty.NettyTransport                   [] - Remote
>>>> connection to [null] failed with java.net.ConnectException: Connection
>>>> refused: /<TM-IP>:35376
>>>> 2022-10-11 22:11:02,413 WARN  akka.remote.ReliableDeliverySupervisor
>>>>                    [] - Association with remote system [
>>>> akka.tcp://flink@<TM-IP>:35376] has failed, address is now gated for
>>>> [50] ms. Reason: [Association failed with [
>>>> akka.tcp://flink@<TM-IP>:35376]] Caused by:
>>>> [java.net.ConnectException: Connection refused: /<TM-IP>:35376]
>>>> 2022-10-11 22:11:02,682 WARN
>>>> akka.remote.transport.netty.NettyTransport                   [] - Remote
>>>> connection to [null] failed with java.net.ConnectException: Connection
>>>> refused: /<TM-IP>:35376
>>>> 2022-10-11 22:11:02,683 WARN  akka.remote.ReliableDeliverySupervisor
>>>>                    [] - Association with remote system [
>>>> akka.tcp://flink@<TM-IP>:35376] has failed, address is now gated for
>>>> [50] ms. Reason: [Association failed with [
>>>> akka.tcp://flink@<TM-IP>:35376]] Caused by:
>>>> [java.net.ConnectException: Connection refused: /<TM-IP>:35376]
>>>> 2022-10-11 22:11:12,702 WARN
>>>> akka.remote.transport.netty.NettyTransport                   [] - Remote
>>>> connection to [null] failed with java.net.ConnectException: Connection
>>>> refused: /<TM-IP>:35376
>>>> 2022-10-11 22:11:12,703 WARN  akka.remote.ReliableDeliverySupervisor
>>>>                    [] - Association with remote system [
>>>> akka.tcp://flink@<TM-IP>:35376] has failed, address is now gated for
>>>> [50] ms. Reason: [Association failed with [
>>>> akka.tcp://flink@<TM-IP>:35376]] Caused by:
>>>> [java.net.ConnectException: Connection refused: /<TM-IP>:35376]
>>>> 2022-10-11 22:11:21,683 INFO
>>>> org.apache.flink.runtime.entrypoint.ClusterEntrypoint        [] - RECEIVED
>>>> SIGNAL 15: SIGTERM. Shutting down as requested.
>>>> 2022-10-11 22:11:21,687 INFO  org.apache.flink.runtime.blob.BlobServer
>>>>                    [] - Stopped BLOB server at 0.0.0.0:33887
>>>>
>>>>
>>>> Regards,
>>>> Puneet
>>>>
>>>>
>>>>
>>>
>

Re: Job Manager getting restarted while restarting task manager

Reply via email to