Re: Why is task manager shutting down?

Congxian Qiu Fri, 30 Sep 2022 04:46:00 -0700

Hi
    You can configure the key `task.cancellation.timeout`[1] to increase
the timeout, and the code about this logic is here[2]


[1]
https://nightlies.apache.org/flink/flink-docs-master/docs/deployment/config/#task-cancellation-timeout
[2]
https://github.com/apache/flink/blob/f543b8ac690b1dee58bc3cb345a1c8ad0db0941e/flink-runtime/src/main/java/org/apache/flink/runtime/taskmanager/Task.java#L1775
Best,
Congxian


John Smith <java.dev....@gmail.com> 于2022年9月29日周四 19:04写道：

> Sorry I mean the 180 seconds. Where does flink decide that 180 seconds is
> the cutoff point... And can I increase it.
>
> On Thu., Sep. 29, 2022, 7:02 a.m. John Smith, <java.dev....@gmail.com>
> wrote:
>
>> Is there a way to increase the 30 seconds to 60? Where is that 30 second
>> timeout set?
>>
>> I have jdbc query timeout but at some point at night the insert takes a
>> bit longer cause of index rebuilding.
>>
>> On Wed., Sep. 28, 2022, 5:02 a.m. Congxian Qiu, <qcx978132...@gmail.com>
>> wrote:
>>
>>> Hi John
>>>
>>> Yes, the whole TaskManager exited because the task did not react to
>>> cancelling signal in time
>>>
>>> ```
>>>
>>> 2022-08-30 09:14:22,138 ERROR 
>>> org.apache.flink.runtime.taskexecutor.TaskExecutor           [] - Task did 
>>> not exit gracefully within 180 + seconds.
>>> org.apache.flink.util.FlinkRuntimeException: Task did not exit gracefully 
>>> within 180 + seconds.
>>>     at 
>>> org.apache.flink.runtime.taskmanager.Task$TaskCancelerWatchDog.run(Task.java:1791)
>>>  [flink-dist_2.12-1.14.4.jar:1.14.4]
>>>     at java.lang.Thread.run(Thread.java:750) [?:1.8.0_342]
>>> 2022-08-30 09:14:22,139 ERROR 
>>> org.apache.flink.runtime.taskexecutor.TaskManagerRunner      [] - Fatal 
>>> error occurred while executing the TaskManager. Shutting it down...
>>>
>>> ```
>>>
>>>
>>>  And the task stack logged such as below when cancelling the sink task
>>>
>>> ```
>>>
>>> 2022-08-30 09:14:22,135 WARN  org.apache.flink.runtime.taskmanager.Task     
>>>                [] - Task 'Sink: jdbc (1/1)#359' did not react to cancelling 
>>> signal - notifying TM; it is stuck for 180 seconds in method:
>>>  java.net.SocketInputStream.socketRead0(Native Method)
>>> java.net.SocketInputStream.socketRead(SocketInputStream.java:116)
>>> java.net.SocketInputStream.read(SocketInputStream.java:171)
>>> java.net.SocketInputStream.read(SocketInputStream.java:141)
>>> com.microsoft.sqlserver.jdbc.TDSChannel.read(IOBuffer.java:2023)
>>> com.microsoft.sqlserver.jdbc.TDSReader.readPacket(IOBuffer.java:6418)
>>> com.microsoft.sqlserver.jdbc.TDSCommand.startResponse(IOBuffer.java:7579)
>>> com.microsoft.sqlserver.jdbc.SQLServerPreparedStatement.doExecutePreparedStatement(SQLServerPreparedStatement.java:592)
>>> com.microsoft.sqlserver.jdbc.SQLServerPreparedStatement$PrepStmtExecCmd.doExecute(SQLServerPreparedStatement.java:524)
>>> com.microsoft.sqlserver.jdbc.TDSCommand.execute(IOBuffer.java:7194)
>>> com.microsoft.sqlserver.jdbc.SQLServerConnection.executeCommand(SQLServerConnection.java:2979)
>>> com.microsoft.sqlserver.jdbc.SQLServerStatement.executeCommand(SQLServerStatement.java:248)
>>> com.microsoft.sqlserver.jdbc.SQLServerStatement.executeStatement(SQLServerStatement.java:223)
>>> com.microsoft.sqlserver.jdbc.SQLServerPreparedStatement.execute(SQLServerPreparedStatement.java:505)
>>> com.xxxxxx.common.flink.connectors.jdbc.xxxxxxJdbcJsonOutputFormat.flush(xxxxxxJdbcJsonOutputFormat.java:111)
>>> com.xxxxxx.common.flink.connectors.jdbc.xxxxxxJdbcJsonSink.snapshotState(xxxxxxJdbcJsonSink.java:33)
>>> ```
>>>
>>>
>>> Best,
>>> Congxian
>>>
>>>
>>> John Smith <java.dev....@gmail.com> 于2022年9月23日周五 23:35写道：
>>>
>>>> Sorry new file:
>>>> https://www.dropbox.com/s/mm9521crwvevzgl/flink-flink-taskexecutor-274-flink-prod-v-task-0001.log?dl=0
>>>>
>>>> On Fri, Sep 23, 2022 at 11:26 AM John Smith <java.dev....@gmail.com>
>>>> wrote:
>>>>
>>>>> Hi I have attached the logs here...
>>>>>
>>>>>
>>>>> https://www.dropbox.com/s/12gwlps52lvxdhz/flink-flink-taskexecutor-274-flink-prod-v-task-0001.log?dl=0
>>>>>
>>>>> 1- It looks like a timeout issue. Can someone confirm?
>>>>> 2- The task manager is restarted, since I have restart on failure in
>>>>> SystemD. But it seems after a few restarts it stops. Does it mean that
>>>>> SystemD has an internal counter of how many times it will restart a 
>>>>> service
>>>>> before it doesn't do it anymore?
>>>>>
>>>>

Re: Why is task manager shutting down?

Reply via email to