Hi You can configure the key `task.cancellation.timeout`[1] to increase the timeout, and the code about this logic is here[2]
[1] https://nightlies.apache.org/flink/flink-docs-master/docs/deployment/config/#task-cancellation-timeout [2] https://github.com/apache/flink/blob/f543b8ac690b1dee58bc3cb345a1c8ad0db0941e/flink-runtime/src/main/java/org/apache/flink/runtime/taskmanager/Task.java#L1775 Best, Congxian John Smith <java.dev....@gmail.com> 于2022年9月29日周四 19:04写道: > Sorry I mean the 180 seconds. Where does flink decide that 180 seconds is > the cutoff point... And can I increase it. > > On Thu., Sep. 29, 2022, 7:02 a.m. John Smith, <java.dev....@gmail.com> > wrote: > >> Is there a way to increase the 30 seconds to 60? Where is that 30 second >> timeout set? >> >> I have jdbc query timeout but at some point at night the insert takes a >> bit longer cause of index rebuilding. >> >> On Wed., Sep. 28, 2022, 5:02 a.m. Congxian Qiu, <qcx978132...@gmail.com> >> wrote: >> >>> Hi John >>> >>> Yes, the whole TaskManager exited because the task did not react to >>> cancelling signal in time >>> >>> ``` >>> >>> 2022-08-30 09:14:22,138 ERROR >>> org.apache.flink.runtime.taskexecutor.TaskExecutor [] - Task did >>> not exit gracefully within 180 + seconds. >>> org.apache.flink.util.FlinkRuntimeException: Task did not exit gracefully >>> within 180 + seconds. >>> at >>> org.apache.flink.runtime.taskmanager.Task$TaskCancelerWatchDog.run(Task.java:1791) >>> [flink-dist_2.12-1.14.4.jar:1.14.4] >>> at java.lang.Thread.run(Thread.java:750) [?:1.8.0_342] >>> 2022-08-30 09:14:22,139 ERROR >>> org.apache.flink.runtime.taskexecutor.TaskManagerRunner [] - Fatal >>> error occurred while executing the TaskManager. Shutting it down... >>> >>> ``` >>> >>> >>> And the task stack logged such as below when cancelling the sink task >>> >>> ``` >>> >>> 2022-08-30 09:14:22,135 WARN org.apache.flink.runtime.taskmanager.Task >>> [] - Task 'Sink: jdbc (1/1)#359' did not react to cancelling >>> signal - notifying TM; it is stuck for 180 seconds in method: >>> java.net.SocketInputStream.socketRead0(Native Method) >>> java.net.SocketInputStream.socketRead(SocketInputStream.java:116) >>> java.net.SocketInputStream.read(SocketInputStream.java:171) >>> java.net.SocketInputStream.read(SocketInputStream.java:141) >>> com.microsoft.sqlserver.jdbc.TDSChannel.read(IOBuffer.java:2023) >>> com.microsoft.sqlserver.jdbc.TDSReader.readPacket(IOBuffer.java:6418) >>> com.microsoft.sqlserver.jdbc.TDSCommand.startResponse(IOBuffer.java:7579) >>> com.microsoft.sqlserver.jdbc.SQLServerPreparedStatement.doExecutePreparedStatement(SQLServerPreparedStatement.java:592) >>> com.microsoft.sqlserver.jdbc.SQLServerPreparedStatement$PrepStmtExecCmd.doExecute(SQLServerPreparedStatement.java:524) >>> com.microsoft.sqlserver.jdbc.TDSCommand.execute(IOBuffer.java:7194) >>> com.microsoft.sqlserver.jdbc.SQLServerConnection.executeCommand(SQLServerConnection.java:2979) >>> com.microsoft.sqlserver.jdbc.SQLServerStatement.executeCommand(SQLServerStatement.java:248) >>> com.microsoft.sqlserver.jdbc.SQLServerStatement.executeStatement(SQLServerStatement.java:223) >>> com.microsoft.sqlserver.jdbc.SQLServerPreparedStatement.execute(SQLServerPreparedStatement.java:505) >>> com.xxxxxx.common.flink.connectors.jdbc.xxxxxxJdbcJsonOutputFormat.flush(xxxxxxJdbcJsonOutputFormat.java:111) >>> com.xxxxxx.common.flink.connectors.jdbc.xxxxxxJdbcJsonSink.snapshotState(xxxxxxJdbcJsonSink.java:33) >>> ``` >>> >>> >>> Best, >>> Congxian >>> >>> >>> John Smith <java.dev....@gmail.com> 于2022年9月23日周五 23:35写道: >>> >>>> Sorry new file: >>>> https://www.dropbox.com/s/mm9521crwvevzgl/flink-flink-taskexecutor-274-flink-prod-v-task-0001.log?dl=0 >>>> >>>> On Fri, Sep 23, 2022 at 11:26 AM John Smith <java.dev....@gmail.com> >>>> wrote: >>>> >>>>> Hi I have attached the logs here... >>>>> >>>>> >>>>> https://www.dropbox.com/s/12gwlps52lvxdhz/flink-flink-taskexecutor-274-flink-prod-v-task-0001.log?dl=0 >>>>> >>>>> 1- It looks like a timeout issue. Can someone confirm? >>>>> 2- The task manager is restarted, since I have restart on failure in >>>>> SystemD. But it seems after a few restarts it stops. Does it mean that >>>>> SystemD has an internal counter of how many times it will restart a >>>>> service >>>>> before it doesn't do it anymore? >>>>> >>>>