有时候这种job持续2个多小时,我只能cancel job,但无法正常 cancel,都会导致 taskmanager 挂掉,错误如下
2021-01-31 23:04:23,677 ERROR org.apache.flink.runtime.taskexecutor.TaskExecutor [] - Task did not exit gracefully within 180 + seconds. org.apache.flink.util.FlinkRuntimeException: Task did not exit gracefully within 180 + seconds. at org.apache.flink.runtime.taskmanager.Task$TaskCancelerWatchDog.run(Task.java:1685) [flink-dist_2.11-1.12.1.jar:1.12.1] at java.lang.Thread.run(Thread.java:748) [?:1.8.0_282] 2021-01-31 23:04:23,685 ERROR org.apache.flink.runtime.taskexecutor.TaskManagerRunner [] - Fatal error occurred while executing the TaskManager. Shutting it down... org.apache.flink.util.FlinkRuntimeException: Task did not exit gracefully within 180 + seconds. at org.apache.flink.runtime.taskmanager.Task$TaskCancelerWatchDog.run(Task.java:1685) [flink-dist_2.11-1.12.1.jar:1.12.1] at java.lang.Thread.run(Thread.java:748) [?:1.8.0_282] 2021-01-31 23:04:23,686 INFO org.apache.flink.runtime.taskexecutor.TaskExecutor [] - Stopping TaskExecutor akka.tcp://flink@10.13.69.52:45901/user/rpc/taskmanager_0. 2021-01-31 23:04:23,686 INFO org.apache.flink.runtime.taskexecutor.TaskExecutor [] - Close ResourceManager connection 1bd159f361d86e77d17e261ab44b5128. 2021-01-31 23:04:23,689 WARN org.apache.flink.runtime.taskmanager.Task [] - Task 'Source: HiveSource-snmpprobe.p_port_traffic_5m -> Calc(select=[binaryid AS id, ver, CAST(2021-01-31 21:45:00:TIMESTAMP(6)) AS coltime, CAST(in_octets) AS in_octets, CAST(out_octets) AS out_octets, CAST(bi_octets) AS bi_octets, CAST(unimax_octets) AS unimax_octets, in_speed, out_speed, bi_speed, unimax_speed, in_util, out_util, bi_util, unimax_util, inout_ratio, bandwidth, origin, CAST((() DATE_FORMAT _UTF-16LE'yyyy-MM-dd HH:mm:ss')) AS crtime], where=[(coltime = 2021-01-31 21:45:00:TIMESTAMP(9))]) -> Sink: Sink(table=[myhive.prod_mysql_zqzynetdb.p_port_traffic_5m], fields=[id, ver, coltime, in_octets, out_octets, bi_octets, unimax_octets, in_speed, out_speed, bi_speed, unimax_speed, in_util, out_util, bi_util, unimax_util, inout_ratio, bandwidth, origin, crtime]) (1/1)#0' did not react to cancelling signal for 30 seconds, but is stuck in method: java.net.SocketInputStream.socketRead0(Native Method) java.net.SocketInputStream.socketRead(SocketInputStream.java:116) java.net.SocketInputStream.read(SocketInputStream.java:171) java.net.SocketInputStream.read(SocketInputStream.java:141) com.mysql.cj.protocol.ReadAheadInputStream.fill(ReadAheadInputStream.java:107) com.mysql.cj.protocol.ReadAheadInputStream.readFromUnderlyingStreamIfNecessary(ReadAheadInputStream.java:150) com.mysql.cj.protocol.ReadAheadInputStream.read(ReadAheadInputStream.java:180) java.io.FilterInputStream.read(FilterInputStream.java:133) com.mysql.cj.protocol.FullReadInputStream.readFully(FullReadInputStream.java:64) com.mysql.cj.protocol.a.SimplePacketReader.readHeader(SimplePacketReader.java:63) com.mysql.cj.protocol.a.SimplePacketReader.readHeader(SimplePacketReader.java:45) com.mysql.cj.protocol.a.TimeTrackingPacketReader.readHeader(TimeTrackingPacketReader.java:52) com.mysql.cj.protocol.a.TimeTrackingPacketReader.readHeader(TimeTrackingPacketReader.java:41) com.mysql.cj.protocol.a.MultiPacketReader.readHeader(MultiPacketReader.java:54) com.mysql.cj.protocol.a.MultiPacketReader.readHeader(MultiPacketReader.java:44) com.mysql.cj.protocol.a.NativeProtocol.readMessage(NativeProtocol.java:538) com.mysql.cj.protocol.a.NativeProtocol.checkErrorMessage(NativeProtocol.java:708) com.mysql.cj.protocol.a.NativeProtocol.sendCommand(NativeProtocol.java:647) com.mysql.cj.protocol.a.NativeProtocol.sendQueryPacket(NativeProtocol.java:946) com.mysql.cj.NativeSession.execSQL(NativeSession.java:1075) com.mysql.cj.jdbc.ClientPreparedStatement.executeInternal(ClientPreparedStatement.java:930) com.mysql.cj.jdbc.ClientPreparedStatement.executeUpdateInternal(ClientPreparedStatement.java:1092) com.mysql.cj.jdbc.ClientPreparedStatement.executeBatchSerially(ClientPreparedStatement.java:832) com.mysql.cj.jdbc.ClientPreparedStatement.executeBatchInternal(ClientPreparedStatement.java:435) com.mysql.cj.jdbc.StatementImpl.executeBatch(StatementImpl.java:796) org.apache.flink.connector.jdbc.statement.FieldNamedPreparedStatementImpl.executeBatch(FieldNamedPreparedStatementImpl.java:65) org.apache.flink.connector.jdbc.internal.executor.TableSimpleStatementExecutor.executeBatch(TableSimpleStatementExecutor.java:64) org.apache.flink.connector.jdbc.internal.executor.TableBufferReducedStatementExecutor.executeBatch(TableBufferReducedStatementExecutor.java:101) org.apache.flink.connector.jdbc.internal.JdbcBatchingOutputFormat.attemptFlush(JdbcBatchingOutputFormat.java:216) org.apache.flink.connector.jdbc.internal.JdbcBatchingOutputFormat.flush(JdbcBatchingOutputFormat.java:184) org.apache.flink.connector.jdbc.internal.JdbcBatchingOutputFormat.writeRecord(JdbcBatchingOutputFormat.java:167) org.apache.flink.streaming.api.functions.sink.OutputFormatSinkFunction.invoke(OutputFormatSinkFunction.java:87) org.apache.flink.streaming.api.functions.sink.SinkFunction.invoke(SinkFunction.java:49) org.apache.flink.table.runtime.operators.sink.SinkOperator.processElement(SinkOperator.java:72) org.apache.flink.streaming.runtime.tasks.ChainingOutput.pushToOperator(ChainingOutput.java:112) org.apache.flink.streaming.runtime.tasks.ChainingOutput.collect(ChainingOutput.java:93) org.apache.flink.streaming.runtime.tasks.ChainingOutput.collect(ChainingOutput.java:39) org.apache.flink.streaming.api.operators.CountingOutput.collect(CountingOutput.java:50) org.apache.flink.streaming.api.operators.CountingOutput.collect(CountingOutput.java:28) BatchCalc$597.processElement(Unknown Source) org.apache.flink.streaming.runtime.tasks.ChainingOutput.pushToOperator(ChainingOutput.java:112) org.apache.flink.streaming.runtime.tasks.ChainingOutput.collect(ChainingOutput.java:93) org.apache.flink.streaming.runtime.tasks.ChainingOutput.collect(ChainingOutput.java:39) org.apache.flink.streaming.runtime.tasks.SourceOperatorStreamTask$AsyncDataOutputToOutput.emitRecord(SourceOperatorStreamTask.java:160) org.apache.flink.streaming.api.operators.source.SourceOutputWithWatermarks.collect(SourceOutputWithWatermarks.java:110) org.apache.flink.streaming.api.operators.source.SourceOutputWithWatermarks.collect(SourceOutputWithWatermarks.java:101) org.apache.flink.connector.file.src.impl.FileSourceRecordEmitter.emitRecord(FileSourceRecordEmitter.java:45) org.apache.flink.connector.file.src.impl.FileSourceRecordEmitter.emitRecord(FileSourceRecordEmitter.java:35) org.apache.flink.connector.base.source.reader.SourceReaderBase.pollNext(SourceReaderBase.java:128) org.apache.flink.streaming.api.operators.SourceOperator.emitNext(SourceOperator.java:275) org.apache.flink.streaming.runtime.io.StreamTaskSourceInput.emitNext(StreamTaskSourceInput.java:67) org.apache.flink.streaming.runtime.io.StreamOneInputProcessor.processInput(StreamOneInputProcessor.java:65) org.apache.flink.streaming.runtime.tasks.StreamTask.processInput(StreamTask.java:395) org.apache.flink.streaming.runtime.tasks.StreamTask$$Lambda$267/2023964146.runDefaultAction(Unknown Source) org.apache.flink.streaming.runtime.tasks.mailbox.MailboxProcessor.runMailboxLoop(MailboxProcessor.java:191) org.apache.flink.streaming.runtime.tasks.StreamTask.runMailboxLoop(StreamTask.java:609) org.apache.flink.streaming.runtime.tasks.StreamTask.invoke(StreamTask.java:573) org.apache.flink.runtime.taskmanager.Task.doRun(Task.java:755) org.apache.flink.runtime.taskmanager.Task.run(Task.java:570) java.lang.Thread.run(Thread.java:748) 2021-01-31 23:04:23,691 INFO org.apache.flink.runtime.taskexecutor.TaskExecutor [] - Close JobManager connection for job 13bc5fa5addba5772e9425161b48a2e3. 2021-01-31 23:04:23,691 INFO org.apache.flink.runtime.taskmanager.Task [] - Attempting to fail task externally Source: HiveSource-snmpprobe.p_port_traffic_5m -> Calc(select=[binaryid AS id, ver, CAST(2021-01-31 21:45:00:TIMESTAMP(6)) AS coltime, CAST(in_octets) AS in_octets, CAST(out_octets) AS out_octets, CAST(bi_octets) AS bi_octets, CAST(unimax_octets) AS unimax_octets, in_speed, out_speed, bi_speed, unimax_speed, in_util, out_util, bi_util, unimax_util, inout_ratio, bandwidth, origin, CAST((() DATE_FORMAT _UTF-16LE'yyyy-MM-dd HH:mm:ss')) AS crtime], where=[(coltime = 2021-01-31 21:45:00:TIMESTAMP(9))]) -> Sink: Sink(table=[myhive.prod_mysql_zqzynetdb.p_port_traffic_5m], fields=[id, ver, coltime, in_octets, out_octets, bi_octets, unimax_octets, in_speed, out_speed, bi_speed, unimax_speed, in_util, out_util, bi_util, unimax_util, inout_ratio, bandwidth, origin, crtime]) (1/1)#0 (7f4ca2467b4b31e2476bb7f5b93f6d33). 2021-01-31 23:04:23,691 INFO org.apache.flink.runtime.taskmanager.Task [] - Task Source: HiveSource-snmpprobe.p_port_traffic_5m -> Calc(select=[binaryid AS id, ver, CAST(2021-01-31 21:45:00:TIMESTAMP(6)) AS coltime, CAST(in_octets) AS in_octets, CAST(out_octets) AS out_octets, CAST(bi_octets) AS bi_octets, CAST(unimax_octets) AS unimax_octets, in_speed, out_speed, bi_speed, unimax_speed, in_util, out_util, bi_util, unimax_util, inout_ratio, bandwidth, origin, CAST((() DATE_FORMAT _UTF-16LE'yyyy-MM-dd HH:mm:ss')) AS crtime], where=[(coltime = 2021-01-31 21:45:00:TIMESTAMP(9))]) -> Sink: Sink(table=[myhive.prod_mysql_zqzynetdb.p_port_traffic_5m], fields=[id, ver, coltime, in_octets, out_octets, bi_octets, unimax_octets, in_speed, out_speed, bi_speed, unimax_speed, in_util, out_util, bi_util, unimax_util, inout_ratio, bandwidth, origin, crtime]) (1/1)#0 is already in state CANCELING 2021-01-31 23:04:23,693 INFO org.apache.flink.runtime.taskexecutor.slot.TaskSlotTableImpl [] - Free slot TaskSlot(index:22, state:ALLOCATED, resource profile: ResourceProfile{cpuCores=1.0000000000000000, taskHeapMemory=172.000mb (180355070 bytes), taskOffHeapMemory=0 bytes, managedMemory=128.000mb (134217730 bytes), networkMemory=16.000mb (16777216 bytes)}, allocationId: b70921707d488586bce0319feb054ebc, jobId: 13bc5fa5addba5772e9425161b48a2e3). 2021-01-31 23:04:23,693 INFO org.apache.flink.runtime.taskmanager.Task [] - Attempting to fail task externally Source: HiveSource-snmpprobe.p_port_traffic_5m -> Calc(select=[binaryid AS id, ver, CAST(2021-01-31 21:45:00:TIMESTAMP(6)) AS coltime, CAST(in_octets) AS in_octets, CAST(out_octets) AS out_octets, CAST(bi_octets) AS bi_octets, CAST(unimax_octets) AS unimax_octets, in_speed, out_speed, bi_speed, unimax_speed, in_util, out_util, bi_util, unimax_util, inout_ratio, bandwidth, origin, CAST((() DATE_FORMAT _UTF-16LE'yyyy-MM-dd HH:mm:ss')) AS crtime], where=[(coltime = 2021-01-31 21:45:00:TIMESTAMP(9))]) -> Sink: Sink(table=[myhive.prod_mysql_zqzynetdb.p_port_traffic_5m], fields=[id, ver, coltime, in_octets, out_octets, bi_octets, unimax_octets, in_speed, out_speed, bi_speed, unimax_speed, in_util, out_util, bi_util, unimax_util, inout_ratio, bandwidth, origin, crtime]) (1/1)#0 (7f4ca2467b4b31e2476bb7f5b93f6d33). 2021-01-31 23:04:23,693 INFO org.apache.flink.runtime.taskmanager.Task [] - Task Source: HiveSource-snmpprobe.p_port_traffic_5m -> Calc(select=[binaryid AS id, ver, CAST(2021-01-31 21:45:00:TIMESTAMP(6)) AS coltime, CAST(in_octets) AS in_octets, CAST(out_octets) AS out_octets, CAST(bi_octets) AS bi_octets, CAST(unimax_octets) AS unimax_octets, in_speed, out_speed, bi_speed, unimax_speed, in_util, out_util, bi_util, unimax_util, inout_ratio, bandwidth, origin, CAST((() DATE_FORMAT _UTF-16LE'yyyy-MM-dd HH:mm:ss')) AS crtime], where=[(coltime = 2021-01-31 21:45:00:TIMESTAMP(9))]) -> Sink: Sink(table=[myhive.prod_mysql_zqzynetdb.p_port_traffic_5m], fields=[id, ver, coltime, in_octets, out_octets, bi_octets, unimax_octets, in_speed, out_speed, bi_speed, unimax_speed, in_util, out_util, bi_util, unimax_util, inout_ratio, bandwidth, origin, crtime]) (1/1)#0 is already in state CANCELING 2021-01-31 23:04:23,695 INFO org.apache.flink.runtime.taskexecutor.slot.TaskSlotTableImpl [] - Free slot TaskSlot(index:22, state:RELEASING, resource profile: ResourceProfile{cpuCores=1.0000000000000000, taskHeapMemory=172.000mb (180355070 bytes), taskOffHeapMemory=0 bytes, managedMemory=128.000mb (134217730 bytes), networkMemory=16.000mb (16777216 bytes)}, allocationId: b70921707d488586bce0319feb054ebc, jobId: 13bc5fa5addba5772e9425161b48a2e3). 2021-01-31 23:04:33,693 INFO org.apache.flink.runtime.blob.PermanentBlobCache [] - Shutting down BLOB cache 2021-01-31 23:04:33,693 INFO org.apache.flink.runtime.blob.TransientBlobCache [] - Shutting down BLOB cache 2021-01-31 23:04:33,693 INFO org.apache.flink.runtime.state.TaskExecutorLocalStateStoresManager [] - Shutting down TaskExecutorLocalStateStoresManager. 2021-01-31 23:04:33,695 INFO org.apache.flink.runtime.filecache.FileCache [] - removed file cache directory /tmp/flink-dist-cache-b62b4fe6-a247-48c2-b1c2-f813ac4d2a78 2021-01-31 23:04:33,697 INFO org.apache.flink.runtime.io.disk.FileChannelManagerImpl [] - FileChannelManager removed spill file directory /tmp/flink-io-a8733545-1457-42f4-892a-779141bc4ce5 2021-01-31 23:04:33,697 INFO org.apache.flink.runtime.io.disk.FileChannelManagerImpl [] - FileChannelManager removed spill file directory /tmp/flink-netty-shuffle-3889e83c-89e2-436e-8e30-2997eaf8cd21 -- Sent from: http://apache-flink.147419.n8.nabble.com/