Re: Why is task manager shutting down?
Hi You can configure the key `task.cancellation.timeout`[1] to increase the timeout, and the code about this logic is here[2] [1] https://nightlies.apache.org/flink/flink-docs-master/docs/deployment/config/#task-cancellation-timeout [2] https://github.com/apache/flink/blob/f543b8ac690b1dee58bc3cb345a1c8ad0db0941e/flink-runtime/src/main/java/org/apache/flink/runtime/taskmanager/Task.java#L1775 Best, Congxian John Smith 于2022年9月29日周四 19:04写道: > Sorry I mean the 180 seconds. Where does flink decide that 180 seconds is > the cutoff point... And can I increase it. > > On Thu., Sep. 29, 2022, 7:02 a.m. John Smith, > wrote: > >> Is there a way to increase the 30 seconds to 60? Where is that 30 second >> timeout set? >> >> I have jdbc query timeout but at some point at night the insert takes a >> bit longer cause of index rebuilding. >> >> On Wed., Sep. 28, 2022, 5:02 a.m. Congxian Qiu, >> wrote: >> >>> Hi John >>> >>> Yes, the whole TaskManager exited because the task did not react to >>> cancelling signal in time >>> >>> ``` >>> >>> 2022-08-30 09:14:22,138 ERROR >>> org.apache.flink.runtime.taskexecutor.TaskExecutor [] - Task did >>> not exit gracefully within 180 + seconds. >>> org.apache.flink.util.FlinkRuntimeException: Task did not exit gracefully >>> within 180 + seconds. >>> at >>> org.apache.flink.runtime.taskmanager.Task$TaskCancelerWatchDog.run(Task.java:1791) >>> [flink-dist_2.12-1.14.4.jar:1.14.4] >>> at java.lang.Thread.run(Thread.java:750) [?:1.8.0_342] >>> 2022-08-30 09:14:22,139 ERROR >>> org.apache.flink.runtime.taskexecutor.TaskManagerRunner [] - Fatal >>> error occurred while executing the TaskManager. Shutting it down... >>> >>> ``` >>> >>> >>> And the task stack logged such as below when cancelling the sink task >>> >>> ``` >>> >>> 2022-08-30 09:14:22,135 WARN org.apache.flink.runtime.taskmanager.Task >>>[] - Task 'Sink: jdbc (1/1)#359' did not react to cancelling >>> signal - notifying TM; it is stuck for 180 seconds in method: >>> java.net.SocketInputStream.socketRead0(Native Method) >>> java.net.SocketInputStream.socketRead(SocketInputStream.java:116) >>> java.net.SocketInputStream.read(SocketInputStream.java:171) >>> java.net.SocketInputStream.read(SocketInputStream.java:141) >>> com.microsoft.sqlserver.jdbc.TDSChannel.read(IOBuffer.java:2023) >>> com.microsoft.sqlserver.jdbc.TDSReader.readPacket(IOBuffer.java:6418) >>> com.microsoft.sqlserver.jdbc.TDSCommand.startResponse(IOBuffer.java:7579) >>> com.microsoft.sqlserver.jdbc.SQLServerPreparedStatement.doExecutePreparedStatement(SQLServerPreparedStatement.java:592) >>> com.microsoft.sqlserver.jdbc.SQLServerPreparedStatement$PrepStmtExecCmd.doExecute(SQLServerPreparedStatement.java:524) >>> com.microsoft.sqlserver.jdbc.TDSCommand.execute(IOBuffer.java:7194) >>> com.microsoft.sqlserver.jdbc.SQLServerConnection.executeCommand(SQLServerConnection.java:2979) >>> com.microsoft.sqlserver.jdbc.SQLServerStatement.executeCommand(SQLServerStatement.java:248) >>> com.microsoft.sqlserver.jdbc.SQLServerStatement.executeStatement(SQLServerStatement.java:223) >>> com.microsoft.sqlserver.jdbc.SQLServerPreparedStatement.execute(SQLServerPreparedStatement.java:505) >>> com.xx.common.flink.connectors.jdbc.xxJdbcJsonOutputFormat.flush(xxJdbcJsonOutputFormat.java:111) >>> com.xx.common.flink.connectors.jdbc.xxJdbcJsonSink.snapshotState(xxJdbcJsonSink.java:33) >>> ``` >>> >>> >>> Best, >>> Congxian >>> >>> >>> John Smith 于2022年9月23日周五 23:35写道: >>> Sorry new file: https://www.dropbox.com/s/mm9521crwvevzgl/flink-flink-taskexecutor-274-flink-prod-v-task-0001.log?dl=0 On Fri, Sep 23, 2022 at 11:26 AM John Smith wrote: > Hi I have attached the logs here... > > > https://www.dropbox.com/s/12gwlps52lvxdhz/flink-flink-taskexecutor-274-flink-prod-v-task-0001.log?dl=0 > > 1- It looks like a timeout issue. Can someone confirm? > 2- The task manager is restarted, since I have restart on failure in > SystemD. But it seems after a few restarts it stops. Does it mean that > SystemD has an internal counter of how many times it will restart a > service > before it doesn't do it anymore? >
Re: Why is task manager shutting down?
Sorry I mean the 180 seconds. Where does flink decide that 180 seconds is the cutoff point... And can I increase it. On Thu., Sep. 29, 2022, 7:02 a.m. John Smith, wrote: > Is there a way to increase the 30 seconds to 60? Where is that 30 second > timeout set? > > I have jdbc query timeout but at some point at night the insert takes a > bit longer cause of index rebuilding. > > On Wed., Sep. 28, 2022, 5:02 a.m. Congxian Qiu, > wrote: > >> Hi John >> >> Yes, the whole TaskManager exited because the task did not react to >> cancelling signal in time >> >> ``` >> >> 2022-08-30 09:14:22,138 ERROR >> org.apache.flink.runtime.taskexecutor.TaskExecutor [] - Task did >> not exit gracefully within 180 + seconds. >> org.apache.flink.util.FlinkRuntimeException: Task did not exit gracefully >> within 180 + seconds. >> at >> org.apache.flink.runtime.taskmanager.Task$TaskCancelerWatchDog.run(Task.java:1791) >> [flink-dist_2.12-1.14.4.jar:1.14.4] >> at java.lang.Thread.run(Thread.java:750) [?:1.8.0_342] >> 2022-08-30 09:14:22,139 ERROR >> org.apache.flink.runtime.taskexecutor.TaskManagerRunner [] - Fatal >> error occurred while executing the TaskManager. Shutting it down... >> >> ``` >> >> >> And the task stack logged such as below when cancelling the sink task >> >> ``` >> >> 2022-08-30 09:14:22,135 WARN org.apache.flink.runtime.taskmanager.Task >> [] - Task 'Sink: jdbc (1/1)#359' did not react to cancelling >> signal - notifying TM; it is stuck for 180 seconds in method: >> java.net.SocketInputStream.socketRead0(Native Method) >> java.net.SocketInputStream.socketRead(SocketInputStream.java:116) >> java.net.SocketInputStream.read(SocketInputStream.java:171) >> java.net.SocketInputStream.read(SocketInputStream.java:141) >> com.microsoft.sqlserver.jdbc.TDSChannel.read(IOBuffer.java:2023) >> com.microsoft.sqlserver.jdbc.TDSReader.readPacket(IOBuffer.java:6418) >> com.microsoft.sqlserver.jdbc.TDSCommand.startResponse(IOBuffer.java:7579) >> com.microsoft.sqlserver.jdbc.SQLServerPreparedStatement.doExecutePreparedStatement(SQLServerPreparedStatement.java:592) >> com.microsoft.sqlserver.jdbc.SQLServerPreparedStatement$PrepStmtExecCmd.doExecute(SQLServerPreparedStatement.java:524) >> com.microsoft.sqlserver.jdbc.TDSCommand.execute(IOBuffer.java:7194) >> com.microsoft.sqlserver.jdbc.SQLServerConnection.executeCommand(SQLServerConnection.java:2979) >> com.microsoft.sqlserver.jdbc.SQLServerStatement.executeCommand(SQLServerStatement.java:248) >> com.microsoft.sqlserver.jdbc.SQLServerStatement.executeStatement(SQLServerStatement.java:223) >> com.microsoft.sqlserver.jdbc.SQLServerPreparedStatement.execute(SQLServerPreparedStatement.java:505) >> com.xx.common.flink.connectors.jdbc.xxJdbcJsonOutputFormat.flush(xxJdbcJsonOutputFormat.java:111) >> com.xx.common.flink.connectors.jdbc.xxJdbcJsonSink.snapshotState(xxJdbcJsonSink.java:33) >> ``` >> >> >> Best, >> Congxian >> >> >> John Smith 于2022年9月23日周五 23:35写道: >> >>> Sorry new file: >>> https://www.dropbox.com/s/mm9521crwvevzgl/flink-flink-taskexecutor-274-flink-prod-v-task-0001.log?dl=0 >>> >>> On Fri, Sep 23, 2022 at 11:26 AM John Smith >>> wrote: >>> Hi I have attached the logs here... https://www.dropbox.com/s/12gwlps52lvxdhz/flink-flink-taskexecutor-274-flink-prod-v-task-0001.log?dl=0 1- It looks like a timeout issue. Can someone confirm? 2- The task manager is restarted, since I have restart on failure in SystemD. But it seems after a few restarts it stops. Does it mean that SystemD has an internal counter of how many times it will restart a service before it doesn't do it anymore? >>>
Re: Why is task manager shutting down?
Is there a way to increase the 30 seconds to 60? Where is that 30 second timeout set? I have jdbc query timeout but at some point at night the insert takes a bit longer cause of index rebuilding. On Wed., Sep. 28, 2022, 5:02 a.m. Congxian Qiu, wrote: > Hi John > > Yes, the whole TaskManager exited because the task did not react to > cancelling signal in time > > ``` > > 2022-08-30 09:14:22,138 ERROR > org.apache.flink.runtime.taskexecutor.TaskExecutor [] - Task did > not exit gracefully within 180 + seconds. > org.apache.flink.util.FlinkRuntimeException: Task did not exit gracefully > within 180 + seconds. > at > org.apache.flink.runtime.taskmanager.Task$TaskCancelerWatchDog.run(Task.java:1791) > [flink-dist_2.12-1.14.4.jar:1.14.4] > at java.lang.Thread.run(Thread.java:750) [?:1.8.0_342] > 2022-08-30 09:14:22,139 ERROR > org.apache.flink.runtime.taskexecutor.TaskManagerRunner [] - Fatal error > occurred while executing the TaskManager. Shutting it down... > > ``` > > > And the task stack logged such as below when cancelling the sink task > > ``` > > 2022-08-30 09:14:22,135 WARN org.apache.flink.runtime.taskmanager.Task > [] - Task 'Sink: jdbc (1/1)#359' did not react to cancelling > signal - notifying TM; it is stuck for 180 seconds in method: > java.net.SocketInputStream.socketRead0(Native Method) > java.net.SocketInputStream.socketRead(SocketInputStream.java:116) > java.net.SocketInputStream.read(SocketInputStream.java:171) > java.net.SocketInputStream.read(SocketInputStream.java:141) > com.microsoft.sqlserver.jdbc.TDSChannel.read(IOBuffer.java:2023) > com.microsoft.sqlserver.jdbc.TDSReader.readPacket(IOBuffer.java:6418) > com.microsoft.sqlserver.jdbc.TDSCommand.startResponse(IOBuffer.java:7579) > com.microsoft.sqlserver.jdbc.SQLServerPreparedStatement.doExecutePreparedStatement(SQLServerPreparedStatement.java:592) > com.microsoft.sqlserver.jdbc.SQLServerPreparedStatement$PrepStmtExecCmd.doExecute(SQLServerPreparedStatement.java:524) > com.microsoft.sqlserver.jdbc.TDSCommand.execute(IOBuffer.java:7194) > com.microsoft.sqlserver.jdbc.SQLServerConnection.executeCommand(SQLServerConnection.java:2979) > com.microsoft.sqlserver.jdbc.SQLServerStatement.executeCommand(SQLServerStatement.java:248) > com.microsoft.sqlserver.jdbc.SQLServerStatement.executeStatement(SQLServerStatement.java:223) > com.microsoft.sqlserver.jdbc.SQLServerPreparedStatement.execute(SQLServerPreparedStatement.java:505) > com.xx.common.flink.connectors.jdbc.xxJdbcJsonOutputFormat.flush(xxJdbcJsonOutputFormat.java:111) > com.xx.common.flink.connectors.jdbc.xxJdbcJsonSink.snapshotState(xxJdbcJsonSink.java:33) > ``` > > > Best, > Congxian > > > John Smith 于2022年9月23日周五 23:35写道: > >> Sorry new file: >> https://www.dropbox.com/s/mm9521crwvevzgl/flink-flink-taskexecutor-274-flink-prod-v-task-0001.log?dl=0 >> >> On Fri, Sep 23, 2022 at 11:26 AM John Smith >> wrote: >> >>> Hi I have attached the logs here... >>> >>> >>> https://www.dropbox.com/s/12gwlps52lvxdhz/flink-flink-taskexecutor-274-flink-prod-v-task-0001.log?dl=0 >>> >>> 1- It looks like a timeout issue. Can someone confirm? >>> 2- The task manager is restarted, since I have restart on failure in >>> SystemD. But it seems after a few restarts it stops. Does it mean that >>> SystemD has an internal counter of how many times it will restart a service >>> before it doesn't do it anymore? >>> >>
Re: Why is task manager shutting down?
Hi John Yes, the whole TaskManager exited because the task did not react to cancelling signal in time ``` 2022-08-30 09:14:22,138 ERROR org.apache.flink.runtime.taskexecutor.TaskExecutor [] - Task did not exit gracefully within 180 + seconds. org.apache.flink.util.FlinkRuntimeException: Task did not exit gracefully within 180 + seconds. at org.apache.flink.runtime.taskmanager.Task$TaskCancelerWatchDog.run(Task.java:1791) [flink-dist_2.12-1.14.4.jar:1.14.4] at java.lang.Thread.run(Thread.java:750) [?:1.8.0_342] 2022-08-30 09:14:22,139 ERROR org.apache.flink.runtime.taskexecutor.TaskManagerRunner [] - Fatal error occurred while executing the TaskManager. Shutting it down... ``` And the task stack logged such as below when cancelling the sink task ``` 2022-08-30 09:14:22,135 WARN org.apache.flink.runtime.taskmanager.Task[] - Task 'Sink: jdbc (1/1)#359' did not react to cancelling signal - notifying TM; it is stuck for 180 seconds in method: java.net.SocketInputStream.socketRead0(Native Method) java.net.SocketInputStream.socketRead(SocketInputStream.java:116) java.net.SocketInputStream.read(SocketInputStream.java:171) java.net.SocketInputStream.read(SocketInputStream.java:141) com.microsoft.sqlserver.jdbc.TDSChannel.read(IOBuffer.java:2023) com.microsoft.sqlserver.jdbc.TDSReader.readPacket(IOBuffer.java:6418) com.microsoft.sqlserver.jdbc.TDSCommand.startResponse(IOBuffer.java:7579) com.microsoft.sqlserver.jdbc.SQLServerPreparedStatement.doExecutePreparedStatement(SQLServerPreparedStatement.java:592) com.microsoft.sqlserver.jdbc.SQLServerPreparedStatement$PrepStmtExecCmd.doExecute(SQLServerPreparedStatement.java:524) com.microsoft.sqlserver.jdbc.TDSCommand.execute(IOBuffer.java:7194) com.microsoft.sqlserver.jdbc.SQLServerConnection.executeCommand(SQLServerConnection.java:2979) com.microsoft.sqlserver.jdbc.SQLServerStatement.executeCommand(SQLServerStatement.java:248) com.microsoft.sqlserver.jdbc.SQLServerStatement.executeStatement(SQLServerStatement.java:223) com.microsoft.sqlserver.jdbc.SQLServerPreparedStatement.execute(SQLServerPreparedStatement.java:505) com.xx.common.flink.connectors.jdbc.xxJdbcJsonOutputFormat.flush(xxJdbcJsonOutputFormat.java:111) com.xx.common.flink.connectors.jdbc.xxJdbcJsonSink.snapshotState(xxJdbcJsonSink.java:33) ``` Best, Congxian John Smith 于2022年9月23日周五 23:35写道: > Sorry new file: > https://www.dropbox.com/s/mm9521crwvevzgl/flink-flink-taskexecutor-274-flink-prod-v-task-0001.log?dl=0 > > On Fri, Sep 23, 2022 at 11:26 AM John Smith > wrote: > >> Hi I have attached the logs here... >> >> >> https://www.dropbox.com/s/12gwlps52lvxdhz/flink-flink-taskexecutor-274-flink-prod-v-task-0001.log?dl=0 >> >> 1- It looks like a timeout issue. Can someone confirm? >> 2- The task manager is restarted, since I have restart on failure in >> SystemD. But it seems after a few restarts it stops. Does it mean that >> SystemD has an internal counter of how many times it will restart a service >> before it doesn't do it anymore? >> >
Re: Why is task manager shutting down?
Sorry new file: https://www.dropbox.com/s/mm9521crwvevzgl/flink-flink-taskexecutor-274-flink-prod-v-task-0001.log?dl=0 On Fri, Sep 23, 2022 at 11:26 AM John Smith wrote: > Hi I have attached the logs here... > > > https://www.dropbox.com/s/12gwlps52lvxdhz/flink-flink-taskexecutor-274-flink-prod-v-task-0001.log?dl=0 > > 1- It looks like a timeout issue. Can someone confirm? > 2- The task manager is restarted, since I have restart on failure in > SystemD. But it seems after a few restarts it stops. Does it mean that > SystemD has an internal counter of how many times it will restart a service > before it doesn't do it anymore? >
Why is task manager shutting down?
Hi I have attached the logs here... https://www.dropbox.com/s/12gwlps52lvxdhz/flink-flink-taskexecutor-274-flink-prod-v-task-0001.log?dl=0 1- It looks like a timeout issue. Can someone confirm? 2- The task manager is restarted, since I have restart on failure in SystemD. But it seems after a few restarts it stops. Does it mean that SystemD has an internal counter of how many times it will restart a service before it doesn't do it anymore?
Re: Task manager shutting down.
Actually what's happening is there's a nightly indexing job. So when we call the insert it takes longer than the specified checkpoint threshold. JDBC will hapilly continue waiting for a response from the DB until it's done. So the checkpoint threshold is reached and the job tries to shut down and restart, but the job is blocked on the JDBC driver and it's causing all kinds of crazy exceptions as you see in the logs. So a stop gap solution was to add setQueryTimeout to a value a bit shorter than the threshold of the checkpoint. This allows the job to fail "gracefully" and restart until indexing is done. 1- We can review the indexing policy, if it's required nightly, which just means that instead of having the job fail every night it will fail only when the indexing happens. 2- The other is to try to figure out a way to pause the job, maybe through cron and savepoints. But it seems way overly thought. On Wed, May 4, 2022 at 1:40 PM Martijn Visser wrote: > Hi John, > > In an ideal scenario you would be able to leverage Flink's backpressure > mechanism. That would effectively slow down the processing until the reason > for backpressure has been resolved. However, given that indexing happens > after you've sinked your result, from a Flink perspective, the action is > completed. Perhaps someone else has a different idea on how to achieve > this. > > Best regards, > > Martijn > > On Wed, 4 May 2022 at 19:31, John Smith wrote: > >> So I know specifically, it's the indexing and I put setQueryTimeout. So >> the job fails. And goes into retry. That's fine. >> >> But just wondering is there a way to pause the stream at a specified >> time/checkpoint and then resume after a specified time? >> >> On Wed, May 4, 2022 at 10:23 AM Martijn Visser >> wrote: >> >>> Hi John, >>> >>> It is generic, but each database has its own dialect implementation >>> because they all have their differences unfortunately :) >>> >>> I wish I knew how I could help you out here. Perhaps some of the JDBC >>> maintainers could chip in. >>> >>> Best regards, >>> >>> Martijn >>> >>> On Sun, 1 May 2022 at 04:06, John Smith wrote: >>> Plus in a way isn't the flink-jdbc connector kinda generic? At least the older one didn't seem to be server specific. On Sat, Apr 30, 2022 at 10:04 PM John Smith wrote: > Hi Martin, is there anything I need to check for? > > On Tue, Apr 26, 2022 at 9:50 PM John Smith > wrote: > >> Yeah based off the flink JDBC output format... >> >> >> On Tue, Apr 26, 2022 at 10:05 AM Martijn Visser < >> martijnvis...@apache.org> wrote: >> >>> Hi John, >>> >>> Have you built your own JDBC MSSQL source or sink or perhaps a CDC >>> driver? Because I'm not aware of a Flink Microsoft SQL Server JDBC >>> driver. >>> >>> Best regards, >>> >>> Martijn Visser >>> https://twitter.com/MartijnVisser82 >>> https://github.com/MartijnVisser >>> >>> >>> On Tue, 26 Apr 2022 at 16:01, John Smith >>> wrote: >>> Hi running 1.14.4 Logs included: https://www.dropbox.com/s/8zjndt5rzd9o80f/flink-flink-taskexecutor-138-task-0002.log?dl=0 1- My task managers shut down with: Terminating TaskManagerRunner with exit code 1. 2- It seems to happen at the same time every day. Which leads me to believe it's our database indexing (See below for reasoning of this). 3- Most of our jobs are ETL from Kafka to SQL Server. 4- We see the following exceptions in the logs: - Task 'Sink: jdbc (1/1)#10' did not react to cancelling signal - interrupting; it is stuck for 30 seconds in method: ... com.microsoft.sqlserver.jdbc.TDSChannel ... - Sink: jdbc (1/1)#9 (3aaf6d8a45df6c43198bc8297b42354c) switched from RUNNING to FAILED with failure cause: org.apache.flink.util.FlinkException: Disconnect from JobManager responsible for ... 5- Also seeing this: Failed to close consumer network client with type org.apache.kafka.clients.consumer.internals.ConsumerNetworkClient java.lang.NoClassDefFoundError: org/apache/kafka/common/network/Selector$CloseMode So what I'm guessing is happening is the indexing is blocking the job and the task manager cannot cleanly remove the job and finally after a while it decides to shut down completely? Is there a way to pause the stream and restart at a later time knowing that this happens always at the same wall clock time? Or maybe allow the JDBC to cleanly shutdown with a timeout?
Re: Task manager shutting down.
Hi John, In an ideal scenario you would be able to leverage Flink's backpressure mechanism. That would effectively slow down the processing until the reason for backpressure has been resolved. However, given that indexing happens after you've sinked your result, from a Flink perspective, the action is completed. Perhaps someone else has a different idea on how to achieve this. Best regards, Martijn On Wed, 4 May 2022 at 19:31, John Smith wrote: > So I know specifically, it's the indexing and I put setQueryTimeout. So > the job fails. And goes into retry. That's fine. > > But just wondering is there a way to pause the stream at a specified > time/checkpoint and then resume after a specified time? > > On Wed, May 4, 2022 at 10:23 AM Martijn Visser > wrote: > >> Hi John, >> >> It is generic, but each database has its own dialect implementation >> because they all have their differences unfortunately :) >> >> I wish I knew how I could help you out here. Perhaps some of the JDBC >> maintainers could chip in. >> >> Best regards, >> >> Martijn >> >> On Sun, 1 May 2022 at 04:06, John Smith wrote: >> >>> Plus in a way isn't the flink-jdbc connector kinda generic? At least the >>> older one didn't seem to be server specific. >>> >>> On Sat, Apr 30, 2022 at 10:04 PM John Smith >>> wrote: >>> Hi Martin, is there anything I need to check for? On Tue, Apr 26, 2022 at 9:50 PM John Smith wrote: > Yeah based off the flink JDBC output format... > > > On Tue, Apr 26, 2022 at 10:05 AM Martijn Visser < > martijnvis...@apache.org> wrote: > >> Hi John, >> >> Have you built your own JDBC MSSQL source or sink or perhaps a CDC >> driver? Because I'm not aware of a Flink Microsoft SQL Server JDBC >> driver. >> >> Best regards, >> >> Martijn Visser >> https://twitter.com/MartijnVisser82 >> https://github.com/MartijnVisser >> >> >> On Tue, 26 Apr 2022 at 16:01, John Smith >> wrote: >> >>> Hi running 1.14.4 >>> >>> Logs included: >>> https://www.dropbox.com/s/8zjndt5rzd9o80f/flink-flink-taskexecutor-138-task-0002.log?dl=0 >>> >>> 1- My task managers shut down with: Terminating TaskManagerRunner >>> with exit code 1. >>> 2- It seems to happen at the same time every day. Which leads me to >>> believe it's our database indexing (See below for reasoning of this). >>> 3- Most of our jobs are ETL from Kafka to SQL Server. >>> 4- We see the following exceptions in the logs: >>> - Task 'Sink: jdbc (1/1)#10' did not react to cancelling >>> signal - interrupting; it is stuck for 30 seconds in method: >>> ... com.microsoft.sqlserver.jdbc.TDSChannel ... >>> - Sink: jdbc (1/1)#9 (3aaf6d8a45df6c43198bc8297b42354c) >>> switched from RUNNING to FAILED with failure cause: >>> org.apache.flink.util.FlinkException: Disconnect from JobManager >>> responsible for ... >>> 5- Also seeing this: Failed to close consumer network client with >>> type org.apache.kafka.clients.consumer.internals.ConsumerNetworkClient >>> java.lang.NoClassDefFoundError: >>> org/apache/kafka/common/network/Selector$CloseMode >>> >>> So what I'm guessing is happening is the indexing is blocking the >>> job and the task manager cannot cleanly remove the job and finally >>> after a >>> while it decides to shut down completely? >>> >>> Is there a way to pause the stream and restart at a later time >>> knowing that this happens always at the same wall clock time? Or maybe >>> allow the JDBC to cleanly shutdown with a timeout? >>> >>> >>>
Re: Task manager shutting down.
So I know specifically, it's the indexing and I put setQueryTimeout. So the job fails. And goes into retry. That's fine. But just wondering is there a way to pause the stream at a specified time/checkpoint and then resume after a specified time? On Wed, May 4, 2022 at 10:23 AM Martijn Visser wrote: > Hi John, > > It is generic, but each database has its own dialect implementation > because they all have their differences unfortunately :) > > I wish I knew how I could help you out here. Perhaps some of the JDBC > maintainers could chip in. > > Best regards, > > Martijn > > On Sun, 1 May 2022 at 04:06, John Smith wrote: > >> Plus in a way isn't the flink-jdbc connector kinda generic? At least the >> older one didn't seem to be server specific. >> >> On Sat, Apr 30, 2022 at 10:04 PM John Smith >> wrote: >> >>> Hi Martin, is there anything I need to check for? >>> >>> On Tue, Apr 26, 2022 at 9:50 PM John Smith >>> wrote: >>> Yeah based off the flink JDBC output format... On Tue, Apr 26, 2022 at 10:05 AM Martijn Visser < martijnvis...@apache.org> wrote: > Hi John, > > Have you built your own JDBC MSSQL source or sink or perhaps a CDC > driver? Because I'm not aware of a Flink Microsoft SQL Server JDBC driver. > > Best regards, > > Martijn Visser > https://twitter.com/MartijnVisser82 > https://github.com/MartijnVisser > > > On Tue, 26 Apr 2022 at 16:01, John Smith > wrote: > >> Hi running 1.14.4 >> >> Logs included: >> https://www.dropbox.com/s/8zjndt5rzd9o80f/flink-flink-taskexecutor-138-task-0002.log?dl=0 >> >> 1- My task managers shut down with: Terminating TaskManagerRunner >> with exit code 1. >> 2- It seems to happen at the same time every day. Which leads me to >> believe it's our database indexing (See below for reasoning of this). >> 3- Most of our jobs are ETL from Kafka to SQL Server. >> 4- We see the following exceptions in the logs: >> - Task 'Sink: jdbc (1/1)#10' did not react to cancelling signal >> - interrupting; it is stuck for 30 seconds in method: >> ... com.microsoft.sqlserver.jdbc.TDSChannel ... >> - Sink: jdbc (1/1)#9 (3aaf6d8a45df6c43198bc8297b42354c) >> switched from RUNNING to FAILED with failure cause: >> org.apache.flink.util.FlinkException: Disconnect from JobManager >> responsible for ... >> 5- Also seeing this: Failed to close consumer network client with >> type org.apache.kafka.clients.consumer.internals.ConsumerNetworkClient >> java.lang.NoClassDefFoundError: >> org/apache/kafka/common/network/Selector$CloseMode >> >> So what I'm guessing is happening is the indexing is blocking the job >> and the task manager cannot cleanly remove the job and finally after a >> while it decides to shut down completely? >> >> Is there a way to pause the stream and restart at a later time >> knowing that this happens always at the same wall clock time? Or maybe >> allow the JDBC to cleanly shutdown with a timeout? >> >> >>
Re: Task manager shutting down.
Hi John, It is generic, but each database has its own dialect implementation because they all have their differences unfortunately :) I wish I knew how I could help you out here. Perhaps some of the JDBC maintainers could chip in. Best regards, Martijn On Sun, 1 May 2022 at 04:06, John Smith wrote: > Plus in a way isn't the flink-jdbc connector kinda generic? At least the > older one didn't seem to be server specific. > > On Sat, Apr 30, 2022 at 10:04 PM John Smith > wrote: > >> Hi Martin, is there anything I need to check for? >> >> On Tue, Apr 26, 2022 at 9:50 PM John Smith >> wrote: >> >>> Yeah based off the flink JDBC output format... >>> >>> >>> On Tue, Apr 26, 2022 at 10:05 AM Martijn Visser < >>> martijnvis...@apache.org> wrote: >>> Hi John, Have you built your own JDBC MSSQL source or sink or perhaps a CDC driver? Because I'm not aware of a Flink Microsoft SQL Server JDBC driver. Best regards, Martijn Visser https://twitter.com/MartijnVisser82 https://github.com/MartijnVisser On Tue, 26 Apr 2022 at 16:01, John Smith wrote: > Hi running 1.14.4 > > Logs included: > https://www.dropbox.com/s/8zjndt5rzd9o80f/flink-flink-taskexecutor-138-task-0002.log?dl=0 > > 1- My task managers shut down with: Terminating TaskManagerRunner with > exit code 1. > 2- It seems to happen at the same time every day. Which leads me to > believe it's our database indexing (See below for reasoning of this). > 3- Most of our jobs are ETL from Kafka to SQL Server. > 4- We see the following exceptions in the logs: > - Task 'Sink: jdbc (1/1)#10' did not react to cancelling signal > - interrupting; it is stuck for 30 seconds in method: > ... com.microsoft.sqlserver.jdbc.TDSChannel ... > - Sink: jdbc (1/1)#9 (3aaf6d8a45df6c43198bc8297b42354c) switched > from RUNNING to FAILED with failure cause: > org.apache.flink.util.FlinkException: Disconnect from JobManager > responsible for ... > 5- Also seeing this: Failed to close consumer network client with type > org.apache.kafka.clients.consumer.internals.ConsumerNetworkClient > java.lang.NoClassDefFoundError: > org/apache/kafka/common/network/Selector$CloseMode > > So what I'm guessing is happening is the indexing is blocking the job > and the task manager cannot cleanly remove the job and finally after a > while it decides to shut down completely? > > Is there a way to pause the stream and restart at a later time knowing > that this happens always at the same wall clock time? Or maybe allow the > JDBC to cleanly shutdown with a timeout? > > >
Re: Task manager shutting down.
Plus in a way isn't the flink-jdbc connector kinda generic? At least the older one didn't seem to be server specific. On Sat, Apr 30, 2022 at 10:04 PM John Smith wrote: > Hi Martin, is there anything I need to check for? > > On Tue, Apr 26, 2022 at 9:50 PM John Smith wrote: > >> Yeah based off the flink JDBC output format... >> >> >> On Tue, Apr 26, 2022 at 10:05 AM Martijn Visser >> wrote: >> >>> Hi John, >>> >>> Have you built your own JDBC MSSQL source or sink or perhaps a CDC >>> driver? Because I'm not aware of a Flink Microsoft SQL Server JDBC driver. >>> >>> Best regards, >>> >>> Martijn Visser >>> https://twitter.com/MartijnVisser82 >>> https://github.com/MartijnVisser >>> >>> >>> On Tue, 26 Apr 2022 at 16:01, John Smith wrote: >>> Hi running 1.14.4 Logs included: https://www.dropbox.com/s/8zjndt5rzd9o80f/flink-flink-taskexecutor-138-task-0002.log?dl=0 1- My task managers shut down with: Terminating TaskManagerRunner with exit code 1. 2- It seems to happen at the same time every day. Which leads me to believe it's our database indexing (See below for reasoning of this). 3- Most of our jobs are ETL from Kafka to SQL Server. 4- We see the following exceptions in the logs: - Task 'Sink: jdbc (1/1)#10' did not react to cancelling signal - interrupting; it is stuck for 30 seconds in method: ... com.microsoft.sqlserver.jdbc.TDSChannel ... - Sink: jdbc (1/1)#9 (3aaf6d8a45df6c43198bc8297b42354c) switched from RUNNING to FAILED with failure cause: org.apache.flink.util.FlinkException: Disconnect from JobManager responsible for ... 5- Also seeing this: Failed to close consumer network client with type org.apache.kafka.clients.consumer.internals.ConsumerNetworkClient java.lang.NoClassDefFoundError: org/apache/kafka/common/network/Selector$CloseMode So what I'm guessing is happening is the indexing is blocking the job and the task manager cannot cleanly remove the job and finally after a while it decides to shut down completely? Is there a way to pause the stream and restart at a later time knowing that this happens always at the same wall clock time? Or maybe allow the JDBC to cleanly shutdown with a timeout?
Re: Task manager shutting down.
Hi Martin, is there anything I need to check for? On Tue, Apr 26, 2022 at 9:50 PM John Smith wrote: > Yeah based off the flink JDBC output format... > > > On Tue, Apr 26, 2022 at 10:05 AM Martijn Visser > wrote: > >> Hi John, >> >> Have you built your own JDBC MSSQL source or sink or perhaps a CDC >> driver? Because I'm not aware of a Flink Microsoft SQL Server JDBC driver. >> >> Best regards, >> >> Martijn Visser >> https://twitter.com/MartijnVisser82 >> https://github.com/MartijnVisser >> >> >> On Tue, 26 Apr 2022 at 16:01, John Smith wrote: >> >>> Hi running 1.14.4 >>> >>> Logs included: >>> https://www.dropbox.com/s/8zjndt5rzd9o80f/flink-flink-taskexecutor-138-task-0002.log?dl=0 >>> >>> 1- My task managers shut down with: Terminating TaskManagerRunner with >>> exit code 1. >>> 2- It seems to happen at the same time every day. Which leads me to >>> believe it's our database indexing (See below for reasoning of this). >>> 3- Most of our jobs are ETL from Kafka to SQL Server. >>> 4- We see the following exceptions in the logs: >>> - Task 'Sink: jdbc (1/1)#10' did not react to cancelling signal - >>> interrupting; it is stuck for 30 seconds in method: >>> ... com.microsoft.sqlserver.jdbc.TDSChannel ... >>> - Sink: jdbc (1/1)#9 (3aaf6d8a45df6c43198bc8297b42354c) switched >>> from RUNNING to FAILED with failure cause: >>> org.apache.flink.util.FlinkException: Disconnect from JobManager >>> responsible for ... >>> 5- Also seeing this: Failed to close consumer network client with type >>> org.apache.kafka.clients.consumer.internals.ConsumerNetworkClient >>> java.lang.NoClassDefFoundError: >>> org/apache/kafka/common/network/Selector$CloseMode >>> >>> So what I'm guessing is happening is the indexing is blocking the job >>> and the task manager cannot cleanly remove the job and finally after a >>> while it decides to shut down completely? >>> >>> Is there a way to pause the stream and restart at a later time knowing >>> that this happens always at the same wall clock time? Or maybe allow the >>> JDBC to cleanly shutdown with a timeout? >>> >>> >>>
Re: Task manager shutting down.
Yeah based off the flink JDBC output format... On Tue, Apr 26, 2022 at 10:05 AM Martijn Visser wrote: > Hi John, > > Have you built your own JDBC MSSQL source or sink or perhaps a CDC driver? > Because I'm not aware of a Flink Microsoft SQL Server JDBC driver. > > Best regards, > > Martijn Visser > https://twitter.com/MartijnVisser82 > https://github.com/MartijnVisser > > > On Tue, 26 Apr 2022 at 16:01, John Smith wrote: > >> Hi running 1.14.4 >> >> Logs included: >> https://www.dropbox.com/s/8zjndt5rzd9o80f/flink-flink-taskexecutor-138-task-0002.log?dl=0 >> >> 1- My task managers shut down with: Terminating TaskManagerRunner with >> exit code 1. >> 2- It seems to happen at the same time every day. Which leads me to >> believe it's our database indexing (See below for reasoning of this). >> 3- Most of our jobs are ETL from Kafka to SQL Server. >> 4- We see the following exceptions in the logs: >> - Task 'Sink: jdbc (1/1)#10' did not react to cancelling signal - >> interrupting; it is stuck for 30 seconds in method: >> ... com.microsoft.sqlserver.jdbc.TDSChannel ... >> - Sink: jdbc (1/1)#9 (3aaf6d8a45df6c43198bc8297b42354c) switched >> from RUNNING to FAILED with failure cause: >> org.apache.flink.util.FlinkException: Disconnect from JobManager >> responsible for ... >> 5- Also seeing this: Failed to close consumer network client with type >> org.apache.kafka.clients.consumer.internals.ConsumerNetworkClient >> java.lang.NoClassDefFoundError: >> org/apache/kafka/common/network/Selector$CloseMode >> >> So what I'm guessing is happening is the indexing is blocking the job and >> the task manager cannot cleanly remove the job and finally after a while it >> decides to shut down completely? >> >> Is there a way to pause the stream and restart at a later time knowing >> that this happens always at the same wall clock time? Or maybe allow the >> JDBC to cleanly shutdown with a timeout? >> >> >>
Re: Task manager shutting down.
Hi John, Have you built your own JDBC MSSQL source or sink or perhaps a CDC driver? Because I'm not aware of a Flink Microsoft SQL Server JDBC driver. Best regards, Martijn Visser https://twitter.com/MartijnVisser82 https://github.com/MartijnVisser On Tue, 26 Apr 2022 at 16:01, John Smith wrote: > Hi running 1.14.4 > > Logs included: > https://www.dropbox.com/s/8zjndt5rzd9o80f/flink-flink-taskexecutor-138-task-0002.log?dl=0 > > 1- My task managers shut down with: Terminating TaskManagerRunner with > exit code 1. > 2- It seems to happen at the same time every day. Which leads me to > believe it's our database indexing (See below for reasoning of this). > 3- Most of our jobs are ETL from Kafka to SQL Server. > 4- We see the following exceptions in the logs: > - Task 'Sink: jdbc (1/1)#10' did not react to cancelling signal - > interrupting; it is stuck for 30 seconds in method: > ... com.microsoft.sqlserver.jdbc.TDSChannel ... > - Sink: jdbc (1/1)#9 (3aaf6d8a45df6c43198bc8297b42354c) switched > from RUNNING to FAILED with failure cause: > org.apache.flink.util.FlinkException: Disconnect from JobManager > responsible for ... > 5- Also seeing this: Failed to close consumer network client with type > org.apache.kafka.clients.consumer.internals.ConsumerNetworkClient > java.lang.NoClassDefFoundError: > org/apache/kafka/common/network/Selector$CloseMode > > So what I'm guessing is happening is the indexing is blocking the job and > the task manager cannot cleanly remove the job and finally after a while it > decides to shut down completely? > > Is there a way to pause the stream and restart at a later time knowing > that this happens always at the same wall clock time? Or maybe allow the > JDBC to cleanly shutdown with a timeout? > > >
Task manager shutting down.
Hi running 1.14.4 Logs included: https://www.dropbox.com/s/8zjndt5rzd9o80f/flink-flink-taskexecutor-138-task-0002.log?dl=0 1- My task managers shut down with: Terminating TaskManagerRunner with exit code 1. 2- It seems to happen at the same time every day. Which leads me to believe it's our database indexing (See below for reasoning of this). 3- Most of our jobs are ETL from Kafka to SQL Server. 4- We see the following exceptions in the logs: - Task 'Sink: jdbc (1/1)#10' did not react to cancelling signal - interrupting; it is stuck for 30 seconds in method: ... com.microsoft.sqlserver.jdbc.TDSChannel ... - Sink: jdbc (1/1)#9 (3aaf6d8a45df6c43198bc8297b42354c) switched from RUNNING to FAILED with failure cause: org.apache.flink.util.FlinkException: Disconnect from JobManager responsible for ... 5- Also seeing this: Failed to close consumer network client with type org.apache.kafka.clients.consumer.internals.ConsumerNetworkClient java.lang.NoClassDefFoundError: org/apache/kafka/common/network/Selector$CloseMode So what I'm guessing is happening is the indexing is blocking the job and the task manager cannot cleanly remove the job and finally after a while it decides to shut down completely? Is there a way to pause the stream and restart at a later time knowing that this happens always at the same wall clock time? Or maybe allow the JDBC to cleanly shutdown with a timeout?