Re: Operator metrics do not get unregistered after job finishes
Hi Vino, The log shows no problems. The problem can be reproduced easily. I created https://issues.apache.org/jira/browse/FLINK-10300 <https://issues.apache.org/jira/browse/FLINK-10300>. Best, Helmut > On 18. Aug 2018, at 04:53, vino yang wrote: > > Hi Helmut, > > Is the metrics of all the sub task instances of a job not unregistered, or > part of it is not unregistered. Is there any exception log information > available? > > Please feel free to create a JIRA issue and clearly describe your problem. > > Thanks, vino. > > Helmut Zechmann mailto:hel...@adeven.com>> 于2018年8月17日周五 > 下午11:14写道: > Hi all, > > > we are using flink 1.5.2 in batch mode with prometheus monitoring. > > We noticed that a few metrics do not get unregistered after a job is finished: > > flink_taskmanager_job_task_operator_numRecordsIn > flink_taskmanager_job_task_operator_numRecordsInPerSecond > flink_taskmanager_job_task_operator_numRecordsOut > flink_taskmanager_job_task_operator_numRecordsOutPerSecond > > > Those metrics stay in the taksmanager metrics list until the task manger gets > restarted. > > Our metrics config is: > > metrics.reporters: prom > metrics.reporter.prom.class: > org.apache.flink.metrics.prometheus.PrometheusReporter > metrics.reporter.prom.port: 7000-7001 > > metrics.scope.jm <http://metrics.scope.jm/>: flink..jobmanager > metrics.scope.tm <http://metrics.scope.tm/>: flink..taskmanager. > metrics.scope.jm.job: flink..jobmanager. > metrics.scope.tm.job: flink..taskmanager.. > metrics.scope.task: > flink..taskmanager > metrics.scope.operator: > flink..taskmanager > > > Since we run many batch jobs, this makes prometheus monitoring unusable for > us. Is this a known issue? > > > Best, > > Helmut
Re: Flink Jobmanager Failover in HA mode
Hi Dominik, all jobs on the cluster (batch only jobs without state) where in status FINISHED. Best, Helmut On Fri, Aug 17, 2018 at 8:04 PM Dominik Wosiński wrote: > I have faced this issue, but in 1.4.0 IIRC. This seems to be related to > https://issues.apache.org/jira/browse/FLINK-10011. What was the status of > the jobs when the main Job Manager has been stopped ? > > 2018-08-17 17:08 GMT+02:00 Helmut Zechmann : > >> Hi all, >> >> we have a problem with flink 1.5.2 high availability in standalone mode. >> >> We have two jobmanagers running. When I shut down the main job manager, >> the failover job manager encounters an error during failover. >> >> Logs: >> >> >> 2018-08-17 14:38:16,478 WARN akka.remote.ReliableDeliverySupervisor >> - Association with remote system [akka.tcp:// >> fl...@seg-1.adjust.com:29095] has failed, address is now gated for [50] >> ms. Reason: [Disassociated] >> 2018-08-17 14:38:31,449 WARN akka.remote.transport.netty.NettyTransport >> - Remote connection to [null] failed with >> java.net.ConnectException: Connection refused: >> seg-1.adjust.com/178.162.219.66:29095 >> 2018-08-17 <http://seg-1.adjust.com/178.162.219.66:29095%0D2018-08-17> >> 14:38:31,451 WARN akka.remote.ReliableDeliverySupervisor >> - Association with remote system [akka.tcp:// >> fl...@seg-1.adjust.com:29095] has failed, address is now gated for [50] >> ms. Reason: [Association failed with [akka.tcp:// >> fl...@seg-1.adjust.com:29095]] Caused by: [Connection refused: >> seg-1.adjust.com/178.162.219.66:29095] >> 2018-08-17 14:38:41,379 ERROR >> org.apache.flink.runtime.rest.handler.legacy.files.StaticFileServerHandler >> - Could not retrieve the redirect address. >> java.util.concurrent.CompletionException: >> akka.pattern.AskTimeoutException: Ask timed out on [Actor[akka.tcp:// >> fl...@seg-1.adjust.com:29095/user/dispatcher#-1599908403]] after [1 >> ms]. Sender[null] sent message of type >> "org.apache.flink.runtime.rpc.messages.RemoteFencedMessage". >> at >> java.util.concurrent.CompletableFuture.encodeThrowable(CompletableFuture.java:292) >> [... shortened ...] >> Caused by: akka.pattern.AskTimeoutException: Ask timed out on >> [Actor[akka.tcp:// >> fl...@seg-1.adjust.com:29095/user/dispatcher#-1599908403]] after [1 >> ms]. Sender[null] sent message of type >> "org.apache.flink.runtime.rpc.messages.RemoteFencedMessage". >> at >> akka.pattern.PromiseActorRef$$anonfun$1.apply$mcV$sp(AskSupport.scala:604) >> ... 9 more >> 2018-08-17 14:38:48,005 INFO >> org.apache.flink.runtime.dispatcher.DispatcherRestEndpoint- >> http://seg-2.adjust.com:8083 was granted leadership with >> leaderSessionID=708d1a64-c353-448b-9101-7eb3f910970e >> 2018-08-17 14:38:48,005 INFO >> org.apache.flink.runtime.resourcemanager.StandaloneResourceManager - >> ResourceManager akka.tcp:// >> fl...@seg-2.adjust.com:30169/user/resourcemanager was granted leadership >> with fencing token 8de829de14876a367a80d37194b944ee >> 2018-08-17 14:38:48,006 INFO >> org.apache.flink.runtime.resourcemanager.slotmanager.SlotManager - >> Starting the SlotManager. >> 2018-08-17 14:38:48,007 INFO >> org.apache.flink.runtime.dispatcher.StandaloneDispatcher - Dispatcher >> akka.tcp://fl...@seg-2.adjust.com:30169/user/dispatcher was granted >> leadership with fencing token 684f50f8-327c-47e1-a53c-931c4f4ea3e5 >> 2018-08-17 14:38:48,007 INFO >> org.apache.flink.runtime.dispatcher.StandaloneDispatcher - Recovering >> all persisted jobs. >> 2018-08-17 14:38:48,021 INFO >> org.apache.flink.runtime.jobmanager.ZooKeeperSubmittedJobGraphStore - >> Recovered SubmittedJobGraph(b951bbf518bcf6cc031be6d2ccc441bb, null). >> 2018-08-17 14:38:48,028 INFO >> org.apache.flink.runtime.jobmanager.ZooKeeperSubmittedJobGraphStore - >> Recovered SubmittedJobGraph(06ed64f48fa0a7cffde53b99cbaa073f, null). >> 2018-08-17 14:38:48,035 ERROR >> org.apache.flink.runtime.entrypoint.ClusterEntrypoint - Fatal error >> occurred in the cluster entrypoint. >> java.lang.RuntimeException: >> org.apache.flink.runtime.client.JobExecutionException: Could not set up >> JobManager >> at >> org.apache.flink.util.ExceptionUtils.rethrow(ExceptionUtils.java:199) >> [... shortened ...] >> Caused by: org.apache.flink.runtime.client.JobExecutionException: Could >> not set up JobManager >> at >> org.apache.flink.runtime.jobmaster.JobManagerRunner.(Jo
Operator metrics do not get unregistered after job finishes
Hi all, we are using flink 1.5.2 in batch mode with prometheus monitoring. We noticed that a few metrics do not get unregistered after a job is finished: flink_taskmanager_job_task_operator_numRecordsIn flink_taskmanager_job_task_operator_numRecordsInPerSecond flink_taskmanager_job_task_operator_numRecordsOut flink_taskmanager_job_task_operator_numRecordsOutPerSecond Those metrics stay in the taksmanager metrics list until the task manger gets restarted. Our metrics config is: metrics.reporters: prom metrics.reporter.prom.class: org.apache.flink.metrics.prometheus.PrometheusReporter metrics.reporter.prom.port: 7000-7001 metrics.scope.jm: flink..jobmanager metrics.scope.tm: flink..taskmanager. metrics.scope.jm.job: flink..jobmanager. metrics.scope.tm.job: flink..taskmanager.. metrics.scope.task: flink..taskmanager metrics.scope.operator: flink..taskmanager Since we run many batch jobs, this makes prometheus monitoring unusable for us. Is this a known issue? Best, Helmut
Flink Jobmanager Failover in HA mode
Hi all, we have a problem with flink 1.5.2 high availability in standalone mode. We have two jobmanagers running. When I shut down the main job manager, the failover job manager encounters an error during failover. Logs: 2018-08-17 14:38:16,478 WARN akka.remote.ReliableDeliverySupervisor - Association with remote system [akka.tcp://fl...@seg-1.adjust.com:29095] has failed, address is now gated for [50] ms. Reason: [Disassociated] 2018-08-17 14:38:31,449 WARN akka.remote.transport.netty.NettyTransport - Remote connection to [null] failed with java.net.ConnectException: Connection refused: seg-1.adjust.com/178.162.219.66:29095 2018-08-17 14:38:31,451 WARN akka.remote.ReliableDeliverySupervisor - Association with remote system [akka.tcp://fl...@seg-1.adjust.com:29095] has failed, address is now gated for [50] ms. Reason: [Association failed with [akka.tcp://fl...@seg-1.adjust.com:29095]] Caused by: [Connection refused: seg-1.adjust.com/178.162.219.66:29095] 2018-08-17 14:38:41,379 ERROR org.apache.flink.runtime.rest.handler.legacy.files.StaticFileServerHandler - Could not retrieve the redirect address. java.util.concurrent.CompletionException: akka.pattern.AskTimeoutException: Ask timed out on [Actor[akka.tcp://fl...@seg-1.adjust.com:29095/user/dispatcher#-1599908403]] after [1 ms]. Sender[null] sent message of type "org.apache.flink.runtime.rpc.messages.RemoteFencedMessage". at java.util.concurrent.CompletableFuture.encodeThrowable(CompletableFuture.java:292) [... shortened ...] Caused by: akka.pattern.AskTimeoutException: Ask timed out on [Actor[akka.tcp://fl...@seg-1.adjust.com:29095/user/dispatcher#-1599908403]] after [1 ms]. Sender[null] sent message of type "org.apache.flink.runtime.rpc.messages.RemoteFencedMessage". at akka.pattern.PromiseActorRef$$anonfun$1.apply$mcV$sp(AskSupport.scala:604) ... 9 more 2018-08-17 14:38:48,005 INFO org.apache.flink.runtime.dispatcher.DispatcherRestEndpoint- http://seg-2.adjust.com:8083 was granted leadership with leaderSessionID=708d1a64-c353-448b-9101-7eb3f910970e 2018-08-17 14:38:48,005 INFO org.apache.flink.runtime.resourcemanager.StandaloneResourceManager - ResourceManager akka.tcp://fl...@seg-2.adjust.com:30169/user/resourcemanager was granted leadership with fencing token 8de829de14876a367a80d37194b944ee 2018-08-17 14:38:48,006 INFO org.apache.flink.runtime.resourcemanager.slotmanager.SlotManager - Starting the SlotManager. 2018-08-17 14:38:48,007 INFO org.apache.flink.runtime.dispatcher.StandaloneDispatcher - Dispatcher akka.tcp://fl...@seg-2.adjust.com:30169/user/dispatcher was granted leadership with fencing token 684f50f8-327c-47e1-a53c-931c4f4ea3e5 2018-08-17 14:38:48,007 INFO org.apache.flink.runtime.dispatcher.StandaloneDispatcher - Recovering all persisted jobs. 2018-08-17 14:38:48,021 INFO org.apache.flink.runtime.jobmanager.ZooKeeperSubmittedJobGraphStore - Recovered SubmittedJobGraph(b951bbf518bcf6cc031be6d2ccc441bb, null). 2018-08-17 14:38:48,028 INFO org.apache.flink.runtime.jobmanager.ZooKeeperSubmittedJobGraphStore - Recovered SubmittedJobGraph(06ed64f48fa0a7cffde53b99cbaa073f, null). 2018-08-17 14:38:48,035 ERROR org.apache.flink.runtime.entrypoint.ClusterEntrypoint - Fatal error occurred in the cluster entrypoint. java.lang.RuntimeException: org.apache.flink.runtime.client.JobExecutionException: Could not set up JobManager at org.apache.flink.util.ExceptionUtils.rethrow(ExceptionUtils.java:199) [... shortened ...] Caused by: org.apache.flink.runtime.client.JobExecutionException: Could not set up JobManager at org.apache.flink.runtime.jobmaster.JobManagerRunner.(JobManagerRunner.java:176) at org.apache.flink.runtime.dispatcher.Dispatcher$DefaultJobManagerRunnerFactory.createJobManagerRunner(Dispatcher.java:936) at org.apache.flink.runtime.dispatcher.Dispatcher.createJobManagerRunner(Dispatcher.java:291) at org.apache.flink.runtime.dispatcher.Dispatcher.runJob(Dispatcher.java:281) at org.apache.flink.util.function.ConsumerWithException.accept(ConsumerWithException.java:38) ... 21 more Caused by: java.lang.Exception: Cannot set up the user code libraries: /var/lib/flink/ceph/prod/1.5-batch/ha_state/1.5-batch/blob/job_b951bbf518bcf6cc031be6d2ccc441bb/blob_p-a26f62e3bbdcd8884dd18c42a3f6f202b9d2c6e7-0dc87a56862a1f799d515306ffeddfcb (No such file or directory) at org.apache.flink.runtime.jobmaster.JobManagerRunner.(JobManagerRunner.java:134) ... 25 more Caused by: java.io.FileNotFoundException: /var/lib/flink/ceph/prod/1.5-batch/ha_state/1.5-batch/blob/job_b951bbf518bcf6cc031be6d2ccc441bb/blob_p-a26f62e3bbdcd8884dd18c42a3f6f202b9d2c6e7-0dc87a56862a1f799d515306ffeddfcb (No such file or directory) at java.io.FileInputStream.open0(Native Method) [... shortened ...
Fwd: Processing Sorted Input Datasets
Hi all, Helmut Zechmann helmut.zechm...@mailbox.org www.helmutzechmann.com 0151 27527950 we want to use flink batch to merge records from two or more datasets using groupBy. The input datasets are already sorted since they have been written out sorted by some other job. Is it possible to tell flink that it does not have to re-sort the data again? Best, Helmut