Hi Chesnay, sorry for the late reply. I did not have time to look into this sooner
I did what you suggested. Added some logging to the PrometheusReporter like this : https://github.com/jelmerk/flink/commit/58779ee60a8c3961f3eb2c487c603c33822bba8a And deployed a custom build of the reporter to our test environment. I managed to reproduce the issue like this 1. Deploy job A : it lands on worker 1 2. Deploy job B : it lands on worker 1, take note of the job id 3. Redeploy job b by canceling it from a savepoint and deploying it again from the savepoint : it lands on worker 3 4. Execute curl -s http://localhost:9249/metrics | grep "job id from step 2" on worker 1. The metrics are still exposed even though the job is canceled I attached a piece of the log to the email. What I notice is that the two jobs register metrics with the same scoped metric name. In this case flink_taskmanager_job_task_buffers_inputQueueLength. The prometheus exporter seems to use reference counting for the metrics and the metrics will only be removed when the count is 0, canceling job B will lower the counter by 5 but because job A still is deployed the count is not 1 so the metric never gets unregistered Canceling job A will remove the lingering metrics from the old job B It seems to me that this is a bug and that the childs that are being added in notifyOfAddedMetric <https://github.com/jelmerk/flink/commit/58779ee60a8c3961f3eb2c487c603c33822bba8a#diff-36ff6f170e359d30a1265b43659443bfR163> should be removed in notifyOfRemovedMetric Can you confirm this ? --Jelmer On Fri, 15 Jun 2018 at 18:01, Chesnay Schepler <ches...@apache.org> wrote: > I remember that another user reported something similar, but he wasn't > using the PrometheusReporter. see > http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/JVM-metrics-disappearing-after-job-crash-restart-tt20420.html > > We couldn't find the cause, but my suspicion was FLINK-8946 which will be > fixed in 1.4.3 . > You could cherry-pick 8b046fafb6ee77a86e360f6b792e7f73399239bd and see > whether this actually caused it. > > Alternatively, if you can reproduce this it would be immensely helpful if > you could modify the PrometheusReporter and log all notifications about > added or removed metrics. > > On 15.06.2018 15:42, Till Rohrmann wrote: > > Hi, > > this sounds very strange. I just tried it out locally with with a standard > metric and the Prometheus metrics seem to be unregistered after the job has > reached a terminal state. Thus, it looks as if the standard metrics are > properly removed from `CollectorRegistry.defaultRegistry`. Could you check > the log files whether they contain anything suspicious about a failed > metric deregistration a la `There was a problem unregistering metric`? > > I've also pulled in Chesnay who knows more about the metric reporters. > > Cheers, > Till > > On Thu, Jun 14, 2018 at 11:34 PM jelmer <jkupe...@gmail.com> wrote: > >> Hi >> >> We are using flink-metrics-prometheus for reporting on apache flink 1.4.2 >> >> And I am looking into an issue where it seems that somehow in some cases >> the metrics registered >> by org.apache.flink.runtime.metrics.groups.TaskIOMetricGroup >> (flink_taskmanager_job_task_buffers_outPoolUsage etc) are not being >> unregistered in prometheus in case of a job restart >> >> Eventually this seems to cause a java.lang.NoClassDefFoundError: >> org/apache/kafka/common/metrics/stats/Rate$1 error when a new version of >> the job is deployed because the jar file >> in /tmp/blobStore-foo/job_bar/blob_p-baz-qux has been removed upon >> deployment of the new job but the url classloader still points to it and it >> cannot find stats/Rate$1 (some synthetically generated code generated by >> the java compiler because its a switch on an enum) >> >> Has anybody come across this issue ? Has it possibly been fixed in 1.5 ? >> Can somebody any pointers as to where to look to tackle this ? >> >> Attached screenshot shows what classloader that cannot be garbage >> collected with the gc root >> >> >
2018-06-26 15:21:41.602 [marshalled-with-bot-header-for-dev -> Sink: kafka-sink-for-dev (1/1)] INFO org.apache.flink.metrics.prometheus.PrometheusReporter - Metric added with metricName inputQueueLength and scopedMetricName flink_taskmanager_job_task_buffers_inputQueueLength and metricGroup scope components flinkworker001,taskmanager,odin-router-v2,marshalled-with-bot-header-for-dev -> Sink: kafka-sink-for-dev,0,buffers and metric group variables {<job_id>=58f009823d425ec30a95fcab9b91929f, <task_id>=cc2765d0460902a818200cb341f809c2, <task_attempt_id>=d00c5dc5565d99849c4d996057cdf99f, <host>=flinkworker001, <task_name>=marshalled-with-bot-header-for-dev -> Sink: kafka-sink-for-dev, <task_attempt_num>=0, <job_name>=odin-router-v2, <tm_id>=9fce323ec41b9537e85d6ed52c38cdb2, <subtask_index>=0}. Count is 0 2018-06-26 15:21:41.605 [bots-for-dev -> Map -> Sink: bots-sink-for-dev (1/1)] INFO org.apache.flink.metrics.prometheus.PrometheusReporter - Metric added with metricName inputQueueLength and scopedMetricName flink_taskmanager_job_task_buffers_inputQueueLength and metricGroup scope components flinkworker001,taskmanager,odin-router-v2,bots-for-dev -> Map -> Sink: bots-sink-for-dev,0,buffers and metric group variables {<job_id>=58f009823d425ec30a95fcab9b91929f, <task_id>=83df09636bb350560b5f98563cc74f57, <task_attempt_id>=ffa5f9398bde80c4e85d12038cdb9b69, <host>=flinkworker001, <task_name>=bots-for-dev -> Map -> Sink: bots-sink-for-dev, <task_attempt_num>=0, <job_name>=odin-router-v2, <tm_id>=9fce323ec41b9537e85d6ed52c38cdb2, <subtask_index>=0}. Count is 1 2018-06-26 15:21:41.612 [Source: kafka-source-for-dev -> marshalling-for-dev -> (Flat Map, pii-and-nonpii-for-dev) (1/1)] INFO org.apache.flink.metrics.prometheus.PrometheusReporter - Metric added with metricName inputQueueLength and scopedMetricName flink_taskmanager_job_task_buffers_inputQueueLength and metricGroup scope components flinkworker001,taskmanager,odin-router-v2,Source: kafka-source-for-dev -> marshalling-for-dev -> (Flat Map, pii-and-nonpii-for-dev),0,buffers and metric group variables {<job_id>=58f009823d425ec30a95fcab9b91929f, <task_id>=e88427a3f2292fd83a881281ca32e317, <task_attempt_id>=19880799af3abd1ddd396aac1f2e4f4e, <host>=flinkworker001, <task_name>=Source: kafka-source-for-dev -> marshalling-for-dev -> (Flat Map, pii-and-nonpii-for-dev), <task_attempt_num>=0, <job_name>=odin-router-v2, <tm_id>=9fce323ec41b9537e85d6ed52c38cdb2, <subtask_index>=0}. Count is 2 2018-06-26 15:23:20.502 [Source: kafka-source-reaction-views -> filtered-views -> views (1/1)] INFO org.apache.flink.metrics.prometheus.PrometheusReporter - Metric added with metricName inputQueueLength and scopedMetricName flink_taskmanager_job_task_buffers_inputQueueLength and metricGroup scope components flinkworker001,taskmanager,mim-advertisement-counters-flink-v2-cdata,Source: kafka-source-reaction-views -> filtered-views -> views,0,buffers and metric group variables {<job_id>=be6b25487ae34283fbe64b4ad0cb1fa9, <task_id>=590884759e085b7988e6295889a2fffe, <task_attempt_id>=0de7367b4b0415b7c8c79bb73fe429b3, <host>=flinkworker001, <task_name>=Source: kafka-source-reaction-views -> filtered-views -> views, <task_attempt_num>=0, <job_name>=mim-advertisement-counters-flink-v2-cdata, <tm_id>=9fce323ec41b9537e85d6ed52c38cdb2, <subtask_index>=0}. Count is 3 2018-06-26 15:23:20.507 [Source: kafka-source-ad-deletes -> cleanup-queue-items (1/1)] INFO org.apache.flink.metrics.prometheus.PrometheusReporter - Metric added with metricName inputQueueLength and scopedMetricName flink_taskmanager_job_task_buffers_inputQueueLength and metricGroup scope components flinkworker001,taskmanager,mim-advertisement-counters-flink-v2-cdata,Source: kafka-source-ad-deletes -> cleanup-queue-items,0,buffers and metric group variables {<job_id>=be6b25487ae34283fbe64b4ad0cb1fa9, <task_id>=8daabebbb8d55449bdc9d59d2d0e3677, <task_attempt_id>=6f6bdbbc9f6e88c71f792e6cfaa4b579, <host>=flinkworker001, <task_name>=Source: kafka-source-ad-deletes -> cleanup-queue-items, <task_attempt_num>=0, <job_name>=mim-advertisement-counters-flink-v2-cdata, <tm_id>=9fce323ec41b9537e85d6ed52c38cdb2, <subtask_index>=0}. Count is 4 2018-06-26 15:23:20.511 [buffered-unique-views-per-day -> Sink: cassandra-sink (1/1)] INFO org.apache.flink.metrics.prometheus.PrometheusReporter - Metric added with metricName inputQueueLength and scopedMetricName flink_taskmanager_job_task_buffers_inputQueueLength and metricGroup scope components flinkworker001,taskmanager,mim-advertisement-counters-flink-v2-cdata,buffered-unique-views-per-day -> Sink: cassandra-sink,0,buffers and metric group variables {<job_id>=be6b25487ae34283fbe64b4ad0cb1fa9, <task_id>=59fde8bd4b4482cfb22134e20521abd9, <task_attempt_id>=0eb245932abda4866c85d7d06da71f3c, <host>=flinkworker001, <task_name>=buffered-unique-views-per-day -> Sink: cassandra-sink, <task_attempt_num>=0, <job_name>=mim-advertisement-counters-flink-v2-cdata, <tm_id>=9fce323ec41b9537e85d6ed52c38cdb2, <subtask_index>=0}. Count is 5 2018-06-26 15:23:20.513 [unique-views-per-day -> Map -> Sink: late-views (1/1)] INFO org.apache.flink.metrics.prometheus.PrometheusReporter - Metric added with metricName inputQueueLength and scopedMetricName flink_taskmanager_job_task_buffers_inputQueueLength and metricGroup scope components flinkworker001,taskmanager,mim-advertisement-counters-flink-v2-cdata,unique-views-per-day -> Map -> Sink: late-views,0,buffers and metric group variables {<job_id>=be6b25487ae34283fbe64b4ad0cb1fa9, <task_id>=04c3abd3d7d44bb11ea475260ffec7a2, <task_attempt_id>=d29bc78d9a3bfe6bba7fa44071daf780, <host>=flinkworker001, <task_name>=unique-views-per-day -> Map -> Sink: late-views, <task_attempt_num>=0, <job_name>=mim-advertisement-counters-flink-v2-cdata, <tm_id>=9fce323ec41b9537e85d6ed52c38cdb2, <subtask_index>=0}. Count is 6 2018-06-26 15:23:20.515 [delayed-cleanup-queue-items -> Map -> Sink: delete-counters-cassandra-sink (1/1)] INFO org.apache.flink.metrics.prometheus.PrometheusReporter - Metric added with metricName inputQueueLength and scopedMetricName flink_taskmanager_job_task_buffers_inputQueueLength and metricGroup scope components flinkworker001,taskmanager,mim-advertisement-counters-flink-v2-cdata,delayed-cleanup-queue-items -> Map -> Sink: delete-counters-cassandra-sink,0,buffers and metric group variables {<job_id>=be6b25487ae34283fbe64b4ad0cb1fa9, <task_id>=627d91b17d0d23202e3f0f00f27e584b, <task_attempt_id>=2347d3b9b2f4174dae124b2a7c8a759b, <host>=flinkworker001, <task_name>=delayed-cleanup-queue-items -> Map -> Sink: delete-counters-cassandra-sink, <task_attempt_num>=0, <job_name>=mim-advertisement-counters-flink-v2-cdata, <tm_id>=9fce323ec41b9537e85d6ed52c38cdb2, <subtask_index>=0}. Count is 7 2018-06-26 15:27:35.337 [buffered-unique-views-per-day -> Sink: cassandra-sink (1/1)] INFO org.apache.flink.metrics.prometheus.PrometheusReporter - Metric removed with metricName inputQueueLength and scopedMetricName flink_taskmanager_job_task_buffers_inputQueueLength and metricGroup scope components flinkworker001,taskmanager,mim-advertisement-counters-flink-v2-cdata,buffered-unique-views-per-day -> Sink: cassandra-sink,0,buffers and metric group variables {<job_id>=be6b25487ae34283fbe64b4ad0cb1fa9, <task_id>=59fde8bd4b4482cfb22134e20521abd9, <task_attempt_id>=0eb245932abda4866c85d7d06da71f3c, <host>=flinkworker001, <task_name>=buffered-unique-views-per-day -> Sink: cassandra-sink, <task_attempt_num>=0, <job_name>=mim-advertisement-counters-flink-v2-cdata, <tm_id>=9fce323ec41b9537e85d6ed52c38cdb2, <subtask_index>=0}. Count is 8 2018-06-26 15:27:35.344 [delayed-cleanup-queue-items -> Map -> Sink: delete-counters-cassandra-sink (1/1)] INFO org.apache.flink.metrics.prometheus.PrometheusReporter - Metric removed with metricName inputQueueLength and scopedMetricName flink_taskmanager_job_task_buffers_inputQueueLength and metricGroup scope components flinkworker001,taskmanager,mim-advertisement-counters-flink-v2-cdata,delayed-cleanup-queue-items -> Map -> Sink: delete-counters-cassandra-sink,0,buffers and metric group variables {<job_id>=be6b25487ae34283fbe64b4ad0cb1fa9, <task_id>=627d91b17d0d23202e3f0f00f27e584b, <task_attempt_id>=2347d3b9b2f4174dae124b2a7c8a759b, <host>=flinkworker001, <task_name>=delayed-cleanup-queue-items -> Map -> Sink: delete-counters-cassandra-sink, <task_attempt_num>=0, <job_name>=mim-advertisement-counters-flink-v2-cdata, <tm_id>=9fce323ec41b9537e85d6ed52c38cdb2, <subtask_index>=0}. Count is 7 2018-06-26 15:27:35.356 [unique-views-per-day -> Map -> Sink: late-views (1/1)] INFO org.apache.flink.metrics.prometheus.PrometheusReporter - Metric removed with metricName inputQueueLength and scopedMetricName flink_taskmanager_job_task_buffers_inputQueueLength and metricGroup scope components flinkworker001,taskmanager,mim-advertisement-counters-flink-v2-cdata,unique-views-per-day -> Map -> Sink: late-views,0,buffers and metric group variables {<job_id>=be6b25487ae34283fbe64b4ad0cb1fa9, <task_id>=04c3abd3d7d44bb11ea475260ffec7a2, <task_attempt_id>=d29bc78d9a3bfe6bba7fa44071daf780, <host>=flinkworker001, <task_name>=unique-views-per-day -> Map -> Sink: late-views, <task_attempt_num>=0, <job_name>=mim-advertisement-counters-flink-v2-cdata, <tm_id>=9fce323ec41b9537e85d6ed52c38cdb2, <subtask_index>=0}. Count is 6 2018-06-26 15:27:35.361 [Source: kafka-source-ad-deletes -> cleanup-queue-items (1/1)] INFO org.apache.flink.metrics.prometheus.PrometheusReporter - Metric removed with metricName inputQueueLength and scopedMetricName flink_taskmanager_job_task_buffers_inputQueueLength and metricGroup scope components flinkworker001,taskmanager,mim-advertisement-counters-flink-v2-cdata,Source: kafka-source-ad-deletes -> cleanup-queue-items,0,buffers and metric group variables {<job_id>=be6b25487ae34283fbe64b4ad0cb1fa9, <task_id>=8daabebbb8d55449bdc9d59d2d0e3677, <task_attempt_id>=6f6bdbbc9f6e88c71f792e6cfaa4b579, <host>=flinkworker001, <task_name>=Source: kafka-source-ad-deletes -> cleanup-queue-items, <task_attempt_num>=0, <job_name>=mim-advertisement-counters-flink-v2-cdata, <tm_id>=9fce323ec41b9537e85d6ed52c38cdb2, <subtask_index>=0}. Count is 5 2018-06-26 15:27:35.366 [Source: kafka-source-reaction-views -> filtered-views -> views (1/1)] INFO org.apache.flink.metrics.prometheus.PrometheusReporter - Metric removed with metricName inputQueueLength and scopedMetricName flink_taskmanager_job_task_buffers_inputQueueLength and metricGroup scope components flinkworker001,taskmanager,mim-advertisement-counters-flink-v2-cdata,Source: kafka-source-reaction-views -> filtered-views -> views,0,buffers and metric group variables {<job_id>=be6b25487ae34283fbe64b4ad0cb1fa9, <task_id>=590884759e085b7988e6295889a2fffe, <task_attempt_id>=0de7367b4b0415b7c8c79bb73fe429b3, <host>=flinkworker001, <task_name>=Source: kafka-source-reaction-views -> filtered-views -> views, <task_attempt_num>=0, <job_name>=mim-advertisement-counters-flink-v2-cdata, <tm_id>=9fce323ec41b9537e85d6ed52c38cdb2, <subtask_index>=0}. Count is 4