Hi Chesnay, sorry for the late reply. I did not have time to look into this
sooner

I did what you suggested. Added some logging to the PrometheusReporter like
this :

https://github.com/jelmerk/flink/commit/58779ee60a8c3961f3eb2c487c603c33822bba8a

And deployed a custom build of the reporter to our test environment.

I managed to reproduce the issue like this

1. Deploy job A : it lands on worker 1
2. Deploy job B : it lands on worker 1, take note of the job id
3. Redeploy job b by canceling it from a savepoint and deploying it again
from the savepoint : it lands on worker 3
4. Execute curl -s http://localhost:9249/metrics | grep "job id from step
2" on worker 1. The metrics are still exposed even though the job is
canceled

I attached a piece of the log to the email. What I notice is that the two
jobs register metrics with the same scoped metric name. In this case
flink_taskmanager_job_task_buffers_inputQueueLength.

The prometheus exporter seems to use reference counting for the metrics and
the metrics will only be removed when the count is 0, canceling job B will
lower the counter by 5 but because job A still is deployed the count is not
1 so the metric never gets unregistered

Canceling job A will remove the lingering metrics from the old job B

It seems to me that this is a bug and that the childs that are being added
in notifyOfAddedMetric
<https://github.com/jelmerk/flink/commit/58779ee60a8c3961f3eb2c487c603c33822bba8a#diff-36ff6f170e359d30a1265b43659443bfR163>
should be removed in notifyOfRemovedMetric

Can you confirm this ?


--Jelmer



On Fri, 15 Jun 2018 at 18:01, Chesnay Schepler <ches...@apache.org> wrote:

> I remember that another user reported something similar, but he wasn't
> using the PrometheusReporter. see
> http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/JVM-metrics-disappearing-after-job-crash-restart-tt20420.html
>
> We couldn't find the cause, but my suspicion was FLINK-8946 which will be
> fixed in 1.4.3 .
> You could cherry-pick 8b046fafb6ee77a86e360f6b792e7f73399239bd and see
> whether this actually caused it.
>
> Alternatively, if you can reproduce this it would be immensely helpful if
> you could modify the PrometheusReporter and log all notifications about
> added or removed metrics.
>
> On 15.06.2018 15:42, Till Rohrmann wrote:
>
> Hi,
>
> this sounds very strange. I just tried it out locally with with a standard
> metric and the Prometheus metrics seem to be unregistered after the job has
> reached a terminal state. Thus, it looks as if the standard metrics are
> properly removed from `CollectorRegistry.defaultRegistry`. Could you check
> the log files whether they contain anything suspicious about a failed
> metric deregistration a la `There was a problem unregistering metric`?
>
> I've also pulled in Chesnay who knows more about the metric reporters.
>
> Cheers,
> Till
>
> On Thu, Jun 14, 2018 at 11:34 PM jelmer <jkupe...@gmail.com> wrote:
>
>> Hi
>>
>> We are using flink-metrics-prometheus for reporting on apache flink 1.4.2
>>
>> And I am looking into an issue where it seems that somehow in some cases
>> the metrics registered
>> by org.apache.flink.runtime.metrics.groups.TaskIOMetricGroup
>> (flink_taskmanager_job_task_buffers_outPoolUsage etc)  are not being
>> unregistered in prometheus in case of a job restart
>>
>> Eventually this seems to cause a java.lang.NoClassDefFoundError:
>> org/apache/kafka/common/metrics/stats/Rate$1 error when a new version of
>> the job is deployed  because the jar file
>> in /tmp/blobStore-foo/job_bar/blob_p-baz-qux has been removed upon
>> deployment of the new job but the url classloader still points to it and it
>> cannot find stats/Rate$1 (some synthetically generated code generated by
>> the java compiler because its a switch on an enum)
>>
>> Has anybody come across this issue ? Has it possibly been fixed in 1.5 ?
>> Can somebody any pointers as to where to look to tackle this ?
>>
>> Attached screenshot shows what classloader that cannot be garbage
>> collected with the gc root
>>
>>
>
2018-06-26 15:21:41.602 [marshalled-with-bot-header-for-dev -> Sink: 
kafka-sink-for-dev (1/1)] INFO  
org.apache.flink.metrics.prometheus.PrometheusReporter  - Metric added with 
metricName inputQueueLength and scopedMetricName 
flink_taskmanager_job_task_buffers_inputQueueLength and metricGroup scope 
components 
flinkworker001,taskmanager,odin-router-v2,marshalled-with-bot-header-for-dev -> 
Sink: kafka-sink-for-dev,0,buffers and metric group variables 
{<job_id>=58f009823d425ec30a95fcab9b91929f, 
<task_id>=cc2765d0460902a818200cb341f809c2, 
<task_attempt_id>=d00c5dc5565d99849c4d996057cdf99f, <host>=flinkworker001, 
<task_name>=marshalled-with-bot-header-for-dev -> Sink: kafka-sink-for-dev, 
<task_attempt_num>=0, <job_name>=odin-router-v2, 
<tm_id>=9fce323ec41b9537e85d6ed52c38cdb2, <subtask_index>=0}. Count is 0
2018-06-26 15:21:41.605 [bots-for-dev -> Map -> Sink: bots-sink-for-dev (1/1)] 
INFO  org.apache.flink.metrics.prometheus.PrometheusReporter  - Metric added 
with metricName inputQueueLength and scopedMetricName 
flink_taskmanager_job_task_buffers_inputQueueLength and metricGroup scope 
components flinkworker001,taskmanager,odin-router-v2,bots-for-dev -> Map -> 
Sink: bots-sink-for-dev,0,buffers and metric group variables 
{<job_id>=58f009823d425ec30a95fcab9b91929f, 
<task_id>=83df09636bb350560b5f98563cc74f57, 
<task_attempt_id>=ffa5f9398bde80c4e85d12038cdb9b69, <host>=flinkworker001, 
<task_name>=bots-for-dev -> Map -> Sink: bots-sink-for-dev, 
<task_attempt_num>=0, <job_name>=odin-router-v2, 
<tm_id>=9fce323ec41b9537e85d6ed52c38cdb2, <subtask_index>=0}. Count is 1
2018-06-26 15:21:41.612 [Source: kafka-source-for-dev -> marshalling-for-dev -> 
(Flat Map, pii-and-nonpii-for-dev) (1/1)] INFO  
org.apache.flink.metrics.prometheus.PrometheusReporter  - Metric added with 
metricName inputQueueLength and scopedMetricName 
flink_taskmanager_job_task_buffers_inputQueueLength and metricGroup scope 
components flinkworker001,taskmanager,odin-router-v2,Source: 
kafka-source-for-dev -> marshalling-for-dev -> (Flat Map, 
pii-and-nonpii-for-dev),0,buffers and metric group variables 
{<job_id>=58f009823d425ec30a95fcab9b91929f, 
<task_id>=e88427a3f2292fd83a881281ca32e317, 
<task_attempt_id>=19880799af3abd1ddd396aac1f2e4f4e, <host>=flinkworker001, 
<task_name>=Source: kafka-source-for-dev -> marshalling-for-dev -> (Flat Map, 
pii-and-nonpii-for-dev), <task_attempt_num>=0, <job_name>=odin-router-v2, 
<tm_id>=9fce323ec41b9537e85d6ed52c38cdb2, <subtask_index>=0}. Count is 2
2018-06-26 15:23:20.502 [Source: kafka-source-reaction-views -> filtered-views 
-> views (1/1)] INFO  org.apache.flink.metrics.prometheus.PrometheusReporter  - 
Metric added with metricName inputQueueLength and scopedMetricName 
flink_taskmanager_job_task_buffers_inputQueueLength and metricGroup scope 
components 
flinkworker001,taskmanager,mim-advertisement-counters-flink-v2-cdata,Source: 
kafka-source-reaction-views -> filtered-views -> views,0,buffers and metric 
group variables {<job_id>=be6b25487ae34283fbe64b4ad0cb1fa9, 
<task_id>=590884759e085b7988e6295889a2fffe, 
<task_attempt_id>=0de7367b4b0415b7c8c79bb73fe429b3, <host>=flinkworker001, 
<task_name>=Source: kafka-source-reaction-views -> filtered-views -> views, 
<task_attempt_num>=0, <job_name>=mim-advertisement-counters-flink-v2-cdata, 
<tm_id>=9fce323ec41b9537e85d6ed52c38cdb2, <subtask_index>=0}. Count is 3
2018-06-26 15:23:20.507 [Source: kafka-source-ad-deletes -> cleanup-queue-items 
(1/1)] INFO  org.apache.flink.metrics.prometheus.PrometheusReporter  - Metric 
added with metricName inputQueueLength and scopedMetricName 
flink_taskmanager_job_task_buffers_inputQueueLength and metricGroup scope 
components 
flinkworker001,taskmanager,mim-advertisement-counters-flink-v2-cdata,Source: 
kafka-source-ad-deletes -> cleanup-queue-items,0,buffers and metric group 
variables {<job_id>=be6b25487ae34283fbe64b4ad0cb1fa9, 
<task_id>=8daabebbb8d55449bdc9d59d2d0e3677, 
<task_attempt_id>=6f6bdbbc9f6e88c71f792e6cfaa4b579, <host>=flinkworker001, 
<task_name>=Source: kafka-source-ad-deletes -> cleanup-queue-items, 
<task_attempt_num>=0, <job_name>=mim-advertisement-counters-flink-v2-cdata, 
<tm_id>=9fce323ec41b9537e85d6ed52c38cdb2, <subtask_index>=0}. Count is 4
2018-06-26 15:23:20.511 [buffered-unique-views-per-day -> Sink: cassandra-sink 
(1/1)] INFO  org.apache.flink.metrics.prometheus.PrometheusReporter  - Metric 
added with metricName inputQueueLength and scopedMetricName 
flink_taskmanager_job_task_buffers_inputQueueLength and metricGroup scope 
components 
flinkworker001,taskmanager,mim-advertisement-counters-flink-v2-cdata,buffered-unique-views-per-day
 -> Sink: cassandra-sink,0,buffers and metric group variables 
{<job_id>=be6b25487ae34283fbe64b4ad0cb1fa9, 
<task_id>=59fde8bd4b4482cfb22134e20521abd9, 
<task_attempt_id>=0eb245932abda4866c85d7d06da71f3c, <host>=flinkworker001, 
<task_name>=buffered-unique-views-per-day -> Sink: cassandra-sink, 
<task_attempt_num>=0, <job_name>=mim-advertisement-counters-flink-v2-cdata, 
<tm_id>=9fce323ec41b9537e85d6ed52c38cdb2, <subtask_index>=0}. Count is 5
2018-06-26 15:23:20.513 [unique-views-per-day -> Map -> Sink: late-views (1/1)] 
INFO  org.apache.flink.metrics.prometheus.PrometheusReporter  - Metric added 
with metricName inputQueueLength and scopedMetricName 
flink_taskmanager_job_task_buffers_inputQueueLength and metricGroup scope 
components 
flinkworker001,taskmanager,mim-advertisement-counters-flink-v2-cdata,unique-views-per-day
 -> Map -> Sink: late-views,0,buffers and metric group variables 
{<job_id>=be6b25487ae34283fbe64b4ad0cb1fa9, 
<task_id>=04c3abd3d7d44bb11ea475260ffec7a2, 
<task_attempt_id>=d29bc78d9a3bfe6bba7fa44071daf780, <host>=flinkworker001, 
<task_name>=unique-views-per-day -> Map -> Sink: late-views, 
<task_attempt_num>=0, <job_name>=mim-advertisement-counters-flink-v2-cdata, 
<tm_id>=9fce323ec41b9537e85d6ed52c38cdb2, <subtask_index>=0}. Count is 6
2018-06-26 15:23:20.515 [delayed-cleanup-queue-items -> Map -> Sink: 
delete-counters-cassandra-sink (1/1)] INFO  
org.apache.flink.metrics.prometheus.PrometheusReporter  - Metric added with 
metricName inputQueueLength and scopedMetricName 
flink_taskmanager_job_task_buffers_inputQueueLength and metricGroup scope 
components 
flinkworker001,taskmanager,mim-advertisement-counters-flink-v2-cdata,delayed-cleanup-queue-items
 -> Map -> Sink: delete-counters-cassandra-sink,0,buffers and metric group 
variables {<job_id>=be6b25487ae34283fbe64b4ad0cb1fa9, 
<task_id>=627d91b17d0d23202e3f0f00f27e584b, 
<task_attempt_id>=2347d3b9b2f4174dae124b2a7c8a759b, <host>=flinkworker001, 
<task_name>=delayed-cleanup-queue-items -> Map -> Sink: 
delete-counters-cassandra-sink, <task_attempt_num>=0, 
<job_name>=mim-advertisement-counters-flink-v2-cdata, 
<tm_id>=9fce323ec41b9537e85d6ed52c38cdb2, <subtask_index>=0}. Count is 7
2018-06-26 15:27:35.337 [buffered-unique-views-per-day -> Sink: cassandra-sink 
(1/1)] INFO  org.apache.flink.metrics.prometheus.PrometheusReporter  - Metric 
removed with metricName inputQueueLength and scopedMetricName 
flink_taskmanager_job_task_buffers_inputQueueLength and metricGroup scope 
components 
flinkworker001,taskmanager,mim-advertisement-counters-flink-v2-cdata,buffered-unique-views-per-day
 -> Sink: cassandra-sink,0,buffers and metric group variables 
{<job_id>=be6b25487ae34283fbe64b4ad0cb1fa9, 
<task_id>=59fde8bd4b4482cfb22134e20521abd9, 
<task_attempt_id>=0eb245932abda4866c85d7d06da71f3c, <host>=flinkworker001, 
<task_name>=buffered-unique-views-per-day -> Sink: cassandra-sink, 
<task_attempt_num>=0, <job_name>=mim-advertisement-counters-flink-v2-cdata, 
<tm_id>=9fce323ec41b9537e85d6ed52c38cdb2, <subtask_index>=0}. Count is 8
2018-06-26 15:27:35.344 [delayed-cleanup-queue-items -> Map -> Sink: 
delete-counters-cassandra-sink (1/1)] INFO  
org.apache.flink.metrics.prometheus.PrometheusReporter  - Metric removed with 
metricName inputQueueLength and scopedMetricName 
flink_taskmanager_job_task_buffers_inputQueueLength and metricGroup scope 
components 
flinkworker001,taskmanager,mim-advertisement-counters-flink-v2-cdata,delayed-cleanup-queue-items
 -> Map -> Sink: delete-counters-cassandra-sink,0,buffers and metric group 
variables {<job_id>=be6b25487ae34283fbe64b4ad0cb1fa9, 
<task_id>=627d91b17d0d23202e3f0f00f27e584b, 
<task_attempt_id>=2347d3b9b2f4174dae124b2a7c8a759b, <host>=flinkworker001, 
<task_name>=delayed-cleanup-queue-items -> Map -> Sink: 
delete-counters-cassandra-sink, <task_attempt_num>=0, 
<job_name>=mim-advertisement-counters-flink-v2-cdata, 
<tm_id>=9fce323ec41b9537e85d6ed52c38cdb2, <subtask_index>=0}. Count is 7
2018-06-26 15:27:35.356 [unique-views-per-day -> Map -> Sink: late-views (1/1)] 
INFO  org.apache.flink.metrics.prometheus.PrometheusReporter  - Metric removed 
with metricName inputQueueLength and scopedMetricName 
flink_taskmanager_job_task_buffers_inputQueueLength and metricGroup scope 
components 
flinkworker001,taskmanager,mim-advertisement-counters-flink-v2-cdata,unique-views-per-day
 -> Map -> Sink: late-views,0,buffers and metric group variables 
{<job_id>=be6b25487ae34283fbe64b4ad0cb1fa9, 
<task_id>=04c3abd3d7d44bb11ea475260ffec7a2, 
<task_attempt_id>=d29bc78d9a3bfe6bba7fa44071daf780, <host>=flinkworker001, 
<task_name>=unique-views-per-day -> Map -> Sink: late-views, 
<task_attempt_num>=0, <job_name>=mim-advertisement-counters-flink-v2-cdata, 
<tm_id>=9fce323ec41b9537e85d6ed52c38cdb2, <subtask_index>=0}. Count is 6
2018-06-26 15:27:35.361 [Source: kafka-source-ad-deletes -> cleanup-queue-items 
(1/1)] INFO  org.apache.flink.metrics.prometheus.PrometheusReporter  - Metric 
removed with metricName inputQueueLength and scopedMetricName 
flink_taskmanager_job_task_buffers_inputQueueLength and metricGroup scope 
components 
flinkworker001,taskmanager,mim-advertisement-counters-flink-v2-cdata,Source: 
kafka-source-ad-deletes -> cleanup-queue-items,0,buffers and metric group 
variables {<job_id>=be6b25487ae34283fbe64b4ad0cb1fa9, 
<task_id>=8daabebbb8d55449bdc9d59d2d0e3677, 
<task_attempt_id>=6f6bdbbc9f6e88c71f792e6cfaa4b579, <host>=flinkworker001, 
<task_name>=Source: kafka-source-ad-deletes -> cleanup-queue-items, 
<task_attempt_num>=0, <job_name>=mim-advertisement-counters-flink-v2-cdata, 
<tm_id>=9fce323ec41b9537e85d6ed52c38cdb2, <subtask_index>=0}. Count is 5
2018-06-26 15:27:35.366 [Source: kafka-source-reaction-views -> filtered-views 
-> views (1/1)] INFO  org.apache.flink.metrics.prometheus.PrometheusReporter  - 
Metric removed with metricName inputQueueLength and scopedMetricName 
flink_taskmanager_job_task_buffers_inputQueueLength and metricGroup scope 
components 
flinkworker001,taskmanager,mim-advertisement-counters-flink-v2-cdata,Source: 
kafka-source-reaction-views -> filtered-views -> views,0,buffers and metric 
group variables {<job_id>=be6b25487ae34283fbe64b4ad0cb1fa9, 
<task_id>=590884759e085b7988e6295889a2fffe, 
<task_attempt_id>=0de7367b4b0415b7c8c79bb73fe429b3, <host>=flinkworker001, 
<task_name>=Source: kafka-source-reaction-views -> filtered-views -> views, 
<task_attempt_num>=0, <job_name>=mim-advertisement-counters-flink-v2-cdata, 
<tm_id>=9fce323ec41b9537e85d6ed52c38cdb2, <subtask_index>=0}. Count is 4

Reply via email to