Hi, this sounds very strange. I just tried it out locally with with a standard metric and the Prometheus metrics seem to be unregistered after the job has reached a terminal state. Thus, it looks as if the standard metrics are properly removed from `CollectorRegistry.defaultRegistry`. Could you check the log files whether they contain anything suspicious about a failed metric deregistration a la `There was a problem unregistering metric`?
I've also pulled in Chesnay who knows more about the metric reporters. Cheers, Till On Thu, Jun 14, 2018 at 11:34 PM jelmer <jkupe...@gmail.com> wrote: > Hi > > We are using flink-metrics-prometheus for reporting on apache flink 1.4.2 > > And I am looking into an issue where it seems that somehow in some cases > the metrics registered > by org.apache.flink.runtime.metrics.groups.TaskIOMetricGroup > (flink_taskmanager_job_task_buffers_outPoolUsage etc) are not being > unregistered in prometheus in case of a job restart > > Eventually this seems to cause a java.lang.NoClassDefFoundError: > org/apache/kafka/common/metrics/stats/Rate$1 error when a new version of > the job is deployed because the jar file > in /tmp/blobStore-foo/job_bar/blob_p-baz-qux has been removed upon > deployment of the new job but the url classloader still points to it and it > cannot find stats/Rate$1 (some synthetically generated code generated by > the java compiler because its a switch on an enum) > > Has anybody come across this issue ? Has it possibly been fixed in 1.5 ? > Can somebody any pointers as to where to look to tackle this ? > > Attached screenshot shows what classloader that cannot be garbage > collected with the gc root > >