Re: TaskIOMetricGroup metrics not unregistered in prometheus on job failure ?

Till Rohrmann Fri, 15 Jun 2018 06:43:37 -0700

Hi,

this sounds very strange. I just tried it out locally with with a standard
metric and the Prometheus metrics seem to be unregistered after the job has
reached a terminal state. Thus, it looks as if the standard metrics are
properly removed from `CollectorRegistry.defaultRegistry`. Could you check
the log files whether they contain anything suspicious about a failed
metric deregistration a la `There was a problem unregistering metric`?


I've also pulled in Chesnay who knows more about the metric reporters.

Cheers,
Till

On Thu, Jun 14, 2018 at 11:34 PM jelmer <jkupe...@gmail.com> wrote:

> Hi
>
> We are using flink-metrics-prometheus for reporting on apache flink 1.4.2
>
> And I am looking into an issue where it seems that somehow in some cases
> the metrics registered
> by org.apache.flink.runtime.metrics.groups.TaskIOMetricGroup
> (flink_taskmanager_job_task_buffers_outPoolUsage etc)  are not being
> unregistered in prometheus in case of a job restart
>
> Eventually this seems to cause a java.lang.NoClassDefFoundError:
> org/apache/kafka/common/metrics/stats/Rate$1 error when a new version of
> the job is deployed  because the jar file
> in /tmp/blobStore-foo/job_bar/blob_p-baz-qux has been removed upon
> deployment of the new job but the url classloader still points to it and it
> cannot find stats/Rate$1 (some synthetically generated code generated by
> the java compiler because its a switch on an enum)
>
> Has anybody come across this issue ? Has it possibly been fixed in 1.5 ?
> Can somebody any pointers as to where to look to tackle this ?
>
> Attached screenshot shows what classloader that cannot be garbage
> collected with the gc root
>
>

Re: TaskIOMetricGroup metrics not unregistered in prometheus on job failure ?

Reply via email to