I remember that another user reported something similar, but he wasn't using the PrometheusReporter. see http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/JVM-metrics-disappearing-after-job-crash-restart-tt20420.html

We couldn't find the cause, but my suspicion was FLINK-8946 which will be fixed in 1.4.3 . You could cherry-pick 8b046fafb6ee77a86e360f6b792e7f73399239bd and see whether this actually caused it.

Alternatively, if you can reproduce this it would be immensely helpful if you could modify the PrometheusReporter and log all notifications about added or removed metrics.

On 15.06.2018 15:42, Till Rohrmann wrote:
Hi,

this sounds very strange. I just tried it out locally with with a standard metric and the Prometheus metrics seem to be unregistered after the job has reached a terminal state. Thus, it looks as if the standard metrics are properly removed from `CollectorRegistry.defaultRegistry`. Could you check the log files whether they contain anything suspicious about a failed metric deregistration a la `There was a problem unregistering metric`?

I've also pulled in Chesnay who knows more about the metric reporters.

Cheers,
Till

On Thu, Jun 14, 2018 at 11:34 PM jelmer <jkupe...@gmail.com <mailto:jkupe...@gmail.com>> wrote:

    Hi

    We are using flink-metrics-prometheus for reporting on apache
    flink 1.4.2

    And I am looking into an issue where it seems that somehow in some
    cases the metrics registered
    by org.apache.flink.runtime.metrics.groups.TaskIOMetricGroup
    (flink_taskmanager_job_task_buffers_outPoolUsage etc)  are not
    being unregistered in prometheus in case of a job restart

    Eventually this seems to cause a java.lang.NoClassDefFoundError:
    org/apache/kafka/common/metrics/stats/Rate$1 error when a new
    version of the job is deployed  because the jar file
    in /tmp/blobStore-foo/job_bar/blob_p-baz-qux has been removed upon
    deployment of the new job but the url classloader still points to
    it and it cannot find stats/Rate$1 (some synthetically generated
    code generated by the java compiler because its a switch on an enum)

    Has anybody come across this issue ? Has it possibly been fixed in
    1.5 ? Can somebody any pointers as to where to look to tackle this ?

    Attached screenshot shows what classloader that cannot be garbage
    collected with the gc root


Reply via email to