I remember that another user reported something similar, but he wasn't
using the PrometheusReporter. see
http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/JVM-metrics-disappearing-after-job-crash-restart-tt20420.html
We couldn't find the cause, but my suspicion was FLINK-8946 which will
be fixed in 1.4.3 .
You could cherry-pick 8b046fafb6ee77a86e360f6b792e7f73399239bd and see
whether this actually caused it.
Alternatively, if you can reproduce this it would be immensely helpful
if you could modify the PrometheusReporter and log all notifications
about added or removed metrics.
On 15.06.2018 15:42, Till Rohrmann wrote:
Hi,
this sounds very strange. I just tried it out locally with with a
standard metric and the Prometheus metrics seem to be unregistered
after the job has reached a terminal state. Thus, it looks as if the
standard metrics are properly removed from
`CollectorRegistry.defaultRegistry`. Could you check the log files
whether they contain anything suspicious about a failed metric
deregistration a la `There was a problem unregistering metric`?
I've also pulled in Chesnay who knows more about the metric reporters.
Cheers,
Till
On Thu, Jun 14, 2018 at 11:34 PM jelmer <jkupe...@gmail.com
<mailto:jkupe...@gmail.com>> wrote:
Hi
We are using flink-metrics-prometheus for reporting on apache
flink 1.4.2
And I am looking into an issue where it seems that somehow in some
cases the metrics registered
by org.apache.flink.runtime.metrics.groups.TaskIOMetricGroup
(flink_taskmanager_job_task_buffers_outPoolUsage etc) are not
being unregistered in prometheus in case of a job restart
Eventually this seems to cause a java.lang.NoClassDefFoundError:
org/apache/kafka/common/metrics/stats/Rate$1 error when a new
version of the job is deployed because the jar file
in /tmp/blobStore-foo/job_bar/blob_p-baz-qux has been removed upon
deployment of the new job but the url classloader still points to
it and it cannot find stats/Rate$1 (some synthetically generated
code generated by the java compiler because its a switch on an enum)
Has anybody come across this issue ? Has it possibly been fixed in
1.5 ? Can somebody any pointers as to where to look to tackle this ?
Attached screenshot shows what classloader that cannot be garbage
collected with the gc root