Great work on debugging this, you're exactly right.
The children we add to the collector have to be removed individually
when a metric is unregistered.
If the collector is a io.prometheus.client.Gauge we can use the
#remove() method. For histograms we will have to modify our
HistogramSummaryProxy class to allow removing individual histograms.
I've filed FLINK-9665 <https://issues.apache.org/jira/browse/FLINK-9665>.
On 26.06.2018 17:28, jelmer wrote:
Hi Chesnay, sorry for the late reply. I did not have time to look into
this sooner
I did what you suggested. Added some logging to the PrometheusReporter
like this :
https://github.com/jelmerk/flink/commit/58779ee60a8c3961f3eb2c487c603c33822bba8a
And deployed a custom build of the reporter to our test environment.
I managed to reproduce the issue like this
1. Deploy job A : it lands on worker 1
2. Deploy job B : it lands on worker 1, take note of the job id
3. Redeploy job b by canceling it from a savepoint and deploying it
again from the savepoint : it lands on worker 3
4. Execute curl -s http://localhost:9249/metrics | grep "job id from
step 2" on worker 1. The metrics are still exposed even though the job
is canceled
I attached a piece of the log to the email. What I notice is that the
two jobs register metrics with the same scoped metric name. In this
case flink_taskmanager_job_task_buffers_inputQueueLength.
The prometheus exporter seems to use reference counting for the
metrics and the metrics will only be removed when the count is 0,
canceling job B will lower the counter by 5 but because job A still is
deployed the count is not 1 so the metric never gets unregistered
Canceling job A will remove the lingering metrics from the old job B
It seems to me that this is a bug and that the childs that are being
added in notifyOfAddedMetric
<https://github.com/jelmerk/flink/commit/58779ee60a8c3961f3eb2c487c603c33822bba8a#diff-36ff6f170e359d30a1265b43659443bfR163>
should be removed in notifyOfRemovedMetric
Can you confirm this ?
--Jelmer
On Fri, 15 Jun 2018 at 18:01, Chesnay Schepler <ches...@apache.org
<mailto:ches...@apache.org>> wrote:
I remember that another user reported something similar, but he
wasn't using the PrometheusReporter. see
http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/JVM-metrics-disappearing-after-job-crash-restart-tt20420.html
We couldn't find the cause, but my suspicion was FLINK-8946 which
will be fixed in 1.4.3 .
You could cherry-pick 8b046fafb6ee77a86e360f6b792e7f73399239bd and
see whether this actually caused it.
Alternatively, if you can reproduce this it would be immensely
helpful if you could modify the PrometheusReporter and log all
notifications about added or removed metrics.
On 15.06.2018 15:42, Till Rohrmann wrote:
Hi,
this sounds very strange. I just tried it out locally with with a
standard metric and the Prometheus metrics seem to be
unregistered after the job has reached a terminal state. Thus, it
looks as if the standard metrics are properly removed from
`CollectorRegistry.defaultRegistry`. Could you check the log
files whether they contain anything suspicious about a failed
metric deregistration a la `There was a problem unregistering
metric`?
I've also pulled in Chesnay who knows more about the metric
reporters.
Cheers,
Till
On Thu, Jun 14, 2018 at 11:34 PM jelmer <jkupe...@gmail.com
<mailto:jkupe...@gmail.com>> wrote:
Hi
We are using flink-metrics-prometheus for reporting on apache
flink 1.4.2
And I am looking into an issue where it seems that somehow in
some cases the metrics registered
by org.apache.flink.runtime.metrics.groups.TaskIOMetricGroup
(flink_taskmanager_job_task_buffers_outPoolUsage etc) are
not being unregistered in prometheus in case of a job restart
Eventually this seems to cause
a java.lang.NoClassDefFoundError:
org/apache/kafka/common/metrics/stats/Rate$1 error when a new
version of the job is deployed because the jar file
in /tmp/blobStore-foo/job_bar/blob_p-baz-qux has been removed
upon deployment of the new job but the url classloader still
points to it and it cannot find stats/Rate$1 (some
synthetically generated code generated by the java compiler
because its a switch on an enum)
Has anybody come across this issue ? Has it possibly been
fixed in 1.5 ? Can somebody any pointers as to where to look
to tackle this ?
Attached screenshot shows what classloader that cannot be
garbage collected with the gc root