The config looks OK to me. On the Flink side I cannot find an
explanation why only /some /metrics disappear.
The only explanation I could come up with at the moment is that
FLINK-8946 is triggered, all metrics are (officially) unregistered, but
the reporter isn't removing some metrics (i.e. all job related ones).
Due to FLINK-8946 no new metrics would be registered after the JM
restart, but the old metrics continue to be reported.
To verify this I would add logging statements to the
/notifyOfAddedMetric/notifyOfRemovedMetric/ methods, to check whether
Flink attempts to unregister all metrics or only some.
On 05.06.2018 02:02, Nikolas Davis wrote:
Fabian,
It does look like it may be related. I'll add a comment. After digging
a bit more I found that the crash and lack of metrics were
precipitated by the JobManager instance crashing and cycling, which
caused the job to restart.
Chesnay,
I didn't see anything interesting in our logs. Our reporter config is
fairly straightforward (I think):
metrics.reporter.nr.class: com.newrelic.flink.NewRelicReporter
metrics.reporter.nr.interval: 60 SECONDS
metrics.reporters: nr
Nik Davis
Software Engineer
New Relic
On Mon, Jun 4, 2018 at 1:56 AM, Chesnay Schepler <ches...@apache.org
<mailto:ches...@apache.org>> wrote:
Can you show us the metrics-related configuration parameters in
flink-conf.yaml?
Please also check the logs for any warnings from the MetricGroup
and MetricRegistry classes.
On 04.06.2018 10:44, Fabian Hueske wrote:
Hi Nik,
Can you have a look at this JIRA ticket [1] and check if it is
related to the problems your are facing?
If so, would you mind leaving a comment there?
Thank you,
Fabian
[1] https://issues.apache.org/jira/browse/FLINK-8946
<https://issues.apache.org/jira/browse/FLINK-8946>
2018-05-31 4:41 GMT+02:00 Nikolas Davis <nda...@newrelic.com
<mailto:nda...@newrelic.com>>:
We keep track of metrics by using the value of
MetricGroup::getMetricIdentifier, which returns the fully
qualified metric name. The query that we use to monitor
metrics filters for metrics IDs that
match '%Status.JVM.Memory%'. As long as the new metrics come
online via the MetricReporter interface then I think the
chart would be continuous; we would just see the old JVM
memory metrics cycle into new metrics.
Nik Davis
Software Engineer
New Relic
On Wed, May 30, 2018 at 5:30 PM, Ajay Tripathy
<aj...@yelp.com <mailto:aj...@yelp.com>> wrote:
How are your metrics dimensionalized/named? Task managers
often have UIDs generated for them. The task id dimension
will change on restart. If you name your metric based on
this 'task_id' there would be a discontinuity with the
old metric.
On Wed, May 30, 2018 at 4:49 PM, Nikolas Davis
<nda...@newrelic.com <mailto:nda...@newrelic.com>> wrote:
Howdy,
We are seeing our task manager JVM metrics disappear
over time. This last time we correlated it to our job
crashing and restarting. I wasn't able to grab the
failing exception to share. Any thoughts?
We track metrics through the MetricReporter
interface. As far as I can tell this more or less
only affects the JVM metrics. I.e. most / all other
metrics continue reporting fine as the job is
automatically restarted.
Nik Davis
Software Engineer
New Relic