Both your suggestions sound good, would be great to create JIRAs for them.
Could you replace the task scope format with the one below and try again?
metrics.scope.task: flink.tm.<tm_id>.<job_id>.<task_id>.<subtask_index>
This scope doesn't contain any special characters, except the periods.
If you receive task metrics with this scope there are some other special
characters we need to filter out.
Filtering characters in the StatsDReporter is always a bit icky though,
since it supports many storage
backends with different requirements. The last-resort would be to filter
out /all /special characters.
On 13.06.2017 13:41, Dail, Christopher wrote:
Responses to your questions:
1. Did this work with the same setup before 1.3?
I have not tested it with another version. I started working on the
metrics stuff with a snapshot of 1.3 and move to the release.
2. Are all task/operator metrics available in the metrics tab of the
dashboard?
Yes, the metrics are seen from the dashboard.
3. Are there any warnings in the TaskManager logs from the
MetricRegistry or StatsDReporter?
No, I am not seeing any errors in the logs related to metrics.
> My *guess *would be that the operator/task metrics contain characters
that either StatsD or telegraf don't allow,
which causes them to be dropped.
This was my original thought too. I did find two separate issues with
the metrics Flink outputs and I was planning on filing JIRA tickets on
these. They are:
-Flink does not escape spaces. I had a space in the job name which
messed up the metrics. So I have a workaround for this but it is
probably something Flink should escape.
-Flink is outputting a float value of “n/a” for
lastCheckpointExternalPath. A guage needs to be a float so Telegraf
does not like this. It errors on and continues ignoring it though.
Note that even with these accounted for I am still not seeing the
task/operator metrics. I ran a tcpdump to be sure on exactly what is
coming through. Searching through that dump, I don’t see any of the
metrics I was looking for.
I guess a few things to note. This is the application I am running:
https://github.com/chrisdail/pravega-samples/blob/master/flink-examples/src/main/scala/io/pravega/examples/flink/iot/TurbineHeatProcessor.scala
Also, I am running this in DC/OS 1.9 trying to integrate with DC/OS
metrics.
Thanks
Chris
*From: *Chesnay Schepler <ches...@apache.org>
*Date: *Tuesday, June 13, 2017 at 5:26 AM
*To: *"user@flink.apache.org" <user@flink.apache.org>
*Subject: *Re: Task and Operator Metrics in Flink 1.3
The scopes look OK to me.
Let's try to narrow down the problem areas a bit:
1. Did this work with the same setup before 1.3?
2. Are all task/operator metrics available in the metrics tab of the
dashboard?
3. Are there any warnings in the TaskManager logs from the
MetricRegistry or StatsDReporter?
My *guess *would be that the operator/task metrics contain characters
that either StatsD or telegraf don't allow,
which causes them to be dropped.
On 12.06.2017 20:32, Dail, Christopher wrote:
I’m using the Flink 1.3.0 release and am not seeing all of the
metrics that I would expect to see. I have flink configured to
write out metrics via statsd and I am consuming this with
telegraf. Initially I thought this was an issue with telegraf
parsing the data generated. I dumped all of the metrics going into
telegraf using tcpdump and found that there was a bunch of data
missing that I expect.
I’m using this as a reference for what metrics I expect:
https://ci.apache.org/projects/flink/flink-docs-release-1.3/monitoring/metrics.html
I see all of the JobManager and TaskManager level metrics. Things
like Status.JVM.* are coming through. TaskManager Status.Network
are there (but not Task level buffers). The ‘Cluster’ metrics are
there.
This IO section contains task and operator level metrics (like
what is available on the dashboard). I’m not seeing any of these
metrics coming through when using statsd.
I’m configuring flink with this configuration:
metrics.reporters: statsd
metrics.reporter.statsd.class:
org.apache.flink.metrics.statsd.StatsDReporter
metrics.reporter.statsd.host: hostname
metrics.reporter.statsd.port: 8125
# Customized Scopes
metrics.scope.jm: flink.jm
metrics.scope.jm.job: flink.jm.<job_name>
metrics.scope.tm: flink.tm.<tm_id>
metrics.scope.tm.job: flink.tm.<tm_id>.<job_name>
metrics.scope.task:
flink.tm.<tm_id>.<job_name>.<task_name>.<subtask_index>
metrics.scope.operator:
flink.tm.<tm_id>.<job_name>.<operator_name>.<subtask_index>
I have tried with and without specifically setting the
metrics.scope values.
Is anyone else having similar issues with metrics in 1.3?
Thanks
*Chris Dail*
Director, Software Engineering
*Dell EMC* | Infrastructure Solutions Group
mobile +1 506 863 4675
christopher.d...@dell.com <mailto:christopher.d...@dell.com>