[jira] [Commented] (FLINK-6911) StatsD Metrics name should escape spaces

2017-06-26 Thread David Brinegar (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-6911?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16064103#comment-16064103
 ] 

David Brinegar commented on FLINK-6911:
---

FLINK-7009 tries to address this and the metric length issue, curious to get 
your input.

> StatsD Metrics name should escape spaces 
> -
>
> Key: FLINK-6911
> URL: https://issues.apache.org/jira/browse/FLINK-6911
> Project: Flink
>  Issue Type: Improvement
>  Components: Metrics
>Affects Versions: 1.3.0
> Environment: StatsD Metrics with Telegraf server
>Reporter: Chris Dail
>
> The StatsDReporter does not escape spaces in the metric name. It is generally 
> accepted that spaces in the metric name are a bad idea:
> https://stackoverflow.com/questions/29674488/whitespace-in-statsd-metric-name
> It should also be noted that the FlinkStatsDReporter was based on the 
> ReadyTalk StatsD implementation (this is indicated in the comment). Note that 
> the ReadyTalk implementation does replace whitespace:
> https://github.com/ReadyTalk/metrics-statsd/blob/master/metrics-statsd-common/src/main/java/com/readytalk/metrics/StatsD.java#L129
> Specifically, I am integrating with Telegraf. It actually splits the name on 
> spaces and treats these as (name, value, timestamp). It ignores everything 
> except the name.
> https://github.com/influxdata/telegraf/blob/master/plugins/parsers/graphite/parser.go#L225
> Initially I found this issue when I had a space in the job name. Flink 
> encodes the job name into the metrics as is. So when I put these into 
> telegraf, all of the job level metrics ended up with the same bucket in 
> telegraf.
> Flink also uses things like "Sink- " and "Source- " to encode 
> source/sink. These also do not work with telegraf. I end up with metrics that 
> look like this inside telegraf:
> {noformat}
> taskmanager_5e453417d87c755da6311b1940cc602f_TurbineHeatProcessor_examples_turbineHeatTest_Sink-
> {noformat}
> The actual name is truncated after the space.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (FLINK-6464) Metric name is not stable

2017-06-26 Thread David Brinegar (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-6464?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16064097#comment-16064097
 ] 

David Brinegar commented on FLINK-6464:
---

nice find!  FLINK-7009 tries to address this by removing the instance ids, then 
using a hash of the remaining stable part of the string as a compressed metric 
name.  So the above would convert into something like "TriggerWin_abcdef12" 
which is at least the same every time you run the job, and short so metric 
systems can handle it without truncation or conversion problems, but in the end 
only a shorter more stable default name, not particularly readable in itself.  
Thoughts?

> Metric name is not stable
> -
>
> Key: FLINK-6464
> URL: https://issues.apache.org/jira/browse/FLINK-6464
> Project: Flink
>  Issue Type: Bug
>  Components: DataStream API, Metrics
>Affects Versions: 1.2.0
>Reporter: Andrey
>
> Currently according to the documentation 
> (https://ci.apache.org/projects/flink/flink-docs-release-1.2/monitoring/metrics.html)
>  operator metrics constructed using the following pattern:
> , 
> For some operators, "operator_name" could contain default implementation of 
> toString method. For example:
> {code}
> TriggerWindow(TumblingProcessingTimeWindows(3000), 
> ListStateDescriptor{serializer=org.apache.flink.api.java.typeutils.runtime.kryo.KryoSerializer@c65792d4},
>  xxx.Trigger@665fe457, WindowedStream.apply(WindowedStream.java:521)) -> 
> Sink: Unnamed
> {code}
> The part "@c65792d4" will be changed every time job is restarted/cancelled. 
> As a consequence it's not possible to store metrics for a long time.
> Expected:
> * ensure all operators return human readable, non-default names OR
> * change the way TriggerWindow generates it's name.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Created] (FLINK-7009) dogstatsd mode in statsd reporter

2017-06-26 Thread David Brinegar (JIRA)
David Brinegar created FLINK-7009:
-

 Summary: dogstatsd mode in statsd reporter
 Key: FLINK-7009
 URL: https://issues.apache.org/jira/browse/FLINK-7009
 Project: Flink
  Issue Type: Improvement
  Components: Metrics
Affects Versions: 1.4.0
 Environment: org.apache.flink.metrics.statsd.StatsDReporter
Reporter: David Brinegar
 Fix For: 1.4.0


The current statsd reporter can only report a subset of Flink metrics owing to 
the manner in which Flink variables are handled, mainly around invalid 
characters and metrics too long.  As an option, it would be quite useful to 
have a stricter dogstatsd compliant output.  Dogstatsd metrics are tagged, 
should be less than 200 characters including tag names and values, be 
alphanumeric + underbar, delimited by periods.  As a further pragmatic 
restriction, negative and other invalid values should be ignored rather than 
sent to the backend.  These restrictions play well with a broad set of 
collectors and time series databases.

This mode would:

* convert output to ascii alphanumeric characters with underbar, delimited by 
periods.  Runs of invalid characters within a metric segment would be collapsed 
to a single underbar.
* report all Flink variables as tags
* compress overly long segments, say over 50 chars, to a symbolic 
representation of the metric name, to preserve the unique metric time series 
but avoid downstream truncation
* compress 32 character Flink IDs like tm_id, task_id, job_id, task_attempt_id, 
to the first 8 characters, again to preserve enough distinction amongst metrics 
while trimming up to 96 characters from the metric
* remove object references from names, such as the instance hash id of the 
serializer
* drop negative or invalid numeric values such as "n/a", "-1" which is used for 
unknowns like JVM.Memory.NonHeap.Max, and "-9223372036854775808" which is used 
for unknowns like currentLowWaterMark

With these in place, it becomes quite reasonable to support LatencyGauge 
metrics as well.


One idea for symbolic compression is to take the first 10 valid characters plus 
a hash of the long name.  For example, a value like this operator_name:

{code:java}
TriggerWindow(TumblingProcessingTimeWindows(5000), 
ReducingStateDescriptor{serializer=org.apache.flink.api.java.typeutils.runtime.PojoSerializer@f3395ffa,
 
reduceFunction=org.apache.flink.streaming.examples.socket.SocketWindowWordCount$1@4201c465},
 ProcessingTimeTrigger(), WindowedStream.reduce(WindowedStream.java-301))
{code}

would first drop the instance references.  The stable version would be:
 
{code:java}
TriggerWindow(TumblingProcessingTimeWindows(5000), 
ReducingStateDescriptor{serializer=org.apache.flink.api.java.typeutils.runtime.PojoSerializer,
 
reduceFunction=org.apache.flink.streaming.examples.socket.SocketWindowWordCount$1},
 ProcessingTimeTrigger(), WindowedStream.reduce(WindowedStream.java-301))
{code}

and then the compressed name would be the first ten valid characters plus the 
hash of the stable string:

{code}
TriggerWin_d8c007da
{code}

This is just one way of dealing with unruly default names, the main point would 
be to preserve the metrics so they are valid, avoid truncation, and can be 
aggregated along other dimensions even if this particular dimension is hard to 
parse after the compression.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)