OK, so, today it's more metrics reporting at the end of the pipeline execution ? It means we can't really monitor a pipeline.

Correct ?

Regards
JB

On 02/17/2017 07:36 PM, Ben Chambers wrote:
It don't think it is possible for there to be a general mechanism for
pushing metrics out during the execution of a pipeline. The Metrics API
suggests that metrics should be reported as values across all attempts and
values across only successful attempts. The latter requires runner
involvement to ensure that a given metric value is atomically incremented
(or checkpointed) when the bundle it was reported in is committed.

Aviem has already implemented Metrics support for the Spark runner. I am
working on support for the Dataflow runner.

On Fri, Feb 17, 2017 at 7:50 AM Jean-Baptiste Onofré <j...@nanthrax.net>
wrote:

Hi guys,

As I'm back from vacation, I'm back on this topic ;)

It's a great discussion, and I think about the Metric IO coverage, it's
good.

However, there's a point that we discussed very fast in the thread and I
think it's an important one (maybe more important than the provided
metrics actually in term of roadmap ;)).

Assuming we have pipelines, PTransforms, IOs, ... using the Metric API,
how do we expose the metrics for the end-users ?

A first approach would be to bind a JMX MBean server by the pipeline and
expose the metrics via MBeans. I don't think it's a good idea for the
following reasons:
1. It's not easy to know where the pipeline is actually executed, and
so, not easy to find the MBean server URI.
2. For the same reason, we can have port binding error.
3. If it could work for unbounded/streaming pipelines (as they are
always "running"), it's not really applicable for bounded/batch
pipelines as their lifetime is "limited" ;)

So, I think the "push" approach is better: during the execution, a
pipeline "internally" collects and pushes the metric to a backend.
The "push" could a kind of sink. For instance, the metric "records" can
be sent to a Kafka topic, or directly to Elasticsearch or whatever.
The metric backend can deal with alerting, reporting, etc.

Basically, we have to define two things:
1. The "appender" where the metrics have to be sent (and the
corresponding configuration to connect, like Kafka or Elasticsearch
location)
2. The format of the metric data (for instance, json format).

In Apache Karaf, I created something similar named Decanter:

http://blog.nanthrax.net/2015/07/monitoring-and-alerting-with-apache-karaf-decanter/

http://karaf.apache.org/manual/decanter/latest-1/

Decanter provides collectors that harvest the metrics (like JMX MBean
attributes, log messages, ...). Basically, for Beam, it would be
directly the Metric API used by pipeline parts.
Then, the metric record are send to a dispatcher which send the metric
records to an appender. The appenders store or send the metric records
to a backend (elasticsearc, cassandra, kafka, jms, reddis, ...).

I think it would make sense to provide the configuration and Metric
"appender" via the pipeline options.
As it's not really runner specific, it could be part of the metric API
(or SPI in that case).

WDYT ?

Regards
JB

On 02/15/2017 09:22 AM, Stas Levin wrote:
+1 to making the IO metrics (e.g. producers, consumers) available as part
of the Beam pipeline metrics tree for debugging and visibility.

As it has already been mentioned, many IO clients have a metrics mechanism
in place, so in these cases I think it could be beneficial to mirror their
metrics under the relevant subtree of the Beam metrics tree.

On Wed, Feb 15, 2017 at 12:04 AM Amit Sela <amitsel...@gmail.com> wrote:

I think this is a great discussion and I'd like to relate to some of the
points raised here, and raise some of my own.

First of all I think we should be careful here not to cross boundaries.
IOs
naturally have many metrics, and Beam should avoid "taking over" those.
IO
metrics should focus on what's relevant to the Pipeline: input/output
rate,
backlog (for UnboundedSources, which exists in bytes but for monitoring
purposes we might want to consider #messages).

I don't agree that we should not invest in doing this in Sources/Sinks
and
going directly to SplittableDoFn because the IO API is familiar and
known,
and as long as we keep it should be treated as a first class citizen.

As for enable/disable - if IOs consider focusing on pipeline-related
metrics I think we should be fine, though this could also change between
runners as well.

Finally, considering "split-metrics" is interesting because on one hand
it
affects the pipeline directly (unbalanced partitions in Kafka that may
cause backlog) but this is that fine-line of responsibilities (Kafka
monitoring would probably be able to tell you that partitions are not
balanced).

My 2 cents, cheers!

On Tue, Feb 14, 2017 at 8:46 PM Raghu Angadi <rang...@google.com.invalid>
wrote:

On Tue, Feb 14, 2017 at 9:21 AM, Ben Chambers
<bchamb...@google.com.invalid

wrote:


* I also think there are data source specific metrics that a given IO
will
want to expose (ie, things like kafka backlog for a topic.)


UnboundedSource has API for backlog. It is better for beam/runners to
handle backlog as well.
Of course there will be some source specific metrics too (errors, i/o
ops
etc).




--
Jean-Baptiste Onofré
jbono...@apache.org
http://blog.nanthrax.net
Talend - http://www.talend.com


--
Jean-Baptiste Onofré
jbono...@apache.org
http://blog.nanthrax.net
Talend - http://www.talend.com

Reply via email to