Thanks for keeping this discussion going, Etienne. I can help investigate
what it would take to add support for Dataflow runner. I've filed BEAM-3926
to track [1].

Is there a @ValidatesRunner integration test [2] that can be used to verify
when the functionality has been correctly implemented for a new runner?

[1] https://issues.apache.org/jira/browse/BEAM-3926
[2]
https://github.com/apache/beam/blob/master/sdks/java/core/src/main/java/org/apache/beam/sdk/testing/ValidatesRunner.java


On Mon, Mar 26, 2018 at 8:54 AM Jean-Baptiste Onofré <[email protected]>
wrote:

> Hi Etienne,
>
> as we might want to keep the runners consistent on such feature, I think it
> makes sense to have this in the dataflow runner.
>
> Especially, if it's not used by end-users, there's no impact in the runner.
>
> So, +1 to add MetricsPusher in dataflow runner.
>
> My $0.01
>
> Regards
> JB
>
> On 03/26/2018 05:29 PM, Etienne Chauchot wrote:
> > Hi guys,
> > As part of the work bellow I need the help of Google Dataflow engine
> maintainers:
> >
> > AFAIK Dataflow being a cloud hosted engine, the related runner is very
> different
> > from the others. It just submits a job to the cloud hosted engine. So,
> no access
> > to metrics container etc... from the runner. So I think that the
> MetricsPusher
> > (component responsible for merging metrics and pushing them to a sink
> backend)
> > must not be instanciated in DataflowRunner otherwise it would be more a
> client
> > (driver) piece of code and we will lose all the interest of being close
> to the
> > execution engine (among other things instrumentation of the execution of
> the
> > pipelines).
> > *I think that the MetricsPusher needs to be instanciated in the actual
> Dataflow
> > engine. *
> >
> > As a side note a java implementation of the MetricsPusher is available
> in the PR
> > in runner-core.
> >
> > We need a serialized (langage agnostic) version of the MetricsPusher
> that SDKs
> > will generate out of user pipelineOptions and that will be sent to
> runners for
> > instanciation
> >
> > *Are you guys willing to instanciate it in dataflow engine?  I think it
> is
> > important that dataflow is wired up in addition to flink and spark for
> this new
> > feature to ramp up (adding it to other runners) in Beam. *
> > *If you agree, can someone from Google help on that? *
> >
> > Thanks
> > Etienne
> >
> >
> > Le mercredi 31 janvier 2018 à 14:01 +0100, Etienne Chauchot a écrit :
> >>
> >> Hi all,
> >>
> >> Just to let you know that I have just submitted the PR [1]:
> >>
> >> This PR adds a MetricsPusher discussed in this [2] document in scenario
> 3.b.
> >> It merges and pushes beam metrics at a configurable (via
> pipelineOptions)
> >> frequency to a configurable sink. By default the sink is a DummySink
> also
> >> useful for tests. There is also a HttpMetricsSink available that pushes
> to a
> >> http backend the json-serialized metrics results. We could imagine in
> the
> >> future many Beam sinks to write to Ganglia, Graphite, ... Please note
> that
> >> these are not IOs because it needed to be transparent to the user.
> >>
> >> The pusher only supports attempted metrics for now.
> >>
> >> This feature is hosted in the runner-core module (as discussed in the
> doc) to
> >> be used by all the runners. I have wired it up with Spark and Flink
> runners
> >> (this part was quite painful :) )
> >>
> >> If the PR is merged, I hope that runner experts would wire this feature
> up in
> >> the other runners. Your help is needed guys :)
> >>
> >> Besides, there is nothing related to portability for now in this PR.
> >>
> >> Best!
> >>
> >> Etienne
> >>
> >> [1] https://github.com/apache/beam/pull/4548
> >>
> >> [2] https://s.apache.org/runner_independent_metrics_extraction
> >>
> >>
> >> Le 11/12/2017 à 17:33, Etienne Chauchot a écrit :
> >>>
> >>> Hi all,
> >>>
> >>> I sketched a little doc [1] about this subject. It tries to sum up the
> >>> differences between the runners towards metrics extraction and propose
> some
> >>> possible designs to have a runner agnostic extraction of the metrics.
> >>>
> >>> It is a 2 pages long doc, can you please comment it, and correct it if
> needed?
> >>>
> >>> Thanks
> >>>
> >>> Etienne
> >>>
> >>> [1]: *https://s.apache.org/runner_independent_metrics_extraction*
> >>>
> >>>
> >>> Le 27/11/2017 à 18:17, Ben Chambers a écrit :
> >>>> I think discussing a runner agnostic way of configuring how metrics
> are
> >>>> extracted is a great idea -- thanks for bringing it up Etienne!
> >>>>
> >>>> Using a thread that polls the pipeline result relies on the program
> that
> >>>> created and submitted the pipeline continuing to run (eg., no machine
> >>>> faults, network problems, etc.). For many applications, this isn't a
> good
> >>>> model (a Streaming pipeline may run for weeks, a Batch pipeline may be
> >>>> automatically run every hour, etc.).
> >>>>
> >>>> Etienne's proposal of having something the runner pushes metrics too
> has
> >>>> the benefit of running in the same cluster as the pipeline, thus
> having the
> >>>> same reliability benefits.
> >>>>
> >>>> As noted, it would require runners to ensure that metrics were pushed
> into
> >>>> the extractor but from there it would allow a general configuration
> of how
> >>>> metrics are extracted from the pipeline and exposed to some external
> >>>> services.
> >>>>
> >>>> Providing something that the runners could push metrics into and have
> them
> >>>> automatically exported seems like it would have several benefits:
> >>>>   1. It would provide a single way to configure how metrics are
> actually
> >>>> exported.
> >>>>   2. It would allow the runners to ensure it was reliably executed.
> >>>>   3. It would allow the runner to report system metrics directly
> (eg., if a
> >>>> runner wanted to report the watermark, it could push that in
> directly).
> >>>>
> >>>> -- Ben
> >>>>
> >>>> On Mon, Nov 27, 2017 at 9:06 AM Jean-Baptiste Onofré <[email protected]>
> <mailto:[email protected]>
> >>>> wrote:
> >>>>
> >>>>> Hi all,
> >>>>>
> >>>>> Etienne forgot to mention that we started a PoC about that.
> >>>>>
> >>>>> What I started is to wrap the Pipeline creation to include a thread
> that
> >>>>> polls
> >>>>> periodically the metrics in the pipeline result (it's what I
> proposed when
> >>>>> I
> >>>>> compared with Karaf Decanter some time ago).
> >>>>> Then, this thread marshalls the collected metrics and send to a
> sink. At
> >>>>> the
> >>>>> end, it means that the harvested metrics data will be store in a
> backend
> >>>>> (for
> >>>>> instance elasticsearch).
> >>>>>
> >>>>> The pro of this approach is that it doesn't require any change in the
> >>>>> core, it's
> >>>>> up to the user to use the PipelineWithMetric wrapper.
> >>>>>
> >>>>> The cons is that the user needs to explicitly use the
> PipelineWithMetric
> >>>>> wrapper.
> >>>>>
> >>>>> IMHO, it's good enough as user can decide to poll metrics for some
> >>>>> pipelines and
> >>>>> not for others.
> >>>>>
> >>>>> Regards
> >>>>> JB
> >>>>>
> >>>>> On 11/27/2017 04:56 PM, Etienne Chauchot wrote:
> >>>>>> Hi all,
> >>>>>>
> >>>>>> I came by this ticket
> https://issues.apache.org/jira/browse/BEAM-2456.
> >>>>> I know
> >>>>>> that the metrics subject has already been discussed a lot, but I
> would
> >>>>> like to
> >>>>>> revive the discussion.
> >>>>>>
> >>>>>> The aim in this ticket is to avoid relying on the runner to provide
> the
> >>>>> metrics
> >>>>>> because they don't have all the same capabilities towards metrics.
> The
> >>>>> idea in
> >>>>>> the ticket is to still use beam metrics API (and not others like
> >>>>> codahale as it
> >>>>>> has been discussed some time ago) and provide a way to extract the
> >>>>> metrics with
> >>>>>> a polling thread that would be forked by a PipelineWithMetrics (so,
> >>>>> almost
> >>>>>> invisible to the end user) and then to push to a sink (such as a
> Http
> >>>>> rest sink
> >>>>>> for example or Graphite sink or anything else...). Nevertheless, a
> >>>>> polling
> >>>>>> thread might not work for all the runners because some might not
> make the
> >>>>>> metrics available before the end of the pipeline. Also, forking a
> thread
> >>>>> would
> >>>>>> be a bit unconventional, so it could be provided as a beam sdk
> extension.
> >>>>>>
> >>>>>> Another way, to avoid polling, would be to push metrics values to a
> sink
> >>>>> when
> >>>>>> they are updated but I don't know if it is feasible in a runner
> >>>>> independent way.
> >>>>>> WDYT about the ideas in this ticket?
> >>>>>>
> >>>>>> Best,
> >>>>>> Etienne
> >>>>> --
> >>>>> Jean-Baptiste Onofré
> >>>>> [email protected] <mailto:[email protected]>
> >>>>> http://blog.nanthrax.net
> >>>>> Talend - http://www.talend.com
> >>>>>
> >>>
> >>
>
> --
> Jean-Baptiste Onofré
> [email protected]
> http://blog.nanthrax.net
> Talend - http://www.talend.com
>
-- 


Got feedback? http://go/swegner-feedback

Reply via email to