Thanks for keeping this discussion going, Etienne. I can help investigate what it would take to add support for Dataflow runner. I've filed BEAM-3926 to track [1].
Is there a @ValidatesRunner integration test [2] that can be used to verify when the functionality has been correctly implemented for a new runner? [1] https://issues.apache.org/jira/browse/BEAM-3926 [2] https://github.com/apache/beam/blob/master/sdks/java/core/src/main/java/org/apache/beam/sdk/testing/ValidatesRunner.java On Mon, Mar 26, 2018 at 8:54 AM Jean-Baptiste Onofré <[email protected]> wrote: > Hi Etienne, > > as we might want to keep the runners consistent on such feature, I think it > makes sense to have this in the dataflow runner. > > Especially, if it's not used by end-users, there's no impact in the runner. > > So, +1 to add MetricsPusher in dataflow runner. > > My $0.01 > > Regards > JB > > On 03/26/2018 05:29 PM, Etienne Chauchot wrote: > > Hi guys, > > As part of the work bellow I need the help of Google Dataflow engine > maintainers: > > > > AFAIK Dataflow being a cloud hosted engine, the related runner is very > different > > from the others. It just submits a job to the cloud hosted engine. So, > no access > > to metrics container etc... from the runner. So I think that the > MetricsPusher > > (component responsible for merging metrics and pushing them to a sink > backend) > > must not be instanciated in DataflowRunner otherwise it would be more a > client > > (driver) piece of code and we will lose all the interest of being close > to the > > execution engine (among other things instrumentation of the execution of > the > > pipelines). > > *I think that the MetricsPusher needs to be instanciated in the actual > Dataflow > > engine. * > > > > As a side note a java implementation of the MetricsPusher is available > in the PR > > in runner-core. > > > > We need a serialized (langage agnostic) version of the MetricsPusher > that SDKs > > will generate out of user pipelineOptions and that will be sent to > runners for > > instanciation > > > > *Are you guys willing to instanciate it in dataflow engine? I think it > is > > important that dataflow is wired up in addition to flink and spark for > this new > > feature to ramp up (adding it to other runners) in Beam. * > > *If you agree, can someone from Google help on that? * > > > > Thanks > > Etienne > > > > > > Le mercredi 31 janvier 2018 à 14:01 +0100, Etienne Chauchot a écrit : > >> > >> Hi all, > >> > >> Just to let you know that I have just submitted the PR [1]: > >> > >> This PR adds a MetricsPusher discussed in this [2] document in scenario > 3.b. > >> It merges and pushes beam metrics at a configurable (via > pipelineOptions) > >> frequency to a configurable sink. By default the sink is a DummySink > also > >> useful for tests. There is also a HttpMetricsSink available that pushes > to a > >> http backend the json-serialized metrics results. We could imagine in > the > >> future many Beam sinks to write to Ganglia, Graphite, ... Please note > that > >> these are not IOs because it needed to be transparent to the user. > >> > >> The pusher only supports attempted metrics for now. > >> > >> This feature is hosted in the runner-core module (as discussed in the > doc) to > >> be used by all the runners. I have wired it up with Spark and Flink > runners > >> (this part was quite painful :) ) > >> > >> If the PR is merged, I hope that runner experts would wire this feature > up in > >> the other runners. Your help is needed guys :) > >> > >> Besides, there is nothing related to portability for now in this PR. > >> > >> Best! > >> > >> Etienne > >> > >> [1] https://github.com/apache/beam/pull/4548 > >> > >> [2] https://s.apache.org/runner_independent_metrics_extraction > >> > >> > >> Le 11/12/2017 à 17:33, Etienne Chauchot a écrit : > >>> > >>> Hi all, > >>> > >>> I sketched a little doc [1] about this subject. It tries to sum up the > >>> differences between the runners towards metrics extraction and propose > some > >>> possible designs to have a runner agnostic extraction of the metrics. > >>> > >>> It is a 2 pages long doc, can you please comment it, and correct it if > needed? > >>> > >>> Thanks > >>> > >>> Etienne > >>> > >>> [1]: *https://s.apache.org/runner_independent_metrics_extraction* > >>> > >>> > >>> Le 27/11/2017 à 18:17, Ben Chambers a écrit : > >>>> I think discussing a runner agnostic way of configuring how metrics > are > >>>> extracted is a great idea -- thanks for bringing it up Etienne! > >>>> > >>>> Using a thread that polls the pipeline result relies on the program > that > >>>> created and submitted the pipeline continuing to run (eg., no machine > >>>> faults, network problems, etc.). For many applications, this isn't a > good > >>>> model (a Streaming pipeline may run for weeks, a Batch pipeline may be > >>>> automatically run every hour, etc.). > >>>> > >>>> Etienne's proposal of having something the runner pushes metrics too > has > >>>> the benefit of running in the same cluster as the pipeline, thus > having the > >>>> same reliability benefits. > >>>> > >>>> As noted, it would require runners to ensure that metrics were pushed > into > >>>> the extractor but from there it would allow a general configuration > of how > >>>> metrics are extracted from the pipeline and exposed to some external > >>>> services. > >>>> > >>>> Providing something that the runners could push metrics into and have > them > >>>> automatically exported seems like it would have several benefits: > >>>> 1. It would provide a single way to configure how metrics are > actually > >>>> exported. > >>>> 2. It would allow the runners to ensure it was reliably executed. > >>>> 3. It would allow the runner to report system metrics directly > (eg., if a > >>>> runner wanted to report the watermark, it could push that in > directly). > >>>> > >>>> -- Ben > >>>> > >>>> On Mon, Nov 27, 2017 at 9:06 AM Jean-Baptiste Onofré <[email protected]> > <mailto:[email protected]> > >>>> wrote: > >>>> > >>>>> Hi all, > >>>>> > >>>>> Etienne forgot to mention that we started a PoC about that. > >>>>> > >>>>> What I started is to wrap the Pipeline creation to include a thread > that > >>>>> polls > >>>>> periodically the metrics in the pipeline result (it's what I > proposed when > >>>>> I > >>>>> compared with Karaf Decanter some time ago). > >>>>> Then, this thread marshalls the collected metrics and send to a > sink. At > >>>>> the > >>>>> end, it means that the harvested metrics data will be store in a > backend > >>>>> (for > >>>>> instance elasticsearch). > >>>>> > >>>>> The pro of this approach is that it doesn't require any change in the > >>>>> core, it's > >>>>> up to the user to use the PipelineWithMetric wrapper. > >>>>> > >>>>> The cons is that the user needs to explicitly use the > PipelineWithMetric > >>>>> wrapper. > >>>>> > >>>>> IMHO, it's good enough as user can decide to poll metrics for some > >>>>> pipelines and > >>>>> not for others. > >>>>> > >>>>> Regards > >>>>> JB > >>>>> > >>>>> On 11/27/2017 04:56 PM, Etienne Chauchot wrote: > >>>>>> Hi all, > >>>>>> > >>>>>> I came by this ticket > https://issues.apache.org/jira/browse/BEAM-2456. > >>>>> I know > >>>>>> that the metrics subject has already been discussed a lot, but I > would > >>>>> like to > >>>>>> revive the discussion. > >>>>>> > >>>>>> The aim in this ticket is to avoid relying on the runner to provide > the > >>>>> metrics > >>>>>> because they don't have all the same capabilities towards metrics. > The > >>>>> idea in > >>>>>> the ticket is to still use beam metrics API (and not others like > >>>>> codahale as it > >>>>>> has been discussed some time ago) and provide a way to extract the > >>>>> metrics with > >>>>>> a polling thread that would be forked by a PipelineWithMetrics (so, > >>>>> almost > >>>>>> invisible to the end user) and then to push to a sink (such as a > Http > >>>>> rest sink > >>>>>> for example or Graphite sink or anything else...). Nevertheless, a > >>>>> polling > >>>>>> thread might not work for all the runners because some might not > make the > >>>>>> metrics available before the end of the pipeline. Also, forking a > thread > >>>>> would > >>>>>> be a bit unconventional, so it could be provided as a beam sdk > extension. > >>>>>> > >>>>>> Another way, to avoid polling, would be to push metrics values to a > sink > >>>>> when > >>>>>> they are updated but I don't know if it is feasible in a runner > >>>>> independent way. > >>>>>> WDYT about the ideas in this ticket? > >>>>>> > >>>>>> Best, > >>>>>> Etienne > >>>>> -- > >>>>> Jean-Baptiste Onofré > >>>>> [email protected] <mailto:[email protected]> > >>>>> http://blog.nanthrax.net > >>>>> Talend - http://www.talend.com > >>>>> > >>> > >> > > -- > Jean-Baptiste Onofré > [email protected] > http://blog.nanthrax.net > Talend - http://www.talend.com > -- Got feedback? http://go/swegner-feedback
