Hi Etienne, as we might want to keep the runners consistent on such feature, I think it makes sense to have this in the dataflow runner.
Especially, if it's not used by end-users, there's no impact in the runner. So, +1 to add MetricsPusher in dataflow runner. My $0.01 Regards JB On 03/26/2018 05:29 PM, Etienne Chauchot wrote: > Hi guys, > As part of the work bellow I need the help of Google Dataflow engine > maintainers: > > AFAIK Dataflow being a cloud hosted engine, the related runner is very > different > from the others. It just submits a job to the cloud hosted engine. So, no > access > to metrics container etc... from the runner. So I think that the MetricsPusher > (component responsible for merging metrics and pushing them to a sink backend) > must not be instanciated in DataflowRunner otherwise it would be more a client > (driver) piece of code and we will lose all the interest of being close to the > execution engine (among other things instrumentation of the execution of the > pipelines). > *I think that the MetricsPusher needs to be instanciated in the actual > Dataflow > engine. * > > As a side note a java implementation of the MetricsPusher is available in the > PR > in runner-core. > > We need a serialized (langage agnostic) version of the MetricsPusher that SDKs > will generate out of user pipelineOptions and that will be sent to runners for > instanciation > > *Are you guys willing to instanciate it in dataflow engine? I think it is > important that dataflow is wired up in addition to flink and spark for this > new > feature to ramp up (adding it to other runners) in Beam. * > *If you agree, can someone from Google help on that? * > > Thanks > Etienne > > > Le mercredi 31 janvier 2018 à 14:01 +0100, Etienne Chauchot a écrit : >> >> Hi all, >> >> Just to let you know that I have just submitted the PR [1]: >> >> This PR adds a MetricsPusher discussed in this [2] document in scenario 3.b. >> It merges and pushes beam metrics at a configurable (via pipelineOptions) >> frequency to a configurable sink. By default the sink is a DummySink also >> useful for tests. There is also a HttpMetricsSink available that pushes to a >> http backend the json-serialized metrics results. We could imagine in the >> future many Beam sinks to write to Ganglia, Graphite, ... Please note that >> these are not IOs because it needed to be transparent to the user. >> >> The pusher only supports attempted metrics for now. >> >> This feature is hosted in the runner-core module (as discussed in the doc) to >> be used by all the runners. I have wired it up with Spark and Flink runners >> (this part was quite painful :) ) >> >> If the PR is merged, I hope that runner experts would wire this feature up in >> the other runners. Your help is needed guys :) >> >> Besides, there is nothing related to portability for now in this PR. >> >> Best! >> >> Etienne >> >> [1] https://github.com/apache/beam/pull/4548 >> >> [2] https://s.apache.org/runner_independent_metrics_extraction >> >> >> Le 11/12/2017 à 17:33, Etienne Chauchot a écrit : >>> >>> Hi all, >>> >>> I sketched a little doc [1] about this subject. It tries to sum up the >>> differences between the runners towards metrics extraction and propose some >>> possible designs to have a runner agnostic extraction of the metrics. >>> >>> It is a 2 pages long doc, can you please comment it, and correct it if >>> needed? >>> >>> Thanks >>> >>> Etienne >>> >>> [1]: *https://s.apache.org/runner_independent_metrics_extraction* >>> >>> >>> Le 27/11/2017 à 18:17, Ben Chambers a écrit : >>>> I think discussing a runner agnostic way of configuring how metrics are >>>> extracted is a great idea -- thanks for bringing it up Etienne! >>>> >>>> Using a thread that polls the pipeline result relies on the program that >>>> created and submitted the pipeline continuing to run (eg., no machine >>>> faults, network problems, etc.). For many applications, this isn't a good >>>> model (a Streaming pipeline may run for weeks, a Batch pipeline may be >>>> automatically run every hour, etc.). >>>> >>>> Etienne's proposal of having something the runner pushes metrics too has >>>> the benefit of running in the same cluster as the pipeline, thus having the >>>> same reliability benefits. >>>> >>>> As noted, it would require runners to ensure that metrics were pushed into >>>> the extractor but from there it would allow a general configuration of how >>>> metrics are extracted from the pipeline and exposed to some external >>>> services. >>>> >>>> Providing something that the runners could push metrics into and have them >>>> automatically exported seems like it would have several benefits: >>>> 1. It would provide a single way to configure how metrics are actually >>>> exported. >>>> 2. It would allow the runners to ensure it was reliably executed. >>>> 3. It would allow the runner to report system metrics directly (eg., if a >>>> runner wanted to report the watermark, it could push that in directly). >>>> >>>> -- Ben >>>> >>>> On Mon, Nov 27, 2017 at 9:06 AM Jean-Baptiste Onofré <[email protected]> >>>> <mailto:[email protected]> >>>> wrote: >>>> >>>>> Hi all, >>>>> >>>>> Etienne forgot to mention that we started a PoC about that. >>>>> >>>>> What I started is to wrap the Pipeline creation to include a thread that >>>>> polls >>>>> periodically the metrics in the pipeline result (it's what I proposed when >>>>> I >>>>> compared with Karaf Decanter some time ago). >>>>> Then, this thread marshalls the collected metrics and send to a sink. At >>>>> the >>>>> end, it means that the harvested metrics data will be store in a backend >>>>> (for >>>>> instance elasticsearch). >>>>> >>>>> The pro of this approach is that it doesn't require any change in the >>>>> core, it's >>>>> up to the user to use the PipelineWithMetric wrapper. >>>>> >>>>> The cons is that the user needs to explicitly use the PipelineWithMetric >>>>> wrapper. >>>>> >>>>> IMHO, it's good enough as user can decide to poll metrics for some >>>>> pipelines and >>>>> not for others. >>>>> >>>>> Regards >>>>> JB >>>>> >>>>> On 11/27/2017 04:56 PM, Etienne Chauchot wrote: >>>>>> Hi all, >>>>>> >>>>>> I came by this ticket https://issues.apache.org/jira/browse/BEAM-2456. >>>>> I know >>>>>> that the metrics subject has already been discussed a lot, but I would >>>>> like to >>>>>> revive the discussion. >>>>>> >>>>>> The aim in this ticket is to avoid relying on the runner to provide the >>>>> metrics >>>>>> because they don't have all the same capabilities towards metrics. The >>>>> idea in >>>>>> the ticket is to still use beam metrics API (and not others like >>>>> codahale as it >>>>>> has been discussed some time ago) and provide a way to extract the >>>>> metrics with >>>>>> a polling thread that would be forked by a PipelineWithMetrics (so, >>>>> almost >>>>>> invisible to the end user) and then to push to a sink (such as a Http >>>>> rest sink >>>>>> for example or Graphite sink or anything else...). Nevertheless, a >>>>> polling >>>>>> thread might not work for all the runners because some might not make the >>>>>> metrics available before the end of the pipeline. Also, forking a thread >>>>> would >>>>>> be a bit unconventional, so it could be provided as a beam sdk extension. >>>>>> >>>>>> Another way, to avoid polling, would be to push metrics values to a sink >>>>> when >>>>>> they are updated but I don't know if it is feasible in a runner >>>>> independent way. >>>>>> WDYT about the ideas in this ticket? >>>>>> >>>>>> Best, >>>>>> Etienne >>>>> -- >>>>> Jean-Baptiste Onofré >>>>> [email protected] <mailto:[email protected]> >>>>> http://blog.nanthrax.net >>>>> Talend - http://www.talend.com >>>>> >>> >> -- Jean-Baptiste Onofré [email protected] http://blog.nanthrax.net Talend - http://www.talend.com
