Re: [DISCUSSION] Runner agnostic metrics extractor?

Jean-Baptiste Onofré Mon, 26 Mar 2018 08:54:21 -0700

Hi Etienne,

as we might want to keep the runners consistent on such feature, I think it
makes sense to have this in the dataflow runner.


Especially, if it's not used by end-users, there's no impact in the runner.

So, +1 to add MetricsPusher in dataflow runner.

My $0.01

Regards
JB

On 03/26/2018 05:29 PM, Etienne Chauchot wrote:
> Hi guys,
> As part of the work bellow I need the help of Google Dataflow engine 
> maintainers:
> 
> AFAIK Dataflow being a cloud hosted engine, the related runner is very 
> different
> from the others. It just submits a job to the cloud hosted engine. So, no 
> access
> to metrics container etc... from the runner. So I think that the MetricsPusher
> (component responsible for merging metrics and pushing them to a sink backend)
> must not be instanciated in DataflowRunner otherwise it would be more a client
> (driver) piece of code and we will lose all the interest of being close to the
> execution engine (among other things instrumentation of the execution of the
> pipelines). 
> *I think that the MetricsPusher needs to be instanciated in the actual 
> Dataflow
> engine. *
> 
> As a side note a java implementation of the MetricsPusher is available in the 
> PR
> in runner-core.
> 
> We need a serialized (langage agnostic) version of the MetricsPusher that SDKs
> will generate out of user pipelineOptions and that will be sent to runners for
> instanciation
> 
> *Are you guys willing to instanciate it in dataflow engine?  I think it is
> important that dataflow is wired up in addition to flink and spark for this 
> new
> feature to ramp up (adding it to other runners) in Beam. *
> *If you agree, can someone from Google help on that? *
> 
> Thanks 
> Etienne
> 
> 
> Le mercredi 31 janvier 2018 à 14:01 +0100, Etienne Chauchot a écrit :
>>
>> Hi all,
>>
>> Just to let you know that I have just submitted the PR [1]:
>>
>> This PR adds a MetricsPusher discussed in this [2] document in scenario 3.b.
>> It merges and pushes beam metrics at a configurable (via pipelineOptions)
>> frequency to a configurable sink. By default the sink is a DummySink also
>> useful for tests. There is also a HttpMetricsSink available that pushes to a
>> http backend the json-serialized metrics results. We could imagine in the
>> future many Beam sinks to write to Ganglia, Graphite, ... Please note that
>> these are not IOs because it needed to be transparent to the user.
>>
>> The pusher only supports attempted metrics for now.
>>
>> This feature is hosted in the runner-core module (as discussed in the doc) to
>> be used by all the runners. I have wired it up with Spark and Flink runners
>> (this part was quite painful :) )
>>
>> If the PR is merged, I hope that runner experts would wire this feature up in
>> the other runners. Your help is needed guys :)
>>
>> Besides, there is nothing related to portability for now in this PR.
>>
>> Best!
>>
>> Etienne
>>
>> [1] https://github.com/apache/beam/pull/4548
>>
>> [2] https://s.apache.org/runner_independent_metrics_extraction
>>
>>
>> Le 11/12/2017 à 17:33, Etienne Chauchot a écrit :
>>>
>>> Hi all,
>>>
>>> I sketched a little doc [1] about this subject. It tries to sum up the
>>> differences between the runners towards metrics extraction and propose some
>>> possible designs to have a runner agnostic extraction of the metrics.
>>>
>>> It is a 2 pages long doc, can you please comment it, and correct it if 
>>> needed?
>>>
>>> Thanks
>>>
>>> Etienne
>>>
>>> [1]: *https://s.apache.org/runner_independent_metrics_extraction*
>>>
>>>
>>> Le 27/11/2017 à 18:17, Ben Chambers a écrit :
>>>> I think discussing a runner agnostic way of configuring how metrics are
>>>> extracted is a great idea -- thanks for bringing it up Etienne!
>>>>
>>>> Using a thread that polls the pipeline result relies on the program that
>>>> created and submitted the pipeline continuing to run (eg., no machine
>>>> faults, network problems, etc.). For many applications, this isn't a good
>>>> model (a Streaming pipeline may run for weeks, a Batch pipeline may be
>>>> automatically run every hour, etc.).
>>>>
>>>> Etienne's proposal of having something the runner pushes metrics too has
>>>> the benefit of running in the same cluster as the pipeline, thus having the
>>>> same reliability benefits.
>>>>
>>>> As noted, it would require runners to ensure that metrics were pushed into
>>>> the extractor but from there it would allow a general configuration of how
>>>> metrics are extracted from the pipeline and exposed to some external
>>>> services.
>>>>
>>>> Providing something that the runners could push metrics into and have them
>>>> automatically exported seems like it would have several benefits:
>>>>   1. It would provide a single way to configure how metrics are actually
>>>> exported.
>>>>   2. It would allow the runners to ensure it was reliably executed.
>>>>   3. It would allow the runner to report system metrics directly (eg., if a
>>>> runner wanted to report the watermark, it could push that in directly).
>>>>
>>>> -- Ben
>>>>
>>>> On Mon, Nov 27, 2017 at 9:06 AM Jean-Baptiste Onofré <[email protected]> 
>>>> <mailto:[email protected]>
>>>> wrote:
>>>>
>>>>> Hi all,
>>>>>
>>>>> Etienne forgot to mention that we started a PoC about that.
>>>>>
>>>>> What I started is to wrap the Pipeline creation to include a thread that
>>>>> polls
>>>>> periodically the metrics in the pipeline result (it's what I proposed when
>>>>> I
>>>>> compared with Karaf Decanter some time ago).
>>>>> Then, this thread marshalls the collected metrics and send to a sink. At
>>>>> the
>>>>> end, it means that the harvested metrics data will be store in a backend
>>>>> (for
>>>>> instance elasticsearch).
>>>>>
>>>>> The pro of this approach is that it doesn't require any change in the
>>>>> core, it's
>>>>> up to the user to use the PipelineWithMetric wrapper.
>>>>>
>>>>> The cons is that the user needs to explicitly use the PipelineWithMetric
>>>>> wrapper.
>>>>>
>>>>> IMHO, it's good enough as user can decide to poll metrics for some
>>>>> pipelines and
>>>>> not for others.
>>>>>
>>>>> Regards
>>>>> JB
>>>>>
>>>>> On 11/27/2017 04:56 PM, Etienne Chauchot wrote:
>>>>>> Hi all,
>>>>>>
>>>>>> I came by this ticket https://issues.apache.org/jira/browse/BEAM-2456.
>>>>> I know
>>>>>> that the metrics subject has already been discussed a lot, but I would
>>>>> like to
>>>>>> revive the discussion.
>>>>>>
>>>>>> The aim in this ticket is to avoid relying on the runner to provide the
>>>>> metrics
>>>>>> because they don't have all the same capabilities towards metrics. The
>>>>> idea in
>>>>>> the ticket is to still use beam metrics API (and not others like
>>>>> codahale as it
>>>>>> has been discussed some time ago) and provide a way to extract the
>>>>> metrics with
>>>>>> a polling thread that would be forked by a PipelineWithMetrics (so,
>>>>> almost
>>>>>> invisible to the end user) and then to push to a sink (such as a Http
>>>>> rest sink
>>>>>> for example or Graphite sink or anything else...). Nevertheless, a
>>>>> polling
>>>>>> thread might not work for all the runners because some might not make the
>>>>>> metrics available before the end of the pipeline. Also, forking a thread
>>>>> would
>>>>>> be a bit unconventional, so it could be provided as a beam sdk extension.
>>>>>>
>>>>>> Another way, to avoid polling, would be to push metrics values to a sink
>>>>> when
>>>>>> they are updated but I don't know if it is feasible in a runner
>>>>> independent way.
>>>>>> WDYT about the ideas in this ticket?
>>>>>>
>>>>>> Best,
>>>>>> Etienne
>>>>> --
>>>>> Jean-Baptiste Onofré
>>>>> [email protected] <mailto:[email protected]>
>>>>> http://blog.nanthrax.net
>>>>> Talend - http://www.talend.com
>>>>>
>>>
>>

-- 
Jean-Baptiste Onofré
[email protected]
http://blog.nanthrax.net
Talend - http://www.talend.com

Re: [DISCUSSION] Runner agnostic metrics extractor?

Reply via email to