Re: Committed vs. attempted metrics results

Ben Chambers Thu, 19 Jan 2017 15:18:32 -0800

Part of the problem here is that whether attempted or committed is "what
matters" depends on the metric. If you're counting the number of RPCs to an
external service, you may want all the attempts (to predict your bill). If
you're tracking how long those RPCs took, you may want it just to be the
committed (eg., what is the best-case time to execute your pipeline). This
is essentially Luke's case of wanting one or the other.


Regarding the step names -- the metrics are reported using the full step
name, which is also made unique during the Graph API. So "My
Composite/ParDo(MyDoFn)" or "My Composite/ParDo(MyDoFn)2" if there are
multiple instances within the same composite. Specifically -- the names are
made unique prior to recording metrics, so there are no double counts.

On Thu, Jan 19, 2017 at 1:57 PM Amit Sela <amitsel...@gmail.com> wrote:

> I think Luke's example is interesting, but I wonder how common it is/would
> be ? I'd expect failures to happen but not in a rate that would be so
> dramatic that it'd be interesting to follow applicatively (I'd expect the
> runner/cluster to properly monitor up time of processes/nodes separately).
> And even if it is useful, I can't think of other use cases.
>
> I thought the idea was to "declare" the Metrics guarantee level in the
> query API, but the more I think about it the more I tend to let it go for
> the following reasons:
>
>    - Setting aside Luke's example, I think users would prefer the best
>    guarantee a runner can provide. And on that note, I'd expect a
> "getMetrics"
>    API and not have to figure-out guarantees.
>    - Programmatic querying would "break" (UnsupportedOperationExecption)
>    portability if a program that was running with a runner that supports
>    committed() would try to execute on a runner that only supports
> attempted()
>    - I know that portability is for the Pipeline and this is post-execution
>    but still, call it 25% portability issue ;-) .
>    - According to the Capability Matrix, all runners fail to provide
>    "commit" guarantee for Aggregators. I can only speak for Spark saying
> that
>    supporting the Metrics API relies on the same underlying mechanism and
> so
>    nothing will change. I wonder about other runners, anyone plans to
> support
>    "commit" guarantees for Metrics soon ? having said that, not sure this
> is a
>    good reason not to have this as a placeholder.
>
> Another question for querying Metrics - querying by step could be a bit
> tricky since a runner is expected to keep unique naming/ids for steps, but
> users are supposed to be aware of this here and I'd suspect users might not
> follow and if they use the same ParDo in a couple of places they'll query
> it and it might be confusing for them to see "double counts" if they didn't
> mean for that.
>
> Amit.
>
> On Thu, Jan 19, 2017 at 7:36 PM Ben Chambers <bchamb...@google.com.invalid
> >
> wrote:
>
> > Thanks for starting the discussion! I'm going to hold off saying what I
> > think and instead just provide some background and additional questions,
> > because I want to see where the discussion goes.
> >
> > When I first suggested the API for querying metrics I was adding it for
> > parity with aggregators. A good first question might be does the pipeline
> > result even need query methods? Runners could add them as necessary based
> > on the levels of querying the support.
> >
> > The other desire was to make the accuracy clear. One implementation path
> > was reporting metrics directly from the workers while attempting work.
> This
> > can overcount when retrying and may be under the actual attempts if the
> > worker lost connectivity before reporting.
> >
> > Another implementation was something like a side output where the counts
> > are committed as part of each bundles results, and then aggregated. This
> > committed value is more accurate and represents the value that occurred
> > along the success path of the pipeline.
> >
> > I suspect there are other possible implementations so trying to make an
> API
> > that expresses all of them is difficult. So:
> >
> > 1. Does pipeline result need to support querying (which is useful for
> > programmatic consumption) or are metrics intended only to get values out
> of
> > a pipeline and into some metrics store?
> >
> > 2. How should pipeline results indicate the different kinds of metrics?
> > What if a runner supported multiple kinds (eg, the runner reports both
> > attempted and committed results)? As Luke mentions it may be useful to
> look
> > at both to understand how much retries affected the value.
> > On Thu, Jan 19, 2017, 1:42 AM Aviem Zur <aviem...@gmail.com> wrote:
> >
> > Hi all,
> >
> > While working on the implementation of metrics API in Spark runner the
> > question of committed vs. attempted results has come up, sparking (no pun
> > intended) an interesting conversation.
> > (See API: MetricResult
> > <
> >
> >
> https://github.com/apache/beam/blob/master/sdks/java/core/src/main/java/org/apache/beam/sdk/metrics/MetricResult.java
> > >
> > and
> > discussion: PR #1750 <https://github.com/apache/beam/pull/1750>)
> >
> > The separation of `attempted` and `committed` metric results seems a bit
> > unclear.
> >
> > Seeing that current implementations of aggregators in the different
> runners
> > do not guarantee correctness, one could assume that the metrics API
> > implementations will also follow the same guarantees.
> >
> > If this is correct, then you could assume that only `attempted()` metrics
> > results can be fulfilled.
> > Would it then be better to just have a single method such as `get()` in
> the
> > API, and have the guarantees of each runner explained in the capability
> > matrix / documentation?
> >
>

Re: Committed vs. attempted metrics results

Reply via email to