Sure. I've opened a ticket which tries to summarize these final thoughts: https://issues.apache.org/jira/browse/BEAM-1344
On Thu, Jan 26, 2017 at 6:47 PM Ben Chambers <bchamb...@google.com.invalid> wrote: > It think relaxing the query to not be an exact match is reasonable. I'm > wondering if it should be substring or regex. either one preserves the > existing behavior of, when passed a full step path returning only the > metrics for that specific step, but it adds the ability to just know > approximately the step name. > > A) Util or not seems fine to me. If it seems likely to be reusable let's do > so. Preferably in a seperate PR or commit. > > B) Yes, direct runner should match whatever semantics we choose. > > On Thu, Jan 26, 2017, 5:30 AM Aviem Zur <aviem...@gmail.com> wrote: > > > Ben - yes, there is still some ambiguity regarding the querying of the > > metrics results. > > > > You've discussed in this thread the notion that metrics step names should > > be converted to unique names when aggregating metrics, so that each step > > will aggregate its own metrics, and not join with other steps by mistake. > > The way you suggested to query this is that when querying by step, all > > steps which contain the query string as a substring will show up in the > > results. > > > > This sounds fine, however this is not how the direct runner is > implemented > > right now. It is an exact match. > > > > In my PR I copied the MetricsResults filtering methods directly from the > > direct runner and suggested that since I copied them verbatim perhaps > they > > should be pulled up to a more central module, and then all runners could > > use them. > > > > So, my remaining questions are: > > A) Should we create a util for filtering MetricsResults based on a query, > > using the suggested filtering implementation in this thread (Substring > > match) ? > > B) Should direct runner be changed to use such a util, and conform with > the > > results filtering suggested. > > > > On Tue, Jan 24, 2017 at 2:03 AM Ben Chambers > <bchamb...@google.com.invalid > > > > > wrote: > > > > > For the short term, it seems like staying with the existing Query API > and > > > allowing runner's to throw exceptions if a user issues a query that is > > not > > > supported is reasonable. It shouldn't affect the ability to run a Beam > > > pipeline on other runners, since the Query API is only exposed *after* > > the > > > pipeline is run. > > > > > > For the longer term, it seems like the Query API could merit some more > > > thought, especially if people have good use cases for accessing the > value > > > of metrics programatically from the same program that ran the original > > > pipeline. > > > > > > Aviem -- is there anything specific that needs to be discussed/nailed > > down > > > to help with your PR? > > > > > > On Thu, Jan 19, 2017 at 3:57 PM Ben Chambers <bchamb...@google.com> > > wrote: > > > > > > > On Thu, Jan 19, 2017 at 3:28 PM Amit Sela <amitsel...@gmail.com> > > wrote: > > > > > > > > On Fri, Jan 20, 2017 at 1:18 AM Ben Chambers > > > <bchamb...@google.com.invalid > > > > > > > > > wrote: > > > > > > > > > Part of the problem here is that whether attempted or committed is > > > "what > > > > > matters" depends on the metric. If you're counting the number of > RPCs > > > to > > > > an > > > > > external service, you may want all the attempts (to predict your > > bill). > > > > If > > > > > you're tracking how long those RPCs took, you may want it just to > be > > > the > > > > > committed (eg., what is the best-case time to execute your > pipeline). > > > > This > > > > > is essentially Luke's case of wanting one or the other. > > > > > > > > > This sounds like Metrics should have a user-defined guarantee-level.. > > > which > > > > might make more sense - Metrics.counter().attempted()/committed() - > > > though > > > > this might prove more challenging for runners to implement. > > > > > > > > > > > > We went away from that because of cases like Luke's where a user > might > > > > want to compare the two. Or, not even realize there is a difference > > > > up-front, so declaring ahead of time is difficult. If both are > > available > > > > and can be looked at, if they're the same -- no problems. If they're > > > > different, then it provides a good reason to investigate and figure > out > > > the > > > > difference. > > > > > > > > > > > > > > > > > > Regarding the step names -- the metrics are reported using the full > > > step > > > > > name, which is also made unique during the Graph API. So "My > > > > > Composite/ParDo(MyDoFn)" or "My Composite/ParDo(MyDoFn)2" if there > > are > > > > > multiple instances within the same composite. Specifically -- the > > names > > > > are > > > > > made unique prior to recording metrics, so there are no double > > counts. > > > > > > > > > But how would the user know that ? I'm afraid this could be confusing > > as > > > a > > > > user-facing query API, and I think most users would simply name > metrics > > > > differently. > > > > > > > > > > > > The query API can support querying based on a sub-string of the full > > > name, > > > > and return metrics from all steps that match. That would allow the > user > > > to > > > > query metrics without knowing that. Having unique names for steps is > > > > important and useful for many other things (logging, associating time > > > spent > > > > executing, etc.). > > > > > > > > > > > > > > > > > > On Thu, Jan 19, 2017 at 1:57 PM Amit Sela <amitsel...@gmail.com> > > > wrote: > > > > > > > > > > > I think Luke's example is interesting, but I wonder how common it > > > > > is/would > > > > > > be ? I'd expect failures to happen but not in a rate that would > be > > so > > > > > > dramatic that it'd be interesting to follow applicatively (I'd > > expect > > > > the > > > > > > runner/cluster to properly monitor up time of processes/nodes > > > > > separately). > > > > > > And even if it is useful, I can't think of other use cases. > > > > > > > > > > > > I thought the idea was to "declare" the Metrics guarantee level > in > > > the > > > > > > query API, but the more I think about it the more I tend to let > it > > go > > > > for > > > > > > the following reasons: > > > > > > > > > > > > - Setting aside Luke's example, I think users would prefer the > > > best > > > > > > guarantee a runner can provide. And on that note, I'd expect a > > > > > > "getMetrics" > > > > > > API and not have to figure-out guarantees. > > > > > > - Programmatic querying would "break" > > > > (UnsupportedOperationExecption) > > > > > > portability if a program that was running with a runner that > > > > supports > > > > > > committed() would try to execute on a runner that only > supports > > > > > > attempted() > > > > > > - I know that portability is for the Pipeline and this is > > > > > post-execution > > > > > > but still, call it 25% portability issue ;-) . > > > > > > - According to the Capability Matrix, all runners fail to > > provide > > > > > > "commit" guarantee for Aggregators. I can only speak for Spark > > > > saying > > > > > > that > > > > > > supporting the Metrics API relies on the same underlying > > mechanism > > > > and > > > > > > so > > > > > > nothing will change. I wonder about other runners, anyone > plans > > to > > > > > > support > > > > > > "commit" guarantees for Metrics soon ? having said that, not > > sure > > > > this > > > > > > is a > > > > > > good reason not to have this as a placeholder. > > > > > > > > > > > > Another question for querying Metrics - querying by step could > be a > > > bit > > > > > > tricky since a runner is expected to keep unique naming/ids for > > > steps, > > > > > but > > > > > > users are supposed to be aware of this here and I'd suspect users > > > might > > > > > not > > > > > > follow and if they use the same ParDo in a couple of places > they'll > > > > query > > > > > > it and it might be confusing for them to see "double counts" if > > they > > > > > didn't > > > > > > mean for that. > > > > > > > > > > > > Amit. > > > > > > > > > > > > On Thu, Jan 19, 2017 at 7:36 PM Ben Chambers > > > > > <bchamb...@google.com.invalid > > > > > > > > > > > > > wrote: > > > > > > > > > > > > > Thanks for starting the discussion! I'm going to hold off > saying > > > > what I > > > > > > > think and instead just provide some background and additional > > > > > questions, > > > > > > > because I want to see where the discussion goes. > > > > > > > > > > > > > > When I first suggested the API for querying metrics I was > adding > > it > > > > for > > > > > > > parity with aggregators. A good first question might be does > the > > > > > pipeline > > > > > > > result even need query methods? Runners could add them as > > necessary > > > > > based > > > > > > > on the levels of querying the support. > > > > > > > > > > > > > > The other desire was to make the accuracy clear. One > > implementation > > > > > path > > > > > > > was reporting metrics directly from the workers while > attempting > > > > work. > > > > > > This > > > > > > > can overcount when retrying and may be under the actual > attempts > > if > > > > the > > > > > > > worker lost connectivity before reporting. > > > > > > > > > > > > > > Another implementation was something like a side output where > the > > > > > counts > > > > > > > are committed as part of each bundles results, and then > > aggregated. > > > > > This > > > > > > > committed value is more accurate and represents the value that > > > > occurred > > > > > > > along the success path of the pipeline. > > > > > > > > > > > > > > I suspect there are other possible implementations so trying to > > > make > > > > an > > > > > > API > > > > > > > that expresses all of them is difficult. So: > > > > > > > > > > > > > > 1. Does pipeline result need to support querying (which is > useful > > > for > > > > > > > programmatic consumption) or are metrics intended only to get > > > values > > > > > out > > > > > > of > > > > > > > a pipeline and into some metrics store? > > > > > > > > > > > > > > 2. How should pipeline results indicate the different kinds of > > > > metrics? > > > > > > > What if a runner supported multiple kinds (eg, the runner > reports > > > > both > > > > > > > attempted and committed results)? As Luke mentions it may be > > useful > > > > to > > > > > > look > > > > > > > at both to understand how much retries affected the value. > > > > > > > On Thu, Jan 19, 2017, 1:42 AM Aviem Zur <aviem...@gmail.com> > > > wrote: > > > > > > > > > > > > > > Hi all, > > > > > > > > > > > > > > While working on the implementation of metrics API in Spark > > runner > > > > the > > > > > > > question of committed vs. attempted results has come up, > sparking > > > (no > > > > > pun > > > > > > > intended) an interesting conversation. > > > > > > > (See API: MetricResult > > > > > > > < > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > https://github.com/apache/beam/blob/master/sdks/java/core/src/main/java/org/apache/beam/sdk/metrics/MetricResult.java > > > > > > > > > > > > > > > and > > > > > > > discussion: PR #1750 <https://github.com/apache/beam/pull/1750 > >) > > > > > > > > > > > > > > The separation of `attempted` and `committed` metric results > > seems > > > a > > > > > bit > > > > > > > unclear. > > > > > > > > > > > > > > Seeing that current implementations of aggregators in the > > different > > > > > > runners > > > > > > > do not guarantee correctness, one could assume that the metrics > > API > > > > > > > implementations will also follow the same guarantees. > > > > > > > > > > > > > > If this is correct, then you could assume that only > `attempted()` > > > > > metrics > > > > > > > results can be fulfilled. > > > > > > > Would it then be better to just have a single method such as > > > `get()` > > > > in > > > > > > the > > > > > > > API, and have the guarantees of each runner explained in the > > > > capability > > > > > > > matrix / documentation? > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > >