I think v1 of the querying API should be just "give me *all* the metrics." Shortly thereafter, we should have a v2 that allows for requesting just a subset metrics, possibly pre-aggregated. (My preference would be a filter like {URN: regex, label: [label_name: regex]} and all matching counters would be returned. A second parameter could control how counters are aggregated across distinct labels (specifically some kind of many-to-one mapping of labels onto a smaller set, also could be regex-based, useful for things like "given me the msecs of all steps "FirstTransform/.*" or ".*/WriteToShuffle).)
On Wed, Jan 30, 2019 at 6:19 PM Alex Amato <ajam...@google.com> wrote: > > Okay, yeah I was tossing and turning last night thinking the same thing. > > The querying API needs to be relatively simple, not use a structure similar > to URNs/MonitoringInfo structure. But there there should be a way to pass > through metrics so that they can be queried out. I think that is missing from > the doc right now. I'll iterate on that a bit. For sum_int64 and > distribution_int_64 this will be possible. but we should document the > translation formal > > On Wed, Jan 30, 2019 at 5:33 AM Robert Bradshaw <rober...@google.com> wrote: >> >> Thanks for writing this up. I left some comments in the doc, but at a >> high level I am in favor of the "more deeply overhaul SDKs' >> metrics/querying structures to use MonitoringInfos / URNs" option, at >> least over the Jobs API, for consistency and completeness. The SDK can >> provide whatever convenience wrappers over these it wants. >> >> On Tue, Jan 29, 2019 at 6:30 PM Ryan Williams <r...@runsascoded.com> wrote: >> > >> > Alex and I have PRs out related to supporting metrics in portable-runner >> > code-paths: >> > >> > #7624 associates metrics in the SDK harness with the (pre-fusion) >> > PTransforms the user defined them in. >> > #7641 sends metrics over the "Job API" (between job server and portable >> > runner): >> > >> > Flink portable-VR metrics tests pass (Java) >> > metrics print()s work in portable wordcount (Python) >> > >> > Open Questions: >> > >> > What to do with type-specific protos (e.g. IntDistributionData vs. >> > DoubleDistributionData)? >> > >> > I think Alex and I were leaning toward only supporting the "int"-cases for >> > now >> > That's what Java does in its existing metrics >> > >> > "MetricKey" and "MetricName" semantics: >> > >> > These exist in Java and Python, and I added proto versions in #7641. >> > MetricName wraps "namespace" and "name" strings, and MetricKey wraps a >> > "step (ptransform) name" and a MetricName. >> > PCollection-scoped metrics (e.g. element count) are identified by a null >> > "step name" in #7624 and #7641. >> > Alex and I discussed using URNs as the source of this information instead: >> > >> > "step name" can instead come from a MonitoringInfo's PTRANSFORM label, >> > while "namespace" and "name" can be parsed from its URN. >> > URNs could encode these over the wire, then SDKs could convert to existing >> > MetricKey/MetricNames for use in querying / MetricResults >> > or: we could more deeply overhaul SDKs' metrics/querying structures to use >> > MonitoringInfos / URNs. >> > >> > at the least, SDKs should get helpers for querying for Alex's new "system >> > metrics" (e.g. element count, various timings) that are associated with >> > specific URNs >> > >> > Gauges: the protos have a nod to sending gauges over the wire as counters >> > >> > are there problems with that? >> > #7641 should support this, for now. >> > >> > ExtremaData: the protos contain these, but SDKs don't support them (afaik). >> > >> > Alex likely has more to add, and we plan to make a doc about these >> > changes, but I wanted to post here first in case others have thoughts or >> > we are overlooking anything. >> > >> > Thanks!