I think v1 of the querying API should be just "give me *all* the
metrics." Shortly thereafter, we should have a v2 that allows for
requesting just a subset metrics, possibly pre-aggregated. (My
preference would be a filter like {URN: regex, label: [label_name:
regex]} and all matching counters would be returned. A second
parameter could control how counters are aggregated across distinct
labels (specifically some kind of many-to-one mapping of labels onto a
smaller set, also could be regex-based, useful for things like "given
me the msecs of all steps "FirstTransform/.*" or ".*/WriteToShuffle).)

On Wed, Jan 30, 2019 at 6:19 PM Alex Amato <ajam...@google.com> wrote:
>
> Okay, yeah I was tossing and turning last night thinking the same thing.
>
> The querying API needs to be relatively simple, not use a structure similar 
> to URNs/MonitoringInfo structure. But there there should be a way to pass 
> through metrics so that they can be queried out. I think that is missing from 
> the doc right now. I'll iterate on that a bit. For sum_int64 and 
> distribution_int_64 this will be possible. but we should document the 
> translation formal
>
> On Wed, Jan 30, 2019 at 5:33 AM Robert Bradshaw <rober...@google.com> wrote:
>>
>> Thanks for writing this up. I left some comments in the doc, but at a
>> high level I am in favor of the "more deeply overhaul SDKs'
>> metrics/querying structures to use MonitoringInfos / URNs" option, at
>> least over the Jobs API, for consistency and completeness. The SDK can
>> provide whatever convenience wrappers over these it wants.
>>
>> On Tue, Jan 29, 2019 at 6:30 PM Ryan Williams <r...@runsascoded.com> wrote:
>> >
>> > Alex and I have PRs out related to supporting metrics in portable-runner 
>> > code-paths:
>> >
>> > #7624 associates metrics in the SDK harness with the (pre-fusion) 
>> > PTransforms the user defined them in.
>> > #7641 sends metrics over the "Job API" (between job server and portable 
>> > runner):
>> >
>> > Flink portable-VR metrics tests pass (Java)
>> > metrics print()s work in portable wordcount (Python)
>> >
>> > Open Questions:
>> >
>> > What to do with type-specific protos (e.g. IntDistributionData vs. 
>> > DoubleDistributionData)?
>> >
>> > I think Alex and I were leaning toward only supporting the "int"-cases for 
>> > now
>> > That's what Java does in its existing metrics
>> >
>> > "MetricKey" and "MetricName" semantics:
>> >
>> > These exist in Java and Python, and I added proto versions in #7641.
>> > MetricName wraps "namespace" and "name" strings, and MetricKey wraps a 
>> > "step (ptransform) name" and a MetricName.
>> > PCollection-scoped metrics (e.g. element count) are identified by a null 
>> > "step name" in #7624 and #7641.
>> > Alex and I discussed using URNs as the source of this information instead:
>> >
>> > "step name" can instead come from a MonitoringInfo's PTRANSFORM label, 
>> > while "namespace" and "name" can be parsed from its URN.
>> > URNs could encode these over the wire, then SDKs could convert to existing 
>> > MetricKey/MetricNames for use in querying / MetricResults
>> > or: we could more deeply overhaul SDKs' metrics/querying structures to use 
>> > MonitoringInfos / URNs.
>> >
>> > at the least, SDKs should get helpers for querying for Alex's new "system 
>> > metrics" (e.g. element count, various timings) that are associated with 
>> > specific URNs
>> >
>> > Gauges: the protos have a nod to sending gauges over the wire as counters
>> >
>> > are there problems with that?
>> > #7641 should support this, for now.
>> >
>> > ExtremaData: the protos contain these, but SDKs don't support them (afaik).
>> >
>> > Alex likely has more to add, and we plan to make a doc about these 
>> > changes, but I wanted to post here first in case others have thoughts or 
>> > we are overlooking anything.
>> >
>> > Thanks!

Reply via email to