Re: Percentile metrics in Beam

Gleb Kanterov Tue, 18 Aug 2020 09:19:23 -0700

Hi Alex,

I'm not sure about restoring histogram, because the use-case I had in the
past used percentiles. As I understand it, you can approximate histogram if
you know percentiles and total count. E.g. 5% of values fall into
[P95, +INF) bucket, other 5% [P90, P95), etc. I don't understand the paper
well enough to say how it's going to work if given bucket boundaries happen
to include a small number of values. I guess it's a similar kind of
trade-off when we need to choose boundaries if we want to get percentiles
from histogram buckets. I see primarily moment sketch as a method intended
to approximate percentiles, not histogram buckets.


/Gleb

On Tue, Aug 18, 2020 at 2:13 AM Alex Amato <[email protected]> wrote:

> Hi Gleb, and Luke
>
> I was reading through the paper, blog and github you linked to. One thing
> I can't figure out is if it's possible to use the Moment Sketch to restore
> an original histogram.
> Given bucket boundaries: b0, b1, b2, b3, ...
> Can we obtain the counts for the number of values inserted each of the
> ranges: [-INF, B0), … [Bi, Bi+1), …
> (This is a requirement I need)
>
> Not be confused with the percentile/threshold based queries discussed in
> the blog.
>
> Luke, were you suggesting collecting both and sending both over the FN API
> wire? I.e. collecting both
>
>    - the variables to represent the Histogram as suggested in
>    https://s.apache.org/beam-histogram-metrics:
>    - In addition to the moment sketch variables
>    
> <https://blog.acolyer.org/2018/10/31/moment-based-quantile-sketches-for-efficient-high-cardinality-aggregation-queries/>
>    .
>
> I believe that would be feasible, as we would still retain the Histogram
> data. I don't think we can restore the Histograms with just the Sketch, if
> that was the suggestion. Please let me know if I misunderstood.
>
> If that's correct, I can write up the benefits and drawbacks I see for
> both approaches.
>
>
> On Mon, Aug 17, 2020 at 9:23 AM Luke Cwik <[email protected]> wrote:
>
>> That is an interesting suggestion to change to use a sketch.
>>
>> I believe having one metric URN that represents all this information
>> grouped together would make sense instead of attempting to aggregate
>> several metrics together. The underlying implementation of using
>> sum/count/max/min would stay the same but we would want a single object
>> that abstracts this complexity away for users as well.
>>
>> On Mon, Aug 17, 2020 at 3:42 AM Gleb Kanterov <[email protected]> wrote:
>>
>>> Didn't see proposal by Alex before today. I want to add a few more cents
>>> from my side.
>>>
>>> There is a paper Moment-based quantile sketches for efficient high
>>> cardinality aggregation queries [1], a TL;DR that for some N (around 10-20
>>> depending on accuracy) we need to collect SUM(log^N(X)) ... log(X),
>>> COUNT(X), SUM(X), SUM(X^2)... SUM(X^N), MAX(X), MIN(X). Given aggregated
>>> numbers, it uses solver for Chebyshev polynomials to get quantile number,
>>> and there is already Java implementation for it on GitHub [2].
>>>
>>> This way we can express quantiles using existing metric types in Beam,
>>> that can be already done without SDK or runner changes. It can fit nicely
>>> into existing runners and can be abstracted over if needed. I think this is
>>> also one of the best implementations, it has < 1% error rate for 200 bytes
>>> of storage, and quite efficient to compute. Did we consider using that?
>>>
>>> [1]:
>>> https://blog.acolyer.org/2018/10/31/moment-based-quantile-sketches-for-efficient-high-cardinality-aggregation-queries/
>>> [2]: https://github.com/stanford-futuredata/msketch
>>>
>>> On Sat, Aug 15, 2020 at 6:15 AM Alex Amato <[email protected]> wrote:
>>>
>>>> The distinction here is that even though these metrics come from user
>>>> space, we still gave them specific URNs, which imply they have a specific
>>>> format, with specific labels, etc.
>>>>
>>>> That is, we won't be packaging them into a USER_HISTOGRAM urn. That URN
>>>> would have less expectation for its format. Today the USER_COUNTER just
>>>> expects like labels (TRANSFORM, NAME, NAMESPACE).
>>>>
>>>> We didn't decide on making a private API. But rather an API
>>>> available to user code for populating metrics with specific labels, and
>>>> specific URNs. The same API could pretty much be used for user
>>>> USER_HISTOGRAM. with a default URN chosen.
>>>> Thats how I see it in my head at the moment.
>>>>
>>>>
>>>> On Fri, Aug 14, 2020 at 8:52 PM Robert Bradshaw <[email protected]>
>>>> wrote:
>>>>
>>>>> On Fri, Aug 14, 2020 at 7:35 PM Alex Amato <[email protected]> wrote:
>>>>> >
>>>>> > I am only tackling the specific metrics covered in (for the python
>>>>> SDK first, then Java). To collect latency of IO API RPCS, and store it in 
>>>>> a
>>>>> histogram.
>>>>> > https://s.apache.org/beam-gcp-debuggability
>>>>> >
>>>>> > User histogram metrics are unfunded, as far as I know. But you
>>>>> should be able to extend what I do for that project to the user metric use
>>>>> case. I agree, it won't be much more work to support that. I designed the
>>>>> histogram with the user histogram case in mind.
>>>>>
>>>>> From the portability point of view, all metrics generated in users
>>>>> code (and SDK-side IOs are "user code") are user metrics. But
>>>>> regardless of how things are named, once we have histogram metrics
>>>>> crossing the FnAPI boundary all the infrastructure will be in place.
>>>>> (At least the plan as I understand it shouldn't use private APIs
>>>>> accessible only by the various IOs but not other SDK-level code.)
>>>>>
>>>>> > On Fri, Aug 14, 2020 at 5:47 PM Robert Bradshaw <[email protected]>
>>>>> wrote:
>>>>> >>
>>>>> >> Once histograms are implemented in the SDK(s) (Alex, you're tackling
>>>>> >> this, right?) it shoudn't be much work to update the Samza worker
>>>>> code
>>>>> >> to publish these via the Samza runner APIs (in parallel with Alex's
>>>>> >> work to do the same on Dataflow).
>>>>> >>
>>>>> >> On Fri, Aug 14, 2020 at 5:35 PM Alex Amato <[email protected]>
>>>>> wrote:
>>>>> >> >
>>>>> >> > Noone has any plans currently to work on adding a generic
>>>>> histogram metric, at the moment.
>>>>> >> >
>>>>> >> > But I will be actively working on adding it for a specific set of
>>>>> metrics in the next quarter or so
>>>>> >> > https://s.apache.org/beam-gcp-debuggability
>>>>> >> >
>>>>> >> > After that work, one could take a look at my PRs for reference to
>>>>> create new metrics using the same histogram. One may wish to implement the
>>>>> UserHistogram use case and use that in the Samza Runner
>>>>> >> >
>>>>> >> >
>>>>> >> >
>>>>> >> >
>>>>> >> > On Fri, Aug 14, 2020 at 5:25 PM Ke Wu <[email protected]> wrote:
>>>>> >> >>
>>>>> >> >> Thank you Robert and Alex. I am not running a Beam job in Google
>>>>> Cloud but with Samza Runner, so I am wondering if there is any ETA to add
>>>>> the Histogram metrics in Metrics class so it can be mapped to the
>>>>> SamzaHistogram metric to the actual emitting.
>>>>> >> >>
>>>>> >> >> Best,
>>>>> >> >> Ke
>>>>> >> >>
>>>>> >> >> On Aug 14, 2020, at 4:44 PM, Alex Amato <[email protected]>
>>>>> wrote:
>>>>> >> >>
>>>>> >> >> One of the plans to use the histogram data is to send it to
>>>>> Google Monitoring to compute estimates of percentiles. This is done using
>>>>> the bucket counts and bucket boundaries.
>>>>> >> >>
>>>>> >> >> Here is a describing of roughly how its calculated.
>>>>> >> >>
>>>>> https://stackoverflow.com/questions/59635115/gcp-console-how-are-percentile-charts-calculated
>>>>> >> >> This is a non exact estimate. But plotting the estimated
>>>>> percentiles over time is often easier to understand and sufficient.
>>>>> >> >> (An alternative is a heatmap chart representing histograms over
>>>>> time. I.e. a histogram for each window of time).
>>>>> >> >>
>>>>> >> >>
>>>>> >> >> On Fri, Aug 14, 2020 at 4:16 PM Robert Bradshaw <
>>>>> [email protected]> wrote:
>>>>> >> >>>
>>>>> >> >>> You may be interested in the propose histogram metrics:
>>>>> >> >>>
>>>>> https://docs.google.com/document/d/1kiNG2BAR-51pRdBCK4-XFmc0WuIkSuBzeb__Zv8owbU/edit
>>>>> >> >>>
>>>>> >> >>> I think it'd be reasonable to add percentiles as its own metric
>>>>> type
>>>>> >> >>> as well. The tricky bit (though there are lots of resources on
>>>>> this)
>>>>> >> >>> is that one would have to publish more than just the
>>>>> percentiles from
>>>>> >> >>> each worker to be able to compute the final percentiles across
>>>>> all
>>>>> >> >>> workers.
>>>>> >> >>>
>>>>> >> >>> On Fri, Aug 14, 2020 at 4:05 PM Ke Wu <[email protected]>
>>>>> wrote:
>>>>> >> >>> >
>>>>> >> >>> > Hi everyone,
>>>>> >> >>> >
>>>>> >> >>> > I am looking to add percentile metrics (p50, p90 etc) to my
>>>>> beam job but I only find Counter, Gauge and Distribution metrics. I
>>>>> understand that I can calculate percentile metrics in my job itself and 
>>>>> use
>>>>> Gauge to emit, however this is not an easy approach. On the other hand,
>>>>> Distribution metrics sounds like the one to go to according to its
>>>>> documentation: "A metric that reports information about the distribution 
>>>>> of
>>>>> reported values.”, however it seems that it is intended for SUM, COUNT,
>>>>> MIN, MAX.
>>>>> >> >>> >
>>>>> >> >>> > The question(s) are:
>>>>> >> >>> >
>>>>> >> >>> > 1. is Distribution metric only intended for sum, count, min,
>>>>> max?
>>>>> >> >>> > 2. If Yes, can the documentation be updated to be more
>>>>> specific?
>>>>> >> >>> > 3. Can we add percentiles metric support, such as Histogram,
>>>>> with configurable list of percentiles to emit?
>>>>> >> >>> >
>>>>> >> >>> > Best,
>>>>> >> >>> > Ke
>>>>> >> >>
>>>>> >> >>
>>>>>
>>>>

Re: Percentile metrics in Beam

Reply via email to