Re: Percentile metrics in Beam

2021-10-01 Thread Ajo Thomas
Thanks, encoding sketch data using a new URN makes sense. - Ajo On Fri, Oct 1, 2021 at 11:22 AM Luke Cwik wrote: > Yes you could encode the sketch information but would need to use new URNs > because the encoding for the existing ones are already fixed. The point of > adding new URNs is to

Re: Percentile metrics in Beam

2021-10-01 Thread Ajo Thomas
Thanks for the pointers, Luke and sorry for replying late on this thread. Distribution metric's - *void update(long sum, long count, long min, long max)* certainly seems like dead code and should be okay to remove. I can reach out to the original author to confirm. As for the approach for

Re: Percentile metrics in Beam

2021-09-20 Thread Ajo Thomas
I can definitely share a PR with my changes for the Distribution metric soon. But there is something that I wanted to discuss first. As per the original design doc for metrics api, http://s.apache.org/beam-metrics-api, it seems like the Distribution metric interface was only intended to have a

Re: Percentile metrics in Beam

2021-09-17 Thread Ke Wu
+1 would love to see a PR/Proposal out. This is a highly demanding feature our users at LinkedIn are asking for as well. > On Sep 17, 2021, at 10:56, Pablo Estrada wrote: > >  > Thanks for working on this! > In the past, we have avoided adding complex metrics because metrics tend to > be

Re: Percentile metrics in Beam

2021-09-17 Thread Ajo Thomas
Thanks for the link to the doc. I think it should be okay to include percentiles in Distribution given that it was intended to be extensible. As for the user facing Metrics API, there will be no changes unless we want to allow the user to specify custom percentiles aside from a set of defaults.

Re: Percentile metrics in Beam

2021-09-17 Thread Kenneth Knowles
If I recall from when the metrics were introduced ( http://s.apache.org/beam-metrics-api) the intention of the Distribution metric was to allow the representation to be more flexible. The name was chosen to be more abstract, so a runner could track the data in its own way. Specifically

Re: Percentile metrics in Beam

2021-09-15 Thread Ajo Thomas
Thanks for the response, Alexey and Ke. Agree with your point to introduce a new metric type (say Percentiles) instead of altering the Distribution metric type to ensure compatibility across runners and sdks. I am currently working on a prototype to add this new metric type to the metrics API and

Re: Percentile metrics in Beam

2021-09-15 Thread Alexey Romanenko
I agree with Ke Wu in the way that we need to keep compatibility across all runners and the same metrics. So, it seems that it would be better to create another metric type in this case. Also, to discuss it in details, I’d recommend to create a design document with possible solutions and

Re: Percentile metrics in Beam

2021-09-14 Thread Ke Wu
I prefer adding a new metrics type instead of enhancing the existing Distribution [1] to support percentiles etc in order to ensure better compatibility. @Luke @Kyle what are your thoughts on this? Best, Ke [1]

Percentile metrics in Beam

2021-09-07 Thread Ajo Thomas
Hi All, I am working on adding support for some additional distribution metrics like std dev, percentiles to the Metrics API. The runner of interest here is Samza runner. I wanted to get the opinion of fellow beam devs on this. One way to do this would be to make changes to the existing

Re: Percentile metrics in Beam

2020-09-08 Thread Alex Amato
I've updated the Histogram Style Metrics design for the FN API based, with a section exploring the Moment Sketch. PTAL at the “Collect Moment Sketch Variables Instead of Bucket Counts” section, and see the assessment. LMK what you think :) Date

Re: Percentile metrics in Beam

2020-08-18 Thread Luke Cwik
getPMForCDF[1] seems to return a CDF and you can choose the split points (b0, b1, b2, ...). 1: https://github.com/stanford-futuredata/msketch/blob/cf4e49e860761f48ebdeb00f650ce997c46073e2/javamsketch/quantilebench/src/main/java/yahoo/DoublesPmfCdfImpl.java#L16 On Tue, Aug 18, 2020 at 11:20 AM

Re: Percentile metrics in Beam

2020-08-18 Thread Alex Amato
I'm a bit confused, are you sure that it is possible to derive the CDF? Using the moments variables. The linked implementation on github seems to not use a derived CDF equation, but instead using some sampling technique (which I can't fully grasp yet) to estimate how many elements are in each

Re: Percentile metrics in Beam

2020-08-18 Thread Ke Wu
Hi Alex, It is great to know you are working on the metrics. Do you have any concern if we add a Histogram type metrics in Samza Runner itself for now so we can start using it before a generic histogram metrics can be introduced in the Metrics class? Best, Ke > On Aug 18, 2020, at 12:57 AM,

Re: Percentile metrics in Beam

2020-08-18 Thread Gleb Kanterov
Hi Alex, I'm not sure about restoring histogram, because the use-case I had in the past used percentiles. As I understand it, you can approximate histogram if you know percentiles and total count. E.g. 5% of values fall into [P95, +INF) bucket, other 5% [P90, P95), etc. I don't understand the

Re: Percentile metrics in Beam

2020-08-18 Thread Luke Cwik
You can use a cumulative distribution function over the sketch at b0, b1, b2, b3, ... which will tell you the probability that any given value is <= X. You multiply that probability against the total count (which is also recorded as part of the sketch) to get an estimate for the number of values

Re: Percentile metrics in Beam

2020-08-17 Thread Alex Amato
Hi Gleb, and Luke I was reading through the paper, blog and github you linked to. One thing I can't figure out is if it's possible to use the Moment Sketch to restore an original histogram. Given bucket boundaries: b0, b1, b2, b3, ... Can we obtain the counts for the number of values inserted

Re: Percentile metrics in Beam

2020-08-17 Thread Luke Cwik
That is an interesting suggestion to change to use a sketch. I believe having one metric URN that represents all this information grouped together would make sense instead of attempting to aggregate several metrics together. The underlying implementation of using sum/count/max/min would stay the

Re: Percentile metrics in Beam

2020-08-17 Thread Gleb Kanterov
Didn't see proposal by Alex before today. I want to add a few more cents from my side. There is a paper Moment-based quantile sketches for efficient high cardinality aggregation queries [1], a TL;DR that for some N (around 10-20 depending on accuracy) we need to collect SUM(log^N(X)) ... log(X),

Re: Percentile metrics in Beam

2020-08-14 Thread Alex Amato
The distinction here is that even though these metrics come from user space, we still gave them specific URNs, which imply they have a specific format, with specific labels, etc. That is, we won't be packaging them into a USER_HISTOGRAM urn. That URN would have less expectation for its format.

Re: Percentile metrics in Beam

2020-08-14 Thread Robert Bradshaw
On Fri, Aug 14, 2020 at 7:35 PM Alex Amato wrote: > > I am only tackling the specific metrics covered in (for the python SDK first, > then Java). To collect latency of IO API RPCS, and store it in a histogram. > https://s.apache.org/beam-gcp-debuggability > > User histogram metrics are unfunded,

Re: Percentile metrics in Beam

2020-08-14 Thread Alex Amato
I am only tackling the specific metrics covered in (for the python SDK first, then Java). To collect latency of IO API RPCS, and store it in a histogram. https://s.apache.org/beam-gcp-debuggability User histogram metrics are unfunded, as far as I know. But you should be able to extend what I do

Re: Percentile metrics in Beam

2020-08-14 Thread Robert Bradshaw
Once histograms are implemented in the SDK(s) (Alex, you're tackling this, right?) it shoudn't be much work to update the Samza worker code to publish these via the Samza runner APIs (in parallel with Alex's work to do the same on Dataflow). On Fri, Aug 14, 2020 at 5:35 PM Alex Amato wrote: > >

Re: Percentile metrics in Beam

2020-08-14 Thread Alex Amato
Noone has any plans currently to work on adding a generic histogram metric, at the moment. But I will be actively working on adding it for a specific set of metrics in the next quarter or so https://s.apache.org/beam-gcp-debuggability After that work, one could take a look at my PRs for

Re: Percentile metrics in Beam

2020-08-14 Thread Ke Wu
Thank you Robert and Alex. I am not running a Beam job in Google Cloud but with Samza Runner, so I am wondering if there is any ETA to add the Histogram metrics in Metrics class so it can be mapped to the SamzaHistogram

Re: Percentile metrics in Beam

2020-08-14 Thread Alex Amato
One of the plans to use the histogram data is to send it to Google Monitoring to compute estimates of percentiles. This is done using the bucket counts and bucket boundaries. Here is a describing of roughly how its calculated.

Re: Percentile metrics in Beam

2020-08-14 Thread Robert Bradshaw
You may be interested in the propose histogram metrics: https://docs.google.com/document/d/1kiNG2BAR-51pRdBCK4-XFmc0WuIkSuBzeb__Zv8owbU/edit I think it'd be reasonable to add percentiles as its own metric type as well. The tricky bit (though there are lots of resources on this) is that one would

Percentile metrics in Beam

2020-08-14 Thread Ke Wu
Hi everyone, I am looking to add percentile metrics (p50, p90 etc) to my beam job but I only find Counter , Gauge