Thanks, encoding sketch data using a new URN makes sense. - Ajo
On Fri, Oct 1, 2021 at 11:22 AM Luke Cwik <lc...@google.com> wrote: > Yes you could encode the sketch information but would need to use new URNs > because the encoding for the existing ones are already fixed. The point of > adding new URNs is to specifically be able to add new formats and > specifications over time. > > On Fri, Oct 1, 2021 at 10:34 AM Ajo Thomas <ajo.thoma...@gmail.com> wrote: > >> Thanks for the pointers, Luke and sorry for replying late on this thread. >> >> Distribution metric's - *void update(long sum, long count, long min, >> long max)* certainly seems like dead code and should be okay to remove. >> I can reach out to the original author to confirm. >> >> As for the approach for computing quantiles, I took a pass at quantile >> sketches using moments [1] which you were referring to and DD sketches from >> Datadog [2][3]. Both seem to be mergeable. >> Moments based approach seems to have a smaller memory footprint compared >> to DDSketch. The memory usage for moments sketch is fixed- based on k >> moments and doesn't depend on the input data. Whereas for DDSketch, it >> could run into a few KBs per [4] for an unbounded sketch. Bounded sketches >> seem to be inaccurate for tail ends of the distribution. These are just >> some preliminary thoughts that I have gathered from a quick scan of the >> literature/code so far. Need to spend more time on it. >> >> Regarding representing these metrics in a portable way, can we encode the >> information to reconstruct the sketch in DistributionData using >> USER_DISTRIBUTION_DOUBLE/USER_DISTRIBUTION_INT urn [5]? I have seen >> MonitoringInfo being used in the following ways right now in the runners- >> Certain runners are currently using the monitoring info to update the >> DistributionData and others {Metric}Data(s) in the metric containers. >> MonitoringInfo is also being used as the output type for portable metrics >> results. [6] >> If I understand correctly, the only way to have a mergeable quantile >> metric is if we keep the sketch information in DistributionData and encode >> it to MonitoringInfo's payload. But then is it okay to have the sketch >> information in the portable pipeline metrics results? [6] For non-portable >> pipelines, it's different since DistributionData is being converted to >> DistributionResult. The DistributionResult class can only have the >> quantiles and other metrics it currently reports and not the sketch. >> >> [1] >> https://blog.acolyer.org/2018/10/31/moment-based-quantile-sketches-for-efficient-high-cardinality-aggregation-queries/ >> [2] https://arxiv.org/pdf/1908.10693.pdf >> [3] >> https://www.datadoghq.com/blog/engineering/computing-accurate-percentiles-with-ddsketch/ >> [4] https://github.com/DataDog/sketches-java >> [5] >> https://github.com/apache/beam/blob/master/model/pipeline/src/main/proto/metrics.proto#L110 >> [6] >> https://github.com/apache/beam/blob/master/runners/spark/src/main/java/org/apache/beam/runners/spark/SparkPipelineResult.java#L158 >> [7] >> https://github.com/apache/beam/blob/master/sdks/java/core/src/main/java/org/apache/beam/sdk/metrics/Distribution.java >> >> - Ajo >> >> >> On Mon, Sep 20, 2021 at 1:23 PM Luke Cwik <lc...@google.com> wrote: >> >>> That change stems from PR#7183 [1] and looks like dead code now. Might >>> be worthwhile to talk to the original author about removing the method >>> update(sum, count, min, max). >>> >>> Also, the issue with distributions has always been about: >>> A) How to specify them in a portable fashion (describing the encoding >>> via the URN) [2] >>> B) How to merge the buckets. There are techniques to merge various >>> buckets dynamically using a quantile sketch[3] or you need users to specify >>> the ranges upfront and it would be an error for a user to define the same >>> metric multiple times with different ranges. I did really like the quantile >>> sketch approach since it didn't require users to "guess" what the range was >>> and it would adapt over time if the buckets changed (good properties to >>> have with long running streaming pipelines). >>> >>> 1: https://github.com/apache/beam/pull/7183 >>> 2: >>> https://github.com/apache/beam/blob/a7b706cb9d1e84709f89fe98d1dda94d4eb1243b/model/pipeline/src/main/proto/metrics.proto#L110 >>> 3: >>> https://lists.apache.org/thread.html/rfc1ff850ed2eaa9057ba9fb34286c19a802bc2720424afc0dffa3b1b%40%3Cdev.beam.apache.org%3E >>> >>> On Mon, Sep 20, 2021 at 8:55 AM Ajo Thomas <ajo.thoma...@gmail.com> >>> wrote: >>> >>>> I can definitely share a PR with my changes for the Distribution metric >>>> soon. >>>> >>>> But there is something that I wanted to discuss first. As per the >>>> original design doc for metrics api, >>>> http://s.apache.org/beam-metrics-api, it seems like the Distribution >>>> metric interface was only intended to have a single method to update the >>>> distribution- *update(long value)*. This made the Distribution Metric >>>> easily extensible. >>>> The current Distribution metric interface [1], however, exposes another >>>> method- *void update(long sum, long count, long min, long max).* This >>>> call in turn calls the update method in the DistributionCell[2] which >>>> directly sets the values in the DistributionData[3] class. >>>> DistributionData would contain the logic to compute percentiles but the >>>> other *update *method in the user facing Distribution api makes it >>>> hard to add anything new. Our user's mostly use the *update(value)* method >>>> to update the distribution over updating the individual values- sum, min, >>>> max... >>>> >>>> 1. >>>> https://github.com/apache/beam/blob/master/sdks/java/core/src/main/java/org/apache/beam/sdk/metrics/Distribution.java >>>> 2. >>>> https://github.com/apache/beam/blob/master/runners/core-java/src/main/java/org/apache/beam/runners/core/metrics/DistributionCell.java#L65 >>>> 3. >>>> https://github.com/apache/beam/blob/master/runners/core-java/src/main/java/org/apache/beam/runners/core/metrics/DistributionData.java#L46 >>>> >>>> On Fri, Sep 17, 2021 at 11:33 AM Ke Wu <ke.wu...@gmail.com> wrote: >>>> >>>>> +1 would love to see a PR/Proposal out. >>>>> >>>>> This is a highly demanding feature our users at LinkedIn are asking >>>>> for as well. >>>>> >>>>> On Sep 17, 2021, at 10:56, Pablo Estrada <pabl...@google.com> wrote: >>>>> >>>>> >>>>> Thanks for working on this! >>>>> In the past, we have avoided adding complex metrics because metrics >>>>> tend to be aggregated in the control path rather than the data path - and >>>>> we worried about overwhelming the metrics backends - however users have in >>>>> the past asked for more information in the distribution metric itself. I >>>>> think it makes sense to provide more info, while allowing runners to >>>>> report >>>>> as much as they see fit. I'd love to see a proposal / PR for this. >>>>> >>>>> fyi @Robert Bradshaw <rober...@google.com> >>>>> >>>>> On Wed, Sep 15, 2021 at 10:37 AM Ajo Thomas <ajo.thoma...@gmail.com> >>>>> wrote: >>>>> >>>>>> Thanks for the response, Alexey and Ke. >>>>>> Agree with your point to introduce a new metric type (say >>>>>> Percentiles) instead of altering the Distribution metric type to ensure >>>>>> compatibility across runners and sdks. >>>>>> I am currently working on a prototype to add this new metric type to >>>>>> the metrics API and testing it with samza runner. I can share a design >>>>>> doc >>>>>> with the community with possible solutions very soon. >>>>>> >>>>>> Thanks >>>>>> Ajo >>>>>> >>>>>> On Wed, Sep 15, 2021 at 9:26 AM Alexey Romanenko < >>>>>> aromanenko....@gmail.com> wrote: >>>>>> >>>>>>> I agree with Ke Wu in the way that we need to keep compatibility >>>>>>> across all runners and the same metrics. So, it seems that it would be >>>>>>> better to create another metric type in this case. >>>>>>> >>>>>>> Also, to discuss it in details, I’d recommend to create a design >>>>>>> document with possible solutions and examples. >>>>>>> >>>>>>> — >>>>>>> Alexey >>>>>>> >>>>>>> On 14 Sep 2021, at 19:04, Ke Wu <ke.wu...@gmail.com> wrote: >>>>>>> >>>>>>> I prefer adding a new metrics type instead of enhancing the existing >>>>>>> Distribution [1] to support percentiles etc in order to ensure better >>>>>>> compatibility. >>>>>>> >>>>>>> @Luke @Kyle what are your thoughts on this? >>>>>>> >>>>>>> Best, >>>>>>> Ke >>>>>>> >>>>>>> [1] >>>>>>> https://github.com/apache/beam/blob/master/sdks/java/core/src/main/java/org/apache/beam/sdk/metrics/Distribution.java >>>>>>> >>>>>>> >>>>>>> On Sep 7, 2021, at 1:28 PM, Ajo Thomas <ajo.thoma...@gmail.com> >>>>>>> wrote: >>>>>>> >>>>>>> Hi All, >>>>>>> >>>>>>> I am working on adding support for some additional distribution >>>>>>> metrics like std dev, percentiles to the Metrics API. The runner of >>>>>>> interest here is Samza runner. I wanted to get the opinion of fellow >>>>>>> beam >>>>>>> devs on this. >>>>>>> >>>>>>> One way to do this would be to make changes to the existing >>>>>>> Distribution metric: >>>>>>> - Add additional metrics to Distribution metric- custom percentiles, >>>>>>> std dev, mean. Use Dropwizard Histogram under the hood in >>>>>>> DistributionData >>>>>>> to track the distribution of the data. >>>>>>> - This also means changes to accompanying classes like >>>>>>> DistributionData, DistributionResult which might involve runner specific >>>>>>> changes. >>>>>>> >>>>>>> Is this an acceptable change or would you suggest something else? Is >>>>>>> the Distribution metric only intended to track the metrics that it is >>>>>>> currently tracking- sum, min, max, count? >>>>>>> >>>>>>> Thanks >>>>>>> Ajo >>>>>>> >>>>>>> >>>>>>> >>>>>>>