Re: Percentile metrics in Beam

Ajo Thomas Fri, 01 Oct 2021 15:35:24 -0700

Thanks, encoding sketch data using a new URN makes sense.

- Ajo


On Fri, Oct 1, 2021 at 11:22 AM Luke Cwik <[email protected]> wrote:

> Yes you could encode the sketch information but would need to use new URNs
> because the encoding for the existing ones are already fixed. The point of
> adding new URNs is to specifically be able to add new formats and
> specifications over time.
>
> On Fri, Oct 1, 2021 at 10:34 AM Ajo Thomas <[email protected]> wrote:
>
>> Thanks for the pointers, Luke and sorry for replying late on this thread.
>>
>> Distribution metric's - *void update(long sum, long count, long min,
>> long max)* certainly seems like dead code and should be okay to remove.
>> I can reach out to the original author to confirm.
>>
>> As for the approach for computing quantiles, I took a pass at quantile
>> sketches using moments [1] which you were referring to and DD sketches from
>> Datadog [2][3]. Both seem to be mergeable.
>> Moments based approach seems to have a smaller memory footprint compared
>> to DDSketch. The memory usage for moments sketch is fixed- based on k
>> moments and doesn't depend on the input data. Whereas for DDSketch, it
>> could run into a few KBs per [4] for an unbounded sketch. Bounded sketches
>> seem to be inaccurate for tail ends of the distribution. These are just
>> some preliminary thoughts that I have gathered from a quick scan of the
>> literature/code so far. Need to spend more time on it.
>>
>> Regarding representing these metrics in a portable way, can we encode the
>> information to reconstruct the sketch in DistributionData using
>> USER_DISTRIBUTION_DOUBLE/USER_DISTRIBUTION_INT urn [5]? I have seen
>> MonitoringInfo being used in the following ways right now in the runners-
>> Certain runners are currently using the monitoring info to update the
>> DistributionData and others {Metric}Data(s) in the metric containers.
>> MonitoringInfo is also being used as the output type for portable metrics
>> results. [6]
>> If I understand correctly, the only way to have a mergeable quantile
>> metric is if we keep the sketch information in DistributionData and encode
>> it to MonitoringInfo's payload. But then is it okay to have the sketch
>> information in the portable pipeline metrics results? [6] For non-portable
>> pipelines, it's different since DistributionData is being converted to
>> DistributionResult. The DistributionResult class can only have the
>> quantiles and other metrics it currently reports and not the sketch.
>>
>> [1]
>> https://blog.acolyer.org/2018/10/31/moment-based-quantile-sketches-for-efficient-high-cardinality-aggregation-queries/
>> [2] https://arxiv.org/pdf/1908.10693.pdf
>> [3]
>> https://www.datadoghq.com/blog/engineering/computing-accurate-percentiles-with-ddsketch/
>> [4] https://github.com/DataDog/sketches-java
>> [5]
>> https://github.com/apache/beam/blob/master/model/pipeline/src/main/proto/metrics.proto#L110
>> [6]
>> https://github.com/apache/beam/blob/master/runners/spark/src/main/java/org/apache/beam/runners/spark/SparkPipelineResult.java#L158
>> [7]
>> https://github.com/apache/beam/blob/master/sdks/java/core/src/main/java/org/apache/beam/sdk/metrics/Distribution.java
>>
>> - Ajo
>>
>>
>> On Mon, Sep 20, 2021 at 1:23 PM Luke Cwik <[email protected]> wrote:
>>
>>> That change stems from PR#7183 [1] and looks like dead code now. Might
>>> be worthwhile to talk to the original author about removing the method
>>> update(sum, count, min, max).
>>>
>>> Also, the issue with distributions has always been about:
>>> A) How to specify them in a portable fashion (describing the encoding
>>> via the URN) [2]
>>> B) How to merge the buckets. There are techniques to merge various
>>> buckets dynamically using a quantile sketch[3] or you need users to specify
>>> the ranges upfront and it would be an error for a user to define the same
>>> metric multiple times with different ranges. I did really like the quantile
>>> sketch approach since it didn't require users to "guess" what the range was
>>> and it would adapt over time if the buckets changed (good properties to
>>> have with long running streaming pipelines).
>>>
>>> 1: https://github.com/apache/beam/pull/7183
>>> 2:
>>> https://github.com/apache/beam/blob/a7b706cb9d1e84709f89fe98d1dda94d4eb1243b/model/pipeline/src/main/proto/metrics.proto#L110
>>> 3:
>>> https://lists.apache.org/thread.html/rfc1ff850ed2eaa9057ba9fb34286c19a802bc2720424afc0dffa3b1b%40%3Cdev.beam.apache.org%3E
>>>
>>> On Mon, Sep 20, 2021 at 8:55 AM Ajo Thomas <[email protected]>
>>> wrote:
>>>
>>>> I can definitely share a PR with my changes for the Distribution metric
>>>> soon.
>>>>
>>>> But there is something that I wanted to discuss first. As per the
>>>> original design doc for metrics api,
>>>> http://s.apache.org/beam-metrics-api, it seems like the Distribution
>>>> metric interface was only intended to have a single method to update the
>>>> distribution- *update(long value)*. This made the Distribution Metric
>>>> easily extensible.
>>>> The current Distribution metric interface [1], however, exposes another
>>>> method- *void update(long sum, long count, long min, long max).* This
>>>> call in turn calls the update method in the DistributionCell[2] which
>>>> directly sets the values in the DistributionData[3] class.
>>>> DistributionData would contain the logic to compute percentiles but the
>>>> other *update *method in the user facing Distribution api makes it
>>>> hard to add anything new. Our user's mostly use the *update(value)* method
>>>> to update the distribution over updating the individual values- sum, min,
>>>> max...
>>>>
>>>> 1.
>>>> https://github.com/apache/beam/blob/master/sdks/java/core/src/main/java/org/apache/beam/sdk/metrics/Distribution.java
>>>> 2.
>>>> https://github.com/apache/beam/blob/master/runners/core-java/src/main/java/org/apache/beam/runners/core/metrics/DistributionCell.java#L65
>>>> 3.
>>>> https://github.com/apache/beam/blob/master/runners/core-java/src/main/java/org/apache/beam/runners/core/metrics/DistributionData.java#L46
>>>>
>>>> On Fri, Sep 17, 2021 at 11:33 AM Ke Wu <[email protected]> wrote:
>>>>
>>>>> +1 would love to see a PR/Proposal out.
>>>>>
>>>>> This is a highly demanding feature our users at LinkedIn are asking
>>>>> for as well.
>>>>>
>>>>> On Sep 17, 2021, at 10:56, Pablo Estrada <[email protected]> wrote:
>>>>>
>>>>> 
>>>>> Thanks for working on this!
>>>>> In the past, we have avoided adding complex metrics because metrics
>>>>> tend to be aggregated in the control path rather than the data path - and
>>>>> we worried about overwhelming the metrics backends - however users have in
>>>>> the past asked for more information in the distribution metric itself. I
>>>>> think it makes sense to provide more info, while allowing runners to 
>>>>> report
>>>>> as much as they see fit. I'd love to see a proposal / PR for this.
>>>>>
>>>>> fyi @Robert Bradshaw <[email protected]>
>>>>>
>>>>> On Wed, Sep 15, 2021 at 10:37 AM Ajo Thomas <[email protected]>
>>>>> wrote:
>>>>>
>>>>>> Thanks for the response, Alexey and Ke.
>>>>>> Agree with your point to introduce a new metric type (say
>>>>>> Percentiles) instead of altering the Distribution metric type to ensure
>>>>>> compatibility across runners and sdks.
>>>>>> I am currently working on a prototype to add this new metric type to
>>>>>> the metrics API and testing it with samza runner. I can share a design 
>>>>>> doc
>>>>>> with the community with possible solutions very soon.
>>>>>>
>>>>>> Thanks
>>>>>> Ajo
>>>>>>
>>>>>> On Wed, Sep 15, 2021 at 9:26 AM Alexey Romanenko <
>>>>>> [email protected]> wrote:
>>>>>>
>>>>>>> I agree with Ke Wu in the way that we need to keep compatibility
>>>>>>> across all runners and the same metrics. So, it seems that it would be
>>>>>>> better to create another metric type in this case.
>>>>>>>
>>>>>>> Also, to discuss it in details, I’d recommend to create a design
>>>>>>> document with possible solutions and examples.
>>>>>>>
>>>>>>> —
>>>>>>> Alexey
>>>>>>>
>>>>>>> On 14 Sep 2021, at 19:04, Ke Wu <[email protected]> wrote:
>>>>>>>
>>>>>>> I prefer adding a new metrics type instead of enhancing the existing
>>>>>>> Distribution [1] to support percentiles etc in order to ensure better
>>>>>>> compatibility.
>>>>>>>
>>>>>>> @Luke @Kyle what are your thoughts on this?
>>>>>>>
>>>>>>> Best,
>>>>>>> Ke
>>>>>>>
>>>>>>> [1]
>>>>>>> https://github.com/apache/beam/blob/master/sdks/java/core/src/main/java/org/apache/beam/sdk/metrics/Distribution.java
>>>>>>>
>>>>>>>
>>>>>>> On Sep 7, 2021, at 1:28 PM, Ajo Thomas <[email protected]>
>>>>>>> wrote:
>>>>>>>
>>>>>>> Hi All,
>>>>>>>
>>>>>>> I am working on adding support for some additional distribution
>>>>>>> metrics like std dev, percentiles to the Metrics API. The runner of
>>>>>>> interest here is Samza runner. I wanted to get the opinion of fellow 
>>>>>>> beam
>>>>>>> devs on this.
>>>>>>>
>>>>>>> One way to do this would be to make changes to the existing
>>>>>>> Distribution metric:
>>>>>>> - Add additional metrics to Distribution metric- custom percentiles,
>>>>>>> std dev, mean. Use Dropwizard Histogram under the hood in 
>>>>>>> DistributionData
>>>>>>> to track the distribution of the data.
>>>>>>> - This also means changes to accompanying classes like
>>>>>>> DistributionData, DistributionResult which might involve runner specific
>>>>>>> changes.
>>>>>>>
>>>>>>> Is this an acceptable change or would you suggest something else? Is
>>>>>>> the Distribution metric only intended to track the metrics that it is
>>>>>>> currently tracking- sum, min, max, count?
>>>>>>>
>>>>>>> Thanks
>>>>>>> Ajo
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>

Re: Percentile metrics in Beam

Reply via email to