Re: DDSketch

Justin Thaler Tue, 22 Oct 2019 07:01:37 -0700

> But it is easy to come up with contrived examples where this notion of
error makes no sense. E.g., consider if all of the data items are numbers
in some small-length range, but the numbers themselves of huge (for
concreteness, say [10^{20}, 10^{20}+i] for some small value of i), their
definition permits additive error that is vastly bigger than the actual
spread of the data (in my contrived example, the error can grow like
10^{20}, even though error at most i can be achieved just by storing the
max and min).


To clarify the point I'm making above:

I am not saying that, on this contrived example, their algorithm would
actually return an item \tilde{x}_q outside of the data range [10^{20},
10^{20}+i] (it wouldn't). But their problem definition would allow an
algorithm to do this, and still be called a "small-relative-error"
algorithm according to the definition.

So I am highlighting a setting where the problem definition itself seems
inappropriate, rather than a setting where the algorithm actually would
behave badly.

(As the paper itself acknowledges, one can easily come up with examples
where the algorithm behaves badly, at least if one tries to keep the space
constant...).

On Tue, Oct 22, 2019 at 8:56 AM Justin Thaler <[email protected]>
wrote:

> This paper does not solve the relative error problem we've discussed
> previously. Which isn't to say it's not interesting, but it's not the same
> problem (it's unfortunate that the paper picked the same name that we've
> been using, when the problems are actually very different).
>
> Here's the situation.
>
> When we've said "relative-error quantiles", we've meant what the paper
> calls "relative-error rank-error". That means, roughly, that if someone
> specifies a query percentile q, the algorithm should return an item whose
> true percentile is (1 \pm epsilon)*q for some small epsilon.
>
> When the paper says "relative-error quantiles", what it means is that if
> someone someone specifies a query percentile q, and the true
> q'th-percentile-item in the dataset is x_q, then the algorithm should
> return a item \tilde{x}_q such that |x_q - \tilde{x}_q|<=epsilon*x_q.
>
> So in their problem, error is measured with respect to the _magnitude of
> the data items_.
>
> This can certainly make sense for numerical data falling in some
> understood range, e.g., they focus on latency of web requests, which
> apparently are typically between milliseconds and minutes.
>
> But it is easy to come up with contrived examples where this notion of
> error makes no sense. E.g., consider if all of the data items are numbers
> in some small-length range, but the numbers themselves of huge (for
> concreteness, say [10^{20}, 10^{20}+i] for some small value of i), their
> definition permits additive error that is vastly bigger than the actual
> spread of the data (in my contrived example, the error can grow like
> 10^{20}, even though error at most i can be achieved just by storing the
> max and min).
>
> In our problem, the actual data items are never referred to, only the
> ranks---in fact, our problem makes sense even if data is not numerical, so
> long as it is totally ordered.
>
> Their algorithm is very simple, and it does have provable worst-case error
> guarantees if one lets the space cost grow logarithmically with the
> universe size. In the paper, they seem to want the space cost to be
> constant independent of the universe size, and they develop some pruning
> heuristics to deal with this, but these heuristics don't (and can't) come
> with worst-case guarantees. All of their experiments and analysis of these
> settings are for highly "well-behaved" data (e.g., iid from very
> well-understood distributions).
>
> They seem to be the first work (other than HDRHistogram, which I haven't
> looked at but which apparently doesn't come with provable error guarantees)
> to study this notion of error for quantiles. I am guessing this is at least
> in part because solving the problem they define is not super difficult from
> an algorithmic perspective.
>
> But, to be clear, if their notion of error actually makes sense for an
> application, and one is happy letting the space grow logarithmically with
> the universe size, then simplicity is a virtue, not a flaw.
>
> BTW I knew one of the authors when I was starting grad school. We've lost
> touch, but I'm happy to see a new paper by him!
>
> On Sun, Oct 20, 2019 at 2:42 PM Jon Malkin <[email protected]> wrote:
>
>> A cursory glance suggests that it's O(N) in size of you want guarantees
>> against an adversary. The bucket collapsing method reduces accuracy in
>> collapsed buckets. If so, the attack would be easiest when distributed to
>> manipulate data splits so that you collapse significant blocks of data
>> together when merging.
>>
>> I'll see if I have a chance to read things more carefully this coming
>> week.
>>
>>   jon
>>
>> On Sat, Oct 19, 2019, 10:32 AM leerho <[email protected]> wrote:
>>
>> > Science team,
>> >
>> > Would you please take a look at this sketch.  It appears to address the
>> > "relative error" for quantiles issue we have been discussing recently.
>> >
>> > https://blog.acolyer.org/2019/09/06/ddsketch/
>> >
>> > Thanks,
>> >
>> > Lee.
>> >
>>
>

Re: DDSketch

Reply via email to