tdunning commented on pull request #2432: URL: https://github.com/apache/drill/pull/2432#issuecomment-1023413118
Oops, @James Turton ***@***.***> Good catch. My language was very ambiguous. What I meant was estimation of the value of the median of the original data. The issue is that t-digest is focused on estimation of the value of quantiles (which are just generalized percentiles) that are near the tails of the distribution. Because of this focus, the algorithm is designed to estimate things like the 0.1%-ile or 99.9%-ile very accurately, but to estimate the 50th %-ile is lower priority to t-digest so it allocates less of its internal information budget to that estimate. That prioritization can be adjusted by something called the scale function. The default is to use a scale function called K_2 which puts a very large emphasis on the tails. There is an alternative called K_0 which puts even emphasis on all estimates. The effect of K_2 is that estimates of the 50th %-ile might be ±1 % while the estimate of the 99.99th %-ile would be ±0.002 %. Tha makes sense some days for things like latencies. My impression is that K_0 might be more appropriate because we could get something like ±0.2% from 0%-ile to 100%-ile. The specific values here can all be adjusted by setting the compression parameter. I can help with any of this if somebody tells me the purpose of the t-digest. On Thu, Jan 27, 2022 at 5:21 AM James Turton ***@***.***> wrote: > @tdunning <https://github.com/tdunning> when you say "median accuracy" > there, do you mean median estimate error across the entire distribution, or > do you mean estimate error specifically near the distribution's median? Do > you think that in the case of Drill we'd make no assumptions about user > data distributions, and opt for a t-digest scale factor that optimises an > *overall* accuracy statistic rather than one that targets any specific > area (e.g. tails, median). PS I'm waving my hands a bit here, I don't know > if what I've said is well defined... > > — > Reply to this email directly, view it on GitHub > <https://github.com/apache/drill/pull/2432#issuecomment-1023204744>, or > unsubscribe > <https://github.com/notifications/unsubscribe-auth/AAB5E6TSAIXX4HTKXZAGJPTUYFBGJANCNFSM5MM2LDFQ> > . > Triage notifications on the go with GitHub Mobile for iOS > <https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675> > or Android > <https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub>. > > You are receiving this because you were mentioned.Message ID: > ***@***.***> > -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
