tdunning commented on pull request #2432:
URL: https://github.com/apache/drill/pull/2432#issuecomment-1023413118


   Oops, @James Turton ***@***.***>
   
   Good catch. My language was very ambiguous.
   
   What I meant was estimation of the value of the median of the original data.
   
   The issue is that t-digest is focused on estimation of the value of
   quantiles (which are just generalized percentiles) that are near the tails
   of the distribution. Because of this focus, the algorithm is designed to
   estimate things like the 0.1%-ile or 99.9%-ile very accurately, but to
   estimate the 50th %-ile is lower priority to t-digest so it allocates less
   of its internal information budget to that estimate.
   
   That prioritization can be adjusted by something called the scale function.
   The default is to use a scale function called K_2 which puts a very large
   emphasis on the tails. There is an alternative called K_0 which puts even
   emphasis on all estimates.
   
   The effect of K_2 is that estimates of the 50th %-ile might be ±1 % while
   the estimate of the 99.99th %-ile would be ±0.002 %. Tha makes sense some
   days for things like latencies.
   
   My impression is that K_0 might be more appropriate because we could get
   something like ±0.2% from 0%-ile to 100%-ile.
   
   The specific values here can all be adjusted by setting the compression
   parameter.
   
   I can help with any of this if somebody tells me the purpose of the
   t-digest.
   
   
   
   
   On Thu, Jan 27, 2022 at 5:21 AM James Turton ***@***.***>
   wrote:
   
   > @tdunning <https://github.com/tdunning> when you say "median accuracy"
   > there, do you mean median estimate error across the entire distribution, or
   > do you mean estimate error specifically near the distribution's median? Do
   > you think that in the case of Drill we'd make no assumptions about user
   > data distributions, and opt for a t-digest scale factor that optimises an
   > *overall* accuracy statistic rather than one that targets any specific
   > area (e.g. tails, median). PS I'm waving my hands a bit here, I don't know
   > if what I've said is well defined...
   >
   > —
   > Reply to this email directly, view it on GitHub
   > <https://github.com/apache/drill/pull/2432#issuecomment-1023204744>, or
   > unsubscribe
   > 
<https://github.com/notifications/unsubscribe-auth/AAB5E6TSAIXX4HTKXZAGJPTUYFBGJANCNFSM5MM2LDFQ>
   > .
   > Triage notifications on the go with GitHub Mobile for iOS
   > 
<https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675>
   > or Android
   > 
<https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub>.
   >
   > You are receiving this because you were mentioned.Message ID:
   > ***@***.***>
   >
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


Reply via email to