Erm, just https://spark.apache.org/docs/2.3.0/api/sql/index.html#approx_percentile ?
On Tue, Apr 27, 2021 at 3:52 AM Ivan Petrov <capacyt...@gmail.com> wrote: > Hi, I have billions, potentially dozens of billions of observations. Each > observation is a decimal number. > I need to calculate percentiles 1, 25, 50, 75, 95 for these observations > using Scala Spark. I can use both RDD and Dataset API. Whatever would work > better. > > What I can do in terms of perf optimisation: > - I can round decimal observations to long > - I can even round each observation to nearest 5, for example: 2.6 can be > rounded to 5 or 11.3123123 can be rounded to 10 to reduce amount of unique > values of observations (if it helps on Math side) > - I’m fine with some approximation approach, loose some precision (how to > measure an error BTW? ) but get percentile results faster. > > > What can I try? > Thanks! >