Erm, just
https://spark.apache.org/docs/2.3.0/api/sql/index.html#approx_percentile ?

On Tue, Apr 27, 2021 at 3:52 AM Ivan Petrov <capacyt...@gmail.com> wrote:

> Hi, I have billions, potentially dozens of billions of observations. Each
> observation is a decimal number.
> I need to calculate percentiles 1, 25, 50, 75, 95 for these observations
> using Scala Spark. I can use both RDD and Dataset API. Whatever would work
> better.
>
> What I can do in terms of perf optimisation:
> - I can round decimal observations to long
> - I can even round each observation to nearest 5, for example: 2.6 can be
> rounded to 5 or 11.3123123 can be rounded to 10 to reduce amount of unique
> values of observations (if it helps on Math side)
> - I’m fine with some approximation approach, loose some precision (how to
> measure an error BTW? ) but get percentile results faster.
>
>
> What can I try?
> Thanks!
>

Reply via email to