Re: distributed computation of median

2017-04-18 Thread pavan adukuri

Do you know of any python implementation for the same?

thanks
pavan
On 4/17/17, 9:54 AM, svjk24 wrote:

Hello,
  Is there any interest in an efficient distributed computation of the 
median algorithm?
A google search pulls some stackoverflow discussion but it would be 
good to have one provided.


I have an implementation (that could be improved)
from the paper " Fast Computation of the Median by Successive Binning":

https://github.com/4d55397500/medianbinning

Thanks-








Re: distributed computation of median

2017-04-17 Thread Koert Kuipers
Also q-tree is implemented in algebird, not hard to get it going in spark.
That is another probabilistic data structure that is useful for this.

On Apr 17, 2017 11:27, "Jason White" <jason.wh...@shopify.com> wrote:

> Have you looked at t-digests?
>
> Calculating percentiles (including medians) is something that is inherently
> difficult/inefficient to do in a distributed system. T-digests provide a
> useful probabilistic structure to allow you to compute any percentile with
> a
> known (and tunable) margin of error.
>
> https://github.com/tdunning/t-digest
>
>
>
>
> --
> View this message in context: http://apache-spark-
> developers-list.1001551.n3.nabble.com/distributed-computation-of-median-
> tp21356p21357.html
> Sent from the Apache Spark Developers List mailing list archive at
> Nabble.com.
>
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>
>


Re: distributed computation of median

2017-04-17 Thread Reynold Xin
The DataFrame API includes an approximate quartile implementation. If you
ask for quantile 0.5, you will get approximate median.


On Sun, Apr 16, 2017 at 9:24 PM svjk24 <svj...@gmail.com> wrote:

> Hello,
>   Is there any interest in an efficient distributed computation of the
> median algorithm?
> A google search pulls some stackoverflow discussion but it would be good
> to have one provided.
>
> I have an implementation (that could be improved)
> from the paper " Fast Computation of the Median by Successive Binning":
>
> https://github.com/4d55397500/medianbinning
>
> Thanks-
>
>
>
>
>


Re: distributed computation of median

2017-04-17 Thread Jason White
Have you looked at t-digests?

Calculating percentiles (including medians) is something that is inherently
difficult/inefficient to do in a distributed system. T-digests provide a
useful probabilistic structure to allow you to compute any percentile with a
known (and tunable) margin of error.

https://github.com/tdunning/t-digest




--
View this message in context: 
http://apache-spark-developers-list.1001551.n3.nabble.com/distributed-computation-of-median-tp21356p21357.html
Sent from the Apache Spark Developers List mailing list archive at Nabble.com.

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



distributed computation of median

2017-04-16 Thread svjk24

Hello,
  Is there any interest in an efficient distributed computation of the 
median algorithm?
A google search pulls some stackoverflow discussion but it would be good 
to have one provided.


I have an implementation (that could be improved)
from the paper " Fast Computation of the Median by Successive Binning":

https://github.com/4d55397500/medianbinning

Thanks-