Re: distributed computation of median
Do you know of any python implementation for the same? thanks pavan On 4/17/17, 9:54 AM, svjk24 wrote: Hello, Is there any interest in an efficient distributed computation of the median algorithm? A google search pulls some stackoverflow discussion but it would be good to have one provided. I have an implementation (that could be improved) from the paper " Fast Computation of the Median by Successive Binning": https://github.com/4d55397500/medianbinning Thanks-
Re: distributed computation of median
Also q-tree is implemented in algebird, not hard to get it going in spark. That is another probabilistic data structure that is useful for this. On Apr 17, 2017 11:27, "Jason White" <jason.wh...@shopify.com> wrote: > Have you looked at t-digests? > > Calculating percentiles (including medians) is something that is inherently > difficult/inefficient to do in a distributed system. T-digests provide a > useful probabilistic structure to allow you to compute any percentile with > a > known (and tunable) margin of error. > > https://github.com/tdunning/t-digest > > > > > -- > View this message in context: http://apache-spark- > developers-list.1001551.n3.nabble.com/distributed-computation-of-median- > tp21356p21357.html > Sent from the Apache Spark Developers List mailing list archive at > Nabble.com. > > - > To unsubscribe e-mail: dev-unsubscr...@spark.apache.org > >
Re: distributed computation of median
The DataFrame API includes an approximate quartile implementation. If you ask for quantile 0.5, you will get approximate median. On Sun, Apr 16, 2017 at 9:24 PM svjk24 <svj...@gmail.com> wrote: > Hello, > Is there any interest in an efficient distributed computation of the > median algorithm? > A google search pulls some stackoverflow discussion but it would be good > to have one provided. > > I have an implementation (that could be improved) > from the paper " Fast Computation of the Median by Successive Binning": > > https://github.com/4d55397500/medianbinning > > Thanks- > > > > >
Re: distributed computation of median
Have you looked at t-digests? Calculating percentiles (including medians) is something that is inherently difficult/inefficient to do in a distributed system. T-digests provide a useful probabilistic structure to allow you to compute any percentile with a known (and tunable) margin of error. https://github.com/tdunning/t-digest -- View this message in context: http://apache-spark-developers-list.1001551.n3.nabble.com/distributed-computation-of-median-tp21356p21357.html Sent from the Apache Spark Developers List mailing list archive at Nabble.com. - To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
distributed computation of median
Hello, Is there any interest in an efficient distributed computation of the median algorithm? A google search pulls some stackoverflow discussion but it would be good to have one provided. I have an implementation (that could be improved) from the paper " Fast Computation of the Median by Successive Binning": https://github.com/4d55397500/medianbinning Thanks-