Thanks Imran for very detailed explanations and options. I think for now T-Digest is what I want.
From: iras...@cloudera.com Date: Tue, 17 Feb 2015 08:39:48 -0600 Subject: Re: Percentile example To: myl...@hotmail.com CC: user@spark.apache.org (trying to repost to the list w/out URLs -- rejected as spam earlier) Hi, Using take() is not a good idea, as you have noted it will pull a lot of data down to the driver so its not scalable. Here are some more scalable alternatives: 1. Approximate solutions 1a. Sample the data. Just sample some of the data to the driver, sort that data in memory, and take the 66th percentile of that sample. 1b. Make a histogram with pre-determined buckets. Eg., if you know your data ranges from 0 to 1 and is uniform-ish, you could make buckets every 0.01. Then count how many data points go into each bucket. Or if you only care about relative error and you have integers (often the case if your data is counts), then you can span the full range of integers with a relatively small number of buckets. Eg., you only need 200 buckets for 5% error. See the Histogram class in twitter's Ostrich library The problem is, if you have no idea what the distribution of your data is, its very hard to come up with good buckets; you could have an arbitrary amount of data going to one bucket, and thus tons of error. 1c. Use a TDigest , a compact & scalable data structure for approximating distributions, and performs reasonably across a wide range of distributions. You would make one TDigest for each partition (with mapPartitions), and then merge all of the TDigests together. I wrote up a little more detail on this earlier, you can search the spark-user on nabble for "tdigest" 2. Exact solutions. There are also a few options here, but I'll give one that is a variant of what you suggested. Start out by doing a sortByKey. Then figure out how many records you have in each partitions (with mapPartitions). Figure out which partition the 66th percentile would be in. Then just read the one partition you want, and go down to the Nth record in that partition. To read the one partition you want, you can either (a) use mapPartitionsWithIndex, and just ignore every partition that isnt' the one you want or (b) use PartitionPruningRDD. PartitionPruningRDD will avoid launching empty tasks on the other partitions, so it will be slightly more efficient, but its also a developer api, so perhaps not worth going to that level of detail. Note that internally, sortByKey will sample your data to get an approximate distribution, to figure out what data to put in each partition. However, your still getting an exact answer this way -- the approximation is only important for distributing work among all executors. Even if the approximation is inaccurate, you'll still correct for it, you will just have unequal partitions. Imran On Sun, Feb 15, 2015 at 9:37 AM, SiMaYunRui <myl...@hotmail.com> wrote: hello, I am a newbie to spark and trying to figure out how to get percentile against a big data set. Actually, I googled this topic but not find any very useful code example and explanation. Seems that I can use transformer SortBykey to get my data set in order, but not pretty sure how can I get value of , for example, percentile 66. Should I use take() to pick up the value of percentile 66? I don't believe any machine can load my data set in memory. I believe there must be more efficient approaches. Can anyone shed some light on this problem? Regards