Hi Franco, As a fast approximate way to get probability distributions, you might be interested in t-digests:
https://github.com/tdunning/t-digest In one pass, you could make a t-digest for each variable, to get its distribution. And after that, you could make another pass to map each data point to its percentile in the distribution. to create the tdigests, you would do something like this: val myDataRDD = ... myDataRDD.mapPartitions{itr => xDistribution = TDigest.createArrayDigest(32, 100) yDistribution = TDigest.createArrayDigest(32, 100) ... itr.foreach{ data => xDistribution.add(data.x) yDistribution.add(data.y) ... } Seq( "x" -> xDistribution, "y" -> yDistribution ).toIterator.map{case(k,v) => val arr = new Array[Byte](t.byteSize) v.asBytes(ByteBuffer.wrap(arr)) k -> arr } }.reduceByKey{case(t1Arr,t2Arr) => val merged = ArrayDigest.fromBytes(ByteBuffer.wrap(t1Arr)).add(ArrayDigest.fromBytes(ByteBuffer.wrap(t2Arr)) val arr = new Array[Byte](merged.byteSize) merged.asBytes(ByteBuffer.wrap(arr)) } (the complication there is just that tdigests are not directly serializable, so I need to do the manual work of converting to and from an array of bytes). On Thu, Nov 27, 2014 at 9:28 AM, Franco Barrientos < franco.barrien...@exalitica.com> wrote: > Hi folks!, > > > > Anyone known how can I calculate for each elements of a variable in a RDD > its percentile? I tried to calculate trough Spark SQL with subqueries but I > think that is imposible in Spark SQL. Any idea will be welcome. > > > > Thanks in advance, > > > > *Franco Barrientos* > Data Scientist > > Málaga #115, Of. 1003, Las Condes. > Santiago, Chile. > (+562)-29699649 > (+569)-76347893 > > franco.barrien...@exalitica.com > > www.exalitica.com > > [image: http://exalitica.com/web/img/frim.png] > > >