Hi Franco,

As a fast approximate way to get probability distributions, you might be
interested in t-digests:

https://github.com/tdunning/t-digest

In one pass, you could make a t-digest for each variable, to get its
distribution.  And after that, you could make another pass to map each data
point to its percentile in the distribution.

to create the tdigests, you would do something like this:

val myDataRDD = ...

myDataRDD.mapPartitions{itr =>
  xDistribution = TDigest.createArrayDigest(32, 100)
  yDistribution = TDigest.createArrayDigest(32, 100)
  ...
  itr.foreach{ data =>
    xDistribution.add(data.x)
    yDistribution.add(data.y)
    ...
  }

  Seq(
    "x" -> xDistribution,
    "y" -> yDistribution
  ).toIterator.map{case(k,v) =>
    val arr = new Array[Byte](t.byteSize)
    v.asBytes(ByteBuffer.wrap(arr))
    k -> arr
  }
}.reduceByKey{case(t1Arr,t2Arr) =>
  val merged =
ArrayDigest.fromBytes(ByteBuffer.wrap(t1Arr)).add(ArrayDigest.fromBytes(ByteBuffer.wrap(t2Arr))
  val arr = new Array[Byte](merged.byteSize)
  merged.asBytes(ByteBuffer.wrap(arr))
}


(the complication there is just that tdigests are not directly
serializable, so I need to do the manual work of converting to and from an
array of bytes).


On Thu, Nov 27, 2014 at 9:28 AM, Franco Barrientos <
franco.barrien...@exalitica.com> wrote:

> Hi folks!,
>
>
>
> Anyone known how can I calculate for each elements of a variable in a RDD
> its percentile? I tried to calculate trough Spark SQL with subqueries but I
> think that is imposible in Spark SQL. Any idea will be welcome.
>
>
>
> Thanks in advance,
>
>
>
> *Franco Barrientos*
> Data Scientist
>
> Málaga #115, Of. 1003, Las Condes.
> Santiago, Chile.
> (+562)-29699649
> (+569)-76347893
>
> franco.barrien...@exalitica.com
>
> www.exalitica.com
>
> [image: http://exalitica.com/web/img/frim.png]
>
>
>

Reply via email to