GitHub user mengxr opened a pull request:
https://github.com/apache/incubator-spark/pull/578
Adding assignRanks and assignUniqueIds to RDD
Assign ranks to an ordered or unordered data set is a common operation.
This could be done by first counting records in each partition and then assign
ranks in parallel.
The purpose of assigning ranks to an unordered set is usually to get a
unique id for each item, e.g., to map feature names to feature indices. In such
cases, the assignment could be done without counting records, saving one spark
job.
https://spark-project.atlassian.net/browse/SPARK-1076
You can merge this pull request into a Git repository by running:
$ git pull https://github.com/apache/incubator-spark rank
Alternatively you can review and apply these changes as the patch at:
https://github.com/apache/incubator-spark/pull/578.patch
----
commit 21b434b77f1a7ffd75ba2d1ad4ab2296f1914971
Author: Xiangrui Meng <[email protected]>
Date: 2014-02-10T23:18:41Z
add assignRanks and assignUniqueIds to RDD
commit 630868c88f14ea955991acfd3d68caa8be6dedec
Author: Xiangrui Meng <[email protected]>
Date: 2014-02-10T23:20:21Z
newline
----