I'm not sure the problem is quite as bad as you state. Both sampleByKey and
sampleByKeyExact are implemented using a function from
StratifiedSamplingUtils which does one of two things depending on whether
the exact implementation is needed. The exact version requires double the
number of lines of code (17) than the non-exact and has to do extra passes
over the data to get, for example, the counts per key.

As far as I can see your problem 2 and sampleByKeyExact are very similar and
could be solved the same way. It has been decided that sampleByKeyExact is a
widely useful function and so is provided out of the box as part of the
PairRDD API. I don't see any reason why your problem 2 couldn't be provided
in the same way as part of the API if there was the demand for it. 

An alternative design would perhaps be something like an extension to
PairRDD, let's call it TwoPassPairRDD, where certain information for the key
could be provided along with an Iterable e.g. the counts for the key. Both
sampleByKeyExact and your problem 2 could be implemented in a few less lines
of code.



--
View this message in context: 
http://apache-spark-developers-list.1001551.n3.nabble.com/RDD-API-patterns-tp14116p14148.html
Sent from the Apache Spark Developers List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org

Reply via email to