I'm not sure the problem is quite as bad as you state. Both sampleByKey and sampleByKeyExact are implemented using a function from StratifiedSamplingUtils which does one of two things depending on whether the exact implementation is needed. The exact version requires double the number of lines of code (17) than the non-exact and has to do extra passes over the data to get, for example, the counts per key.
As far as I can see your problem 2 and sampleByKeyExact are very similar and could be solved the same way. It has been decided that sampleByKeyExact is a widely useful function and so is provided out of the box as part of the PairRDD API. I don't see any reason why your problem 2 couldn't be provided in the same way as part of the API if there was the demand for it. An alternative design would perhaps be something like an extension to PairRDD, let's call it TwoPassPairRDD, where certain information for the key could be provided along with an Iterable e.g. the counts for the key. Both sampleByKeyExact and your problem 2 could be implemented in a few less lines of code. -- View this message in context: http://apache-spark-developers-list.1001551.n3.nabble.com/RDD-API-patterns-tp14116p14148.html Sent from the Apache Spark Developers List mailing list archive at Nabble.com. --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org