Rdd nesting can lead to recursive nesting...i would like to know the usecase and why join can't support it...you can always expose an api over a rdd and access that in another rdd mappartition...use a external data source like hbase cassandra redis to support the api...
For ur case group by and then pass the logic...collect each group sample in a seq and then lookup if u r doing one at a time...if doing all try joining it...pattern is common if every key is a iid and you a cross validating a model for each key on 80% train 20% test... We are looking to fit it in pipeline flow...with minor mods it will fit.. On Sep 16, 2015 6:39 AM, "robineast" <robin.e...@xense.co.uk> wrote: > I'm not sure the problem is quite as bad as you state. Both sampleByKey and > sampleByKeyExact are implemented using a function from > StratifiedSamplingUtils which does one of two things depending on whether > the exact implementation is needed. The exact version requires double the > number of lines of code (17) than the non-exact and has to do extra passes > over the data to get, for example, the counts per key. > > As far as I can see your problem 2 and sampleByKeyExact are very similar > and > could be solved the same way. It has been decided that sampleByKeyExact is > a > widely useful function and so is provided out of the box as part of the > PairRDD API. I don't see any reason why your problem 2 couldn't be provided > in the same way as part of the API if there was the demand for it. > > An alternative design would perhaps be something like an extension to > PairRDD, let's call it TwoPassPairRDD, where certain information for the > key > could be provided along with an Iterable e.g. the counts for the key. Both > sampleByKeyExact and your problem 2 could be implemented in a few less > lines > of code. > > > > -- > View this message in context: > http://apache-spark-developers-list.1001551.n3.nabble.com/RDD-API-patterns-tp14116p14148.html > Sent from the Apache Spark Developers List mailing list archive at > Nabble.com. > > --------------------------------------------------------------------- > To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org > For additional commands, e-mail: dev-h...@spark.apache.org > >