Rdd nesting can lead to recursive nesting...i would like to know the
usecase and why join can't support it...you can always expose an api over a
rdd and access that in another rdd mappartition...use a external data
source like hbase cassandra redis to support the api...

For ur case group by and then pass the logic...collect each group sample in
a seq and then lookup if u r doing one at a time...if doing all try joining
it...pattern is common if every key is a iid and you a cross validating a
model for each key on 80% train 20% test...

We are looking to fit it in pipeline flow...with minor mods it will fit..
On Sep 16, 2015 6:39 AM, "robineast" <robin.e...@xense.co.uk> wrote:

> I'm not sure the problem is quite as bad as you state. Both sampleByKey and
> sampleByKeyExact are implemented using a function from
> StratifiedSamplingUtils which does one of two things depending on whether
> the exact implementation is needed. The exact version requires double the
> number of lines of code (17) than the non-exact and has to do extra passes
> over the data to get, for example, the counts per key.
>
> As far as I can see your problem 2 and sampleByKeyExact are very similar
> and
> could be solved the same way. It has been decided that sampleByKeyExact is
> a
> widely useful function and so is provided out of the box as part of the
> PairRDD API. I don't see any reason why your problem 2 couldn't be provided
> in the same way as part of the API if there was the demand for it.
>
> An alternative design would perhaps be something like an extension to
> PairRDD, let's call it TwoPassPairRDD, where certain information for the
> key
> could be provided along with an Iterable e.g. the counts for the key. Both
> sampleByKeyExact and your problem 2 could be implemented in a few less
> lines
> of code.
>
>
>
> --
> View this message in context:
> http://apache-spark-developers-list.1001551.n3.nabble.com/RDD-API-patterns-tp14116p14148.html
> Sent from the Apache Spark Developers List mailing list archive at
> Nabble.com.
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
> For additional commands, e-mail: dev-h...@spark.apache.org
>
>

Reply via email to