[ https://issues.apache.org/jira/browse/SPARK-6664?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14394147#comment-14394147 ]
Florian Verhein commented on SPARK-6664: ---------------------------------------- I guess the other thing is - we can union RDDs, so why not be able to 'undo' that? > Split Ordered RDD into multiple RDDs by keys (boundaries or intervals) > ---------------------------------------------------------------------- > > Key: SPARK-6664 > URL: https://issues.apache.org/jira/browse/SPARK-6664 > Project: Spark > Issue Type: New Feature > Components: Spark Core > Reporter: Florian Verhein > > I can't find this functionality (if I missed something, apologies!), but it > would be very useful for evaluating ml models. > *Use case example* > suppose you have pre-processed web logs for a few months, and now want to > split it into a training set (where you train a model to predict some aspect > of site accesses, perhaps per user) and an out of time test set (where you > evaluate how well your model performs in the future). This example has just a > single split, but in general you could want more for cross validation. You > may also want to have multiple overlaping intervals. > *Specification* > 1. Given an Ordered RDD and an ordered sequence of n boundaries (i.e. keys), > return n+1 RDDs such that values in the ith RDD are within the (i-1)th and > ith boundary. > 2. More complex alternative (but similar under the hood): provide a sequence > of possibly overlapping intervals (ordered by the start key of the interval), > and return the RDDs containing values within those intervals. > *Implementation ideas / notes for 1* > - The ordered RDDs are likely RangePartitioned (or there should be a simple > way to find ranges from partitions in an ordered RDD) > - Find the partitions containing the boundary, and split them in two. > - Construct the new RDDs from the original partitions (and any split ones) > I suspect this could be done by launching only a few jobs to split the > partitions containing the boundaries. > Alternatively, it might be possible to decorate these partitions and use them > in more than one RDD. I.e. let one of these partitions (for boundary i) be p. > Apply two decorators p' and p'', where p' is masks out values above the ith > boundary, and p'' masks out values below the ith boundary. Any operations on > these partitions apply only to values not masked out. Then assign p' to the > ith output RDD and p'' to the (i+1)th output RDD. > If I understand Spark correctly, this should not require any jobs. Not sure > whether it's worth trying this optimisation. > *Implementation ideas / notes for 2* > This is very similar, except that we have to handle entire (or parts) of > partitions belonging to more than one output RDD, since they are no longer > mutually exclusive. But since RDDs are immutable(??), the decorator idea > should still work? > Thoughts? -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org