[ 
https://issues.apache.org/jira/browse/SPARK-6664?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14391950#comment-14391950
 ] 

Florian Verhein commented on SPARK-6664:
----------------------------------------

The closest approach I've found that should achieve the same result is calling 
OrderedRDDFunctions.filterByRange n+1 times. I assume this approach will be 
much slower, but... it may not be if it's completely lazy.. (??). I don't know 
spark well enough yet to be anywhere near sure of this.

> Split Ordered RDD into multiple RDDs by keys (boundaries or intervals)
> ----------------------------------------------------------------------
>
>                 Key: SPARK-6664
>                 URL: https://issues.apache.org/jira/browse/SPARK-6664
>             Project: Spark
>          Issue Type: New Feature
>          Components: Spark Core
>            Reporter: Florian Verhein
>
> I can't find this functionality (if I missed something, apologies!), but it 
> would be very useful for evaluating ml models.  
> *Use case example* 
> suppose you have pre-processed web logs for a few months, and now want to 
> split it into a training set (where you train a model to predict some aspect 
> of site accesses, perhaps per user) and an out of time test set (where you 
> evaluate how well your model performs in the future). This example has just a 
> single split, but in general you could want more for cross validation. You 
> may also want to have multiple overlaping intervals.   
> *Specification* 
> 1. Given an Ordered RDD and an ordered sequence of n boundaries (i.e. keys), 
> return n+1 RDDs such that values in the ith RDD are within the (i-1)th and 
> ith boundary.
> 2. More complex alternative (but similar under the hood): provide a sequence 
> of possibly overlapping intervals (ordered by the start key of the interval), 
> and return the RDDs containing values within those intervals. 
> *Implementation ideas / notes for 1*
> - The ordered RDDs are likely RangePartitioned (or there should be a simple 
> way to find ranges from partitions in an ordered RDD)
> - Find the partitions containing the boundary, and split them in two.  
> - Construct the new RDDs from the original partitions (and any split ones)
> I suspect this could be done by launching only a few jobs to split the 
> partitions containing the boundaries. 
> Alternatively, it might be possible to decorate these partitions and use them 
> in more than one RDD. I.e. let one of these partitions (for boundary i) be p. 
> Apply two decorators p' and p'', where p' is masks out values above the ith 
> boundary, and p'' masks out values below the ith boundary. Any operations on 
> these partitions apply only to values not masked out. Then assign p' to the 
> ith output RDD and p'' to the (i+1)th output RDD.
> If I understand Spark correctly, this should not require any jobs. Not sure 
> whether it's worth trying this optimisation.
> *Implementation ideas / notes for 2*
> This is very similar, except that we have to handle entire (or parts) of 
> partitions belonging to more than one output RDD, since they are no longer 
> mutually exclusive. But since RDDs are immutable(??), the decorator idea 
> should still work?
> Thoughts?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to