[ 
https://issues.apache.org/jira/browse/SPARK-6664?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Florian Verhein updated SPARK-6664:
-----------------------------------
    Description: 
I can't find this functionality (if I missed something, apologies!), but it 
would be very useful for evaluating ml models.  

*Use case example* 
suppose you have pre-processed web logs for a few months, and now want to split 
it into a training set (where you train a model to predict some aspect of site 
accesses, perhaps per user) and an out of time test set (where you evaluate how 
well your model performs in the future). This example has just a single split, 
but in general you could want more for cross validation. You may also want to 
have multiple overlaping intervals.   

*Specification* 

1. Given an Ordered RDD and an ordered sequence of n boundaries (i.e. keys), 
return n+1 RDDs such that values in the ith RDD are within the (i-1)th and ith 
boundary.

2. More complex alternative (but similar under the hood): provide a sequence of 
possibly overlapping intervals (ordered by the start key of the interval), and 
return the RDDs containing values within those intervals. 

*Implementation ideas / notes for 1*

- The ordered RDDs are likely RangePartitioned (or there should be a simple way 
to find ranges from partitions in an ordered RDD)
- Find the partitions containing the boundary, and split them in two.  
- Construct the new RDDs from the original partitions (and any split ones)

I suspect this could be done by launching only a few jobs to split the 
partitions containing the boundaries. 
Alternatively, it might be possible to decorate these partitions and use them 
in more than one RDD. I.e. let one of these partitions (for boundary i) be p. 
Apply two decorators p' and p'', where p' is masks out values above the ith 
boundary, and p'' masks out values below the ith boundary. Any operations on 
these partitions apply only to values not masked out. Then assign p' to the ith 
output RDD and p'' to the (i+1)th output RDD.
If I understand Spark correctly, this should not require any jobs. Not sure 
whether it's worth trying this optimisation.

*Implementation ideas / notes for 2*
This is very similar, except that we have to handle entire (or parts) of 
partitions belonging to more than one output RDD, since they are no longer 
mutually exclusive. But since RDDs are immutable(??), the decorator idea should 
still work?

Thoughts?


  was:

I can't find this functionality (if I missed something, apologies!), but it 
would be very useful for evaluating ml models.  

Use case example: 
suppose you have pre-processed web logs for a few months, and now want to split 
it into a training set (where you train a model to predict some aspect of site 
accesses, perhaps per user) and an out of time test set (where you evaluate how 
well your model performs in the future). This example has just a single split, 
but in general you could want more for cross validation. You may also want to 
have multiple overlaping intervals.   

Specification: 

1. Given an Ordered RDD and an ordered sequence of n boundaries (i.e. keys), 
return n+1 RDDs such that values in the ith RDD are within the (i-1)th and ith 
boundary.

2. More complex alternative (but similar under the hood): provide a sequence of 
possibly overlapping intervals, and return the RDDs containing values within 
those intervals. 

Implementation ideas / notes for 1:

- The ordered RDDs are likely RangePartitioned (or there should be a simple way 
to find ranges from partitions in an ordered RDD)
- Find the partitions containing the boundary, and split them in two.  
- Construct the new RDDs from the original partitions (and any split ones)

I suspect this could be done by launching only a few jobs to split the 
partitions containing the boundaries. 
Alternatively, it might be possible to decorate these partitions and use them 
in more than one RDD. I.e. let one of these partitions (for boundary i) be p. 
Apply two decorators p' and p'', where p' is masks out values above the ith 
boundary, and p'' masks out values below the ith boundary. Any operations on 
these partitions apply only to values not masked out. Then assign p' to the ith 
output RDD and p'' to the (i+1)th output RDD.
If I understand Spark correctly, this should not require any jobs. Not sure 
whether it's worth trying this optimisation.

Implementation ideas / notes for 2:
This is very similar, except that we have to handle entire (or parts) of 
partitions belonging to more than one output RDD, since they are no longer 
mutually exclusive. But since RDDs are immutable(?), the decorator idea should 
still work?

Thoughts?



> Split Ordered RDD into multiple RDDs by keys (boundaries or intervals)
> ----------------------------------------------------------------------
>
>                 Key: SPARK-6664
>                 URL: https://issues.apache.org/jira/browse/SPARK-6664
>             Project: Spark
>          Issue Type: New Feature
>          Components: Spark Core
>            Reporter: Florian Verhein
>
> I can't find this functionality (if I missed something, apologies!), but it 
> would be very useful for evaluating ml models.  
> *Use case example* 
> suppose you have pre-processed web logs for a few months, and now want to 
> split it into a training set (where you train a model to predict some aspect 
> of site accesses, perhaps per user) and an out of time test set (where you 
> evaluate how well your model performs in the future). This example has just a 
> single split, but in general you could want more for cross validation. You 
> may also want to have multiple overlaping intervals.   
> *Specification* 
> 1. Given an Ordered RDD and an ordered sequence of n boundaries (i.e. keys), 
> return n+1 RDDs such that values in the ith RDD are within the (i-1)th and 
> ith boundary.
> 2. More complex alternative (but similar under the hood): provide a sequence 
> of possibly overlapping intervals (ordered by the start key of the interval), 
> and return the RDDs containing values within those intervals. 
> *Implementation ideas / notes for 1*
> - The ordered RDDs are likely RangePartitioned (or there should be a simple 
> way to find ranges from partitions in an ordered RDD)
> - Find the partitions containing the boundary, and split them in two.  
> - Construct the new RDDs from the original partitions (and any split ones)
> I suspect this could be done by launching only a few jobs to split the 
> partitions containing the boundaries. 
> Alternatively, it might be possible to decorate these partitions and use them 
> in more than one RDD. I.e. let one of these partitions (for boundary i) be p. 
> Apply two decorators p' and p'', where p' is masks out values above the ith 
> boundary, and p'' masks out values below the ith boundary. Any operations on 
> these partitions apply only to values not masked out. Then assign p' to the 
> ith output RDD and p'' to the (i+1)th output RDD.
> If I understand Spark correctly, this should not require any jobs. Not sure 
> whether it's worth trying this optimisation.
> *Implementation ideas / notes for 2*
> This is very similar, except that we have to handle entire (or parts) of 
> partitions belonging to more than one output RDD, since they are no longer 
> mutually exclusive. But since RDDs are immutable(??), the decorator idea 
> should still work?
> Thoughts?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to