Re: Creating time-sequential pairs

2014-05-10 Thread Sean Owen
How about ...

val data = sc.parallelize(Array((1,0.05),(2,0.10),(3,0.15)))
val pairs = data.join(data.map(t => (t._1 + 1, t._2)))

It's a self-join, but one copy has its ID incremented by 1. I don't
know if it's performant but works, although output is more like:

(2,(0.1,0.05))
(3,(0.15,0.1))

On Thu, May 8, 2014 at 11:04 PM, Nicholas Pritchard
 wrote:
> Hi Spark community,
>
> I have a design/algorithm question that I assume is common enough for
> someone else to have tackled before. I have an RDD of time-series data
> formatted as time-value tuples, RDD[(Double, Double)], and am trying to
> extract threshold crossings. In order to do so, I first want to transform
> the RDD into pairs of time-sequential values.
>
> For example:
> Input: The time-series data:
> (1, 0.05)
> (2, 0.10)
> (3, 0.15)
> Output: Transformed into time-sequential pairs:
> ((1, 0.05), (2, 0.10))
> ((2, 0.10), (3, 0.15))
>
> My initial thought was to try and utilize a custom partitioner. This
> partitioner could ensure sequential data was kept together. Then I could use
> "mapPartitions" to transform these lists of sequential data. Finally, I
> would need some logic for creating sequential pairs across the boundaries of
> each partition.
>
> However I was hoping to get some feedback and ideas from the community.
> Anyone have thoughts on a simpler solution?
>
> Thanks,
> Nick


Creating time-sequential pairs

2014-05-10 Thread Nicholas Pritchard
Hi Spark community,

I have a design/algorithm question that I assume is common enough for
someone else to have tackled before. I have an RDD of time-series data
formatted as time-value tuples, RDD[(Double, Double)], and am trying to
extract threshold crossings. In order to do so, I first want to transform
the RDD into pairs of time-sequential values.

For example:
Input: The time-series data:
(1, 0.05)
(2, 0.10)
(3, 0.15)
Output: Transformed into time-sequential pairs:
((1, 0.05), (2, 0.10))
((2, 0.10), (3, 0.15))

My initial thought was to try and utilize a custom partitioner. This
partitioner could ensure sequential data was kept together. Then I could
use "mapPartitions" to transform these lists of sequential data. Finally, I
would need some logic for creating sequential pairs across the boundaries
of each partition.

However I was hoping to get some feedback and ideas from the community.
Anyone have thoughts on a simpler solution?

Thanks,
Nick