How about ... val data = sc.parallelize(Array((1,0.05),(2,0.10),(3,0.15))) val pairs = data.join(data.map(t => (t._1 + 1, t._2)))
It's a self-join, but one copy has its ID incremented by 1. I don't know if it's performant but works, although output is more like: (2,(0.1,0.05)) (3,(0.15,0.1)) On Thu, May 8, 2014 at 11:04 PM, Nicholas Pritchard <nicholas.pritch...@falkonry.com> wrote: > Hi Spark community, > > I have a design/algorithm question that I assume is common enough for > someone else to have tackled before. I have an RDD of time-series data > formatted as time-value tuples, RDD[(Double, Double)], and am trying to > extract threshold crossings. In order to do so, I first want to transform > the RDD into pairs of time-sequential values. > > For example: > Input: The time-series data: > (1, 0.05) > (2, 0.10) > (3, 0.15) > Output: Transformed into time-sequential pairs: > ((1, 0.05), (2, 0.10)) > ((2, 0.10), (3, 0.15)) > > My initial thought was to try and utilize a custom partitioner. This > partitioner could ensure sequential data was kept together. Then I could use > "mapPartitions" to transform these lists of sequential data. Finally, I > would need some logic for creating sequential pairs across the boundaries of > each partition. > > However I was hoping to get some feedback and ideas from the community. > Anyone have thoughts on a simpler solution? > > Thanks, > Nick