Creating time-sequential pairs
Hi Spark community, I have a design/algorithm question that I assume is common enough for someone else to have tackled before. I have an RDD of time-series data formatted as time-value tuples, RDD[(Double, Double)], and am trying to extract threshold crossings. In order to do so, I first want to transform the RDD into pairs of time-sequential values. For example: Input: The time-series data: (1, 0.05) (2, 0.10) (3, 0.15) Output: Transformed into time-sequential pairs: ((1, 0.05), (2, 0.10)) ((2, 0.10), (3, 0.15)) My initial thought was to try and utilize a custom partitioner. This partitioner could ensure sequential data was kept together. Then I could use mapPartitions to transform these lists of sequential data. Finally, I would need some logic for creating sequential pairs across the boundaries of each partition. However I was hoping to get some feedback and ideas from the community. Anyone have thoughts on a simpler solution? Thanks, Nick
Re: Creating time-sequential pairs
How about ... val data = sc.parallelize(Array((1,0.05),(2,0.10),(3,0.15))) val pairs = data.join(data.map(t = (t._1 + 1, t._2))) It's a self-join, but one copy has its ID incremented by 1. I don't know if it's performant but works, although output is more like: (2,(0.1,0.05)) (3,(0.15,0.1)) On Thu, May 8, 2014 at 11:04 PM, Nicholas Pritchard nicholas.pritch...@falkonry.com wrote: Hi Spark community, I have a design/algorithm question that I assume is common enough for someone else to have tackled before. I have an RDD of time-series data formatted as time-value tuples, RDD[(Double, Double)], and am trying to extract threshold crossings. In order to do so, I first want to transform the RDD into pairs of time-sequential values. For example: Input: The time-series data: (1, 0.05) (2, 0.10) (3, 0.15) Output: Transformed into time-sequential pairs: ((1, 0.05), (2, 0.10)) ((2, 0.10), (3, 0.15)) My initial thought was to try and utilize a custom partitioner. This partitioner could ensure sequential data was kept together. Then I could use mapPartitions to transform these lists of sequential data. Finally, I would need some logic for creating sequential pairs across the boundaries of each partition. However I was hoping to get some feedback and ideas from the community. Anyone have thoughts on a simpler solution? Thanks, Nick
Re: Schema view of HadoopRDD
Hi, For each line that we read as textLine from HDFS, we have a schema..if there is an API that takes the schema as List[Symbol] and maps each token to the Symbol it will be helpful... One solution is to keep data on hdfs as avro/protobuf serialized objects but not sure if that works on HBase input...we are testing HDFS right now but finally we will read from a persistent store like hbase...so basically the immutableBytes need to be converted to a schema view as well incase we don't want to write the whole row as a protobuf... Does RDDs provide a schema view of the dataset on HDFS / HBase ? Thanks. Deb
Re: os buffer cache does not cache shuffle output file
yes it seems broken. i got only a few emails in last few days On Fri, May 9, 2014 at 7:24 AM, wxhsdp wxh...@gmail.com wrote: is there something wrong with the mailing list? very few people see my thread -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/os-buffer-cache-does-not-cache-shuffle-output-file-tp5478p5521.html Sent from the Apache Spark User List mailing list archive at Nabble.com.
Re: problem about broadcast variable in iteration
i run in spark 1.0.0, the newest under-development version. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/problem-about-broadcast-variable-in-iteration-tp5479p5480.html Sent from the Apache Spark User List mailing list archive at Nabble.com.