Creating time-sequential pairs

2014-05-10 Thread Nicholas Pritchard
Hi Spark community,

I have a design/algorithm question that I assume is common enough for
someone else to have tackled before. I have an RDD of time-series data
formatted as time-value tuples, RDD[(Double, Double)], and am trying to
extract threshold crossings. In order to do so, I first want to transform
the RDD into pairs of time-sequential values.

For example:
Input: The time-series data:
(1, 0.05)
(2, 0.10)
(3, 0.15)
Output: Transformed into time-sequential pairs:
((1, 0.05), (2, 0.10))
((2, 0.10), (3, 0.15))

My initial thought was to try and utilize a custom partitioner. This
partitioner could ensure sequential data was kept together. Then I could
use mapPartitions to transform these lists of sequential data. Finally, I
would need some logic for creating sequential pairs across the boundaries
of each partition.

However I was hoping to get some feedback and ideas from the community.
Anyone have thoughts on a simpler solution?

Thanks,
Nick


Re: Creating time-sequential pairs

2014-05-10 Thread Sean Owen
How about ...

val data = sc.parallelize(Array((1,0.05),(2,0.10),(3,0.15)))
val pairs = data.join(data.map(t = (t._1 + 1, t._2)))

It's a self-join, but one copy has its ID incremented by 1. I don't
know if it's performant but works, although output is more like:

(2,(0.1,0.05))
(3,(0.15,0.1))

On Thu, May 8, 2014 at 11:04 PM, Nicholas Pritchard
nicholas.pritch...@falkonry.com wrote:
 Hi Spark community,

 I have a design/algorithm question that I assume is common enough for
 someone else to have tackled before. I have an RDD of time-series data
 formatted as time-value tuples, RDD[(Double, Double)], and am trying to
 extract threshold crossings. In order to do so, I first want to transform
 the RDD into pairs of time-sequential values.

 For example:
 Input: The time-series data:
 (1, 0.05)
 (2, 0.10)
 (3, 0.15)
 Output: Transformed into time-sequential pairs:
 ((1, 0.05), (2, 0.10))
 ((2, 0.10), (3, 0.15))

 My initial thought was to try and utilize a custom partitioner. This
 partitioner could ensure sequential data was kept together. Then I could use
 mapPartitions to transform these lists of sequential data. Finally, I
 would need some logic for creating sequential pairs across the boundaries of
 each partition.

 However I was hoping to get some feedback and ideas from the community.
 Anyone have thoughts on a simpler solution?

 Thanks,
 Nick


Re: Schema view of HadoopRDD

2014-05-10 Thread Debasish Das
Hi,

For each line that we read as textLine from HDFS, we have a schema..if
there is an API that takes the schema as List[Symbol] and maps each token
to the Symbol it will be helpful...

One solution is to keep data on hdfs as avro/protobuf serialized objects
but not sure if that works on HBase input...we are testing HDFS right now
but finally we will read from a persistent store like hbase...so basically
the immutableBytes need to be converted to a schema view as well incase we
don't want to write the whole row as a protobuf...

Does RDDs provide a schema view of the dataset on HDFS / HBase ?

Thanks.
Deb


Re: os buffer cache does not cache shuffle output file

2014-05-10 Thread Koert Kuipers
yes it seems broken. i got only a few emails in last few days


On Fri, May 9, 2014 at 7:24 AM, wxhsdp wxh...@gmail.com wrote:

 is there something wrong with the mailing list? very few people see my
 thread



 --
 View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/os-buffer-cache-does-not-cache-shuffle-output-file-tp5478p5521.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.



Re: problem about broadcast variable in iteration

2014-05-10 Thread randylu
i run in spark 1.0.0, the newest under-development version.



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/problem-about-broadcast-variable-in-iteration-tp5479p5480.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.