So you want windows covering the same length of time, some of which will be fuller than others? You could, for example, simply bucket the data by minute to get this kind of effect. If you an RDD[Ticker], where Ticker has a timestamp in ms, you could:
tickerRDD.groupBy(ticker => (ticker.timestamp / 60000) * 60000)) ... to get an RDD[(Long,Iterable[Ticker])], where the keys are the moment at the start of each minute, and the values are the Tickers within the following minute. You can try variations on this to bucket in different ways. Just be careful because a minute with a huge number of values might cause you to run out of memory. If you're just doing aggregations of some kind there are more efficient methods than this most generic method, like the aggregate methods. On Tue, Jan 6, 2015 at 8:34 PM, Asim Jalis <asimja...@gmail.com> wrote: > Thanks. Another question. I have event data with timestamps. I want to > create a sliding window using timestamps. Some windows will have a lot of > events in them others won’t. Is there a way to get an RDD made of this kind > of a variable length window? > > > On Tue, Jan 6, 2015 at 1:03 PM, Sean Owen <so...@cloudera.com> wrote: > >> First you'd need to sort the RDD to give it a meaningful order, but I >> assume you have some kind of timestamp in your data you can sort on. >> >> I think you might be after the sliding() function, a developer API in >> MLlib: >> >> >> https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/rdd/RDDFunctions.scala#L43 >> >> On Tue, Jan 6, 2015 at 5:25 PM, Asim Jalis <asimja...@gmail.com> wrote: >> >>> Is there an easy way to do a moving average across a single RDD (in a >>> non-streaming app). Here is the use case. I have an RDD made up of stock >>> prices. I want to calculate a moving average using a window size of N. >>> >>> Thanks. >>> >>> Asim >>> >> >> >