Taking out the complexity of the ARIMA models to simplify things- I can't seem to find a good way to represent even standard moving averages in spark streaming. Perhaps it's my ignorance with the micro-batched style of the DStreams API.
On Fri, Mar 27, 2015 at 9:13 PM, Corey Nolet <cjno...@gmail.com> wrote: > I want to use ARIMA for a predictive model so that I can take time series > data (metrics) and perform a light anomaly detection. The time series data > is going to be bucketed to different time units (several minutes within > several hours, several hours within several days, several days within > several years. > > I want to do the algorithm in Spark Streaming. I'm used to "tuple at a > time" streaming and I'm having a tad bit of trouble gaining insight into > how exactly the windows are managed inside of DStreams. > > Let's say I have a simple dataset that is marked by a key/value tuple > where the key is the name of the component who's metrics I want to run the > algorithm against and the value is a metric (a value representing a sum for > the time bucket. I want to create histograms of the time series data for > each key in the windows in which they reside so I can use that histogram > vector to generate my ARIMA prediction (actually, it seems like this > doesn't just apply to ARIMA but could apply to any sliding average). > > I *think* my prediction code may look something like this: > > val predictionAverages = dstream > .groupByKeyAndWindow(60*60*24, 60*60*24) > .mapValues(applyARIMAFunction) > > That is, keep 24 hours worth of metrics in each window and use that for > the ARIMA prediction. The part I'm struggling with is how to join together > the actual values so that i can do my comparison against the prediction > model. > > Let's say dstream contains the actual values. For any time window, I > should be able to take a previous set of windows and use model to compare > against the current values. > > >