Sean, I do agree about the "inside out" parallelization but my curiosity is mostly in what type of performance I can expect to have by piping out to R. I'm playing with Twitter's new Anomaly Detection library btw, this could be a solution if I can get the calls to R to stand up to the massive dataset that I have.
I'll report back my findings. On Thu, Apr 2, 2015 at 3:46 AM, Sean Owen <so...@cloudera.com> wrote: > This "inside out" parallelization has been a way people have used R > with MapReduce for a long time. Run N copies of an R script on the > cluster, on different subsets of the data, babysat by Mappers. You > just need R installed on the cluster. Hadoop Streaming makes this easy > and things like RDD.pipe in Spark make it easier. > > So it may be just that simple and so there's not much to say about it. > I haven't tried this with Spark Streaming but imagine it would also > work. Have you tried this? > > Within a window you would probably take the first x% as training and > the rest as test. I don't think there's a question of looking across > windows. > > On Thu, Apr 2, 2015 at 12:31 AM, Corey Nolet <cjno...@gmail.com> wrote: > > Surprised I haven't gotten any responses about this. Has anyone tried > using > > rJava or FastR w/ Spark? I've seen the SparkR project but thta goes the > > other way- what I'd like to do is use R for model calculation and Spark > to > > distribute the load across the cluster. > > > > Also, has anyone used Scalation for ARIMA models? > > > > On Mon, Mar 30, 2015 at 9:30 AM, Corey Nolet <cjno...@gmail.com> wrote: > >> > >> Taking out the complexity of the ARIMA models to simplify things- I > can't > >> seem to find a good way to represent even standard moving averages in > spark > >> streaming. Perhaps it's my ignorance with the micro-batched style of the > >> DStreams API. > >> > >> On Fri, Mar 27, 2015 at 9:13 PM, Corey Nolet <cjno...@gmail.com> wrote: > >>> > >>> I want to use ARIMA for a predictive model so that I can take time > series > >>> data (metrics) and perform a light anomaly detection. The time series > data > >>> is going to be bucketed to different time units (several minutes within > >>> several hours, several hours within several days, several days within > >>> several years. > >>> > >>> I want to do the algorithm in Spark Streaming. I'm used to "tuple at a > >>> time" streaming and I'm having a tad bit of trouble gaining insight > into how > >>> exactly the windows are managed inside of DStreams. > >>> > >>> Let's say I have a simple dataset that is marked by a key/value tuple > >>> where the key is the name of the component who's metrics I want to run > the > >>> algorithm against and the value is a metric (a value representing a > sum for > >>> the time bucket. I want to create histograms of the time series data > for > >>> each key in the windows in which they reside so I can use that > histogram > >>> vector to generate my ARIMA prediction (actually, it seems like this > doesn't > >>> just apply to ARIMA but could apply to any sliding average). > >>> > >>> I *think* my prediction code may look something like this: > >>> > >>> val predictionAverages = dstream > >>> .groupByKeyAndWindow(60*60*24, 60*60*24) > >>> .mapValues(applyARIMAFunction) > >>> > >>> That is, keep 24 hours worth of metrics in each window and use that for > >>> the ARIMA prediction. The part I'm struggling with is how to join > together > >>> the actual values so that i can do my comparison against the prediction > >>> model. > >>> > >>> Let's say dstream contains the actual values. For any time window, I > >>> should be able to take a previous set of windows and use model to > compare > >>> against the current values. > >>> > >>> > >> > > >