Re: Streaming anomaly detection using ARIMA

2015-04-10 Thread Corey Nolet
Sean, I do agree about the inside out parallelization but my curiosity is mostly in what type of performance I can expect to have by piping out to R. I'm playing with Twitter's new Anomaly Detection library btw, this could be a solution if I can get the calls to R to stand up to the massive

Re: Streaming anomaly detection using ARIMA

2015-04-02 Thread Sean Owen
This inside out parallelization has been a way people have used R with MapReduce for a long time. Run N copies of an R script on the cluster, on different subsets of the data, babysat by Mappers. You just need R installed on the cluster. Hadoop Streaming makes this easy and things like RDD.pipe in

RE: Streaming anomaly detection using ARIMA

2015-04-01 Thread Felix Cheung
R with JRI or through rdd.forEachPartition(pass_data_to_R) or rdd.pipe From: cjno...@gmail.com Date: Wed, 1 Apr 2015 19:31:48 -0400 Subject: Re: Streaming anomaly detection using ARIMA To: user@spark.apache.org Surprised I haven't gotten any responses about this. Has anyone tried using rJava

Re: Streaming anomaly detection using ARIMA

2015-04-01 Thread Corey Nolet
Surprised I haven't gotten any responses about this. Has anyone tried using rJava or FastR w/ Spark? I've seen the SparkR project but thta goes the other way- what I'd like to do is use R for model calculation and Spark to distribute the load across the cluster. Also, has anyone used Scalation

Re: Streaming anomaly detection using ARIMA

2015-03-30 Thread Corey Nolet
Taking out the complexity of the ARIMA models to simplify things- I can't seem to find a good way to represent even standard moving averages in spark streaming. Perhaps it's my ignorance with the micro-batched style of the DStreams API. On Fri, Mar 27, 2015 at 9:13 PM, Corey Nolet

Streaming anomaly detection using ARIMA

2015-03-27 Thread Corey Nolet
I want to use ARIMA for a predictive model so that I can take time series data (metrics) and perform a light anomaly detection. The time series data is going to be bucketed to different time units (several minutes within several hours, several hours within several days, several days within several