Re: Finding moving average using Spark and Scala

2015-07-14 Thread Feynman Liang
If your rows may have NAs in them, I would process each column individually by first projecting the column ( map(x = x.nameOfColumn) ), filtering out the NAs, then running a summarizer over each column. Even if you have many rows, after summarizing you will only have a vector of length #columns.

Re: Finding moving average using Spark and Scala

2015-07-13 Thread Anupam Bagchi
Thank you Feynman for the lead. I was able to modify the code using clues from the RegressionMetrics example. Here is what I got now. val deviceAggregateLogs = sc.textFile(logFile).map(DailyDeviceAggregates.parseLogLine).cache() // Calculate statistics based on bytes-transferred val

Re: Finding moving average using Spark and Scala

2015-07-13 Thread Feynman Liang
Dimensions mismatch when adding new sample. Expecting 8 but got 14. Make sure all the vectors you are summarizing over have the same dimension. Why would you want to write a MultivariateOnlineSummary object (which can be represented with a couple Double's) into a distributed filesystem like

Re: Finding moving average using Spark and Scala

2015-07-13 Thread Anupam Bagchi
Hello Feynman, Actually in my case, the vectors I am summarizing over will not have the same dimension since many devices will be inactive on some days. This is at best a sparse matrix where we take only the active days and attempt to fit a moving average over it. The reason I would like to

Finding moving average using Spark and Scala

2015-07-13 Thread Anupam Bagchi
I have to do the following tasks on a dataset using Apache Spark with Scala as the programming language: - Read the dataset from HDFS. A few sample lines look like this: deviceid,bytes,eventdate 15590657,246620,20150630 14066921,1907,20150621 14066921,1906,20150626 6522013,2349,20150626

Re: Finding moving average using Spark and Scala

2015-07-13 Thread Feynman Liang
The call to Sorting.quicksort is not working. Perhaps I am calling it the wrong way. allaggregates.toArray allocates and creates a new array separate from allaggregates which is sorted by Sorting.quickSort; allaggregates. Try: val sortedAggregates = allaggregates.toArray

Re: Finding moving average using Spark and Scala

2015-07-13 Thread Feynman Liang
A good example is RegressionMetrics https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/evaluation/RegressionMetrics.scala#L48's use of of OnlineMultivariateSummarizer to aggregate statistics across labels and residuals; take a look at how aggregateByKey is used

Re: Finding moving average using Spark and Scala

2015-07-13 Thread Anupam Bagchi
Thank you Feynman for your response. Since I am very new to Scala I may need a bit more hand-holding at this stage. I have been able to incorporate your suggestion about sorting - and it now works perfectly. Thanks again for that. I tried to use your suggestion of using

Moving average using Spark and Scala

2015-07-12 Thread Anupam Bagchi
I have to do the following tasks on a dataset using Apache Spark with Scala as the programming language: Read the dataset from HDFS. A few sample lines look like this: deviceid,bytes,eventdate 15590657,246620,20150630 14066921,1907,20150621 14066921,1906,20150626 6522013,2349,20150626