If your rows may have NAs in them, I would process each column individually
by first projecting the column ( map(x = x.nameOfColumn) ), filtering out
the NAs, then running a summarizer over each column.
Even if you have many rows, after summarizing you will only have a vector
of length #columns.
Thank you Feynman for the lead.
I was able to modify the code using clues from the RegressionMetrics example.
Here is what I got now.
val deviceAggregateLogs =
sc.textFile(logFile).map(DailyDeviceAggregates.parseLogLine).cache()
// Calculate statistics based on bytes-transferred
val
Dimensions mismatch when adding new sample. Expecting 8 but got 14.
Make sure all the vectors you are summarizing over have the same dimension.
Why would you want to write a MultivariateOnlineSummary object (which can
be represented with a couple Double's) into a distributed filesystem like
Hello Feynman,
Actually in my case, the vectors I am summarizing over will not have the same
dimension since many devices will be inactive on some days. This is at best a
sparse matrix where we take only the active days and attempt to fit a moving
average over it.
The reason I would like to
I have to do the following tasks on a dataset using Apache Spark with Scala as
the programming language:
- Read the dataset from HDFS. A few sample lines look like this:
deviceid,bytes,eventdate
15590657,246620,20150630
14066921,1907,20150621
14066921,1906,20150626
6522013,2349,20150626
The call to Sorting.quicksort is not working. Perhaps I am calling it the
wrong way.
allaggregates.toArray allocates and creates a new array separate from
allaggregates which is sorted by Sorting.quickSort; allaggregates. Try:
val sortedAggregates = allaggregates.toArray
A good example is RegressionMetrics
https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/evaluation/RegressionMetrics.scala#L48's
use of of OnlineMultivariateSummarizer to aggregate statistics across
labels and residuals; take a look at how aggregateByKey is used
Thank you Feynman for your response. Since I am very new to Scala I may need a
bit more hand-holding at this stage.
I have been able to incorporate your suggestion about sorting - and it now
works perfectly. Thanks again for that.
I tried to use your suggestion of using
I have to do the following tasks on a dataset using Apache Spark with Scala as
the programming language:
Read the dataset from HDFS. A few sample lines look like this:
deviceid,bytes,eventdate
15590657,246620,20150630
14066921,1907,20150621
14066921,1906,20150626
6522013,2349,20150626