The reason we are not using MLLib and Breeze is the lack of control over the
data and performance. After computing the covariance matrix, there isn't too
much we can do after that. Many of the methods are private. For now, we need
the max value and the coresponding pair of columns. Later, we may do other
algorithms. The MLLib covariance gets the means and Gramian matrix in
parallel and after that, I believe it's back to single node computation. We
have to bring everything back to a single node to get the max. Making it
parallel again hasn't worked well either.

The reason we are using Spark is that we want a simple way to distribute
data and work in parallel. I would prefer a SIMD/MPI type of approach, but I
have to work within this framework which is more of a MapReduce style. 

I'm looking into getting the code you sent working. It won't allow me to
reduce by key.

RE: cartesian: I agree that it is generating many copies of the data. That
was a last resort. It would be a huge benefit to everyone if we could access
RDDs like a list, array or hash map. 

Here is the Covariance that works fast for us. We get the averages first
O(N^2). Then differences (Vi-Avgi) in O(N^2). Then compute Covariance
without having to do the above steps in O(N^3). You can see that I'm using
Java code to efficiently get Covariance. The Scala code was very slow in
comparison. We can next use JNI to add HW acceleration. Matrix is a HashMap
here. Also note that I am using the lower triangle. I'm sure that
MLLib/Breeze is making optimizations too.

This covariance is based off of the two pass algorithm, but we may change to
a one pass approximation.
http://en.wikipedia.org/wiki/Algorithms_for_calculating_variance
http://commons.apache.org/proper/commons-math/jacoco/org.apache.commons.math3.stat.correlation/Covariance.java.html







--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-Akka-actor-failures-tp12071p12140.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Reply via email to