The reason we are not using MLLib and Breeze is the lack of control over the data and performance. After computing the covariance matrix, there isn't too much we can do after that. Many of the methods are private. For now, we need the max value and the coresponding pair of columns. Later, we may do other algorithms. The MLLib covariance gets the means and Gramian matrix in parallel and after that, I believe it's back to single node computation. We have to bring everything back to a single node to get the max. Making it parallel again hasn't worked well either.
The reason we are using Spark is that we want a simple way to distribute data and work in parallel. I would prefer a SIMD/MPI type of approach, but I have to work within this framework which is more of a MapReduce style. I'm looking into getting the code you sent working. It won't allow me to reduce by key. RE: cartesian: I agree that it is generating many copies of the data. That was a last resort. It would be a huge benefit to everyone if we could access RDDs like a list, array or hash map. Here is the Covariance that works fast for us. We get the averages first O(N^2). Then differences (Vi-Avgi) in O(N^2). Then compute Covariance without having to do the above steps in O(N^3). You can see that I'm using Java code to efficiently get Covariance. The Scala code was very slow in comparison. We can next use JNI to add HW acceleration. Matrix is a HashMap here. Also note that I am using the lower triangle. I'm sure that MLLib/Breeze is making optimizations too. This covariance is based off of the two pass algorithm, but we may change to a one pass approximation. http://en.wikipedia.org/wiki/Algorithms_for_calculating_variance http://commons.apache.org/proper/commons-math/jacoco/org.apache.commons.math3.stat.correlation/Covariance.java.html -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-Akka-actor-failures-tp12071p12140.html Sent from the Apache Spark User List mailing list archive at Nabble.com. --------------------------------------------------------------------- To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org