Could you try to map it to row-majored first? Your approach may generate multiple copies of the data. The code should look like this:
~~~ val rows = rdd.map { case (j, values) => values.view.zipWithIndex.map { case (v, i) => (i, (j, v)) } }.groupByKey().map { case (i, entries) => Vectors.dense(entries.sortBy(_._1).map(_._2).toArray) } val mat = new RowMatrix(rows) val cov = mat.computeCovariance() ~~~ On Wed, Aug 13, 2014 at 3:56 PM, ldmtwo <ldm...@gmail.com> wrote: > Need help getting around these errors. > > I have this program that runs fine on smaller input sizes. As it gets > larger, Spark has increasing difficulty of being efficient and functioning > without errors. We have about 46GB free on each node. The workers and > executors are configured to use this up (the only way not to have Heap Space > or GC overhead errors). On the driver, the data only uses 1.2GB RAM and is > in the form of /matrix: RDD[(Integer, Array[Float])]/. It's a matrix that is > column major with dimensions of 15k x 20k (columns). Each column takes about > 4*15k = 60KB. 60KB*20k = 1.2GB. The data is not even that large. Eventually, > I want to test 60k x 70k. > > The Covariance Matrix algorithm we are using is basicly. O(N^3) At minimum, > the outer loop needs to be parallelized. > for each column i in matrix > for each column j in matrix > get the covariance between columns i and j > > Covariance is practically this. (no need to parallelize since we have enough > work to do and this is small) > for the two columns, get the sum of squares. O(N) > > > Since I can't figure out a way to do permutation or nested for loop on RDD > any other way, I had to call matrix.cartesian(matrix).map{ pair => ... }. I > could do 5kx5k (1/4th of the work) using HashMap instead of RDD and finish > in 10 sec. If I partition with 3k, it takes 18 hours. 300 takes 12 hours. > 200 fails (error #1). 16 would be ideal (error #2). Note that I set the Akka > frame size (spark-defaults.conf) to 15 to address some of the other errors > with Akka. > > > > > > This is error #1 > > > | > | > | > | > | > | > | > | > | > | > | > | > | > | > > This is error 2 > > > > > > -- > View this message in context: > http://apache-spark-user-list.1001560.n3.nabble.com/Spark-Akka-actor-failures-tp12071.html > Sent from the Apache Spark User List mailing list archive at Nabble.com. > > --------------------------------------------------------------------- > To unsubscribe, e-mail: user-unsubscr...@spark.apache.org > For additional commands, e-mail: user-h...@spark.apache.org > --------------------------------------------------------------------- To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org