Could you try to map it to row-majored first? Your approach may
generate multiple copies of the data. The code should look like this:

~~~
val rows = rdd.map { case (j, values) =>
  values.view.zipWithIndex.map { case (v, i) =>
    (i, (j, v))
  }
}.groupByKey().map { case (i, entries) =>
  Vectors.dense(entries.sortBy(_._1).map(_._2).toArray)
}

val mat = new RowMatrix(rows)
val cov = mat.computeCovariance()
~~~

On Wed, Aug 13, 2014 at 3:56 PM, ldmtwo <ldm...@gmail.com> wrote:
> Need help getting around these errors.
>
> I have this program that runs fine on smaller input sizes. As it gets
> larger, Spark has increasing difficulty of being efficient and functioning
> without errors. We have about 46GB free on each node. The workers and
> executors are configured to use this up (the only way not to have Heap Space
> or GC overhead errors). On the driver, the data only uses 1.2GB RAM and is
> in the form of /matrix: RDD[(Integer, Array[Float])]/. It's a matrix that is
> column major with dimensions of 15k x 20k (columns). Each column takes about
> 4*15k = 60KB. 60KB*20k = 1.2GB. The data is not even that large. Eventually,
> I want to test 60k x 70k.
>
> The Covariance Matrix algorithm we are using is basicly. O(N^3) At minimum,
> the outer loop needs to be parallelized.
>   for each column i in matrix
>      for each column j in matrix
>           get the covariance between columns i and j
>
> Covariance is practically this. (no need to parallelize since we have enough
> work to do and this is small)
> for the two columns, get the sum of squares. O(N)
>
>
> Since I can't figure out a way to do permutation or nested for loop on RDD
> any other way, I had to call matrix.cartesian(matrix).map{ pair => ... }. I
> could do 5kx5k (1/4th of the work) using HashMap instead of RDD and finish
> in 10 sec. If I partition with 3k, it takes 18 hours. 300 takes 12 hours.
> 200 fails (error #1). 16 would be ideal (error #2). Note that I set the Akka
> frame size (spark-defaults.conf) to 15 to address some of the other errors
> with Akka.
>
>
>
>
>
> This is error #1
>
>
> |
> |
> |
> |
> |
> |
> |
> |
> |
> |
> |
> |
> |
> |
>
> This is error 2
>
>
>
>
>
> --
> View this message in context: 
> http://apache-spark-user-list.1001560.n3.nabble.com/Spark-Akka-actor-failures-tp12071.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Reply via email to