Could you try to map it to row-majored first? Your approach may
generate multiple copies of the data. The code should look like this:
~~~
val rows = rdd.map { case (j, values) =
values.view.zipWithIndex.map { case (v, i) =
(i, (j, v))
}
}.groupByKey().map { case (i, entries) =
Vectors.dense(entries.sortBy(_._1).map(_._2).toArray)
}
val mat = new RowMatrix(rows)
val cov = mat.computeCovariance()
~~~
On Wed, Aug 13, 2014 at 3:56 PM, ldmtwo ldm...@gmail.com wrote:
Need help getting around these errors.
I have this program that runs fine on smaller input sizes. As it gets
larger, Spark has increasing difficulty of being efficient and functioning
without errors. We have about 46GB free on each node. The workers and
executors are configured to use this up (the only way not to have Heap Space
or GC overhead errors). On the driver, the data only uses 1.2GB RAM and is
in the form of /matrix: RDD[(Integer, Array[Float])]/. It's a matrix that is
column major with dimensions of 15k x 20k (columns). Each column takes about
4*15k = 60KB. 60KB*20k = 1.2GB. The data is not even that large. Eventually,
I want to test 60k x 70k.
The Covariance Matrix algorithm we are using is basicly. O(N^3) At minimum,
the outer loop needs to be parallelized.
for each column i in matrix
for each column j in matrix
get the covariance between columns i and j
Covariance is practically this. (no need to parallelize since we have enough
work to do and this is small)
for the two columns, get the sum of squares. O(N)
Since I can't figure out a way to do permutation or nested for loop on RDD
any other way, I had to call matrix.cartesian(matrix).map{ pair = ... }. I
could do 5kx5k (1/4th of the work) using HashMap instead of RDD and finish
in 10 sec. If I partition with 3k, it takes 18 hours. 300 takes 12 hours.
200 fails (error #1). 16 would be ideal (error #2). Note that I set the Akka
frame size (spark-defaults.conf) to 15 to address some of the other errors
with Akka.
This is error #1
|
|
|
|
|
|
|
|
|
|
|
|
|
|
This is error 2
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-Akka-actor-failures-tp12071.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.
-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org
-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org