The PCA.fit function calls the RowMatrix PCA routine, which attempts to
construct the covariance matrix locally on the driver, and then computes
the SVD of that to get the PCs. I'm not sure what's causing the memory
error: RowMatrix.scala:124 is only using 3.5 GB of memory (n*(n+1)/2 with
n=29604 a
I'm using Spark 1.5.1
When I turned on DEBUG, I don't see anything that looks useful. Other than
the INFO outputs, there is a ton of RPC message related logs, and this bit:
16/01/13 05:53:43 DEBUG ClosureCleaner: +++ Cleaning closure
(org.apache.spark.rdd.RDD$$anonfun$count$1) +++
16/01/13 05:53
BTW, yes the referenced s3 bucket does exist, and
hdfs dfs -ls s3n://agittens/CFSRArawtars
does list the entries, although it first prints the same warnings:
015-12-10 00:26:53,815 WARN httpclient.RestS3Service
(RestS3Service.java:performRequest(393)) - Response '/CFSRArawtars' -
Unexpected res
Thanks, the issue was indeed the dfs replication factor. To fix it without
entirely clearing out HDFS and rebooting, I first ran
hdfs dfs -setrep -R -w 1 /
to reduce all the current files' replication factor to 1 recursively from
the root, then I changed the dfs.replication factor in
ephemeral-hdfs
It sounds like you've already computed the covariance matrix. You can
convert it to a breeze matrix then use breeze.linalg.det :
val determinant = breeze.linalg.det(
mat.toBreeze.asInstanceOf[breeze.linalg.DenseMatrix[Double]] )
On Mon, Aug 24, 2015 at 4:10 AM, Naveen wrote:
> Hi,
>
> Is ther
Thanks. Repartitioning to a smaller number of partitions seems to fix my
issue, but I'll keep broadcasting in mind (droprows is an integer array
with about 4 million entries).
On Wed, Aug 5, 2015 at 12:34 PM, Philip Weaver
wrote:
> How big is droprows?
>
> Try explicitly broadcasting it like thi
I have a similar use case, so I wrote a python script to fix the cluster
configuration that spark-ec2 uses when you use Hadoop 2. Start a cluster
with enough machines that the hdfs system can hold 1Tb (so use instance
types that have SSDs), then follow the instructions at
http://thousandfold.net/cz
I think the issue was NOT with spark. I was running a spark program that
dumped output to a binary file and then calling a scala program to read it
and write out Matrix Market format files. The issue seems to have been with
the classpath on the scala program, and went away when I added the spark
ja