from:"Alex Gittens"

Re: PCA OutOfMemoryError

2016-01-13 Thread Alex Gittens

The PCA.fit function calls the RowMatrix PCA routine, which attempts to construct the covariance matrix locally on the driver, and then computes the SVD of that to get the PCs. I'm not sure what's causing the memory error: RowMatrix.scala:124 is only using 3.5 GB of memory (n*(n+1)/2 with n=29604 a

Re: failure to parallelize an RDD

2016-01-12 Thread Alex Gittens

I'm using Spark 1.5.1 When I turned on DEBUG, I don't see anything that looks useful. Other than the INFO outputs, there is a ton of RPC message related logs, and this bit: 16/01/13 05:53:43 DEBUG ClosureCleaner: +++ Cleaning closure (org.apache.spark.rdd.RDD$$anonfun$count$1) +++ 16/01/13 05:53

Re: distcp suddenly broken with spark-ec2 script setup

2015-12-09 Thread Alex Gittens

BTW, yes the referenced s3 bucket does exist, and hdfs dfs -ls s3n://agittens/CFSRArawtars does list the entries, although it first prints the same warnings: 015-12-10 00:26:53,815 WARN httpclient.RestS3Service (RestS3Service.java:performRequest(393)) - Response '/CFSRArawtars' - Unexpected res

Re: Why does a 3.8 T dataset take up 11.59 Tb on HDFS

2015-11-25 Thread Alex Gittens

Thanks, the issue was indeed the dfs replication factor. To fix it without entirely clearing out HDFS and rebooting, I first ran hdfs dfs -setrep -R -w 1 / to reduce all the current files' replication factor to 1 recursively from the root, then I changed the dfs.replication factor in ephemeral-hdfs

Re: Determinant of Matrix

2015-08-24 Thread Alex Gittens

It sounds like you've already computed the covariance matrix. You can convert it to a breeze matrix then use breeze.linalg.det : val determinant = breeze.linalg.det( mat.toBreeze.asInstanceOf[breeze.linalg.DenseMatrix[Double]] ) On Mon, Aug 24, 2015 at 4:10 AM, Naveen wrote: > Hi, > > Is ther

Re: spark hangs at broadcasting during a filter

2015-08-05 Thread Alex Gittens

Thanks. Repartitioning to a smaller number of partitions seems to fix my issue, but I'll keep broadcasting in mind (droprows is an integer array with about 4 million entries). On Wed, Aug 5, 2015 at 12:34 PM, Philip Weaver wrote: > How big is droprows? > > Try explicitly broadcasting it like thi

Re: Need clarification on spark on cluster set up instruction

2015-07-01 Thread Alex Gittens

I have a similar use case, so I wrote a python script to fix the cluster configuration that spark-ec2 uses when you use Hadoop 2. Start a cluster with enough machines that the hdfs system can hold 1Tb (so use instance types that have SSDs), then follow the instructions at http://thousandfold.net/cz

Re: breeze.linalg.DenseMatrix not found

2015-07-01 Thread Alex Gittens

I think the issue was NOT with spark. I was running a spark program that dumped output to a binary file and then calling a scala program to read it and write out Matrix Market format files. The issue seems to have been with the classpath on the scala program, and went away when I added the spark ja

Re: PCA OutOfMemoryError

Re: failure to parallelize an RDD

Re: distcp suddenly broken with spark-ec2 script setup

Re: Why does a 3.8 T dataset take up 11.59 Tb on HDFS

Re: Determinant of Matrix

Re: spark hangs at broadcasting during a filter

Re: Need clarification on spark on cluster set up instruction

Re: breeze.linalg.DenseMatrix not found

8 matches

Site Navigation

Mail list logo

Footer information