Re: PCA OutOfMemoryError

2016-01-13 Thread Alex Gittens
The PCA.fit function calls the RowMatrix PCA routine, which attempts to construct the covariance matrix locally on the driver, and then computes the SVD of that to get the PCs. I'm not sure what's causing the memory error: RowMatrix.scala:124 is only using 3.5 GB of memory (n*(n+1)/2 with n=29604

Re: failure to parallelize an RDD

2016-01-12 Thread Alex Gittens
I'm using Spark 1.5.1 When I turned on DEBUG, I don't see anything that looks useful. Other than the INFO outputs, there is a ton of RPC message related logs, and this bit: 16/01/13 05:53:43 DEBUG ClosureCleaner: +++ Cleaning closure (org.apache.spark.rdd.RDD$$anonfun$count$1) +++ 16/01/13

Re: distcp suddenly broken with spark-ec2 script setup

2015-12-09 Thread Alex Gittens
BTW, yes the referenced s3 bucket does exist, and hdfs dfs -ls s3n://agittens/CFSRArawtars does list the entries, although it first prints the same warnings: 015-12-10 00:26:53,815 WARN httpclient.RestS3Service (RestS3Service.java:performRequest(393)) - Response '/CFSRArawtars' - Unexpected

Re: Why does a 3.8 T dataset take up 11.59 Tb on HDFS

2015-11-25 Thread Alex Gittens
Thanks, the issue was indeed the dfs replication factor. To fix it without entirely clearing out HDFS and rebooting, I first ran hdfs dfs -setrep -R -w 1 / to reduce all the current files' replication factor to 1 recursively from the root, then I changed the dfs.replication factor in

Re: spark hangs at broadcasting during a filter

2015-08-06 Thread Alex Gittens
Thanks. Repartitioning to a smaller number of partitions seems to fix my issue, but I'll keep broadcasting in mind (droprows is an integer array with about 4 million entries). On Wed, Aug 5, 2015 at 12:34 PM, Philip Weaver philip.wea...@gmail.com wrote: How big is droprows? Try explicitly

Re: breeze.linalg.DenseMatrix not found

2015-07-01 Thread Alex Gittens
I think the issue was NOT with spark. I was running a spark program that dumped output to a binary file and then calling a scala program to read it and write out Matrix Market format files. The issue seems to have been with the classpath on the scala program, and went away when I added the spark

Re: Need clarification on spark on cluster set up instruction

2015-07-01 Thread Alex Gittens
I have a similar use case, so I wrote a python script to fix the cluster configuration that spark-ec2 uses when you use Hadoop 2. Start a cluster with enough machines that the hdfs system can hold 1Tb (so use instance types that have SSDs), then follow the instructions at