Spark Shell slowness on Google Cloud

2014-12-17 Thread Alessandro Baretta
All, I'm using the Spark shell to interact with a small test deployment of Spark, built from the current master branch. I'm processing a dataset comprising a few thousand objects on Google Cloud Storage, split into a half dozen directories. My code constructs an object--let me call it the Dataset

Re: Spark Shell slowness on Google Cloud

2014-12-17 Thread Denny Lee
I'm curious if you're seeing the same thing when using bdutil against GCS? I'm wondering if this may be an issue concerning the transfer rate of Spark - Hadoop - GCS Connector - GCS. On Wed Dec 17 2014 at 10:09:17 PM Alessandro Baretta alexbare...@gmail.com wrote: All, I'm using the Spark

Re: Spark Shell slowness on Google Cloud

2014-12-17 Thread Alessandro Baretta
Denny, No, gsutil scans through the listing of the bucket quickly. See the following. alex@hadoop-m:~/split$ time bash -c gsutil ls gs://my-bucket/20141205/csv/*/*/* | wc -l 6860 real0m6.971s user0m1.052s sys 0m0.096s Alex On Wed, Dec 17, 2014 at 10:29 PM, Denny Lee

Re: Spark Shell slowness on Google Cloud

2014-12-17 Thread Denny Lee
Oh, it makes sense of gsutil scans through this quickly, but I was wondering if running a Hadoop job / bdutil would result in just as fast scans? On Wed Dec 17 2014 at 10:44:45 PM Alessandro Baretta alexbare...@gmail.com wrote: Denny, No, gsutil scans through the listing of the bucket

Re: Spark Shell slowness on Google Cloud

2014-12-17 Thread Alessandro Baretta
Well, what do you suggest I run to test this? But more importantly, what information would this give me? On Wed, Dec 17, 2014 at 10:46 PM, Denny Lee denny.g@gmail.com wrote: Oh, it makes sense of gsutil scans through this quickly, but I was wondering if running a Hadoop job / bdutil would

Re: Spark Shell slowness on Google Cloud

2014-12-17 Thread Denny Lee
For Spark to connect to GCS, it utilizes the Hadoop and GCS connector jars for connectivity. I'm wondering if it's those connection points that are ultimately slowing down the connection between Spark and GCS. The reason I was asking if you could run bdutil is because it would be basically Hadoop

Re: Spark Shell slowness on Google Cloud

2014-12-17 Thread Alessandro Baretta
Here's another data point: the slow part of my code is the construction of an RDD as the union of the textFile RDDs representing data from several distinct google storage directories. So the question becomes the following: what computation happens when calling the union method on two RDDs? On