All,
I'm using the Spark shell to interact with a small test deployment of
Spark, built from the current master branch. I'm processing a dataset
comprising a few thousand objects on Google Cloud Storage, split into a
half dozen directories. My code constructs an object--let me call it the
Dataset
I'm curious if you're seeing the same thing when using bdutil against GCS?
I'm wondering if this may be an issue concerning the transfer rate of Spark
- Hadoop - GCS Connector - GCS.
On Wed Dec 17 2014 at 10:09:17 PM Alessandro Baretta alexbare...@gmail.com
wrote:
All,
I'm using the Spark
Denny,
No, gsutil scans through the listing of the bucket quickly. See the
following.
alex@hadoop-m:~/split$ time bash -c gsutil ls
gs://my-bucket/20141205/csv/*/*/* | wc -l
6860
real0m6.971s
user0m1.052s
sys 0m0.096s
Alex
On Wed, Dec 17, 2014 at 10:29 PM, Denny Lee
Oh, it makes sense of gsutil scans through this quickly, but I was
wondering if running a Hadoop job / bdutil would result in just as fast
scans?
On Wed Dec 17 2014 at 10:44:45 PM Alessandro Baretta alexbare...@gmail.com
wrote:
Denny,
No, gsutil scans through the listing of the bucket
Well, what do you suggest I run to test this? But more importantly, what
information would this give me?
On Wed, Dec 17, 2014 at 10:46 PM, Denny Lee denny.g@gmail.com wrote:
Oh, it makes sense of gsutil scans through this quickly, but I was
wondering if running a Hadoop job / bdutil would
For Spark to connect to GCS, it utilizes the Hadoop and GCS connector jars
for connectivity. I'm wondering if it's those connection points that are
ultimately slowing down the connection between Spark and GCS.
The reason I was asking if you could run bdutil is because it would be
basically Hadoop
Here's another data point: the slow part of my code is the construction of
an RDD as the union of the textFile RDDs representing data from several
distinct google storage directories. So the question becomes the following:
what computation happens when calling the union method on two RDDs?
On