Re: Spark Shell slowness on Google Cloud

2014-12-17 Thread Denny Lee
I'm curious if you're seeing the same thing when using bdutil against GCS?
I'm wondering if this may be an issue concerning the transfer rate of Spark
- Hadoop - GCS Connector - GCS.

On Wed Dec 17 2014 at 10:09:17 PM Alessandro Baretta alexbare...@gmail.com
wrote:

 All,

 I'm using the Spark shell to interact with a small test deployment of
 Spark, built from the current master branch. I'm processing a dataset
 comprising a few thousand objects on Google Cloud Storage, split into a
 half dozen directories. My code constructs an object--let me call it the
 Dataset object--that defines a distinct RDD for each directory. The
 constructor of the object only defines the RDDs; it does not actually
 evaluate them, so I would expect it to return very quickly. Indeed, the
 logging code in the constructor prints a line signaling the completion of
 the code almost immediately after invocation, but the Spark shell does not
 show the prompt right away. Instead, it spends a few minutes seemingly
 frozen, eventually producing the following output:

 14/12/18 05:52:49 INFO mapred.FileInputFormat: Total input paths to
 process : 9

 14/12/18 05:54:15 INFO mapred.FileInputFormat: Total input paths to
 process : 759

 14/12/18 05:54:40 INFO mapred.FileInputFormat: Total input paths to
 process : 228

 14/12/18 06:00:11 INFO mapred.FileInputFormat: Total input paths to
 process : 3076

 14/12/18 06:02:02 INFO mapred.FileInputFormat: Total input paths to
 process : 1013

 14/12/18 06:02:21 INFO mapred.FileInputFormat: Total input paths to
 process : 156

 This stage is inexplicably slow. What could be happening?

 Thanks.


 Alex



Re: Spark Shell slowness on Google Cloud

2014-12-17 Thread Alessandro Baretta
Denny,

No, gsutil scans through the listing of the bucket quickly. See the
following.

alex@hadoop-m:~/split$ time bash -c gsutil ls
gs://my-bucket/20141205/csv/*/*/* | wc -l

6860

real0m6.971s
user0m1.052s
sys 0m0.096s

Alex

On Wed, Dec 17, 2014 at 10:29 PM, Denny Lee denny.g@gmail.com wrote:

 I'm curious if you're seeing the same thing when using bdutil against
 GCS?  I'm wondering if this may be an issue concerning the transfer rate of
 Spark - Hadoop - GCS Connector - GCS.


 On Wed Dec 17 2014 at 10:09:17 PM Alessandro Baretta 
 alexbare...@gmail.com wrote:

 All,

 I'm using the Spark shell to interact with a small test deployment of
 Spark, built from the current master branch. I'm processing a dataset
 comprising a few thousand objects on Google Cloud Storage, split into a
 half dozen directories. My code constructs an object--let me call it the
 Dataset object--that defines a distinct RDD for each directory. The
 constructor of the object only defines the RDDs; it does not actually
 evaluate them, so I would expect it to return very quickly. Indeed, the
 logging code in the constructor prints a line signaling the completion of
 the code almost immediately after invocation, but the Spark shell does not
 show the prompt right away. Instead, it spends a few minutes seemingly
 frozen, eventually producing the following output:

 14/12/18 05:52:49 INFO mapred.FileInputFormat: Total input paths to
 process : 9

 14/12/18 05:54:15 INFO mapred.FileInputFormat: Total input paths to
 process : 759

 14/12/18 05:54:40 INFO mapred.FileInputFormat: Total input paths to
 process : 228

 14/12/18 06:00:11 INFO mapred.FileInputFormat: Total input paths to
 process : 3076

 14/12/18 06:02:02 INFO mapred.FileInputFormat: Total input paths to
 process : 1013

 14/12/18 06:02:21 INFO mapred.FileInputFormat: Total input paths to
 process : 156

 This stage is inexplicably slow. What could be happening?

 Thanks.


 Alex




Re: Spark Shell slowness on Google Cloud

2014-12-17 Thread Denny Lee
Oh, it makes sense of gsutil scans through this quickly, but I was
wondering if running a Hadoop job / bdutil would result in just as fast
scans?

On Wed Dec 17 2014 at 10:44:45 PM Alessandro Baretta alexbare...@gmail.com
wrote:

 Denny,

 No, gsutil scans through the listing of the bucket quickly. See the
 following.

 alex@hadoop-m:~/split$ time bash -c gsutil ls
 gs://my-bucket/20141205/csv/*/*/* | wc -l

 6860

 real0m6.971s
 user0m1.052s
 sys 0m0.096s

 Alex


 On Wed, Dec 17, 2014 at 10:29 PM, Denny Lee denny.g@gmail.com wrote:

 I'm curious if you're seeing the same thing when using bdutil against
 GCS?  I'm wondering if this may be an issue concerning the transfer rate of
 Spark - Hadoop - GCS Connector - GCS.


 On Wed Dec 17 2014 at 10:09:17 PM Alessandro Baretta 
 alexbare...@gmail.com wrote:

 All,

 I'm using the Spark shell to interact with a small test deployment of
 Spark, built from the current master branch. I'm processing a dataset
 comprising a few thousand objects on Google Cloud Storage, split into a
 half dozen directories. My code constructs an object--let me call it the
 Dataset object--that defines a distinct RDD for each directory. The
 constructor of the object only defines the RDDs; it does not actually
 evaluate them, so I would expect it to return very quickly. Indeed, the
 logging code in the constructor prints a line signaling the completion of
 the code almost immediately after invocation, but the Spark shell does not
 show the prompt right away. Instead, it spends a few minutes seemingly
 frozen, eventually producing the following output:

 14/12/18 05:52:49 INFO mapred.FileInputFormat: Total input paths to
 process : 9

 14/12/18 05:54:15 INFO mapred.FileInputFormat: Total input paths to
 process : 759

 14/12/18 05:54:40 INFO mapred.FileInputFormat: Total input paths to
 process : 228

 14/12/18 06:00:11 INFO mapred.FileInputFormat: Total input paths to
 process : 3076

 14/12/18 06:02:02 INFO mapred.FileInputFormat: Total input paths to
 process : 1013

 14/12/18 06:02:21 INFO mapred.FileInputFormat: Total input paths to
 process : 156

 This stage is inexplicably slow. What could be happening?

 Thanks.


 Alex




Re: Spark Shell slowness on Google Cloud

2014-12-17 Thread Alessandro Baretta
Well, what do you suggest I run to test this? But more importantly, what
information would this give me?

On Wed, Dec 17, 2014 at 10:46 PM, Denny Lee denny.g@gmail.com wrote:

 Oh, it makes sense of gsutil scans through this quickly, but I was
 wondering if running a Hadoop job / bdutil would result in just as fast
 scans?


 On Wed Dec 17 2014 at 10:44:45 PM Alessandro Baretta 
 alexbare...@gmail.com wrote:

 Denny,

 No, gsutil scans through the listing of the bucket quickly. See the
 following.

 alex@hadoop-m:~/split$ time bash -c gsutil ls
 gs://my-bucket/20141205/csv/*/*/* | wc -l

 6860

 real0m6.971s
 user0m1.052s
 sys 0m0.096s

 Alex


 On Wed, Dec 17, 2014 at 10:29 PM, Denny Lee denny.g@gmail.com
 wrote:

 I'm curious if you're seeing the same thing when using bdutil against
 GCS?  I'm wondering if this may be an issue concerning the transfer rate of
 Spark - Hadoop - GCS Connector - GCS.


 On Wed Dec 17 2014 at 10:09:17 PM Alessandro Baretta 
 alexbare...@gmail.com wrote:

 All,

 I'm using the Spark shell to interact with a small test deployment of
 Spark, built from the current master branch. I'm processing a dataset
 comprising a few thousand objects on Google Cloud Storage, split into a
 half dozen directories. My code constructs an object--let me call it the
 Dataset object--that defines a distinct RDD for each directory. The
 constructor of the object only defines the RDDs; it does not actually
 evaluate them, so I would expect it to return very quickly. Indeed, the
 logging code in the constructor prints a line signaling the completion of
 the code almost immediately after invocation, but the Spark shell does not
 show the prompt right away. Instead, it spends a few minutes seemingly
 frozen, eventually producing the following output:

 14/12/18 05:52:49 INFO mapred.FileInputFormat: Total input paths to
 process : 9

 14/12/18 05:54:15 INFO mapred.FileInputFormat: Total input paths to
 process : 759

 14/12/18 05:54:40 INFO mapred.FileInputFormat: Total input paths to
 process : 228

 14/12/18 06:00:11 INFO mapred.FileInputFormat: Total input paths to
 process : 3076

 14/12/18 06:02:02 INFO mapred.FileInputFormat: Total input paths to
 process : 1013

 14/12/18 06:02:21 INFO mapred.FileInputFormat: Total input paths to
 process : 156

 This stage is inexplicably slow. What could be happening?

 Thanks.


 Alex




Re: Spark Shell slowness on Google Cloud

2014-12-17 Thread Denny Lee
For Spark to connect to GCS, it utilizes the Hadoop and GCS connector jars
for connectivity. I'm wondering if it's those connection points that are
ultimately slowing down the connection between Spark and GCS.

The reason I was asking if you could run bdutil is because it would be
basically Hadoop connecting to GCS. If it's just as slow than that would
point to the root cause. That is, it's the Hadoop connection that is
slowing things vs something explicitly out of Spark per se.
On Wed, Dec 17, 2014 at 23:25 Alessandro Baretta alexbare...@gmail.com
wrote:

 Well, what do you suggest I run to test this? But more importantly, what
 information would this give me?

 On Wed, Dec 17, 2014 at 10:46 PM, Denny Lee denny.g@gmail.com wrote:

 Oh, it makes sense of gsutil scans through this quickly, but I was
 wondering if running a Hadoop job / bdutil would result in just as fast
 scans?


 On Wed Dec 17 2014 at 10:44:45 PM Alessandro Baretta 
 alexbare...@gmail.com wrote:

 Denny,

 No, gsutil scans through the listing of the bucket quickly. See the
 following.

 alex@hadoop-m:~/split$ time bash -c gsutil ls
 gs://my-bucket/20141205/csv/*/*/* | wc -l

 6860

 real0m6.971s
 user0m1.052s
 sys 0m0.096s

 Alex


 On Wed, Dec 17, 2014 at 10:29 PM, Denny Lee denny.g@gmail.com
 wrote:

 I'm curious if you're seeing the same thing when using bdutil against
 GCS?  I'm wondering if this may be an issue concerning the transfer rate of
 Spark - Hadoop - GCS Connector - GCS.


 On Wed Dec 17 2014 at 10:09:17 PM Alessandro Baretta 
 alexbare...@gmail.com wrote:

 All,

 I'm using the Spark shell to interact with a small test deployment of
 Spark, built from the current master branch. I'm processing a dataset
 comprising a few thousand objects on Google Cloud Storage, split into a
 half dozen directories. My code constructs an object--let me call it the
 Dataset object--that defines a distinct RDD for each directory. The
 constructor of the object only defines the RDDs; it does not actually
 evaluate them, so I would expect it to return very quickly. Indeed, the
 logging code in the constructor prints a line signaling the completion of
 the code almost immediately after invocation, but the Spark shell does not
 show the prompt right away. Instead, it spends a few minutes seemingly
 frozen, eventually producing the following output:

 14/12/18 05:52:49 INFO mapred.FileInputFormat: Total input paths to
 process : 9

 14/12/18 05:54:15 INFO mapred.FileInputFormat: Total input paths to
 process : 759

 14/12/18 05:54:40 INFO mapred.FileInputFormat: Total input paths to
 process : 228

 14/12/18 06:00:11 INFO mapred.FileInputFormat: Total input paths to
 process : 3076

 14/12/18 06:02:02 INFO mapred.FileInputFormat: Total input paths to
 process : 1013

 14/12/18 06:02:21 INFO mapred.FileInputFormat: Total input paths to
 process : 156

 This stage is inexplicably slow. What could be happening?

 Thanks.


 Alex




Re: Spark Shell slowness on Google Cloud

2014-12-17 Thread Alessandro Baretta
Here's another data point: the slow part of my code is the construction of
an RDD as the union of the textFile RDDs representing data from several
distinct google storage directories. So the question becomes the following:
what computation happens when calling the union method on two RDDs?

On Wed, Dec 17, 2014 at 11:24 PM, Alessandro Baretta alexbare...@gmail.com
wrote:

 Well, what do you suggest I run to test this? But more importantly, what
 information would this give me?

 On Wed, Dec 17, 2014 at 10:46 PM, Denny Lee denny.g@gmail.com wrote:

 Oh, it makes sense of gsutil scans through this quickly, but I was
 wondering if running a Hadoop job / bdutil would result in just as fast
 scans?


 On Wed Dec 17 2014 at 10:44:45 PM Alessandro Baretta 
 alexbare...@gmail.com wrote:

 Denny,

 No, gsutil scans through the listing of the bucket quickly. See the
 following.

 alex@hadoop-m:~/split$ time bash -c gsutil ls
 gs://my-bucket/20141205/csv/*/*/* | wc -l

 6860

 real0m6.971s
 user0m1.052s
 sys 0m0.096s

 Alex


 On Wed, Dec 17, 2014 at 10:29 PM, Denny Lee denny.g@gmail.com
 wrote:

 I'm curious if you're seeing the same thing when using bdutil against
 GCS?  I'm wondering if this may be an issue concerning the transfer rate of
 Spark - Hadoop - GCS Connector - GCS.


 On Wed Dec 17 2014 at 10:09:17 PM Alessandro Baretta 
 alexbare...@gmail.com wrote:

 All,

 I'm using the Spark shell to interact with a small test deployment of
 Spark, built from the current master branch. I'm processing a dataset
 comprising a few thousand objects on Google Cloud Storage, split into a
 half dozen directories. My code constructs an object--let me call it the
 Dataset object--that defines a distinct RDD for each directory. The
 constructor of the object only defines the RDDs; it does not actually
 evaluate them, so I would expect it to return very quickly. Indeed, the
 logging code in the constructor prints a line signaling the completion of
 the code almost immediately after invocation, but the Spark shell does not
 show the prompt right away. Instead, it spends a few minutes seemingly
 frozen, eventually producing the following output:

 14/12/18 05:52:49 INFO mapred.FileInputFormat: Total input paths to
 process : 9

 14/12/18 05:54:15 INFO mapred.FileInputFormat: Total input paths to
 process : 759

 14/12/18 05:54:40 INFO mapred.FileInputFormat: Total input paths to
 process : 228

 14/12/18 06:00:11 INFO mapred.FileInputFormat: Total input paths to
 process : 3076

 14/12/18 06:02:02 INFO mapred.FileInputFormat: Total input paths to
 process : 1013

 14/12/18 06:02:21 INFO mapred.FileInputFormat: Total input paths to
 process : 156

 This stage is inexplicably slow. What could be happening?

 Thanks.


 Alex