Re: Spark Shell slowness on Google Cloud
I'm curious if you're seeing the same thing when using bdutil against GCS? I'm wondering if this may be an issue concerning the transfer rate of Spark - Hadoop - GCS Connector - GCS. On Wed Dec 17 2014 at 10:09:17 PM Alessandro Baretta alexbare...@gmail.com wrote: All, I'm using the Spark shell to interact with a small test deployment of Spark, built from the current master branch. I'm processing a dataset comprising a few thousand objects on Google Cloud Storage, split into a half dozen directories. My code constructs an object--let me call it the Dataset object--that defines a distinct RDD for each directory. The constructor of the object only defines the RDDs; it does not actually evaluate them, so I would expect it to return very quickly. Indeed, the logging code in the constructor prints a line signaling the completion of the code almost immediately after invocation, but the Spark shell does not show the prompt right away. Instead, it spends a few minutes seemingly frozen, eventually producing the following output: 14/12/18 05:52:49 INFO mapred.FileInputFormat: Total input paths to process : 9 14/12/18 05:54:15 INFO mapred.FileInputFormat: Total input paths to process : 759 14/12/18 05:54:40 INFO mapred.FileInputFormat: Total input paths to process : 228 14/12/18 06:00:11 INFO mapred.FileInputFormat: Total input paths to process : 3076 14/12/18 06:02:02 INFO mapred.FileInputFormat: Total input paths to process : 1013 14/12/18 06:02:21 INFO mapred.FileInputFormat: Total input paths to process : 156 This stage is inexplicably slow. What could be happening? Thanks. Alex
Re: Spark Shell slowness on Google Cloud
Denny, No, gsutil scans through the listing of the bucket quickly. See the following. alex@hadoop-m:~/split$ time bash -c gsutil ls gs://my-bucket/20141205/csv/*/*/* | wc -l 6860 real0m6.971s user0m1.052s sys 0m0.096s Alex On Wed, Dec 17, 2014 at 10:29 PM, Denny Lee denny.g@gmail.com wrote: I'm curious if you're seeing the same thing when using bdutil against GCS? I'm wondering if this may be an issue concerning the transfer rate of Spark - Hadoop - GCS Connector - GCS. On Wed Dec 17 2014 at 10:09:17 PM Alessandro Baretta alexbare...@gmail.com wrote: All, I'm using the Spark shell to interact with a small test deployment of Spark, built from the current master branch. I'm processing a dataset comprising a few thousand objects on Google Cloud Storage, split into a half dozen directories. My code constructs an object--let me call it the Dataset object--that defines a distinct RDD for each directory. The constructor of the object only defines the RDDs; it does not actually evaluate them, so I would expect it to return very quickly. Indeed, the logging code in the constructor prints a line signaling the completion of the code almost immediately after invocation, but the Spark shell does not show the prompt right away. Instead, it spends a few minutes seemingly frozen, eventually producing the following output: 14/12/18 05:52:49 INFO mapred.FileInputFormat: Total input paths to process : 9 14/12/18 05:54:15 INFO mapred.FileInputFormat: Total input paths to process : 759 14/12/18 05:54:40 INFO mapred.FileInputFormat: Total input paths to process : 228 14/12/18 06:00:11 INFO mapred.FileInputFormat: Total input paths to process : 3076 14/12/18 06:02:02 INFO mapred.FileInputFormat: Total input paths to process : 1013 14/12/18 06:02:21 INFO mapred.FileInputFormat: Total input paths to process : 156 This stage is inexplicably slow. What could be happening? Thanks. Alex
Re: Spark Shell slowness on Google Cloud
Oh, it makes sense of gsutil scans through this quickly, but I was wondering if running a Hadoop job / bdutil would result in just as fast scans? On Wed Dec 17 2014 at 10:44:45 PM Alessandro Baretta alexbare...@gmail.com wrote: Denny, No, gsutil scans through the listing of the bucket quickly. See the following. alex@hadoop-m:~/split$ time bash -c gsutil ls gs://my-bucket/20141205/csv/*/*/* | wc -l 6860 real0m6.971s user0m1.052s sys 0m0.096s Alex On Wed, Dec 17, 2014 at 10:29 PM, Denny Lee denny.g@gmail.com wrote: I'm curious if you're seeing the same thing when using bdutil against GCS? I'm wondering if this may be an issue concerning the transfer rate of Spark - Hadoop - GCS Connector - GCS. On Wed Dec 17 2014 at 10:09:17 PM Alessandro Baretta alexbare...@gmail.com wrote: All, I'm using the Spark shell to interact with a small test deployment of Spark, built from the current master branch. I'm processing a dataset comprising a few thousand objects on Google Cloud Storage, split into a half dozen directories. My code constructs an object--let me call it the Dataset object--that defines a distinct RDD for each directory. The constructor of the object only defines the RDDs; it does not actually evaluate them, so I would expect it to return very quickly. Indeed, the logging code in the constructor prints a line signaling the completion of the code almost immediately after invocation, but the Spark shell does not show the prompt right away. Instead, it spends a few minutes seemingly frozen, eventually producing the following output: 14/12/18 05:52:49 INFO mapred.FileInputFormat: Total input paths to process : 9 14/12/18 05:54:15 INFO mapred.FileInputFormat: Total input paths to process : 759 14/12/18 05:54:40 INFO mapred.FileInputFormat: Total input paths to process : 228 14/12/18 06:00:11 INFO mapred.FileInputFormat: Total input paths to process : 3076 14/12/18 06:02:02 INFO mapred.FileInputFormat: Total input paths to process : 1013 14/12/18 06:02:21 INFO mapred.FileInputFormat: Total input paths to process : 156 This stage is inexplicably slow. What could be happening? Thanks. Alex
Re: Spark Shell slowness on Google Cloud
Well, what do you suggest I run to test this? But more importantly, what information would this give me? On Wed, Dec 17, 2014 at 10:46 PM, Denny Lee denny.g@gmail.com wrote: Oh, it makes sense of gsutil scans through this quickly, but I was wondering if running a Hadoop job / bdutil would result in just as fast scans? On Wed Dec 17 2014 at 10:44:45 PM Alessandro Baretta alexbare...@gmail.com wrote: Denny, No, gsutil scans through the listing of the bucket quickly. See the following. alex@hadoop-m:~/split$ time bash -c gsutil ls gs://my-bucket/20141205/csv/*/*/* | wc -l 6860 real0m6.971s user0m1.052s sys 0m0.096s Alex On Wed, Dec 17, 2014 at 10:29 PM, Denny Lee denny.g@gmail.com wrote: I'm curious if you're seeing the same thing when using bdutil against GCS? I'm wondering if this may be an issue concerning the transfer rate of Spark - Hadoop - GCS Connector - GCS. On Wed Dec 17 2014 at 10:09:17 PM Alessandro Baretta alexbare...@gmail.com wrote: All, I'm using the Spark shell to interact with a small test deployment of Spark, built from the current master branch. I'm processing a dataset comprising a few thousand objects on Google Cloud Storage, split into a half dozen directories. My code constructs an object--let me call it the Dataset object--that defines a distinct RDD for each directory. The constructor of the object only defines the RDDs; it does not actually evaluate them, so I would expect it to return very quickly. Indeed, the logging code in the constructor prints a line signaling the completion of the code almost immediately after invocation, but the Spark shell does not show the prompt right away. Instead, it spends a few minutes seemingly frozen, eventually producing the following output: 14/12/18 05:52:49 INFO mapred.FileInputFormat: Total input paths to process : 9 14/12/18 05:54:15 INFO mapred.FileInputFormat: Total input paths to process : 759 14/12/18 05:54:40 INFO mapred.FileInputFormat: Total input paths to process : 228 14/12/18 06:00:11 INFO mapred.FileInputFormat: Total input paths to process : 3076 14/12/18 06:02:02 INFO mapred.FileInputFormat: Total input paths to process : 1013 14/12/18 06:02:21 INFO mapred.FileInputFormat: Total input paths to process : 156 This stage is inexplicably slow. What could be happening? Thanks. Alex
Re: Spark Shell slowness on Google Cloud
For Spark to connect to GCS, it utilizes the Hadoop and GCS connector jars for connectivity. I'm wondering if it's those connection points that are ultimately slowing down the connection between Spark and GCS. The reason I was asking if you could run bdutil is because it would be basically Hadoop connecting to GCS. If it's just as slow than that would point to the root cause. That is, it's the Hadoop connection that is slowing things vs something explicitly out of Spark per se. On Wed, Dec 17, 2014 at 23:25 Alessandro Baretta alexbare...@gmail.com wrote: Well, what do you suggest I run to test this? But more importantly, what information would this give me? On Wed, Dec 17, 2014 at 10:46 PM, Denny Lee denny.g@gmail.com wrote: Oh, it makes sense of gsutil scans through this quickly, but I was wondering if running a Hadoop job / bdutil would result in just as fast scans? On Wed Dec 17 2014 at 10:44:45 PM Alessandro Baretta alexbare...@gmail.com wrote: Denny, No, gsutil scans through the listing of the bucket quickly. See the following. alex@hadoop-m:~/split$ time bash -c gsutil ls gs://my-bucket/20141205/csv/*/*/* | wc -l 6860 real0m6.971s user0m1.052s sys 0m0.096s Alex On Wed, Dec 17, 2014 at 10:29 PM, Denny Lee denny.g@gmail.com wrote: I'm curious if you're seeing the same thing when using bdutil against GCS? I'm wondering if this may be an issue concerning the transfer rate of Spark - Hadoop - GCS Connector - GCS. On Wed Dec 17 2014 at 10:09:17 PM Alessandro Baretta alexbare...@gmail.com wrote: All, I'm using the Spark shell to interact with a small test deployment of Spark, built from the current master branch. I'm processing a dataset comprising a few thousand objects on Google Cloud Storage, split into a half dozen directories. My code constructs an object--let me call it the Dataset object--that defines a distinct RDD for each directory. The constructor of the object only defines the RDDs; it does not actually evaluate them, so I would expect it to return very quickly. Indeed, the logging code in the constructor prints a line signaling the completion of the code almost immediately after invocation, but the Spark shell does not show the prompt right away. Instead, it spends a few minutes seemingly frozen, eventually producing the following output: 14/12/18 05:52:49 INFO mapred.FileInputFormat: Total input paths to process : 9 14/12/18 05:54:15 INFO mapred.FileInputFormat: Total input paths to process : 759 14/12/18 05:54:40 INFO mapred.FileInputFormat: Total input paths to process : 228 14/12/18 06:00:11 INFO mapred.FileInputFormat: Total input paths to process : 3076 14/12/18 06:02:02 INFO mapred.FileInputFormat: Total input paths to process : 1013 14/12/18 06:02:21 INFO mapred.FileInputFormat: Total input paths to process : 156 This stage is inexplicably slow. What could be happening? Thanks. Alex
Re: Spark Shell slowness on Google Cloud
Here's another data point: the slow part of my code is the construction of an RDD as the union of the textFile RDDs representing data from several distinct google storage directories. So the question becomes the following: what computation happens when calling the union method on two RDDs? On Wed, Dec 17, 2014 at 11:24 PM, Alessandro Baretta alexbare...@gmail.com wrote: Well, what do you suggest I run to test this? But more importantly, what information would this give me? On Wed, Dec 17, 2014 at 10:46 PM, Denny Lee denny.g@gmail.com wrote: Oh, it makes sense of gsutil scans through this quickly, but I was wondering if running a Hadoop job / bdutil would result in just as fast scans? On Wed Dec 17 2014 at 10:44:45 PM Alessandro Baretta alexbare...@gmail.com wrote: Denny, No, gsutil scans through the listing of the bucket quickly. See the following. alex@hadoop-m:~/split$ time bash -c gsutil ls gs://my-bucket/20141205/csv/*/*/* | wc -l 6860 real0m6.971s user0m1.052s sys 0m0.096s Alex On Wed, Dec 17, 2014 at 10:29 PM, Denny Lee denny.g@gmail.com wrote: I'm curious if you're seeing the same thing when using bdutil against GCS? I'm wondering if this may be an issue concerning the transfer rate of Spark - Hadoop - GCS Connector - GCS. On Wed Dec 17 2014 at 10:09:17 PM Alessandro Baretta alexbare...@gmail.com wrote: All, I'm using the Spark shell to interact with a small test deployment of Spark, built from the current master branch. I'm processing a dataset comprising a few thousand objects on Google Cloud Storage, split into a half dozen directories. My code constructs an object--let me call it the Dataset object--that defines a distinct RDD for each directory. The constructor of the object only defines the RDDs; it does not actually evaluate them, so I would expect it to return very quickly. Indeed, the logging code in the constructor prints a line signaling the completion of the code almost immediately after invocation, but the Spark shell does not show the prompt right away. Instead, it spends a few minutes seemingly frozen, eventually producing the following output: 14/12/18 05:52:49 INFO mapred.FileInputFormat: Total input paths to process : 9 14/12/18 05:54:15 INFO mapred.FileInputFormat: Total input paths to process : 759 14/12/18 05:54:40 INFO mapred.FileInputFormat: Total input paths to process : 228 14/12/18 06:00:11 INFO mapred.FileInputFormat: Total input paths to process : 3076 14/12/18 06:02:02 INFO mapred.FileInputFormat: Total input paths to process : 1013 14/12/18 06:02:21 INFO mapred.FileInputFormat: Total input paths to process : 156 This stage is inexplicably slow. What could be happening? Thanks. Alex