Denny, No, gsutil scans through the listing of the bucket quickly. See the following.
alex@hadoop-m:~/split$ time bash -c "gsutil ls gs://my-bucket/20141205/csv/*/*/* | wc -l" 6860 real 0m6.971s user 0m1.052s sys 0m0.096s Alex On Wed, Dec 17, 2014 at 10:29 PM, Denny Lee <denny.g....@gmail.com> wrote: > > I'm curious if you're seeing the same thing when using bdutil against > GCS? I'm wondering if this may be an issue concerning the transfer rate of > Spark -> Hadoop -> GCS Connector -> GCS. > > > On Wed Dec 17 2014 at 10:09:17 PM Alessandro Baretta < > alexbare...@gmail.com> wrote: > >> All, >> >> I'm using the Spark shell to interact with a small test deployment of >> Spark, built from the current master branch. I'm processing a dataset >> comprising a few thousand objects on Google Cloud Storage, split into a >> half dozen directories. My code constructs an object--let me call it the >> Dataset object--that defines a distinct RDD for each directory. The >> constructor of the object only defines the RDDs; it does not actually >> evaluate them, so I would expect it to return very quickly. Indeed, the >> logging code in the constructor prints a line signaling the completion of >> the code almost immediately after invocation, but the Spark shell does not >> show the prompt right away. Instead, it spends a few minutes seemingly >> frozen, eventually producing the following output: >> >> 14/12/18 05:52:49 INFO mapred.FileInputFormat: Total input paths to >> process : 9 >> >> 14/12/18 05:54:15 INFO mapred.FileInputFormat: Total input paths to >> process : 759 >> >> 14/12/18 05:54:40 INFO mapred.FileInputFormat: Total input paths to >> process : 228 >> >> 14/12/18 06:00:11 INFO mapred.FileInputFormat: Total input paths to >> process : 3076 >> >> 14/12/18 06:02:02 INFO mapred.FileInputFormat: Total input paths to >> process : 1013 >> >> 14/12/18 06:02:21 INFO mapred.FileInputFormat: Total input paths to >> process : 156 >> >> This stage is inexplicably slow. What could be happening? >> >> Thanks. >> >> >> Alex >> >