Well, what do you suggest I run to test this? But more importantly, what information would this give me?
On Wed, Dec 17, 2014 at 10:46 PM, Denny Lee <denny.g....@gmail.com> wrote: > > Oh, it makes sense of gsutil scans through this quickly, but I was > wondering if running a Hadoop job / bdutil would result in just as fast > scans? > > > On Wed Dec 17 2014 at 10:44:45 PM Alessandro Baretta < > alexbare...@gmail.com> wrote: > >> Denny, >> >> No, gsutil scans through the listing of the bucket quickly. See the >> following. >> >> alex@hadoop-m:~/split$ time bash -c "gsutil ls >> gs://my-bucket/20141205/csv/*/*/* | wc -l" >> >> 6860 >> >> real 0m6.971s >> user 0m1.052s >> sys 0m0.096s >> >> Alex >> >> >> On Wed, Dec 17, 2014 at 10:29 PM, Denny Lee <denny.g....@gmail.com> >> wrote: >>> >>> I'm curious if you're seeing the same thing when using bdutil against >>> GCS? I'm wondering if this may be an issue concerning the transfer rate of >>> Spark -> Hadoop -> GCS Connector -> GCS. >>> >>> >>> On Wed Dec 17 2014 at 10:09:17 PM Alessandro Baretta < >>> alexbare...@gmail.com> wrote: >>> >>>> All, >>>> >>>> I'm using the Spark shell to interact with a small test deployment of >>>> Spark, built from the current master branch. I'm processing a dataset >>>> comprising a few thousand objects on Google Cloud Storage, split into a >>>> half dozen directories. My code constructs an object--let me call it the >>>> Dataset object--that defines a distinct RDD for each directory. The >>>> constructor of the object only defines the RDDs; it does not actually >>>> evaluate them, so I would expect it to return very quickly. Indeed, the >>>> logging code in the constructor prints a line signaling the completion of >>>> the code almost immediately after invocation, but the Spark shell does not >>>> show the prompt right away. Instead, it spends a few minutes seemingly >>>> frozen, eventually producing the following output: >>>> >>>> 14/12/18 05:52:49 INFO mapred.FileInputFormat: Total input paths to >>>> process : 9 >>>> >>>> 14/12/18 05:54:15 INFO mapred.FileInputFormat: Total input paths to >>>> process : 759 >>>> >>>> 14/12/18 05:54:40 INFO mapred.FileInputFormat: Total input paths to >>>> process : 228 >>>> >>>> 14/12/18 06:00:11 INFO mapred.FileInputFormat: Total input paths to >>>> process : 3076 >>>> >>>> 14/12/18 06:02:02 INFO mapred.FileInputFormat: Total input paths to >>>> process : 1013 >>>> >>>> 14/12/18 06:02:21 INFO mapred.FileInputFormat: Total input paths to >>>> process : 156 >>>> >>>> This stage is inexplicably slow. What could be happening? >>>> >>>> Thanks. >>>> >>>> >>>> Alex >>>> >>>