Re: Spark Shell slowness on Google Cloud

Alessandro Baretta Wed, 17 Dec 2014 22:45:38 -0800

Denny,

No, gsutil scans through the listing of the bucket quickly. See the
following.


alex@hadoop-m:~/split$ time bash -c "gsutil ls
gs://my-bucket/20141205/csv/*/*/* | wc -l"

6860

real    0m6.971s
user    0m1.052s
sys     0m0.096s

Alex

On Wed, Dec 17, 2014 at 10:29 PM, Denny Lee <denny.g....@gmail.com> wrote:
>
> I'm curious if you're seeing the same thing when using bdutil against
> GCS?  I'm wondering if this may be an issue concerning the transfer rate of
> Spark -> Hadoop -> GCS Connector -> GCS.
>
>
> On Wed Dec 17 2014 at 10:09:17 PM Alessandro Baretta <
> alexbare...@gmail.com> wrote:
>
>> All,
>>
>> I'm using the Spark shell to interact with a small test deployment of
>> Spark, built from the current master branch. I'm processing a dataset
>> comprising a few thousand objects on Google Cloud Storage, split into a
>> half dozen directories. My code constructs an object--let me call it the
>> Dataset object--that defines a distinct RDD for each directory. The
>> constructor of the object only defines the RDDs; it does not actually
>> evaluate them, so I would expect it to return very quickly. Indeed, the
>> logging code in the constructor prints a line signaling the completion of
>> the code almost immediately after invocation, but the Spark shell does not
>> show the prompt right away. Instead, it spends a few minutes seemingly
>> frozen, eventually producing the following output:
>>
>> 14/12/18 05:52:49 INFO mapred.FileInputFormat: Total input paths to
>> process : 9
>>
>> 14/12/18 05:54:15 INFO mapred.FileInputFormat: Total input paths to
>> process : 759
>>
>> 14/12/18 05:54:40 INFO mapred.FileInputFormat: Total input paths to
>> process : 228
>>
>> 14/12/18 06:00:11 INFO mapred.FileInputFormat: Total input paths to
>> process : 3076
>>
>> 14/12/18 06:02:02 INFO mapred.FileInputFormat: Total input paths to
>> process : 1013
>>
>> 14/12/18 06:02:21 INFO mapred.FileInputFormat: Total input paths to
>> process : 156
>>
>> This stage is inexplicably slow. What could be happening?
>>
>> Thanks.
>>
>>
>> Alex
>>
>

Re: Spark Shell slowness on Google Cloud

Reply via email to