Well, what do you suggest I run to test this? But more importantly, what
information would this give me?

On Wed, Dec 17, 2014 at 10:46 PM, Denny Lee <denny.g....@gmail.com> wrote:
>
> Oh, it makes sense of gsutil scans through this quickly, but I was
> wondering if running a Hadoop job / bdutil would result in just as fast
> scans?
>
>
> On Wed Dec 17 2014 at 10:44:45 PM Alessandro Baretta <
> alexbare...@gmail.com> wrote:
>
>> Denny,
>>
>> No, gsutil scans through the listing of the bucket quickly. See the
>> following.
>>
>> alex@hadoop-m:~/split$ time bash -c "gsutil ls
>> gs://my-bucket/20141205/csv/*/*/* | wc -l"
>>
>> 6860
>>
>> real    0m6.971s
>> user    0m1.052s
>> sys     0m0.096s
>>
>> Alex
>>
>>
>> On Wed, Dec 17, 2014 at 10:29 PM, Denny Lee <denny.g....@gmail.com>
>> wrote:
>>>
>>> I'm curious if you're seeing the same thing when using bdutil against
>>> GCS?  I'm wondering if this may be an issue concerning the transfer rate of
>>> Spark -> Hadoop -> GCS Connector -> GCS.
>>>
>>>
>>> On Wed Dec 17 2014 at 10:09:17 PM Alessandro Baretta <
>>> alexbare...@gmail.com> wrote:
>>>
>>>> All,
>>>>
>>>> I'm using the Spark shell to interact with a small test deployment of
>>>> Spark, built from the current master branch. I'm processing a dataset
>>>> comprising a few thousand objects on Google Cloud Storage, split into a
>>>> half dozen directories. My code constructs an object--let me call it the
>>>> Dataset object--that defines a distinct RDD for each directory. The
>>>> constructor of the object only defines the RDDs; it does not actually
>>>> evaluate them, so I would expect it to return very quickly. Indeed, the
>>>> logging code in the constructor prints a line signaling the completion of
>>>> the code almost immediately after invocation, but the Spark shell does not
>>>> show the prompt right away. Instead, it spends a few minutes seemingly
>>>> frozen, eventually producing the following output:
>>>>
>>>> 14/12/18 05:52:49 INFO mapred.FileInputFormat: Total input paths to
>>>> process : 9
>>>>
>>>> 14/12/18 05:54:15 INFO mapred.FileInputFormat: Total input paths to
>>>> process : 759
>>>>
>>>> 14/12/18 05:54:40 INFO mapred.FileInputFormat: Total input paths to
>>>> process : 228
>>>>
>>>> 14/12/18 06:00:11 INFO mapred.FileInputFormat: Total input paths to
>>>> process : 3076
>>>>
>>>> 14/12/18 06:02:02 INFO mapred.FileInputFormat: Total input paths to
>>>> process : 1013
>>>>
>>>> 14/12/18 06:02:21 INFO mapred.FileInputFormat: Total input paths to
>>>> process : 156
>>>>
>>>> This stage is inexplicably slow. What could be happening?
>>>>
>>>> Thanks.
>>>>
>>>>
>>>> Alex
>>>>
>>>

Reply via email to