For Spark to connect to GCS, it utilizes the Hadoop and GCS connector jars
for connectivity. I'm wondering if it's those connection points that are
ultimately slowing down the connection between Spark and GCS.

The reason I was asking if you could run bdutil is because it would be
basically Hadoop connecting to GCS. If it's just as slow than that would
point to the root cause. That is, it's the "Hadoop" connection that is
slowing things vs something explicitly out of Spark per se.
On Wed, Dec 17, 2014 at 23:25 Alessandro Baretta <alexbare...@gmail.com>
wrote:

> Well, what do you suggest I run to test this? But more importantly, what
> information would this give me?
>
> On Wed, Dec 17, 2014 at 10:46 PM, Denny Lee <denny.g....@gmail.com> wrote:
>>
>> Oh, it makes sense of gsutil scans through this quickly, but I was
>> wondering if running a Hadoop job / bdutil would result in just as fast
>> scans?
>>
>>
>> On Wed Dec 17 2014 at 10:44:45 PM Alessandro Baretta <
>> alexbare...@gmail.com> wrote:
>>
>>> Denny,
>>>
>>> No, gsutil scans through the listing of the bucket quickly. See the
>>> following.
>>>
>>> alex@hadoop-m:~/split$ time bash -c "gsutil ls
>>> gs://my-bucket/20141205/csv/*/*/* | wc -l"
>>>
>>> 6860
>>>
>>> real    0m6.971s
>>> user    0m1.052s
>>> sys     0m0.096s
>>>
>>> Alex
>>>
>>>
>>> On Wed, Dec 17, 2014 at 10:29 PM, Denny Lee <denny.g....@gmail.com>
>>> wrote:
>>>>
>>>> I'm curious if you're seeing the same thing when using bdutil against
>>>> GCS?  I'm wondering if this may be an issue concerning the transfer rate of
>>>> Spark -> Hadoop -> GCS Connector -> GCS.
>>>>
>>>>
>>>> On Wed Dec 17 2014 at 10:09:17 PM Alessandro Baretta <
>>>> alexbare...@gmail.com> wrote:
>>>>
>>>>> All,
>>>>>
>>>>> I'm using the Spark shell to interact with a small test deployment of
>>>>> Spark, built from the current master branch. I'm processing a dataset
>>>>> comprising a few thousand objects on Google Cloud Storage, split into a
>>>>> half dozen directories. My code constructs an object--let me call it the
>>>>> Dataset object--that defines a distinct RDD for each directory. The
>>>>> constructor of the object only defines the RDDs; it does not actually
>>>>> evaluate them, so I would expect it to return very quickly. Indeed, the
>>>>> logging code in the constructor prints a line signaling the completion of
>>>>> the code almost immediately after invocation, but the Spark shell does not
>>>>> show the prompt right away. Instead, it spends a few minutes seemingly
>>>>> frozen, eventually producing the following output:
>>>>>
>>>>> 14/12/18 05:52:49 INFO mapred.FileInputFormat: Total input paths to
>>>>> process : 9
>>>>>
>>>>> 14/12/18 05:54:15 INFO mapred.FileInputFormat: Total input paths to
>>>>> process : 759
>>>>>
>>>>> 14/12/18 05:54:40 INFO mapred.FileInputFormat: Total input paths to
>>>>> process : 228
>>>>>
>>>>> 14/12/18 06:00:11 INFO mapred.FileInputFormat: Total input paths to
>>>>> process : 3076
>>>>>
>>>>> 14/12/18 06:02:02 INFO mapred.FileInputFormat: Total input paths to
>>>>> process : 1013
>>>>>
>>>>> 14/12/18 06:02:21 INFO mapred.FileInputFormat: Total input paths to
>>>>> process : 156
>>>>>
>>>>> This stage is inexplicably slow. What could be happening?
>>>>>
>>>>> Thanks.
>>>>>
>>>>>
>>>>> Alex
>>>>>
>>>>

Reply via email to