Re: Spark S3 Performance

Daniil Osipov Mon, 24 Nov 2014 14:30:40 -0800

Can you verify that its reading the entire file on each worker using
network monitoring stats? If it does, that would be a bug in my opinion.


On Mon, Nov 24, 2014 at 2:06 PM, Nitay Joffe <ni...@actioniq.co> wrote:

> Andrei, Ashish,
>
> To be clear, I don't think it's *counting* the entire file. It just seems
> from the logging and the timing that it is doing a get of the entire file,
> then figures out it only needs some certain blocks, does another get of
> only the specific block.
>
> Regarding # partitions - I think I see now it has to do with Hadoop's
> block size being set at 64MB. This is not a big deal to me, the main issue
> is the first one, why is every worker doing a call to get the entire file
> followed by the *real* call to get only the specific partitions it needs.
>
> Best,
>
> - Nitay
> Founder & CTO
>
>
> On Sat, Nov 22, 2014 at 8:28 PM, Andrei <faithlessfri...@gmail.com> wrote:
>
>> Concerning your second question, I believe you try to set number of
>> partitions with something like this:
>>
>>     rdd = sc.textFile(..., 8)
>>
>> but things like `textFile()` don't actually take fixed number of
>> partitions. Instead, they expect *minimal* number of partitions. Since
>> in your file you have 21 blocks of data, it creates exactly 21 worker
>> (which is greater than 8, as expected). To set exact number of partitions,
>> use `repartition()` or its full version - `coalesce()` (see example [1])
>>
>> [1]:
>> http://homepage.cs.latrobe.edu.au/zhe/ZhenHeSparkRDDAPIExamples.html#coalesce
>>
>>
>>
>> On Sat, Nov 22, 2014 at 10:04 PM, Ashish Rangole <arang...@gmail.com>
>> wrote:
>>
>>> What makes you think that each executor is reading the whole file? If
>>> that is the case then the count value returned to the driver will be actual
>>> X NumOfExecutors. Is that the case when compared with actual lines in the
>>> input file? If the count returned is same as actual then you probably don't
>>> have an extra read problem.
>>>
>>> I also see this in your logs which indicates that it is a read that
>>> starts from an offset and reading one split size (64MB) worth of data:
>>>
>>> 14/11/20 15:39:45 [Executor task launch worker-1 ] INFO HadoopRDD: Input
>>> split: s3n://mybucket/myfile:335544320+67108864
>>> On Nov 22, 2014 7:23 AM, "Nitay Joffe" <ni...@actioniq.co> wrote:
>>>
>>>> Err I meant #1 :)
>>>>
>>>> - Nitay
>>>> Founder & CTO
>>>>
>>>>
>>>> On Sat, Nov 22, 2014 at 10:20 AM, Nitay Joffe <ni...@actioniq.co>
>>>> wrote:
>>>>
>>>>> Anyone have any thoughts on this? Trying to understand especially #2
>>>>> if it's a legit bug or something I'm doing wrong.
>>>>>
>>>>> - Nitay
>>>>> Founder & CTO
>>>>>
>>>>>
>>>>> On Thu, Nov 20, 2014 at 11:54 AM, Nitay Joffe <ni...@actioniq.co>
>>>>> wrote:
>>>>>
>>>>>> I have a simple S3 job to read a text file and do a line count.
>>>>>> Specifically I'm doing *sc.textFile("s3n://mybucket/myfile").count*.The
>>>>>> file is about 1.2GB. My setup is standalone spark cluster with 4 workers
>>>>>> each with 2 cores / 16GB ram. I'm using branch-1.2 code built against
>>>>>> hadoop 2.4 (though I'm not actually using HDFS, just straight S3 => 
>>>>>> Spark).
>>>>>>
>>>>>> The whole count is taking on the order of a couple of minutes, which
>>>>>> seems extremely slow.
>>>>>> I've been looking into it and so far have noticed two things, hoping
>>>>>> the community has seen this before and knows what to do...
>>>>>>
>>>>>> 1) Every executor seems to make an S3 call to read the *entire file* 
>>>>>> before
>>>>>> making another call to read just it's split. Here's a paste I've cleaned 
>>>>>> up
>>>>>> to show just one task: http://goo.gl/XCfyZA. I've verified this
>>>>>> happens in every task. It is taking a long time (40-50 seconds), I don't
>>>>>> see why it is doing this?
>>>>>> 2) I've tried a few numPartitions parameters. When I make the
>>>>>> parameter anything below 21 it seems to get ignored. Under the hood
>>>>>> FileInputFormat is doing something that always ends up with at least 21
>>>>>> partitions of ~64MB or so. I've also tried 40, 60, and 100 partitions and
>>>>>> have seen that the performance only gets worse as I increase it beyond 
>>>>>> 21.
>>>>>> I would like to try 8 just to see, but again I don't see how to force it 
>>>>>> to
>>>>>> go below 21.
>>>>>>
>>>>>> Thanks for the help,
>>>>>> - Nitay
>>>>>> Founder & CTO
>>>>>>
>>>>>>
>>>>>
>>>>
>>
>

Re: Spark S3 Performance

Reply via email to