Andrei, Ashish,

To be clear, I don't think it's *counting* the entire file. It just seems
from the logging and the timing that it is doing a get of the entire file,
then figures out it only needs some certain blocks, does another get of
only the specific block.

Regarding # partitions - I think I see now it has to do with Hadoop's block
size being set at 64MB. This is not a big deal to me, the main issue is the
first one, why is every worker doing a call to get the entire file followed
by the *real* call to get only the specific partitions it needs.

Best,

- Nitay
Founder & CTO


On Sat, Nov 22, 2014 at 8:28 PM, Andrei <faithlessfri...@gmail.com> wrote:

> Concerning your second question, I believe you try to set number of
> partitions with something like this:
>
>     rdd = sc.textFile(..., 8)
>
> but things like `textFile()` don't actually take fixed number of
> partitions. Instead, they expect *minimal* number of partitions. Since in
> your file you have 21 blocks of data, it creates exactly 21 worker (which
> is greater than 8, as expected). To set exact number of partitions, use
> `repartition()` or its full version - `coalesce()` (see example [1])
>
> [1]:
> http://homepage.cs.latrobe.edu.au/zhe/ZhenHeSparkRDDAPIExamples.html#coalesce
>
>
>
> On Sat, Nov 22, 2014 at 10:04 PM, Ashish Rangole <arang...@gmail.com>
> wrote:
>
>> What makes you think that each executor is reading the whole file? If
>> that is the case then the count value returned to the driver will be actual
>> X NumOfExecutors. Is that the case when compared with actual lines in the
>> input file? If the count returned is same as actual then you probably don't
>> have an extra read problem.
>>
>> I also see this in your logs which indicates that it is a read that
>> starts from an offset and reading one split size (64MB) worth of data:
>>
>> 14/11/20 15:39:45 [Executor task launch worker-1 ] INFO HadoopRDD: Input
>> split: s3n://mybucket/myfile:335544320+67108864
>> On Nov 22, 2014 7:23 AM, "Nitay Joffe" <ni...@actioniq.co> wrote:
>>
>>> Err I meant #1 :)
>>>
>>> - Nitay
>>> Founder & CTO
>>>
>>>
>>> On Sat, Nov 22, 2014 at 10:20 AM, Nitay Joffe <ni...@actioniq.co> wrote:
>>>
>>>> Anyone have any thoughts on this? Trying to understand especially #2 if
>>>> it's a legit bug or something I'm doing wrong.
>>>>
>>>> - Nitay
>>>> Founder & CTO
>>>>
>>>>
>>>> On Thu, Nov 20, 2014 at 11:54 AM, Nitay Joffe <ni...@actioniq.co>
>>>> wrote:
>>>>
>>>>> I have a simple S3 job to read a text file and do a line count.
>>>>> Specifically I'm doing *sc.textFile("s3n://mybucket/myfile").count*.The
>>>>> file is about 1.2GB. My setup is standalone spark cluster with 4 workers
>>>>> each with 2 cores / 16GB ram. I'm using branch-1.2 code built against
>>>>> hadoop 2.4 (though I'm not actually using HDFS, just straight S3 => 
>>>>> Spark).
>>>>>
>>>>> The whole count is taking on the order of a couple of minutes, which
>>>>> seems extremely slow.
>>>>> I've been looking into it and so far have noticed two things, hoping
>>>>> the community has seen this before and knows what to do...
>>>>>
>>>>> 1) Every executor seems to make an S3 call to read the *entire file* 
>>>>> before
>>>>> making another call to read just it's split. Here's a paste I've cleaned 
>>>>> up
>>>>> to show just one task: http://goo.gl/XCfyZA. I've verified this
>>>>> happens in every task. It is taking a long time (40-50 seconds), I don't
>>>>> see why it is doing this?
>>>>> 2) I've tried a few numPartitions parameters. When I make the
>>>>> parameter anything below 21 it seems to get ignored. Under the hood
>>>>> FileInputFormat is doing something that always ends up with at least 21
>>>>> partitions of ~64MB or so. I've also tried 40, 60, and 100 partitions and
>>>>> have seen that the performance only gets worse as I increase it beyond 21.
>>>>> I would like to try 8 just to see, but again I don't see how to force it 
>>>>> to
>>>>> go below 21.
>>>>>
>>>>> Thanks for the help,
>>>>> - Nitay
>>>>> Founder & CTO
>>>>>
>>>>>
>>>>
>>>
>

Reply via email to