Can you verify that its reading the entire file on each worker using network monitoring stats? If it does, that would be a bug in my opinion.
On Mon, Nov 24, 2014 at 2:06 PM, Nitay Joffe <ni...@actioniq.co> wrote: > Andrei, Ashish, > > To be clear, I don't think it's *counting* the entire file. It just seems > from the logging and the timing that it is doing a get of the entire file, > then figures out it only needs some certain blocks, does another get of > only the specific block. > > Regarding # partitions - I think I see now it has to do with Hadoop's > block size being set at 64MB. This is not a big deal to me, the main issue > is the first one, why is every worker doing a call to get the entire file > followed by the *real* call to get only the specific partitions it needs. > > Best, > > - Nitay > Founder & CTO > > > On Sat, Nov 22, 2014 at 8:28 PM, Andrei <faithlessfri...@gmail.com> wrote: > >> Concerning your second question, I believe you try to set number of >> partitions with something like this: >> >> rdd = sc.textFile(..., 8) >> >> but things like `textFile()` don't actually take fixed number of >> partitions. Instead, they expect *minimal* number of partitions. Since >> in your file you have 21 blocks of data, it creates exactly 21 worker >> (which is greater than 8, as expected). To set exact number of partitions, >> use `repartition()` or its full version - `coalesce()` (see example [1]) >> >> [1]: >> http://homepage.cs.latrobe.edu.au/zhe/ZhenHeSparkRDDAPIExamples.html#coalesce >> >> >> >> On Sat, Nov 22, 2014 at 10:04 PM, Ashish Rangole <arang...@gmail.com> >> wrote: >> >>> What makes you think that each executor is reading the whole file? If >>> that is the case then the count value returned to the driver will be actual >>> X NumOfExecutors. Is that the case when compared with actual lines in the >>> input file? If the count returned is same as actual then you probably don't >>> have an extra read problem. >>> >>> I also see this in your logs which indicates that it is a read that >>> starts from an offset and reading one split size (64MB) worth of data: >>> >>> 14/11/20 15:39:45 [Executor task launch worker-1 ] INFO HadoopRDD: Input >>> split: s3n://mybucket/myfile:335544320+67108864 >>> On Nov 22, 2014 7:23 AM, "Nitay Joffe" <ni...@actioniq.co> wrote: >>> >>>> Err I meant #1 :) >>>> >>>> - Nitay >>>> Founder & CTO >>>> >>>> >>>> On Sat, Nov 22, 2014 at 10:20 AM, Nitay Joffe <ni...@actioniq.co> >>>> wrote: >>>> >>>>> Anyone have any thoughts on this? Trying to understand especially #2 >>>>> if it's a legit bug or something I'm doing wrong. >>>>> >>>>> - Nitay >>>>> Founder & CTO >>>>> >>>>> >>>>> On Thu, Nov 20, 2014 at 11:54 AM, Nitay Joffe <ni...@actioniq.co> >>>>> wrote: >>>>> >>>>>> I have a simple S3 job to read a text file and do a line count. >>>>>> Specifically I'm doing *sc.textFile("s3n://mybucket/myfile").count*.The >>>>>> file is about 1.2GB. My setup is standalone spark cluster with 4 workers >>>>>> each with 2 cores / 16GB ram. I'm using branch-1.2 code built against >>>>>> hadoop 2.4 (though I'm not actually using HDFS, just straight S3 => >>>>>> Spark). >>>>>> >>>>>> The whole count is taking on the order of a couple of minutes, which >>>>>> seems extremely slow. >>>>>> I've been looking into it and so far have noticed two things, hoping >>>>>> the community has seen this before and knows what to do... >>>>>> >>>>>> 1) Every executor seems to make an S3 call to read the *entire file* >>>>>> before >>>>>> making another call to read just it's split. Here's a paste I've cleaned >>>>>> up >>>>>> to show just one task: http://goo.gl/XCfyZA. I've verified this >>>>>> happens in every task. It is taking a long time (40-50 seconds), I don't >>>>>> see why it is doing this? >>>>>> 2) I've tried a few numPartitions parameters. When I make the >>>>>> parameter anything below 21 it seems to get ignored. Under the hood >>>>>> FileInputFormat is doing something that always ends up with at least 21 >>>>>> partitions of ~64MB or so. I've also tried 40, 60, and 100 partitions and >>>>>> have seen that the performance only gets worse as I increase it beyond >>>>>> 21. >>>>>> I would like to try 8 just to see, but again I don't see how to force it >>>>>> to >>>>>> go below 21. >>>>>> >>>>>> Thanks for the help, >>>>>> - Nitay >>>>>> Founder & CTO >>>>>> >>>>>> >>>>> >>>> >> >