Re: [Spark-Core] sc.textFile() explicit minPartitions did not work

Gokula Krishnan D Tue, 25 Jul 2017 07:30:05 -0700

Thanks Xiayun Sun, Robin East for your inputs. It make sense to me.

Thanks & Regards,
Gokula Krishnan* (Gokul)*


On Tue, Jul 25, 2017 at 9:55 AM, Xiayun Sun <xiayun...@gmail.com> wrote:

> I'm guessing by "part files" you mean files like part-r-00000. These are
> actually different from hadoop "block size", which is the value actually
> used in partitions.
>
> Looks like your hdfs block size is the default 128mb: 258.2GB in 500 part
> files -> around 528mb per part file -> each part file would take a little
> more than 4 blocks -> total that would be around 2000 blocks.
>
> You cannot set partitions fewer than blocks, that's why 500 does not work
> (spark doc here
> <http://spark.apache.org/docs/latest/rdd-programming-guide.html>)
>
> The textFile method also takes an optional second argument for
>> controlling the number of partitions of the file. By default, Spark creates
>> one partition for each block of the file (blocks being 128MB by default in
>> HDFS), but you can also ask for a higher number of partitions by passing a
>> larger value. Note that you cannot have fewer partitions than blocks.
>
>
>
> Now as to why 3000 gives you 3070 partitions, spark use hadoop's
> InputFormat.getSplits
> <https://github.com/apache/hadoop/blob/f67237cbe7bc48a1b9088e990800b37529f1db2a/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/main/java/org/apache/hadoop/mapred/InputFormat.java#L85>,
> and the partition desired is at most a hint, so the final result could be a
> bit different.
>
> On 25 July 2017 at 19:54, Gokula Krishnan D <email2...@gmail.com> wrote:
>
>> Excuse for the too many mails on this post.
>>
>> found a similar issue https://stackoverflow.co
>> m/questions/24671755/how-to-partition-a-rdd
>>
>> Thanks & Regards,
>> Gokula Krishnan* (Gokul)*
>>
>> On Tue, Jul 25, 2017 at 8:21 AM, Gokula Krishnan D <email2...@gmail.com>
>> wrote:
>>
>>> In addition to that,
>>>
>>> tried to read the same file with 3000 partitions but it used 3070
>>> partitions. And took more time than previous please refer the attachment.
>>>
>>> Thanks & Regards,
>>> Gokula Krishnan* (Gokul)*
>>>
>>> On Tue, Jul 25, 2017 at 8:15 AM, Gokula Krishnan D <email2...@gmail.com>
>>> wrote:
>>>
>>>> Hello All,
>>>>
>>>> I have a HDFS file with approx. *1.5 Billion records* with 500 Part
>>>> files (258.2GB Size) and when I tried to execute the following I could see
>>>> that it used 2290 tasks but it supposed to be 500 as like HDFS File, isn't
>>>> it?
>>>>
>>>> val inputFile = <HDFS File>
>>>> val inputRdd = sc.textFile(inputFile)
>>>> inputRdd.count()
>>>>
>>>> I was hoping that I can do the same with the fewer partitions so tried
>>>> the following
>>>>
>>>> val inputFile = <HDFS File>
>>>> val inputrddnqew = sc.textFile(inputFile,500)
>>>> inputRddNew.count()
>>>>
>>>> But still it used 2290 tasks.
>>>>
>>>> As per scala doc, it supposed use as like the HDFS file i.e 500.
>>>>
>>>> It would be great if you could throw some insight on this.
>>>>
>>>> Thanks & Regards,
>>>> Gokula Krishnan* (Gokul)*
>>>>
>>>
>>>
>>
>
>
> --
> Xiayun Sun
>
> Home is behind, the world ahead,
> and there are many paths to tread
> through shadows to the edge of night,
> until the stars are all alight.
>

Re: [Spark-Core] sc.textFile() explicit minPartitions did not work

Reply via email to