Re: [Spark-Core] sc.textFile() explicit minPartitions did not work

Xiayun Sun Tue, 25 Jul 2017 06:56:43 -0700

I'm guessing by "part files" you mean files like part-r-00000. These are
actually different from hadoop "block size", which is the value actually
used in partitions.


Looks like your hdfs block size is the default 128mb: 258.2GB in 500 part
files -> around 528mb per part file -> each part file would take a little
more than 4 blocks -> total that would be around 2000 blocks.

You cannot set partitions fewer than blocks, that's why 500 does not work
(spark doc here
<http://spark.apache.org/docs/latest/rdd-programming-guide.html>)

The textFile method also takes an optional second argument for controlling
> the number of partitions of the file. By default, Spark creates one
> partition for each block of the file (blocks being 128MB by default in
> HDFS), but you can also ask for a higher number of partitions by passing a
> larger value. Note that you cannot have fewer partitions than blocks.



Now as to why 3000 gives you 3070 partitions, spark use hadoop's
InputFormat.getSplits
<https://github.com/apache/hadoop/blob/f67237cbe7bc48a1b9088e990800b37529f1db2a/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/main/java/org/apache/hadoop/mapred/InputFormat.java#L85>,
and the partition desired is at most a hint, so the final result could be a
bit different.

On 25 July 2017 at 19:54, Gokula Krishnan D <email2...@gmail.com> wrote:

> Excuse for the too many mails on this post.
>
> found a similar issue https://stackoverflow.com/questions/24671755/how-to-
> partition-a-rdd
>
> Thanks & Regards,
> Gokula Krishnan* (Gokul)*
>
> On Tue, Jul 25, 2017 at 8:21 AM, Gokula Krishnan D <email2...@gmail.com>
> wrote:
>
>> In addition to that,
>>
>> tried to read the same file with 3000 partitions but it used 3070
>> partitions. And took more time than previous please refer the attachment.
>>
>> Thanks & Regards,
>> Gokula Krishnan* (Gokul)*
>>
>> On Tue, Jul 25, 2017 at 8:15 AM, Gokula Krishnan D <email2...@gmail.com>
>> wrote:
>>
>>> Hello All,
>>>
>>> I have a HDFS file with approx. *1.5 Billion records* with 500 Part
>>> files (258.2GB Size) and when I tried to execute the following I could see
>>> that it used 2290 tasks but it supposed to be 500 as like HDFS File, isn't
>>> it?
>>>
>>> val inputFile = <HDFS File>
>>> val inputRdd = sc.textFile(inputFile)
>>> inputRdd.count()
>>>
>>> I was hoping that I can do the same with the fewer partitions so tried
>>> the following
>>>
>>> val inputFile = <HDFS File>
>>> val inputrddnqew = sc.textFile(inputFile,500)
>>> inputRddNew.count()
>>>
>>> But still it used 2290 tasks.
>>>
>>> As per scala doc, it supposed use as like the HDFS file i.e 500.
>>>
>>> It would be great if you could throw some insight on this.
>>>
>>> Thanks & Regards,
>>> Gokula Krishnan* (Gokul)*
>>>
>>
>>
>


-- 
Xiayun Sun

Home is behind, the world ahead,
and there are many paths to tread
through shadows to the edge of night,
until the stars are all alight.

Re: [Spark-Core] sc.textFile() explicit minPartitions did not work

Reply via email to