I'm guessing by "part files" you mean files like part-r-00000. These are
actually different from hadoop "block size", which is the value actually
used in partitions.

Looks like your hdfs block size is the default 128mb: 258.2GB in 500 part
files -> around 528mb per part file -> each part file would take a little
more than 4 blocks -> total that would be around 2000 blocks.

You cannot set partitions fewer than blocks, that's why 500 does not work
(spark doc here
<http://spark.apache.org/docs/latest/rdd-programming-guide.html>)

The textFile method also takes an optional second argument for controlling
> the number of partitions of the file. By default, Spark creates one
> partition for each block of the file (blocks being 128MB by default in
> HDFS), but you can also ask for a higher number of partitions by passing a
> larger value. Note that you cannot have fewer partitions than blocks.



Now as to why 3000 gives you 3070 partitions, spark use hadoop's
InputFormat.getSplits
<https://github.com/apache/hadoop/blob/f67237cbe7bc48a1b9088e990800b37529f1db2a/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/main/java/org/apache/hadoop/mapred/InputFormat.java#L85>,
and the partition desired is at most a hint, so the final result could be a
bit different.

On 25 July 2017 at 19:54, Gokula Krishnan D <email2...@gmail.com> wrote:

> Excuse for the too many mails on this post.
>
> found a similar issue https://stackoverflow.com/questions/24671755/how-to-
> partition-a-rdd
>
> Thanks & Regards,
> Gokula Krishnan* (Gokul)*
>
> On Tue, Jul 25, 2017 at 8:21 AM, Gokula Krishnan D <email2...@gmail.com>
> wrote:
>
>> In addition to that,
>>
>> tried to read the same file with 3000 partitions but it used 3070
>> partitions. And took more time than previous please refer the attachment.
>>
>> Thanks & Regards,
>> Gokula Krishnan* (Gokul)*
>>
>> On Tue, Jul 25, 2017 at 8:15 AM, Gokula Krishnan D <email2...@gmail.com>
>> wrote:
>>
>>> Hello All,
>>>
>>> I have a HDFS file with approx. *1.5 Billion records* with 500 Part
>>> files (258.2GB Size) and when I tried to execute the following I could see
>>> that it used 2290 tasks but it supposed to be 500 as like HDFS File, isn't
>>> it?
>>>
>>> val inputFile = <HDFS File>
>>> val inputRdd = sc.textFile(inputFile)
>>> inputRdd.count()
>>>
>>> I was hoping that I can do the same with the fewer partitions so tried
>>> the following
>>>
>>> val inputFile = <HDFS File>
>>> val inputrddnqew = sc.textFile(inputFile,500)
>>> inputRddNew.count()
>>>
>>> But still it used 2290 tasks.
>>>
>>> As per scala doc, it supposed use as like the HDFS file i.e 500.
>>>
>>> It would be great if you could throw some insight on this.
>>>
>>> Thanks & Regards,
>>> Gokula Krishnan* (Gokul)*
>>>
>>
>>
>


-- 
Xiayun Sun

Home is behind, the world ahead,
and there are many paths to tread
through shadows to the edge of night,
until the stars are all alight.

Reply via email to