I'm guessing by "part files" you mean files like part-r-00000. These are actually different from hadoop "block size", which is the value actually used in partitions.
Looks like your hdfs block size is the default 128mb: 258.2GB in 500 part files -> around 528mb per part file -> each part file would take a little more than 4 blocks -> total that would be around 2000 blocks. You cannot set partitions fewer than blocks, that's why 500 does not work (spark doc here <http://spark.apache.org/docs/latest/rdd-programming-guide.html>) The textFile method also takes an optional second argument for controlling > the number of partitions of the file. By default, Spark creates one > partition for each block of the file (blocks being 128MB by default in > HDFS), but you can also ask for a higher number of partitions by passing a > larger value. Note that you cannot have fewer partitions than blocks. Now as to why 3000 gives you 3070 partitions, spark use hadoop's InputFormat.getSplits <https://github.com/apache/hadoop/blob/f67237cbe7bc48a1b9088e990800b37529f1db2a/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/main/java/org/apache/hadoop/mapred/InputFormat.java#L85>, and the partition desired is at most a hint, so the final result could be a bit different. On 25 July 2017 at 19:54, Gokula Krishnan D <email2...@gmail.com> wrote: > Excuse for the too many mails on this post. > > found a similar issue https://stackoverflow.com/questions/24671755/how-to- > partition-a-rdd > > Thanks & Regards, > Gokula Krishnan* (Gokul)* > > On Tue, Jul 25, 2017 at 8:21 AM, Gokula Krishnan D <email2...@gmail.com> > wrote: > >> In addition to that, >> >> tried to read the same file with 3000 partitions but it used 3070 >> partitions. And took more time than previous please refer the attachment. >> >> Thanks & Regards, >> Gokula Krishnan* (Gokul)* >> >> On Tue, Jul 25, 2017 at 8:15 AM, Gokula Krishnan D <email2...@gmail.com> >> wrote: >> >>> Hello All, >>> >>> I have a HDFS file with approx. *1.5 Billion records* with 500 Part >>> files (258.2GB Size) and when I tried to execute the following I could see >>> that it used 2290 tasks but it supposed to be 500 as like HDFS File, isn't >>> it? >>> >>> val inputFile = <HDFS File> >>> val inputRdd = sc.textFile(inputFile) >>> inputRdd.count() >>> >>> I was hoping that I can do the same with the fewer partitions so tried >>> the following >>> >>> val inputFile = <HDFS File> >>> val inputrddnqew = sc.textFile(inputFile,500) >>> inputRddNew.count() >>> >>> But still it used 2290 tasks. >>> >>> As per scala doc, it supposed use as like the HDFS file i.e 500. >>> >>> It would be great if you could throw some insight on this. >>> >>> Thanks & Regards, >>> Gokula Krishnan* (Gokul)* >>> >> >> > -- Xiayun Sun Home is behind, the world ahead, and there are many paths to tread through shadows to the edge of night, until the stars are all alight.