Thanks Xiayun Sun, Robin East for your inputs. It make sense to me. Thanks & Regards, Gokula Krishnan* (Gokul)*
On Tue, Jul 25, 2017 at 9:55 AM, Xiayun Sun <xiayun...@gmail.com> wrote: > I'm guessing by "part files" you mean files like part-r-00000. These are > actually different from hadoop "block size", which is the value actually > used in partitions. > > Looks like your hdfs block size is the default 128mb: 258.2GB in 500 part > files -> around 528mb per part file -> each part file would take a little > more than 4 blocks -> total that would be around 2000 blocks. > > You cannot set partitions fewer than blocks, that's why 500 does not work > (spark doc here > <http://spark.apache.org/docs/latest/rdd-programming-guide.html>) > > The textFile method also takes an optional second argument for >> controlling the number of partitions of the file. By default, Spark creates >> one partition for each block of the file (blocks being 128MB by default in >> HDFS), but you can also ask for a higher number of partitions by passing a >> larger value. Note that you cannot have fewer partitions than blocks. > > > > Now as to why 3000 gives you 3070 partitions, spark use hadoop's > InputFormat.getSplits > <https://github.com/apache/hadoop/blob/f67237cbe7bc48a1b9088e990800b37529f1db2a/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/main/java/org/apache/hadoop/mapred/InputFormat.java#L85>, > and the partition desired is at most a hint, so the final result could be a > bit different. > > On 25 July 2017 at 19:54, Gokula Krishnan D <email2...@gmail.com> wrote: > >> Excuse for the too many mails on this post. >> >> found a similar issue https://stackoverflow.co >> m/questions/24671755/how-to-partition-a-rdd >> >> Thanks & Regards, >> Gokula Krishnan* (Gokul)* >> >> On Tue, Jul 25, 2017 at 8:21 AM, Gokula Krishnan D <email2...@gmail.com> >> wrote: >> >>> In addition to that, >>> >>> tried to read the same file with 3000 partitions but it used 3070 >>> partitions. And took more time than previous please refer the attachment. >>> >>> Thanks & Regards, >>> Gokula Krishnan* (Gokul)* >>> >>> On Tue, Jul 25, 2017 at 8:15 AM, Gokula Krishnan D <email2...@gmail.com> >>> wrote: >>> >>>> Hello All, >>>> >>>> I have a HDFS file with approx. *1.5 Billion records* with 500 Part >>>> files (258.2GB Size) and when I tried to execute the following I could see >>>> that it used 2290 tasks but it supposed to be 500 as like HDFS File, isn't >>>> it? >>>> >>>> val inputFile = <HDFS File> >>>> val inputRdd = sc.textFile(inputFile) >>>> inputRdd.count() >>>> >>>> I was hoping that I can do the same with the fewer partitions so tried >>>> the following >>>> >>>> val inputFile = <HDFS File> >>>> val inputrddnqew = sc.textFile(inputFile,500) >>>> inputRddNew.count() >>>> >>>> But still it used 2290 tasks. >>>> >>>> As per scala doc, it supposed use as like the HDFS file i.e 500. >>>> >>>> It would be great if you could throw some insight on this. >>>> >>>> Thanks & Regards, >>>> Gokula Krishnan* (Gokul)* >>>> >>> >>> >> > > > -- > Xiayun Sun > > Home is behind, the world ahead, > and there are many paths to tread > through shadows to the edge of night, > until the stars are all alight. >