Excuse for the too many mails on this post. found a similar issue https://stackoverflow.com/questions/24671755/how-to-partition-a-rdd
Thanks & Regards, Gokula Krishnan* (Gokul)* On Tue, Jul 25, 2017 at 8:21 AM, Gokula Krishnan D <email2...@gmail.com> wrote: > In addition to that, > > tried to read the same file with 3000 partitions but it used 3070 > partitions. And took more time than previous please refer the attachment. > > Thanks & Regards, > Gokula Krishnan* (Gokul)* > > On Tue, Jul 25, 2017 at 8:15 AM, Gokula Krishnan D <email2...@gmail.com> > wrote: > >> Hello All, >> >> I have a HDFS file with approx. *1.5 Billion records* with 500 Part >> files (258.2GB Size) and when I tried to execute the following I could see >> that it used 2290 tasks but it supposed to be 500 as like HDFS File, isn't >> it? >> >> val inputFile = <HDFS File> >> val inputRdd = sc.textFile(inputFile) >> inputRdd.count() >> >> I was hoping that I can do the same with the fewer partitions so tried >> the following >> >> val inputFile = <HDFS File> >> val inputrddnqew = sc.textFile(inputFile,500) >> inputRddNew.count() >> >> But still it used 2290 tasks. >> >> As per scala doc, it supposed use as like the HDFS file i.e 500. >> >> It would be great if you could throw some insight on this. >> >> Thanks & Regards, >> Gokula Krishnan* (Gokul)* >> > >