The files sound too small to be 2 blocks in HDFS.
Did you set the defaultParallelism to be 3 in your spark?
Yong

Subject: Re: 2 input paths generate 3 partitions
From: zzh...@hortonworks.com
To: rvern...@gmail.com
CC: user@spark.apache.org
Date: Fri, 27 Mar 2015 23:15:38 +0000






Hi Rares,



The number of partition is controlled by HDFS input format, and one file may 
have multiple partitions if it consists of multiple block. In you case, I think 
there is one file with 2 splits.



Thanks.



Zhan Zhang


On Mar 27, 2015, at 3:12 PM, Rares Vernica <rvern...@gmail.com> wrote:


Hello,



I am using the Spark shell in Scala on the localhost. I am using 
sc.textFile to read a directory. The directory looks like this (generated by 
another Spark script):




part-00000
part-00001
_SUCCESS




The part-00000 has four short lines of text while
part-00001 has two short lines of text. The
_SUCCESS file is empty. When I check the number of partitions on the RDD I get:




scala> foo.partitions.length
15/03/27 14:57:31 INFO FileInputFormat: Total input paths to process : 2
res68: Int = 3




I wonder why do the two input files generate three partitions. Does Spark check 
the number of lines in each file and try to generate three balanced partitions?



Thanks!
Rares






                                          

Reply via email to