Here is the body of StreamFileInputFormat#setMinPartitions : def setMinPartitions(context: JobContext, minPartitions: Int) { val totalLen = listStatus(context).asScala.filterNot(_.isDirectory).map(_.getLen).sum val maxSplitSize = math.ceil(totalLen / math.max(minPartitions, 1.0)).toLong super.setMaxSplitSize(maxSplitSize)
I guess what happened was that among the 100 files you had, there were ~60 files whose sizes were much bigger than the rest. According to the way max split size is computed above, you ended up with fewer partitions. I just performed a test using local directory where 3 files were significantly larger than the rest and reproduced what you observed. Cheers On Tue, Apr 26, 2016 at 11:10 AM, Ulanov, Alexander < alexander.ula...@hpe.com> wrote: > Dear Spark developers, > > > > I have 100 binary files in local file system that I want to load into > Spark RDD. I need the data from each file to be in a separate partition. > However, I cannot make it happen: > > > > scala> sc.binaryFiles("/data/subset").partitions.size > > res5: Int = 66 > > > > The “minPartitions” parameter does not seems to help: > > scala> sc.binaryFiles("/data/subset", minPartitions = 100).partitions.size > > res8: Int = 66 > > > > At the same time, Spark produces the required number of partitions with > sc.textFiles (though I cannot use it because my files are binary): > > scala> sc.textFile("/data/subset").partitions.size > > res9: Int = 100 > > > > Could you suggest how to force Spark to load binary files each in a > separate partition? > > > > Best regards, Alexander >