Here is the body of StreamFileInputFormat#setMinPartitions :

  def setMinPartitions(context: JobContext, minPartitions: Int) {
    val totalLen =
listStatus(context).asScala.filterNot(_.isDirectory).map(_.getLen).sum
    val maxSplitSize = math.ceil(totalLen / math.max(minPartitions,
1.0)).toLong
    super.setMaxSplitSize(maxSplitSize)

I guess what happened was that among the 100 files you had, there were ~60
files whose sizes were much bigger than the rest.
According to the way max split size is computed above, you ended up with
fewer partitions.

I just performed a test using local directory where 3 files were
significantly larger than the rest and reproduced what you observed.

Cheers

On Tue, Apr 26, 2016 at 11:10 AM, Ulanov, Alexander <
alexander.ula...@hpe.com> wrote:

> Dear Spark developers,
>
>
>
> I have 100 binary files in local file system that I want to load into
> Spark RDD. I need the data from each file to be in a separate partition.
> However, I cannot make it happen:
>
>
>
> scala> sc.binaryFiles("/data/subset").partitions.size
>
> res5: Int = 66
>
>
>
> The “minPartitions” parameter does not seems to help:
>
> scala> sc.binaryFiles("/data/subset", minPartitions = 100).partitions.size
>
> res8: Int = 66
>
>
>
> At the same time, Spark produces the required number of partitions with
> sc.textFiles (though I cannot use it because my files are binary):
>
> scala> sc.textFile("/data/subset").partitions.size
>
> res9: Int = 100
>
>
>
> Could you suggest how to force Spark to load binary files each in a
> separate partition?
>
>
>
> Best regards, Alexander
>

Reply via email to