RE: Number of partitions for binaryFiles

2016-04-26 Thread Ulanov, Alexander
it will involve shuffling. From: Ted Yu [mailto:yuzhih...@gmail.com] Sent: Tuesday, April 26, 2016 2:44 PM To: Ulanov, Alexander <alexander.ula...@hpe.com> Cc: dev@spark.apache.org Subject: Re: Number of partitions for binaryFiles From what I understand, Spark code was written this way becau

Re: Number of partitions for binaryFiles

2016-04-26 Thread Ted Yu
zhih...@gmail.com] > *Sent:* Tuesday, April 26, 2016 1:22 PM > *To:* Ulanov, Alexander <alexander.ula...@hpe.com> > *Cc:* dev@spark.apache.org > *Subject:* Re: Number of partitions for binaryFiles > > > > Here is the body of StreamFileInputFormat#setMinPartitions : >

RE: Number of partitions for binaryFiles

2016-04-26 Thread Ulanov, Alexander
c: dev@spark.apache.org Subject: Re: Number of partitions for binaryFiles Here is the body of StreamFileInputFormat#setMinPartitions : def setMinPartitions(context: JobContext, minPartitions: Int) { val totalLen = listStatus(context).asScala.filterNot(_.isDirectory).map(_.getLen).sum

Re: Number of partitions for binaryFiles

2016-04-26 Thread Ted Yu
Here is the body of StreamFileInputFormat#setMinPartitions : def setMinPartitions(context: JobContext, minPartitions: Int) { val totalLen = listStatus(context).asScala.filterNot(_.isDirectory).map(_.getLen).sum val maxSplitSize = math.ceil(totalLen / math.max(minPartitions, 1.0)).toLong

Number of partitions for binaryFiles

2016-04-26 Thread Ulanov, Alexander
Dear Spark developers, I have 100 binary files in local file system that I want to load into Spark RDD. I need the data from each file to be in a separate partition. However, I cannot make it happen: scala> sc.binaryFiles("/data/subset").partitions.size res5: Int = 66 The "minPartitions"