That would be awesome, but doesn't seem to have any effect. In PySpark, I created a dict with that key and a numeric value, then passed it into newAPIHadoopFile as a value for the "conf" keyword. The returned RDD still has a single partition.
On Mon, Sep 15, 2014 at 1:56 PM, Sean Owen <so...@cloudera.com> wrote: > I think the reason is simply that there is no longer an explicit > min-partitions argument for Hadoop InputSplits in the new Hadoop APIs. > At least, I didn't see it when I glanced just now. > > However, you should be able to get the same effect by setting a > Configuration property, and you can do so through the newAPIHadoopFile > method. You set it as a suggested maximum split size rather than > suggest minimum number of splits. > > Although I think the old config property mapred.max.split.size is > still respected, you may try > mapreduce.input.fileinputformat.split.maxsize instead, which appears > to be the intended replacement in the new APIs. > > On Mon, Sep 15, 2014 at 9:35 PM, Eric Friedman > <eric.d.fried...@gmail.com> wrote: > > sc.textFile takes a minimum # of partitions to use. > > > > is there a way to get sc.newAPIHadoopFile to do the same? > > > > I know I can repartition() and get a shuffle. I'm wondering if there's a > > way to tell the underlying InputFormat (AvroParquet, in my case) how many > > partitions to use at the outset. > > > > What I'd really prefer is to get the partitions automatically defined > based > > on the number of blocks. >