Re: minPartitions for non-text files?

Eric Friedman Mon, 15 Sep 2014 15:24:07 -0700

That would be awesome, but doesn't seem to have any effect.

In PySpark, I created a dict with that key and a numeric value, then passed
it into newAPIHadoopFile as a value for the "conf" keyword.  The returned
RDD still has a single partition.


On Mon, Sep 15, 2014 at 1:56 PM, Sean Owen <so...@cloudera.com> wrote:

> I think the reason is simply that there is no longer an explicit
> min-partitions argument for Hadoop InputSplits in the new Hadoop APIs.
> At least, I didn't see it when I glanced just now.
>
> However, you should be able to get the same effect by setting a
> Configuration property, and you can do so through the newAPIHadoopFile
> method. You set it as a suggested maximum split size rather than
> suggest minimum number of splits.
>
> Although I think the old config property mapred.max.split.size is
> still respected, you may try
> mapreduce.input.fileinputformat.split.maxsize instead, which appears
> to be the intended replacement in the new APIs.
>
> On Mon, Sep 15, 2014 at 9:35 PM, Eric Friedman
> <eric.d.fried...@gmail.com> wrote:
> > sc.textFile takes a minimum # of partitions to use.
> >
> > is there a way to get sc.newAPIHadoopFile to do the same?
> >
> > I know I can repartition() and get a shuffle.  I'm wondering if there's a
> > way to tell the underlying InputFormat (AvroParquet, in my case) how many
> > partitions to use at the outset.
> >
> > What I'd really prefer is to get the partitions automatically defined
> based
> > on the number of blocks.
>

Re: minPartitions for non-text files?

Reply via email to