Heh, it's still just a suggestion to Hadoop I guess, not guaranteed.

Is it a splittable format? for example, some compressed formats are
not splittable and Hadoop has to process whole files at a time.

I'm also not sure if this is something to do with pyspark, since the
underlying Scala API takes a Configuration object rather than
dictionary.

On Mon, Sep 15, 2014 at 11:23 PM, Eric Friedman
<eric.d.fried...@gmail.com> wrote:
> That would be awesome, but doesn't seem to have any effect.
>
> In PySpark, I created a dict with that key and a numeric value, then passed
> it into newAPIHadoopFile as a value for the "conf" keyword.  The returned
> RDD still has a single partition.
>
> On Mon, Sep 15, 2014 at 1:56 PM, Sean Owen <so...@cloudera.com> wrote:
>>
>> I think the reason is simply that there is no longer an explicit
>> min-partitions argument for Hadoop InputSplits in the new Hadoop APIs.
>> At least, I didn't see it when I glanced just now.
>>
>> However, you should be able to get the same effect by setting a
>> Configuration property, and you can do so through the newAPIHadoopFile
>> method. You set it as a suggested maximum split size rather than
>> suggest minimum number of splits.
>>
>> Although I think the old config property mapred.max.split.size is
>> still respected, you may try
>> mapreduce.input.fileinputformat.split.maxsize instead, which appears
>> to be the intended replacement in the new APIs.
>>
>> On Mon, Sep 15, 2014 at 9:35 PM, Eric Friedman
>> <eric.d.fried...@gmail.com> wrote:
>> > sc.textFile takes a minimum # of partitions to use.
>> >
>> > is there a way to get sc.newAPIHadoopFile to do the same?
>> >
>> > I know I can repartition() and get a shuffle.  I'm wondering if there's
>> > a
>> > way to tell the underlying InputFormat (AvroParquet, in my case) how
>> > many
>> > partitions to use at the outset.
>> >
>> > What I'd really prefer is to get the partitions automatically defined
>> > based
>> > on the number of blocks.
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Reply via email to