sc.textFile takes a minimum # of partitions to use.
is there a way to get sc.newAPIHadoopFile to do the same?
I know I can repartition() and get a shuffle. I'm wondering if there's a
way to tell the underlying InputFormat (AvroParquet, in my case) how many
partitions to use at the outset.
What
I think the reason is simply that there is no longer an explicit
min-partitions argument for Hadoop InputSplits in the new Hadoop APIs.
At least, I didn't see it when I glanced just now.
However, you should be able to get the same effect by setting a
Configuration property, and you can do so
That would be awesome, but doesn't seem to have any effect.
In PySpark, I created a dict with that key and a numeric value, then passed
it into newAPIHadoopFile as a value for the conf keyword. The returned
RDD still has a single partition.
On Mon, Sep 15, 2014 at 1:56 PM, Sean Owen
Heh, it's still just a suggestion to Hadoop I guess, not guaranteed.
Is it a splittable format? for example, some compressed formats are
not splittable and Hadoop has to process whole files at a time.
I'm also not sure if this is something to do with pyspark, since the
underlying Scala API takes
Yes, it's AvroParquetInputFormat, which is splittable. If I force a
repartitioning, it works. If I don't, spark chokes on my not-terribly-large
250Mb files.
PySpark's documentation says that the dictionary is turned into a
Configuration object.
@param conf: Hadoop configuration, passed in as a