Heh, it's still just a suggestion to Hadoop I guess, not guaranteed. Is it a splittable format? for example, some compressed formats are not splittable and Hadoop has to process whole files at a time.
I'm also not sure if this is something to do with pyspark, since the underlying Scala API takes a Configuration object rather than dictionary. On Mon, Sep 15, 2014 at 11:23 PM, Eric Friedman <eric.d.fried...@gmail.com> wrote: > That would be awesome, but doesn't seem to have any effect. > > In PySpark, I created a dict with that key and a numeric value, then passed > it into newAPIHadoopFile as a value for the "conf" keyword. The returned > RDD still has a single partition. > > On Mon, Sep 15, 2014 at 1:56 PM, Sean Owen <so...@cloudera.com> wrote: >> >> I think the reason is simply that there is no longer an explicit >> min-partitions argument for Hadoop InputSplits in the new Hadoop APIs. >> At least, I didn't see it when I glanced just now. >> >> However, you should be able to get the same effect by setting a >> Configuration property, and you can do so through the newAPIHadoopFile >> method. You set it as a suggested maximum split size rather than >> suggest minimum number of splits. >> >> Although I think the old config property mapred.max.split.size is >> still respected, you may try >> mapreduce.input.fileinputformat.split.maxsize instead, which appears >> to be the intended replacement in the new APIs. >> >> On Mon, Sep 15, 2014 at 9:35 PM, Eric Friedman >> <eric.d.fried...@gmail.com> wrote: >> > sc.textFile takes a minimum # of partitions to use. >> > >> > is there a way to get sc.newAPIHadoopFile to do the same? >> > >> > I know I can repartition() and get a shuffle. I'm wondering if there's >> > a >> > way to tell the underlying InputFormat (AvroParquet, in my case) how >> > many >> > partitions to use at the outset. >> > >> > What I'd really prefer is to get the partitions automatically defined >> > based >> > on the number of blocks. > > --------------------------------------------------------------------- To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org