minPartitions for non-text files?

2014-09-15 Thread Eric Friedman
sc.textFile takes a minimum # of partitions to use. is there a way to get sc.newAPIHadoopFile to do the same? I know I can repartition() and get a shuffle. I'm wondering if there's a way to tell the underlying InputFormat (AvroParquet, in my case) how many partitions to use at the outset. What

Re: minPartitions for non-text files?

2014-09-15 Thread Sean Owen
I think the reason is simply that there is no longer an explicit min-partitions argument for Hadoop InputSplits in the new Hadoop APIs. At least, I didn't see it when I glanced just now. However, you should be able to get the same effect by setting a Configuration property, and you can do so

Re: minPartitions for non-text files?

2014-09-15 Thread Eric Friedman
That would be awesome, but doesn't seem to have any effect. In PySpark, I created a dict with that key and a numeric value, then passed it into newAPIHadoopFile as a value for the conf keyword. The returned RDD still has a single partition. On Mon, Sep 15, 2014 at 1:56 PM, Sean Owen

Re: minPartitions for non-text files?

2014-09-15 Thread Sean Owen
Heh, it's still just a suggestion to Hadoop I guess, not guaranteed. Is it a splittable format? for example, some compressed formats are not splittable and Hadoop has to process whole files at a time. I'm also not sure if this is something to do with pyspark, since the underlying Scala API takes

Re: minPartitions for non-text files?

2014-09-15 Thread Eric Friedman
Yes, it's AvroParquetInputFormat, which is splittable. If I force a repartitioning, it works. If I don't, spark chokes on my not-terribly-large 250Mb files. PySpark's documentation says that the dictionary is turned into a Configuration object. @param conf: Hadoop configuration, passed in as a