Re: partitioning of small data sets

2014-04-15 Thread Aaron Davidson
Take a look at the minSplits argument for SparkContext#textFile [1] -- the default value is 2. You can simply set this to 1 if you'd prefer not to split your data. [1] http://spark.apache.org/docs/latest/api/core/index.html#org.apache.spark.SparkContext On Tue, Apr 15, 2014 at 8:44 AM, Diana

Re: partitioning of small data sets

2014-04-15 Thread Matei Zaharia
Yup, one reason it’s 2 actually is to give people a similar experience to working with large files, in case their code doesn’t deal well with the file being partitioned. Matei On Apr 15, 2014, at 9:53 AM, Aaron Davidson ilike...@gmail.com wrote: Take a look at the minSplits argument for

Re: partitioning of small data sets

2014-04-15 Thread Nicholas Chammas
Looking at the Python version of textFile()http://spark.apache.org/docs/latest/api/pyspark/pyspark.context-pysrc.html#SparkContext.textFile, shouldn't it be *max*(self.defaultParallelism, 2)? If the default parallelism is, say 4, wouldn't we want to use that for minSplits instead of 2? On Tue,