Take a look at the minSplits argument for SparkContext#textFile [1] -- the
default value is 2. You can simply set this to 1 if you'd prefer not to
split your data.
[1]
http://spark.apache.org/docs/latest/api/core/index.html#org.apache.spark.SparkContext
On Tue, Apr 15, 2014 at 8:44 AM, Diana
Yup, one reason it’s 2 actually is to give people a similar experience to
working with large files, in case their code doesn’t deal well with the file
being partitioned.
Matei
On Apr 15, 2014, at 9:53 AM, Aaron Davidson ilike...@gmail.com wrote:
Take a look at the minSplits argument for
Looking at the Python version of
textFile()http://spark.apache.org/docs/latest/api/pyspark/pyspark.context-pysrc.html#SparkContext.textFile,
shouldn't it be *max*(self.defaultParallelism, 2)?
If the default parallelism is, say 4, wouldn't we want to use that for
minSplits instead of 2?
On Tue,