Looking at the Python version of textFile()<http://spark.apache.org/docs/latest/api/pyspark/pyspark.context-pysrc.html#SparkContext.textFile>, shouldn't it be "*max*(self.defaultParallelism, 2)"?
If the default parallelism is, say 4, wouldn't we want to use that for minSplits instead of 2? On Tue, Apr 15, 2014 at 1:04 PM, Matei Zaharia <matei.zaha...@gmail.com>wrote: > Yup, one reason it’s 2 actually is to give people a similar experience to > working with large files, in case their code doesn’t deal well with the > file being partitioned. > > Matei > > On Apr 15, 2014, at 9:53 AM, Aaron Davidson <ilike...@gmail.com> wrote: > > Take a look at the minSplits argument for SparkContext#textFile [1] -- the > default value is 2. You can simply set this to 1 if you'd prefer not to > split your data. > > [1] > http://spark.apache.org/docs/latest/api/core/index.html#org.apache.spark.SparkContext > > > On Tue, Apr 15, 2014 at 8:44 AM, Diana Carroll <dcarr...@cloudera.com>wrote: > >> I loaded a very tiny file into Spark -- 23 lines of text, 2.6kb >> >> Given the size, and that it is a single file, I assumed it would only be >> in a single partition. But when I cache it, I can see in the Spark App UI >> that it actually splits it into two partitions: >> >> <sparkdev_2014-04-11.png> >> >> Is this correct behavior? How does Spark decide how big a partition >> should be, or how many partitions to create for an RDD. >> >> If it matters, I have only a single worker in my "cluster", so both >> partitions are stored on the same worker. >> >> The file was on HDFS and was only a single block. >> >> Thanks for any insight. >> >> Diana >> >> >> > >