Yup, one reason it’s 2 actually is to give people a similar experience to working with large files, in case their code doesn’t deal well with the file being partitioned.
Matei On Apr 15, 2014, at 9:53 AM, Aaron Davidson <ilike...@gmail.com> wrote: > Take a look at the minSplits argument for SparkContext#textFile [1] -- the > default value is 2. You can simply set this to 1 if you'd prefer not to split > your data. > > [1] > http://spark.apache.org/docs/latest/api/core/index.html#org.apache.spark.SparkContext > > > On Tue, Apr 15, 2014 at 8:44 AM, Diana Carroll <dcarr...@cloudera.com> wrote: > I loaded a very tiny file into Spark -- 23 lines of text, 2.6kb > > Given the size, and that it is a single file, I assumed it would only be in a > single partition. But when I cache it, I can see in the Spark App UI that > it actually splits it into two partitions: > > <sparkdev_2014-04-11.png> > > Is this correct behavior? How does Spark decide how big a partition should > be, or how many partitions to create for an RDD. > > If it matters, I have only a single worker in my "cluster", so both > partitions are stored on the same worker. > > The file was on HDFS and was only a single block. > > Thanks for any insight. > > Diana > > >