Take a look at the minSplits argument for SparkContext#textFile [1] -- the default value is 2. You can simply set this to 1 if you'd prefer not to split your data.
[1] http://spark.apache.org/docs/latest/api/core/index.html#org.apache.spark.SparkContext On Tue, Apr 15, 2014 at 8:44 AM, Diana Carroll <dcarr...@cloudera.com>wrote: > I loaded a very tiny file into Spark -- 23 lines of text, 2.6kb > > Given the size, and that it is a single file, I assumed it would only be > in a single partition. But when I cache it, I can see in the Spark App UI > that it actually splits it into two partitions: > > [image: Inline image 1] > > Is this correct behavior? How does Spark decide how big a partition > should be, or how many partitions to create for an RDD. > > If it matters, I have only a single worker in my "cluster", so both > partitions are stored on the same worker. > > The file was on HDFS and was only a single block. > > Thanks for any insight. > > Diana > > >
<<inline: sparkdev_2014-04-11.png>>