Re: partitioning of small data sets

Aaron Davidson Tue, 15 Apr 2014 09:55:39 -0700

Take a look at the minSplits argument for SparkContext#textFile [1] -- the
default value is 2. You can simply set this to 1 if you'd prefer not to
split your data.


[1]
http://spark.apache.org/docs/latest/api/core/index.html#org.apache.spark.SparkContext


On Tue, Apr 15, 2014 at 8:44 AM, Diana Carroll <dcarr...@cloudera.com>wrote:

> I loaded a very tiny file into Spark -- 23 lines of text, 2.6kb
>
> Given the size, and that it is a single file, I assumed it would only be
> in a single partition.  But when I cache it,  I can see in the Spark App UI
> that it actually splits it into two partitions:
>
> [image: Inline image 1]
>
> Is this correct behavior?  How does Spark decide how big a partition
> should be, or how many partitions to create for an RDD.
>
> If it matters, I have only a single worker in my "cluster", so both
> partitions are stored on the same worker.
>
> The file was on HDFS and was only a single block.
>
> Thanks for any insight.
>
> Diana
>
>
>

<<inline: sparkdev_2014-04-11.png>>

Re: partitioning of small data sets

Reply via email to