partitioning of small data sets

Diana Carroll Tue, 15 Apr 2014 08:51:52 -0700

I loaded a very tiny file into Spark -- 23 lines of text, 2.6kb

Given the size, and that it is a single file, I assumed it would only be in
a single partition.  But when I cache it,  I can see in the Spark App UI
that it actually splits it into two partitions:


[image: Inline image 1]

Is this correct behavior?  How does Spark decide how big a partition should
be, or how many partitions to create for an RDD.

If it matters, I have only a single worker in my "cluster", so both
partitions are stored on the same worker.

The file was on HDFS and was only a single block.

Thanks for any insight.

Diana

<<inline: sparkdev_2014-04-11.png>>

partitioning of small data sets

Reply via email to