Re: partitioning of small data sets

Matei Zaharia Tue, 15 Apr 2014 10:05:51 -0700

Yup, one reason it’s 2 actually is to give people a similar experience to 
working with large files, in case their code doesn’t deal well with the file 
being partitioned.


Matei

On Apr 15, 2014, at 9:53 AM, Aaron Davidson <ilike...@gmail.com> wrote:

> Take a look at the minSplits argument for SparkContext#textFile [1] -- the 
> default value is 2. You can simply set this to 1 if you'd prefer not to split 
> your data.
> 
> [1] 
> http://spark.apache.org/docs/latest/api/core/index.html#org.apache.spark.SparkContext
> 
> 
> On Tue, Apr 15, 2014 at 8:44 AM, Diana Carroll <dcarr...@cloudera.com> wrote:
> I loaded a very tiny file into Spark -- 23 lines of text, 2.6kb
> 
> Given the size, and that it is a single file, I assumed it would only be in a 
> single partition.  But when I cache it,  I can see in the Spark App UI that 
> it actually splits it into two partitions:
> 
> <sparkdev_2014-04-11.png>
> 
> Is this correct behavior?  How does Spark decide how big a partition should 
> be, or how many partitions to create for an RDD.
> 
> If it matters, I have only a single worker in my "cluster", so both 
> partitions are stored on the same worker.
> 
> The file was on HDFS and was only a single block.
> 
> Thanks for any insight.
> 
> Diana
> 
> 
>

Re: partitioning of small data sets

Reply via email to