Re: partitioning of small data sets

Nicholas Chammas Tue, 15 Apr 2014 14:55:58 -0700

Looking at the Python version of
textFile()<http://spark.apache.org/docs/latest/api/pyspark/pyspark.context-pysrc.html#SparkContext.textFile>,
shouldn't it be "*max*(self.defaultParallelism, 2)"?


If the default parallelism is, say 4, wouldn't we want to use that for
minSplits instead of 2?


On Tue, Apr 15, 2014 at 1:04 PM, Matei Zaharia <matei.zaha...@gmail.com>wrote:

> Yup, one reason it’s 2 actually is to give people a similar experience to
> working with large files, in case their code doesn’t deal well with the
> file being partitioned.
>
> Matei
>
> On Apr 15, 2014, at 9:53 AM, Aaron Davidson <ilike...@gmail.com> wrote:
>
> Take a look at the minSplits argument for SparkContext#textFile [1] -- the
> default value is 2. You can simply set this to 1 if you'd prefer not to
> split your data.
>
> [1]
> http://spark.apache.org/docs/latest/api/core/index.html#org.apache.spark.SparkContext
>
>
> On Tue, Apr 15, 2014 at 8:44 AM, Diana Carroll <dcarr...@cloudera.com>wrote:
>
>> I loaded a very tiny file into Spark -- 23 lines of text, 2.6kb
>>
>> Given the size, and that it is a single file, I assumed it would only be
>> in a single partition.  But when I cache it,  I can see in the Spark App UI
>> that it actually splits it into two partitions:
>>
>> <sparkdev_2014-04-11.png>
>>
>> Is this correct behavior?  How does Spark decide how big a partition
>> should be, or how many partitions to create for an RDD.
>>
>> If it matters, I have only a single worker in my "cluster", so both
>> partitions are stored on the same worker.
>>
>> The file was on HDFS and was only a single block.
>>
>> Thanks for any insight.
>>
>> Diana
>>
>>
>>
>
>

Re: partitioning of small data sets

Reply via email to