Re: textFile partitions

Kostas Sakellis Mon, 09 Feb 2015 22:48:08 -0800

The partitions parameter to textFile is the "minPartitions". So there will
be at least that level of parallelism. Spark delegates to Hadoop to create
the splits for that file (yes, even for a text file on disk and not hdfs).
You can take a look at the code in FileInputFormat - but briefly it will
compute the block size to use and create at least the number of partitions
passed into it. It can create more blocks.


Hope this helps,
Kostas

On Mon, Feb 9, 2015 at 8:00 PM, Yana Kadiyska <yana.kadiy...@gmail.com>
wrote:

> Hi folks, puzzled by something pretty simple:
>
> I have a standalone cluster with default parallelism of 2, spark-shell
> running with 2 cores
>
> sc.textFile("README.md").partitions.size returns 2 (this makes sense)
> sc.textFile("README.md").coalesce(100,true).partitions.size returns 100,
> also makes sense
>
> but
>
> sc.textFile("README.md",100).partitions.size
>  gives 102 --I was expecting this to be equivalent to last statement
> (i.e.result in 100 partitions)
>
> I'd appreciate if someone can enlighten me as to why I end up with 102
> This is on Spark 1.2
>
> thanks
>

Re: textFile partitions

Reply via email to