The partitions parameter to textFile is the "minPartitions". So there will be at least that level of parallelism. Spark delegates to Hadoop to create the splits for that file (yes, even for a text file on disk and not hdfs). You can take a look at the code in FileInputFormat - but briefly it will compute the block size to use and create at least the number of partitions passed into it. It can create more blocks.
Hope this helps, Kostas On Mon, Feb 9, 2015 at 8:00 PM, Yana Kadiyska <yana.kadiy...@gmail.com> wrote: > Hi folks, puzzled by something pretty simple: > > I have a standalone cluster with default parallelism of 2, spark-shell > running with 2 cores > > sc.textFile("README.md").partitions.size returns 2 (this makes sense) > sc.textFile("README.md").coalesce(100,true).partitions.size returns 100, > also makes sense > > but > > sc.textFile("README.md",100).partitions.size > gives 102 --I was expecting this to be equivalent to last statement > (i.e.result in 100 partitions) > > I'd appreciate if someone can enlighten me as to why I end up with 102 > This is on Spark 1.2 > > thanks >