RE: 2 input paths generate 3 partitions

java8964 Fri, 27 Mar 2015 16:50:32 -0700

The files sound too small to be 2 blocks in HDFS.
Did you set the defaultParallelism to be 3 in your spark?
Yong

Subject: Re: 2 input paths generate 3 partitions
From: zzh...@hortonworks.com
To: rvern...@gmail.com
CC: user@spark.apache.org
Date: Fri, 27 Mar 2015 23:15:38 +0000

Hi Rares,

The number of partition is controlled by HDFS input format, and one file may 
have multiple partitions if it consists of multiple block. In you case, I think 
there is one file with 2 splits.

Thanks.

Zhan Zhang

On Mar 27, 2015, at 3:12 PM, Rares Vernica <rvern...@gmail.com> wrote:

Hello,

I am using the Spark shell in Scala on the localhost. I am using 
sc.textFile to read a directory. The directory looks like this (generated by 
another Spark script):

part-00000
part-00001
_SUCCESS

The part-00000 has four short lines of text while
part-00001 has two short lines of text. The
_SUCCESS file is empty. When I check the number of partitions on the RDD I get:

scala> foo.partitions.length
15/03/27 14:57:31 INFO FileInputFormat: Total input paths to process : 2
res68: Int = 3

I wonder why do the two input files generate three partitions. Does Spark check 
the number of lines in each file and try to generate three balanced partitions?

Thanks!
Rares

RE: 2 input paths generate 3 partitions

Reply via email to