Re: 2 input paths generate 3 partitions

2015-03-27 Thread Zhan Zhang
Hi Rares,

The number of partition is controlled by HDFS input format, and one file may 
have multiple partitions if it consists of multiple block. In you case, I think 
there is one file with 2 splits.

Thanks.

Zhan Zhang
On Mar 27, 2015, at 3:12 PM, Rares Vernica 
rvern...@gmail.commailto:rvern...@gmail.com wrote:

Hello,

I am using the Spark shell in Scala on the localhost. I am using sc.textFile to 
read a directory. The directory looks like this (generated by another Spark 
script):

part-0
part-1
_SUCCESS

The part-0 has four short lines of text while part-1 has two short 
lines of text. The _SUCCESS file is empty. When I check the number of 
partitions on the RDD I get:

scala foo.partitions.length
15/03/27 14:57:31 INFO FileInputFormat: Total input paths to process : 2
res68: Int = 3

I wonder why do the two input files generate three partitions. Does Spark check 
the number of lines in each file and try to generate three balanced partitions?

Thanks!
Rares



RE: 2 input paths generate 3 partitions

2015-03-27 Thread java8964
The files sound too small to be 2 blocks in HDFS.
Did you set the defaultParallelism to be 3 in your spark?
Yong

Subject: Re: 2 input paths generate 3 partitions
From: zzh...@hortonworks.com
To: rvern...@gmail.com
CC: user@spark.apache.org
Date: Fri, 27 Mar 2015 23:15:38 +






Hi Rares,



The number of partition is controlled by HDFS input format, and one file may 
have multiple partitions if it consists of multiple block. In you case, I think 
there is one file with 2 splits.



Thanks.



Zhan Zhang


On Mar 27, 2015, at 3:12 PM, Rares Vernica rvern...@gmail.com wrote:


Hello,



I am using the Spark shell in Scala on the localhost. I am using 
sc.textFile to read a directory. The directory looks like this (generated by 
another Spark script):




part-0
part-1
_SUCCESS




The part-0 has four short lines of text while
part-1 has two short lines of text. The
_SUCCESS file is empty. When I check the number of partitions on the RDD I get:




scala foo.partitions.length
15/03/27 14:57:31 INFO FileInputFormat: Total input paths to process : 2
res68: Int = 3




I wonder why do the two input files generate three partitions. Does Spark check 
the number of lines in each file and try to generate three balanced partitions?



Thanks!
Rares






  

Re: 2 input paths generate 3 partitions

2015-03-27 Thread Rares Vernica
Hi,

I am not using HDFS, I am using the local file system. Moreover, I did not
modify the defaultParallelism. The Spark instance is the default one
started by Spark Shell.

Thanks!
Rares


On Fri, Mar 27, 2015 at 4:48 PM, java8964 java8...@hotmail.com wrote:

 The files sound too small to be 2 blocks in HDFS.

 Did you set the defaultParallelism to be 3 in your spark?

 Yong

 --
 Subject: Re: 2 input paths generate 3 partitions
 From: zzh...@hortonworks.com
 To: rvern...@gmail.com
 CC: user@spark.apache.org
 Date: Fri, 27 Mar 2015 23:15:38 +


 Hi Rares,

  The number of partition is controlled by HDFS input format, and one file
 may have multiple partitions if it consists of multiple block. In you case,
 I think there is one file with 2 splits.

  Thanks.

  Zhan Zhang
  On Mar 27, 2015, at 3:12 PM, Rares Vernica rvern...@gmail.com wrote:

  Hello,

  I am using the Spark shell in Scala on the localhost. I am using
 sc.textFile to read a directory. The directory looks like this (generated
 by another Spark script):

  part-0
 part-1
 _SUCCESS


  The part-0 has four short lines of text while part-1 has two
 short lines of text. The _SUCCESS file is empty. When I check the number
 of partitions on the RDD I get:

  scala foo.partitions.length
 15/03/27 14:57:31 INFO FileInputFormat: Total input paths to process : 2
 res68: Int = 3


  I wonder why do the two input files generate three partitions. Does
 Spark check the number of lines in each file and try to generate three
 balanced partitions?

  Thanks!
 Rares