2 input paths generate 3 partitions
Hello, I am using the Spark shell in Scala on the localhost. I am using sc.textFile to read a directory. The directory looks like this (generated by another Spark script): part-0 part-1 _SUCCESS The part-0 has four short lines of text while part-1 has two short lines of text. The _SUCCESS file is empty. When I check the number of partitions on the RDD I get: scala foo.partitions.length 15/03/27 14:57:31 INFO FileInputFormat: Total input paths to process : 2 res68: Int = 3 I wonder why do the two input files generate three partitions. Does Spark check the number of lines in each file and try to generate three balanced partitions? Thanks! Rares
Re: 2 input paths generate 3 partitions
Hi Rares, The number of partition is controlled by HDFS input format, and one file may have multiple partitions if it consists of multiple block. In you case, I think there is one file with 2 splits. Thanks. Zhan Zhang On Mar 27, 2015, at 3:12 PM, Rares Vernica rvern...@gmail.commailto:rvern...@gmail.com wrote: Hello, I am using the Spark shell in Scala on the localhost. I am using sc.textFile to read a directory. The directory looks like this (generated by another Spark script): part-0 part-1 _SUCCESS The part-0 has four short lines of text while part-1 has two short lines of text. The _SUCCESS file is empty. When I check the number of partitions on the RDD I get: scala foo.partitions.length 15/03/27 14:57:31 INFO FileInputFormat: Total input paths to process : 2 res68: Int = 3 I wonder why do the two input files generate three partitions. Does Spark check the number of lines in each file and try to generate three balanced partitions? Thanks! Rares
RE: 2 input paths generate 3 partitions
The files sound too small to be 2 blocks in HDFS. Did you set the defaultParallelism to be 3 in your spark? Yong Subject: Re: 2 input paths generate 3 partitions From: zzh...@hortonworks.com To: rvern...@gmail.com CC: user@spark.apache.org Date: Fri, 27 Mar 2015 23:15:38 + Hi Rares, The number of partition is controlled by HDFS input format, and one file may have multiple partitions if it consists of multiple block. In you case, I think there is one file with 2 splits. Thanks. Zhan Zhang On Mar 27, 2015, at 3:12 PM, Rares Vernica rvern...@gmail.com wrote: Hello, I am using the Spark shell in Scala on the localhost. I am using sc.textFile to read a directory. The directory looks like this (generated by another Spark script): part-0 part-1 _SUCCESS The part-0 has four short lines of text while part-1 has two short lines of text. The _SUCCESS file is empty. When I check the number of partitions on the RDD I get: scala foo.partitions.length 15/03/27 14:57:31 INFO FileInputFormat: Total input paths to process : 2 res68: Int = 3 I wonder why do the two input files generate three partitions. Does Spark check the number of lines in each file and try to generate three balanced partitions? Thanks! Rares
Re: 2 input paths generate 3 partitions
Hi, I am not using HDFS, I am using the local file system. Moreover, I did not modify the defaultParallelism. The Spark instance is the default one started by Spark Shell. Thanks! Rares On Fri, Mar 27, 2015 at 4:48 PM, java8964 java8...@hotmail.com wrote: The files sound too small to be 2 blocks in HDFS. Did you set the defaultParallelism to be 3 in your spark? Yong -- Subject: Re: 2 input paths generate 3 partitions From: zzh...@hortonworks.com To: rvern...@gmail.com CC: user@spark.apache.org Date: Fri, 27 Mar 2015 23:15:38 + Hi Rares, The number of partition is controlled by HDFS input format, and one file may have multiple partitions if it consists of multiple block. In you case, I think there is one file with 2 splits. Thanks. Zhan Zhang On Mar 27, 2015, at 3:12 PM, Rares Vernica rvern...@gmail.com wrote: Hello, I am using the Spark shell in Scala on the localhost. I am using sc.textFile to read a directory. The directory looks like this (generated by another Spark script): part-0 part-1 _SUCCESS The part-0 has four short lines of text while part-1 has two short lines of text. The _SUCCESS file is empty. When I check the number of partitions on the RDD I get: scala foo.partitions.length 15/03/27 14:57:31 INFO FileInputFormat: Total input paths to process : 2 res68: Int = 3 I wonder why do the two input files generate three partitions. Does Spark check the number of lines in each file and try to generate three balanced partitions? Thanks! Rares