Re: Number Of Partitions in RDD
Local mode - __Vikash Pareek -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Number-Of-Partitions-in-RDD-tp28730p28786.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe e-mail: user-unsubscr...@spark.apache.org
Re: Number Of Partitions in RDD
CLuster mode with HDFS? or local mode? -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Number-Of-Partitions-in-RDD-tp28730p28737.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe e-mail: user-unsubscr...@spark.apache.org
Re: Number Of Partitions in RDD
Spark 1.6.1 -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Number-Of-Partitions-in-RDD-tp28730p28735.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe e-mail: user-unsubscr...@spark.apache.org
Re: Number Of Partitions in RDD
What version of spark of spark are you using? -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Number-Of-Partitions-in-RDD-tp28730p28732.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe e-mail: user-unsubscr...@spark.apache.org
Re: Number Of Partitions in RDD
While I'm not sure why you're seeing an increase in partitions with such a small data file, it's worth noting that the second parameter to textFile is the *minimum* number of partitions so there's no guarantee you'll get exactly that number. -- Michael Mior mm...@apache.org 2017-06-01 6:28 GMT-04:00 Vikash Pareek <vikash.par...@infoobjects.com>: > Hi, > > I am creating a RDD from a text file by specifying number of partitions. > But > it gives me different number of partitions than the specified one. > > */scala> val people = sc.textFile("file:///home/pvikash/data/test.txt", 0) > people: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD[72] at > textFile > at :27 > > scala> people.getNumPartitions > res47: Int = 1 > > scala> val people = sc.textFile("file:///home/pvikash/data/test.txt", 1) > people: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD[50] at > textFile > at :27 > > scala> people.getNumPartitions > res36: Int = 1 > > scala> val people = sc.textFile("file:///home/pvikash/data/test.txt", 2) > people: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD[52] at > textFile > at :27 > > scala> people.getNumPartitions > res37: Int = 2 > > scala> val people = sc.textFile("file:///home/pvikash/data/test.txt", 3) > people: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD[54] at > textFile > at :27 > > scala> people.getNumPartitions > res38: Int = 3 > > scala> val people = sc.textFile("file:///home/pvikash/data/test.txt", 4) > people: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD[56] at > textFile > at :27 > > scala> people.getNumPartitions > res39: Int = 4 > > scala> val people = sc.textFile("file:///home/pvikash/data/test.txt", 5) > people: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD[58] at > textFile > at :27 > > scala> people.getNumPartitions > res40: Int = 6 > > scala> val people = sc.textFile("file:///home/pvikash/data/test.txt", 6) > people: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD[60] at > textFile > at :27 > > scala> people.getNumPartitions > res41: Int = 7 > > scala> val people = sc.textFile("file:///home/pvikash/data/test.txt", 7) > people: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD[62] at > textFile > at :27 > > scala> people.getNumPartitions > res42: Int = 8 > > scala> val people = sc.textFile("file:///home/pvikash/data/test.txt", 8) > people: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD[64] at > textFile > at :27 > > scala> people.getNumPartitions > res43: Int = 9 > > scala> val people = sc.textFile("file:///home/pvikash/data/test.txt", 9) > people: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD[66] at > textFile > at :27 > > scala> people.getNumPartitions > res44: Int = 11 > > scala> val people = sc.textFile("file:///home/pvikash/data/test.txt", 10) > people: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD[68] at > textFile > at :27 > > scala> people.getNumPartitions > res45: Int = 11 > > scala> val people = sc.textFile("file:///home/pvikash/data/test.txt", 11) > people: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD[70] at > textFile > at :27 > > scala> people.getNumPartitions > res46: Int = 13/* > > Contents of the file /home/pvikash/data/test.txt is: > " > This is a test file. > Will be used for rdd partition > " > > I am trying to understand why number of partitions is changing here and in > case we have small data (which can fit into one partition) then why spark > creates empty partitions? > > Any explanation would be appreciated. > > --Vikash > > > > -- > View this message in context: http://apache-spark-user-list. > 1001560.n3.nabble.com/Number-Of-Partitions-in-RDD-tp28730.html > Sent from the Apache Spark User List mailing list archive at Nabble.com. > > - > To unsubscribe e-mail: user-unsubscr...@spark.apache.org > >
Number Of Partitions in RDD
Hi, I am creating a RDD from a text file by specifying number of partitions. But it gives me different number of partitions than the specified one. */scala> val people = sc.textFile("file:///home/pvikash/data/test.txt", 0) people: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD[72] at textFile at :27 scala> people.getNumPartitions res47: Int = 1 scala> val people = sc.textFile("file:///home/pvikash/data/test.txt", 1) people: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD[50] at textFile at :27 scala> people.getNumPartitions res36: Int = 1 scala> val people = sc.textFile("file:///home/pvikash/data/test.txt", 2) people: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD[52] at textFile at :27 scala> people.getNumPartitions res37: Int = 2 scala> val people = sc.textFile("file:///home/pvikash/data/test.txt", 3) people: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD[54] at textFile at :27 scala> people.getNumPartitions res38: Int = 3 scala> val people = sc.textFile("file:///home/pvikash/data/test.txt", 4) people: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD[56] at textFile at :27 scala> people.getNumPartitions res39: Int = 4 scala> val people = sc.textFile("file:///home/pvikash/data/test.txt", 5) people: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD[58] at textFile at :27 scala> people.getNumPartitions res40: Int = 6 scala> val people = sc.textFile("file:///home/pvikash/data/test.txt", 6) people: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD[60] at textFile at :27 scala> people.getNumPartitions res41: Int = 7 scala> val people = sc.textFile("file:///home/pvikash/data/test.txt", 7) people: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD[62] at textFile at :27 scala> people.getNumPartitions res42: Int = 8 scala> val people = sc.textFile("file:///home/pvikash/data/test.txt", 8) people: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD[64] at textFile at :27 scala> people.getNumPartitions res43: Int = 9 scala> val people = sc.textFile("file:///home/pvikash/data/test.txt", 9) people: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD[66] at textFile at :27 scala> people.getNumPartitions res44: Int = 11 scala> val people = sc.textFile("file:///home/pvikash/data/test.txt", 10) people: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD[68] at textFile at :27 scala> people.getNumPartitions res45: Int = 11 scala> val people = sc.textFile("file:///home/pvikash/data/test.txt", 11) people: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD[70] at textFile at :27 scala> people.getNumPartitions res46: Int = 13/* Contents of the file /home/pvikash/data/test.txt is: " This is a test file. Will be used for rdd partition " I am trying to understand why number of partitions is changing here and in case we have small data (which can fit into one partition) then why spark creates empty partitions? Any explanation would be appreciated. --Vikash -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Number-Of-Partitions-in-RDD-tp28730.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe e-mail: user-unsubscr...@spark.apache.org
Number of partitions in RDD for input DStreams
Hi list, In an excelent blog post on Kafka and Spark Streaming integrartion ( http://www.michael-noll.com/blog/2014/10/01/kafka-spark-streaming-integration-example-tutorial/), Michael Noll poses an assumption about the number of partitions of the RDDs created by input DStreams. He says his hypothesis is that the number por partitions per RDD is batchInterval / spark.streaming.blockInterval. I guess this is based on the following extract from the Spark Streaming Programming Guide at sectionLevel of Parallelism of Data Receiving Another parameter that should be considered is the receiver’s blocking interval. For most receivers, the received data is coalesced together into large blocks of data before storing inside Spark’s memory. The number of blocks in each batch determines the number of tasks that will be used to process those the received data in a map-like transformation. This blocking interval is determined by the configuration parameter https://spark.apache.org/docs/1.1.0/configuration.html spark.streaming.blockInterval and the default value is 200 milliseconds. Could someone confirm whether that hypotesis is true or false? And if it is false, is there any way to know the number of partitions per RDD for an input DStream? Thansk a lot for your help, Greetings, Juan Rodriguez