My use case as mentioned below. 1. Read input data from local file system using sparkContext.textFile(input path). 2. partition the input data(80 million records) into partitions using RDD.coalesce(numberOfPArtitions) before submitting it to mapper/reducer function. Without using coalesce() or repartition() on the input data spark executes really slow and fails with out of memory exception.
The issue i am facing here is in deciding the number of partitions to be applied on the input data. *The input data size varies every time and hard coding a particular value is not an option. And spark performs really well only when certain optimum partition is applied on the input data for which i have to perform lots of iteration(trial and error). Which is not an option in a production environment.* My question: Is there a thumb rule to decide the number of partitions required depending on the input data size and cluster resources available(executors,cores, etc...)? If yes please point me in that direction. Any help is much appreciated. I am using spark 1.0 on yarn. Thanks, AG -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Question-regarding-spark-data-partition-and-coalesce-Need-info-on-my-use-case-tp12214.html Sent from the Apache Spark User List mailing list archive at Nabble.com. --------------------------------------------------------------------- To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org