A "task" is the work to be done on a partition for a given stage - you should expect the number of tasks to be equal to the number of partitions in each stage, though a task might need to be rerun (due to failure or need to recompute some data).
2-4 times the cores in your cluster should be a good starting place. Then you can try different values and see how it affects your performance. On Mon, Sep 29, 2014 at 5:01 PM, anny9699 <anny9...@gmail.com> wrote: > Hi, > > I read the past posts about partition number, but am still a little > confused > about partitioning strategy. > > I have a cluster with 8 works and 2 cores for each work. Is it true that > the > optimal partition number should be 2-4 * total_coreNumber or should > approximately equal to total_coreNumber? Or it's the task number that > really > determines the speed rather then partition number? > > Thanks a lot! > > > > -- > View this message in context: > http://apache-spark-user-list.1001560.n3.nabble.com/about-partition-number-tp15362.html > Sent from the Apache Spark User List mailing list archive at Nabble.com. > > --------------------------------------------------------------------- > To unsubscribe, e-mail: user-unsubscr...@spark.apache.org > For additional commands, e-mail: user-h...@spark.apache.org > > -- Daniel Siegmann, Software Developer Velos Accelerating Machine Learning 440 NINTH AVENUE, 11TH FLOOR, NEW YORK, NY 10001 E: daniel.siegm...@velos.io W: www.velos.io