Hi, The input data has 2048 partitions. The final step is to load the processed data into hbase through saveAsNewAPIHadoopDataset(). Every step except the last one ran in parallel in the cluster. But the last step only has 1 task which runs on only 1 node using one core.
Spark 1.1.1. + CDH5.3.0. Probably I should set the numPartitions in reduceByKey call to some big number? I did not set this parameter in the current codes. This reduceByKey call is the one that runs before the saveAsNewAPIHaddopDataset() call. Any idea? Thanks! -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/What-could-cause-number-of-tasks-to-go-down-from-2k-to-1-tp21430.html Sent from the Apache Spark User List mailing list archive at Nabble.com. --------------------------------------------------------------------- To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org