Hi, 

The input data has 2048 partitions. The final step is to load the processed
data into hbase through saveAsNewAPIHadoopDataset(). Every step except the
last one ran in parallel in the cluster. But the last step only has 1 task
which runs on only 1 node using one core. 

Spark 1.1.1. + CDH5.3.0. 
        
Probably I should set the numPartitions in reduceByKey call to some big
number? I did not set this parameter in the current codes. This reduceByKey
call is the one that runs before the saveAsNewAPIHaddopDataset() call. 

Any idea? Thanks!



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/What-could-cause-number-of-tasks-to-go-down-from-2k-to-1-tp21430.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Reply via email to