Hi Team, I am running k-means algorithm on KDD 1999 data set ( http://kdd.ics.uci.edu/databases/kddcup99/kddcup99.html). I am running the algorithm for different values of k as such - 5,10,15,....40. The data set is 709 MB. I have placed the file in hdfs with a block size of 128MB (6 blocks).
The cluster is of 4 nodes - (1 Master and 3 workers) . The total worker memory and cores available are 20 GB and 12. When k value is 2 (default), the job takes 4.2 min. when k takes values as 5,10,15,....40 the total duration is 42 min, which is very huge. Are there any optimizations need to be done ? I continuously see the following on the logs: 02) 16/03/03 15:58:28 INFO scheduler.TaskSchedulerImpl: Adding task set 316.0 with 6 tasks 16/03/03 15:58:28 INFO scheduler.TaskSetManager: Starting task 0.0 in stage 316.0 (TID 1898, pg-poc-04, partition 0,NODE_LOCAL, 1967 bytes) 16/03/03 15:58:28 INFO scheduler.TaskSetManager: Starting task 5.0 in stage 316.0 (TID 1899, pg-poc-02, partition 5,NODE_LOCAL, 1967 bytes) 16/03/03 15:58:28 INFO scheduler.TaskSetManager: Starting task 1.0 in stage 316.0 (TID 1900, pg-poc-04, partition 1,NODE_LOCAL, 1967 bytes) 16/03/03 15:58:28 INFO scheduler.TaskSetManager: Starting task 2.0 in stage 316.0 (TID 1901, pg-poc-04, partition 2,NODE_LOCAL, 1967 bytes) 16/03/03 15:58:28 INFO scheduler.TaskSetManager: Starting task 3.0 in stage 316.0 (TID 1902, pg-poc-04, partition 3,NODE_LOCAL, 1967 bytes) 16/03/03 15:58:28 INFO storage.BlockManagerInfo: Added broadcast_471_piece0 in memory on pg-poc-02:53774 (size: 1680.0 B, free: 177.1 MB) 16/03/03 15:58:28 INFO storage.BlockManagerInfo: Added broadcast_471_piece0 in memory on pg-poc-04:32800 (size: 1680.0 B, free: 333.9 MB) 16/03/03 15:58:28 INFO spark.MapOutputTrackerMasterEndpoint: Asked to send map output locations for shuffle 106 to pg-poc-02:51863 16/03/03 15:58:28 INFO spark.MapOutputTrackerMasterEndpoint: Asked to send map output locations for shuffle 106 to pg-poc-04:42511 16/03/03 15:58:28 INFO spark.MapOutputTrackerMaster: Size of output statuses for shuffle 106 is 209 bytes 16/03/03 15:58:28 INFO spark.MapOutputTrackerMaster: Size of output statuses for shuffle 106 is 209 bytes 16/03/03 15:58:28 INFO scheduler.TaskSetManager: Finished task 5.0 in stage 316.0 (TID 1899) in 21 ms on pg-poc-02 (1/6) 16/03/03 15:58:28 INFO scheduler.TaskSetManager: Starting task 4.0 in stage 316.0 (TID 1903, pg-poc-04, partition 4,NODE_LOCAL, 1967 bytes) 16/03/03 15:58:28 INFO scheduler.TaskSetManager: Finished task 3.0 in stage 316.0 (TID 1902) in 22 ms on pg-poc-04 (2/6) 16/03/03 15:58:28 INFO scheduler.TaskSetManager: Finished task 2.0 in stage 316.0 (TID 1901) in 23 ms on pg-poc-04 (3/6) 16/03/03 15:58:28 INFO scheduler.TaskSetManager: Finished task 1.0 in stage 316.0 (TID 1900) in 23 ms on pg-poc-04 (4/6) 16/03/03 15:58:28 INFO scheduler.TaskSetManager: Finished task 0.0 in stage 316.0 (TID 1898) in 24 ms on pg-poc-04 (5/6) 16/03/03 15:58:28 INFO scheduler.TaskSetManager: Finished task 4.0 in stage 316.0 (TID 1903) in 7 ms on pg-poc-04 (6/6) 16/03/03 15:58:28 INFO scheduler.TaskSchedulerImpl: Removed TaskSet 316.0, whose tasks have all completed, from pool 16/03/03 15:58:28 INFO scheduler.DAGScheduler: ResultStage 316 (collectAsMap at KMeans.scala:302) finished in 0.030 s 16/03/03 15:58:28 INFO scheduler.DAGScheduler: Job 209 finished: collectAsMap at KMeans.scala:302, took 11.753212 s 16/03/03 15:58:28 INFO clustering.KMeans: Iterations took 205.479 seconds. 16/03/03 15:58:28 INFO clustering.KMeans: KMeans reached the max number of iterations: 20. 16/03/03 15:58:28 INFO clustering.KMeans: The cost for the best run is 2.340836927489414E13. 16/03/03 15:58:28 INFO rdd.MapPartitionsRDD: Removing RDD 365 from persistence list 16/03/03 15:58:28 INFO storage.BlockManager: Removing RDD 365 16/03/03 15:58:28 INFO storage.MemoryStore: Block broadcast_472 stored as values in memory (estimated size 13.3 KB, free 588.2 KB) 16/03/03 15:58:28 INFO storage.MemoryStore: Block broadcast_472_piece0 stored as bytes in memory (estimated size 5.6 KB, free 593.8 KB) 16/03/03 15:58:28 INFO storage.BlockManagerInfo: Added broadcast_472_piece0 in memory on 10.10.10.90:56180 (size: 5.6 KB, free: 511.4 MB) 16/03/03 15:58:28 INFO spark.SparkContext: Created broadcast 472 from broadcast at AnomalyDetection.scala:34 16/03/03 15:58:28 INFO spark.SparkContext: Starting job: mean at AnomalyDetection.scala:35 16/03/03 15:58:28 INFO scheduler.DAGScheduler: Got job 210 (mean at AnomalyDetection.scala:35) with 6 output partitions 16/03/03 15:58:28 INFO scheduler.DAGScheduler: Final stage: ResultStage 317 (mean at AnomalyDetection.scala:35) 16/03/03 15:58:28 INFO scheduler.DAGScheduler: Parents of final stage: List() 16/03/03 15:58:28 INFO scheduler.DAGScheduler: Missing parents: List() 16/03/03 15:58:28 INFO scheduler.DAGScheduler: Submitting ResultStage 317 (MapPartitionsRDD[433] at mean at AnomalyDetection.scala:35), which has no missing parents 16/03/03 15:58:28 INFO storage.MemoryStore: Block broadcast_473 stored as values in memory (estimated size 4.2 KB, free 598.0 KB) 16/03/03 15:58:28 INFO storage.MemoryStore: Block broadcast_473_piece0 stored as bytes in memory (estimated size 2.3 KB, free 600.3 KB) 16/03/03 15:58:28 INFO storage.BlockManagerInfo: Added broadcast_473_piece0 in memory on 10.10.10.90:56180 (size: 2.3 KB, free: 511.4 MB) 16/03/03 15:58:28 INFO spark.SparkContext: Created broadcast 473 from broadcast at DAGScheduler.scala:1006 16/03/03 15:58:28 INFO scheduler.DAGScheduler: Submitting 6 missing tasks from ResultStage 317 (MapPartitionsRDD[433] at mean at AnomalyDetection.scala:35) 16/03/03 15:58:28 INFO scheduler.TaskSchedulerImpl: Adding task set 317.0 with 6 tasks . . . . Regards, Padma Ch