Spark mllin k-means taking too much time

Priya Ch Thu, 03 Mar 2016 02:41:36 -0800

Hi Team,

   I am running k-means algorithm on KDD 1999 data set (
http://kdd.ics.uci.edu/databases/kddcup99/kddcup99.html). I am running the
algorithm for different values of  k as such - 5,10,15,....40. The data set
is 709 MB. I have placed the file in hdfs with a block size of 128MB (6
blocks).


The  cluster is of 4 nodes - (1 Master and 3 workers) . The total worker
memory and cores available are 20 GB and 12. When k value is 2 (default),
the job takes 4.2 min.

when k takes values as 5,10,15,....40 the total duration is 42 min, which
is very huge. Are there any optimizations need to be done ?

I continuously see the following on the logs:
02)
16/03/03 15:58:28 INFO scheduler.TaskSchedulerImpl: Adding task set 316.0
with 6 tasks
16/03/03 15:58:28 INFO scheduler.TaskSetManager: Starting task 0.0 in stage
316.0 (TID 1898, pg-poc-04, partition 0,NODE_LOCAL, 1967 bytes)
16/03/03 15:58:28 INFO scheduler.TaskSetManager: Starting task 5.0 in stage
316.0 (TID 1899, pg-poc-02, partition 5,NODE_LOCAL, 1967 bytes)
16/03/03 15:58:28 INFO scheduler.TaskSetManager: Starting task 1.0 in stage
316.0 (TID 1900, pg-poc-04, partition 1,NODE_LOCAL, 1967 bytes)
16/03/03 15:58:28 INFO scheduler.TaskSetManager: Starting task 2.0 in stage
316.0 (TID 1901, pg-poc-04, partition 2,NODE_LOCAL, 1967 bytes)
16/03/03 15:58:28 INFO scheduler.TaskSetManager: Starting task 3.0 in stage
316.0 (TID 1902, pg-poc-04, partition 3,NODE_LOCAL, 1967 bytes)
16/03/03 15:58:28 INFO storage.BlockManagerInfo: Added broadcast_471_piece0
in memory on pg-poc-02:53774 (size: 1680.0 B, free: 177.1 MB)
16/03/03 15:58:28 INFO storage.BlockManagerInfo: Added broadcast_471_piece0
in memory on pg-poc-04:32800 (size: 1680.0 B, free: 333.9 MB)
16/03/03 15:58:28 INFO spark.MapOutputTrackerMasterEndpoint: Asked to send
map output locations for shuffle 106 to pg-poc-02:51863
16/03/03 15:58:28 INFO spark.MapOutputTrackerMasterEndpoint: Asked to send
map output locations for shuffle 106 to pg-poc-04:42511
16/03/03 15:58:28 INFO spark.MapOutputTrackerMaster: Size of output
statuses for shuffle 106 is 209 bytes
16/03/03 15:58:28 INFO spark.MapOutputTrackerMaster: Size of output
statuses for shuffle 106 is 209 bytes
16/03/03 15:58:28 INFO scheduler.TaskSetManager: Finished task 5.0 in stage
316.0 (TID 1899) in 21 ms on pg-poc-02 (1/6)
16/03/03 15:58:28 INFO scheduler.TaskSetManager: Starting task 4.0 in stage
316.0 (TID 1903, pg-poc-04, partition 4,NODE_LOCAL, 1967 bytes)
16/03/03 15:58:28 INFO scheduler.TaskSetManager: Finished task 3.0 in stage
316.0 (TID 1902) in 22 ms on pg-poc-04 (2/6)
16/03/03 15:58:28 INFO scheduler.TaskSetManager: Finished task 2.0 in stage
316.0 (TID 1901) in 23 ms on pg-poc-04 (3/6)
16/03/03 15:58:28 INFO scheduler.TaskSetManager: Finished task 1.0 in stage
316.0 (TID 1900) in 23 ms on pg-poc-04 (4/6)
16/03/03 15:58:28 INFO scheduler.TaskSetManager: Finished task 0.0 in stage
316.0 (TID 1898) in 24 ms on pg-poc-04 (5/6)
16/03/03 15:58:28 INFO scheduler.TaskSetManager: Finished task 4.0 in stage
316.0 (TID 1903) in 7 ms on pg-poc-04 (6/6)
16/03/03 15:58:28 INFO scheduler.TaskSchedulerImpl: Removed TaskSet 316.0,
whose tasks have all completed, from pool
16/03/03 15:58:28 INFO scheduler.DAGScheduler: ResultStage 316
(collectAsMap at KMeans.scala:302) finished in 0.030 s
16/03/03 15:58:28 INFO scheduler.DAGScheduler: Job 209 finished:
collectAsMap at KMeans.scala:302, took 11.753212 s
16/03/03 15:58:28 INFO clustering.KMeans: Iterations took 205.479 seconds.
16/03/03 15:58:28 INFO clustering.KMeans: KMeans reached the max number of
iterations: 20.
16/03/03 15:58:28 INFO clustering.KMeans: The cost for the best run is
2.340836927489414E13.
16/03/03 15:58:28 INFO rdd.MapPartitionsRDD: Removing RDD 365 from
persistence list
16/03/03 15:58:28 INFO storage.BlockManager: Removing RDD 365
16/03/03 15:58:28 INFO storage.MemoryStore: Block broadcast_472 stored as
values in memory (estimated size 13.3 KB, free 588.2 KB)
16/03/03 15:58:28 INFO storage.MemoryStore: Block broadcast_472_piece0
stored as bytes in memory (estimated size 5.6 KB, free 593.8 KB)
16/03/03 15:58:28 INFO storage.BlockManagerInfo: Added broadcast_472_piece0
in memory on 10.10.10.90:56180 (size: 5.6 KB, free: 511.4 MB)
16/03/03 15:58:28 INFO spark.SparkContext: Created broadcast 472 from
broadcast at AnomalyDetection.scala:34
16/03/03 15:58:28 INFO spark.SparkContext: Starting job: mean at
AnomalyDetection.scala:35
16/03/03 15:58:28 INFO scheduler.DAGScheduler: Got job 210 (mean at
AnomalyDetection.scala:35) with 6 output partitions
16/03/03 15:58:28 INFO scheduler.DAGScheduler: Final stage: ResultStage 317
(mean at AnomalyDetection.scala:35)
16/03/03 15:58:28 INFO scheduler.DAGScheduler: Parents of final stage:
List()
16/03/03 15:58:28 INFO scheduler.DAGScheduler: Missing parents: List()
16/03/03 15:58:28 INFO scheduler.DAGScheduler: Submitting ResultStage 317
(MapPartitionsRDD[433] at mean at AnomalyDetection.scala:35), which has no
missing parents
16/03/03 15:58:28 INFO storage.MemoryStore: Block broadcast_473 stored as
values in memory (estimated size 4.2 KB, free 598.0 KB)
16/03/03 15:58:28 INFO storage.MemoryStore: Block broadcast_473_piece0
stored as bytes in memory (estimated size 2.3 KB, free 600.3 KB)
16/03/03 15:58:28 INFO storage.BlockManagerInfo: Added broadcast_473_piece0
in memory on 10.10.10.90:56180 (size: 2.3 KB, free: 511.4 MB)
16/03/03 15:58:28 INFO spark.SparkContext: Created broadcast 473 from
broadcast at DAGScheduler.scala:1006
16/03/03 15:58:28 INFO scheduler.DAGScheduler: Submitting 6 missing tasks
from ResultStage 317 (MapPartitionsRDD[433] at mean at
AnomalyDetection.scala:35)
16/03/03 15:58:28 INFO scheduler.TaskSchedulerImpl: Adding task set 317.0
with 6 tasks
.
.
.
.

Regards,
Padma Ch

Spark mllin k-means taking too much time

Reply via email to