[ https://issues.apache.org/jira/browse/SPARK-6706?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14395639#comment-14395639 ]
Sean Owen commented on SPARK-6706: ---------------------------------- I tried your code locally vs master with k=1000 (you say >100, but it works at 500, so I tried 1000), which you can do by building Spark and running the shell. I don't see it stuck in any {{collect()}} stage; those complete quickly. But, the driver does bog down for a long long time in {{LocalKMeans}}: {code} at com.github.fommil.netlib.F2jBLAS.ddot(F2jBLAS.java:71) at org.apache.spark.mllib.linalg.BLAS$.dot(BLAS.scala:121) at org.apache.spark.mllib.linalg.BLAS$.dot(BLAS.scala:104) at org.apache.spark.mllib.util.MLUtils$.fastSquaredDistance(MLUtils.scala:311) at org.apache.spark.mllib.clustering.KMeans$.fastSquaredDistance(KMeans.scala:522) at org.apache.spark.mllib.clustering.KMeans$$anonfun$findClosest$1.apply(KMeans.scala:496) at org.apache.spark.mllib.clustering.KMeans$$anonfun$findClosest$1.apply(KMeans.scala:490) at scala.collection.Iterator$class.foreach(Iterator.scala:727) at scala.collection.AbstractIterator.foreach(Iterator.scala:1157) at scala.collection.GenSeqViewLike$Sliced$class.foreach(GenSeqViewLike.scala:42) at scala.collection.mutable.IndexedSeqView$$anon$2.foreach(IndexedSeqView.scala:80) at org.apache.spark.mllib.clustering.KMeans$.findClosest(KMeans.scala:490) at org.apache.spark.mllib.clustering.KMeans$.pointCost(KMeans.scala:513) at org.apache.spark.mllib.clustering.LocalKMeans$$anonfun$kMeansPlusPlus$1$$anonfun$3.apply(LocalKMeans.scala:53) at org.apache.spark.mllib.clustering.LocalKMeans$$anonfun$kMeansPlusPlus$1$$anonfun$3.apply(LocalKMeans.scala:52) at scala.collection.GenTraversableViewLike$Mapped$$anonfun$foreach$2.apply(GenTraversableViewLike.scala:81) at scala.collection.Iterator$class.foreach(Iterator.scala:727) at scala.collection.AbstractIterator.foreach(Iterator.scala:1157) at scala.collection.IterableViewLike$Transformed$class.foreach(IterableViewLike.scala:42) at scala.collection.SeqViewLike$AbstractTransformed.foreach(SeqViewLike.scala:43) at scala.collection.GenTraversableViewLike$Mapped$class.foreach(GenTraversableViewLike.scala:80) at scala.collection.SeqViewLike$$anon$3.foreach(SeqViewLike.scala:78) at scala.collection.TraversableOnce$class.foldLeft(TraversableOnce.scala:144) at scala.collection.SeqViewLike$AbstractTransformed.foldLeft(SeqViewLike.scala:43) at scala.collection.TraversableOnce$class.sum(TraversableOnce.scala:203) at scala.collection.SeqViewLike$AbstractTransformed.sum(SeqViewLike.scala:43) at org.apache.spark.mllib.clustering.LocalKMeans$$anonfun$kMeansPlusPlus$1.apply$mcVI$sp(LocalKMeans.scala:54) at scala.collection.immutable.Range.foreach$mVc$sp(Range.scala:141) at org.apache.spark.mllib.clustering.LocalKMeans$.kMeansPlusPlus(LocalKMeans.scala:49) at org.apache.spark.mllib.clustering.KMeans$$anonfun$22.apply(KMeans.scala:396) at org.apache.spark.mllib.clustering.KMeans$$anonfun$22.apply(KMeans.scala:393) {code} I think this is what Derrick was getting at in SPARK-3220, that this bit doesn't scale. > kmeans|| hangs for a long time if both k and vector dimension are large > ----------------------------------------------------------------------- > > Key: SPARK-6706 > URL: https://issues.apache.org/jira/browse/SPARK-6706 > Project: Spark > Issue Type: Bug > Components: MLlib > Affects Versions: 1.2.1, 1.3.0 > Environment: Windows 64bit, Linux 64bit > Reporter: Xi Shen > Assignee: Xiangrui Meng > Labels: performance > Attachments: kmeans-debug.7z > > > When doing k-means cluster with the "kmeans||" algorithm which is the default > one. The algorithm hangs at some "collect" step for a long time. > Settings: > - k above 100 > - feature dimension about 360 > - total data size is about 100 MB > The issue was first noticed with Spark 1.2.1. I tested with both local and > cluster mode. On Spark 1.3.0. I, I can also reproduce this issue with local > mode. **However, I do not have a 1.3.0 cluster environment for me to test.** -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org