[ 
https://issues.apache.org/jira/browse/SPARK-6706?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14395639#comment-14395639
 ] 

Sean Owen commented on SPARK-6706:
----------------------------------

I tried your code locally vs master with k=1000 (you say >100, but it works at 
500, so I tried 1000), which you can do by building Spark and running the 
shell. I don't see it stuck in any {{collect()}} stage; those complete quickly. 
But, the driver does bog down for a long long time in {{LocalKMeans}}:

{code}
        at com.github.fommil.netlib.F2jBLAS.ddot(F2jBLAS.java:71)
        at org.apache.spark.mllib.linalg.BLAS$.dot(BLAS.scala:121)
        at org.apache.spark.mllib.linalg.BLAS$.dot(BLAS.scala:104)
        at 
org.apache.spark.mllib.util.MLUtils$.fastSquaredDistance(MLUtils.scala:311)
        at 
org.apache.spark.mllib.clustering.KMeans$.fastSquaredDistance(KMeans.scala:522)
        at 
org.apache.spark.mllib.clustering.KMeans$$anonfun$findClosest$1.apply(KMeans.scala:496)
        at 
org.apache.spark.mllib.clustering.KMeans$$anonfun$findClosest$1.apply(KMeans.scala:490)
        at scala.collection.Iterator$class.foreach(Iterator.scala:727)
        at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
        at 
scala.collection.GenSeqViewLike$Sliced$class.foreach(GenSeqViewLike.scala:42)
        at 
scala.collection.mutable.IndexedSeqView$$anon$2.foreach(IndexedSeqView.scala:80)
        at 
org.apache.spark.mllib.clustering.KMeans$.findClosest(KMeans.scala:490)
        at org.apache.spark.mllib.clustering.KMeans$.pointCost(KMeans.scala:513)
        at 
org.apache.spark.mllib.clustering.LocalKMeans$$anonfun$kMeansPlusPlus$1$$anonfun$3.apply(LocalKMeans.scala:53)
        at 
org.apache.spark.mllib.clustering.LocalKMeans$$anonfun$kMeansPlusPlus$1$$anonfun$3.apply(LocalKMeans.scala:52)
        at 
scala.collection.GenTraversableViewLike$Mapped$$anonfun$foreach$2.apply(GenTraversableViewLike.scala:81)
        at scala.collection.Iterator$class.foreach(Iterator.scala:727)
        at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
        at 
scala.collection.IterableViewLike$Transformed$class.foreach(IterableViewLike.scala:42)
        at 
scala.collection.SeqViewLike$AbstractTransformed.foreach(SeqViewLike.scala:43)
        at 
scala.collection.GenTraversableViewLike$Mapped$class.foreach(GenTraversableViewLike.scala:80)
        at scala.collection.SeqViewLike$$anon$3.foreach(SeqViewLike.scala:78)
        at 
scala.collection.TraversableOnce$class.foldLeft(TraversableOnce.scala:144)
        at 
scala.collection.SeqViewLike$AbstractTransformed.foldLeft(SeqViewLike.scala:43)
        at scala.collection.TraversableOnce$class.sum(TraversableOnce.scala:203)
        at 
scala.collection.SeqViewLike$AbstractTransformed.sum(SeqViewLike.scala:43)
        at 
org.apache.spark.mllib.clustering.LocalKMeans$$anonfun$kMeansPlusPlus$1.apply$mcVI$sp(LocalKMeans.scala:54)
        at scala.collection.immutable.Range.foreach$mVc$sp(Range.scala:141)
        at 
org.apache.spark.mllib.clustering.LocalKMeans$.kMeansPlusPlus(LocalKMeans.scala:49)
        at 
org.apache.spark.mllib.clustering.KMeans$$anonfun$22.apply(KMeans.scala:396)
        at 
org.apache.spark.mllib.clustering.KMeans$$anonfun$22.apply(KMeans.scala:393)
{code}

I think this is what Derrick was getting at in SPARK-3220, that this bit 
doesn't scale.

> kmeans|| hangs for a long time if both k and vector dimension are large
> -----------------------------------------------------------------------
>
>                 Key: SPARK-6706
>                 URL: https://issues.apache.org/jira/browse/SPARK-6706
>             Project: Spark
>          Issue Type: Bug
>          Components: MLlib
>    Affects Versions: 1.2.1, 1.3.0
>         Environment: Windows 64bit, Linux 64bit
>            Reporter: Xi Shen
>            Assignee: Xiangrui Meng
>              Labels: performance
>         Attachments: kmeans-debug.7z
>
>
> When doing k-means cluster with the "kmeans||" algorithm which is the default 
> one. The algorithm hangs at some "collect" step for a long time.
> Settings:
> - k above 100
> - feature dimension about 360
> - total data size is about 100 MB
> The issue was first noticed with Spark 1.2.1. I tested with both local and 
> cluster mode. On Spark 1.3.0. I, I can also reproduce this issue with local 
> mode. **However, I do not have a 1.3.0 cluster environment for me to test.**



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to