Re: Why k-means cluster hang for a long time?

2015-03-30 Thread Xiangrui Meng
Hi Xi, Please create a JIRA if it takes longer to locate the issue. Did you try a smaller k? Best, Xiangrui On Thu, Mar 26, 2015 at 5:45 PM, Xi Shen davidshe...@gmail.com wrote: Hi Burak, After I added .repartition(sc.defaultParallelism), I can see from the log the partition number is set

Re: Why k-means cluster hang for a long time?

2015-03-30 Thread Xi Shen
For the same amount of data, if I set the k=500, the job finished in about 3 hrs. I wonder if I set k=5000, the job could finish in 30 hrs...the longest time I waited was 12 hrs... If I use kmeans-random, same amount of data, k=5000, the job finished in less than 2 hrs. I think current kmeans||

Re: Why k-means cluster hang for a long time?

2015-03-30 Thread Xiangrui Meng
We test large feature dimension but not very large k (https://github.com/databricks/spark-perf/blob/master/config/config.py.template#L525). Again, please create a JIRA and post your test code and a link to your test dataset, we can work on it. It is hard to track the issue with multiple threads in

Why k-means cluster hang for a long time?

2015-03-26 Thread Xi Shen
Hi, When I run k-means cluster with Spark, I got this in the last two lines in the log: 15/03/26 11:42:42 INFO spark.ContextCleaner: Cleaned broadcast 26 15/03/26 11:42:42 INFO spark.ContextCleaner: Cleaned shuffle 5 Then it hangs for a long time. There's no active job. The driver machine is

Re: Why k-means cluster hang for a long time?

2015-03-26 Thread Xi Shen
Hi Burak, After I added .repartition(sc.defaultParallelism), I can see from the log the partition number is set to 32. But in the Spark UI, it seems all the data are loaded onto one executor. Previously they were loaded onto 4 executors. Any idea? Thanks, David On Fri, Mar 27, 2015 at 11:01

Re: Why k-means cluster hang for a long time?

2015-03-26 Thread Xi Shen
How do I get the number of cores that I specified at the command line? I want to use spark.default.parallelism. I have 4 executors, each has 8 cores. According to https://spark.apache.org/docs/1.2.0/configuration.html#execution-behavior, the spark.default.parallelism value will be 4 * 8 = 32...I

Re: Why k-means cluster hang for a long time?

2015-03-26 Thread Xi Shen
Hi Burak, My iterations is set to 500. But I think it should also stop of the centroid coverages, right? My spark is 1.2.0, working in windows 64 bit. My data set is about 40k vectors, each vector has about 300 features, all normalised. All work node have sufficient memory and disk space.

Re: Why k-means cluster hang for a long time?

2015-03-26 Thread Xi Shen
OH, the job I talked about has ran more than 11 hrs without a result...it doesn't make sense. On Fri, Mar 27, 2015 at 9:48 AM Xi Shen davidshe...@gmail.com wrote: Hi Burak, My iterations is set to 500. But I think it should also stop of the centroid coverages, right? My spark is 1.2.0,

Re: Why k-means cluster hang for a long time?

2015-03-26 Thread Xi Shen
The code is very simple. val data = sc.textFile(very/large/text/file) map { l = // turn each line into dense vector Vectors.dense(...) } // the resulting data set is about 40k vectors KMeans.train(data, k=5000, maxIterations=500) I just kill my application. In the log I found this: