Re: Why k-means cluster hang for a long time?
Hi Xi, Please create a JIRA if it takes longer to locate the issue. Did you try a smaller k? Best, Xiangrui On Thu, Mar 26, 2015 at 5:45 PM, Xi Shen davidshe...@gmail.com wrote: Hi Burak, After I added .repartition(sc.defaultParallelism), I can see from the log the partition number is set to 32. But in the Spark UI, it seems all the data are loaded onto one executor. Previously they were loaded onto 4 executors. Any idea? Thanks, David On Fri, Mar 27, 2015 at 11:01 AM Xi Shen davidshe...@gmail.com wrote: How do I get the number of cores that I specified at the command line? I want to use spark.default.parallelism. I have 4 executors, each has 8 cores. According to https://spark.apache.org/docs/1.2.0/configuration.html#execution-behavior, the spark.default.parallelism value will be 4 * 8 = 32...I think it is too large, or inappropriate. Please give some suggestion. I have already used cache, and count to pre-cache. I can try with smaller k for testing, but eventually I will have to use k = 5000 or even large. Because I estimate our data set would have that much of clusters. Thanks, David On Fri, Mar 27, 2015 at 10:40 AM Burak Yavuz brk...@gmail.com wrote: Hi David, The number of centroids (k=5000) seems too large and is probably the cause of the code taking too long. Can you please try the following: 1) Repartition data to the number of available cores with .repartition(numCores) 2) cache data 3) call .count() on data right before k-means 4) try k=500 (even less if possible) Thanks, Burak On Mar 26, 2015 4:15 PM, Xi Shen davidshe...@gmail.com wrote: The code is very simple. val data = sc.textFile(very/large/text/file) map { l = // turn each line into dense vector Vectors.dense(...) } // the resulting data set is about 40k vectors KMeans.train(data, k=5000, maxIterations=500) I just kill my application. In the log I found this: 15/03/26 11:42:43 INFO storage.BlockManagerMaster: Updated info of block broadcast_26_piece0 15/03/26 23:02:57 WARN server.TransportChannelHandler: Exception in connection from workernode0.xshe3539-hadoop-sydney.q10.internal.cloudapp.net/100.72.84.107:56277 java.io.IOException: An existing connection was forcibly closed by the remote host Notice the time gap. I think it means the work node did not generate any log at all for about 12hrs...does it mean they are not working at all? But when testing with very small data set, my application works and output expected data. Thanks, David On Fri, Mar 27, 2015 at 10:04 AM Burak Yavuz brk...@gmail.com wrote: Can you share the code snippet of how you call k-means? Do you cache the data before k-means? Did you repartition the data? On Mar 26, 2015 4:02 PM, Xi Shen davidshe...@gmail.com wrote: OH, the job I talked about has ran more than 11 hrs without a result...it doesn't make sense. On Fri, Mar 27, 2015 at 9:48 AM Xi Shen davidshe...@gmail.com wrote: Hi Burak, My iterations is set to 500. But I think it should also stop of the centroid coverages, right? My spark is 1.2.0, working in windows 64 bit. My data set is about 40k vectors, each vector has about 300 features, all normalised. All work node have sufficient memory and disk space. Thanks, David On Fri, 27 Mar 2015 02:48 Burak Yavuz brk...@gmail.com wrote: Hi David, When the number of runs are large and the data is not properly partitioned, it seems that K-Means is hanging according to my experience. Especially setting the number of runs to something high drastically increases the work in executors. If that's not the case, can you give more info on what Spark version you are using, your setup, and your dataset? Thanks, Burak On Mar 26, 2015 5:10 AM, Xi Shen davidshe...@gmail.com wrote: Hi, When I run k-means cluster with Spark, I got this in the last two lines in the log: 15/03/26 11:42:42 INFO spark.ContextCleaner: Cleaned broadcast 26 15/03/26 11:42:42 INFO spark.ContextCleaner: Cleaned shuffle 5 Then it hangs for a long time. There's no active job. The driver machine is idle. I cannot access the work node, I am not sure if they are busy. I understand k-means may take a long time to finish. But why no active job? no log? Thanks, David - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: Why k-means cluster hang for a long time?
For the same amount of data, if I set the k=500, the job finished in about 3 hrs. I wonder if I set k=5000, the job could finish in 30 hrs...the longest time I waited was 12 hrs... If I use kmeans-random, same amount of data, k=5000, the job finished in less than 2 hrs. I think current kmeans|| implementation could not handle large vector dimensions properly. In my case, my vector has about 350 dimensions. I found another post complaining about kmeans performance in Spark, and that guy has vectors of 200 dimensions. It is possible people never tested large dimension case. Thanks, David On Tue, Mar 31, 2015 at 4:00 AM Xiangrui Meng men...@gmail.com wrote: Hi Xi, Please create a JIRA if it takes longer to locate the issue. Did you try a smaller k? Best, Xiangrui On Thu, Mar 26, 2015 at 5:45 PM, Xi Shen davidshe...@gmail.com wrote: Hi Burak, After I added .repartition(sc.defaultParallelism), I can see from the log the partition number is set to 32. But in the Spark UI, it seems all the data are loaded onto one executor. Previously they were loaded onto 4 executors. Any idea? Thanks, David On Fri, Mar 27, 2015 at 11:01 AM Xi Shen davidshe...@gmail.com wrote: How do I get the number of cores that I specified at the command line? I want to use spark.default.parallelism. I have 4 executors, each has 8 cores. According to https://spark.apache.org/docs/1.2.0/configuration.html# execution-behavior, the spark.default.parallelism value will be 4 * 8 = 32...I think it is too large, or inappropriate. Please give some suggestion. I have already used cache, and count to pre-cache. I can try with smaller k for testing, but eventually I will have to use k = 5000 or even large. Because I estimate our data set would have that much of clusters. Thanks, David On Fri, Mar 27, 2015 at 10:40 AM Burak Yavuz brk...@gmail.com wrote: Hi David, The number of centroids (k=5000) seems too large and is probably the cause of the code taking too long. Can you please try the following: 1) Repartition data to the number of available cores with .repartition(numCores) 2) cache data 3) call .count() on data right before k-means 4) try k=500 (even less if possible) Thanks, Burak On Mar 26, 2015 4:15 PM, Xi Shen davidshe...@gmail.com wrote: The code is very simple. val data = sc.textFile(very/large/text/file) map { l = // turn each line into dense vector Vectors.dense(...) } // the resulting data set is about 40k vectors KMeans.train(data, k=5000, maxIterations=500) I just kill my application. In the log I found this: 15/03/26 11:42:43 INFO storage.BlockManagerMaster: Updated info of block broadcast_26_piece0 15/03/26 23:02:57 WARN server.TransportChannelHandler: Exception in connection from workernode0.xshe3539-hadoop-sydney.q10.internal.cloudapp. net/100.72.84.107:56277 java.io.IOException: An existing connection was forcibly closed by the remote host Notice the time gap. I think it means the work node did not generate any log at all for about 12hrs...does it mean they are not working at all? But when testing with very small data set, my application works and output expected data. Thanks, David On Fri, Mar 27, 2015 at 10:04 AM Burak Yavuz brk...@gmail.com wrote: Can you share the code snippet of how you call k-means? Do you cache the data before k-means? Did you repartition the data? On Mar 26, 2015 4:02 PM, Xi Shen davidshe...@gmail.com wrote: OH, the job I talked about has ran more than 11 hrs without a result...it doesn't make sense. On Fri, Mar 27, 2015 at 9:48 AM Xi Shen davidshe...@gmail.com wrote: Hi Burak, My iterations is set to 500. But I think it should also stop of the centroid coverages, right? My spark is 1.2.0, working in windows 64 bit. My data set is about 40k vectors, each vector has about 300 features, all normalised. All work node have sufficient memory and disk space. Thanks, David On Fri, 27 Mar 2015 02:48 Burak Yavuz brk...@gmail.com wrote: Hi David, When the number of runs are large and the data is not properly partitioned, it seems that K-Means is hanging according to my experience. Especially setting the number of runs to something high drastically increases the work in executors. If that's not the case, can you give more info on what Spark version you are using, your setup, and your dataset? Thanks, Burak On Mar 26, 2015 5:10 AM, Xi Shen davidshe...@gmail.com wrote: Hi, When I run k-means cluster with Spark, I got this in the last two lines in the log: 15/03/26 11:42:42 INFO spark.ContextCleaner: Cleaned broadcast 26 15/03/26 11:42:42 INFO spark.ContextCleaner: Cleaned shuffle 5 Then it hangs for a long time. There's no
Re: Why k-means cluster hang for a long time?
We test large feature dimension but not very large k (https://github.com/databricks/spark-perf/blob/master/config/config.py.template#L525). Again, please create a JIRA and post your test code and a link to your test dataset, we can work on it. It is hard to track the issue with multiple threads in the mailing list. -Xiangrui On Mon, Mar 30, 2015 at 3:55 PM, Xi Shen davidshe...@gmail.com wrote: For the same amount of data, if I set the k=500, the job finished in about 3 hrs. I wonder if I set k=5000, the job could finish in 30 hrs...the longest time I waited was 12 hrs... If I use kmeans-random, same amount of data, k=5000, the job finished in less than 2 hrs. I think current kmeans|| implementation could not handle large vector dimensions properly. In my case, my vector has about 350 dimensions. I found another post complaining about kmeans performance in Spark, and that guy has vectors of 200 dimensions. It is possible people never tested large dimension case. Thanks, David On Tue, Mar 31, 2015 at 4:00 AM Xiangrui Meng men...@gmail.com wrote: Hi Xi, Please create a JIRA if it takes longer to locate the issue. Did you try a smaller k? Best, Xiangrui On Thu, Mar 26, 2015 at 5:45 PM, Xi Shen davidshe...@gmail.com wrote: Hi Burak, After I added .repartition(sc.defaultParallelism), I can see from the log the partition number is set to 32. But in the Spark UI, it seems all the data are loaded onto one executor. Previously they were loaded onto 4 executors. Any idea? Thanks, David On Fri, Mar 27, 2015 at 11:01 AM Xi Shen davidshe...@gmail.com wrote: How do I get the number of cores that I specified at the command line? I want to use spark.default.parallelism. I have 4 executors, each has 8 cores. According to https://spark.apache.org/docs/1.2.0/configuration.html#execution-behavior, the spark.default.parallelism value will be 4 * 8 = 32...I think it is too large, or inappropriate. Please give some suggestion. I have already used cache, and count to pre-cache. I can try with smaller k for testing, but eventually I will have to use k = 5000 or even large. Because I estimate our data set would have that much of clusters. Thanks, David On Fri, Mar 27, 2015 at 10:40 AM Burak Yavuz brk...@gmail.com wrote: Hi David, The number of centroids (k=5000) seems too large and is probably the cause of the code taking too long. Can you please try the following: 1) Repartition data to the number of available cores with .repartition(numCores) 2) cache data 3) call .count() on data right before k-means 4) try k=500 (even less if possible) Thanks, Burak On Mar 26, 2015 4:15 PM, Xi Shen davidshe...@gmail.com wrote: The code is very simple. val data = sc.textFile(very/large/text/file) map { l = // turn each line into dense vector Vectors.dense(...) } // the resulting data set is about 40k vectors KMeans.train(data, k=5000, maxIterations=500) I just kill my application. In the log I found this: 15/03/26 11:42:43 INFO storage.BlockManagerMaster: Updated info of block broadcast_26_piece0 15/03/26 23:02:57 WARN server.TransportChannelHandler: Exception in connection from workernode0.xshe3539-hadoop-sydney.q10.internal.cloudapp.net/100.72.84.107:56277 java.io.IOException: An existing connection was forcibly closed by the remote host Notice the time gap. I think it means the work node did not generate any log at all for about 12hrs...does it mean they are not working at all? But when testing with very small data set, my application works and output expected data. Thanks, David On Fri, Mar 27, 2015 at 10:04 AM Burak Yavuz brk...@gmail.com wrote: Can you share the code snippet of how you call k-means? Do you cache the data before k-means? Did you repartition the data? On Mar 26, 2015 4:02 PM, Xi Shen davidshe...@gmail.com wrote: OH, the job I talked about has ran more than 11 hrs without a result...it doesn't make sense. On Fri, Mar 27, 2015 at 9:48 AM Xi Shen davidshe...@gmail.com wrote: Hi Burak, My iterations is set to 500. But I think it should also stop of the centroid coverages, right? My spark is 1.2.0, working in windows 64 bit. My data set is about 40k vectors, each vector has about 300 features, all normalised. All work node have sufficient memory and disk space. Thanks, David On Fri, 27 Mar 2015 02:48 Burak Yavuz brk...@gmail.com wrote: Hi David, When the number of runs are large and the data is not properly partitioned, it seems that K-Means is hanging according to my experience. Especially setting the number of runs to something high drastically increases the work in executors. If that's not the case, can you give more info on what Spark version
Why k-means cluster hang for a long time?
Hi, When I run k-means cluster with Spark, I got this in the last two lines in the log: 15/03/26 11:42:42 INFO spark.ContextCleaner: Cleaned broadcast 26 15/03/26 11:42:42 INFO spark.ContextCleaner: Cleaned shuffle 5 Then it hangs for a long time. There's no active job. The driver machine is idle. I cannot access the work node, I am not sure if they are busy. I understand k-means may take a long time to finish. But why no active job? no log? Thanks, David
Re: Why k-means cluster hang for a long time?
Hi Burak, After I added .repartition(sc.defaultParallelism), I can see from the log the partition number is set to 32. But in the Spark UI, it seems all the data are loaded onto one executor. Previously they were loaded onto 4 executors. Any idea? Thanks, David On Fri, Mar 27, 2015 at 11:01 AM Xi Shen davidshe...@gmail.com wrote: How do I get the number of cores that I specified at the command line? I want to use spark.default.parallelism. I have 4 executors, each has 8 cores. According to https://spark.apache.org/docs/1.2.0/configuration.html#execution-behavior, the spark.default.parallelism value will be 4 * 8 = 32...I think it is too large, or inappropriate. Please give some suggestion. I have already used cache, and count to pre-cache. I can try with smaller k for testing, but eventually I will have to use k = 5000 or even large. Because I estimate our data set would have that much of clusters. Thanks, David On Fri, Mar 27, 2015 at 10:40 AM Burak Yavuz brk...@gmail.com wrote: Hi David, The number of centroids (k=5000) seems too large and is probably the cause of the code taking too long. Can you please try the following: 1) Repartition data to the number of available cores with .repartition(numCores) 2) cache data 3) call .count() on data right before k-means 4) try k=500 (even less if possible) Thanks, Burak On Mar 26, 2015 4:15 PM, Xi Shen davidshe...@gmail.com wrote: The code is very simple. val data = sc.textFile(very/large/text/file) map { l = // turn each line into dense vector Vectors.dense(...) } // the resulting data set is about 40k vectors KMeans.train(data, k=5000, maxIterations=500) I just kill my application. In the log I found this: 15/03/26 11:42:43 INFO storage.BlockManagerMaster: Updated info of block broadcast_26_piece0 15/03/26 23:02:57 WARN server.TransportChannelHandler: Exception in connection from workernode0.xshe3539-hadoop-sydney.q10.internal.cloudapp. net/100.72.84.107:56277 java.io.IOException: An existing connection was forcibly closed by the remote host Notice the time gap. I think it means the work node did not generate any log at all for about 12hrs...does it mean they are not working at all? But when testing with very small data set, my application works and output expected data. Thanks, David On Fri, Mar 27, 2015 at 10:04 AM Burak Yavuz brk...@gmail.com wrote: Can you share the code snippet of how you call k-means? Do you cache the data before k-means? Did you repartition the data? On Mar 26, 2015 4:02 PM, Xi Shen davidshe...@gmail.com wrote: OH, the job I talked about has ran more than 11 hrs without a result...it doesn't make sense. On Fri, Mar 27, 2015 at 9:48 AM Xi Shen davidshe...@gmail.com wrote: Hi Burak, My iterations is set to 500. But I think it should also stop of the centroid coverages, right? My spark is 1.2.0, working in windows 64 bit. My data set is about 40k vectors, each vector has about 300 features, all normalised. All work node have sufficient memory and disk space. Thanks, David On Fri, 27 Mar 2015 02:48 Burak Yavuz brk...@gmail.com wrote: Hi David, When the number of runs are large and the data is not properly partitioned, it seems that K-Means is hanging according to my experience. Especially setting the number of runs to something high drastically increases the work in executors. If that's not the case, can you give more info on what Spark version you are using, your setup, and your dataset? Thanks, Burak On Mar 26, 2015 5:10 AM, Xi Shen davidshe...@gmail.com wrote: Hi, When I run k-means cluster with Spark, I got this in the last two lines in the log: 15/03/26 11:42:42 INFO spark.ContextCleaner: Cleaned broadcast 26 15/03/26 11:42:42 INFO spark.ContextCleaner: Cleaned shuffle 5 Then it hangs for a long time. There's no active job. The driver machine is idle. I cannot access the work node, I am not sure if they are busy. I understand k-means may take a long time to finish. But why no active job? no log? Thanks, David
Re: Why k-means cluster hang for a long time?
How do I get the number of cores that I specified at the command line? I want to use spark.default.parallelism. I have 4 executors, each has 8 cores. According to https://spark.apache.org/docs/1.2.0/configuration.html#execution-behavior, the spark.default.parallelism value will be 4 * 8 = 32...I think it is too large, or inappropriate. Please give some suggestion. I have already used cache, and count to pre-cache. I can try with smaller k for testing, but eventually I will have to use k = 5000 or even large. Because I estimate our data set would have that much of clusters. Thanks, David On Fri, Mar 27, 2015 at 10:40 AM Burak Yavuz brk...@gmail.com wrote: Hi David, The number of centroids (k=5000) seems too large and is probably the cause of the code taking too long. Can you please try the following: 1) Repartition data to the number of available cores with .repartition(numCores) 2) cache data 3) call .count() on data right before k-means 4) try k=500 (even less if possible) Thanks, Burak On Mar 26, 2015 4:15 PM, Xi Shen davidshe...@gmail.com wrote: The code is very simple. val data = sc.textFile(very/large/text/file) map { l = // turn each line into dense vector Vectors.dense(...) } // the resulting data set is about 40k vectors KMeans.train(data, k=5000, maxIterations=500) I just kill my application. In the log I found this: 15/03/26 11:42:43 INFO storage.BlockManagerMaster: Updated info of block broadcast_26_piece0 15/03/26 23:02:57 WARN server.TransportChannelHandler: Exception in connection from workernode0.xshe3539-hadoop-sydney.q10.internal.cloudapp.net/100.72.84.107:56277 java.io.IOException: An existing connection was forcibly closed by the remote host Notice the time gap. I think it means the work node did not generate any log at all for about 12hrs...does it mean they are not working at all? But when testing with very small data set, my application works and output expected data. Thanks, David On Fri, Mar 27, 2015 at 10:04 AM Burak Yavuz brk...@gmail.com wrote: Can you share the code snippet of how you call k-means? Do you cache the data before k-means? Did you repartition the data? On Mar 26, 2015 4:02 PM, Xi Shen davidshe...@gmail.com wrote: OH, the job I talked about has ran more than 11 hrs without a result...it doesn't make sense. On Fri, Mar 27, 2015 at 9:48 AM Xi Shen davidshe...@gmail.com wrote: Hi Burak, My iterations is set to 500. But I think it should also stop of the centroid coverages, right? My spark is 1.2.0, working in windows 64 bit. My data set is about 40k vectors, each vector has about 300 features, all normalised. All work node have sufficient memory and disk space. Thanks, David On Fri, 27 Mar 2015 02:48 Burak Yavuz brk...@gmail.com wrote: Hi David, When the number of runs are large and the data is not properly partitioned, it seems that K-Means is hanging according to my experience. Especially setting the number of runs to something high drastically increases the work in executors. If that's not the case, can you give more info on what Spark version you are using, your setup, and your dataset? Thanks, Burak On Mar 26, 2015 5:10 AM, Xi Shen davidshe...@gmail.com wrote: Hi, When I run k-means cluster with Spark, I got this in the last two lines in the log: 15/03/26 11:42:42 INFO spark.ContextCleaner: Cleaned broadcast 26 15/03/26 11:42:42 INFO spark.ContextCleaner: Cleaned shuffle 5 Then it hangs for a long time. There's no active job. The driver machine is idle. I cannot access the work node, I am not sure if they are busy. I understand k-means may take a long time to finish. But why no active job? no log? Thanks, David
Re: Why k-means cluster hang for a long time?
Hi Burak, My iterations is set to 500. But I think it should also stop of the centroid coverages, right? My spark is 1.2.0, working in windows 64 bit. My data set is about 40k vectors, each vector has about 300 features, all normalised. All work node have sufficient memory and disk space. Thanks, David On Fri, 27 Mar 2015 02:48 Burak Yavuz brk...@gmail.com wrote: Hi David, When the number of runs are large and the data is not properly partitioned, it seems that K-Means is hanging according to my experience. Especially setting the number of runs to something high drastically increases the work in executors. If that's not the case, can you give more info on what Spark version you are using, your setup, and your dataset? Thanks, Burak On Mar 26, 2015 5:10 AM, Xi Shen davidshe...@gmail.com wrote: Hi, When I run k-means cluster with Spark, I got this in the last two lines in the log: 15/03/26 11:42:42 INFO spark.ContextCleaner: Cleaned broadcast 26 15/03/26 11:42:42 INFO spark.ContextCleaner: Cleaned shuffle 5 Then it hangs for a long time. There's no active job. The driver machine is idle. I cannot access the work node, I am not sure if they are busy. I understand k-means may take a long time to finish. But why no active job? no log? Thanks, David
Re: Why k-means cluster hang for a long time?
OH, the job I talked about has ran more than 11 hrs without a result...it doesn't make sense. On Fri, Mar 27, 2015 at 9:48 AM Xi Shen davidshe...@gmail.com wrote: Hi Burak, My iterations is set to 500. But I think it should also stop of the centroid coverages, right? My spark is 1.2.0, working in windows 64 bit. My data set is about 40k vectors, each vector has about 300 features, all normalised. All work node have sufficient memory and disk space. Thanks, David On Fri, 27 Mar 2015 02:48 Burak Yavuz brk...@gmail.com wrote: Hi David, When the number of runs are large and the data is not properly partitioned, it seems that K-Means is hanging according to my experience. Especially setting the number of runs to something high drastically increases the work in executors. If that's not the case, can you give more info on what Spark version you are using, your setup, and your dataset? Thanks, Burak On Mar 26, 2015 5:10 AM, Xi Shen davidshe...@gmail.com wrote: Hi, When I run k-means cluster with Spark, I got this in the last two lines in the log: 15/03/26 11:42:42 INFO spark.ContextCleaner: Cleaned broadcast 26 15/03/26 11:42:42 INFO spark.ContextCleaner: Cleaned shuffle 5 Then it hangs for a long time. There's no active job. The driver machine is idle. I cannot access the work node, I am not sure if they are busy. I understand k-means may take a long time to finish. But why no active job? no log? Thanks, David
Re: Why k-means cluster hang for a long time?
The code is very simple. val data = sc.textFile(very/large/text/file) map { l = // turn each line into dense vector Vectors.dense(...) } // the resulting data set is about 40k vectors KMeans.train(data, k=5000, maxIterations=500) I just kill my application. In the log I found this: 15/03/26 *11:42:43* INFO storage.BlockManagerMaster: Updated info of block broadcast_26_piece0 15/03/26 *23:02:57* WARN server.TransportChannelHandler: Exception in connection from workernode0.xshe3539-hadoop-sydney.q10.internal.cloudapp.net/100.72.84.107:56277 java.io.IOException: An existing connection was forcibly closed by the remote host Notice the time gap. I think it means the work node did not generate any log at all for about 12hrs...does it mean they are not working at all? But when testing with very small data set, my application works and output expected data. Thanks, David On Fri, Mar 27, 2015 at 10:04 AM Burak Yavuz brk...@gmail.com wrote: Can you share the code snippet of how you call k-means? Do you cache the data before k-means? Did you repartition the data? On Mar 26, 2015 4:02 PM, Xi Shen davidshe...@gmail.com wrote: OH, the job I talked about has ran more than 11 hrs without a result...it doesn't make sense. On Fri, Mar 27, 2015 at 9:48 AM Xi Shen davidshe...@gmail.com wrote: Hi Burak, My iterations is set to 500. But I think it should also stop of the centroid coverages, right? My spark is 1.2.0, working in windows 64 bit. My data set is about 40k vectors, each vector has about 300 features, all normalised. All work node have sufficient memory and disk space. Thanks, David On Fri, 27 Mar 2015 02:48 Burak Yavuz brk...@gmail.com wrote: Hi David, When the number of runs are large and the data is not properly partitioned, it seems that K-Means is hanging according to my experience. Especially setting the number of runs to something high drastically increases the work in executors. If that's not the case, can you give more info on what Spark version you are using, your setup, and your dataset? Thanks, Burak On Mar 26, 2015 5:10 AM, Xi Shen davidshe...@gmail.com wrote: Hi, When I run k-means cluster with Spark, I got this in the last two lines in the log: 15/03/26 11:42:42 INFO spark.ContextCleaner: Cleaned broadcast 26 15/03/26 11:42:42 INFO spark.ContextCleaner: Cleaned shuffle 5 Then it hangs for a long time. There's no active job. The driver machine is idle. I cannot access the work node, I am not sure if they are busy. I understand k-means may take a long time to finish. But why no active job? no log? Thanks, David