Re: Why k-means cluster hang for a long time?

2015-03-30 Thread Xiangrui Meng
Hi Xi,

Please create a JIRA if it takes longer to locate the issue. Did you
try a smaller k?

Best,
Xiangrui

On Thu, Mar 26, 2015 at 5:45 PM, Xi Shen davidshe...@gmail.com wrote:
 Hi Burak,

 After I added .repartition(sc.defaultParallelism), I can see from the log
 the partition number is set to 32. But in the Spark UI, it seems all the
 data are loaded onto one executor. Previously they were loaded onto 4
 executors.

 Any idea?


 Thanks,
 David


 On Fri, Mar 27, 2015 at 11:01 AM Xi Shen davidshe...@gmail.com wrote:

 How do I get the number of cores that I specified at the command line? I
 want to use spark.default.parallelism. I have 4 executors, each has 8
 cores. According to
 https://spark.apache.org/docs/1.2.0/configuration.html#execution-behavior,
 the spark.default.parallelism value will be 4 * 8 = 32...I think it is too
 large, or inappropriate. Please give some suggestion.

 I have already used cache, and count to pre-cache.

 I can try with smaller k for testing, but eventually I will have to use k
 = 5000 or even large. Because I estimate our data set would have that much
 of clusters.


 Thanks,
 David


 On Fri, Mar 27, 2015 at 10:40 AM Burak Yavuz brk...@gmail.com wrote:

 Hi David,
 The number of centroids (k=5000) seems too large and is probably the
 cause of the code taking too long.

 Can you please try the following:
 1) Repartition data to the number of available cores with
 .repartition(numCores)
 2) cache data
 3) call .count() on data right before k-means
 4) try k=500 (even less if possible)

 Thanks,
 Burak

 On Mar 26, 2015 4:15 PM, Xi Shen davidshe...@gmail.com wrote:
 
  The code is very simple.
 
  val data = sc.textFile(very/large/text/file) map { l =
// turn each line into dense vector
Vectors.dense(...)
  }
 
  // the resulting data set is about 40k vectors
 
  KMeans.train(data, k=5000, maxIterations=500)
 
  I just kill my application. In the log I found this:
 
  15/03/26 11:42:43 INFO storage.BlockManagerMaster: Updated info of
  block broadcast_26_piece0
  15/03/26 23:02:57 WARN server.TransportChannelHandler: Exception in
  connection from
  workernode0.xshe3539-hadoop-sydney.q10.internal.cloudapp.net/100.72.84.107:56277
  java.io.IOException: An existing connection was forcibly closed by the
  remote host
 
  Notice the time gap. I think it means the work node did not generate
  any log at all for about 12hrs...does it mean they are not working at all?
 
  But when testing with very small data set, my application works and
  output expected data.
 
 
  Thanks,
  David
 
 
  On Fri, Mar 27, 2015 at 10:04 AM Burak Yavuz brk...@gmail.com wrote:
 
  Can you share the code snippet of how you call k-means? Do you cache
  the data before k-means? Did you repartition the data?
 
  On Mar 26, 2015 4:02 PM, Xi Shen davidshe...@gmail.com wrote:
 
  OH, the job I talked about has ran more than 11 hrs without a
  result...it doesn't make sense.
 
 
  On Fri, Mar 27, 2015 at 9:48 AM Xi Shen davidshe...@gmail.com
  wrote:
 
  Hi Burak,
 
  My iterations is set to 500. But I think it should also stop of the
  centroid coverages, right?
 
  My spark is 1.2.0, working in windows 64 bit. My data set is about
  40k vectors, each vector has about 300 features, all normalised. All 
  work
  node have sufficient memory and disk space.
 
  Thanks,
  David
 
 
  On Fri, 27 Mar 2015 02:48 Burak Yavuz brk...@gmail.com wrote:
 
  Hi David,
 
  When the number of runs are large and the data is not properly
  partitioned, it seems that K-Means is hanging according to my 
  experience.
  Especially setting the number of runs to something high drastically
  increases the work in executors. If that's not the case, can you give 
  more
  info on what Spark version you are using, your setup, and your 
  dataset?
 
  Thanks,
  Burak
 
  On Mar 26, 2015 5:10 AM, Xi Shen davidshe...@gmail.com wrote:
 
  Hi,
 
  When I run k-means cluster with Spark, I got this in the last two
  lines in the log:
 
  15/03/26 11:42:42 INFO spark.ContextCleaner: Cleaned broadcast 26
  15/03/26 11:42:42 INFO spark.ContextCleaner: Cleaned shuffle 5
 
 
 
  Then it hangs for a long time. There's no active job. The driver
  machine is idle. I cannot access the work node, I am not sure if 
  they are
  busy.
 
  I understand k-means may take a long time to finish. But why no
  active job? no log?
 
 
  Thanks,
  David
 

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Why k-means cluster hang for a long time?

2015-03-30 Thread Xi Shen
For the same amount of data, if I set the k=500, the job finished in about
3 hrs. I wonder if I set k=5000, the job could finish in 30 hrs...the
longest time I waited was 12 hrs...

If I use kmeans-random, same amount of data, k=5000, the job finished in
less than 2 hrs.

I think current kmeans|| implementation could not handle large vector
dimensions properly. In my case, my vector has about 350 dimensions. I
found another post complaining about kmeans performance in Spark, and that
guy has vectors of 200 dimensions.

It is possible people never tested large dimension case.


Thanks,
David




On Tue, Mar 31, 2015 at 4:00 AM Xiangrui Meng men...@gmail.com wrote:

 Hi Xi,

 Please create a JIRA if it takes longer to locate the issue. Did you
 try a smaller k?

 Best,
 Xiangrui

 On Thu, Mar 26, 2015 at 5:45 PM, Xi Shen davidshe...@gmail.com wrote:
  Hi Burak,
 
  After I added .repartition(sc.defaultParallelism), I can see from the
 log
  the partition number is set to 32. But in the Spark UI, it seems all the
  data are loaded onto one executor. Previously they were loaded onto 4
  executors.
 
  Any idea?
 
 
  Thanks,
  David
 
 
  On Fri, Mar 27, 2015 at 11:01 AM Xi Shen davidshe...@gmail.com wrote:
 
  How do I get the number of cores that I specified at the command line? I
  want to use spark.default.parallelism. I have 4 executors, each has 8
  cores. According to
  https://spark.apache.org/docs/1.2.0/configuration.html#
 execution-behavior,
  the spark.default.parallelism value will be 4 * 8 = 32...I think it
 is too
  large, or inappropriate. Please give some suggestion.
 
  I have already used cache, and count to pre-cache.
 
  I can try with smaller k for testing, but eventually I will have to use
 k
  = 5000 or even large. Because I estimate our data set would have that
 much
  of clusters.
 
 
  Thanks,
  David
 
 
  On Fri, Mar 27, 2015 at 10:40 AM Burak Yavuz brk...@gmail.com wrote:
 
  Hi David,
  The number of centroids (k=5000) seems too large and is probably the
  cause of the code taking too long.
 
  Can you please try the following:
  1) Repartition data to the number of available cores with
  .repartition(numCores)
  2) cache data
  3) call .count() on data right before k-means
  4) try k=500 (even less if possible)
 
  Thanks,
  Burak
 
  On Mar 26, 2015 4:15 PM, Xi Shen davidshe...@gmail.com wrote:
  
   The code is very simple.
  
   val data = sc.textFile(very/large/text/file) map { l =
 // turn each line into dense vector
 Vectors.dense(...)
   }
  
   // the resulting data set is about 40k vectors
  
   KMeans.train(data, k=5000, maxIterations=500)
  
   I just kill my application. In the log I found this:
  
   15/03/26 11:42:43 INFO storage.BlockManagerMaster: Updated info of
   block broadcast_26_piece0
   15/03/26 23:02:57 WARN server.TransportChannelHandler: Exception in
   connection from
   workernode0.xshe3539-hadoop-sydney.q10.internal.cloudapp.
 net/100.72.84.107:56277
   java.io.IOException: An existing connection was forcibly closed by
 the
   remote host
  
   Notice the time gap. I think it means the work node did not generate
   any log at all for about 12hrs...does it mean they are not working
 at all?
  
   But when testing with very small data set, my application works and
   output expected data.
  
  
   Thanks,
   David
  
  
   On Fri, Mar 27, 2015 at 10:04 AM Burak Yavuz brk...@gmail.com
 wrote:
  
   Can you share the code snippet of how you call k-means? Do you cache
   the data before k-means? Did you repartition the data?
  
   On Mar 26, 2015 4:02 PM, Xi Shen davidshe...@gmail.com wrote:
  
   OH, the job I talked about has ran more than 11 hrs without a
   result...it doesn't make sense.
  
  
   On Fri, Mar 27, 2015 at 9:48 AM Xi Shen davidshe...@gmail.com
   wrote:
  
   Hi Burak,
  
   My iterations is set to 500. But I think it should also stop of
 the
   centroid coverages, right?
  
   My spark is 1.2.0, working in windows 64 bit. My data set is about
   40k vectors, each vector has about 300 features, all normalised.
 All work
   node have sufficient memory and disk space.
  
   Thanks,
   David
  
  
   On Fri, 27 Mar 2015 02:48 Burak Yavuz brk...@gmail.com wrote:
  
   Hi David,
  
   When the number of runs are large and the data is not properly
   partitioned, it seems that K-Means is hanging according to my
 experience.
   Especially setting the number of runs to something high
 drastically
   increases the work in executors. If that's not the case, can you
 give more
   info on what Spark version you are using, your setup, and your
 dataset?
  
   Thanks,
   Burak
  
   On Mar 26, 2015 5:10 AM, Xi Shen davidshe...@gmail.com
 wrote:
  
   Hi,
  
   When I run k-means cluster with Spark, I got this in the last
 two
   lines in the log:
  
   15/03/26 11:42:42 INFO spark.ContextCleaner: Cleaned broadcast
 26
   15/03/26 11:42:42 INFO spark.ContextCleaner: Cleaned shuffle 5
  
  
  
   Then it hangs for a long time. There's no 

Re: Why k-means cluster hang for a long time?

2015-03-30 Thread Xiangrui Meng
We test large feature dimension but not very large k
(https://github.com/databricks/spark-perf/blob/master/config/config.py.template#L525).
Again, please create a JIRA and post your test code and a link to your
test dataset, we can work on it. It is hard to track the issue with
multiple threads in the mailing list. -Xiangrui

On Mon, Mar 30, 2015 at 3:55 PM, Xi Shen davidshe...@gmail.com wrote:
 For the same amount of data, if I set the k=500, the job finished in about 3
 hrs. I wonder if I set k=5000, the job could finish in 30 hrs...the longest
 time I waited was 12 hrs...

 If I use kmeans-random, same amount of data, k=5000, the job finished in
 less than 2 hrs.

 I think current kmeans|| implementation could not handle large vector
 dimensions properly. In my case, my vector has about 350 dimensions. I found
 another post complaining about kmeans performance in Spark, and that guy has
 vectors of 200 dimensions.

 It is possible people never tested large dimension case.


 Thanks,
 David




 On Tue, Mar 31, 2015 at 4:00 AM Xiangrui Meng men...@gmail.com wrote:

 Hi Xi,

 Please create a JIRA if it takes longer to locate the issue. Did you
 try a smaller k?

 Best,
 Xiangrui

 On Thu, Mar 26, 2015 at 5:45 PM, Xi Shen davidshe...@gmail.com wrote:
  Hi Burak,
 
  After I added .repartition(sc.defaultParallelism), I can see from the
  log
  the partition number is set to 32. But in the Spark UI, it seems all the
  data are loaded onto one executor. Previously they were loaded onto 4
  executors.
 
  Any idea?
 
 
  Thanks,
  David
 
 
  On Fri, Mar 27, 2015 at 11:01 AM Xi Shen davidshe...@gmail.com wrote:
 
  How do I get the number of cores that I specified at the command line?
  I
  want to use spark.default.parallelism. I have 4 executors, each has 8
  cores. According to
 
  https://spark.apache.org/docs/1.2.0/configuration.html#execution-behavior,
  the spark.default.parallelism value will be 4 * 8 = 32...I think it
  is too
  large, or inappropriate. Please give some suggestion.
 
  I have already used cache, and count to pre-cache.
 
  I can try with smaller k for testing, but eventually I will have to use
  k
  = 5000 or even large. Because I estimate our data set would have that
  much
  of clusters.
 
 
  Thanks,
  David
 
 
  On Fri, Mar 27, 2015 at 10:40 AM Burak Yavuz brk...@gmail.com wrote:
 
  Hi David,
  The number of centroids (k=5000) seems too large and is probably the
  cause of the code taking too long.
 
  Can you please try the following:
  1) Repartition data to the number of available cores with
  .repartition(numCores)
  2) cache data
  3) call .count() on data right before k-means
  4) try k=500 (even less if possible)
 
  Thanks,
  Burak
 
  On Mar 26, 2015 4:15 PM, Xi Shen davidshe...@gmail.com wrote:
  
   The code is very simple.
  
   val data = sc.textFile(very/large/text/file) map { l =
 // turn each line into dense vector
 Vectors.dense(...)
   }
  
   // the resulting data set is about 40k vectors
  
   KMeans.train(data, k=5000, maxIterations=500)
  
   I just kill my application. In the log I found this:
  
   15/03/26 11:42:43 INFO storage.BlockManagerMaster: Updated info of
   block broadcast_26_piece0
   15/03/26 23:02:57 WARN server.TransportChannelHandler: Exception in
   connection from
  
   workernode0.xshe3539-hadoop-sydney.q10.internal.cloudapp.net/100.72.84.107:56277
   java.io.IOException: An existing connection was forcibly closed by
   the
   remote host
  
   Notice the time gap. I think it means the work node did not generate
   any log at all for about 12hrs...does it mean they are not working
   at all?
  
   But when testing with very small data set, my application works and
   output expected data.
  
  
   Thanks,
   David
  
  
   On Fri, Mar 27, 2015 at 10:04 AM Burak Yavuz brk...@gmail.com
   wrote:
  
   Can you share the code snippet of how you call k-means? Do you
   cache
   the data before k-means? Did you repartition the data?
  
   On Mar 26, 2015 4:02 PM, Xi Shen davidshe...@gmail.com wrote:
  
   OH, the job I talked about has ran more than 11 hrs without a
   result...it doesn't make sense.
  
  
   On Fri, Mar 27, 2015 at 9:48 AM Xi Shen davidshe...@gmail.com
   wrote:
  
   Hi Burak,
  
   My iterations is set to 500. But I think it should also stop of
   the
   centroid coverages, right?
  
   My spark is 1.2.0, working in windows 64 bit. My data set is
   about
   40k vectors, each vector has about 300 features, all normalised.
   All work
   node have sufficient memory and disk space.
  
   Thanks,
   David
  
  
   On Fri, 27 Mar 2015 02:48 Burak Yavuz brk...@gmail.com wrote:
  
   Hi David,
  
   When the number of runs are large and the data is not properly
   partitioned, it seems that K-Means is hanging according to my
   experience.
   Especially setting the number of runs to something high
   drastically
   increases the work in executors. If that's not the case, can you
   give more
   info on what Spark version 

Why k-means cluster hang for a long time?

2015-03-26 Thread Xi Shen
Hi,

When I run k-means cluster with Spark, I got this in the last two lines in
the log:

15/03/26 11:42:42 INFO spark.ContextCleaner: Cleaned broadcast 26
15/03/26 11:42:42 INFO spark.ContextCleaner: Cleaned shuffle 5



Then it hangs for a long time. There's no active job. The driver machine is
idle. I cannot access the work node, I am not sure if they are busy.

I understand k-means may take a long time to finish. But why no active job?
no log?


Thanks,
David


Re: Why k-means cluster hang for a long time?

2015-03-26 Thread Xi Shen
Hi Burak,

After I added .repartition(sc.defaultParallelism), I can see from the log
the partition number is set to 32. But in the Spark UI, it seems all the
data are loaded onto one executor. Previously they were loaded onto 4
executors.

Any idea?


Thanks,
David


On Fri, Mar 27, 2015 at 11:01 AM Xi Shen davidshe...@gmail.com wrote:

 How do I get the number of cores that I specified at the command line? I
 want to use spark.default.parallelism. I have 4 executors, each has 8
 cores. According to
 https://spark.apache.org/docs/1.2.0/configuration.html#execution-behavior,
 the spark.default.parallelism value will be 4 * 8 = 32...I think it is
 too large, or inappropriate. Please give some suggestion.

 I have already used cache, and count to pre-cache.

 I can try with smaller k for testing, but eventually I will have to use k
 = 5000 or even large. Because I estimate our data set would have that much
 of clusters.


 Thanks,
 David


 On Fri, Mar 27, 2015 at 10:40 AM Burak Yavuz brk...@gmail.com wrote:

 Hi David,
 The number of centroids (k=5000) seems too large and is probably the
 cause of the code taking too long.

 Can you please try the following:
 1) Repartition data to the number of available cores with
 .repartition(numCores)
 2) cache data
 3) call .count() on data right before k-means
 4) try k=500 (even less if possible)

 Thanks,
 Burak

 On Mar 26, 2015 4:15 PM, Xi Shen davidshe...@gmail.com wrote:
 
  The code is very simple.
 
  val data = sc.textFile(very/large/text/file) map { l =
// turn each line into dense vector
Vectors.dense(...)
  }
 
  // the resulting data set is about 40k vectors
 
  KMeans.train(data, k=5000, maxIterations=500)
 
  I just kill my application. In the log I found this:
 
  15/03/26 11:42:43 INFO storage.BlockManagerMaster: Updated info of
 block broadcast_26_piece0
  15/03/26 23:02:57 WARN server.TransportChannelHandler: Exception in
 connection from workernode0.xshe3539-hadoop-sydney.q10.internal.cloudapp.
 net/100.72.84.107:56277
  java.io.IOException: An existing connection was forcibly closed by the
 remote host
 
  Notice the time gap. I think it means the work node did not generate
 any log at all for about 12hrs...does it mean they are not working at all?
 
  But when testing with very small data set, my application works and
 output expected data.
 
 
  Thanks,
  David
 
 
  On Fri, Mar 27, 2015 at 10:04 AM Burak Yavuz brk...@gmail.com wrote:
 
  Can you share the code snippet of how you call k-means? Do you cache
 the data before k-means? Did you repartition the data?
 
  On Mar 26, 2015 4:02 PM, Xi Shen davidshe...@gmail.com wrote:
 
  OH, the job I talked about has ran more than 11 hrs without a
 result...it doesn't make sense.
 
 
  On Fri, Mar 27, 2015 at 9:48 AM Xi Shen davidshe...@gmail.com
 wrote:
 
  Hi Burak,
 
  My iterations is set to 500. But I think it should also stop of the
 centroid coverages, right?
 
  My spark is 1.2.0, working in windows 64 bit. My data set is about
 40k vectors, each vector has about 300 features, all normalised. All work
 node have sufficient memory and disk space.
 
  Thanks,
  David
 
 
  On Fri, 27 Mar 2015 02:48 Burak Yavuz brk...@gmail.com wrote:
 
  Hi David,
 
  When the number of runs are large and the data is not properly
 partitioned, it seems that K-Means is hanging according to my experience.
 Especially setting the number of runs to something high drastically
 increases the work in executors. If that's not the case, can you give more
 info on what Spark version you are using, your setup, and your dataset?
 
  Thanks,
  Burak
 
  On Mar 26, 2015 5:10 AM, Xi Shen davidshe...@gmail.com wrote:
 
  Hi,
 
  When I run k-means cluster with Spark, I got this in the last two
 lines in the log:
 
  15/03/26 11:42:42 INFO spark.ContextCleaner: Cleaned broadcast 26
  15/03/26 11:42:42 INFO spark.ContextCleaner: Cleaned shuffle 5
 
 
 
  Then it hangs for a long time. There's no active job. The driver
 machine is idle. I cannot access the work node, I am not sure if they are
 busy.
 
  I understand k-means may take a long time to finish. But why no
 active job? no log?
 
 
  Thanks,
  David
 




Re: Why k-means cluster hang for a long time?

2015-03-26 Thread Xi Shen
How do I get the number of cores that I specified at the command line? I
want to use spark.default.parallelism. I have 4 executors, each has 8
cores. According to
https://spark.apache.org/docs/1.2.0/configuration.html#execution-behavior,
the spark.default.parallelism value will be 4 * 8 = 32...I think it is
too large, or inappropriate. Please give some suggestion.

I have already used cache, and count to pre-cache.

I can try with smaller k for testing, but eventually I will have to use k =
5000 or even large. Because I estimate our data set would have that much of
clusters.


Thanks,
David


On Fri, Mar 27, 2015 at 10:40 AM Burak Yavuz brk...@gmail.com wrote:

 Hi David,
 The number of centroids (k=5000) seems too large and is probably the cause
 of the code taking too long.

 Can you please try the following:
 1) Repartition data to the number of available cores with
 .repartition(numCores)
 2) cache data
 3) call .count() on data right before k-means
 4) try k=500 (even less if possible)

 Thanks,
 Burak

 On Mar 26, 2015 4:15 PM, Xi Shen davidshe...@gmail.com wrote:
 
  The code is very simple.
 
  val data = sc.textFile(very/large/text/file) map { l =
// turn each line into dense vector
Vectors.dense(...)
  }
 
  // the resulting data set is about 40k vectors
 
  KMeans.train(data, k=5000, maxIterations=500)
 
  I just kill my application. In the log I found this:
 
  15/03/26 11:42:43 INFO storage.BlockManagerMaster: Updated info of block
 broadcast_26_piece0
  15/03/26 23:02:57 WARN server.TransportChannelHandler: Exception in
 connection from
 workernode0.xshe3539-hadoop-sydney.q10.internal.cloudapp.net/100.72.84.107:56277
  java.io.IOException: An existing connection was forcibly closed by the
 remote host
 
  Notice the time gap. I think it means the work node did not generate any
 log at all for about 12hrs...does it mean they are not working at all?
 
  But when testing with very small data set, my application works and
 output expected data.
 
 
  Thanks,
  David
 
 
  On Fri, Mar 27, 2015 at 10:04 AM Burak Yavuz brk...@gmail.com wrote:
 
  Can you share the code snippet of how you call k-means? Do you cache
 the data before k-means? Did you repartition the data?
 
  On Mar 26, 2015 4:02 PM, Xi Shen davidshe...@gmail.com wrote:
 
  OH, the job I talked about has ran more than 11 hrs without a
 result...it doesn't make sense.
 
 
  On Fri, Mar 27, 2015 at 9:48 AM Xi Shen davidshe...@gmail.com wrote:
 
  Hi Burak,
 
  My iterations is set to 500. But I think it should also stop of the
 centroid coverages, right?
 
  My spark is 1.2.0, working in windows 64 bit. My data set is about
 40k vectors, each vector has about 300 features, all normalised. All work
 node have sufficient memory and disk space.
 
  Thanks,
  David
 
 
  On Fri, 27 Mar 2015 02:48 Burak Yavuz brk...@gmail.com wrote:
 
  Hi David,
 
  When the number of runs are large and the data is not properly
 partitioned, it seems that K-Means is hanging according to my experience.
 Especially setting the number of runs to something high drastically
 increases the work in executors. If that's not the case, can you give more
 info on what Spark version you are using, your setup, and your dataset?
 
  Thanks,
  Burak
 
  On Mar 26, 2015 5:10 AM, Xi Shen davidshe...@gmail.com wrote:
 
  Hi,
 
  When I run k-means cluster with Spark, I got this in the last two
 lines in the log:
 
  15/03/26 11:42:42 INFO spark.ContextCleaner: Cleaned broadcast 26
  15/03/26 11:42:42 INFO spark.ContextCleaner: Cleaned shuffle 5
 
 
 
  Then it hangs for a long time. There's no active job. The driver
 machine is idle. I cannot access the work node, I am not sure if they are
 busy.
 
  I understand k-means may take a long time to finish. But why no
 active job? no log?
 
 
  Thanks,
  David
 



Re: Why k-means cluster hang for a long time?

2015-03-26 Thread Xi Shen
Hi Burak,

My iterations is set to 500. But I think it should also stop of the
centroid coverages, right?

My spark is 1.2.0, working in windows 64 bit. My data set is about 40k
vectors, each vector has about 300 features, all normalised. All work node
have sufficient memory and disk space.

Thanks,
David

On Fri, 27 Mar 2015 02:48 Burak Yavuz brk...@gmail.com wrote:

 Hi David,

 When the number of runs are large and the data is not properly
 partitioned, it seems that K-Means is hanging according to my experience.
 Especially setting the number of runs to something high drastically
 increases the work in executors. If that's not the case, can you give more
 info on what Spark version you are using, your setup, and your dataset?

 Thanks,
 Burak
 On Mar 26, 2015 5:10 AM, Xi Shen davidshe...@gmail.com wrote:

 Hi,

 When I run k-means cluster with Spark, I got this in the last two lines
 in the log:

 15/03/26 11:42:42 INFO spark.ContextCleaner: Cleaned broadcast 26
 15/03/26 11:42:42 INFO spark.ContextCleaner: Cleaned shuffle 5



 Then it hangs for a long time. There's no active job. The driver machine
 is idle. I cannot access the work node, I am not sure if they are busy.

 I understand k-means may take a long time to finish. But why no active
 job? no log?


 Thanks,
 David




Re: Why k-means cluster hang for a long time?

2015-03-26 Thread Xi Shen
OH, the job I talked about has ran more than 11 hrs without a result...it
doesn't make sense.


On Fri, Mar 27, 2015 at 9:48 AM Xi Shen davidshe...@gmail.com wrote:

 Hi Burak,

 My iterations is set to 500. But I think it should also stop of the
 centroid coverages, right?

 My spark is 1.2.0, working in windows 64 bit. My data set is about 40k
 vectors, each vector has about 300 features, all normalised. All work node
 have sufficient memory and disk space.

 Thanks,
 David

 On Fri, 27 Mar 2015 02:48 Burak Yavuz brk...@gmail.com wrote:

 Hi David,

 When the number of runs are large and the data is not properly
 partitioned, it seems that K-Means is hanging according to my experience.
 Especially setting the number of runs to something high drastically
 increases the work in executors. If that's not the case, can you give more
 info on what Spark version you are using, your setup, and your dataset?

 Thanks,
 Burak
 On Mar 26, 2015 5:10 AM, Xi Shen davidshe...@gmail.com wrote:

 Hi,

 When I run k-means cluster with Spark, I got this in the last two lines
 in the log:

 15/03/26 11:42:42 INFO spark.ContextCleaner: Cleaned broadcast 26
 15/03/26 11:42:42 INFO spark.ContextCleaner: Cleaned shuffle 5



 Then it hangs for a long time. There's no active job. The driver machine
 is idle. I cannot access the work node, I am not sure if they are busy.

 I understand k-means may take a long time to finish. But why no active
 job? no log?


 Thanks,
 David




Re: Why k-means cluster hang for a long time?

2015-03-26 Thread Xi Shen
The code is very simple.

val data = sc.textFile(very/large/text/file) map { l =
  // turn each line into dense vector
  Vectors.dense(...)
}

// the resulting data set is about 40k vectors

KMeans.train(data, k=5000, maxIterations=500)

I just kill my application. In the log I found this:

15/03/26 *11:42:43* INFO storage.BlockManagerMaster: Updated info of block
broadcast_26_piece0
15/03/26 *23:02:57* WARN server.TransportChannelHandler: Exception in
connection from
workernode0.xshe3539-hadoop-sydney.q10.internal.cloudapp.net/100.72.84.107:56277
java.io.IOException: An existing connection was forcibly closed by the
remote host

Notice the time gap. I think it means the work node did not generate any
log at all for about 12hrs...does it mean they are not working at all?

But when testing with very small data set, my application works and output
expected data.


Thanks,
David


On Fri, Mar 27, 2015 at 10:04 AM Burak Yavuz brk...@gmail.com wrote:

 Can you share the code snippet of how you call k-means? Do you cache the
 data before k-means? Did you repartition the data?
 On Mar 26, 2015 4:02 PM, Xi Shen davidshe...@gmail.com wrote:

 OH, the job I talked about has ran more than 11 hrs without a result...it
 doesn't make sense.


 On Fri, Mar 27, 2015 at 9:48 AM Xi Shen davidshe...@gmail.com wrote:

 Hi Burak,

 My iterations is set to 500. But I think it should also stop of the
 centroid coverages, right?

 My spark is 1.2.0, working in windows 64 bit. My data set is about 40k
 vectors, each vector has about 300 features, all normalised. All work node
 have sufficient memory and disk space.

 Thanks,
 David

 On Fri, 27 Mar 2015 02:48 Burak Yavuz brk...@gmail.com wrote:

 Hi David,

 When the number of runs are large and the data is not properly
 partitioned, it seems that K-Means is hanging according to my experience.
 Especially setting the number of runs to something high drastically
 increases the work in executors. If that's not the case, can you give more
 info on what Spark version you are using, your setup, and your dataset?

 Thanks,
 Burak
 On Mar 26, 2015 5:10 AM, Xi Shen davidshe...@gmail.com wrote:

 Hi,

 When I run k-means cluster with Spark, I got this in the last two
 lines in the log:

 15/03/26 11:42:42 INFO spark.ContextCleaner: Cleaned broadcast 26
 15/03/26 11:42:42 INFO spark.ContextCleaner: Cleaned shuffle 5



 Then it hangs for a long time. There's no active job. The driver
 machine is idle. I cannot access the work node, I am not sure if they are
 busy.

 I understand k-means may take a long time to finish. But why no active
 job? no log?


 Thanks,
 David