[GitHub] spark pull request: [SPARK-2355] Add checker for the number of clu...

2014-07-03 Thread viirya
GitHub user viirya opened a pull request:

https://github.com/apache/spark/pull/1293

[SPARK-2355] Add checker for the number of clusters

When the number of clusters given to perform with 
org.apache.spark.mllib.clustering.KMeans under parallel initial mode is greater 
than data number, it will throw ArrayIndexOutOfBoundsException.

This PR adds checker for the number of clusters and throws 
IllegalArgumentException when that number is greater than data number.



You can merge this pull request into a Git repository by running:

$ git pull https://github.com/viirya/spark-1 check_clusters_number

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/1293.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #1293


commit 582cd11e5331a8e2704a5603080eec41c9002cf4
Author: Liang-Chi Hsieh 
Date:   2014-07-03T16:27:22Z

simply add checker for the number of clusters.




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-2355] Add checker for the number of clu...

2014-07-03 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/1293#issuecomment-47958358
  
Can one of the admins verify this patch?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-2355] Add checker for the number of clu...

2014-07-03 Thread aarondav
Github user aarondav commented on the pull request:

https://github.com/apache/spark/pull/1293#issuecomment-47959115
  
data.count() is actually a very expensive operation, as it has to scan all 
the data. If cached, it may not be as much a problem, but it is still probably 
not worth it for this check. Which part throws the exception?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-2355] Add checker for the number of clu...

2014-07-03 Thread viirya
Github user viirya commented on the pull request:

https://github.com/apache/spark/pull/1293#issuecomment-47961640
  
The problem lies in `initKMeansParallel`, the implementation of k-means|| 
algorithm. Since it selects at most the centers as many as the data number, 
when calling `LocalKMeans.kMeansPlusPlus` at the end of `initKMeansParallel`, 
`kMeansPlusPlus` would throw this exception.

I can slightly modify `kMeansPlusPlus` to avoid this exception by selected 
chosen centers to fill the gap between cluster numbers and data number. But 
this approach might not be appropriate because it is not the problem of the 
algorithm.

I also think about whether it is worth to check that by scanning all data. 
But since it is only counting and no other computations involved, it might be 
acceptable still. In fact, there are also many map operations on the data later 
in clustering. Comparing with these map ops, `data.count()` should be 
lightweight more? Or it is unnecessary to check that? Any suggestions?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-2355] Add checker for the number of clu...

2014-07-30 Thread pwendell
Github user pwendell commented on the pull request:

https://github.com/apache/spark/pull/1293#issuecomment-50691878
  
@viirya could you add `[MLlib]` to the title of this pull request? 
Otherwise it doesn't get sorted correctly


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---