[GitHub] spark pull request: [SPARK-2355] Add checker for the number of clu...
GitHub user viirya opened a pull request: https://github.com/apache/spark/pull/1293 [SPARK-2355] Add checker for the number of clusters When the number of clusters given to perform with org.apache.spark.mllib.clustering.KMeans under parallel initial mode is greater than data number, it will throw ArrayIndexOutOfBoundsException. This PR adds checker for the number of clusters and throws IllegalArgumentException when that number is greater than data number. You can merge this pull request into a Git repository by running: $ git pull https://github.com/viirya/spark-1 check_clusters_number Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/1293.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #1293 commit 582cd11e5331a8e2704a5603080eec41c9002cf4 Author: Liang-Chi Hsieh Date: 2014-07-03T16:27:22Z simply add checker for the number of clusters. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-2355] Add checker for the number of clu...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/1293#issuecomment-47958358 Can one of the admins verify this patch? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-2355] Add checker for the number of clu...
Github user aarondav commented on the pull request: https://github.com/apache/spark/pull/1293#issuecomment-47959115 data.count() is actually a very expensive operation, as it has to scan all the data. If cached, it may not be as much a problem, but it is still probably not worth it for this check. Which part throws the exception? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-2355] Add checker for the number of clu...
Github user viirya commented on the pull request: https://github.com/apache/spark/pull/1293#issuecomment-47961640 The problem lies in `initKMeansParallel`, the implementation of k-means|| algorithm. Since it selects at most the centers as many as the data number, when calling `LocalKMeans.kMeansPlusPlus` at the end of `initKMeansParallel`, `kMeansPlusPlus` would throw this exception. I can slightly modify `kMeansPlusPlus` to avoid this exception by selected chosen centers to fill the gap between cluster numbers and data number. But this approach might not be appropriate because it is not the problem of the algorithm. I also think about whether it is worth to check that by scanning all data. But since it is only counting and no other computations involved, it might be acceptable still. In fact, there are also many map operations on the data later in clustering. Comparing with these map ops, `data.count()` should be lightweight more? Or it is unnecessary to check that? Any suggestions? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-2355] Add checker for the number of clu...
Github user pwendell commented on the pull request: https://github.com/apache/spark/pull/1293#issuecomment-50691878 @viirya could you add `[MLlib]` to the title of this pull request? Otherwise it doesn't get sorted correctly --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---