[ https://issues.apache.org/jira/browse/SPARK-2308?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14134264#comment-14134264 ]
RJ Nowling commented on SPARK-2308: ----------------------------------- It is true that we will save on the distance calculations for high dimensional data sets. There is also work under way to improve sampling in Spark, so this will also benefit further from that. Are you planning on creating a PR for your implementation? It would be valuable for the community. I closed mine due to the sampling issues. But I'd be happy to review and test yours. > Add KMeans MiniBatch clustering algorithm to MLlib > -------------------------------------------------- > > Key: SPARK-2308 > URL: https://issues.apache.org/jira/browse/SPARK-2308 > Project: Spark > Issue Type: New Feature > Components: MLlib > Reporter: RJ Nowling > Assignee: RJ Nowling > Priority: Minor > Attachments: many_small_centers.pdf, uneven_centers.pdf > > > Mini-batch is a version of KMeans that uses a randomly-sampled subset of the > data points in each iteration instead of the full set of data points, > improving performance (and in some cases, accuracy). The mini-batch version > is compatible with the KMeans|| initialization algorithm currently > implemented in MLlib. > I suggest adding KMeans Mini-batch as an alternative. > I'd like this to be assigned to me. -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org