[ https://issues.apache.org/jira/browse/SPARK-30661?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17488611#comment-17488611 ]
zhengruifeng commented on SPARK-30661: -------------------------------------- since the input datasets of kmeans are likely dense, so I tend to add a dense impl as a alternative. I think we can do it in this way: step1: move existing impl to the .ml side. I think we should keep existing impl to avoid possible regression cases; step2: make the .mllib.kmeans call the .ml.kmeans internally. We also need to support initialization with existing model in the .ml side, since .mllib.kmeans supports this function; step3: add the new dense impl. Make .ml.kmeans extends HasSolver, it will supports three options: row-based (default), block-based, auto. If end user set it auto, then the impl will check the sparsity and choose the underlying impl. step4: sync the change to the python side. cc [~srowen] > KMeans blockify input vectors > ----------------------------- > > Key: SPARK-30661 > URL: https://issues.apache.org/jira/browse/SPARK-30661 > Project: Spark > Issue Type: Sub-task > Components: ML, PySpark > Affects Versions: 3.0.0 > Reporter: zhengruifeng > Assignee: zhengruifeng > Priority: Minor > Attachments: blockify_kmeans.png > > -- This message was sent by Atlassian Jira (v8.20.1#820001) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org