[ 
https://issues.apache.org/jira/browse/SPARK-30661?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17488611#comment-17488611
 ] 

zhengruifeng commented on SPARK-30661:
--------------------------------------

since the input datasets of kmeans are likely dense, so I tend to add a dense 
impl as a alternative.

 

I think we can do it in this way:

step1: move existing impl to the .ml side. I think we should keep existing impl 
to avoid possible regression cases;

step2: make the .mllib.kmeans call the .ml.kmeans internally. We also need to 
support initialization with existing model in the .ml side, since .mllib.kmeans 
supports this function;

step3: add the new dense impl. Make .ml.kmeans extends HasSolver, it will 
supports three options: row-based (default), block-based, auto. If end user set 
it auto, then the impl will check the sparsity and choose the underlying impl.

step4: sync the change to the python side.

 

cc [~srowen] 

> KMeans blockify input vectors
> -----------------------------
>
>                 Key: SPARK-30661
>                 URL: https://issues.apache.org/jira/browse/SPARK-30661
>             Project: Spark
>          Issue Type: Sub-task
>          Components: ML, PySpark
>    Affects Versions: 3.0.0
>            Reporter: zhengruifeng
>            Assignee: zhengruifeng
>            Priority: Minor
>         Attachments: blockify_kmeans.png
>
>




--
This message was sent by Atlassian Jira
(v8.20.1#820001)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to