[ 
https://issues.apache.org/jira/browse/SPARK-30661?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17478437#comment-17478437
 ] 

zhengruifeng commented on SPARK-30661:
--------------------------------------

recently, I spend some time on testing blockify kmeans and apply GEMM in 
finding the closest cluster.

In short:

1, for sparse datasets, blockifying kmeans still cause regression in most 
cases; (existing impl with triangle-inequality can skip some distance 
computation, but scala-based sparse BLAS will always compute all distances)

2, for dense datasets and small k, blockifying kmeans (without native BLAS) is 
competitive; with native BLAS, it should be significantly faster than existing 
impl.

 

So I plan to add a new parameter {{solver}} by making KMeans extending 
HasSolver, and support both two training impls, so that end users can switch to 
the blockify version.

 

How do you think about it? [~srowen] [~WeichenXu123] [~mengxr] [~huaxingao] 

 

> KMeans blockify input vectors
> -----------------------------
>
>                 Key: SPARK-30661
>                 URL: https://issues.apache.org/jira/browse/SPARK-30661
>             Project: Spark
>          Issue Type: Sub-task
>          Components: ML, PySpark
>    Affects Versions: 3.0.0
>            Reporter: zhengruifeng
>            Assignee: zhengruifeng
>            Priority: Minor
>




--
This message was sent by Atlassian Jira
(v8.20.1#820001)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to