[GitHub] [spark] zhengruifeng commented on pull request #28229: [SPARK-31454][ML] An optimized K-Means based on DenseMatrix and GEMM

2020-12-18 Thread GitBox


zhengruifeng commented on pull request #28229:
URL: https://github.com/apache/spark/pull/28229#issuecomment-747989608


   @xwu99  Sorry for the late reply.
   
   We had just make SVC/LiR/LoR/AFT using blocks instead of instances in 3.1.0, 
but in a new adaptive way to blockify instances 
(https://github.com/apache/spark/pull/30009 and 
https://github.com/apache/spark/pull/30355).
   
   In you above performance tests, KMeans was about 2.25 faster with `MKL` on a 
dense dataset.
   
   But now I have some new concerns:
   
   1, the performance of GEMM is highly related to the underlying native blas, 
while the performance gap between native blas and `f2jBLAS` for GEMV and DOT is 
relatively small:
   
   1.1 in performance tests 
[SPARK-31783](https://issues.apache.org/jira/browse/SPARK-31783),
   
   binary-lor: enabling native blas do not bring too much speed up
   
![image](https://user-images.githubusercontent.com/7322292/102595660-ec267880-4152-11eb-8a9e-d947a91df809.png)
   
   multi-lor: enabling native blas double the performance
   
![image](https://user-images.githubusercontent.com/7322292/102595830-3a3b7c00-4153-11eb-845c-ef37b4a7e8a9.png)
   
   1.2 [GMM](https://github.com/apache/spark/pull/27473) based on GEMM is 3X 
faster than old impl, only if native blas is used; With `f2jBLAS`, it is even 
slower than the old impl. Partially beacause of this, we then reverted it.
   
   1.3 [ALS](https://github.com/apache/spark/pull/30468) based on GEMM is even 
slower than that based on DOT, without native blas;
   
   1.4 in my [previous 
attempts](https://github.com/apache/spark/compare/master...zhengruifeng:blockify_km?expand=1)
 to use GEMM in KMeans (also mentioned above), I found that using GEMM hurted 
performace when dataset is sparse.
   
   So I think, using GEMM to accelerate KMeans will only help when 1,input 
dataset is dense, and 2,the native blas is enabled. But in my opinion, most 
likely, a spark cluster does not support native blas by default.
   
   2, large `k`
   Different from multi-lor whose number of classes is likely a relately small 
number, `k` in KMeans can be a large number. In my recently practical works, we 
set `k`>5000 to group vectors, and then serach nerest neighborhood within each 
group (recall in this way is much better than LSH).
   In each block, GEMM need a buffer of size `k`*`blockSize`, which maybe 
dangerous.
   
   
   In summary, I am now very conservative here, and am considering optimizing 
KMeans in other ways like using GEMV (which works in ALS, but vectors in ALS 
are always dense, I am not sure its performance on sparse dataset).
   
   also ping @srowen , since I guess you maybe interested in this field.
   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] zhengruifeng commented on pull request #28229: [SPARK-31454][ML] An optimized K-Means based on DenseMatrix and GEMM

2020-05-31 Thread GitBox


zhengruifeng commented on pull request #28229:
URL: https://github.com/apache/spark/pull/28229#issuecomment-636473457


   @xwu99 Thanks for your work. The speedup is promising. 
   
   Since this issue (blockify+gemv/gemm) need more discussion with other 
committers, so I am working on retest those algorithms (current result was 
attached in https://issues.apache.org/jira/browse/SPARK-31783).
   
   I'm afraid I can review this PR only after an agreement with other 
committers get reached.
   
   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] zhengruifeng commented on pull request #28229: [SPARK-31454][ML] An optimized K-Means based on DenseMatrix and GEMM

2020-05-06 Thread GitBox


zhengruifeng commented on pull request #28229:
URL: https://github.com/apache/spark/pull/28229#issuecomment-624991855


   @xwu99 I think you can also refer to those two PRs, since some utils were 
added.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] zhengruifeng commented on pull request #28229: [SPARK-31454][ML] An optimized K-Means based on DenseMatrix and GEMM

2020-05-06 Thread GitBox


zhengruifeng commented on pull request #28229:
URL: https://github.com/apache/spark/pull/28229#issuecomment-624989291


   @xwu99 There was a 
[ticket](https://issues.apache.org/jira/browse/SPARK-30641)  for this.
   Now I had merged high-level BLAS supports for 
[LinearSVC](https://github.com/apache/spark/pull/27360) and 
[LogisticRegression](https://github.com/apache/spark/pull/28458).
   
   I can help reviewing this PR on KMeans, you can list some performance 
details, like dataset, numFeatures, numInstances, performance without 
native-blas (mkl), performance with native-blas...
   
   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] zhengruifeng commented on pull request #28229: [SPARK-31454][ML] An optimized K-Means based on DenseMatrix and GEMM

2020-04-28 Thread GitBox


zhengruifeng commented on pull request #28229:
URL: https://github.com/apache/spark/pull/28229#issuecomment-620990814


   @xwu99 My previous works include:
   LinearSVC:  https://github.com/apache/spark/pull/27360
   LogisticRegression: https://github.com/apache/spark/pull/27374
   LinearRegression: https://github.com/apache/spark/pull/27396
   GaussianMixture: https://github.com/apache/spark/pull/27473
   KMeans: 
https://github.com/apache/spark/compare/master...zhengruifeng:blockify_km?expand=1,
 not send
   
   I'm reworking on 
`LinearSVC`/`LogisticRegression`/`LinearRegression`/`GaussianMixture`. For 
KMeans I am glad you can take it over.
   
   I just recreate a new [PR](https://github.com/apache/spark/pull/28349) for 
LinearSVC, the main idea is to use expert param `blockSize` to choose the path. 
The original path will be choosen by default to avoid performance regression on 
sparse datasets.
   
   If nobody object, I will merge it, and then other three impls (since they 
depend on the first one, I do not recreate PRs right now) 
`LogisticRegression`/`LinearRegression`/`GaussianMixture`.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] zhengruifeng commented on pull request #28229: [SPARK-31454][ML] An optimized K-Means based on DenseMatrix and GEMM

2020-04-28 Thread GitBox


zhengruifeng commented on pull request #28229:
URL: https://github.com/apache/spark/pull/28229#issuecomment-620975624


   > I saw your PR was merged, I will rebase. 
   
   I had some reverted PRs on using high-level BLAS in LoR/LiR/SVC/GMM, they 
were reverted because of performance regression on sparse datasets;
   I am now working on it again, using param `blockSize==1` to choose the impl.
   I am also waiting for more feedbacks. If nobody object, I will merge them.
   
   There are some common utils in those PRs, which should also be used in 
KMeans. So I think you can rebase this PR after 
[SVC](https://github.com/apache/spark/pull/28349) get merged.
   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org