[GitHub] [spark] zhengruifeng commented on pull request #28229: [SPARK-31454][ML] An optimized K-Means based on DenseMatrix and GEMM
zhengruifeng commented on pull request #28229: URL: https://github.com/apache/spark/pull/28229#issuecomment-747989608 @xwu99 Sorry for the late reply. We had just make SVC/LiR/LoR/AFT using blocks instead of instances in 3.1.0, but in a new adaptive way to blockify instances (https://github.com/apache/spark/pull/30009 and https://github.com/apache/spark/pull/30355). In you above performance tests, KMeans was about 2.25 faster with `MKL` on a dense dataset. But now I have some new concerns: 1, the performance of GEMM is highly related to the underlying native blas, while the performance gap between native blas and `f2jBLAS` for GEMV and DOT is relatively small: 1.1 in performance tests [SPARK-31783](https://issues.apache.org/jira/browse/SPARK-31783), binary-lor: enabling native blas do not bring too much speed up ![image](https://user-images.githubusercontent.com/7322292/102595660-ec267880-4152-11eb-8a9e-d947a91df809.png) multi-lor: enabling native blas double the performance ![image](https://user-images.githubusercontent.com/7322292/102595830-3a3b7c00-4153-11eb-845c-ef37b4a7e8a9.png) 1.2 [GMM](https://github.com/apache/spark/pull/27473) based on GEMM is 3X faster than old impl, only if native blas is used; With `f2jBLAS`, it is even slower than the old impl. Partially beacause of this, we then reverted it. 1.3 [ALS](https://github.com/apache/spark/pull/30468) based on GEMM is even slower than that based on DOT, without native blas; 1.4 in my [previous attempts](https://github.com/apache/spark/compare/master...zhengruifeng:blockify_km?expand=1) to use GEMM in KMeans (also mentioned above), I found that using GEMM hurted performace when dataset is sparse. So I think, using GEMM to accelerate KMeans will only help when 1,input dataset is dense, and 2,the native blas is enabled. But in my opinion, most likely, a spark cluster does not support native blas by default. 2, large `k` Different from multi-lor whose number of classes is likely a relately small number, `k` in KMeans can be a large number. In my recently practical works, we set `k`>5000 to group vectors, and then serach nerest neighborhood within each group (recall in this way is much better than LSH). In each block, GEMM need a buffer of size `k`*`blockSize`, which maybe dangerous. In summary, I am now very conservative here, and am considering optimizing KMeans in other ways like using GEMV (which works in ALS, but vectors in ALS are always dense, I am not sure its performance on sparse dataset). also ping @srowen , since I guess you maybe interested in this field. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] zhengruifeng commented on pull request #28229: [SPARK-31454][ML] An optimized K-Means based on DenseMatrix and GEMM
zhengruifeng commented on pull request #28229: URL: https://github.com/apache/spark/pull/28229#issuecomment-636473457 @xwu99 Thanks for your work. The speedup is promising. Since this issue (blockify+gemv/gemm) need more discussion with other committers, so I am working on retest those algorithms (current result was attached in https://issues.apache.org/jira/browse/SPARK-31783). I'm afraid I can review this PR only after an agreement with other committers get reached. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] zhengruifeng commented on pull request #28229: [SPARK-31454][ML] An optimized K-Means based on DenseMatrix and GEMM
zhengruifeng commented on pull request #28229: URL: https://github.com/apache/spark/pull/28229#issuecomment-624991855 @xwu99 I think you can also refer to those two PRs, since some utils were added. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] zhengruifeng commented on pull request #28229: [SPARK-31454][ML] An optimized K-Means based on DenseMatrix and GEMM
zhengruifeng commented on pull request #28229: URL: https://github.com/apache/spark/pull/28229#issuecomment-624989291 @xwu99 There was a [ticket](https://issues.apache.org/jira/browse/SPARK-30641) for this. Now I had merged high-level BLAS supports for [LinearSVC](https://github.com/apache/spark/pull/27360) and [LogisticRegression](https://github.com/apache/spark/pull/28458). I can help reviewing this PR on KMeans, you can list some performance details, like dataset, numFeatures, numInstances, performance without native-blas (mkl), performance with native-blas... This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] zhengruifeng commented on pull request #28229: [SPARK-31454][ML] An optimized K-Means based on DenseMatrix and GEMM
zhengruifeng commented on pull request #28229: URL: https://github.com/apache/spark/pull/28229#issuecomment-620990814 @xwu99 My previous works include: LinearSVC: https://github.com/apache/spark/pull/27360 LogisticRegression: https://github.com/apache/spark/pull/27374 LinearRegression: https://github.com/apache/spark/pull/27396 GaussianMixture: https://github.com/apache/spark/pull/27473 KMeans: https://github.com/apache/spark/compare/master...zhengruifeng:blockify_km?expand=1, not send I'm reworking on `LinearSVC`/`LogisticRegression`/`LinearRegression`/`GaussianMixture`. For KMeans I am glad you can take it over. I just recreate a new [PR](https://github.com/apache/spark/pull/28349) for LinearSVC, the main idea is to use expert param `blockSize` to choose the path. The original path will be choosen by default to avoid performance regression on sparse datasets. If nobody object, I will merge it, and then other three impls (since they depend on the first one, I do not recreate PRs right now) `LogisticRegression`/`LinearRegression`/`GaussianMixture`. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] zhengruifeng commented on pull request #28229: [SPARK-31454][ML] An optimized K-Means based on DenseMatrix and GEMM
zhengruifeng commented on pull request #28229: URL: https://github.com/apache/spark/pull/28229#issuecomment-620975624 > I saw your PR was merged, I will rebase. I had some reverted PRs on using high-level BLAS in LoR/LiR/SVC/GMM, they were reverted because of performance regression on sparse datasets; I am now working on it again, using param `blockSize==1` to choose the impl. I am also waiting for more feedbacks. If nobody object, I will merge them. There are some common utils in those PRs, which should also be used in KMeans. So I think you can rebase this PR after [SVC](https://github.com/apache/spark/pull/28349) get merged. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org