[GitHub] [spark] luhenry commented on pull request #32415: [SPARK-35295][ML] Replace fully com.github.fommil.netlib by dev.ludovic.netlib:2.0
luhenry commented on pull request #32415: URL: https://github.com/apache/spark/pull/32415#issuecomment-840835158 IIUC, Spark uses the CSC representation. @fommil is that format represented in MTJ as well? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] luhenry commented on pull request #32415: [SPARK-35295][ML] Replace fully com.github.fommil.netlib by dev.ludovic.netlib:2.0
luhenry commented on pull request #32415: URL: https://github.com/apache/spark/pull/32415#issuecomment-840777158 @srowen makes perfect sense that such a change gets thoroughly tested! Next step is to add support for Sparse matrices and vectors. In your experience, how widely used are SparseMatrix and SparseVector used in Spark? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] luhenry commented on pull request #32415: [SPARK-35295][ML] Replace fully com.github.fommil.netlib by dev.ludovic.netlib:2.0
luhenry commented on pull request #32415: URL: https://github.com/apache/spark/pull/32415#issuecomment-839125900 It does show at https://github.com/apache/spark/blob/420802efbf9aabe3d3f709ec21102510b51dcfc0/dev/deps/spark-deps-hadoop-2.7-hive-2.3#L56 and https://github.com/apache/spark/blob/420802efbf9aabe3d3f709ec21102510b51dcfc0/dev/deps/spark-deps-hadoop-3.2-hive-2.3#L48. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] luhenry commented on pull request #32415: [SPARK-35295][ML] Replace fully com.github.fommil.netlib by dev.ludovic.netlib:2.0
luhenry commented on pull request #32415: URL: https://github.com/apache/spark/pull/32415#issuecomment-839056925 > So the `core` artifact is no longer part of the transitive deps? I would think `all` needs it, still, but, not sure. `core` is still part of the transitive dependencies through `breeze`. If we add `all`, then we made sure that it is on the classpath, and that `breeze` has access to the accelerated implementation in `com.github.fommil.netlib:all`. And `all` itself references `core`, so not referencing `core` isn't an issue. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] luhenry commented on pull request #32415: [SPARK-35295][ML] Replace fully com.github.fommil.netlib by dev.ludovic.netlib:2.0
luhenry commented on pull request #32415: URL: https://github.com/apache/spark/pull/32415#issuecomment-836932214 @srowen the profile is already back at https://github.com/apache/spark/pull/32415/files#diff-9c5fb3d1b7e3b0f54bc5c4182965c4fe1f9023d449017cece3005d3f90e8e4d8R3499. I think it's all done from my end, and tests are passing. Would you like to see something else before it gets merged? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] luhenry commented on pull request #32415: [SPARK-35295][ML] Replace fully com.github.fommil.netlib by dev.ludovic.netlib:2.0
luhenry commented on pull request #32415: URL: https://github.com/apache/spark/pull/32415#issuecomment-834226560 @fommil @srowen I got `com.github.fommil.netlib:all` running this morning, and here are the results with jmh: ``` Benchmark(implementation) (k) (m) (n) (transa) (transb) Mode CntScoreError Units DgemmBenchmark.blasnetlib101010 N N thrpt3 2007041.309 ± 223733.747 ops/s DgemmBenchmark.blasnetlib1010 1000 N N thrpt356665.072 ± 4830.521 ops/s DgemmBenchmark.blasnetlib10 100010 N N thrpt355195.724 ± 8013.092 ops/s DgemmBenchmark.blasnetlib10 1000 1000 N N thrpt3 487.054 ±193.030 ops/s DgemmBenchmark.blasnetlib 10001010 N N thrpt380996.035 ± 25222.496 ops/s DgemmBenchmark.blasnetlib 100010 1000 N N thrpt3 1137.933 ±247.434 ops/s DgemmBenchmark.blasnetlib 1000 100010 N N thrpt3 1007.815 ±478.144 ops/s DgemmBenchmark.blasnetlib 1000 1000 1000 N N thrpt3 29.972 ± 2.674 ops/s DgemmBenchmark.blasnative101010 N N thrpt3 2031140.286 ± 235323.704 ops/s DgemmBenchmark.blasnative1010 1000 N N thrpt356990.399 ± 8130.966 ops/s DgemmBenchmark.blasnative10 100010 N N thrpt354894.098 ± 3628.945 ops/s DgemmBenchmark.blasnative10 1000 1000 N N thrpt3 486.730 ± 16.599 ops/s DgemmBenchmark.blasnative 10001010 N N thrpt381452.518 ± 5438.969 ops/s DgemmBenchmark.blasnative 100010 1000 N N thrpt3 1154.301 ±198.139 ops/s DgemmBenchmark.blasnative 1000 100010 N N thrpt3 1013.764 ±210.225 ops/s DgemmBenchmark.blasnative 1000 1000 1000 N N thrpt3 30.047 ± 6.319 ops/s ``` We can observe there is no performance difference between `native` (`dev.ludovic.netlib.blas.JNIBLAS`) and `netlib` (`com.github.fommil.netlib.NativeSystemBLAS`). -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] luhenry commented on pull request #32415: [SPARK-35295][ML] Replace fully com.github.fommil.netlib by dev.ludovic.netlib:2.0
luhenry commented on pull request #32415: URL: https://github.com/apache/spark/pull/32415#issuecomment-833459432 > @luhenry btw netlib-java might still be pulled in transitively from the Breeze project. You might need to send PRs over there too (I'm not sure if it's still a dependency) It’s been done with https://github.com/scalanlp/breeze/pull/811. This transitive dependency is also why we need to keep the reference to `com.github.fommil.netlib:all` in `pom.xml`, as we wouldn’t want to slow down parts of Spark. I haven’t had a chance to test whether there is a slowdown, life obligations are catching up. I’ll do it by mid next week I expect. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] luhenry commented on pull request #32415: [SPARK-35295][ML] Replace fully com.github.fommil.netlib by dev.ludovic.netlib:2.0
luhenry commented on pull request #32415: URL: https://github.com/apache/spark/pull/32415#issuecomment-832996541 > Oh I see, this is native via ludovic netlib. Can you provide benchmarks vs netlib-java to confirm that there is no obvious perf regression? Let me get some numbers for that. I would appreciate your help in getting it to run as I remember having trouble trying to get it to pick the native library. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] luhenry commented on pull request #32415: [SPARK-35295][ML] Replace fully com.github.fommil.netlib by dev.ludovic.netlib:2.0
luhenry commented on pull request #32415: URL: https://github.com/apache/spark/pull/32415#issuecomment-832527738 @srowen I am not observing much convergence even when ramping up `maxIter` to 1000. __`local[2]` + `maxIter=1000`:__ ![image](https://user-images.githubusercontent.com/660779/117117677-1e4a1700-ad90-11eb-98a5-544b142e4a2d.png) __`local[3]` + `maxIter=1000`:__ ![image](https://user-images.githubusercontent.com/660779/117117807-420d5d00-ad90-11eb-9147-21ba7248796f.png) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] luhenry commented on pull request #32415: [SPARK-35295][ML] Replace fully com.github.fommil.netlib by dev.ludovic.netlib:2.0
luhenry commented on pull request #32415: URL: https://github.com/apache/spark/pull/32415#issuecomment-832290654 And the results for the last commits are at https://github.com/apache/spark/runs/2502327271 (I don't know why it doesn't show in the "All checks have passed" at the end of the discussion). -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] luhenry commented on pull request #32415: [SPARK-35295][ML] Replace fully com.github.fommil.netlib by dev.ludovic.netlib:2.0
luhenry commented on pull request #32415: URL: https://github.com/apache/spark/pull/32415#issuecomment-832285023 Some runs with various values of `max_iter`: __100:__ ![image](https://user-images.githubusercontent.com/660779/117075564-4fe1c480-ad35-11eb-889a-6df72eecacbc.png) __300:__ ![image](https://user-images.githubusercontent.com/660779/117075724-96cfba00-ad35-11eb-9831-6c0640c735ea.png) __1,000:__ ![image](https://user-images.githubusercontent.com/660779/117075621-6be56600-ad35-11eb-81e2-a537936593cd.png) However, when playing with the parallelism provided by Spark, that is where I can easily reproduce the huge variation for `summary.logLikelihood`. __`pyspark.SparkContext('local[1]')`:__ ![image](https://user-images.githubusercontent.com/660779/117076248-5f154200-ad36-11eb-8c47-38676dabede2.png) __`pyspark.SparkContext('local[2]')`:__ ![image](https://user-images.githubusercontent.com/660779/117076284-705e4e80-ad36-11eb-8bd1-291b7bd2a21d.png) __`pyspark.SparkContext('local[3]')`:__ ![image](https://user-images.githubusercontent.com/660779/117076424-a6033780-ad36-11eb-88c3-be61700a45fd.png) __`pyspark.SparkContext('local[4]')`:__ ![image](https://user-images.githubusercontent.com/660779/117076457-b61b1700-ad36-11eb-9346-d633153baa6b.png) That is all with Spark 3.1.1 and so unrelated to my changes. I would say that the flakiness observed in this PR is explained by this change as if it's a potential race in the implementation, changing the performance profile of the underlying operations can make the race more likely to happen. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] luhenry commented on pull request #32415: [SPARK-35295][ML] Replace fully com.github.fommil.netlib by dev.ludovic.netlib:2.0
luhenry commented on pull request #32415: URL: https://github.com/apache/spark/pull/32415#issuecomment-832114209 And AFAIU, that's with sklearn: ![image](https://user-images.githubusercontent.com/660779/117044505-cc14e180-ad0e-11eb-8874-9df9c75dc2e5.png) I've no confidence it's computing the same thing though. It's based on the documentation at https://scikit-learn.org/stable/modules/generated/sklearn.mixture.GaussianMixture.html -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] luhenry commented on pull request #32415: [SPARK-35295][ML] Replace fully com.github.fommil.netlib by dev.ludovic.netlib:2.0
luhenry commented on pull request #32415: URL: https://github.com/apache/spark/pull/32415#issuecomment-832092686 From an older version of Spark (3.1.1), I get the following results: ![image](https://user-images.githubusercontent.com/660779/117040323-0d56c280-ad0a-11eb-92a7-4a5ef52173cf.png) This is not using anything from `dev.ludovic.netlib`. And it is running with `docker run --rm -it -p : jupyter/pyspark-notebook` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] luhenry commented on pull request #32415: [SPARK-35295][ML] Replace fully com.github.fommil.netlib by dev.ludovic.netlib:2.0
luhenry commented on pull request #32415: URL: https://github.com/apache/spark/pull/32415#issuecomment-832072136 From looking at https://github.com/luhenry/spark/runs/2500079223 and https://github.com/luhenry/spark/runs/2500065249, it looks like an intermittent failure. I haven't had a chance to successfully reproduce numbers on scikit-learn/sklearn. What in your experience could lead to this variability in the results? Is there some source of randomness in these algorithms? (I'm far from a data scientist and have only played with a handful of ML algorithms, but nothing serious. I'm trying to learn though!) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] luhenry commented on pull request #32415: [SPARK-35295][ML] Replace fully com.github.fommil.netlib by dev.ludovic.netlib:2.0
luhenry commented on pull request #32415: URL: https://github.com/apache/spark/pull/32415#issuecomment-831542712 > It's entirely possible that 93.3 is a more correct log-likelihood. Usually we check some other implementation if possible to verify. "Other implementation" as in another ML tool than Spark? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] luhenry commented on pull request #32415: [SPARK-35295][ML] Replace fully com.github.fommil.netlib by dev.ludovic.netlib:2.0
luhenry commented on pull request #32415: URL: https://github.com/apache/spark/pull/32415#issuecomment-831514938 The error is the following: ``` File "/__w/spark/spark/python/pyspark/ml/clustering.py", line 276, in __main__.GaussianMixture Failed example: summary.logLikelihood Expected: 65.02945... Got: 93.36008975083433 ``` I'll take a look tomorrow at where this discrepancies is coming from. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] luhenry commented on pull request #32415: [SPARK-35295][ML] Replace fully com.github.fommil.netlib by dev.ludovic.netlib:2.0
luhenry commented on pull request #32415: URL: https://github.com/apache/spark/pull/32415#issuecomment-831212439 /cc @srowen I have release `dev.ludovic.netlib:2.0.0` and I've updated this PR accordingly. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org