[GitHub] [spark] luhenry commented on pull request #32415: [SPARK-35295][ML] Replace fully com.github.fommil.netlib by dev.ludovic.netlib:2.0

2021-05-13 Thread GitBox


luhenry commented on pull request #32415:
URL: https://github.com/apache/spark/pull/32415#issuecomment-840835158


   IIUC, Spark uses the CSC representation. @fommil is that format represented 
in MTJ as well?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] luhenry commented on pull request #32415: [SPARK-35295][ML] Replace fully com.github.fommil.netlib by dev.ludovic.netlib:2.0

2021-05-13 Thread GitBox


luhenry commented on pull request #32415:
URL: https://github.com/apache/spark/pull/32415#issuecomment-840777158


   @srowen makes perfect sense that such a change gets thoroughly tested! Next 
step is to add support for Sparse matrices and vectors. In your experience, how 
widely used are SparseMatrix and SparseVector used in Spark?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] luhenry commented on pull request #32415: [SPARK-35295][ML] Replace fully com.github.fommil.netlib by dev.ludovic.netlib:2.0

2021-05-11 Thread GitBox


luhenry commented on pull request #32415:
URL: https://github.com/apache/spark/pull/32415#issuecomment-839125900


   It does show at 
https://github.com/apache/spark/blob/420802efbf9aabe3d3f709ec21102510b51dcfc0/dev/deps/spark-deps-hadoop-2.7-hive-2.3#L56
 and 
https://github.com/apache/spark/blob/420802efbf9aabe3d3f709ec21102510b51dcfc0/dev/deps/spark-deps-hadoop-3.2-hive-2.3#L48.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] luhenry commented on pull request #32415: [SPARK-35295][ML] Replace fully com.github.fommil.netlib by dev.ludovic.netlib:2.0

2021-05-11 Thread GitBox


luhenry commented on pull request #32415:
URL: https://github.com/apache/spark/pull/32415#issuecomment-839056925


   > So the `core` artifact is no longer part of the transitive deps? I would 
think `all` needs it, still, but, not sure.
   
   `core` is still part of the transitive dependencies through `breeze`. If we 
add `all`, then we made sure that it is on the classpath, and that `breeze` has 
access to the accelerated implementation in `com.github.fommil.netlib:all`. And 
`all` itself references `core`, so not referencing `core` isn't an issue.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] luhenry commented on pull request #32415: [SPARK-35295][ML] Replace fully com.github.fommil.netlib by dev.ludovic.netlib:2.0

2021-05-10 Thread GitBox


luhenry commented on pull request #32415:
URL: https://github.com/apache/spark/pull/32415#issuecomment-836932214


   @srowen the profile is already back at 
https://github.com/apache/spark/pull/32415/files#diff-9c5fb3d1b7e3b0f54bc5c4182965c4fe1f9023d449017cece3005d3f90e8e4d8R3499.
 I think it's all done from my end, and tests are passing. Would you like to 
see something else before it gets merged?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] luhenry commented on pull request #32415: [SPARK-35295][ML] Replace fully com.github.fommil.netlib by dev.ludovic.netlib:2.0

2021-05-07 Thread GitBox


luhenry commented on pull request #32415:
URL: https://github.com/apache/spark/pull/32415#issuecomment-834226560


   @fommil @srowen I got `com.github.fommil.netlib:all` running this morning, 
and here are the results with jmh:
   
   ```
   Benchmark(implementation)   (k)   (m)   (n)  (transa)  (transb)  
 Mode  CntScoreError  Units
   DgemmBenchmark.blasnetlib101010 N N  
thrpt3  2007041.309 ± 223733.747  ops/s
   DgemmBenchmark.blasnetlib1010  1000 N N  
thrpt356665.072 ±   4830.521  ops/s
   DgemmBenchmark.blasnetlib10  100010 N N  
thrpt355195.724 ±   8013.092  ops/s
   DgemmBenchmark.blasnetlib10  1000  1000 N N  
thrpt3  487.054 ±193.030  ops/s
   DgemmBenchmark.blasnetlib  10001010 N N  
thrpt380996.035 ±  25222.496  ops/s
   DgemmBenchmark.blasnetlib  100010  1000 N N  
thrpt3 1137.933 ±247.434  ops/s
   DgemmBenchmark.blasnetlib  1000  100010 N N  
thrpt3 1007.815 ±478.144  ops/s
   DgemmBenchmark.blasnetlib  1000  1000  1000 N N  
thrpt3   29.972 ±  2.674  ops/s
   DgemmBenchmark.blasnative101010 N N  
thrpt3  2031140.286 ± 235323.704  ops/s
   DgemmBenchmark.blasnative1010  1000 N N  
thrpt356990.399 ±   8130.966  ops/s
   DgemmBenchmark.blasnative10  100010 N N  
thrpt354894.098 ±   3628.945  ops/s
   DgemmBenchmark.blasnative10  1000  1000 N N  
thrpt3  486.730 ± 16.599  ops/s
   DgemmBenchmark.blasnative  10001010 N N  
thrpt381452.518 ±   5438.969  ops/s
   DgemmBenchmark.blasnative  100010  1000 N N  
thrpt3 1154.301 ±198.139  ops/s
   DgemmBenchmark.blasnative  1000  100010 N N  
thrpt3 1013.764 ±210.225  ops/s
   DgemmBenchmark.blasnative  1000  1000  1000 N N  
thrpt3   30.047 ±  6.319  ops/s
   ```
   
   We can observe there is no performance difference between `native` 
(`dev.ludovic.netlib.blas.JNIBLAS`) and `netlib` 
(`com.github.fommil.netlib.NativeSystemBLAS`).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] luhenry commented on pull request #32415: [SPARK-35295][ML] Replace fully com.github.fommil.netlib by dev.ludovic.netlib:2.0

2021-05-06 Thread GitBox


luhenry commented on pull request #32415:
URL: https://github.com/apache/spark/pull/32415#issuecomment-833459432


   > @luhenry btw netlib-java might still be pulled in transitively from the 
Breeze project. You might need to send PRs over there too (I'm not sure if it's 
still a dependency) 
   
   It’s been done with https://github.com/scalanlp/breeze/pull/811. This 
transitive dependency is also why we need to keep the reference to 
`com.github.fommil.netlib:all` in `pom.xml`, as we wouldn’t want to slow down 
parts of Spark.
   
   I haven’t had a chance to test whether there is a slowdown, life obligations 
are catching up. I’ll do it by mid next week I expect.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] luhenry commented on pull request #32415: [SPARK-35295][ML] Replace fully com.github.fommil.netlib by dev.ludovic.netlib:2.0

2021-05-05 Thread GitBox


luhenry commented on pull request #32415:
URL: https://github.com/apache/spark/pull/32415#issuecomment-832996541


   > Oh I see, this is native via ludovic netlib. Can you provide benchmarks vs 
netlib-java to confirm that there is no obvious perf regression?
   
   Let me get some numbers for that. I would appreciate your help in getting it 
to run as I remember having trouble trying to get it to pick the native library.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] luhenry commented on pull request #32415: [SPARK-35295][ML] Replace fully com.github.fommil.netlib by dev.ludovic.netlib:2.0

2021-05-05 Thread GitBox


luhenry commented on pull request #32415:
URL: https://github.com/apache/spark/pull/32415#issuecomment-832527738


   @srowen I am not observing much convergence even when ramping up `maxIter` 
to 1000.
   
   __`local[2]` + `maxIter=1000`:__
   
![image](https://user-images.githubusercontent.com/660779/117117677-1e4a1700-ad90-11eb-98a5-544b142e4a2d.png)
   
   __`local[3]` + `maxIter=1000`:__
   
![image](https://user-images.githubusercontent.com/660779/117117807-420d5d00-ad90-11eb-9147-21ba7248796f.png)
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] luhenry commented on pull request #32415: [SPARK-35295][ML] Replace fully com.github.fommil.netlib by dev.ludovic.netlib:2.0

2021-05-04 Thread GitBox


luhenry commented on pull request #32415:
URL: https://github.com/apache/spark/pull/32415#issuecomment-832290654


   And the results for the last commits are at 
https://github.com/apache/spark/runs/2502327271 (I don't know why it doesn't 
show in the "All checks have passed" at the end of the discussion).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] luhenry commented on pull request #32415: [SPARK-35295][ML] Replace fully com.github.fommil.netlib by dev.ludovic.netlib:2.0

2021-05-04 Thread GitBox


luhenry commented on pull request #32415:
URL: https://github.com/apache/spark/pull/32415#issuecomment-832285023


   Some runs with various values of `max_iter`:
   __100:__
   
![image](https://user-images.githubusercontent.com/660779/117075564-4fe1c480-ad35-11eb-889a-6df72eecacbc.png)
   
   __300:__
   
![image](https://user-images.githubusercontent.com/660779/117075724-96cfba00-ad35-11eb-9831-6c0640c735ea.png)
   
   __1,000:__
   
![image](https://user-images.githubusercontent.com/660779/117075621-6be56600-ad35-11eb-81e2-a537936593cd.png)
   
   However, when playing with the parallelism provided by Spark, that is where 
I can easily reproduce the huge variation for `summary.logLikelihood`.
   
   __`pyspark.SparkContext('local[1]')`:__
   
![image](https://user-images.githubusercontent.com/660779/117076248-5f154200-ad36-11eb-8c47-38676dabede2.png)
   
   __`pyspark.SparkContext('local[2]')`:__
   
![image](https://user-images.githubusercontent.com/660779/117076284-705e4e80-ad36-11eb-8bd1-291b7bd2a21d.png)
   
   __`pyspark.SparkContext('local[3]')`:__
   
![image](https://user-images.githubusercontent.com/660779/117076424-a6033780-ad36-11eb-88c3-be61700a45fd.png)
   
   __`pyspark.SparkContext('local[4]')`:__
   
![image](https://user-images.githubusercontent.com/660779/117076457-b61b1700-ad36-11eb-9346-d633153baa6b.png)
   
   That is all with Spark 3.1.1 and so unrelated to my changes. I would say 
that the flakiness observed in this PR is explained by this change as if it's a 
potential race in the implementation, changing the performance profile of the 
underlying operations can make the race more likely to happen.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] luhenry commented on pull request #32415: [SPARK-35295][ML] Replace fully com.github.fommil.netlib by dev.ludovic.netlib:2.0

2021-05-04 Thread GitBox


luhenry commented on pull request #32415:
URL: https://github.com/apache/spark/pull/32415#issuecomment-832114209


   And AFAIU, that's with sklearn:
   
   
![image](https://user-images.githubusercontent.com/660779/117044505-cc14e180-ad0e-11eb-8874-9df9c75dc2e5.png)
   
   I've no confidence it's computing the same thing though. It's based on the 
documentation at 
https://scikit-learn.org/stable/modules/generated/sklearn.mixture.GaussianMixture.html


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] luhenry commented on pull request #32415: [SPARK-35295][ML] Replace fully com.github.fommil.netlib by dev.ludovic.netlib:2.0

2021-05-04 Thread GitBox


luhenry commented on pull request #32415:
URL: https://github.com/apache/spark/pull/32415#issuecomment-832092686


   From an older version of Spark (3.1.1), I get the following results:
   
   
![image](https://user-images.githubusercontent.com/660779/117040323-0d56c280-ad0a-11eb-92a7-4a5ef52173cf.png)
   
   This is not using anything from `dev.ludovic.netlib`. And it is running with 
`docker run --rm -it -p : jupyter/pyspark-notebook` 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] luhenry commented on pull request #32415: [SPARK-35295][ML] Replace fully com.github.fommil.netlib by dev.ludovic.netlib:2.0

2021-05-04 Thread GitBox


luhenry commented on pull request #32415:
URL: https://github.com/apache/spark/pull/32415#issuecomment-832072136


   From looking at https://github.com/luhenry/spark/runs/2500079223 and 
https://github.com/luhenry/spark/runs/2500065249, it looks like an intermittent 
failure. I haven't had a chance to successfully reproduce numbers on 
scikit-learn/sklearn.
   
   What in your experience could lead to this variability in the results? Is 
there some source of randomness in these algorithms? (I'm far from a data 
scientist and have only played with a handful of ML algorithms, but nothing 
serious. I'm trying to learn though!)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] luhenry commented on pull request #32415: [SPARK-35295][ML] Replace fully com.github.fommil.netlib by dev.ludovic.netlib:2.0

2021-05-03 Thread GitBox


luhenry commented on pull request #32415:
URL: https://github.com/apache/spark/pull/32415#issuecomment-831542712


   > It's entirely possible that 93.3 is a more correct log-likelihood. Usually 
we check some other implementation if possible to verify.
   
   "Other implementation" as in another ML tool than Spark?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] luhenry commented on pull request #32415: [SPARK-35295][ML] Replace fully com.github.fommil.netlib by dev.ludovic.netlib:2.0

2021-05-03 Thread GitBox


luhenry commented on pull request #32415:
URL: https://github.com/apache/spark/pull/32415#issuecomment-831514938


   The error is the following:
   ```
   File "/__w/spark/spark/python/pyspark/ml/clustering.py", line 276, in 
__main__.GaussianMixture
   Failed example:
   summary.logLikelihood
   Expected:
   65.02945...
   Got:
   93.36008975083433
   ```
   
   I'll take a look tomorrow at where this discrepancies is coming from.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] luhenry commented on pull request #32415: [SPARK-35295][ML] Replace fully com.github.fommil.netlib by dev.ludovic.netlib:2.0

2021-05-03 Thread GitBox


luhenry commented on pull request #32415:
URL: https://github.com/apache/spark/pull/32415#issuecomment-831212439


   /cc @srowen I have release `dev.ludovic.netlib:2.0.0` and I've updated this 
PR accordingly.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org