[GitHub] spark issue #22136: [SPARK-25124][ML]VectorSizeHint setSize and getSize don'...

2018-08-20 Thread thunterdb
Github user thunterdb commented on the issue: https://github.com/apache/spark/pull/22136 @huaxingao thank you for your pull request. Can you please add a test to make sure this does not regress

[GitHub] spark pull request #19439: [SPARK-21866][ML][PySpark] Adding spark image rea...

2017-11-13 Thread thunterdb
Github user thunterdb commented on a diff in the pull request: https://github.com/apache/spark/pull/19439#discussion_r150650409 --- Diff: mllib/src/main/scala/org/apache/spark/ml/image/HadoopUtils.scala --- @@ -0,0 +1,109 @@ +/* + * Licensed to the Apache Software

[GitHub] spark pull request #19439: [SPARK-21866][ML][PySpark] Adding spark image rea...

2017-11-10 Thread thunterdb
Github user thunterdb commented on a diff in the pull request: https://github.com/apache/spark/pull/19439#discussion_r150250118 --- Diff: mllib/src/main/scala/org/apache/spark/ml/image/HadoopUtils.scala --- @@ -0,0 +1,109 @@ +/* + * Licensed to the Apache Software

[GitHub] spark pull request #19439: [SPARK-21866][ML][PySpark] Adding spark image rea...

2017-11-10 Thread thunterdb
Github user thunterdb commented on a diff in the pull request: https://github.com/apache/spark/pull/19439#discussion_r150247999 --- Diff: mllib/src/main/scala/org/apache/spark/ml/image/HadoopUtils.scala --- @@ -0,0 +1,109 @@ +/* + * Licensed to the Apache Software

[GitHub] spark pull request #19439: [SPARK-21866][ML][PySpark] Adding spark image rea...

2017-10-30 Thread thunterdb
Github user thunterdb commented on a diff in the pull request: https://github.com/apache/spark/pull/19439#discussion_r147661505 --- Diff: mllib/src/main/scala/org/apache/spark/ml/image/ImageSchema.scala --- @@ -0,0 +1,252 @@ +/* + * Licensed to the Apache Software

[GitHub] spark pull request #19439: [SPARK-21866][ML][PySpark] Adding spark image rea...

2017-10-30 Thread thunterdb
Github user thunterdb commented on a diff in the pull request: https://github.com/apache/spark/pull/19439#discussion_r147661396 --- Diff: mllib/src/main/scala/org/apache/spark/ml/image/ImageSchema.scala --- @@ -0,0 +1,252 @@ +/* + * Licensed to the Apache Software

[GitHub] spark pull request #19439: [SPARK-21866][ML][PySpark] Adding spark image rea...

2017-10-30 Thread thunterdb
Github user thunterdb commented on a diff in the pull request: https://github.com/apache/spark/pull/19439#discussion_r147661078 --- Diff: mllib/src/main/scala/org/apache/spark/ml/image/HadoopUtils.scala --- @@ -0,0 +1,120 @@ +/* + * Licensed to the Apache Software

[GitHub] spark issue #19439: [SPARK-21866][ML][PySpark] Adding spark image reader

2017-10-26 Thread thunterdb
Github user thunterdb commented on the issue: https://github.com/apache/spark/pull/19439 @hhbyyh I recall now the reason for an extra `origin` field, which is to get around the standard issue of many small image files in S3 or other distributed file systems. It is standard to compact

[GitHub] spark issue #19439: [SPARK-21866][ML][PySpark] Adding spark image reader

2017-10-24 Thread thunterdb
Github user thunterdb commented on the issue: https://github.com/apache/spark/pull/19439 @hhbyyh regarding the data representation, one could indeed have the each of the representations being encoded with the proper array information. This brings some additional complexity

[GitHub] spark issue #19439: [SPARK-21866][ML][PySpark] Adding spark image reader

2017-10-18 Thread thunterdb
Github user thunterdb commented on the issue: https://github.com/apache/spark/pull/19439 @hhbyyh thank you for bringing up these questions. In response to your questions: > Does the current schema support or plan to support image feature data in Floats[] or Doub

[GitHub] spark pull request #19156: [SPARK-19634][FOLLOW-UP][ML] Improve interface of...

2017-09-07 Thread thunterdb
Github user thunterdb commented on a diff in the pull request: https://github.com/apache/spark/pull/19156#discussion_r137603986 --- Diff: mllib/src/main/scala/org/apache/spark/ml/stat/Summarizer.scala --- @@ -109,31 +108,47 @@ object Summarizer extends Logging

[GitHub] spark issue #18798: [SPARK-19634][ML] Multivariate summarizer - dataframes A...

2017-08-15 Thread thunterdb
Github user thunterdb commented on the issue: https://github.com/apache/spark/pull/18798 Thank you @yanboliang. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled

[GitHub] spark issue #18798: [SPARK-19634][ML] Multivariate summarizer - dataframes A...

2017-08-10 Thread thunterdb
Github user thunterdb commented on the issue: https://github.com/apache/spark/pull/18798 @yanboliang do you feel comfortable to merge this PR? I think that all the questions have been addressed. --- If your project is set up for it, you can reply to this email and have your reply

[GitHub] spark pull request #18798: [SPARK-19634][ML] Multivariate summarizer - dataf...

2017-08-08 Thread thunterdb
Github user thunterdb commented on a diff in the pull request: https://github.com/apache/spark/pull/18798#discussion_r131971123 --- Diff: mllib/src/main/scala/org/apache/spark/ml/stat/Summarizer.scala --- @@ -0,0 +1,587 @@ +/* + * Licensed to the Apache Software Foundation

[GitHub] spark pull request #18798: [SPARK-19634][ML] Multivariate summarizer - dataf...

2017-08-08 Thread thunterdb
Github user thunterdb commented on a diff in the pull request: https://github.com/apache/spark/pull/18798#discussion_r131970836 --- Diff: mllib/src/main/scala/org/apache/spark/ml/stat/Summarizer.scala --- @@ -0,0 +1,587 @@ +/* + * Licensed to the Apache Software Foundation

[GitHub] spark pull request #18798: [SPARK-19634][ML] Multivariate summarizer - dataf...

2017-08-01 Thread thunterdb
Github user thunterdb commented on a diff in the pull request: https://github.com/apache/spark/pull/18798#discussion_r130742319 --- Diff: mllib/src/main/scala/org/apache/spark/ml/stat/Summarizer.scala --- @@ -0,0 +1,633 @@ +/* + * Licensed to the Apache Software Foundation

[GitHub] spark pull request #18798: [SPARK-19634][ML] Multivariate summarizer - dataf...

2017-08-01 Thread thunterdb
Github user thunterdb commented on a diff in the pull request: https://github.com/apache/spark/pull/18798#discussion_r130742836 --- Diff: mllib/src/main/scala/org/apache/spark/ml/stat/Summarizer.scala --- @@ -0,0 +1,633 @@ +/* + * Licensed to the Apache Software Foundation

[GitHub] spark pull request #18798: [SPARK-19634][ML] Multivariate summarizer - dataf...

2017-08-01 Thread thunterdb
Github user thunterdb commented on a diff in the pull request: https://github.com/apache/spark/pull/18798#discussion_r130741880 --- Diff: mllib/src/main/scala/org/apache/spark/ml/stat/Summarizer.scala --- @@ -0,0 +1,633 @@ +/* + * Licensed to the Apache Software Foundation

[GitHub] spark pull request #18798: [SPARK-19634][ML] Multivariate summarizer - dataf...

2017-08-01 Thread thunterdb
Github user thunterdb commented on a diff in the pull request: https://github.com/apache/spark/pull/18798#discussion_r130742759 --- Diff: mllib/src/main/scala/org/apache/spark/ml/stat/Summarizer.scala --- @@ -0,0 +1,633 @@ +/* + * Licensed to the Apache Software Foundation

[GitHub] spark pull request #18798: [SPARK-19634][ML] Multivariate summarizer - dataf...

2017-08-01 Thread thunterdb
Github user thunterdb commented on a diff in the pull request: https://github.com/apache/spark/pull/18798#discussion_r130742524 --- Diff: mllib/src/main/scala/org/apache/spark/ml/stat/Summarizer.scala --- @@ -0,0 +1,633 @@ +/* + * Licensed to the Apache Software Foundation

[GitHub] spark pull request #18798: [SPARK-19634][ML] Multivariate summarizer - dataf...

2017-08-01 Thread thunterdb
Github user thunterdb commented on a diff in the pull request: https://github.com/apache/spark/pull/18798#discussion_r130742933 --- Diff: mllib/src/main/scala/org/apache/spark/ml/stat/Summarizer.scala --- @@ -0,0 +1,633 @@ +/* + * Licensed to the Apache Software Foundation

[GitHub] spark pull request #18798: [SPARK-19634][ML] Multivariate summarizer - dataf...

2017-08-01 Thread thunterdb
Github user thunterdb commented on a diff in the pull request: https://github.com/apache/spark/pull/18798#discussion_r130743131 --- Diff: mllib/src/test/scala/org/apache/spark/ml/stat/SummarizerSuite.scala --- @@ -0,0 +1,619 @@ +/* + * Licensed to the Apache Software

[GitHub] spark pull request #18798: [SPARK-19634][ML] Multivariate summarizer - dataf...

2017-08-01 Thread thunterdb
Github user thunterdb commented on a diff in the pull request: https://github.com/apache/spark/pull/18798#discussion_r130741348 --- Diff: mllib/src/main/scala/org/apache/spark/ml/stat/Summarizer.scala --- @@ -0,0 +1,633 @@ +/* + * Licensed to the Apache Software Foundation

[GitHub] spark issue #17419: [SPARK-19634][ML] Multivariate summarizer - dataframes A...

2017-08-01 Thread thunterdb
Github user thunterdb commented on the issue: https://github.com/apache/spark/pull/17419 I am going to close this PR, since this is being taken over by @WeichenXu123 in #18798 . --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub

[GitHub] spark pull request #17419: [SPARK-19634][ML] Multivariate summarizer - dataf...

2017-08-01 Thread thunterdb
Github user thunterdb closed the pull request at: https://github.com/apache/spark/pull/17419 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature

[GitHub] spark issue #18798: [SPARK-19634][ML] Multivariate summarizer - dataframes A...

2017-08-01 Thread thunterdb
Github user thunterdb commented on the issue: https://github.com/apache/spark/pull/18798 cc @hvanhovell as well. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled

[GitHub] spark issue #18798: [SPARK-19634][ML] Multivariate summarizer - dataframes A...

2017-08-01 Thread thunterdb
Github user thunterdb commented on the issue: https://github.com/apache/spark/pull/18798 Thank you for the performance numbers @WeichenXu123 , I have a couple of comments: - you say that SQL uses adaptive compaction. How bad is that? I assume it adds some overhead. - did

[GitHub] spark issue #18798: [SPARK-19634][ML] Multivariate summarizer - dataframes A...

2017-08-01 Thread thunterdb
Github user thunterdb commented on the issue: https://github.com/apache/spark/pull/18798 @WeichenXu123 thanks! Can you post some performance numbers as well? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your

[GitHub] spark pull request #18281: [SPARK-21027][ML][PYTHON] Added tunable paralleli...

2017-06-13 Thread thunterdb
Github user thunterdb commented on a diff in the pull request: https://github.com/apache/spark/pull/18281#discussion_r121790182 --- Diff: mllib/src/main/scala/org/apache/spark/ml/classification/OneVsRest.scala --- @@ -325,8 +343,13 @@ final class OneVsRest @Since("

[GitHub] spark pull request #17419: [SPARK-19634][ML] Multivariate summarizer - dataf...

2017-03-30 Thread thunterdb
Github user thunterdb commented on a diff in the pull request: https://github.com/apache/spark/pull/17419#discussion_r109063248 --- Diff: mllib/src/main/scala/org/apache/spark/ml/stat/Summarizer.scala --- @@ -0,0 +1,746 @@ +/* + * Licensed to the Apache Software Foundation

[GitHub] spark issue #17419: [SPARK-19634][ML] Multivariate summarizer - dataframes A...

2017-03-30 Thread thunterdb
Github user thunterdb commented on the issue: https://github.com/apache/spark/pull/17419 I looked a bit deeper into the performance aspect. Here are some quick insights: - there was an immediate bottleneck in `VectorUDT`, which boosts the performance already by 3x

[GitHub] spark pull request #17419: [SPARK-19634][ML] Multivariate summarizer - dataf...

2017-03-29 Thread thunterdb
Github user thunterdb commented on a diff in the pull request: https://github.com/apache/spark/pull/17419#discussion_r108743634 --- Diff: mllib/src/test/scala/org/apache/spark/ml/stat/SummarizerSuite.scala --- @@ -335,4 +335,65 @@ class SummarizerSuite extends SparkFunSuite

[GitHub] spark issue #17419: [SPARK-19634][ML] Multivariate summarizer - dataframes A...

2017-03-27 Thread thunterdb
Github user thunterdb commented on the issue: https://github.com/apache/spark/pull/17419 I have added a small perf test to find the performance bottlenecks. Note that this test works on the worst case (vectors of size 1) from the perspective of overhead. Here are the numbers I

[GitHub] spark issue #17419: [SPARK-19634][ML] Multivariate summarizer - dataframes A...

2017-03-27 Thread thunterdb
Github user thunterdb commented on the issue: https://github.com/apache/spark/pull/17419 @sethah it would have been nice, but I do not think we should merge it this late into the release cycle. --- If your project is set up for it, you can reply to this email and have your reply

[GitHub] spark pull request #17419: [SPARK-19634][ML][WIP] Multivariate summarizer - ...

2017-03-24 Thread thunterdb
GitHub user thunterdb opened a pull request: https://github.com/apache/spark/pull/17419 [SPARK-19634][ML][WIP] Multivariate summarizer - dataframes API ## What changes were proposed in this pull request? This patch adds the DataFrames API to the multivariate summarizer

[GitHub] spark issue #17108: [SPARK-19636][ML] Feature parity for correlation statist...

2017-03-23 Thread thunterdb
Github user thunterdb commented on the issue: https://github.com/apache/spark/pull/17108 Tickets created: - https://issues.apache.org/jira/browse/SPARK-20076 - https://issues.apache.org/jira/browse/SPARK-20077 --- If your project is set up for it, you can reply

[GitHub] spark pull request #17108: [SPARK-19636][ML] Feature parity for correlation ...

2017-03-22 Thread thunterdb
Github user thunterdb commented on a diff in the pull request: https://github.com/apache/spark/pull/17108#discussion_r107505718 --- Diff: mllib-local/src/test/scala/org/apache/spark/ml/util/TestingUtils.scala --- @@ -32,6 +32,10 @@ object TestingUtils { * the relative

[GitHub] spark pull request #17108: [SPARK-19636][ML] Feature parity for correlation ...

2017-03-22 Thread thunterdb
Github user thunterdb commented on a diff in the pull request: https://github.com/apache/spark/pull/17108#discussion_r107505201 --- Diff: mllib/src/main/scala/org/apache/spark/ml/stat/Correlations.scala --- @@ -0,0 +1,88 @@ +/* + * Licensed to the Apache Software

[GitHub] spark pull request #17108: [SPARK-19636][ML] Feature parity for correlation ...

2017-03-22 Thread thunterdb
Github user thunterdb commented on a diff in the pull request: https://github.com/apache/spark/pull/17108#discussion_r107505212 --- Diff: mllib/src/main/scala/org/apache/spark/ml/stat/Correlations.scala --- @@ -0,0 +1,88 @@ +/* + * Licensed to the Apache Software

[GitHub] spark pull request #17108: [SPARK-19636][ML] Feature parity for correlation ...

2017-03-22 Thread thunterdb
Github user thunterdb commented on a diff in the pull request: https://github.com/apache/spark/pull/17108#discussion_r107505215 --- Diff: mllib/src/main/scala/org/apache/spark/ml/stat/Correlations.scala --- @@ -0,0 +1,88 @@ +/* + * Licensed to the Apache Software

[GitHub] spark pull request #17108: [SPARK-19636][ML] Feature parity for correlation ...

2017-03-22 Thread thunterdb
Github user thunterdb commented on a diff in the pull request: https://github.com/apache/spark/pull/17108#discussion_r107505185 --- Diff: mllib/src/main/scala/org/apache/spark/ml/stat/Correlations.scala --- @@ -0,0 +1,88 @@ +/* + * Licensed to the Apache Software

[GitHub] spark pull request #17108: [SPARK-19636][ML] Feature parity for correlation ...

2017-03-22 Thread thunterdb
Github user thunterdb commented on a diff in the pull request: https://github.com/apache/spark/pull/17108#discussion_r107505180 --- Diff: mllib/src/test/scala/org/apache/spark/ml/util/LinalgUtils.scala --- @@ -0,0 +1,54 @@ +/* + * Licensed to the Apache Software Foundation

[GitHub] spark issue #16483: [SPARK-18847][GraphX] PageRank gives incorrect results f...

2017-03-17 Thread thunterdb
Github user thunterdb commented on the issue: https://github.com/apache/spark/pull/16483 It looks good to me. cc @jkbradley or @mengxr for final approval --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your

[GitHub] spark pull request #16483: [SPARK-18847][GraphX] PageRank gives incorrect re...

2017-03-17 Thread thunterdb
Github user thunterdb commented on a diff in the pull request: https://github.com/apache/spark/pull/16483#discussion_r106746316 --- Diff: graphx/src/main/scala/org/apache/spark/graphx/lib/PageRank.scala --- @@ -322,13 +335,12 @@ object PageRank extends Logging { def

[GitHub] spark issue #16483: [SPARK-18847][GraphX] PageRank gives incorrect results f...

2017-03-16 Thread thunterdb
Github user thunterdb commented on the issue: https://github.com/apache/spark/pull/16483 In addition, this introduces an extra step reduction at each iteration. I am fine with that since it is for correctness, but @jkbradley may want to comment as well. --- If your project is set

[GitHub] spark pull request #16483: [SPARK-18847][GraphX] PageRank gives incorrect re...

2017-03-16 Thread thunterdb
Github user thunterdb commented on a diff in the pull request: https://github.com/apache/spark/pull/16483#discussion_r106529377 --- Diff: graphx/src/main/scala/org/apache/spark/graphx/lib/PageRank.scala --- @@ -353,9 +365,19 @@ object PageRank extends Logging

[GitHub] spark pull request #16483: [SPARK-18847][GraphX] PageRank gives incorrect re...

2017-03-16 Thread thunterdb
Github user thunterdb commented on a diff in the pull request: https://github.com/apache/spark/pull/16483#discussion_r106532078 --- Diff: graphx/src/test/scala/org/apache/spark/graphx/lib/PageRankSuite.scala --- @@ -68,26 +69,34 @@ class PageRankSuite extends SparkFunSuite

[GitHub] spark pull request #16483: [SPARK-18847][GraphX] PageRank gives incorrect re...

2017-03-16 Thread thunterdb
Github user thunterdb commented on a diff in the pull request: https://github.com/apache/spark/pull/16483#discussion_r106535595 --- Diff: graphx/src/main/scala/org/apache/spark/graphx/lib/PageRank.scala --- @@ -322,13 +335,12 @@ object PageRank extends Logging { def

[GitHub] spark pull request #16483: [SPARK-18847][GraphX] PageRank gives incorrect re...

2017-03-16 Thread thunterdb
Github user thunterdb commented on a diff in the pull request: https://github.com/apache/spark/pull/16483#discussion_r106528007 --- Diff: graphx/src/main/scala/org/apache/spark/graphx/lib/PageRank.scala --- @@ -162,7 +162,15 @@ object PageRank extends Logging

[GitHub] spark pull request #16971: [SPARK-19573][SQL] Make NaN/null handling consist...

2017-03-15 Thread thunterdb
Github user thunterdb commented on a diff in the pull request: https://github.com/apache/spark/pull/16971#discussion_r106309333 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/ApproximatePercentile.scala --- @@ -245,7 +245,7 @@ object

[GitHub] spark issue #17108: [SPARK-19636][ML] Feature parity for correlation statist...

2017-03-15 Thread thunterdb
Github user thunterdb commented on the issue: https://github.com/apache/spark/pull/17108 I moved the code `Correlations` as suggested. @imatiach-msft , I addressed your comments. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub

[GitHub] spark pull request #17108: [SPARK-19636][ML] Feature parity for correlation ...

2017-03-15 Thread thunterdb
Github user thunterdb commented on a diff in the pull request: https://github.com/apache/spark/pull/17108#discussion_r106307446 --- Diff: mllib/src/test/scala/org/apache/spark/ml/stat/StatisticsSuite.scala --- @@ -0,0 +1,102 @@ +/* + * Licensed to the Apache Software

[GitHub] spark pull request #17108: [SPARK-19636][ML] Feature parity for correlation ...

2017-03-15 Thread thunterdb
Github user thunterdb commented on a diff in the pull request: https://github.com/apache/spark/pull/17108#discussion_r106307517 --- Diff: mllib/src/main/scala/org/apache/spark/ml/stat/Statistics.scala --- @@ -0,0 +1,89 @@ +/* + * Licensed to the Apache Software Foundation

[GitHub] spark pull request #17108: [SPARK-19636][ML] Feature parity for correlation ...

2017-03-15 Thread thunterdb
Github user thunterdb commented on a diff in the pull request: https://github.com/apache/spark/pull/17108#discussion_r106306502 --- Diff: mllib/src/test/scala/org/apache/spark/ml/stat/StatisticsSuite.scala --- @@ -0,0 +1,102 @@ +/* + * Licensed to the Apache Software

[GitHub] spark pull request #17108: [SPARK-19636][ML] Feature parity for correlation ...

2017-03-15 Thread thunterdb
Github user thunterdb commented on a diff in the pull request: https://github.com/apache/spark/pull/17108#discussion_r106306385 --- Diff: mllib/src/main/scala/org/apache/spark/ml/stat/Statistics.scala --- @@ -0,0 +1,89 @@ +/* + * Licensed to the Apache Software Foundation

[GitHub] spark pull request #17108: [SPARK-19636][ML] Feature parity for correlation ...

2017-03-15 Thread thunterdb
Github user thunterdb commented on a diff in the pull request: https://github.com/apache/spark/pull/17108#discussion_r106306111 --- Diff: mllib/src/main/scala/org/apache/spark/ml/stat/Statistics.scala --- @@ -0,0 +1,89 @@ +/* + * Licensed to the Apache Software Foundation

[GitHub] spark pull request #17108: [SPARK-19636][ML] Feature parity for correlation ...

2017-03-15 Thread thunterdb
Github user thunterdb commented on a diff in the pull request: https://github.com/apache/spark/pull/17108#discussion_r106305822 --- Diff: mllib/src/main/scala/org/apache/spark/ml/stat/Statistics.scala --- @@ -0,0 +1,89 @@ +/* + * Licensed to the Apache Software Foundation

[GitHub] spark issue #17110: [SPARK-19635][ML] DataFrame-based API for chi square tes...

2017-03-14 Thread thunterdb
Github user thunterdb commented on the issue: https://github.com/apache/spark/pull/17110 @jkbradley LGTM, thanks! --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled

[GitHub] spark issue #17215: [MINOR][ML] Improve MLWriter overwrite error message

2017-03-13 Thread thunterdb
Github user thunterdb commented on the issue: https://github.com/apache/spark/pull/17215 LGTM --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so

[GitHub] spark pull request #13440: [SPARK-15699] [ML] Implement a Chi-Squared test s...

2017-03-07 Thread thunterdb
Github user thunterdb commented on a diff in the pull request: https://github.com/apache/spark/pull/13440#discussion_r104813864 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/tree/impurity/ChiSquared.scala --- @@ -0,0 +1,162 @@ +/* + * Licensed to the Apache

[GitHub] spark pull request #13440: [SPARK-15699] [ML] Implement a Chi-Squared test s...

2017-03-07 Thread thunterdb
Github user thunterdb commented on a diff in the pull request: https://github.com/apache/spark/pull/13440#discussion_r104813302 --- Diff: mllib/src/test/scala/org/apache/spark/ml/classification/DecisionTreeClassifierSuite.scala --- @@ -237,6 +237,41 @@ class

[GitHub] spark pull request #13440: [SPARK-15699] [ML] Implement a Chi-Squared test s...

2017-03-07 Thread thunterdb
Github user thunterdb commented on a diff in the pull request: https://github.com/apache/spark/pull/13440#discussion_r104812803 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/tree/impurity/Impurity.scala --- @@ -50,6 +50,50 @@ trait Impurity extends Serializable

[GitHub] spark pull request #13440: [SPARK-15699] [ML] Implement a Chi-Squared test s...

2017-03-07 Thread thunterdb
Github user thunterdb commented on a diff in the pull request: https://github.com/apache/spark/pull/13440#discussion_r104812596 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/tree/impurity/Impurity.scala --- @@ -50,6 +50,50 @@ trait Impurity extends Serializable

[GitHub] spark pull request #13440: [SPARK-15699] [ML] Implement a Chi-Squared test s...

2017-03-07 Thread thunterdb
Github user thunterdb commented on a diff in the pull request: https://github.com/apache/spark/pull/13440#discussion_r104812468 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/tree/impurity/Impurity.scala --- @@ -50,6 +50,50 @@ trait Impurity extends Serializable

[GitHub] spark pull request #13440: [SPARK-15699] [ML] Implement a Chi-Squared test s...

2017-03-07 Thread thunterdb
Github user thunterdb commented on a diff in the pull request: https://github.com/apache/spark/pull/13440#discussion_r104812484 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/tree/impurity/Impurity.scala --- @@ -50,6 +50,50 @@ trait Impurity extends Serializable

[GitHub] spark pull request #17108: [SPARK-19636][ML] Feature parity for correlation ...

2017-02-28 Thread thunterdb
Github user thunterdb commented on a diff in the pull request: https://github.com/apache/spark/pull/17108#discussion_r103596760 --- Diff: mllib/src/main/scala/org/apache/spark/ml/stat/Correlations.scala --- @@ -0,0 +1,25 @@ +/* + * Licensed to the Apache Software

[GitHub] spark pull request #17108: [SPARK-19636][ML] Feature parity for correlation ...

2017-02-28 Thread thunterdb
GitHub user thunterdb opened a pull request: https://github.com/apache/spark/pull/17108 [SPARK-19636][ML] Feature parity for correlation statistics in MLlib ## What changes were proposed in this pull request? This patch adds the Dataframes-based support for the correlation

[GitHub] spark issue #15770: [SPARK-15784][ML]:Add Power Iteration Clustering to spar...

2017-02-27 Thread thunterdb
Github user thunterdb commented on the issue: https://github.com/apache/spark/pull/15770 Note that any of these formats would cause trouble for a graph with high centrality (lady gaga in the twitter graph). That being said, I do not have a strong opinion as to which option we pick

[GitHub] spark issue #16971: [SPARK-19573][SQL] Make NaN/null handling consistent in ...

2017-02-22 Thread thunterdb
Github user thunterdb commented on the issue: https://github.com/apache/spark/pull/16971 @zhengruifeng thanks for looking into this issue. I have one comment above. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your

[GitHub] spark pull request #16971: [SPARK-19573][SQL] Make NaN/null handling consist...

2017-02-22 Thread thunterdb
Github user thunterdb commented on a diff in the pull request: https://github.com/apache/spark/pull/16971#discussion_r102592174 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/stat/StatFunctions.scala --- @@ -78,7 +80,12 @@ object StatFunctions extends Logging

[GitHub] spark pull request #16971: [SPARK-19573][SQL] Make NaN/null handling consist...

2017-02-22 Thread thunterdb
Github user thunterdb commented on a diff in the pull request: https://github.com/apache/spark/pull/16971#discussion_r102589719 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/DataFrameStatFunctions.scala --- @@ -89,18 +89,17 @@ final class DataFrameStatFunctions private

[GitHub] spark issue #15770: [SPARK-15784][ML]:Add Power Iteration Clustering to spar...

2017-02-21 Thread thunterdb
Github user thunterdb commented on the issue: https://github.com/apache/spark/pull/15770 @wangmiao1981 yes I had seen the discussions there. I believe that eventually PIC should be moved into graphframes, but we can have a simple API in `spark.ml` for the time being. --- If your

[GitHub] spark issue #15770: [SPARK-15784][ML]:Add Power Iteration Clustering to spar...

2017-02-21 Thread thunterdb
Github user thunterdb commented on the issue: https://github.com/apache/spark/pull/15770 You are right, I had forgotten that for this algorithm, the input is the edges, and the output is the label for each of the vertices. This is a tricky algorithm to put as a transformer

[GitHub] spark issue #14299: Ensure broadcasted variables are destroyed even in case ...

2017-02-17 Thread thunterdb
Github user thunterdb commented on the issue: https://github.com/apache/spark/pull/14299 @AnthonyTruchet thank you for the PR. This is definitely worth fixing for large deployments. Now, as you noticed, this portion of code does not quite abide by the best engineering practices

[GitHub] spark issue #16774: [SPARK-19357][ML] Adding parallel model evaluation in ML...

2017-02-17 Thread thunterdb
Github user thunterdb commented on the issue: https://github.com/apache/spark/pull/16774 Thanks for working on this task, this is a much requested feature. While it will work for simple cases in the current shape, it is going to cause some issues for any complex deployments (Apache

[GitHub] spark pull request #16774: [SPARK-19357][ML] Adding parallel model evaluatio...

2017-02-17 Thread thunterdb
Github user thunterdb commented on a diff in the pull request: https://github.com/apache/spark/pull/16774#discussion_r101834675 --- Diff: mllib/src/test/scala/org/apache/spark/ml/tuning/CrossValidatorSuite.scala --- @@ -121,6 +121,33 @@ class CrossValidatorSuite

[GitHub] spark issue #16973: [SPARKR][EXAMPLES] update examples to stop spark session

2017-02-17 Thread thunterdb
Github user thunterdb commented on the issue: https://github.com/apache/spark/pull/16973 These changes look good to me, but my knowledge of R is very limited. @mengxr should confirm. --- If your project is set up for it, you can reply to this email and have your reply appear

[GitHub] spark issue #16557: [SPARK-18693][ML][MLLIB] ML Evaluators should use weight...

2017-02-17 Thread thunterdb
Github user thunterdb commented on the issue: https://github.com/apache/spark/pull/16557 I agree, let's break this PR. It will go faster, and some changes may require longer discussions. --- If your project is set up for it, you can reply to this email and have your reply appear

[GitHub] spark issue #16776: [SPARK-19436][SQL] Add missing tests for approxQuantile

2017-02-17 Thread thunterdb
Github user thunterdb commented on the issue: https://github.com/apache/spark/pull/16776 Sorry I missed the conversation here. LGTM. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have

[GitHub] spark issue #15770: [SPARK-15784][ML]:Add Power Iteration Clustering to spar...

2017-02-17 Thread thunterdb
Github user thunterdb commented on the issue: https://github.com/apache/spark/pull/15770 @wangmiao1981 thanks a lot! I would be very happy to see that PR in Spark 2.2 and I will gladly help you for that. --- If your project is set up for it, you can reply to this email and have your

[GitHub] spark pull request #15770: [SPARK-15784][ML]:Add Power Iteration Clustering ...

2017-02-16 Thread thunterdb
Github user thunterdb commented on a diff in the pull request: https://github.com/apache/spark/pull/15770#discussion_r101666251 --- Diff: mllib/src/test/scala/org/apache/spark/ml/clustering/PowerIterationClusteringSuite.scala --- @@ -0,0 +1,153 @@ +/* + * Licensed

[GitHub] spark pull request #15770: [SPARK-15784][ML]:Add Power Iteration Clustering ...

2017-02-16 Thread thunterdb
Github user thunterdb commented on a diff in the pull request: https://github.com/apache/spark/pull/15770#discussion_r101665899 --- Diff: mllib/src/main/scala/org/apache/spark/ml/clustering/PowerIterationClustering.scala --- @@ -0,0 +1,182 @@ +/* + * Licensed

[GitHub] spark pull request #15770: [SPARK-15784][ML]:Add Power Iteration Clustering ...

2017-02-16 Thread thunterdb
Github user thunterdb commented on a diff in the pull request: https://github.com/apache/spark/pull/15770#discussion_r101664268 --- Diff: mllib/src/main/scala/org/apache/spark/ml/clustering/PowerIterationClustering.scala --- @@ -0,0 +1,182 @@ +/* + * Licensed

[GitHub] spark pull request #15770: [SPARK-15784][ML]:Add Power Iteration Clustering ...

2017-02-16 Thread thunterdb
Github user thunterdb commented on a diff in the pull request: https://github.com/apache/spark/pull/15770#discussion_r101663790 --- Diff: mllib/src/main/scala/org/apache/spark/ml/clustering/PowerIterationClustering.scala --- @@ -0,0 +1,182 @@ +/* + * Licensed

[GitHub] spark pull request #15770: [SPARK-15784][ML]:Add Power Iteration Clustering ...

2017-02-16 Thread thunterdb
Github user thunterdb commented on a diff in the pull request: https://github.com/apache/spark/pull/15770#discussion_r101662332 --- Diff: mllib/src/main/scala/org/apache/spark/ml/clustering/PowerIterationClustering.scala --- @@ -0,0 +1,182 @@ +/* + * Licensed

[GitHub] spark pull request #15770: [SPARK-15784][ML]:Add Power Iteration Clustering ...

2017-02-16 Thread thunterdb
Github user thunterdb commented on a diff in the pull request: https://github.com/apache/spark/pull/15770#discussion_r101662298 --- Diff: mllib/src/main/scala/org/apache/spark/ml/clustering/PowerIterationClustering.scala --- @@ -0,0 +1,182 @@ +/* + * Licensed

[GitHub] spark pull request #15770: [SPARK-15784][ML]:Add Power Iteration Clustering ...

2017-02-16 Thread thunterdb
Github user thunterdb commented on a diff in the pull request: https://github.com/apache/spark/pull/15770#discussion_r101662273 --- Diff: mllib/src/main/scala/org/apache/spark/ml/clustering/PowerIterationClustering.scala --- @@ -0,0 +1,182 @@ +/* + * Licensed

[GitHub] spark pull request #15770: [SPARK-15784][ML]:Add Power Iteration Clustering ...

2017-02-16 Thread thunterdb
Github user thunterdb commented on a diff in the pull request: https://github.com/apache/spark/pull/15770#discussion_r101662038 --- Diff: mllib/src/main/scala/org/apache/spark/ml/clustering/PowerIterationClustering.scala --- @@ -0,0 +1,182 @@ +/* + * Licensed

[GitHub] spark pull request #15770: [SPARK-15784][ML]:Add Power Iteration Clustering ...

2017-02-16 Thread thunterdb
Github user thunterdb commented on a diff in the pull request: https://github.com/apache/spark/pull/15770#discussion_r101662018 --- Diff: mllib/src/main/scala/org/apache/spark/ml/clustering/PowerIterationClustering.scala --- @@ -0,0 +1,182 @@ +/* + * Licensed

[GitHub] spark issue #15826: [SPARK-14077][ML][FOLLOW-UP] Minor refactor and cleanup ...

2016-11-11 Thread thunterdb
Github user thunterdb commented on the issue: https://github.com/apache/spark/pull/15826 @yanboliang that looks great, thank you. LGTM. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have

[GitHub] spark pull request #15593: [SPARK-18060][ML] Avoid unnecessary computation f...

2016-11-09 Thread thunterdb
Github user thunterdb commented on a diff in the pull request: https://github.com/apache/spark/pull/15593#discussion_r87267275 --- Diff: mllib/src/main/scala/org/apache/spark/ml/classification/LogisticRegression.scala --- @@ -489,13 +485,14 @@ class LogisticRegression @Since

[GitHub] spark issue #15683: [SPARK-18166][MLlib] Fix Poisson GLM bug due to wrong re...

2016-11-09 Thread thunterdb
Github user thunterdb commented on the issue: https://github.com/apache/spark/pull/15683 +1 for trying to get it into 2.1 (modulo tests) --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have

[GitHub] spark issue #15809: [SPARK-18268][ML][MLLib] ALS fail with better message if...

2016-11-09 Thread thunterdb
Github user thunterdb commented on the issue: https://github.com/apache/spark/pull/15809 LGTM --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so

[GitHub] spark pull request #15826: [SPARK-14077][ML][FOLLOW-UP] Minor refactor and c...

2016-11-09 Thread thunterdb
Github user thunterdb commented on a diff in the pull request: https://github.com/apache/spark/pull/15826#discussion_r87256910 --- Diff: mllib/src/main/scala/org/apache/spark/ml/classification/NaiveBayes.scala --- @@ -110,21 +110,20 @@ class NaiveBayes @Since("

[GitHub] spark pull request #15826: [SPARK-14077][ML][FOLLOW-UP] Minor refactor and c...

2016-11-09 Thread thunterdb
Github user thunterdb commented on a diff in the pull request: https://github.com/apache/spark/pull/15826#discussion_r87256702 --- Diff: mllib/src/main/scala/org/apache/spark/ml/classification/NaiveBayes.scala --- @@ -226,13 +206,33 @@ class NaiveBayes @Since("

[GitHub] spark pull request #15826: [SPARK-14077][ML][FOLLOW-UP] Minor refactor and c...

2016-11-09 Thread thunterdb
Github user thunterdb commented on a diff in the pull request: https://github.com/apache/spark/pull/15826#discussion_r87256589 --- Diff: mllib/src/main/scala/org/apache/spark/ml/classification/NaiveBayes.scala --- @@ -226,13 +206,33 @@ class NaiveBayes @Since("

[GitHub] spark pull request #15795: [SPARK-18081] Add user guide for Locality Sensiti...

2016-11-09 Thread thunterdb
Github user thunterdb commented on a diff in the pull request: https://github.com/apache/spark/pull/15795#discussion_r87116504 --- Diff: docs/ml-features.md --- @@ -1396,3 +1396,149 @@ for more details on the API. {% include_example python/ml/chisq_selector_example.py

[GitHub] spark pull request #15795: [SPARK-18081] Add user guide for Locality Sensiti...

2016-11-09 Thread thunterdb
Github user thunterdb commented on a diff in the pull request: https://github.com/apache/spark/pull/15795#discussion_r87113033 --- Diff: docs/ml-features.md --- @@ -1396,3 +1396,149 @@ for more details on the API. {% include_example python/ml/chisq_selector_example.py

[GitHub] spark pull request #15795: [SPARK-18081] Add user guide for Locality Sensiti...

2016-11-09 Thread thunterdb
Github user thunterdb commented on a diff in the pull request: https://github.com/apache/spark/pull/15795#discussion_r87113839 --- Diff: docs/ml-features.md --- @@ -1396,3 +1396,149 @@ for more details on the API. {% include_example python/ml/chisq_selector_example.py

[GitHub] spark pull request #15795: [SPARK-18081] Add user guide for Locality Sensiti...

2016-11-09 Thread thunterdb
Github user thunterdb commented on a diff in the pull request: https://github.com/apache/spark/pull/15795#discussion_r87113728 --- Diff: docs/ml-features.md --- @@ -1396,3 +1396,149 @@ for more details on the API. {% include_example python/ml/chisq_selector_example.py

  1   2   3   4   >