[GitHub] spark pull request #15795: [SPARK-18081] Add user guide for Locality Sensiti...

2016-11-09 Thread thunterdb
Github user thunterdb commented on a diff in the pull request: https://github.com/apache/spark/pull/15795#discussion_r87116350 --- Diff: docs/ml-features.md --- @@ -1396,3 +1396,134 @@ for more details on the API. {% include_example python/ml/chisq_selector_example.py

[GitHub] spark pull request #15795: [SPARK-18081] Add user guide for Locality Sensiti...

2016-11-09 Thread thunterdb
Github user thunterdb commented on a diff in the pull request: https://github.com/apache/spark/pull/15795#discussion_r87116745 --- Diff: docs/ml-features.md --- @@ -1396,3 +1396,149 @@ for more details on the API. {% include_example python/ml/chisq_selector_example.py

[GitHub] spark pull request #15795: [SPARK-18081] Add user guide for Locality Sensiti...

2016-11-09 Thread thunterdb
Github user thunterdb commented on a diff in the pull request: https://github.com/apache/spark/pull/15795#discussion_r87116485 --- Diff: docs/ml-features.md --- @@ -1396,3 +1396,149 @@ for more details on the API. {% include_example python/ml/chisq_selector_example.py

[GitHub] spark issue #15074: [SPARK-17520] Implement a better __eq__ for SparseMatrix

2016-11-08 Thread thunterdb
Github user thunterdb commented on the issue: https://github.com/apache/spark/pull/15074 @dbtsai and @holdenk have a good point about the equality being a actually an equivalence check (we want to check that two matrices have the same content, not that two python objects have

[GitHub] spark pull request #12819: [SPARK-14077][ML] Refactor NaiveBayes to support ...

2016-11-08 Thread thunterdb
Github user thunterdb commented on a diff in the pull request: https://github.com/apache/spark/pull/12819#discussion_r87029278 --- Diff: mllib/src/main/scala/org/apache/spark/ml/classification/NaiveBayes.scala --- @@ -109,10 +118,89 @@ class NaiveBayes @Since("

[GitHub] spark pull request #12819: [SPARK-14077][ML] Refactor NaiveBayes to support ...

2016-11-08 Thread thunterdb
Github user thunterdb commented on a diff in the pull request: https://github.com/apache/spark/pull/12819#discussion_r87025609 --- Diff: mllib/src/main/scala/org/apache/spark/ml/classification/NaiveBayes.scala --- @@ -109,10 +118,89 @@ class NaiveBayes @Since("

[GitHub] spark pull request #12819: [SPARK-14077][ML] Refactor NaiveBayes to support ...

2016-11-08 Thread thunterdb
Github user thunterdb commented on a diff in the pull request: https://github.com/apache/spark/pull/12819#discussion_r87025222 --- Diff: mllib/src/main/scala/org/apache/spark/ml/classification/NaiveBayes.scala --- @@ -109,10 +118,89 @@ class NaiveBayes @Since("

[GitHub] spark pull request #12819: [SPARK-14077][ML] Refactor NaiveBayes to support ...

2016-11-08 Thread thunterdb
Github user thunterdb commented on a diff in the pull request: https://github.com/apache/spark/pull/12819#discussion_r87025073 --- Diff: mllib/src/main/scala/org/apache/spark/ml/classification/NaiveBayes.scala --- @@ -109,10 +118,89 @@ class NaiveBayes @Since("

[GitHub] spark issue #15051: [SPARK-17499][SparkR][ML][MLLib] make the default params...

2016-09-14 Thread thunterdb
Github user thunterdb commented on the issue: https://github.com/apache/spark/pull/15051 LGTM. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so

[GitHub] spark pull request #11119: [SPARK-10780][ML] Add an initial model to kmeans

2016-09-13 Thread thunterdb
Github user thunterdb commented on a diff in the pull request: https://github.com/apache/spark/pull/9#discussion_r78610830 --- Diff: mllib/src/test/scala/org/apache/spark/ml/clustering/KMeansSuite.scala --- @@ -139,16 +145,32 @@ class KMeansSuite extends SparkFunSuite

[GitHub] spark issue #15002: [SQL][SPARK-17439] Fixing compression issues with approx...

2016-09-13 Thread thunterdb
Github user thunterdb commented on the issue: https://github.com/apache/spark/pull/15002 Thank you @srowen --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so

[GitHub] spark pull request #15002: [SQL][SPARK-17439] Fixing compression issues with...

2016-09-07 Thread thunterdb
Github user thunterdb commented on a diff in the pull request: https://github.com/apache/spark/pull/15002#discussion_r77905310 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/QuantileSummaries.scala --- @@ -258,7 +263,10 @@ object QuantileSummaries

[GitHub] spark pull request #15002: [ML][SPARK-17439] Fixing compression issues with ...

2016-09-07 Thread thunterdb
GitHub user thunterdb opened a pull request: https://github.com/apache/spark/pull/15002 [ML][SPARK-17439] Fixing compression issues with approximate quantiles and adding more tests ## What changes were proposed in this pull request? This PR build on #14976 and fixes

[GitHub] spark issue #15002: [ML][SPARK-17439] Fixing compression issues with approxi...

2016-09-07 Thread thunterdb
Github user thunterdb commented on the issue: https://github.com/apache/spark/pull/15002 cc @clockfly --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so

[GitHub] spark pull request #14976: [SPARK-17306] [SQL] QuantileSummaries doesn't com...

2016-09-07 Thread thunterdb
Github user thunterdb commented on a diff in the pull request: https://github.com/apache/spark/pull/14976#discussion_r77902612 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/QuantileSummaries.scala --- @@ -59,9 +59,14 @@ class QuantileSummaries

[GitHub] spark issue #14976: [SPARK-17306] [SQL] QuantileSummaries doesn't compress

2016-09-06 Thread thunterdb
Github user thunterdb commented on the issue: https://github.com/apache/spark/pull/14976 By the way, you gave some great advice here. Is there a page on the wiki where we collect all this internal knowledge? --- If your project is set up for it, you can reply to this email and have

[GitHub] spark issue #14976: [SPARK-17306] [SQL] QuantileSummaries doesn't compress

2016-09-06 Thread thunterdb
Github user thunterdb commented on the issue: https://github.com/apache/spark/pull/14976 @srowen LGTM, thanks! --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes

[GitHub] spark pull request #14976: [SPARK-17306] [SQL] QuantileSummaries doesn't com...

2016-09-06 Thread thunterdb
Github user thunterdb commented on a diff in the pull request: https://github.com/apache/spark/pull/14976#discussion_r77735611 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/QuantileSummaries.scala --- @@ -236,7 +240,7 @@ object QuantileSummaries

[GitHub] spark pull request #14976: [SPARK-17306] [SQL] QuantileSummaries doesn't com...

2016-09-06 Thread thunterdb
Github user thunterdb commented on a diff in the pull request: https://github.com/apache/spark/pull/14976#discussion_r77734419 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/QuantileSummaries.scala --- @@ -59,9 +59,14 @@ class QuantileSummaries

[GitHub] spark issue #14603: [SPARK-17021][SQL] simplify the constructor parameters o...

2016-08-11 Thread thunterdb
Github user thunterdb commented on the issue: https://github.com/apache/spark/pull/14603 LGTM --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so

[GitHub] spark pull request #14298: [SPARK-16283][SQL] Implement `percentile_approx` ...

2016-07-25 Thread thunterdb
Github user thunterdb commented on a diff in the pull request: https://github.com/apache/spark/pull/14298#discussion_r72137984 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/PercentileApprox.scala --- @@ -0,0 +1,456

[GitHub] spark pull request #14298: [SPARK-16283][SQL] Implement `percentile_approx` ...

2016-07-21 Thread thunterdb
Github user thunterdb commented on a diff in the pull request: https://github.com/apache/spark/pull/14298#discussion_r71740068 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/PercentileApprox.scala --- @@ -0,0 +1,456

[GitHub] spark issue #14023: [SPARK-16348][ML][MLLIB][PYTHON] Use full classpaths for...

2016-07-05 Thread thunterdb
Github user thunterdb commented on the issue: https://github.com/apache/spark/pull/14023 This looks good to me (as the original reporter of the issue). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does

[GitHub] spark pull request #13789: [SPARK-16074] [MLLIB] expose VectorUDT/MatrixUDT ...

2016-06-20 Thread thunterdb
Github user thunterdb commented on a diff in the pull request: https://github.com/apache/spark/pull/13789#discussion_r67778583 --- Diff: mllib/src/main/scala/org/apache/spark/ml/linalg/dataTypes.scala --- @@ -0,0 +1,35 @@ +/* + * Licensed to the Apache Software Foundation

[GitHub] spark pull request: [SPARK-15100][DOC] Modified user guide and examples for ...

2016-05-31 Thread thunterdb
Github user thunterdb commented on the pull request: https://github.com/apache/spark/pull/13176 I will try as well this afternoon. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have

[GitHub] spark pull request: Master

2016-05-31 Thread thunterdb
Github user thunterdb commented on the pull request: https://github.com/apache/spark/pull/13256 @Luwein can you please close this pull request? Thanks! --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does

[GitHub] spark pull request: [SPARK-12922][SparkR][WIP] Implement gapply() ...

2016-05-26 Thread thunterdb
Github user thunterdb commented on the pull request: https://github.com/apache/spark/pull/12836#issuecomment-221930188 That sounds good, we can add `aggregate` later. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well

[GitHub] spark pull request: [SPARK-15186][ML][DOCS] Add user guide for gen...

2016-05-26 Thread thunterdb
Github user thunterdb commented on a diff in the pull request: https://github.com/apache/spark/pull/13139#discussion_r64779485 --- Diff: docs/ml-classification-regression.md --- @@ -374,6 +374,148 @@ regression model and extracting model summary statistics

[GitHub] spark pull request: [SPARK-12922][SparkR][WIP] Implement gapply() ...

2016-05-23 Thread thunterdb
Github user thunterdb commented on the pull request: https://github.com/apache/spark/pull/12836#issuecomment-221109113 @NarineK thank you for working on it, this is a great addition to SparkR! Regarding the API, it is very close to R's `aggregate` function in the `stats

[GitHub] spark pull request: [SPARK-15186][ML][DOCS] Add user guide for gen...

2016-05-23 Thread thunterdb
Github user thunterdb commented on a diff in the pull request: https://github.com/apache/spark/pull/13139#discussion_r64248278 --- Diff: docs/ml-classification-regression.md --- @@ -374,6 +374,197 @@ regression model and extracting model summary statistics

[GitHub] spark pull request: [SPARK-15402] [ML] [PySpark] PySpark ml.evalua...

2016-05-20 Thread thunterdb
Github user thunterdb commented on a diff in the pull request: https://github.com/apache/spark/pull/13194#discussion_r64090534 --- Diff: python/pyspark/ml/evaluation.py --- @@ -316,17 +335,27 @@ def setParams(self, predictionCol="prediction", label

[GitHub] spark pull request: [SPARK-15186][ML][DOCS] Add user guide for gen...

2016-05-20 Thread thunterdb
Github user thunterdb commented on a diff in the pull request: https://github.com/apache/spark/pull/13139#discussion_r64086909 --- Diff: docs/ml-classification-regression.md --- @@ -374,6 +374,148 @@ regression model and extracting model summary statistics

[GitHub] spark pull request: [SPARK-15186][ML][DOCS] Add user guide for gen...

2016-05-20 Thread thunterdb
Github user thunterdb commented on a diff in the pull request: https://github.com/apache/spark/pull/13139#discussion_r64086417 --- Diff: docs/ml-classification-regression.md --- @@ -374,6 +374,197 @@ regression model and extracting model summary statistics

[GitHub] spark pull request: [SPARK-15186][ML][DOCS] Add user guide for gen...

2016-05-20 Thread thunterdb
Github user thunterdb commented on a diff in the pull request: https://github.com/apache/spark/pull/13139#discussion_r64086020 --- Diff: docs/ml-classification-regression.md --- @@ -374,6 +374,197 @@ regression model and extracting model summary statistics

[GitHub] spark pull request: [SPARK-15413] [ML] [MLLIB] Change `toBreeze` t...

2016-05-20 Thread thunterdb
Github user thunterdb commented on the pull request: https://github.com/apache/spark/pull/13198#issuecomment-220677539 It seems to me that there is not much cost (other than us arguing for or against) and a slightly positive benefit (it is more clear that no expensive copy occurs

[GitHub] spark pull request: [SPARK-15100][DOC] Modified user guide and exa...

2016-05-20 Thread thunterdb
Github user thunterdb commented on a diff in the pull request: https://github.com/apache/spark/pull/13176#discussion_r64079535 --- Diff: docs/ml-features.md --- @@ -1064,7 +1069,8 @@ categorical features. The bin ranges are chosen by taking a sample of the data and dividing

[GitHub] spark pull request: [SPARK-15100][DOC] Modified user guide and exa...

2016-05-20 Thread thunterdb
Github user thunterdb commented on a diff in the pull request: https://github.com/apache/spark/pull/13176#discussion_r64079509 --- Diff: docs/ml-features.md --- @@ -1064,7 +1069,8 @@ categorical features. The bin ranges are chosen by taking a sample of the data and dividing

[GitHub] spark pull request: [SPARK-15282][SQL] Make ScalaUDF nondeterminis...

2016-05-20 Thread thunterdb
Github user thunterdb commented on the pull request: https://github.com/apache/spark/pull/13087#issuecomment-220659554 @marmbrus +1 I suggest to change the documentation of UDFs instead to underline the expectation that they be deterministic for the time being. --- If your

[GitHub] spark pull request: [SPARK-13786][ML][PYTHON] Removed save/load fo...

2016-04-29 Thread thunterdb
Github user thunterdb commented on the pull request: https://github.com/apache/spark/pull/12782#issuecomment-215917416 @jkbradley this looks to me --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does

[GitHub] spark pull request: [SPARK-13786][ML][PYTHON] Removed save/load fo...

2016-04-29 Thread thunterdb
Github user thunterdb commented on a diff in the pull request: https://github.com/apache/spark/pull/12782#discussion_r61661429 --- Diff: python/pyspark/ml/tests.py --- @@ -44,8 +44,7 @@ from pyspark import keyword_only from pyspark.ml import Estimator, Model

[GitHub] spark pull request: [SPARK-14646][ML] Modified Kmeans to store clu...

2016-04-29 Thread thunterdb
Github user thunterdb commented on the pull request: https://github.com/apache/spark/pull/12792#issuecomment-215900920 As discussed in private, kmeans persistence is new for this release, so breaking the format is not an issue at this point. LGTM --- If your project is set

[GitHub] spark pull request: [WIP][SPARKR][SPARK-14831]Make the SparkR MLli...

2016-04-29 Thread thunterdb
GitHub user thunterdb opened a pull request: https://github.com/apache/spark/pull/12789 [WIP][SPARKR][SPARK-14831]Make the SparkR MLlib API more consistent with Spark ## What changes were proposed in this pull request? This is a work in progress, not intended to be merged

[GitHub] spark pull request: [SPARK-14570][ML] Log instrumentation in Rando...

2016-04-29 Thread thunterdb
Github user thunterdb commented on the pull request: https://github.com/apache/spark/pull/12536#issuecomment-215826014 @BenFradet thanks! @jkbradley can you make a final pass? --- If your project is set up for it, you can reply to this email and have your reply appear

[GitHub] spark pull request: [SPARK-7264][ML] Parallel lapply for sparkR

2016-04-29 Thread thunterdb
Github user thunterdb commented on the pull request: https://github.com/apache/spark/pull/12426#issuecomment-215821456 @shivaram @felixcheung @dongjoon-hyun thank you for your comments on my first R pull request! Also, I put a note in the ticket about updating

[GitHub] spark pull request: [SPARK-7264][ML] Parallel lapply for sparkR

2016-04-27 Thread thunterdb
Github user thunterdb commented on a diff in the pull request: https://github.com/apache/spark/pull/12426#discussion_r61347932 --- Diff: R/pkg/R/context.R --- @@ -226,6 +226,49 @@ setCheckpointDir <- function(sc, dirName) { invisible(callJMethod(sc, "setCheck

[GitHub] spark pull request: [SPARK-14570][ML] Log instrumentation in Rando...

2016-04-27 Thread thunterdb
Github user thunterdb commented on the pull request: https://github.com/apache/spark/pull/12536#issuecomment-215183774 @BenFradet it looks like there are some compilation issues --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub

[GitHub] spark pull request: [SPARK-14862][ML] Updated Classifiers to not r...

2016-04-27 Thread thunterdb
Github user thunterdb commented on the pull request: https://github.com/apache/spark/pull/12663#issuecomment-215091481 A quick look at the source code of scikit-learn shows that it always reindexes, but it uses some efficient numpy primitive for doing that. I think assuming an index

[GitHub] spark pull request: [SPARK-14862][ML] Updated Classifiers to not r...

2016-04-27 Thread thunterdb
Github user thunterdb commented on a diff in the pull request: https://github.com/apache/spark/pull/12663#discussion_r61259678 --- Diff: mllib/src/test/scala/org/apache/spark/ml/classification/ClassifierSuite.scala --- @@ -29,4 +69,31 @@ object ClassifierSuite

[GitHub] spark pull request: [SPARK-14862][ML] Updated Classifiers to not r...

2016-04-27 Thread thunterdb
Github user thunterdb commented on a diff in the pull request: https://github.com/apache/spark/pull/12663#discussion_r61259664 --- Diff: mllib/src/main/scala/org/apache/spark/ml/classification/GBTClassifier.scala --- @@ -126,16 +127,16 @@ final class GBTClassifier @Since("

[GitHub] spark pull request: [SPARK-14862][ML] Updated Classifiers to not r...

2016-04-27 Thread thunterdb
Github user thunterdb commented on a diff in the pull request: https://github.com/apache/spark/pull/12663#discussion_r61259639 --- Diff: mllib/src/main/scala/org/apache/spark/ml/classification/Classifier.scala --- @@ -62,6 +65,57 @@ abstract class Classifier[ def

[GitHub] spark pull request: [SPARK-14862][ML] Updated Classifiers to not r...

2016-04-27 Thread thunterdb
Github user thunterdb commented on a diff in the pull request: https://github.com/apache/spark/pull/12663#discussion_r61259624 --- Diff: mllib/src/main/scala/org/apache/spark/ml/classification/Classifier.scala --- @@ -62,6 +65,57 @@ abstract class Classifier[ def

[GitHub] spark pull request: [SPARK-14671][ML] Pipeline setStages should ha...

2016-04-26 Thread thunterdb
Github user thunterdb commented on the pull request: https://github.com/apache/spark/pull/12430#issuecomment-214898522 LGTM. FYI, there are some corner cases (maps of arrays for example) in which the `_` shortcut syntax will now do what you want, but it is remote enough

[GitHub] spark pull request: [SPARK-12301][ML] Made all tree and ensemble c...

2016-04-26 Thread thunterdb
Github user thunterdb commented on the pull request: https://github.com/apache/spark/pull/12711#issuecomment-214895988 @jkbradley LGTM --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have

[GitHub] spark pull request: [SPARK-14570][ML] Log instrumentation in Rando...

2016-04-26 Thread thunterdb
Github user thunterdb commented on a diff in the pull request: https://github.com/apache/spark/pull/12536#discussion_r61168912 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/tree/DecisionTree.scala --- @@ -62,8 +64,9 @@ class DecisionTree private[spark] (private val

[GitHub] spark pull request: [SPARK-14570][ML] Log instrumentation in Rando...

2016-04-26 Thread thunterdb
Github user thunterdb commented on a diff in the pull request: https://github.com/apache/spark/pull/12536#discussion_r61168881 --- Diff: mllib/src/main/scala/org/apache/spark/ml/regression/DecisionTreeRegressor.scala --- @@ -88,17 +88,30 @@ final class DecisionTreeRegressor

[GitHub] spark pull request: [SPARK-14571][ML]Log instrumentation in ALS

2016-04-26 Thread thunterdb
Github user thunterdb commented on the pull request: https://github.com/apache/spark/pull/12560#issuecomment-214890439 In any case, we should also log the success with `instrLog.logSuccess(model)` --- If your project is set up for it, you can reply to this email and have your reply

[GitHub] spark pull request: [SPARK-14571][ML]Log instrumentation in ALS

2016-04-26 Thread thunterdb
Github user thunterdb commented on a diff in the pull request: https://github.com/apache/spark/pull/12560#discussion_r61166736 --- Diff: mllib/src/main/scala/org/apache/spark/ml/recommendation/ALS.scala --- @@ -395,6 +395,10 @@ class ALS(@Since("1.4.0") override val u

[GitHub] spark pull request: [SPARK-14571][ML]Log instrumentation in ALS

2016-04-26 Thread thunterdb
Github user thunterdb commented on the pull request: https://github.com/apache/spark/pull/12560#issuecomment-214890157 @wangmiao1981 sorry for the delay. I agree with @MLnick that the developer API of ALS is already brittle enough as it stands. One way to still do it is to add one

[GitHub] spark pull request: [SPARK-7264][ML] Parallel lapply for sparkR

2016-04-25 Thread thunterdb
Github user thunterdb commented on a diff in the pull request: https://github.com/apache/spark/pull/12426#discussion_r60950867 --- Diff: R/pkg/R/context.R --- @@ -225,3 +225,26 @@ broadcast <- function(sc, object) { setCheckpointDir <- function(sc, dirName) { inv

[GitHub] spark pull request: [SPARK-7264][ML] Parallel lapply for sparkR

2016-04-25 Thread thunterdb
Github user thunterdb commented on a diff in the pull request: https://github.com/apache/spark/pull/12426#discussion_r60950901 --- Diff: R/pkg/NAMESPACE --- @@ -287,6 +287,7 @@ export("as.DataFrame", "read.json", "read.parquet

[GitHub] spark pull request: [SPARK-7264][ML] Parallel lapply for sparkR

2016-04-25 Thread thunterdb
Github user thunterdb commented on a diff in the pull request: https://github.com/apache/spark/pull/12426#discussion_r60950880 --- Diff: R/pkg/inst/tests/testthat/test_context.R --- @@ -136,3 +136,8 @@ test_that("sparkJars sparkPackages as comma-separated st

[GitHub] spark pull request: [SPARK-14570][ML] Log instrumentation in Rando...

2016-04-21 Thread thunterdb
Github user thunterdb commented on the pull request: https://github.com/apache/spark/pull/12536#issuecomment-213021639 Oh sorry, I had not realized that @jkbradley had asked you to remove the parameter. @jkbradley there is no performance issue doing that? --- If your project is set

[GitHub] spark pull request: [SPARK-14570][ML] Log instrumentation in Rando...

2016-04-21 Thread thunterdb
Github user thunterdb commented on the pull request: https://github.com/apache/spark/pull/12536#issuecomment-213020180 @BenFradet thanks for the PR. I have one comment about performance. --- If your project is set up for it, you can reply to this email and have your reply appear

[GitHub] spark pull request: [SPARK-14570][ML] Log instrumentation in Rando...

2016-04-21 Thread thunterdb
Github user thunterdb commented on a diff in the pull request: https://github.com/apache/spark/pull/12536#discussion_r60618838 --- Diff: mllib/src/main/scala/org/apache/spark/ml/tree/impl/RandomForest.scala --- @@ -610,7 +604,9 @@ private[spark] object RandomForest extends Logging

[GitHub] spark pull request: [SPARK-14570][ML] Log instrumentation in Rando...

2016-04-21 Thread thunterdb
Github user thunterdb commented on a diff in the pull request: https://github.com/apache/spark/pull/12536#discussion_r60618857 --- Diff: mllib/src/main/scala/org/apache/spark/ml/tree/impl/RandomForest.scala --- @@ -668,6 +664,7 @@ private[spark] object RandomForest extends Logging

[GitHub] spark pull request: [SPARK-14570][ML] Log instrumentation in Rando...

2016-04-21 Thread thunterdb
Github user thunterdb commented on a diff in the pull request: https://github.com/apache/spark/pull/12536#discussion_r60618828 --- Diff: mllib/src/main/scala/org/apache/spark/ml/tree/impl/RandomForest.scala --- @@ -80,6 +80,7 @@ private[spark] object RandomForest extends Logging

[GitHub] spark pull request: [SPARK-14570][ML] Log instrumentation in Rando...

2016-04-21 Thread thunterdb
Github user thunterdb commented on a diff in the pull request: https://github.com/apache/spark/pull/12536#discussion_r60618750 --- Diff: mllib/src/main/scala/org/apache/spark/ml/classification/DecisionTreeClassifier.scala --- @@ -94,17 +94,33 @@ final class DecisionTreeClassifier

[GitHub] spark pull request: [SPARK-14570][ML] Log instrumentation in Rando...

2016-04-21 Thread thunterdb
Github user thunterdb commented on a diff in the pull request: https://github.com/apache/spark/pull/12536#discussion_r60618432 --- Diff: mllib/src/main/scala/org/apache/spark/ml/classification/DecisionTreeClassifier.scala --- @@ -94,17 +94,33 @@ final class DecisionTreeClassifier

[GitHub] spark pull request: [SPARK-14571][ML]Log instrumentation in ALS

2016-04-21 Thread thunterdb
Github user thunterdb commented on the pull request: https://github.com/apache/spark/pull/12560#issuecomment-213009688 @wangmiao1981 thanks for the PR. Can you see if you could make it work without having to evaluate the input data? --- If your project is set up for it, you can

[GitHub] spark pull request: [SPARK-14571][ML]Log instrumentation in ALS

2016-04-21 Thread thunterdb
Github user thunterdb commented on a diff in the pull request: https://github.com/apache/spark/pull/12560#discussion_r60616196 --- Diff: mllib/src/main/scala/org/apache/spark/ml/recommendation/ALS.scala --- @@ -395,13 +395,21 @@ class ALS(@Since("1.4.0") override val u

[GitHub] spark pull request: [SPARK-14571][ML]Log instrumentation in ALS

2016-04-21 Thread thunterdb
Github user thunterdb commented on a diff in the pull request: https://github.com/apache/spark/pull/12560#discussion_r60615798 --- Diff: mllib/src/main/scala/org/apache/spark/ml/recommendation/ALS.scala --- @@ -395,13 +395,21 @@ class ALS(@Since("1.4.0") override val u

[GitHub] spark pull request: [SPARK-7264][ML] Parallel lapply for sparkR

2016-04-18 Thread thunterdb
Github user thunterdb commented on the pull request: https://github.com/apache/spark/pull/12426#issuecomment-211506352 Forget my last comment, I found the other tests. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well

[GitHub] spark pull request: [SPARK-14676] Wrap and re-throw Await.result e...

2016-04-18 Thread thunterdb
Github user thunterdb commented on the pull request: https://github.com/apache/spark/pull/12433#issuecomment-211493984 +1, because we can then add a linter rule that disallows `Await.result` --- If your project is set up for it, you can reply to this email and have your reply appear

[GitHub] spark pull request: [SPARK-7264][ML] Parallel lapply for sparkR

2016-04-18 Thread thunterdb
Github user thunterdb commented on the pull request: https://github.com/apache/spark/pull/12426#issuecomment-211484675 @felixcheung do you have an example I could follow for testing in R? --- If your project is set up for it, you can reply to this email and have your reply appear

[GitHub] spark pull request: [SPARK-14569][ML] Log instrumentation in KMean...

2016-04-15 Thread thunterdb
Github user thunterdb commented on a diff in the pull request: https://github.com/apache/spark/pull/12432#discussion_r59950837 --- Diff: mllib/src/main/scala/org/apache/spark/ml/clustering/KMeans.scala --- @@ -264,6 +264,9 @@ class KMeans @Since("1.5.0") ( overri

[GitHub] spark pull request: [SPARK-14569][ML] Log instrumentation in KMean...

2016-04-15 Thread thunterdb
Github user thunterdb commented on the pull request: https://github.com/apache/spark/pull/12432#issuecomment-210671834 @keypointt thanks! I have one comment. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your

[GitHub] spark pull request: [SPARK-7264][ML] Parallel lapply for sparkR

2016-04-15 Thread thunterdb
Github user thunterdb commented on the pull request: https://github.com/apache/spark/pull/12426#issuecomment-210646935 Some other changes got merged, removing them --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your

[GitHub] spark pull request: [SPARK-7264][ML] Parallel lapply for sparkR

2016-04-15 Thread thunterdb
GitHub user thunterdb opened a pull request: https://github.com/apache/spark/pull/12426 [SPARK-7264][ML] Parallel lapply for sparkR ## What changes were proposed in this pull request? This PR adds a new function in SparkR called `sparkLapply(list, function)`. This function

[GitHub] spark pull request: [SPARK-14605][ML][PYTHON] Changed Python to us...

2016-04-14 Thread thunterdb
Github user thunterdb commented on the pull request: https://github.com/apache/spark/pull/12368#issuecomment-210090347 @jkbradley sorry for the delay, I have one comment. Also, you have some merging issues. --- If your project is set up for it, you can reply to this email and have

[GitHub] spark pull request: [SPARK-14605][ML][PYTHON] Changed Python to us...

2016-04-14 Thread thunterdb
Github user thunterdb commented on a diff in the pull request: https://github.com/apache/spark/pull/12368#discussion_r59767952 --- Diff: python/pyspark/ml/pipeline.py --- @@ -19,6 +19,7 @@ if sys.version > '3': basestring = str +unicode = str ---

[GitHub] spark pull request: [SPARK-14568][ML] Instrumentation framework fo...

2016-04-13 Thread thunterdb
Github user thunterdb commented on the pull request: https://github.com/apache/spark/pull/12331#issuecomment-209478984 @jkbradley comments addressed, thanks for the review! Regarding the other messages, I would rather not change them for now, because they are not instrumentation

[GitHub] spark pull request: [SPARK-14568][ML] Instrumentation framework fo...

2016-04-13 Thread thunterdb
Github user thunterdb commented on a diff in the pull request: https://github.com/apache/spark/pull/12331#discussion_r59557439 --- Diff: mllib/src/main/scala/org/apache/spark/ml/util/Instrumentation.scala --- @@ -0,0 +1,109 @@ +/* + * Licensed to the Apache Software

[GitHub] spark pull request: [SPARK-14568][ML] Instrumentation framework fo...

2016-04-13 Thread thunterdb
Github user thunterdb commented on a diff in the pull request: https://github.com/apache/spark/pull/12331#discussion_r59557153 --- Diff: mllib/src/main/scala/org/apache/spark/ml/util/Instrumentation.scala --- @@ -0,0 +1,109 @@ +/* + * Licensed to the Apache Software

[GitHub] spark pull request: [SPARK-14568][ML] Instrumentation framework fo...

2016-04-13 Thread thunterdb
Github user thunterdb commented on a diff in the pull request: https://github.com/apache/spark/pull/12331#discussion_r59557115 --- Diff: mllib/src/main/scala/org/apache/spark/ml/util/Instrumentation.scala --- @@ -0,0 +1,109 @@ +/* + * Licensed to the Apache Software

[GitHub] spark pull request: [SPARK-14568][ML] Instrumentation framework fo...

2016-04-13 Thread thunterdb
Github user thunterdb commented on a diff in the pull request: https://github.com/apache/spark/pull/12331#discussion_r59557130 --- Diff: mllib/src/main/scala/org/apache/spark/ml/util/Instrumentation.scala --- @@ -0,0 +1,109 @@ +/* + * Licensed to the Apache Software

[GitHub] spark pull request: [SPARK-14568][ML] Instrumentation framework fo...

2016-04-12 Thread thunterdb
Github user thunterdb commented on a diff in the pull request: https://github.com/apache/spark/pull/12331#discussion_r59478228 --- Diff: mllib/src/main/scala/org/apache/spark/ml/classification/LogisticRegression.scala --- @@ -273,6 +273,10 @@ class LogisticRegression @Since

[GitHub] spark pull request: [SPARK-14568][ML] Instrumentation framework fo...

2016-04-12 Thread thunterdb
Github user thunterdb commented on the pull request: https://github.com/apache/spark/pull/12331#issuecomment-209124853 cc @mengxr --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have

[GitHub] spark pull request: [SPARK-8519][SPARK-11560][SPARK-11559] [ML] [M...

2016-04-12 Thread thunterdb
Github user thunterdb commented on the pull request: https://github.com/apache/spark/pull/10806#issuecomment-209081278 I see that the size of the blocks can be tuned and is fairly small by default (128). Out of curiosity, how did you pick this number, instead of the full size

[GitHub] spark pull request: [SPARK-14568][ML] Instrumentation framework fo...

2016-04-12 Thread thunterdb
GitHub user thunterdb opened a pull request: https://github.com/apache/spark/pull/12331 [SPARK-14568][ML] Instrumentation framework for logistic regression ## What changes were proposed in this pull request? This adds extra logging information about a `LogisticRegression

[GitHub] spark pull request: [SPARK-13449] Naive Bayes wrapper in SparkR

2016-03-22 Thread thunterdb
Github user thunterdb commented on the pull request: https://github.com/apache/spark/pull/11890#issuecomment-200023253 @mengxr it looks great, I have no comment --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your

[GitHub] spark pull request: [SPARK-12566] [ML] [WIP] GLM model family, lin...

2016-03-21 Thread thunterdb
Github user thunterdb commented on the pull request: https://github.com/apache/spark/pull/11549#issuecomment-199516534 @hhbyyh thanks for the split. I have two small comments. Also, can you include some tests in `test_mllib.R`? It should be very close to the manual testing you did

[GitHub] spark pull request: [SPARK-12566] [ML] [WIP] GLM model family, lin...

2016-03-21 Thread thunterdb
Github user thunterdb commented on a diff in the pull request: https://github.com/apache/spark/pull/11549#discussion_r56908061 --- Diff: mllib/src/main/scala/org/apache/spark/ml/r/SparkRWrappers.scala --- @@ -17,15 +17,34 @@ package org.apache.spark.ml.api.r

[GitHub] spark pull request: [SPARK-12566] [ML] [WIP] GLM model family, lin...

2016-03-21 Thread thunterdb
Github user thunterdb commented on a diff in the pull request: https://github.com/apache/spark/pull/11549#discussion_r56907102 --- Diff: R/pkg/R/mllib.R --- @@ -29,15 +29,10 @@ setClass("PipelineModel", representation(model = "jobj")) #' @param formula A

[GitHub] spark pull request: [SPARK-13430][PySpark][ML] Python API for trai...

2016-03-20 Thread thunterdb
Github user thunterdb commented on the pull request: https://github.com/apache/spark/pull/11621#issuecomment-198108360 @BryanCutler thanks for the PR, I only have a few small comments --- If your project is set up for it, you can reply to this email and have your reply appear

[GitHub] spark pull request: [MINOR] Typo fixes

2016-03-20 Thread thunterdb
Github user thunterdb commented on a diff in the pull request: https://github.com/apache/spark/pull/11802#discussion_r56595542 --- Diff: streaming/src/main/scala/org/apache/spark/streaming/dstream/DStream.scala --- @@ -277,7 +277,7 @@ abstract class DStream[T: ClassTag

[GitHub] spark pull request: [MINOR] Typo fixes

2016-03-20 Thread thunterdb
Github user thunterdb commented on a diff in the pull request: https://github.com/apache/spark/pull/11802#discussion_r56595456 --- Diff: streaming/src/main/scala/org/apache/spark/streaming/dstream/DStream.scala --- @@ -269,7 +269,7 @@ abstract class DStream[T: ClassTag

[GitHub] spark pull request: [SPARK-13430][PySpark][ML] Python API for trai...

2016-03-19 Thread thunterdb
Github user thunterdb commented on a diff in the pull request: https://github.com/apache/spark/pull/11621#discussion_r56587628 --- Diff: python/pyspark/ml/classification.py --- @@ -231,6 +232,210 @@ def intercept(self): """ return

[GitHub] spark pull request: [Minor] [ML] When trainingSummary is None, it ...

2016-03-19 Thread thunterdb
Github user thunterdb commented on a diff in the pull request: https://github.com/apache/spark/pull/11784#discussion_r56597344 --- Diff: mllib/src/main/scala/org/apache/spark/ml/classification/LogisticRegression.scala --- @@ -508,12 +508,10 @@ class LogisticRegressionModel

[GitHub] spark pull request: [MINOR] Typo fixes

2016-03-19 Thread thunterdb
Github user thunterdb commented on the pull request: https://github.com/apache/spark/pull/11802#issuecomment-198135365 I had some small comments from my side. The changes in `ml` look good. Someone from streaming should take over for the rest. --- If your project is set up

[GitHub] spark pull request: [MINOR] Typo fixes

2016-03-19 Thread thunterdb
Github user thunterdb commented on a diff in the pull request: https://github.com/apache/spark/pull/11802#discussion_r56595285 --- Diff: streaming/src/main/scala/org/apache/spark/streaming/StreamingContext.scala --- @@ -132,7 +132,7 @@ class StreamingContext private[streaming

<    1   2   3   4   >