[GitHub] spark pull request #19229: [SPARK-22001][ML][SQL] ImputerModel can do withCo...

2017-09-24 Thread WeichenXu123
Github user WeichenXu123 commented on a diff in the pull request: https://github.com/apache/spark/pull/19229#discussion_r140689908 --- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/Imputer.scala --- @@ -223,20 +223,18 @@ class ImputerModel private[ml

[GitHub] spark issue #19337: [SPARK-22114][ML][MLLIB]add epsilon for LDA

2017-09-27 Thread WeichenXu123
Github user WeichenXu123 commented on the issue: https://github.com/apache/spark/pull/19337 +1 for updating ML API. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail

[GitHub] spark pull request #19350: [SPARK-22126][ML] Fix model-specific optimization...

2017-09-27 Thread WeichenXu123
Github user WeichenXu123 closed the pull request at: https://github.com/apache/spark/pull/19350 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h

[GitHub] spark issue #18924: [SPARK-14371] [MLLIB] OnlineLDAOptimizer should not coll...

2017-09-27 Thread WeichenXu123
Github user WeichenXu123 commented on the issue: https://github.com/apache/spark/pull/18924 LGTM. Thanks! ping @jkbradley --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands

[GitHub] spark pull request #19208: [SPARK-21087] [ML] CrossValidator, TrainValidatio...

2017-09-27 Thread WeichenXu123
Github user WeichenXu123 commented on a diff in the pull request: https://github.com/apache/spark/pull/19208#discussion_r141356070 --- Diff: mllib/src/main/scala/org/apache/spark/ml/tuning/CrossValidator.scala --- @@ -261,17 +290,40 @@ class CrossValidatorModel private[ml

[GitHub] spark pull request #19208: [SPARK-21087] [ML] CrossValidator, TrainValidatio...

2017-09-27 Thread WeichenXu123
Github user WeichenXu123 commented on a diff in the pull request: https://github.com/apache/spark/pull/19208#discussion_r141357565 --- Diff: mllib/src/main/scala/org/apache/spark/ml/tuning/TrainValidationSplit.scala --- @@ -276,12 +315,32 @@ object TrainValidationSplitModel

[GitHub] spark issue #19020: [SPARK-3181] [ML] Implement huber loss for LinearRegress...

2017-09-27 Thread WeichenXu123
Github user WeichenXu123 commented on the issue: https://github.com/apache/spark/pull/19020 I also vote to combine them as one estimator, here are my two cents: 1, Regression with Huber loss is one kind of linear regression. It makes sense to switch between different loss

[GitHub] spark issue #17373: [SPARK-12664][ML] Expose probability in mlp model

2017-08-21 Thread WeichenXu123
Github user WeichenXu123 commented on the issue: https://github.com/apache/spark/pull/17373 Jenkins, test this please. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled

[GitHub] spark issue #15435: [SPARK-17139][ML] Add model summary for MultinomialLogis...

2017-08-21 Thread WeichenXu123
Github user WeichenXu123 commented on the issue: https://github.com/apache/spark/pull/15435 Jenkins, test this please. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled

[GitHub] spark issue #17849: [SPARK-10931][ML][PYSPARK] PySpark Models Copy Param Val...

2017-08-21 Thread WeichenXu123
Github user WeichenXu123 commented on the issue: https://github.com/apache/spark/pull/17849 What do you think about this ? @jkbradley --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have

[GitHub] spark issue #18924: [SPARK-14371] [MLLIB] OnlineLDAOptimizer should not coll...

2017-08-22 Thread WeichenXu123
Github user WeichenXu123 commented on the issue: https://github.com/apache/spark/pull/18924 Thanks! I will take a look later. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature

[GitHub] spark pull request #18538: [SPARK-14516][ML] Adding ClusteringEvaluator with...

2017-08-22 Thread WeichenXu123
Github user WeichenXu123 commented on a diff in the pull request: https://github.com/apache/spark/pull/18538#discussion_r134449164 --- Diff: mllib/src/test/scala/org/apache/spark/ml/evaluation/ClusteringEvaluatorSuite.scala --- @@ -0,0 +1,91 @@ +/* + * Licensed

[GitHub] spark issue #15435: [SPARK-17139][ML] Add model summary for MultinomialLogis...

2017-08-21 Thread WeichenXu123
Github user WeichenXu123 commented on the issue: https://github.com/apache/spark/pull/15435 Jenkins, test this please. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled

[GitHub] spark issue #19018: [SPARK-21801][SPARKR][TEST] unit test randomly fail with...

2017-08-27 Thread WeichenXu123
Github user WeichenXu123 commented on the issue: https://github.com/apache/spark/pull/19018 ping @felixcheung We can make all R tests for trees deterministic (not only random trees). Leave other problems to separate PR. It would be great to fix it soon, Thanks! --- If your project

[GitHub] spark pull request #19065: [SPARK-21729][ML][TEST] Generic test for Probabil...

2017-08-28 Thread WeichenXu123
GitHub user WeichenXu123 opened a pull request: https://github.com/apache/spark/pull/19065 [SPARK-21729][ML][TEST] Generic test for ProbabilisticClassifier to ensure consistent output columns ## What changes were proposed in this pull request? Add test for prediction using

[GitHub] spark issue #19065: [SPARK-21729][ML][TEST] Generic test for ProbabilisticCl...

2017-08-28 Thread WeichenXu123
Github user WeichenXu123 commented on the issue: https://github.com/apache/spark/pull/19065 Jenkins, test this please. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled

[GitHub] spark pull request #17014: [SPARK-18608][ML] Fix double-caching in ML algori...

2017-08-28 Thread WeichenXu123
Github user WeichenXu123 commented on a diff in the pull request: https://github.com/apache/spark/pull/17014#discussion_r135534873 --- Diff: mllib/src/main/scala/org/apache/spark/ml/Predictor.scala --- @@ -85,6 +86,10 @@ abstract class Predictor[ M <: PredictionMo

[GitHub] spark issue #19018: [SPARK-21801][SPARKR][TEST] unit test randomly fail with...

2017-08-28 Thread WeichenXu123
Github user WeichenXu123 commented on the issue: https://github.com/apache/spark/pull/19018 @felixcheung In Jenkins Log I only found Random forest and Decision Tree failed, random forest failed more frequently. thanks! --- If your project is set up for it, you can reply

[GitHub] spark pull request #19026: [SPARK-21681][ML] fix bug of MLOR do not work cor...

2017-08-28 Thread WeichenXu123
Github user WeichenXu123 closed the pull request at: https://github.com/apache/spark/pull/19026 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature

[GitHub] spark pull request #19072: [SPARK-17133][ML][FOLLOW-UP] Add convenient metho...

2017-08-28 Thread WeichenXu123
GitHub user WeichenXu123 opened a pull request: https://github.com/apache/spark/pull/19072 [SPARK-17133][ML][FOLLOW-UP] Add convenient method `asBinary` for casting to BinaryLogisticRegressionSummary ## What changes were proposed in this pull request? add an "asB

[GitHub] spark issue #19018: [SPARK-21801][SPARKR][TEST] unit test randomly fail with...

2017-08-24 Thread WeichenXu123
Github user WeichenXu123 commented on the issue: https://github.com/apache/spark/pull/19018 @felixcheung This error occur in the OneHotEncoder inside the RFormula I think. Only OneHotEncoder will print this error message after I search the project... --- If your project is set up

[GitHub] spark issue #15435: [SPARK-17139][ML] Add model summary for MultinomialLogis...

2017-08-25 Thread WeichenXu123
Github user WeichenXu123 commented on the issue: https://github.com/apache/spark/pull/15435 Jenkins test this please --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled

[GitHub] spark pull request #19029: [SPARK-21818][ML][MLLIB] Fix bug of MultivariateO...

2017-08-24 Thread WeichenXu123
Github user WeichenXu123 commented on a diff in the pull request: https://github.com/apache/spark/pull/19029#discussion_r135186430 --- Diff: mllib/src/main/scala/org/apache/spark/ml/stat/Summarizer.scala --- @@ -438,6 +438,10 @@ private[ml] object SummaryBuilderImpl extends Logging

[GitHub] spark issue #15435: [SPARK-17139][ML] Add model summary for MultinomialLogis...

2017-08-25 Thread WeichenXu123
Github user WeichenXu123 commented on the issue: https://github.com/apache/spark/pull/15435 Jenkins test this please --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled

[GitHub] spark pull request #18992: [SPARK-19762][ML][FOLLOWUP]Add necessary comments...

2017-08-20 Thread WeichenXu123
Github user WeichenXu123 commented on a diff in the pull request: https://github.com/apache/spark/pull/18992#discussion_r134109929 --- Diff: mllib/src/main/scala/org/apache/spark/ml/optim/loss/DifferentiableRegularization.scala --- @@ -57,6 +61,11 @@ private[ml] class

[GitHub] spark issue #19078: [SPARK-21862] Add overflow check in PCA

2017-08-29 Thread WeichenXu123
Github user WeichenXu123 commented on the issue: https://github.com/apache/spark/pull/19078 cc @jkbradley --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so

[GitHub] spark pull request #19078: [SPARK-21862] Add overflow check in PCA

2017-08-29 Thread WeichenXu123
GitHub user WeichenXu123 opened a pull request: https://github.com/apache/spark/pull/19078 [SPARK-21862] Add overflow check in PCA ## What changes were proposed in this pull request? add overflow check in PCA, otherwise it is possible to throw `NegativeArraySizeException

[GitHub] spark pull request #19078: [SPARK-21862] Add overflow check in PCA

2017-08-29 Thread WeichenXu123
Github user WeichenXu123 commented on a diff in the pull request: https://github.com/apache/spark/pull/19078#discussion_r135751225 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/feature/PCA.scala --- @@ -44,6 +44,13 @@ class PCA @Since("1.4.0") (@Since("1.4

[GitHub] spark pull request #17014: [SPARK-18608][ML] Fix double-caching in ML algori...

2017-08-28 Thread WeichenXu123
Github user WeichenXu123 commented on a diff in the pull request: https://github.com/apache/spark/pull/17014#discussion_r135695930 --- Diff: mllib/src/main/scala/org/apache/spark/ml/Predictor.scala --- @@ -85,6 +86,10 @@ abstract class Predictor[ M <: PredictionMo

[GitHub] spark issue #19072: [SPARK-17133][ML][FOLLOW-UP] Add convenient method `asBi...

2017-08-29 Thread WeichenXu123
Github user WeichenXu123 commented on the issue: https://github.com/apache/spark/pull/19072 Jenkins, test this please. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled

[GitHub] spark issue #17014: [SPARK-18608][ML] Fix double-caching in ML algorithms

2017-08-31 Thread WeichenXu123
Github user WeichenXu123 commented on the issue: https://github.com/apache/spark/pull/17014 I think about this double-cache issue for a few days. One big problem is that, we are hard get precise storage level info. For example, we may add `map` transform on cached dataset

[GitHub] spark pull request #19122: [SPARK-21911][ML][PySpark] Parallel Model Evaluat...

2017-09-04 Thread WeichenXu123
Github user WeichenXu123 commented on a diff in the pull request: https://github.com/apache/spark/pull/19122#discussion_r136850665 --- Diff: python/pyspark/ml/tuning.py --- @@ -255,18 +257,23 @@ def _fit(self, dataset): randCol = self.uid + "_rand"

[GitHub] spark issue #19108: [SPARK-21898][ML] Feature parity for KolmogorovSmirnovTe...

2017-09-04 Thread WeichenXu123
Github user WeichenXu123 commented on the issue: https://github.com/apache/spark/pull/19108 cc @yanboliang Thanks! --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail

[GitHub] spark issue #18902: [SPARK-21690][ML] one-pass imputer

2017-09-04 Thread WeichenXu123
Github user WeichenXu123 commented on the issue: https://github.com/apache/spark/pull/18902 Sure. I will create JIRA after this perf gap is confirmed. --- - To unsubscribe, e-mail: reviews-unsubscr

[GitHub] spark issue #18902: [SPARK-21690][ML] one-pass imputer

2017-09-03 Thread WeichenXu123
Github user WeichenXu123 commented on the issue: https://github.com/apache/spark/pull/18902 hmm... that's interesting. So I found performance gap between dataframe codegen aggregation and the simple RDD aggregation. I will discuss with SQL team for this later. Thanks! --- If your

[GitHub] spark issue #17014: [SPARK-18608][ML] Fix double-caching in ML algorithms

2017-09-03 Thread WeichenXu123
Github user WeichenXu123 commented on the issue: https://github.com/apache/spark/pull/17014 @zhengruifeng `KMeans` regarded as a bugfix(SPARK-21799) because the double-cache issue is introduced in 2.2 and cause perf regression. Other algos also have the same issue, but the issue

[GitHub] spark issue #19110: [SPARK-21027][ML][PYTHON] Added tunable parallelism to o...

2017-09-02 Thread WeichenXu123
Github user WeichenXu123 commented on the issue: https://github.com/apache/spark/pull/19110 Jenkins, test this please. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled

[GitHub] spark pull request #16774: [SPARK-19357][ML] Adding parallel model evaluatio...

2017-09-03 Thread WeichenXu123
Github user WeichenXu123 commented on a diff in the pull request: https://github.com/apache/spark/pull/16774#discussion_r136719383 --- Diff: mllib/src/main/scala/org/apache/spark/ml/tuning/CrossValidator.scala --- @@ -100,31 +113,53 @@ class CrossValidator @Since("1.2.0"

[GitHub] spark pull request #16774: [SPARK-19357][ML] Adding parallel model evaluatio...

2017-09-03 Thread WeichenXu123
Github user WeichenXu123 commented on a diff in the pull request: https://github.com/apache/spark/pull/16774#discussion_r136719561 --- Diff: mllib/src/main/scala/org/apache/spark/ml/tuning/TrainValidationSplit.scala --- @@ -87,37 +91,63 @@ class TrainValidationSplit @Since("

[GitHub] spark pull request #16774: [SPARK-19357][ML] Adding parallel model evaluatio...

2017-09-03 Thread WeichenXu123
Github user WeichenXu123 commented on a diff in the pull request: https://github.com/apache/spark/pull/16774#discussion_r136719485 --- Diff: mllib/src/main/scala/org/apache/spark/ml/tuning/CrossValidator.scala --- @@ -100,31 +113,53 @@ class CrossValidator @Since("1.2.0"

[GitHub] spark pull request #19122: [SPARK-21911][ML][PySpark] Parallel Model Evaluat...

2017-09-04 Thread WeichenXu123
GitHub user WeichenXu123 opened a pull request: https://github.com/apache/spark/pull/19122 [SPARK-21911][ML][PySpark] Parallel Model Evaluation for ML Tuning in PySpark ## What changes were proposed in this pull request? Add parallelism support for ML tuning in pyspark

[GitHub] spark pull request #19106: [SPARK-21770][ML] ProbabilisticClassificationMode...

2017-09-01 Thread WeichenXu123
GitHub user WeichenXu123 opened a pull request: https://github.com/apache/spark/pull/19106 [SPARK-21770][ML] ProbabilisticClassificationModel fix corner case: normalization of all-zero raw predictions ## What changes were proposed in this pull request

[GitHub] spark issue #17014: [SPARK-18608][ML] Fix double-caching in ML algorithms

2017-09-01 Thread WeichenXu123
Github user WeichenXu123 commented on the issue: https://github.com/apache/spark/pull/17014 @zhengruifeng @jkbradley I create a PR #19107 for quick fix `KMeans` perf regression bug. This PR can continue to work on adding Param of `handlePersistence` which is not so emergent

[GitHub] spark pull request #19107: [SPARK-21799][ML] Fix `KMeans` performance regres...

2017-09-01 Thread WeichenXu123
GitHub user WeichenXu123 opened a pull request: https://github.com/apache/spark/pull/19107 [SPARK-21799][ML] Fix `KMeans` performance regression caused by double-caching ## What changes were proposed in this pull request? Fix `KMeans` performance regression caused

[GitHub] spark issue #19107: [SPARK-21799][ML] Fix `KMeans` performance regression ca...

2017-09-01 Thread WeichenXu123
Github user WeichenXu123 commented on the issue: https://github.com/apache/spark/pull/19107 cc @jkbradley @smurching This should be merged and backport to 2.2 ASAP! Other improvement (e.g adding `handlePersistence` param) can be left in this PR #17014 --- If your project

[GitHub] spark issue #16864: [SPARK-19527][Core] Approximate Size of Intersection of ...

2017-09-01 Thread WeichenXu123
Github user WeichenXu123 commented on the issue: https://github.com/apache/spark/pull/16864 @Bcpoole Thanks for this PR. But I want to ask which place in spark can this extension apply to ? e.g. can this algo used in join cost estimating or somewhere else

[GitHub] spark pull request #19108: [SPARK-21898][ML] Feature parity for KolmogorovSm...

2017-09-02 Thread WeichenXu123
GitHub user WeichenXu123 opened a pull request: https://github.com/apache/spark/pull/19108 [SPARK-21898][ML] Feature parity for KolmogorovSmirnovTest in MLlib ## What changes were proposed in this pull request? Feature parity for KolmogorovSmirnovTest in MLlib

[GitHub] spark pull request #19106: [SPARK-21770][ML] ProbabilisticClassificationMode...

2017-09-02 Thread WeichenXu123
Github user WeichenXu123 commented on a diff in the pull request: https://github.com/apache/spark/pull/19106#discussion_r136696592 --- Diff: mllib/src/main/scala/org/apache/spark/ml/classification/ProbabilisticClassifier.scala --- @@ -245,6 +245,13 @@ private[ml] object

[GitHub] spark pull request #19020: [SPARK-3181] [ML] Implement huber loss for Linear...

2017-08-30 Thread WeichenXu123
Github user WeichenXu123 commented on a diff in the pull request: https://github.com/apache/spark/pull/19020#discussion_r136071530 --- Diff: mllib/src/test/scala/org/apache/spark/ml/optim/aggregator/HuberAggregatorSuite.scala --- @@ -0,0 +1,170 @@ +/* + * Licensed

[GitHub] spark pull request #19020: [SPARK-3181] [ML] Implement huber loss for Linear...

2017-08-30 Thread WeichenXu123
Github user WeichenXu123 commented on a diff in the pull request: https://github.com/apache/spark/pull/19020#discussion_r136067839 --- Diff: mllib/src/main/scala/org/apache/spark/ml/optim/aggregator/HuberAggregator.scala --- @@ -0,0 +1,141 @@ +/* + * Licensed

[GitHub] spark pull request #19020: [SPARK-3181] [ML] Implement huber loss for Linear...

2017-08-30 Thread WeichenXu123
Github user WeichenXu123 commented on a diff in the pull request: https://github.com/apache/spark/pull/19020#discussion_r136069679 --- Diff: mllib/src/main/scala/org/apache/spark/ml/optim/aggregator/HuberAggregator.scala --- @@ -0,0 +1,141 @@ +/* + * Licensed

[GitHub] spark pull request #19020: [SPARK-3181] [ML] Implement huber loss for Linear...

2017-08-30 Thread WeichenXu123
Github user WeichenXu123 commented on a diff in the pull request: https://github.com/apache/spark/pull/19020#discussion_r136072548 --- Diff: mllib/src/test/scala/org/apache/spark/ml/regression/LinearRegressionSuite.scala --- @@ -146,6 +161,8 @@ class LinearRegressionSuite

[GitHub] spark pull request #19065: [SPARK-21729][ML][TEST] Generic test for Probabil...

2017-08-29 Thread WeichenXu123
Github user WeichenXu123 commented on a diff in the pull request: https://github.com/apache/spark/pull/19065#discussion_r135782045 --- Diff: mllib/src/test/scala/org/apache/spark/ml/classification/ProbabilisticClassifierSuite.scala --- @@ -91,4 +94,54 @@ object

[GitHub] spark issue #19065: [SPARK-21729][ML][TEST] Generic test for ProbabilisticCl...

2017-08-29 Thread WeichenXu123
Github user WeichenXu123 commented on the issue: https://github.com/apache/spark/pull/19065 @smurching Code updated, thanks! --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature

[GitHub] spark issue #19018: [SPARK-21801][SPARKR][TEST] unit test randomly fail with...

2017-09-02 Thread WeichenXu123
Github user WeichenXu123 commented on the issue: https://github.com/apache/spark/pull/19018 cc @felixcheung I encounter RTest failed again even when this seed added. https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/81350/console error: ``` Failed

[GitHub] spark pull request #19110: [SPARK-21027][ML][PYTHON] Added tunable paralleli...

2017-09-02 Thread WeichenXu123
GitHub user WeichenXu123 opened a pull request: https://github.com/apache/spark/pull/19110 [SPARK-21027][ML][PYTHON] Added tunable parallelism to one vs. rest in both Scala mllib and Pyspark ## What changes were proposed in this pull request? Added tunable parallelism

[GitHub] spark issue #18281: [SPARK-21027][ML][PYTHON] Added tunable parallelism to o...

2017-09-02 Thread WeichenXu123
Github user WeichenXu123 commented on the issue: https://github.com/apache/spark/pull/18281 I take this PR over in #19110 because the original author is busy but we need merge this PR soon. Thanks! --- If your project is set up for it, you can reply to this email and have your

[GitHub] spark issue #18538: [SPARK-14516][ML] Adding ClusteringEvaluator with the im...

2017-09-03 Thread WeichenXu123
Github user WeichenXu123 commented on the issue: https://github.com/apache/spark/pull/18538 Jenkins, test this please. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled

[GitHub] spark issue #19111: [SPARK-21801][SPARKR][TEST][WIP] set random seed for pre...

2017-09-03 Thread WeichenXu123
Github user WeichenXu123 commented on the issue: https://github.com/apache/spark/pull/19111 I found `NaiveBayes` also possible to fail. Founded here #18538 . Hope this can works! https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/81316/console

[GitHub] spark issue #18902: [SPARK-21690][ML] one-pass imputer

2017-09-03 Thread WeichenXu123
Github user WeichenXu123 commented on the issue: https://github.com/apache/spark/pull/18902 +1 for using Dataframe-based version code. @zhengruifeng One thing I want to confirm is that, I check your testing code, both RDD-based version and Dataframe-based version code

[GitHub] spark pull request #16774: [SPARK-19357][ML] Adding parallel model evaluatio...

2017-08-30 Thread WeichenXu123
Github user WeichenXu123 commented on a diff in the pull request: https://github.com/apache/spark/pull/16774#discussion_r136243309 --- Diff: mllib/src/test/scala/org/apache/spark/ml/tuning/CrossValidatorSuite.scala --- @@ -120,6 +120,33 @@ class CrossValidatorSuite

[GitHub] spark pull request #16774: [SPARK-19357][ML] Adding parallel model evaluatio...

2017-08-31 Thread WeichenXu123
Github user WeichenXu123 commented on a diff in the pull request: https://github.com/apache/spark/pull/16774#discussion_r136482755 --- Diff: mllib/src/test/scala/org/apache/spark/ml/tuning/CrossValidatorSuite.scala --- @@ -120,6 +120,33 @@ class CrossValidatorSuite

[GitHub] spark issue #17014: [SPARK-18608][ML] Fix double-caching in ML algorithms

2017-08-31 Thread WeichenXu123
Github user WeichenXu123 commented on the issue: https://github.com/apache/spark/pull/17014 @smurching Yes this should be added as a `ml.Param`, we should not add as an argument. @zhengruifeng Would you mind update the PR according to our discussion result above ? Make

[GitHub] spark pull request #18538: [SPARK-14516][ML] Adding ClusteringEvaluator with...

2017-09-01 Thread WeichenXu123
Github user WeichenXu123 commented on a diff in the pull request: https://github.com/apache/spark/pull/18538#discussion_r136536168 --- Diff: mllib/src/test/scala/org/apache/spark/ml/evaluation/ClusteringEvaluatorSuite.scala --- @@ -0,0 +1,91 @@ +/* + * Licensed

[GitHub] spark pull request #18538: [SPARK-14516][ML] Adding ClusteringEvaluator with...

2017-09-01 Thread WeichenXu123
Github user WeichenXu123 commented on a diff in the pull request: https://github.com/apache/spark/pull/18538#discussion_r136532646 --- Diff: mllib/src/main/scala/org/apache/spark/ml/evaluation/ClusteringEvaluator.scala --- @@ -0,0 +1,395 @@ +/* + * Licensed to the Apache

[GitHub] spark issue #17014: [SPARK-18608][ML] Fix double-caching in ML algorithms

2017-08-29 Thread WeichenXu123
Github user WeichenXu123 commented on the issue: https://github.com/apache/spark/pull/17014 cc @zhengruifeng I update my comment you need check again, thanks! I read the PR again, it still do not resolve double-caching issue in KMeans. in KMean, your code

[GitHub] spark issue #18924: [SPARK-14371] [MLLIB] OnlineLDAOptimizer should not coll...

2017-10-03 Thread WeichenXu123
Github user WeichenXu123 commented on the issue: https://github.com/apache/spark/pull/18924 Oh, sorry for that, it should waiting @jkbradley to merge it. Don't worry, I will contact him! --- - To unsubscribe, e

[GitHub] spark issue #18924: [SPARK-14371] [MLLIB] OnlineLDAOptimizer should not coll...

2017-10-11 Thread WeichenXu123
Github user WeichenXu123 commented on the issue: https://github.com/apache/spark/pull/18924 @akopich LGTM. and do you have time to create a PR to resolve random seed not working issue mentioned by @hhbyyh ? Thanks

[GitHub] spark pull request #15770: [SPARK-15784][ML]:Add Power Iteration Clustering ...

2017-10-09 Thread WeichenXu123
Github user WeichenXu123 commented on a diff in the pull request: https://github.com/apache/spark/pull/15770#discussion_r143426157 --- Diff: mllib/src/main/scala/org/apache/spark/ml/clustering/PowerIterationClustering.scala --- @@ -0,0 +1,216 @@ +/* + * Licensed

[GitHub] spark issue #19433: [SPARK-3162] [MLlib][WIP] Add local tree training for de...

2017-10-09 Thread WeichenXu123
Github user WeichenXu123 commented on the issue: https://github.com/apache/spark/pull/19433 @smurching Does it still WIP ? If done remove "[WIP]", I will begin review, thanks! --- - To unsubscribe, e-mai

[GitHub] spark issue #19106: [SPARK-21770][ML] ProbabilisticClassificationModel fix c...

2017-10-09 Thread WeichenXu123
Github user WeichenXu123 commented on the issue: https://github.com/apache/spark/pull/19106 @srowen Any other comments? Thanks! --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional

[GitHub] spark pull request #17862: [SPARK-20602] [ML]Adding LBFGS optimizer and Squa...

2017-10-18 Thread WeichenXu123
Github user WeichenXu123 commented on a diff in the pull request: https://github.com/apache/spark/pull/17862#discussion_r145371903 --- Diff: mllib/src/main/scala/org/apache/spark/ml/classification/LinearSVC.scala --- @@ -282,8 +348,27 @@ class LinearSVC @Since("

[GitHub] spark pull request #17862: [SPARK-20602] [ML]Adding LBFGS optimizer and Squa...

2017-10-18 Thread WeichenXu123
Github user WeichenXu123 commented on a diff in the pull request: https://github.com/apache/spark/pull/17862#discussion_r145371704 --- Diff: mllib/src/main/scala/org/apache/spark/ml/classification/LinearSVC.scala --- @@ -42,7 +44,26 @@ import org.apache.spark.sql.functions.{col

[GitHub] spark pull request #17862: [SPARK-20602] [ML]Adding LBFGS optimizer and Squa...

2017-10-18 Thread WeichenXu123
Github user WeichenXu123 commented on a diff in the pull request: https://github.com/apache/spark/pull/17862#discussion_r145369694 --- Diff: mllib/src/main/scala/org/apache/spark/ml/classification/LinearSVC.scala --- @@ -282,8 +348,27 @@ class LinearSVC @Since("

[GitHub] spark issue #10466: [SPARK-12375] [ML] add handleinvalid for vectorindexer

2017-10-23 Thread WeichenXu123
Github user WeichenXu123 commented on the issue: https://github.com/apache/spark/pull/10466 @hhbyyh Do you get time to continue this PR ? thanks! --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

[GitHub] spark pull request #19558: [SPARK-22332][ML][TEST] Fix NaiveBayes unit test ...

2017-10-23 Thread WeichenXu123
GitHub user WeichenXu123 opened a pull request: https://github.com/apache/spark/pull/19558 [SPARK-22332][ML][TEST] Fix NaiveBayes unit test occasionly fail (cause by test dataset not deterministic) ## What changes were proposed in this pull request? Fix NaiveBayes unit

[GitHub] spark issue #19439: [SPARK-21866][ML][PySpark] Adding spark image reader

2017-10-23 Thread WeichenXu123
Github user WeichenXu123 commented on the issue: https://github.com/apache/spark/pull/19439 @hhbyyh Thanks for your comments! > Another option is that to support all bytes[], short[], int[], float[] and double[] as data storage type candidates, and switch among them accord

[GitHub] spark issue #19516: [SPARK-22277][ML]fix the bug of ChiSqSelector on prepari...

2017-10-24 Thread WeichenXu123
Github user WeichenXu123 commented on the issue: https://github.com/apache/spark/pull/19516 I thought about this, because `ChiSqSelector` only work for categorical features, after processing it marked features without attributes as `NominalAttribute` is reasonable, the problem

[GitHub] spark pull request #19516: [SPARK-22277][ML]fix the bug of ChiSqSelector on ...

2017-10-24 Thread WeichenXu123
Github user WeichenXu123 commented on a diff in the pull request: https://github.com/apache/spark/pull/19516#discussion_r146513000 --- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/ChiSqSelector.scala --- @@ -291,9 +291,13 @@ final class ChiSqSelectorModel private[ml

[GitHub] spark pull request #19516: [SPARK-22277][ML]fix the bug of ChiSqSelector on ...

2017-10-24 Thread WeichenXu123
Github user WeichenXu123 commented on a diff in the pull request: https://github.com/apache/spark/pull/19516#discussion_r146531755 --- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/ChiSqSelector.scala --- @@ -291,9 +291,13 @@ final class ChiSqSelectorModel private[ml

[GitHub] spark pull request #19516: [SPARK-22277][ML]fix the bug of ChiSqSelector on ...

2017-10-24 Thread WeichenXu123
Github user WeichenXu123 commented on a diff in the pull request: https://github.com/apache/spark/pull/19516#discussion_r146546628 --- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/ChiSqSelector.scala --- @@ -291,9 +291,13 @@ final class ChiSqSelectorModel private[ml

[GitHub] spark pull request #17972: [SPARK-20723][ML]Add intermediate storage level t...

2017-11-13 Thread WeichenXu123
Github user WeichenXu123 commented on a diff in the pull request: https://github.com/apache/spark/pull/17972#discussion_r150470524 --- Diff: mllib/src/main/scala/org/apache/spark/ml/recommendation/ALS.scala --- @@ -129,7 +129,7 @@ private[recommendation] trait ALSModelParams

[GitHub] spark pull request #19588: [SPARK-12375][ML] VectorIndexerModel support hand...

2017-11-12 Thread WeichenXu123
Github user WeichenXu123 commented on a diff in the pull request: https://github.com/apache/spark/pull/19588#discussion_r150445283 --- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/VectorIndexer.scala --- @@ -37,7 +38,25 @@ import org.apache.spark.sql.types.{StructField

[GitHub] spark pull request #19525: [SPARK-22289] [ML] Add JSON support for Matrix pa...

2017-11-13 Thread WeichenXu123
Github user WeichenXu123 commented on a diff in the pull request: https://github.com/apache/spark/pull/19525#discussion_r150486465 --- Diff: mllib/src/main/scala/org/apache/spark/ml/linalg/JsonMatrixConverter.scala --- @@ -0,0 +1,79 @@ +/* + * Licensed to the Apache

[GitHub] spark pull request #19525: [SPARK-22289] [ML] Add JSON support for Matrix pa...

2017-11-13 Thread WeichenXu123
Github user WeichenXu123 commented on a diff in the pull request: https://github.com/apache/spark/pull/19525#discussion_r150482756 --- Diff: mllib-local/src/main/scala/org/apache/spark/ml/linalg/Matrices.scala --- @@ -476,6 +476,10 @@ class DenseMatrix @Since("

[GitHub] spark issue #19753: [SPARK-22521][ML] VectorIndexerModel support handle unse...

2017-11-15 Thread WeichenXu123
Github user WeichenXu123 commented on the issue: https://github.com/apache/spark/pull/19753 @smurching The getter/setter is included in the super class `HasHandleInvalid`. I can add test

[GitHub] spark issue #19588: [SPARK-12375][ML] VectorIndexerModel support handle unse...

2017-11-14 Thread WeichenXu123
Github user WeichenXu123 commented on the issue: https://github.com/apache/spark/pull/19588 Python API jira created here: https://issues.apache.org/jira/browse/SPARK-22521 --- - To unsubscribe, e-mail: reviews

[GitHub] spark issue #19621: [SPARK-11215][ML] Add multiple columns support to String...

2017-11-14 Thread WeichenXu123
Github user WeichenXu123 commented on the issue: https://github.com/apache/spark/pull/19621 @viirya @MLnick Thanks! --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail

[GitHub] spark pull request #19588: [SPARK-12375][ML] VectorIndexerModel support hand...

2017-11-15 Thread WeichenXu123
Github user WeichenXu123 commented on a diff in the pull request: https://github.com/apache/spark/pull/19588#discussion_r151055221 --- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/VectorIndexer.scala --- @@ -311,22 +346,39 @@ class VectorIndexerModel private[ml

[GitHub] spark issue #19621: [SPARK-11215][ML] Add multiple columns support to String...

2017-11-15 Thread WeichenXu123
Github user WeichenXu123 commented on the issue: https://github.com/apache/spark/pull/19621 I want to ask, for option `StringIndexer.frequencyDesc`, in the case existing two labels which have the same frequency, which of them will be put in the front ? If this is not specified

[GitHub] spark pull request #19753: [SPARK-22521][ML] VectorIndexerModel support hand...

2017-11-14 Thread WeichenXu123
GitHub user WeichenXu123 opened a pull request: https://github.com/apache/spark/pull/19753 [SPARK-22521][ML] VectorIndexerModel support handle unseen categories via handleInvalid: Python API ## What changes were proposed in this pull request? Add python api

[GitHub] spark pull request #19753: [SPARK-22521][ML] VectorIndexerModel support hand...

2017-11-15 Thread WeichenXu123
Github user WeichenXu123 commented on a diff in the pull request: https://github.com/apache/spark/pull/19753#discussion_r151311569 --- Diff: python/pyspark/ml/feature.py --- @@ -2565,22 +2575,28 @@ class VectorIndexer(JavaEstimator, HasInputCol, HasOutputCol, JavaMLReadable, Ja

[GitHub] spark pull request #19753: [SPARK-22521][ML] VectorIndexerModel support hand...

2017-11-15 Thread WeichenXu123
Github user WeichenXu123 commented on a diff in the pull request: https://github.com/apache/spark/pull/19753#discussion_r151311832 --- Diff: python/pyspark/ml/feature.py --- @@ -2565,22 +2575,28 @@ class VectorIndexer(JavaEstimator, HasInputCol, HasOutputCol, JavaMLReadable, Ja

[GitHub] spark issue #19621: [SPARK-11215][ML] Add multiple columns support to String...

2017-11-24 Thread WeichenXu123
Github user WeichenXu123 commented on the issue: https://github.com/apache/spark/pull/19621 I checked the failed tests in sparkR. There's some trouble in the failed `glm` sparkR tests. These tests compare sparkR glm and R-lib glm results on test data "iris", b

[GitHub] spark issue #19621: [SPARK-11215][ML] Add multiple columns support to String...

2017-11-22 Thread WeichenXu123
Github user WeichenXu123 commented on the issue: https://github.com/apache/spark/pull/19621 Jenkins retest this please. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e

[GitHub] spark issue #19621: [SPARK-11215][ML] Add multiple columns support to String...

2017-11-22 Thread WeichenXu123
Github user WeichenXu123 commented on the issue: https://github.com/apache/spark/pull/19621 @MLnick Will RDD "count by value" aggregation be deterministic ? e.g., 2 RDD with the same elements, but with different element order and different partition number, will `rdd.co

[GitHub] spark issue #19621: [SPARK-11215][ML] Add multiple columns support to String...

2017-11-22 Thread WeichenXu123
Github user WeichenXu123 commented on the issue: https://github.com/apache/spark/pull/19621 @MLnick Ah, I don't express it exactly, the first case, what I mean is, sort by frequency, but if the case frequency equal, sort by alphabet

[GitHub] spark issue #19621: [SPARK-11215][ML] Add multiple columns support to String...

2017-11-23 Thread WeichenXu123
Github user WeichenXu123 commented on the issue: https://github.com/apache/spark/pull/19621 @viirya @MLnick Code updated. Thanks! --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional

[GitHub] spark pull request #19758: [SPARK-3162][MLlib] Local Tree Training Pt 1: Ref...

2017-11-23 Thread WeichenXu123
Github user WeichenXu123 commented on a diff in the pull request: https://github.com/apache/spark/pull/19758#discussion_r152749515 --- Diff: mllib/src/test/scala/org/apache/spark/ml/tree/impl/TreeSplitUtilsSuite.scala --- @@ -0,0 +1,280 @@ +/* + * Licensed to the Apache

[GitHub] spark issue #19621: [SPARK-11215][ML] Add multiple columns support to String...

2017-11-22 Thread WeichenXu123
Github user WeichenXu123 commented on the issue: https://github.com/apache/spark/pull/19621 @MLnick How about this way: The case "fequencyAsc/Desc", sort first by frequency and then by alphabet, The case "alphabetAsc/Desc", sort by alphabet (and if

<    2   3   4   5   6   7   8   9   10   11   >