[GitHub] spark issue #17819: [SPARK-20542][ML][SQL] Add an API to Bucketizer that can...

2017-09-18 Thread WeichenXu123
Github user WeichenXu123 commented on the issue: https://github.com/apache/spark/pull/17819 @viirya It is possible I think. A similar example is, `HasRegParam` trait, do not put `setRegParam` in trait but moved into concrete estimator/transformer class, should be the same reason

[GitHub] spark issue #17819: [SPARK-20542][ML][SQL] Add an API to Bucketizer that can...

2017-09-18 Thread WeichenXu123
Github user WeichenXu123 commented on the issue: https://github.com/apache/spark/pull/17819 @viirya Yes. But if there is some better design I will be happy to listen. --- - To unsubscribe, e-mail: reviews-unsubscr

[GitHub] spark issue #19229: [SPARK-22001][ML][SQL] ImputerModel can do withColumn fo...

2017-09-18 Thread WeichenXu123
Github user WeichenXu123 commented on the issue: https://github.com/apache/spark/pull/19229 Great! That's it. thanks! --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional comman

[GitHub] spark issue #19229: [SPARK-22001][ML][SQL] ImputerModel can do withColumn fo...

2017-09-18 Thread WeichenXu123
Github user WeichenXu123 commented on the issue: https://github.com/apache/spark/pull/19229 @viirya I guess the reason is, the old PR version: `df.withColumn(..).withColumn(..).withColumn(..)`, the long df chain prevent the shuffle re-using... but now you merge them into one step

[GitHub] spark issue #17819: [SPARK-20542][ML][SQL] Add an API to Bucketizer that can...

2017-09-18 Thread WeichenXu123
Github user WeichenXu123 commented on the issue: https://github.com/apache/spark/pull/17819 ok to test. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h

[GitHub] spark issue #19152: [SPARK-21915][ML][PySpark] Model 1 and Model 2 ParamMaps...

2017-09-18 Thread WeichenXu123
Github user WeichenXu123 commented on the issue: https://github.com/apache/spark/pull/19152 @marktab You should close merged PR. Thanks! --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For

[GitHub] spark pull request #18748: [SPARK-20679][ML] Support recommending for a subs...

2017-09-15 Thread WeichenXu123
Github user WeichenXu123 commented on a diff in the pull request: https://github.com/apache/spark/pull/18748#discussion_r139161851 --- Diff: mllib/src/main/scala/org/apache/spark/ml/recommendation/ALS.scala --- @@ -356,6 +371,40 @@ class ALSModel private[ml

[GitHub] spark issue #19208: [SPARK-21087] [ML] CrossValidator, TrainValidationSplit ...

2017-09-14 Thread WeichenXu123
Github user WeichenXu123 commented on the issue: https://github.com/apache/spark/pull/19208 Jenkins, test this please. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e

[GitHub] spark issue #19208: [SPARK-21087] [ML] CrossValidator, TrainValidationSplit ...

2017-09-13 Thread WeichenXu123
Github user WeichenXu123 commented on the issue: https://github.com/apache/spark/pull/19208 @jkbradley I split this PR, removed the code for "dump models to disk", so the PR will be smaller and easier to review. When this PR merged, I will create follow-up PR for "dump

[GitHub] spark issue #19208: [SPARK-21087] [ML] CrossValidator, TrainValidationSplit ...

2017-09-13 Thread WeichenXu123
Github user WeichenXu123 commented on the issue: https://github.com/apache/spark/pull/19208 oh...sorry for that, I integrate @hhbyyh's old PR into this new one, because I found the code "dump models to disk" and "collect models" seem to be cohesive and s

[GitHub] spark pull request #19204: [SPARK-21981][PYTHON][ML] Added Python interface ...

2017-09-13 Thread WeichenXu123
Github user WeichenXu123 commented on a diff in the pull request: https://github.com/apache/spark/pull/19204#discussion_r138763970 --- Diff: python/pyspark/ml/evaluation.py --- @@ -328,6 +329,87 @@ def setParams(self, predictionCol="prediction", label

[GitHub] spark issue #18924: [SPARK-14371] [MLLIB] OnlineLDAOptimizer should not coll...

2017-09-13 Thread WeichenXu123
Github user WeichenXu123 commented on the issue: https://github.com/apache/spark/pull/18924 ping @akopich This is an very useful improvement. Can you update the code while you're at it ? --- - To unsubscri

[GitHub] spark issue #19156: [SPARK-19634][FOLLOW-UP][ML] Improve interface of datafr...

2017-09-13 Thread WeichenXu123
Github user WeichenXu123 commented on the issue: https://github.com/apache/spark/pull/19156 ping @yanboliang Any other comments ? We need merge this before 2.3 release. --- - To unsubscribe, e-mail: reviews

[GitHub] spark issue #19204: [SPARK-21981][PYTHON][ML] Added Python interface for Clu...

2017-09-13 Thread WeichenXu123
Github user WeichenXu123 commented on the issue: https://github.com/apache/spark/pull/19204 Jenkins, test this please. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e

[GitHub] spark pull request #19186: [SPARK-21972][ML] Add param handlePersistence

2017-09-13 Thread WeichenXu123
Github user WeichenXu123 commented on a diff in the pull request: https://github.com/apache/spark/pull/19186#discussion_r138577518 --- Diff: mllib/src/main/scala/org/apache/spark/ml/classification/LogisticRegression.scala --- @@ -483,24 +488,24 @@ class LogisticRegression @Since

[GitHub] spark issue #19214: [SPARK-21027][MINOR][FOLLOW-UP] add missing since tag

2017-09-12 Thread WeichenXu123
Github user WeichenXu123 commented on the issue: https://github.com/apache/spark/pull/19214 cc @srowen Thanks! --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail

[GitHub] spark pull request #19214: [SPARK-21027][MINOR][FOLLOW-UP] add missing since...

2017-09-12 Thread WeichenXu123
GitHub user WeichenXu123 opened a pull request: https://github.com/apache/spark/pull/19214 [SPARK-21027][MINOR][FOLLOW-UP] add missing since tag ## What changes were proposed in this pull request? add missing since tag for `setParallelism` in #19110 ## How was

[GitHub] spark pull request #19110: [SPARK-21027][ML][PYTHON] Added tunable paralleli...

2017-09-12 Thread WeichenXu123
Github user WeichenXu123 commented on a diff in the pull request: https://github.com/apache/spark/pull/19110#discussion_r138519719 --- Diff: mllib/src/main/scala/org/apache/spark/ml/classification/OneVsRest.scala --- @@ -297,6 +298,16 @@ final class OneVsRest @Since("

[GitHub] spark issue #19122: [SPARK-21911][ML][PySpark] Parallel Model Evaluation for...

2017-09-12 Thread WeichenXu123
Github user WeichenXu123 commented on the issue: https://github.com/apache/spark/pull/19122 @BryanCutler code updated. thanks! --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional

[GitHub] spark pull request #19122: [SPARK-21911][ML][PySpark] Parallel Model Evaluat...

2017-09-12 Thread WeichenXu123
Github user WeichenXu123 commented on a diff in the pull request: https://github.com/apache/spark/pull/19122#discussion_r138518283 --- Diff: python/pyspark/ml/tuning.py --- @@ -193,7 +194,8 @@ class CrossValidator(Estimator, ValidatorParams, MLReadable, MLWritable

[GitHub] spark pull request #19122: [SPARK-21911][ML][PySpark] Parallel Model Evaluat...

2017-09-12 Thread WeichenXu123
Github user WeichenXu123 commented on a diff in the pull request: https://github.com/apache/spark/pull/19122#discussion_r138518235 --- Diff: python/pyspark/ml/tuning.py --- @@ -208,23 +210,23 @@ class CrossValidator(Estimator, ValidatorParams, MLReadable, MLWritable

[GitHub] spark issue #17862: [SPARK-20602] [ML]Adding LBFGS optimizer and Squared_hin...

2017-09-12 Thread WeichenXu123
Github user WeichenXu123 commented on the issue: https://github.com/apache/spark/pull/17862 @hhbyyh Test result looks good! OWLQN takes longer time for each iteration, because each iteration's line search, it made more passes on da

[GitHub] spark pull request #19208: [SPARK-21087] [ML] CrossValidator, TrainValidatio...

2017-09-12 Thread WeichenXu123
Github user WeichenXu123 commented on a diff in the pull request: https://github.com/apache/spark/pull/19208#discussion_r138391134 --- Diff: mllib/src/main/scala/org/apache/spark/ml/tuning/ValidatorParams.scala --- @@ -150,20 +150,14 @@ private[ml] object ValidatorParams

[GitHub] spark pull request #19208: [SPARK-21087] [ML] CrossValidator, TrainValidatio...

2017-09-12 Thread WeichenXu123
Github user WeichenXu123 commented on a diff in the pull request: https://github.com/apache/spark/pull/19208#discussion_r138393318 --- Diff: mllib/src/main/scala/org/apache/spark/ml/tuning/CrossValidator.scala --- @@ -212,14 +238,12 @@ object CrossValidator extends MLReadable

[GitHub] spark pull request #19208: [SPARK-21087] [ML] CrossValidator, TrainValidatio...

2017-09-12 Thread WeichenXu123
Github user WeichenXu123 commented on a diff in the pull request: https://github.com/apache/spark/pull/19208#discussion_r138389265 --- Diff: mllib/src/main/scala/org/apache/spark/ml/tuning/CrossValidator.scala --- @@ -261,17 +290,40 @@ class CrossValidatorModel private[ml

[GitHub] spark issue #19208: [SPARK-21087] [ML] CrossValidator, TrainValidationSplit ...

2017-09-12 Thread WeichenXu123
Github user WeichenXu123 commented on the issue: https://github.com/apache/spark/pull/19208 cc @jkbradley --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews

[GitHub] spark issue #18313: [SPARK-21087] [ML] CrossValidator, TrainValidationSplit ...

2017-09-12 Thread WeichenXu123
Github user WeichenXu123 commented on the issue: https://github.com/apache/spark/pull/18313 @hhbyyh I apologize to you that your PR is valuable (in the case model list is very big). But now your PR is stale and I integrate it into my new PR #19208 Would you mind to take a

[GitHub] spark issue #16774: [SPARK-19357][ML] Adding parallel model evaluation in ML...

2017-09-12 Thread WeichenXu123
Github user WeichenXu123 commented on the issue: https://github.com/apache/spark/pull/16774 @BryanCutler @MLnick I found a bug in this PR: after save estimator (CV or TVS) and then load again, the "Parallelism" setting will be lost. But I fix this in #19208

[GitHub] spark pull request #19208: [SPARK-21087] [ML] CrossValidator, TrainValidatio...

2017-09-12 Thread WeichenXu123
GitHub user WeichenXu123 opened a pull request: https://github.com/apache/spark/pull/19208 [SPARK-21087] [ML] CrossValidator, TrainValidationSplit should preserve all models after fitting: Scala ## What changes were proposed in this pull request? 1. We add a parameter

[GitHub] spark pull request #19122: [SPARK-21911][ML][PySpark] Parallel Model Evaluat...

2017-09-11 Thread WeichenXu123
Github user WeichenXu123 commented on a diff in the pull request: https://github.com/apache/spark/pull/19122#discussion_r138249937 --- Diff: python/pyspark/ml/param/_shared_params_code_gen.py --- @@ -152,6 +152,8 @@ def get$Name(self): ("varianceCol", "

[GitHub] spark issue #19107: [SPARK-21799][ML] Fix `KMeans` performance regression ca...

2017-09-11 Thread WeichenXu123
Github user WeichenXu123 commented on the issue: https://github.com/apache/spark/pull/19107 OK. Thanks @zhengruifeng .I will close this PR. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For

[GitHub] spark pull request #19107: [SPARK-21799][ML] Fix `KMeans` performance regres...

2017-09-11 Thread WeichenXu123
Github user WeichenXu123 closed the pull request at: https://github.com/apache/spark/pull/19107 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h

[GitHub] spark issue #9183: [SPARK-11215] [ML] Add multiple columns support to String...

2017-09-10 Thread WeichenXu123
Github user WeichenXu123 commented on the issue: https://github.com/apache/spark/pull/9183 @minixalpha Sorry for delay. Too busy recently. But I will try to finish and commit my new PR once I get time. Thanks

[GitHub] spark issue #19110: [SPARK-21027][ML][PYTHON] Added tunable parallelism to o...

2017-09-09 Thread WeichenXu123
Github user WeichenXu123 commented on the issue: https://github.com/apache/spark/pull/19110 Thanks @MLnick @BryanCutler . Would you mind helping review another similar PR #19122 ? We need some other features but blocking on that PR. Thanks

[GitHub] spark pull request #19172: [SPARK-21856] Add probability and rawPrediction t...

2017-09-09 Thread WeichenXu123
Github user WeichenXu123 commented on a diff in the pull request: https://github.com/apache/spark/pull/19172#discussion_r137922397 --- Diff: python/pyspark/ml/tests.py --- @@ -1655,6 +1655,25 @@ def test_multinomial_logistic_regression_with_bound(self

[GitHub] spark pull request #19172: [SPARK-21856] Add probability and rawPrediction t...

2017-09-09 Thread WeichenXu123
Github user WeichenXu123 commented on a diff in the pull request: https://github.com/apache/spark/pull/19172#discussion_r137922474 --- Diff: python/pyspark/ml/classification.py --- @@ -1425,11 +1425,13 @@ class MultilayerPerceptronClassifier(JavaEstimator, HasFeaturesCol

[GitHub] spark pull request #19172: [SPARK-21856] Add probability and rawPrediction t...

2017-09-09 Thread WeichenXu123
Github user WeichenXu123 commented on a diff in the pull request: https://github.com/apache/spark/pull/19172#discussion_r137922379 --- Diff: python/pyspark/ml/tests.py --- @@ -1655,6 +1655,25 @@ def test_multinomial_logistic_regression_with_bound(self

[GitHub] spark issue #19172: [SPARK-21856] Add probability and rawPrediction to MLPC ...

2017-09-09 Thread WeichenXu123
Github user WeichenXu123 commented on the issue: https://github.com/apache/spark/pull/19172 Jenkins, test this please. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e

[GitHub] spark pull request #18748: [SPARK-20679][ML] Support recommending for a subs...

2017-09-08 Thread WeichenXu123
Github user WeichenXu123 commented on a diff in the pull request: https://github.com/apache/spark/pull/18748#discussion_r137815796 --- Diff: mllib/src/main/scala/org/apache/spark/ml/recommendation/ALS.scala --- @@ -356,6 +371,40 @@ class ALSModel private[ml

[GitHub] spark pull request #15770: [SPARK-15784][ML]:Add Power Iteration Clustering ...

2017-09-08 Thread WeichenXu123
Github user WeichenXu123 commented on a diff in the pull request: https://github.com/apache/spark/pull/15770#discussion_r137800867 --- Diff: mllib/src/main/scala/org/apache/spark/ml/clustering/PowerIterationClustering.scala --- @@ -0,0 +1,216 @@ +/* + * Licensed to the

[GitHub] spark pull request #15770: [SPARK-15784][ML]:Add Power Iteration Clustering ...

2017-09-08 Thread WeichenXu123
Github user WeichenXu123 commented on a diff in the pull request: https://github.com/apache/spark/pull/15770#discussion_r137805843 --- Diff: mllib/src/main/scala/org/apache/spark/ml/clustering/PowerIterationClustering.scala --- @@ -0,0 +1,216 @@ +/* + * Licensed to the

[GitHub] spark pull request #19156: [SPARK-19634][FOLLOW-UP][ML] Improve interface of...

2017-09-08 Thread WeichenXu123
Github user WeichenXu123 commented on a diff in the pull request: https://github.com/apache/spark/pull/19156#discussion_r137740578 --- Diff: mllib/src/main/scala/org/apache/spark/ml/stat/Summarizer.scala --- @@ -94,46 +97,86 @@ object Summarizer extends Logging { * - min

[GitHub] spark issue #19156: [SPARK-19634][FOLLOW-UP][ML] Improve interface of datafr...

2017-09-08 Thread WeichenXu123
Github user WeichenXu123 commented on the issue: https://github.com/apache/spark/pull/19156 Thanks @thunterdb code updated. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands

[GitHub] spark issue #15770: [SPARK-15784][ML]:Add Power Iteration Clustering to spar...

2017-09-08 Thread WeichenXu123
Github user WeichenXu123 commented on the issue: https://github.com/apache/spark/pull/15770 @wangmiao1981 Sorry for delay, I will take a look later, thanks! --- - To unsubscribe, e-mail: reviews-unsubscr

[GitHub] spark issue #19107: [SPARK-21799][ML] Fix `KMeans` performance regression ca...

2017-09-07 Thread WeichenXu123
Github user WeichenXu123 commented on the issue: https://github.com/apache/spark/pull/19107 cc @smurching Thanks! --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail

[GitHub] spark issue #17383: [SPARK-3165][MLlib] DecisionTree use sparsity in data

2017-09-07 Thread WeichenXu123
Github user WeichenXu123 commented on the issue: https://github.com/apache/spark/pull/17383 @facaiy So can you do benchmark first (by generating random testing data) ? So we can see how much this can speed up

[GitHub] spark pull request #16158: [SPARK-18724][ML] Add TuningSummary for TrainVali...

2017-09-07 Thread WeichenXu123
Github user WeichenXu123 commented on a diff in the pull request: https://github.com/apache/spark/pull/16158#discussion_r137546848 --- Diff: mllib/src/main/scala/org/apache/spark/ml/tuning/ValidatorParams.scala --- @@ -85,6 +86,32 @@ private[ml] trait ValidatorParams extends

[GitHub] spark pull request #16158: [SPARK-18724][ML] Add TuningSummary for TrainVali...

2017-09-07 Thread WeichenXu123
Github user WeichenXu123 commented on a diff in the pull request: https://github.com/apache/spark/pull/16158#discussion_r137545402 --- Diff: mllib/src/main/scala/org/apache/spark/ml/tuning/ValidatorParams.scala --- @@ -85,6 +86,32 @@ private[ml] trait ValidatorParams extends

[GitHub] spark pull request #16158: [SPARK-18724][ML] Add TuningSummary for TrainVali...

2017-09-07 Thread WeichenXu123
Github user WeichenXu123 commented on a diff in the pull request: https://github.com/apache/spark/pull/16158#discussion_r137542479 --- Diff: mllib/src/main/scala/org/apache/spark/ml/tuning/ValidatorParams.scala --- @@ -85,6 +86,32 @@ private[ml] trait ValidatorParams extends

[GitHub] spark issue #19156: [SPARK-19634][FOLLOW-UP][ML] Improve interface of datafr...

2017-09-07 Thread WeichenXu123
Github user WeichenXu123 commented on the issue: https://github.com/apache/spark/pull/19156 cc @yanboliang @thunterdb Thanks! --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional

[GitHub] spark pull request #19156: [SPARK-19634][FOLLOW-UP][ML] Improve interface of...

2017-09-07 Thread WeichenXu123
GitHub user WeichenXu123 opened a pull request: https://github.com/apache/spark/pull/19156 [SPARK-19634][FOLLOW-UP][ML] Improve interface of dataframe vectorized summarizer ## What changes were proposed in this pull request? Make several improvements in dataframe

[GitHub] spark pull request #19122: [SPARK-21911][ML][PySpark] Parallel Model Evaluat...

2017-09-06 Thread WeichenXu123
Github user WeichenXu123 commented on a diff in the pull request: https://github.com/apache/spark/pull/19122#discussion_r137264588 --- Diff: python/pyspark/ml/tuning.py --- @@ -255,18 +257,23 @@ def _fit(self, dataset): randCol = self.uid + "_rand"

[GitHub] spark issue #19110: [SPARK-21027][ML][PYTHON] Added tunable parallelism to o...

2017-09-06 Thread WeichenXu123
Github user WeichenXu123 commented on the issue: https://github.com/apache/spark/pull/19110 @MLnick Conflict resolved. Thanks! --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional

[GitHub] spark pull request #19122: [SPARK-21911][ML][PySpark] Parallel Model Evaluat...

2017-09-05 Thread WeichenXu123
Github user WeichenXu123 commented on a diff in the pull request: https://github.com/apache/spark/pull/19122#discussion_r137175343 --- Diff: python/pyspark/ml/tuning.py --- @@ -255,18 +257,24 @@ def _fit(self, dataset): randCol = self.uid + "_rand"

[GitHub] spark issue #19020: [SPARK-3181] [ML] Implement huber loss for LinearRegress...

2017-09-05 Thread WeichenXu123
Github user WeichenXu123 commented on the issue: https://github.com/apache/spark/pull/19020 Looks good. cc @jkbradley Thanks! --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional

[GitHub] spark pull request #19122: [SPARK-21911][ML][PySpark] Parallel Model Evaluat...

2017-09-05 Thread WeichenXu123
Github user WeichenXu123 commented on a diff in the pull request: https://github.com/apache/spark/pull/19122#discussion_r136934638 --- Diff: python/pyspark/ml/tuning.py --- @@ -255,18 +257,24 @@ def _fit(self, dataset): randCol = self.uid + "_rand"

[GitHub] spark pull request #19122: [SPARK-21911][ML][PySpark] Parallel Model Evaluat...

2017-09-05 Thread WeichenXu123
Github user WeichenXu123 commented on a diff in the pull request: https://github.com/apache/spark/pull/19122#discussion_r136933807 --- Diff: python/pyspark/ml/tuning.py --- @@ -255,18 +257,23 @@ def _fit(self, dataset): randCol = self.uid + "_rand"

[GitHub] spark issue #13794: [SPARK-15574][ML][PySpark] Python meta-algorithms in Sca...

2017-09-04 Thread WeichenXu123
Github user WeichenXu123 commented on the issue: https://github.com/apache/spark/pull/13794 +1 @jkbradley For now it is better to keep the current implementation for the 4 meta-algo in pyspark. @yinxusen Would you mind to close this PR ? But I still appreciate your contribution

[GitHub] spark issue #19108: [SPARK-21898][ML] Feature parity for KolmogorovSmirnovTe...

2017-09-04 Thread WeichenXu123
Github user WeichenXu123 commented on the issue: https://github.com/apache/spark/pull/19108 cc @yanboliang Thanks! --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail

[GitHub] spark pull request #19122: [SPARK-21911][ML][PySpark] Parallel Model Evaluat...

2017-09-04 Thread WeichenXu123
Github user WeichenXu123 commented on a diff in the pull request: https://github.com/apache/spark/pull/19122#discussion_r136850665 --- Diff: python/pyspark/ml/tuning.py --- @@ -255,18 +257,23 @@ def _fit(self, dataset): randCol = self.uid + "_rand"

[GitHub] spark pull request #19122: [SPARK-21911][ML][PySpark] Parallel Model Evaluat...

2017-09-04 Thread WeichenXu123
GitHub user WeichenXu123 opened a pull request: https://github.com/apache/spark/pull/19122 [SPARK-21911][ML][PySpark] Parallel Model Evaluation for ML Tuning in PySpark ## What changes were proposed in this pull request? Add parallelism support for ML tuning in pyspark

[GitHub] spark issue #18902: [SPARK-21690][ML] one-pass imputer

2017-09-04 Thread WeichenXu123
Github user WeichenXu123 commented on the issue: https://github.com/apache/spark/pull/18902 Sure. I will create JIRA after this perf gap is confirmed. --- - To unsubscribe, e-mail: reviews-unsubscr

[GitHub] spark issue #18902: [SPARK-21690][ML] one-pass imputer

2017-09-03 Thread WeichenXu123
Github user WeichenXu123 commented on the issue: https://github.com/apache/spark/pull/18902 hmm... that's interesting. So I found performance gap between dataframe codegen aggregation and the simple RDD aggregation. I will discuss with SQL team for this later. Thanks! --- If

[GitHub] spark issue #17014: [SPARK-18608][ML] Fix double-caching in ML algorithms

2017-09-03 Thread WeichenXu123
Github user WeichenXu123 commented on the issue: https://github.com/apache/spark/pull/17014 @zhengruifeng `KMeans` regarded as a bugfix(SPARK-21799) because the double-cache issue is introduced in 2.2 and cause perf regression. Other algos also have the same issue, but the issue

[GitHub] spark issue #18902: [SPARK-21690][ML] one-pass imputer

2017-09-03 Thread WeichenXu123
Github user WeichenXu123 commented on the issue: https://github.com/apache/spark/pull/18902 +1 for using Dataframe-based version code. @zhengruifeng One thing I want to confirm is that, I check your testing code, both RDD-based version and Dataframe-based version code will

[GitHub] spark issue #19111: [SPARK-21801][SPARKR][TEST][WIP] set random seed for pre...

2017-09-03 Thread WeichenXu123
Github user WeichenXu123 commented on the issue: https://github.com/apache/spark/pull/19111 I found `NaiveBayes` also possible to fail. Founded here #18538 . Hope this can works! https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/81316/console

[GitHub] spark issue #18538: [SPARK-14516][ML] Adding ClusteringEvaluator with the im...

2017-09-03 Thread WeichenXu123
Github user WeichenXu123 commented on the issue: https://github.com/apache/spark/pull/18538 Jenkins, test this please. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled

[GitHub] spark pull request #16774: [SPARK-19357][ML] Adding parallel model evaluatio...

2017-09-03 Thread WeichenXu123
Github user WeichenXu123 commented on a diff in the pull request: https://github.com/apache/spark/pull/16774#discussion_r136719561 --- Diff: mllib/src/main/scala/org/apache/spark/ml/tuning/TrainValidationSplit.scala --- @@ -87,37 +91,63 @@ class TrainValidationSplit @Since("

[GitHub] spark pull request #16774: [SPARK-19357][ML] Adding parallel model evaluatio...

2017-09-03 Thread WeichenXu123
Github user WeichenXu123 commented on a diff in the pull request: https://github.com/apache/spark/pull/16774#discussion_r136719485 --- Diff: mllib/src/main/scala/org/apache/spark/ml/tuning/CrossValidator.scala --- @@ -100,31 +113,53 @@ class CrossValidator @Since("1.2.0"

[GitHub] spark pull request #16774: [SPARK-19357][ML] Adding parallel model evaluatio...

2017-09-03 Thread WeichenXu123
Github user WeichenXu123 commented on a diff in the pull request: https://github.com/apache/spark/pull/16774#discussion_r136719383 --- Diff: mllib/src/main/scala/org/apache/spark/ml/tuning/CrossValidator.scala --- @@ -100,31 +113,53 @@ class CrossValidator @Since("1.2.0"

[GitHub] spark issue #19110: [SPARK-21027][ML][PYTHON] Added tunable parallelism to o...

2017-09-02 Thread WeichenXu123
Github user WeichenXu123 commented on the issue: https://github.com/apache/spark/pull/19110 Jenkins, test this please. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled

[GitHub] spark issue #19018: [SPARK-21801][SPARKR][TEST] unit test randomly fail with...

2017-09-02 Thread WeichenXu123
Github user WeichenXu123 commented on the issue: https://github.com/apache/spark/pull/19018 cc @felixcheung I encounter RTest failed again even when this seed added. https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/81350/console error: ``` Failed

[GitHub] spark issue #18281: [SPARK-21027][ML][PYTHON] Added tunable parallelism to o...

2017-09-02 Thread WeichenXu123
Github user WeichenXu123 commented on the issue: https://github.com/apache/spark/pull/18281 I take this PR over in #19110 because the original author is busy but we need merge this PR soon. Thanks! --- If your project is set up for it, you can reply to this email and have your

[GitHub] spark pull request #19110: [SPARK-21027][ML][PYTHON] Added tunable paralleli...

2017-09-02 Thread WeichenXu123
GitHub user WeichenXu123 opened a pull request: https://github.com/apache/spark/pull/19110 [SPARK-21027][ML][PYTHON] Added tunable parallelism to one vs. rest in both Scala mllib and Pyspark ## What changes were proposed in this pull request? Added tunable parallelism to

[GitHub] spark pull request #19106: [SPARK-21770][ML] ProbabilisticClassificationMode...

2017-09-02 Thread WeichenXu123
Github user WeichenXu123 commented on a diff in the pull request: https://github.com/apache/spark/pull/19106#discussion_r136696592 --- Diff: mllib/src/main/scala/org/apache/spark/ml/classification/ProbabilisticClassifier.scala --- @@ -245,6 +245,13 @@ private[ml] object

[GitHub] spark pull request #19108: [SPARK-21898][ML] Feature parity for KolmogorovSm...

2017-09-02 Thread WeichenXu123
GitHub user WeichenXu123 opened a pull request: https://github.com/apache/spark/pull/19108 [SPARK-21898][ML] Feature parity for KolmogorovSmirnovTest in MLlib ## What changes were proposed in this pull request? Feature parity for KolmogorovSmirnovTest in MLlib

[GitHub] spark issue #19107: [SPARK-21799][ML] Fix `KMeans` performance regression ca...

2017-09-01 Thread WeichenXu123
Github user WeichenXu123 commented on the issue: https://github.com/apache/spark/pull/19107 cc @jkbradley @smurching This should be merged and backport to 2.2 ASAP! Other improvement (e.g adding `handlePersistence` param) can be left in this PR #17014 --- If your project is

[GitHub] spark issue #17014: [SPARK-18608][ML] Fix double-caching in ML algorithms

2017-09-01 Thread WeichenXu123
Github user WeichenXu123 commented on the issue: https://github.com/apache/spark/pull/17014 @zhengruifeng @jkbradley I create a PR #19107 for quick fix `KMeans` perf regression bug. This PR can continue to work on adding Param of `handlePersistence` which is not so emergent

[GitHub] spark pull request #19107: [SPARK-21799][ML] Fix `KMeans` performance regres...

2017-09-01 Thread WeichenXu123
GitHub user WeichenXu123 opened a pull request: https://github.com/apache/spark/pull/19107 [SPARK-21799][ML] Fix `KMeans` performance regression caused by double-caching ## What changes were proposed in this pull request? Fix `KMeans` performance regression caused by

[GitHub] spark pull request #19106: [SPARK-21770][ML] ProbabilisticClassificationMode...

2017-09-01 Thread WeichenXu123
GitHub user WeichenXu123 opened a pull request: https://github.com/apache/spark/pull/19106 [SPARK-21770][ML] ProbabilisticClassificationModel fix corner case: normalization of all-zero raw predictions ## What changes were proposed in this pull request

[GitHub] spark issue #16864: [SPARK-19527][Core] Approximate Size of Intersection of ...

2017-09-01 Thread WeichenXu123
Github user WeichenXu123 commented on the issue: https://github.com/apache/spark/pull/16864 @Bcpoole Thanks for this PR. But I want to ask which place in spark can this extension apply to ? e.g. can this algo used in join cost estimating or somewhere else ? But if there is no

[GitHub] spark pull request #18538: [SPARK-14516][ML] Adding ClusteringEvaluator with...

2017-09-01 Thread WeichenXu123
Github user WeichenXu123 commented on a diff in the pull request: https://github.com/apache/spark/pull/18538#discussion_r136536168 --- Diff: mllib/src/test/scala/org/apache/spark/ml/evaluation/ClusteringEvaluatorSuite.scala --- @@ -0,0 +1,91 @@ +/* + * Licensed to the

[GitHub] spark pull request #18538: [SPARK-14516][ML] Adding ClusteringEvaluator with...

2017-09-01 Thread WeichenXu123
Github user WeichenXu123 commented on a diff in the pull request: https://github.com/apache/spark/pull/18538#discussion_r136532646 --- Diff: mllib/src/main/scala/org/apache/spark/ml/evaluation/ClusteringEvaluator.scala --- @@ -0,0 +1,395 @@ +/* + * Licensed to the Apache

[GitHub] spark pull request #16774: [SPARK-19357][ML] Adding parallel model evaluatio...

2017-08-31 Thread WeichenXu123
Github user WeichenXu123 commented on a diff in the pull request: https://github.com/apache/spark/pull/16774#discussion_r136482755 --- Diff: mllib/src/test/scala/org/apache/spark/ml/tuning/CrossValidatorSuite.scala --- @@ -120,6 +120,33 @@ class CrossValidatorSuite

[GitHub] spark issue #17014: [SPARK-18608][ML] Fix double-caching in ML algorithms

2017-08-31 Thread WeichenXu123
Github user WeichenXu123 commented on the issue: https://github.com/apache/spark/pull/17014 @smurching Yes this should be added as a `ml.Param`, we should not add as an argument. @zhengruifeng Would you mind update the PR according to our discussion result above ? Make

[GitHub] spark issue #17014: [SPARK-18608][ML] Fix double-caching in ML algorithms

2017-08-31 Thread WeichenXu123
Github user WeichenXu123 commented on the issue: https://github.com/apache/spark/pull/17014 I think about this double-cache issue for a few days. One big problem is that, we are hard get precise storage level info. For example, we may add `map` transform on cached dataset and then

[GitHub] spark pull request #16774: [SPARK-19357][ML] Adding parallel model evaluatio...

2017-08-30 Thread WeichenXu123
Github user WeichenXu123 commented on a diff in the pull request: https://github.com/apache/spark/pull/16774#discussion_r136243309 --- Diff: mllib/src/test/scala/org/apache/spark/ml/tuning/CrossValidatorSuite.scala --- @@ -120,6 +120,33 @@ class CrossValidatorSuite

[GitHub] spark pull request #19020: [SPARK-3181] [ML] Implement huber loss for Linear...

2017-08-30 Thread WeichenXu123
Github user WeichenXu123 commented on a diff in the pull request: https://github.com/apache/spark/pull/19020#discussion_r136071530 --- Diff: mllib/src/test/scala/org/apache/spark/ml/optim/aggregator/HuberAggregatorSuite.scala --- @@ -0,0 +1,170 @@ +/* + * Licensed to the

[GitHub] spark pull request #19020: [SPARK-3181] [ML] Implement huber loss for Linear...

2017-08-30 Thread WeichenXu123
Github user WeichenXu123 commented on a diff in the pull request: https://github.com/apache/spark/pull/19020#discussion_r136072548 --- Diff: mllib/src/test/scala/org/apache/spark/ml/regression/LinearRegressionSuite.scala --- @@ -146,6 +161,8 @@ class LinearRegressionSuite

[GitHub] spark pull request #19020: [SPARK-3181] [ML] Implement huber loss for Linear...

2017-08-30 Thread WeichenXu123
Github user WeichenXu123 commented on a diff in the pull request: https://github.com/apache/spark/pull/19020#discussion_r136067839 --- Diff: mllib/src/main/scala/org/apache/spark/ml/optim/aggregator/HuberAggregator.scala --- @@ -0,0 +1,141 @@ +/* + * Licensed to the

[GitHub] spark pull request #19020: [SPARK-3181] [ML] Implement huber loss for Linear...

2017-08-30 Thread WeichenXu123
Github user WeichenXu123 commented on a diff in the pull request: https://github.com/apache/spark/pull/19020#discussion_r136069679 --- Diff: mllib/src/main/scala/org/apache/spark/ml/optim/aggregator/HuberAggregator.scala --- @@ -0,0 +1,141 @@ +/* + * Licensed to the

[GitHub] spark pull request #19078: [SPARK-21862][ML] Add overflow check in PCA

2017-08-30 Thread WeichenXu123
Github user WeichenXu123 commented on a diff in the pull request: https://github.com/apache/spark/pull/19078#discussion_r136032375 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/feature/PCA.scala --- @@ -44,6 +44,13 @@ class PCA @Since("1.4.0") (@Since("1.4

[GitHub] spark issue #17862: [SPARK-20602] [ML]Adding LBFGS optimizer and Squared_hin...

2017-08-30 Thread WeichenXu123
Github user WeichenXu123 commented on the issue: https://github.com/apache/spark/pull/17862 +1 for adding test on large-scale datasets. Another thing I want to know is that: you can compare the final loss value on the result coefficients, between LIBLINEAR(scikit-learn), LBFGS

[GitHub] spark issue #17014: [SPARK-18608][ML] Fix double-caching in ML algorithms

2017-08-29 Thread WeichenXu123
Github user WeichenXu123 commented on the issue: https://github.com/apache/spark/pull/17014 @zhengruifeng OK. so the the part of `KMeans` in this PR still works. No need change I think. --- If your project is set up for it, you can reply to this email and have your reply appear on

[GitHub] spark issue #17014: [SPARK-18608][ML] Fix double-caching in ML algorithms

2017-08-29 Thread WeichenXu123
Github user WeichenXu123 commented on the issue: https://github.com/apache/spark/pull/17014 cc @zhengruifeng I update my comment you need check again, thanks! I read the PR again, it still do not resolve double-caching issue in KMeans. in KMean, your code

[GitHub] spark issue #19065: [SPARK-21729][ML][TEST] Generic test for ProbabilisticCl...

2017-08-29 Thread WeichenXu123
Github user WeichenXu123 commented on the issue: https://github.com/apache/spark/pull/19065 @smurching Code updated, thanks! --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature

[GitHub] spark pull request #19065: [SPARK-21729][ML][TEST] Generic test for Probabil...

2017-08-29 Thread WeichenXu123
Github user WeichenXu123 commented on a diff in the pull request: https://github.com/apache/spark/pull/19065#discussion_r135782045 --- Diff: mllib/src/test/scala/org/apache/spark/ml/classification/ProbabilisticClassifierSuite.scala --- @@ -91,4 +94,54 @@ object

[GitHub] spark issue #19078: [SPARK-21862] Add overflow check in PCA

2017-08-29 Thread WeichenXu123
Github user WeichenXu123 commented on the issue: https://github.com/apache/spark/pull/19078 cc @jkbradley --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so

[GitHub] spark pull request #19078: [SPARK-21862] Add overflow check in PCA

2017-08-29 Thread WeichenXu123
Github user WeichenXu123 commented on a diff in the pull request: https://github.com/apache/spark/pull/19078#discussion_r135751225 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/feature/PCA.scala --- @@ -44,6 +44,13 @@ class PCA @Since("1.4.0") (@Since("1.4

[GitHub] spark pull request #19078: [SPARK-21862] Add overflow check in PCA

2017-08-29 Thread WeichenXu123
GitHub user WeichenXu123 opened a pull request: https://github.com/apache/spark/pull/19078 [SPARK-21862] Add overflow check in PCA ## What changes were proposed in this pull request? add overflow check in PCA, otherwise it is possible to throw `NegativeArraySizeException

<    2   3   4   5   6   7   8   9   10   11   >