[GitHub] spark issue #19381: [SPARK-10884][ML] Support prediction on single instance ...

2017-10-16 Thread WeichenXu123
Github user WeichenXu123 commented on the issue: https://github.com/apache/spark/pull/19381 cc @jkbradley @yanboliang thanks! --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional

[GitHub] spark pull request #19433: [SPARK-3162] [MLlib] Add local tree training for ...

2017-10-16 Thread WeichenXu123
Github user WeichenXu123 commented on a diff in the pull request: https://github.com/apache/spark/pull/19433#discussion_r144787368 --- Diff: mllib/src/main/scala/org/apache/spark/ml/tree/impl/LocalDecisionTree.scala --- @@ -0,0 +1,250 @@ +/* + * Licensed to the Apache

[GitHub] spark pull request #19433: [SPARK-3162] [MLlib] Add local tree training for ...

2017-10-16 Thread WeichenXu123
Github user WeichenXu123 commented on a diff in the pull request: https://github.com/apache/spark/pull/19433#discussion_r144788264 --- Diff: mllib/src/main/scala/org/apache/spark/ml/tree/impl/LocalDecisionTree.scala --- @@ -0,0 +1,250 @@ +/* + * Licensed to the Apache

[GitHub] spark pull request #19433: [SPARK-3162] [MLlib] Add local tree training for ...

2017-10-16 Thread WeichenXu123
Github user WeichenXu123 commented on a diff in the pull request: https://github.com/apache/spark/pull/19433#discussion_r144790803 --- Diff: mllib/src/main/scala/org/apache/spark/ml/tree/impl/TrainingInfo.scala --- @@ -0,0 +1,144 @@ +/* + * Licensed to the Apache Software

[GitHub] spark pull request #19433: [SPARK-3162] [MLlib] Add local tree training for ...

2017-10-16 Thread WeichenXu123
Github user WeichenXu123 commented on a diff in the pull request: https://github.com/apache/spark/pull/19433#discussion_r144786233 --- Diff: mllib/src/main/scala/org/apache/spark/ml/tree/impl/FeatureVector.scala --- @@ -0,0 +1,148 @@ +/* + * Licensed to the Apache Software

[GitHub] spark pull request #19433: [SPARK-3162] [MLlib] Add local tree training for ...

2017-10-16 Thread WeichenXu123
Github user WeichenXu123 commented on a diff in the pull request: https://github.com/apache/spark/pull/19433#discussion_r144790384 --- Diff: mllib/src/main/scala/org/apache/spark/ml/tree/impl/AggUpdateUtils.scala --- @@ -0,0 +1,85 @@ +/* + * Licensed to the Apache Software

[GitHub] spark pull request #19433: [SPARK-3162] [MLlib] Add local tree training for ...

2017-10-16 Thread WeichenXu123
Github user WeichenXu123 commented on a diff in the pull request: https://github.com/apache/spark/pull/19433#discussion_r144789990 --- Diff: mllib/src/main/scala/org/apache/spark/ml/tree/impl/AggUpdateUtils.scala --- @@ -0,0 +1,85 @@ +/* + * Licensed to the Apache Software

[GitHub] spark pull request #19433: [SPARK-3162] [MLlib] Add local tree training for ...

2017-10-16 Thread WeichenXu123
Github user WeichenXu123 commented on a diff in the pull request: https://github.com/apache/spark/pull/19433#discussion_r144789015 --- Diff: mllib/src/main/scala/org/apache/spark/ml/tree/impl/TrainingInfo.scala --- @@ -0,0 +1,144 @@ +/* + * Licensed to the Apache Software

[GitHub] spark issue #19433: [SPARK-3162] [MLlib] Add local tree training for decisio...

2017-10-12 Thread WeichenXu123
Github user WeichenXu123 commented on the issue: https://github.com/apache/spark/pull/19433 I made a rough pass. I have only a few issues for now, I haven't go into code details: - The `colStoreInit` currently ignore the `subsampleWeights`, it should be used, isn't it ? I

[GitHub] spark issue #18924: [SPARK-14371] [MLLIB] OnlineLDAOptimizer should not coll...

2017-10-12 Thread WeichenXu123
Github user WeichenXu123 commented on the issue: https://github.com/apache/spark/pull/18924 @akopich Yes you can wait this to be merged first. I think @jkbradley will have time to check this next week. Don't worry

[GitHub] spark issue #18924: [SPARK-14371] [MLLIB] OnlineLDAOptimizer should not coll...

2017-10-11 Thread WeichenXu123
Github user WeichenXu123 commented on the issue: https://github.com/apache/spark/pull/18924 @akopich LGTM. and do you have time to create a PR to resolve random seed not working issue mentioned by @hhbyyh ? Thanks

[GitHub] spark pull request #19020: [SPARK-3181] [ML] Implement huber loss for Linear...

2017-10-10 Thread WeichenXu123
Github user WeichenXu123 commented on a diff in the pull request: https://github.com/apache/spark/pull/19020#discussion_r143727850 --- Diff: mllib/src/main/scala/org/apache/spark/ml/regression/LinearRegression.scala --- @@ -208,6 +292,26 @@ class LinearRegression @Since("

[GitHub] spark pull request #19020: [SPARK-3181] [ML] Implement huber loss for Linear...

2017-10-10 Thread WeichenXu123
Github user WeichenXu123 commented on a diff in the pull request: https://github.com/apache/spark/pull/19020#discussion_r143726258 --- Diff: mllib/src/main/scala/org/apache/spark/ml/regression/LinearRegression.scala --- @@ -744,11 +754,20 @@ object LinearRegressionModel extends

[GitHub] spark pull request #19020: [SPARK-3181] [ML] Implement huber loss for Linear...

2017-10-10 Thread WeichenXu123
Github user WeichenXu123 commented on a diff in the pull request: https://github.com/apache/spark/pull/19020#discussion_r143727489 --- Diff: mllib/src/main/scala/org/apache/spark/ml/optim/aggregator/HuberAggregator.scala --- @@ -0,0 +1,145 @@ +/* + * Licensed

[GitHub] spark issue #17968: [SPARK-9792] Make DenseMatrix equality semantical

2017-10-10 Thread WeichenXu123
Github user WeichenXu123 commented on the issue: https://github.com/apache/spark/pull/17968 @gatorsmile Add this to white list! --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional

[GitHub] spark pull request #17968: [SPARK-9792] Make DenseMatrix equality semantical

2017-10-10 Thread WeichenXu123
Github user WeichenXu123 commented on a diff in the pull request: https://github.com/apache/spark/pull/17968#discussion_r143702830 --- Diff: python/pyspark/ml/linalg/__init__.py --- @@ -976,14 +976,18 @@ def __getitem__(self, indices): return self.values[i + j

[GitHub] spark pull request #17968: [SPARK-9792] Make DenseMatrix equality semantical

2017-10-10 Thread WeichenXu123
Github user WeichenXu123 commented on a diff in the pull request: https://github.com/apache/spark/pull/17968#discussion_r143702719 --- Diff: python/pyspark/mllib/linalg/__init__.py --- @@ -1131,14 +1131,17 @@ def __getitem__(self, indices): return self.values[i + j

[GitHub] spark pull request #17968: [SPARK-9792] Make DenseMatrix equality semantical

2017-10-10 Thread WeichenXu123
Github user WeichenXu123 commented on a diff in the pull request: https://github.com/apache/spark/pull/17968#discussion_r143656736 --- Diff: python/pyspark/ml/linalg/__init__.py --- @@ -976,14 +976,20 @@ def __getitem__(self, indices): return self.values[i + j

[GitHub] spark pull request #17968: [SPARK-9792] Make DenseMatrix equality semantical

2017-10-10 Thread WeichenXu123
Github user WeichenXu123 commented on a diff in the pull request: https://github.com/apache/spark/pull/17968#discussion_r143656820 --- Diff: python/pyspark/mllib/linalg/__init__.py --- @@ -1131,14 +1131,20 @@ def __getitem__(self, indices): return self.values[i + j

[GitHub] spark pull request #19438: [SPARK-22208] [SQL] Improve percentile_approx by ...

2017-10-09 Thread WeichenXu123
Github user WeichenXu123 commented on a diff in the pull request: https://github.com/apache/spark/pull/19438#discussion_r143492975 --- Diff: sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/util/QuantileSummariesSuite.scala --- @@ -58,7 +58,7 @@ class

[GitHub] spark issue #17968: [SPARK-9792] Make DenseMatrix equality semantical

2017-10-09 Thread WeichenXu123
Github user WeichenXu123 commented on the issue: https://github.com/apache/spark/pull/17968 @gglanzani And you the `ml.linalg.DenseMatrix` looks have the same bug. Can you also update it ? --- - To unsubscribe, e

[GitHub] spark issue #17968: [SPARK-9792] Make DenseMatrix equality semantical

2017-10-09 Thread WeichenXu123
Github user WeichenXu123 commented on the issue: https://github.com/apache/spark/pull/17968 ping @gglanzani This bug need fixed ASAP. Can you update code when you're free ? Thanks. --- - To unsubscribe, e-mail

[GitHub] spark pull request #17968: [SPARK-9792] Make DenseMatrix equality semantical

2017-10-09 Thread WeichenXu123
Github user WeichenXu123 commented on a diff in the pull request: https://github.com/apache/spark/pull/17968#discussion_r143465176 --- Diff: python/pyspark/mllib/linalg/__init__.py --- @@ -1131,14 +1131,21 @@ def __getitem__(self, indices): return self.values[i + j

[GitHub] spark issue #19106: [SPARK-21770][ML] ProbabilisticClassificationModel fix c...

2017-10-09 Thread WeichenXu123
Github user WeichenXu123 commented on the issue: https://github.com/apache/spark/pull/19106 @srowen Any other comments? Thanks! --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional

[GitHub] spark pull request #15770: [SPARK-15784][ML]:Add Power Iteration Clustering ...

2017-10-09 Thread WeichenXu123
Github user WeichenXu123 commented on a diff in the pull request: https://github.com/apache/spark/pull/15770#discussion_r143426157 --- Diff: mllib/src/main/scala/org/apache/spark/ml/clustering/PowerIterationClustering.scala --- @@ -0,0 +1,216 @@ +/* + * Licensed

[GitHub] spark issue #19433: [SPARK-3162] [MLlib][WIP] Add local tree training for de...

2017-10-09 Thread WeichenXu123
Github user WeichenXu123 commented on the issue: https://github.com/apache/spark/pull/19433 @smurching Does it still WIP ? If done remove "[WIP]", I will begin review, thanks! --- - To unsubscribe, e-mai

[GitHub] spark issue #18924: [SPARK-14371] [MLLIB] OnlineLDAOptimizer should not coll...

2017-10-03 Thread WeichenXu123
Github user WeichenXu123 commented on the issue: https://github.com/apache/spark/pull/18924 Oh, sorry for that, it should waiting @jkbradley to merge it. Don't worry, I will contact him! --- - To unsubscribe, e

[GitHub] spark pull request #13493: [SPARK-15750][MLLib][PYSPARK] Constructing FPGrow...

2017-09-29 Thread WeichenXu123
Github user WeichenXu123 commented on a diff in the pull request: https://github.com/apache/spark/pull/13493#discussion_r141852693 --- Diff: python/pyspark/mllib/tests.py --- @@ -1762,6 +1763,18 @@ def test_pca(self): self.assertEqualUpToSign(pcs.toArray()[:, k

[GitHub] spark issue #13493: [SPARK-15750][MLLib][PYSPARK] Constructing FPGrowth fail...

2017-09-29 Thread WeichenXu123
Github user WeichenXu123 commented on the issue: https://github.com/apache/spark/pull/13493 ping @zjffdu Looks reasonable fix, But pls resolve the conflicts! Thanks! --- - To unsubscribe, e-mail: reviews-unsubscr

[GitHub] spark pull request #19106: [SPARK-21770][ML] ProbabilisticClassificationMode...

2017-09-28 Thread WeichenXu123
Github user WeichenXu123 commented on a diff in the pull request: https://github.com/apache/spark/pull/19106#discussion_r141616141 --- Diff: mllib/src/main/scala/org/apache/spark/ml/classification/ProbabilisticClassifier.scala --- @@ -230,21 +230,19 @@ private[ml] object

[GitHub] spark issue #8883: [SPARK-10884] [ML] Support prediction on single instance ...

2017-09-28 Thread WeichenXu123
Github user WeichenXu123 commented on the issue: https://github.com/apache/spark/pull/8883 I take this over in #19381 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e

[GitHub] spark pull request #19381: [SPARK-10884][ML] Support prediction on single in...

2017-09-28 Thread WeichenXu123
GitHub user WeichenXu123 opened a pull request: https://github.com/apache/spark/pull/19381 [SPARK-10884][ML] Support prediction on single instance for regression and classification related models ## What changes were proposed in this pull request? Support prediction

[GitHub] spark issue #19208: [SPARK-21087] [ML] CrossValidator, TrainValidationSplit ...

2017-09-27 Thread WeichenXu123
Github user WeichenXu123 commented on the issue: https://github.com/apache/spark/pull/19208 cc @smurching code updated, thanks! --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional

[GitHub] spark pull request #19208: [SPARK-21087] [ML] CrossValidator, TrainValidatio...

2017-09-27 Thread WeichenXu123
Github user WeichenXu123 commented on a diff in the pull request: https://github.com/apache/spark/pull/19208#discussion_r141391733 --- Diff: mllib/src/main/scala/org/apache/spark/ml/tuning/CrossValidator.scala --- @@ -261,17 +290,40 @@ class CrossValidatorModel private[ml

[GitHub] spark issue #19122: [SPARK-21911][ML][PySpark] Parallel Model Evaluation for...

2017-09-27 Thread WeichenXu123
Github user WeichenXu123 commented on the issue: https://github.com/apache/spark/pull/19122 ping @jkbradley --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail

[GitHub] spark issue #19020: [SPARK-3181] [ML] Implement huber loss for LinearRegress...

2017-09-27 Thread WeichenXu123
Github user WeichenXu123 commented on the issue: https://github.com/apache/spark/pull/19020 I also vote to combine them as one estimator, here are my two cents: 1, Regression with Huber loss is one kind of linear regression. It makes sense to switch between different loss

[GitHub] spark pull request #19208: [SPARK-21087] [ML] CrossValidator, TrainValidatio...

2017-09-27 Thread WeichenXu123
Github user WeichenXu123 commented on a diff in the pull request: https://github.com/apache/spark/pull/19208#discussion_r141357565 --- Diff: mllib/src/main/scala/org/apache/spark/ml/tuning/TrainValidationSplit.scala --- @@ -276,12 +315,32 @@ object TrainValidationSplitModel

[GitHub] spark pull request #19208: [SPARK-21087] [ML] CrossValidator, TrainValidatio...

2017-09-27 Thread WeichenXu123
Github user WeichenXu123 commented on a diff in the pull request: https://github.com/apache/spark/pull/19208#discussion_r141356070 --- Diff: mllib/src/main/scala/org/apache/spark/ml/tuning/CrossValidator.scala --- @@ -261,17 +290,40 @@ class CrossValidatorModel private[ml

[GitHub] spark issue #18924: [SPARK-14371] [MLLIB] OnlineLDAOptimizer should not coll...

2017-09-27 Thread WeichenXu123
Github user WeichenXu123 commented on the issue: https://github.com/apache/spark/pull/18924 LGTM. Thanks! ping @jkbradley --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands

[GitHub] spark pull request #19350: [SPARK-22126][ML] Fix model-specific optimization...

2017-09-27 Thread WeichenXu123
Github user WeichenXu123 closed the pull request at: https://github.com/apache/spark/pull/19350 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h

[GitHub] spark issue #19337: [SPARK-22114][ML][MLLIB]add epsilon for LDA

2017-09-27 Thread WeichenXu123
Github user WeichenXu123 commented on the issue: https://github.com/apache/spark/pull/19337 +1 for updating ML API. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail

[GitHub] spark issue #17503: [SPARK-3159][MLlib] Check for reducible DecisionTree

2017-09-26 Thread WeichenXu123
Github user WeichenXu123 commented on the issue: https://github.com/apache/spark/pull/17503 Can you do some benchmark to show how much improvement this change will bring ? --- - To unsubscribe, e-mail: reviews

[GitHub] spark issue #19337: [SPARK-22114][ML][MLLIB]add epsilon for LDA

2017-09-26 Thread WeichenXu123
Github user WeichenXu123 commented on the issue: https://github.com/apache/spark/pull/19337 `require` is better i think --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e

[GitHub] spark issue #19337: [SPARK-22114][ML][MLLIB]add epsilon for LDA

2017-09-26 Thread WeichenXu123
Github user WeichenXu123 commented on the issue: https://github.com/apache/spark/pull/19337 Why not possible ? `meanGammaChange` is always positive. The loop will become `while (true) {...}` and should use `require` instead of `assert

[GitHub] spark issue #19208: [SPARK-21087] [ML] CrossValidator, TrainValidationSplit ...

2017-09-26 Thread WeichenXu123
Github user WeichenXu123 commented on the issue: https://github.com/apache/spark/pull/19208 I will update this PR after #19350 get merged. We need to address another issue first. Thanks! --- - To unsubscribe, e

[GitHub] spark issue #19350: [SPARK-22126][ML] Fix model-specific optimization suppor...

2017-09-26 Thread WeichenXu123
Github user WeichenXu123 commented on the issue: https://github.com/apache/spark/pull/19350 retest this please. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail

[GitHub] spark issue #19337: [SPARK-22114][ML][MLLIB]add epsilon for LDA

2017-09-26 Thread WeichenXu123
Github user WeichenXu123 commented on the issue: https://github.com/apache/spark/pull/19337 OK. But I'm afraid accidently negative value cause the code run into dead loop. Maybe adding `assert` is better

[GitHub] spark pull request #18034: [SPARK-20797][MLLIB]fix LocalLDAModel.save() bug.

2017-09-26 Thread WeichenXu123
Github user WeichenXu123 commented on a diff in the pull request: https://github.com/apache/spark/pull/18034#discussion_r141016974 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/clustering/LDAModel.scala --- @@ -468,7 +469,16 @@ object LocalLDAModel extends Loader

[GitHub] spark pull request #19350: [SPARK-22126][ML] Fix model-specific optimization...

2017-09-26 Thread WeichenXu123
Github user WeichenXu123 commented on a diff in the pull request: https://github.com/apache/spark/pull/19350#discussion_r141015509 --- Diff: mllib/src/main/scala/org/apache/spark/ml/Estimator.scala --- @@ -82,5 +85,32 @@ abstract class Estimator[M <: Model[M]] exte

[GitHub] spark pull request #19350: [SPARK-22126][ML] Fix model-specific optimization...

2017-09-26 Thread WeichenXu123
GitHub user WeichenXu123 opened a pull request: https://github.com/apache/spark/pull/19350 [SPARK-22126][ML] Fix model-specific optimization support for ML tuning ## What changes were proposed in this pull request? Push down fitting parallelization code from CrossValidator

[GitHub] spark pull request #19229: [SPARK-22001][ML][SQL] ImputerModel can do withCo...

2017-09-24 Thread WeichenXu123
Github user WeichenXu123 commented on a diff in the pull request: https://github.com/apache/spark/pull/19229#discussion_r140690182 --- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/Imputer.scala --- @@ -223,20 +223,18 @@ class ImputerModel private[ml

[GitHub] spark pull request #19229: [SPARK-22001][ML][SQL] ImputerModel can do withCo...

2017-09-24 Thread WeichenXu123
Github user WeichenXu123 commented on a diff in the pull request: https://github.com/apache/spark/pull/19229#discussion_r140689908 --- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/Imputer.scala --- @@ -223,20 +223,18 @@ class ImputerModel private[ml

[GitHub] spark pull request #19229: [SPARK-22001][ML][SQL] ImputerModel can do withCo...

2017-09-24 Thread WeichenXu123
Github user WeichenXu123 commented on a diff in the pull request: https://github.com/apache/spark/pull/19229#discussion_r140675967 --- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/Imputer.scala --- @@ -223,20 +223,18 @@ class ImputerModel private[ml

[GitHub] spark issue #19317: [SPARK-22098][CORE] Add new method aggregateByKeyLocally...

2017-09-24 Thread WeichenXu123
Github user WeichenXu123 commented on the issue: https://github.com/apache/spark/pull/19317 It is better adding more perf test for `OpenHashSet` replacement to avoid perf regression. And I found `reduceByKeyLocally` also use `JHashSet`, I am not sure whether there is some special

[GitHub] spark issue #19317: [SPARK-22098][CORE] Add new method aggregateByKeyLocally...

2017-09-24 Thread WeichenXu123
Github user WeichenXu123 commented on the issue: https://github.com/apache/spark/pull/19317 @ConeyLiu Yes tree aggregate introduce extra shuffle. But it is possible to improve perf when driver total collecting data size from executors are large and there're many partitions. But I

[GitHub] spark issue #18924: [SPARK-14371] [MLLIB] OnlineLDAOptimizer should not coll...

2017-09-23 Thread WeichenXu123
Github user WeichenXu123 commented on the issue: https://github.com/apache/spark/pull/18924 @akopich follow-up JIRA created here https://issues.apache.org/jira/browse/SPARK-22111 Can you create follow up PR after this PR being merged

[GitHub] spark issue #19318: [SPARK-22096][ML] use aggregateByKeyLocally in feature f...

2017-09-23 Thread WeichenXu123
Github user WeichenXu123 commented on the issue: https://github.com/apache/spark/pull/19318 The `aggregateByKeyLocally` haven't been implemented. Please wait until #19317 finished or merge that PR contents into this, otherwise this PR cannot use. We should not open a PR which cannot

[GitHub] spark issue #19317: [SPARK-22098][CORE] Add new method aggregateByKeyLocally...

2017-09-22 Thread WeichenXu123
Github user WeichenXu123 commented on the issue: https://github.com/apache/spark/pull/19317 Oh I get your point. This is different from `RDD.aggregate`, it directly return Map and avoid shuffling. it seems useful when numKeys is small. But, I check the final `reduce` step

[GitHub] spark issue #19317: [SPARK-22098][CORE] Add new method aggregateByKeyLocally...

2017-09-22 Thread WeichenXu123
Github user WeichenXu123 commented on the issue: https://github.com/apache/spark/pull/19317 Yes. I guess the perf gain is because, this PR use local hashmap which can use unlimited memory, but current spark aggregation impl, will auto spill local hashmap when exceeding a threshold

[GitHub] spark issue #19317: [SPARK-22098][CORE] Add new method aggregateByKeyLocally...

2017-09-22 Thread WeichenXu123
Github user WeichenXu123 commented on the issue: https://github.com/apache/spark/pull/19317 And I have to point out that your impl have high risk causing OOM. The current impl will auto spill when local hashmap is too large and can take advantage of spark auto memory management

[GitHub] spark issue #19229: [SPARK-22001][ML][SQL] ImputerModel can do withColumn fo...

2017-09-22 Thread WeichenXu123
Github user WeichenXu123 commented on the issue: https://github.com/apache/spark/pull/19229 @viirya Yeah the perf gap I only focus on `mean` which can take advantage of codegen. --- - To unsubscribe, e-mail

[GitHub] spark issue #19229: [SPARK-22001][ML][SQL] ImputerModel can do withColumn fo...

2017-09-22 Thread WeichenXu123
Github user WeichenXu123 commented on the issue: https://github.com/apache/spark/pull/19229 The performance gap issue (compared with RDD version), I create a separated JIRA to track: https://issues.apache.org/jira/browse/SPARK-22105 As the result of discussion with @cloud-fan

[GitHub] spark issue #19156: [SPARK-19634][SQL][ML][FOLLOW-UP] Improve interface of d...

2017-09-22 Thread WeichenXu123
Github user WeichenXu123 commented on the issue: https://github.com/apache/spark/pull/19156 @cloud-fan Can you help review the part of code which related to SQL interface ? --- - To unsubscribe, e-mail: reviews

[GitHub] spark issue #19278: [SPARK-22060][ML] Fix CrossValidator/TrainValidationSpli...

2017-09-21 Thread WeichenXu123
Github user WeichenXu123 commented on the issue: https://github.com/apache/spark/pull/19278 @jkbradley Sure I tested the backwards compatibility. Part of the reason I changed into `DefaultParamReader.getAndSetParams` is for backwards compatibility

[GitHub] spark pull request #19122: [SPARK-21911][ML][PySpark] Parallel Model Evaluat...

2017-09-21 Thread WeichenXu123
Github user WeichenXu123 commented on a diff in the pull request: https://github.com/apache/spark/pull/19122#discussion_r140402700 --- Diff: python/pyspark/ml/tests.py --- @@ -836,6 +836,27 @@ def test_save_load_simple_estimator(self): loadedModel

[GitHub] spark issue #13794: [SPARK-15574][ML][PySpark] Python meta-algorithms in Sca...

2017-09-21 Thread WeichenXu123
Github user WeichenXu123 commented on the issue: https://github.com/apache/spark/pull/13794 cc @srowen Can you help close this ? We won't need this feature for now. --- - To unsubscribe, e-mail: reviews-unsubscr

[GitHub] spark issue #19279: [SPARK-22061] [ML]add pipeline model of SVM

2017-09-21 Thread WeichenXu123
Github user WeichenXu123 commented on the issue: https://github.com/apache/spark/pull/19279 cc @srowen Can you help close this ? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional

[GitHub] spark pull request #19278: [SPARK-22060][ML] Fix CrossValidator/TrainValidat...

2017-09-21 Thread WeichenXu123
Github user WeichenXu123 commented on a diff in the pull request: https://github.com/apache/spark/pull/19278#discussion_r140398933 --- Diff: mllib/src/test/scala/org/apache/spark/ml/tuning/CrossValidatorSuite.scala --- @@ -159,12 +159,15 @@ class CrossValidatorSuite

[GitHub] spark pull request #19278: [SPARK-22060][ML] Fix CrossValidator/TrainValidat...

2017-09-21 Thread WeichenXu123
Github user WeichenXu123 commented on a diff in the pull request: https://github.com/apache/spark/pull/19278#discussion_r140398857 --- Diff: mllib/src/test/scala/org/apache/spark/ml/tuning/TrainValidationSplitSuite.scala --- @@ -160,11 +160,13 @@ class TrainValidationSplitSuite

[GitHub] spark pull request #18924: [SPARK-14371] [MLLIB] OnlineLDAOptimizer should n...

2017-09-21 Thread WeichenXu123
Github user WeichenXu123 commented on a diff in the pull request: https://github.com/apache/spark/pull/18924#discussion_r140249325 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/clustering/LDAOptimizer.scala --- @@ -503,17 +518,15 @@ final class OnlineLDAOptimizer extends

[GitHub] spark pull request #18924: [SPARK-14371] [MLLIB] OnlineLDAOptimizer should n...

2017-09-21 Thread WeichenXu123
Github user WeichenXu123 commented on a diff in the pull request: https://github.com/apache/spark/pull/18924#discussion_r140188363 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/clustering/LDAOptimizer.scala --- @@ -503,17 +518,15 @@ final class OnlineLDAOptimizer extends

[GitHub] spark issue #14325: [SPARK-16692] [ML] Add multi label classification evalua...

2017-09-20 Thread WeichenXu123
Github user WeichenXu123 commented on the issue: https://github.com/apache/spark/pull/14325 ping @gatorsmile Add this to test. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional

[GitHub] spark issue #14325: [SPARK-16692] [ML] Add multi label classification evalua...

2017-09-20 Thread WeichenXu123
Github user WeichenXu123 commented on the issue: https://github.com/apache/spark/pull/14325 You should override `override def evaluate(dataset: Dataset[_])` (without the label param). --- - To unsubscribe, e-mail

[GitHub] spark issue #19106: [SPARK-21770][ML] ProbabilisticClassificationModel fix c...

2017-09-20 Thread WeichenXu123
Github user WeichenXu123 commented on the issue: https://github.com/apache/spark/pull/19106 ping @srowen Any other comments ? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional

[GitHub] spark issue #19288: [SPARK-22075][ML] GBTs unpersist datasets cached by Chec...

2017-09-20 Thread WeichenXu123
Github user WeichenXu123 commented on the issue: https://github.com/apache/spark/pull/19288 OK I agree. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h

[GitHub] spark pull request #18924: [SPARK-14371] [MLLIB] OnlineLDAOptimizer should n...

2017-09-20 Thread WeichenXu123
Github user WeichenXu123 commented on a diff in the pull request: https://github.com/apache/spark/pull/18924#discussion_r140111453 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/clustering/LDAOptimizer.scala --- @@ -503,17 +518,15 @@ final class OnlineLDAOptimizer extends

[GitHub] spark issue #19229: [SPARK-22001][ML][SQL] ImputerModel can do withColumn fo...

2017-09-19 Thread WeichenXu123
Github user WeichenXu123 commented on the issue: https://github.com/apache/spark/pull/19229 @viirya Thanks very much! Although the perf gap exists (when numCols is large), it won't block this PR. I will create a JIRA to track

[GitHub] spark issue #17819: [SPARK-20542][ML][SQL] Add an API to Bucketizer that can...

2017-09-19 Thread WeichenXu123
Github user WeichenXu123 commented on the issue: https://github.com/apache/spark/pull/17819 @MLnick Yea, you're right, only move `setXXX` to concrete class also work fine. The root cause is the `setXXX` return type. But I think the multi / single logic can be merged, because single

[GitHub] spark issue #19208: [SPARK-21087] [ML] CrossValidator, TrainValidationSplit ...

2017-09-19 Thread WeichenXu123
Github user WeichenXu123 commented on the issue: https://github.com/apache/spark/pull/19208 @smurching I will update this PR after #19278 merged. Because now this PR depend on that one. Thanks! --- - To unsubscribe

[GitHub] spark issue #19278: [SPARK-22060][ML] Fix CrossValidator/TrainValidationSpli...

2017-09-19 Thread WeichenXu123
Github user WeichenXu123 commented on the issue: https://github.com/apache/spark/pull/19278 @BryanCutler The reason I add `skipParams` is that, if we don't use `DefaultParamReader.getAndSetParams`, we have to hardcoding all params which are very troublesome. And every time we add new

[GitHub] spark pull request #19278: [SPARK-22060][ML] Fix CrossValidator/TrainValidat...

2017-09-19 Thread WeichenXu123
Github user WeichenXu123 commented on a diff in the pull request: https://github.com/apache/spark/pull/19278#discussion_r139855087 --- Diff: mllib/src/main/scala/org/apache/spark/ml/tuning/CrossValidator.scala --- @@ -303,16 +304,16 @@ object CrossValidatorModel extends

[GitHub] spark pull request #19278: [SPARK-22060][ML] Fix CrossValidator/TrainValidat...

2017-09-19 Thread WeichenXu123
Github user WeichenXu123 commented on a diff in the pull request: https://github.com/apache/spark/pull/19278#discussion_r139854984 --- Diff: mllib/src/main/scala/org/apache/spark/ml/tuning/CrossValidator.scala --- @@ -212,14 +213,12 @@ object CrossValidator extends MLReadable

[GitHub] spark pull request #19278: [SPARK-22060][ML] Fix CrossValidator/TrainValidat...

2017-09-19 Thread WeichenXu123
Github user WeichenXu123 commented on a diff in the pull request: https://github.com/apache/spark/pull/19278#discussion_r139854872 --- Diff: mllib/src/main/scala/org/apache/spark/ml/tuning/TrainValidationSplit.scala --- @@ -122,7 +123,7 @@ class TrainValidationSplit @Since("

[GitHub] spark pull request #19278: [SPARK-22060][ML] Fix CrossValidator/TrainValidat...

2017-09-19 Thread WeichenXu123
Github user WeichenXu123 commented on a diff in the pull request: https://github.com/apache/spark/pull/19278#discussion_r139854853 --- Diff: mllib/src/main/scala/org/apache/spark/ml/tuning/CrossValidator.scala --- @@ -283,6 +282,8 @@ object CrossValidatorModel extends MLReadable

[GitHub] spark issue #19106: [SPARK-21770][ML] ProbabilisticClassificationModel fix c...

2017-09-19 Thread WeichenXu123
Github user WeichenXu123 commented on the issue: https://github.com/apache/spark/pull/19106 Jenkins, test this please. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e

[GitHub] spark issue #12066: [SPARK-7424] [ML] ML ClassificationModel should add meta...

2017-09-19 Thread WeichenXu123
Github user WeichenXu123 commented on the issue: https://github.com/apache/spark/pull/12066 @yanboliang Are you still working on this ? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

[GitHub] spark issue #19278: [SPARK-22060][ML] Fix CrossValidator/TrainValidationSpli...

2017-09-19 Thread WeichenXu123
Github user WeichenXu123 commented on the issue: https://github.com/apache/spark/pull/19278 cc @BryanCutler Thanks! --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail

[GitHub] spark pull request #19278: [SPARK-22060][ML] Fix CrossValidator/TrainValidat...

2017-09-19 Thread WeichenXu123
GitHub user WeichenXu123 opened a pull request: https://github.com/apache/spark/pull/19278 [SPARK-22060][ML] Fix CrossValidator/TrainValidationSplit param persist/load bug ## What changes were proposed in this pull request? Currently the param of CrossValidator

[GitHub] spark issue #19186: [SPARK-21972][ML] Add param handlePersistence

2017-09-19 Thread WeichenXu123
Github user WeichenXu123 commented on the issue: https://github.com/apache/spark/pull/19186 @zhengruifeng You should write a design doc succinctly and link to jira first. After we reach agreement with the design this PR can move forward. Thanks

[GitHub] spark issue #19229: [SPARK-22001][ML][SQL] ImputerModel can do withColumn fo...

2017-09-19 Thread WeichenXu123
Github user WeichenXu123 commented on the issue: https://github.com/apache/spark/pull/19229 Oh. That's what have done in the old PR #18902 .(Because the RDD version (not in master branch, only personal impl here (sorry for put wrong link, the code link is here: https://github.com

[GitHub] spark issue #17819: [SPARK-20542][ML][SQL] Add an API to Bucketizer that can...

2017-09-19 Thread WeichenXu123
Github user WeichenXu123 commented on the issue: https://github.com/apache/spark/pull/17819 Yes you can only move `setInputCols` into the outer class to resolve this issue. But I prefer merge it together. I think we can unify the `transform` method. (First we check param `inputCol

[GitHub] spark issue #17819: [SPARK-20542][ML][SQL] Add an API to Bucketizer that can...

2017-09-18 Thread WeichenXu123
Github user WeichenXu123 commented on the issue: https://github.com/apache/spark/pull/17819 @viirya Oh, I am not saying the compatibility against old version scala application. What I say is about new version `Bucketizer`, when spark user use java language(not scala language), call

[GitHub] spark issue #19208: [SPARK-21087] [ML] CrossValidator, TrainValidationSplit ...

2017-09-18 Thread WeichenXu123
Github user WeichenXu123 commented on the issue: https://github.com/apache/spark/pull/19208 @smurching Thanks! I will update later. And note that I will separate part of this PR to a new PR (the separated part will be a bugfix for #16774

[GitHub] spark issue #17819: [SPARK-20542][ML][SQL] Add an API to Bucketizer that can...

2017-09-18 Thread WeichenXu123
Github user WeichenXu123 commented on the issue: https://github.com/apache/spark/pull/17819 @viirya Scala `with trait` is a complex mechanism and `trait` isn't equivalent to java's `interface`. Scala compiler will precompile and generate many other codes, so java-side code cannot

[GitHub] spark issue #16774: [SPARK-19357][ML] Adding parallel model evaluation in ML...

2017-09-18 Thread WeichenXu123
Github user WeichenXu123 commented on the issue: https://github.com/apache/spark/pull/16774 OK. I will separate a PR. Thanks! --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional

[GitHub] spark issue #19229: [SPARK-22001][ML][SQL] ImputerModel can do withColumn fo...

2017-09-18 Thread WeichenXu123
Github user WeichenXu123 commented on the issue: https://github.com/apache/spark/pull/19229 @viirya No, keep the dataframe version code. But I only want to confirm how much performance gap between this and RDD version. (for possible improvements in the future, because in similar test

[GitHub] spark pull request #18924: [SPARK-14371] [MLLIB] OnlineLDAOptimizer should n...

2017-09-18 Thread WeichenXu123
Github user WeichenXu123 commented on a diff in the pull request: https://github.com/apache/spark/pull/18924#discussion_r139467949 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/clustering/LDAOptimizer.scala --- @@ -462,31 +462,44 @@ final class OnlineLDAOptimizer extends

[GitHub] spark pull request #18924: [SPARK-14371] [MLLIB] OnlineLDAOptimizer should n...

2017-09-18 Thread WeichenXu123
Github user WeichenXu123 commented on a diff in the pull request: https://github.com/apache/spark/pull/18924#discussion_r139470472 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/clustering/LDAOptimizer.scala --- @@ -462,31 +462,44 @@ final class OnlineLDAOptimizer extends

[GitHub] spark issue #19229: [SPARK-22001][ML][SQL] ImputerModel can do withColumn fo...

2017-09-18 Thread WeichenXu123
Github user WeichenXu123 commented on the issue: https://github.com/apache/spark/pull/19229 @viirya I run the code, you're right, most of time cost on the executedPlan generation (The old version code). thanks! But can you append benchmark comparison with `RDD.aggregate` version

[GitHub] spark issue #19229: [SPARK-22001][ML][SQL] ImputerModel can do withColumn fo...

2017-09-18 Thread WeichenXu123
Github user WeichenXu123 commented on the issue: https://github.com/apache/spark/pull/19229 Looks not the reason. maybe issues somewhere else. Let me run test later. Thanks! But there is some small issues in test: Don't include gen data time: ``` val start

<    1   2   3   4   5   6   7   8   9   10   >