[GitHub] spark issue #18998: [SPARK-21748][ML] Migrate the implementation of HashingT...

2018-04-25 Thread facaiy
Github user facaiy commented on the issue: https://github.com/apache/spark/pull/18998 Close it since quite a long time without any activity. Thanks all the same --- - To unsubscribe, e-mail: reviews-unsubscr

[GitHub] spark pull request #18998: [SPARK-21748][ML] Migrate the implementation of H...

2018-04-25 Thread facaiy
Github user facaiy closed the pull request at: https://github.com/apache/spark/pull/18998 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #18998: [SPARK-21748][ML] Migrate the implementation of H...

2018-04-06 Thread facaiy
Github user facaiy commented on a diff in the pull request: https://github.com/apache/spark/pull/18998#discussion_r179903481 --- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/HashingTF.scala --- @@ -93,11 +97,21 @@ class HashingTF @Since("1.4.0") (@Si

[GitHub] spark issue #18736: [SPARK-21481][ML] Add indexOf method for ml.feature.Hash...

2018-03-03 Thread facaiy
Github user facaiy commented on the issue: https://github.com/apache/spark/pull/18736 Closed as #18998 takes too long to wait. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional

[GitHub] spark pull request #18736: [SPARK-21481][ML] Add indexOf method for ml.featu...

2018-03-03 Thread facaiy
Github user facaiy closed the pull request at: https://github.com/apache/spark/pull/18736 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #17503: [SPARK-3159][MLlib] Check for reducible DecisionTree

2018-03-03 Thread facaiy
Github user facaiy commented on the issue: https://github.com/apache/spark/pull/17503 Colsed since its duplicate PR #20632 has been merged. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

[GitHub] spark pull request #17503: [SPARK-3159][MLlib] Check for reducible DecisionT...

2018-03-03 Thread facaiy
Github user facaiy closed the pull request at: https://github.com/apache/spark/pull/17503 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #18998: [SPARK-21748][ML] Migrate the implementation of H...

2018-03-03 Thread facaiy
Github user facaiy commented on a diff in the pull request: https://github.com/apache/spark/pull/18998#discussion_r172015693 --- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/HashingTF.scala --- @@ -93,11 +97,21 @@ class HashingTF @Since("1.4.0") (@Si

[GitHub] spark pull request #18998: [SPARK-21748][ML] Migrate the implementation of H...

2018-03-03 Thread facaiy
Github user facaiy commented on a diff in the pull request: https://github.com/apache/spark/pull/18998#discussion_r172015644 --- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/HashingTF.scala --- @@ -93,11 +97,21 @@ class HashingTF @Since("1.4.0") (@Si

[GitHub] spark pull request #18998: [SPARK-21748][ML] Migrate the implementation of H...

2018-02-28 Thread facaiy
Github user facaiy commented on a diff in the pull request: https://github.com/apache/spark/pull/18998#discussion_r171412547 --- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/HashingTF.scala --- @@ -93,11 +97,21 @@ class HashingTF @Since("1.4.0") (@Si

[GitHub] spark issue #19666: [SPARK-22451][ML] Reduce decision tree aggregate size fo...

2017-11-09 Thread facaiy
Github user facaiy commented on the issue: https://github.com/apache/spark/pull/19666 Thank you, @WeichenXu123 . You can also use the condition "include the first bin" to filter left splits. Perhaps it

[GitHub] spark issue #19666: [SPARK-22451][ML] Reduce decision tree aggregate size fo...

2017-11-09 Thread facaiy
Github user facaiy commented on the issue: https://github.com/apache/spark/pull/19666 In fact, I'm not sure whether the idea is right, so no hesitate to correct me. I assume the algorithm requires O(N^2) complexity

[GitHub] spark issue #19666: [SPARK-22451][ML] Reduce decision tree aggregate size fo...

2017-11-09 Thread facaiy
Github user facaiy commented on the issue: https://github.com/apache/spark/pull/19666 Hi, I write a demo with python. I'll be happy if it could be useful. For N bins, say `[x_1, x_2, ..., x_N]`, since all its splits contain either `x_1` or not, so we can choose the half

[GitHub] spark issue #19666: [SPARK-22451][ML] Reduce decision tree aggregate size fo...

2017-11-07 Thread facaiy
Github user facaiy commented on the issue: https://github.com/apache/spark/pull/19666 I believe that unordered features will benefit a lot from the idea, however I have two questions: 1. I'm a little confused by 964L in `traverseUnorderedSplits`. Is it a backtracking algorithm

[GitHub] spark pull request #19666: [SPARK-22451][ML] Reduce decision tree aggregate ...

2017-11-07 Thread facaiy
Github user facaiy commented on a diff in the pull request: https://github.com/apache/spark/pull/19666#discussion_r149313427 --- Diff: mllib/src/test/scala/org/apache/spark/ml/tree/impl/RandomForestSuite.scala --- @@ -631,6 +614,42 @@ class RandomForestSuite extends SparkFunSuite

[GitHub] spark issue #18998: [SPARK-21748][ML] Migrate the implementation of HashingT...

2017-11-07 Thread facaiy
Github user facaiy commented on the issue: https://github.com/apache/spark/pull/18998 ping @yanboliang --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h

[GitHub] spark pull request #17383: [SPARK-3165][MLlib] DecisionTree use sparsity in ...

2017-09-26 Thread facaiy
Github user facaiy closed the pull request at: https://github.com/apache/spark/pull/17383 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #17383: [SPARK-3165][MLlib] DecisionTree use sparsity in data

2017-09-26 Thread facaiy
Github user facaiy commented on the issue: https://github.com/apache/spark/pull/17383 Hi, since the work has been done for a long time, I take a review by myself. After careful review, as SparseVector is compressed sparse row format, so the only benefit of the PR would

[GitHub] spark issue #17503: [SPARK-3159][MLlib] Check for reducible DecisionTree

2017-09-26 Thread facaiy
Github user facaiy commented on the issue: https://github.com/apache/spark/pull/17503 HI, @WeichenXu123. As said by @srowen , the benefit of this would be for speed at predict time or for model storage. Hence I'm not sure whether benchmark is really need for the PR

[GitHub] spark issue #17383: [SPARK-3165][MLlib] DecisionTree use sparsity in data

2017-09-08 Thread facaiy
Github user facaiy commented on the issue: https://github.com/apache/spark/pull/17383 Sure, @WeichenXu123 , perhaps one or two weeks later, is it OK? By the way, I think using sparse representation can only reduce memory usage, and it is in the cost of compute performance

[GitHub] spark issue #17383: [SPARK-3165][MLlib] DecisionTree use sparsity in data

2017-09-06 Thread facaiy
Github user facaiy commented on the issue: https://github.com/apache/spark/pull/17383 Thank you for comment. Very good question, at least for me, the answer to both questions is no. In most case, we feed dense raw data into tree model. However, if large dimensions required

[GitHub] spark issue #18998: [SPARK-21748][ML] Migrate the implementation of HashingT...

2017-08-29 Thread facaiy
Github user facaiy commented on the issue: https://github.com/apache/spark/pull/18998 Hi, @yanboliang and @srowen . Thanks for your comments. For HashingTF, I agree that it is necessary to migrate its implementation so that new method could be added easily. Thanks, any

[GitHub] spark pull request #18120: [SPARK-20498][PYSPARK][ML] Expose getMaxDepth for...

2017-08-26 Thread facaiy
Github user facaiy closed the pull request at: https://github.com/apache/spark/pull/18120 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature

[GitHub] spark issue #17503: [SPARK-3159][MLlib] Check for reducible DecisionTree

2017-08-26 Thread facaiy
Github user facaiy commented on the issue: https://github.com/apache/spark/pull/17503 Hi, @yanboliang . Do you have time to take a look at first? Thanks very much. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your

[GitHub] spark issue #18998: [SPARK-21748][ML] Migrate the implementation of HashingT...

2017-08-26 Thread facaiy
Github user facaiy commented on the issue: https://github.com/apache/spark/pull/18998 The PR is dependent by #18736 . To keep consistency of `setxxx` methods between scala and python , as @yanboliang suggested, it is better to migrate the HashingTF implementation from mllib to ml

[GitHub] spark issue #18998: [SPARK-21748][ML] Migrate the implementation of HashingT...

2017-08-26 Thread facaiy
Github user facaiy commented on the issue: https://github.com/apache/spark/pull/18998 Hi, @srowen . Could you take a look at the PR? Thanks. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have

[GitHub] spark issue #18998: [SPARK-21748][ML] Migrate the implementation of HashingT...

2017-08-18 Thread facaiy
Github user facaiy commented on the issue: https://github.com/apache/spark/pull/18998 cc @yanboliang @WeichenXu123 who I believe are interested in this PR. Could you take a look please? --- If your project is set up for it, you can reply to this email and have your reply appear

[GitHub] spark pull request #18998: [SPARK-21748][ML] Migrate the implementation of H...

2017-08-18 Thread facaiy
GitHub user facaiy opened a pull request: https://github.com/apache/spark/pull/18998 [SPARK-21748][ML] Migrate the implementation of HashingTF from MLlib to ML ## What changes were proposed in this pull request? Migrate the implementation of HashingTF from MLlib to ML

[GitHub] spark issue #18736: [SPARK-21481][ML] Add indexOf method for ml.feature.Hash...

2017-08-15 Thread facaiy
Github user facaiy commented on the issue: https://github.com/apache/spark/pull/18736 Sure, @yanboliang . Thanks for your suggestion. I'll work on it later, perhaps next week. Is it OK? --- If your project is set up for it, you can reply to this email and have your reply appear

[GitHub] spark issue #18736: [SPARK-21481][ML] Add indexOf method for ml.feature.Hash...

2017-08-13 Thread facaiy
Github user facaiy commented on the issue: https://github.com/apache/spark/pull/18736 @yanboliang Hi, yangbo. Could you help review the PR? Thanks. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does

[GitHub] spark pull request #18736: [SPARK-21481][ML] Add indexOf method for ml.featu...

2017-08-10 Thread facaiy
Github user facaiy commented on a diff in the pull request: https://github.com/apache/spark/pull/18736#discussion_r132618802 --- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/HashingTF.scala --- @@ -80,20 +82,31 @@ class HashingTF @Since("1.4.0") (@Si

[GitHub] spark pull request #18736: [SPARK-21481][ML] Add indexOf method for ml.featu...

2017-08-09 Thread facaiy
Github user facaiy commented on a diff in the pull request: https://github.com/apache/spark/pull/18736#discussion_r132131171 --- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/HashingTF.scala --- @@ -90,10 +92,22 @@ class HashingTF @Since("1.4.0") (@Si

[GitHub] spark pull request #18763: [SPARK-21306][ML] For branch 2.1, OneVsRest shoul...

2017-08-08 Thread facaiy
Github user facaiy closed the pull request at: https://github.com/apache/spark/pull/18763 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature

[GitHub] spark issue #18763: [SPARK-21306][ML] For branch 2.1, OneVsRest should suppo...

2017-08-08 Thread facaiy
Github user facaiy commented on the issue: https://github.com/apache/spark/pull/18763 Thanks, all. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so

[GitHub] spark issue #18764: [SPARK-21306][ML] For branch 2.0, OneVsRest should suppo...

2017-08-08 Thread facaiy
Github user facaiy commented on the issue: https://github.com/apache/spark/pull/18764 Sure, thanks, @yanboliang ! --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled

[GitHub] spark pull request #18764: [SPARK-21306][ML] For branch 2.0, OneVsRest shoul...

2017-08-08 Thread facaiy
Github user facaiy closed the pull request at: https://github.com/apache/spark/pull/18764 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature

[GitHub] spark issue #18764: [SPARK-21306][ML] For branch 2.0, OneVsRest should suppo...

2017-08-07 Thread facaiy
Github user facaiy commented on the issue: https://github.com/apache/spark/pull/18764 Thanks, @yanboliang @gatorsmile --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled

[GitHub] spark issue #18764: [SPARK-21306][ML] For branch 2.0, OneVsRest should suppo...

2017-08-06 Thread facaiy
Github user facaiy commented on the issue: https://github.com/apache/spark/pull/18764 @SparkQA Take a test, please. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled

[GitHub] spark pull request #18763: [SPARK-21306][ML] For branch 2.1, OneVsRest shoul...

2017-08-05 Thread facaiy
Github user facaiy commented on a diff in the pull request: https://github.com/apache/spark/pull/18763#discussion_r131529768 --- Diff: python/pyspark/ml/classification.py --- @@ -1423,7 +1425,18 @@ def _fit(self, dataset): numClasses = int(dataset.agg({labelCol

[GitHub] spark pull request #18764: [SPARK-21306][ML] For branch 2.0, OneVsRest shoul...

2017-08-05 Thread facaiy
Github user facaiy commented on a diff in the pull request: https://github.com/apache/spark/pull/18764#discussion_r131529693 --- Diff: python/pyspark/ml/classification.py --- @@ -1344,7 +1346,19 @@ def _fit(self, dataset): numClasses = int(dataset.agg({labelCol

[GitHub] spark issue #18764: [SPARK-21306][ML] For branch 2.0, OneVsRest should suppo...

2017-08-04 Thread facaiy
Github user facaiy commented on the issue: https://github.com/apache/spark/pull/18764 @yanboliang Thanks, yanbo. I am not familar with python 2.6, which is too outdated. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well

[GitHub] spark issue #18764: [SPARK-21306][ML] For branch 2.0, OneVsRest should suppo...

2017-08-04 Thread facaiy
Github user facaiy commented on the issue: https://github.com/apache/spark/pull/18764 Test failures in pyspark.ml.tests with python2.6, but I don't have the environment. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well

[GitHub] spark issue #18764: [SPARK-21306][ML] For branch 2.0, OneVsRest should suppo...

2017-08-03 Thread facaiy
Github user facaiy commented on the issue: https://github.com/apache/spark/pull/18764 Test failures in pyspark.ml.tests with python2.6, but I don't have the environment. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well

[GitHub] spark issue #18764: [SPARK-21306][ML] For branch 2.0, OneVsRest should suppo...

2017-08-01 Thread facaiy
Github user facaiy commented on the issue: https://github.com/apache/spark/pull/18764 Jenkins, test this please. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled

[GitHub] spark issue #18764: [SPARK-21306][ML] For branch 2.0, OneVsRest should suppo...

2017-07-31 Thread facaiy
Github user facaiy commented on the issue: https://github.com/apache/spark/pull/18764 Thanks, @yanboliang . Could you give a hand, @srowen ? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have

[GitHub] spark pull request #18763: [SPARK-21306][ML] For branch 2.1, OneVsRest shoul...

2017-07-28 Thread facaiy
Github user facaiy commented on a diff in the pull request: https://github.com/apache/spark/pull/18763#discussion_r130213337 --- Diff: mllib/src/test/scala/org/apache/spark/ml/classification/OneVsRestSuite.scala --- @@ -158,7 +158,7 @@ class OneVsRestSuite extends SparkFunSuite

[GitHub] spark pull request #18763: [SPARK-21306][ML] For branch 2.1, OneVsRest shoul...

2017-07-28 Thread facaiy
Github user facaiy commented on a diff in the pull request: https://github.com/apache/spark/pull/18763#discussion_r130202540 --- Diff: mllib/src/test/scala/org/apache/spark/ml/classification/OneVsRestSuite.scala --- @@ -158,7 +158,7 @@ class OneVsRestSuite extends SparkFunSuite

[GitHub] spark pull request #18763: [SPARK-21306][ML] For branch-2.1, OneVsRest shoul...

2017-07-28 Thread facaiy
Github user facaiy commented on a diff in the pull request: https://github.com/apache/spark/pull/18763#discussion_r130200461 --- Diff: mllib/src/test/scala/org/apache/spark/ml/classification/OneVsRestSuite.scala --- @@ -157,6 +157,16 @@ class OneVsRestSuite extends SparkFunSuite

[GitHub] spark pull request #18764: [SPARK-21306][ML] For branch 2.0, OneVsRest shoul...

2017-07-28 Thread facaiy
Github user facaiy commented on a diff in the pull request: https://github.com/apache/spark/pull/18764#discussion_r130200379 --- Diff: mllib/src/test/scala/org/apache/spark/ml/classification/OneVsRestSuite.scala --- @@ -143,6 +144,16 @@ class OneVsRestSuite extends SparkFunSuite

[GitHub] spark pull request #18764: [SPARK-21306][ML] For branch 2.0, OneVsRest shoul...

2017-07-28 Thread facaiy
Github user facaiy commented on a diff in the pull request: https://github.com/apache/spark/pull/18764#discussion_r130200288 --- Diff: mllib/src/test/scala/org/apache/spark/ml/classification/OneVsRestSuite.scala --- @@ -33,6 +33,7 @@ import

[GitHub] spark pull request #18764: [SPARK-21306][ML] For branch 2.0, OneVsRest shoul...

2017-07-28 Thread facaiy
GitHub user facaiy opened a pull request: https://github.com/apache/spark/pull/18764 [SPARK-21306][ML] For branch 2.0, OneVsRest should support setWeightCol The PR is related to #18554, and is modified for branch 2.0. ## What changes were proposed in this pull request

[GitHub] spark pull request #18763: [SPARK-21306][ML] OneVsRest should support setWei...

2017-07-28 Thread facaiy
GitHub user facaiy opened a pull request: https://github.com/apache/spark/pull/18763 [SPARK-21306][ML] OneVsRest should support setWeightCol for branch-2.1 The PR is related to #18554, and is modified for branch 2.1. ## What changes were proposed in this pull request

[GitHub] spark pull request #18554: [SPARK-21306][ML] OneVsRest should support setWei...

2017-07-26 Thread facaiy
Github user facaiy commented on a diff in the pull request: https://github.com/apache/spark/pull/18554#discussion_r129562237 --- Diff: python/pyspark/ml/tests.py --- @@ -1255,6 +1255,24 @@ def test_output_columns(self): output = model.transform(df

[GitHub] spark pull request #18554: [SPARK-21306][ML] OneVsRest should support setWei...

2017-07-26 Thread facaiy
Github user facaiy commented on a diff in the pull request: https://github.com/apache/spark/pull/18554#discussion_r129562189 --- Diff: python/pyspark/ml/classification.py --- @@ -1517,20 +1517,22 @@ class OneVsRest(Estimator, OneVsRestParams, MLReadable, MLWritable

[GitHub] spark issue #18554: [SPARK-21306][ML] OneVsRest should support setWeightCol

2017-07-26 Thread facaiy
Github user facaiy commented on the issue: https://github.com/apache/spark/pull/18554 ping @holdenk @yanboliang --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled

[GitHub] spark pull request #18736: [SPARK-21481][ML] Add indexOf method for ml.featu...

2017-07-26 Thread facaiy
GitHub user facaiy opened a pull request: https://github.com/apache/spark/pull/18736 [SPARK-21481][ML] Add indexOf method for ml.feature.HashingTF ## What changes were proposed in this pull request? Add indexOf method for ml.feature.HashingTF. The PR is a hotfix

[GitHub] spark pull request #18554: [SPARK-21306][ML] OneVsRest should support setWei...

2017-07-18 Thread facaiy
Github user facaiy commented on a diff in the pull request: https://github.com/apache/spark/pull/18554#discussion_r128158473 --- Diff: python/pyspark/ml/tests.py --- @@ -1255,6 +1255,17 @@ def test_output_columns(self): output = model.transform(df

[GitHub] spark pull request #18305: [SPARK-20988][ML] Logistic regression uses aggreg...

2017-07-18 Thread facaiy
Github user facaiy commented on a diff in the pull request: https://github.com/apache/spark/pull/18305#discussion_r127972263 --- Diff: mllib/src/main/scala/org/apache/spark/ml/classification/LogisticRegression.scala --- @@ -598,8 +598,23 @@ class LogisticRegression @Since("

[GitHub] spark pull request #18305: [SPARK-20988][ML] Logistic regression uses aggreg...

2017-07-17 Thread facaiy
Github user facaiy commented on a diff in the pull request: https://github.com/apache/spark/pull/18305#discussion_r127874833 --- Diff: mllib/src/main/scala/org/apache/spark/ml/optim/loss/DifferentiableRegularization.scala --- @@ -32,40 +34,45 @@ private[ml] trait

[GitHub] spark pull request #18305: [SPARK-20988][ML] Logistic regression uses aggreg...

2017-07-17 Thread facaiy
Github user facaiy commented on a diff in the pull request: https://github.com/apache/spark/pull/18305#discussion_r127873828 --- Diff: mllib/src/main/scala/org/apache/spark/ml/classification/LogisticRegression.scala --- @@ -598,8 +598,23 @@ class LogisticRegression @Since("

[GitHub] spark pull request #18554: [SPARK-21306][ML] OneVsRest should cache weightCo...

2017-07-11 Thread facaiy
Github user facaiy commented on a diff in the pull request: https://github.com/apache/spark/pull/18554#discussion_r126863072 --- Diff: mllib/src/main/scala/org/apache/spark/ml/classification/OneVsRest.scala --- @@ -317,7 +318,12 @@ final class OneVsRest @Since("

[GitHub] spark pull request #18582: [SPARK-18619][ML] Make QuantileDiscretizer/Bucket...

2017-07-11 Thread facaiy
Github user facaiy commented on a diff in the pull request: https://github.com/apache/spark/pull/18582#discussion_r126646511 --- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/StringIndexer.scala --- @@ -36,7 +36,8 @@ import org.apache.spark.util.collection.OpenHashMap

[GitHub] spark pull request #18582: [SPARK-18619][ML] Make QuantileDiscretizer/Bucket...

2017-07-11 Thread facaiy
Github user facaiy commented on a diff in the pull request: https://github.com/apache/spark/pull/18582#discussion_r126645714 --- Diff: python/pyspark/ml/feature.py --- @@ -3058,26 +3035,37 @@ class RFormula(JavaEstimator, HasFeaturesCol, HasLabelCol, JavaMLReadable, JavaM

[GitHub] spark pull request #18582: [SPARK-18619][ML] Make QuantileDiscretizer/Bucket...

2017-07-11 Thread facaiy
Github user facaiy commented on a diff in the pull request: https://github.com/apache/spark/pull/18582#discussion_r126643882 --- Diff: mllib/src/main/scala/org/apache/spark/ml/regression/LinearRegression.scala --- @@ -460,16 +460,16 @@ object LinearRegression extends

[GitHub] spark pull request #18582: [SPARK-18619][ML] Make QuantileDiscretizer/Bucket...

2017-07-11 Thread facaiy
Github user facaiy commented on a diff in the pull request: https://github.com/apache/spark/pull/18582#discussion_r126642928 --- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/Bucketizer.scala --- @@ -36,7 +36,8 @@ import org.apache.spark.sql.types.{DoubleType

[GitHub] spark issue #18554: [SPARK-21306][ML] OneVsRest should cache weightCol if ne...

2017-07-11 Thread facaiy
Github user facaiy commented on the issue: https://github.com/apache/spark/pull/18554 @srowen @yanboliang Could you help review the PR? Thanks. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does

[GitHub] spark issue #18554: [SPARK-21306][ML] OneVsRest should cache weightCol if ne...

2017-07-06 Thread facaiy
Github user facaiy commented on the issue: https://github.com/apache/spark/pull/18554 I'm not familiar with R, and use grep to search "OneVsRest" and get nothing. Hence it seems that nothing is needed to do with R part. --- If your project is set up for it, you

[GitHub] spark issue #18523: [SPARK-21285][ML] VectorAssembler reports the column nam...

2017-07-06 Thread facaiy
Github user facaiy commented on the issue: https://github.com/apache/spark/pull/18523 @SparkQA test again, please. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled

[GitHub] spark pull request #18556: [SPARK-21326][SPARK-21066][ML] Use TextFileFormat...

2017-07-06 Thread facaiy
Github user facaiy commented on a diff in the pull request: https://github.com/apache/spark/pull/18556#discussion_r126050849 --- Diff: mllib/src/main/scala/org/apache/spark/ml/source/libsvm/LibSVMRelation.scala --- @@ -89,18 +93,17 @@ private[libsvm] class LibSVMFileFormat extends

[GitHub] spark issue #18554: [SPARK-21306][ML] OneVsRest should cache weightCol if ne...

2017-07-06 Thread facaiy
Github user facaiy commented on the issue: https://github.com/apache/spark/pull/18554 @lins05 thanks, reasonable suggestion, I will fix it later. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does

[GitHub] spark pull request #18556: [SPARK-21326][SPARK-21066][ML] Use TextFileFormat...

2017-07-06 Thread facaiy
Github user facaiy commented on a diff in the pull request: https://github.com/apache/spark/pull/18556#discussion_r126026388 --- Diff: mllib/src/main/scala/org/apache/spark/ml/source/libsvm/LibSVMRelation.scala --- @@ -89,18 +93,17 @@ private[libsvm] class LibSVMFileFormat extends

[GitHub] spark pull request #18556: [SPARK-21326][SPARK-21066][ML] Use TextFileFormat...

2017-07-06 Thread facaiy
Github user facaiy commented on a diff in the pull request: https://github.com/apache/spark/pull/18556#discussion_r126023986 --- Diff: mllib/src/main/scala/org/apache/spark/ml/source/libsvm/LibSVMRelation.scala --- @@ -89,18 +93,17 @@ private[libsvm] class LibSVMFileFormat extends

[GitHub] spark pull request #18523: [SPARK-21285][ML] VectorAssembler reports the col...

2017-07-06 Thread facaiy
Github user facaiy commented on a diff in the pull request: https://github.com/apache/spark/pull/18523#discussion_r125860650 --- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/VectorAssembler.scala --- @@ -113,12 +113,15 @@ class VectorAssembler @Since("1.4.0"

[GitHub] spark pull request #18554: [SPARK-21306][ML] OneVsRest should cache weightCo...

2017-07-06 Thread facaiy
GitHub user facaiy opened a pull request: https://github.com/apache/spark/pull/18554 [SPARK-21306][ML] OneVsRest should cache weightCol if necessary ## What changes were proposed in this pull request? cache weightCol if classifier inherits HasWeightCol trait

[GitHub] spark issue #18523: [SPARK-21285][ML] VectorAssembler reports the column nam...

2017-07-05 Thread facaiy
Github user facaiy commented on the issue: https://github.com/apache/spark/pull/18523 @SparkQA Jenkins, run tests again, please. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature

[GitHub] spark pull request #18523: [SPARK-21285][ML] VectorAssembler reports the col...

2017-07-05 Thread facaiy
Github user facaiy commented on a diff in the pull request: https://github.com/apache/spark/pull/18523#discussion_r125763918 --- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/VectorAssembler.scala --- @@ -113,12 +113,15 @@ class VectorAssembler @Since("1.4.0"

[GitHub] spark issue #18523: [SPARK-21285][ML] VectorAssembler reports the column nam...

2017-07-05 Thread facaiy
Github user facaiy commented on the issue: https://github.com/apache/spark/pull/18523 I don't know how to write an unit test for the pr? Is it necessary? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project

[GitHub] spark issue #18523: [SPARK-21285][ML] VectorAssembler reports the column nam...

2017-07-05 Thread facaiy
Github user facaiy commented on the issue: https://github.com/apache/spark/pull/18523 Good idea! --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so

[GitHub] spark pull request #18523: [SPARK-21285][ML] VectorAssembler reports the col...

2017-07-05 Thread facaiy
Github user facaiy commented on a diff in the pull request: https://github.com/apache/spark/pull/18523#discussion_r125584572 --- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/VectorAssembler.scala --- @@ -113,12 +113,12 @@ class VectorAssembler @Since("1.4.0"

[GitHub] spark pull request #17383: [SPARK-3165][MLlib][WIP] DecisionTree does not us...

2017-07-05 Thread facaiy
GitHub user facaiy reopened a pull request: https://github.com/apache/spark/pull/17383 [SPARK-3165][MLlib][WIP] DecisionTree does not use sparsity in data ## What changes were proposed in this pull request? DecisionTree should take advantage of sparse feature vectors

[GitHub] spark pull request #17383: [SPARK-3165][MLlib][WIP] DecisionTree does not us...

2017-07-04 Thread facaiy
Github user facaiy closed the pull request at: https://github.com/apache/spark/pull/17383 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature

[GitHub] spark pull request #18523: [SPARK-21285][ML] VectorAssembler reports the col...

2017-07-04 Thread facaiy
Github user facaiy commented on a diff in the pull request: https://github.com/apache/spark/pull/18523#discussion_r125539040 --- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/VectorAssembler.scala --- @@ -113,12 +113,12 @@ class VectorAssembler @Since("1.4.0"

[GitHub] spark issue #17503: [SPARK-3159][MLlib] Check for reducible DecisionTree

2017-07-04 Thread facaiy
Github user facaiy commented on the issue: https://github.com/apache/spark/pull/17503 @jkbradley May you have time reviewing the pr? I believe that it will be a little improvement for predict. Thanks. --- If your project is set up for it, you can reply to this email and have your

[GitHub] spark pull request #18523: [SPARK-21285][ML] VectorAssembler reports the col...

2017-07-04 Thread facaiy
Github user facaiy commented on a diff in the pull request: https://github.com/apache/spark/pull/18523#discussion_r125398010 --- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/VectorAssembler.scala --- @@ -113,12 +113,12 @@ class VectorAssembler @Since("1.4.0"

[GitHub] spark pull request #18523: [SPARK-21285][ML] VectorAssembler reports the col...

2017-07-03 Thread facaiy
GitHub user facaiy opened a pull request: https://github.com/apache/spark/pull/18523 [SPARK-21285][ML] VectorAssembler reports the column name of unsupported data type ## What changes were proposed in this pull request? add the column name in the exception which is raised

[GitHub] spark issue #18288: [SPARK-21066][ML] LibSVM load just one input file

2017-06-23 Thread facaiy
Github user facaiy commented on the issue: https://github.com/apache/spark/pull/18288 Yes. an example code: ```scala val df = spark.read.format("libsvm") .option("numFeatures", "780") .load("data/mllib/sample_libsvm_data.

[GitHub] spark issue #18288: [SPARK-21066][ML] LibSVM load just one input file

2017-06-22 Thread facaiy
Github user facaiy commented on the issue: https://github.com/apache/spark/pull/18288 You might be mistaken. The aim of code here is to encourage user to specify `numFeatures` in any case, rather than encourage user to use only one file. --- If your project is set up for it, you can

[GitHub] spark pull request #18288: [SPARK-21066][ML] LibSVM load just one input file

2017-06-22 Thread facaiy
Github user facaiy commented on a diff in the pull request: https://github.com/apache/spark/pull/18288#discussion_r123474003 --- Diff: mllib/src/main/scala/org/apache/spark/ml/source/libsvm/LibSVMRelation.scala --- @@ -91,12 +91,10 @@ private[libsvm] class LibSVMFileFormat extends

[GitHub] spark issue #18288: [SPARK-21066][ML] LibSVM load just one input file

2017-06-22 Thread facaiy
Github user facaiy commented on the issue: https://github.com/apache/spark/pull/18288 In my opinion, `numFeatures` is vital for sparse data. Say our feature is 100-dim indeed, while in a small train data their maximum size is 990. It is dangerous (or wrong) to train a 990

[GitHub] spark pull request #18288: [SPARK-21066][ML] LibSVM load just one input file

2017-06-20 Thread facaiy
Github user facaiy commented on a diff in the pull request: https://github.com/apache/spark/pull/18288#discussion_r122909919 --- Diff: mllib/src/main/scala/org/apache/spark/ml/source/libsvm/LibSVMRelation.scala --- @@ -91,12 +91,10 @@ private[libsvm] class LibSVMFileFormat extends

[GitHub] spark pull request #18288: [SPARK-21066][ML] LibSVM load just one input file

2017-06-20 Thread facaiy
Github user facaiy commented on a diff in the pull request: https://github.com/apache/spark/pull/18288#discussion_r122908140 --- Diff: mllib/src/main/scala/org/apache/spark/ml/source/libsvm/LibSVMRelation.scala --- @@ -91,12 +91,10 @@ private[libsvm] class LibSVMFileFormat extends

[GitHub] spark pull request #18139: [SPARK-20787][PYTHON] PySpark can't handle dateti...

2017-05-31 Thread facaiy
Github user facaiy commented on a diff in the pull request: https://github.com/apache/spark/pull/18139#discussion_r119346663 --- Diff: python/pyspark/sql/types.py --- @@ -187,8 +187,11 @@ def needConversion(self): def toInternal(self, dt): if dt

[GitHub] spark pull request #18139: [SPARK-20787][PYTHON] PySpark can't handle dateti...

2017-05-30 Thread facaiy
Github user facaiy commented on a diff in the pull request: https://github.com/apache/spark/pull/18139#discussion_r119263608 --- Diff: python/pyspark/sql/types.py --- @@ -187,8 +187,11 @@ def needConversion(self): def toInternal(self, dt): if dt

[GitHub] spark issue #18120: [SPARK-20498][PYSPARK][ML] Expose getMaxDepth for ensemb...

2017-05-30 Thread facaiy
Github user facaiy commented on the issue: https://github.com/apache/spark/pull/18120 Thanks, @BryanCutler. It seems that #17849 copys `Params` from `Estimator` to `Model` automatically, which is pretty useful. However, `getter` method is still missing and need to be added

[GitHub] spark issue #18120: [SPARK-20498][PYSPARK][ML] Expose getMaxDepth for ensemb...

2017-05-27 Thread facaiy
Github user facaiy commented on the issue: https://github.com/apache/spark/pull/18120 Hi, @keypointt . It's the feature of Python. The doctest is both document and unit test. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub

[GitHub] spark issue #18120: [SPARK-20498][PYSPARK][ML] Expose getMaxDepth for ensemb...

2017-05-26 Thread facaiy
Github user facaiy commented on the issue: https://github.com/apache/spark/pull/18120 @keypointt Hi, could you help check the pr is consistent with your #17207 ? Thanks. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well

[GitHub] spark pull request #18120: [SPARK-20498][PYSPARK][ML] Expose getMaxDepth for...

2017-05-26 Thread facaiy
GitHub user facaiy opened a pull request: https://github.com/apache/spark/pull/18120 [SPARK-20498][PYSPARK][ML] Expose getMaxDepth for ensemble tree model in PySpark ## What changes were proposed in this pull request? add `getMaxDepth` method for ensemble tree models

[GitHub] spark issue #18058: [SPARK-20768][PYSPARK][ML] Expose numPartitions (expert)...

2017-05-25 Thread facaiy
Github user facaiy commented on the issue: https://github.com/apache/spark/pull/18058 Resolved. By the way, Which one is preferable, rebase or merge? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your

[GitHub] spark issue #18058: [SPARK-20768][PYSPARK][ML] Expose numPartitions (expert)...

2017-05-24 Thread facaiy
Github user facaiy commented on the issue: https://github.com/apache/spark/pull/18058 Hi, I'm not familiar with pyspark. I just wonder whether is it needed to create a unit test for verification. If yes, how to check it? Thanks. --- If your project is set up for it, you can reply

[GitHub] spark pull request #18058: [SPARK-20768][PYSPARK][ML] Expose numPartitions (...

2017-05-24 Thread facaiy
Github user facaiy commented on a diff in the pull request: https://github.com/apache/spark/pull/18058#discussion_r118416434 --- Diff: python/pyspark/ml/fpm.py --- @@ -49,6 +49,32 @@ def getMinSupport(self): return self.getOrDefault(self.minSupport

  1   2   >