[GitHub] spark issue #18998: [SPARK-21748][ML] Migrate the implementation of HashingT...
Github user facaiy commented on the issue: https://github.com/apache/spark/pull/18998 Close it since quite a long time without any activity. Thanks all the same --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18998: [SPARK-21748][ML] Migrate the implementation of H...
Github user facaiy closed the pull request at: https://github.com/apache/spark/pull/18998 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18998: [SPARK-21748][ML] Migrate the implementation of H...
Github user facaiy commented on a diff in the pull request: https://github.com/apache/spark/pull/18998#discussion_r179903481 --- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/HashingTF.scala --- @@ -93,11 +97,21 @@ class HashingTF @Since("1.4.0") (@Since("1.4.0") override val uid: String) @Since("2.0.0") override def transform(dataset: Dataset[_]): DataFrame = { val outputSchema = transformSchema(dataset.schema) -val hashingTF = new feature.HashingTF($(numFeatures)).setBinary($(binary)) -// TODO: Make the hashingTF.transform natively in ml framework to avoid extra conversion. -val t = udf { terms: Seq[_] => hashingTF.transform(terms).asML } +val hashUDF = udf { (terms: Seq[_]) => + val ids = terms.map { term => --- End diff -- @sethah Hi, thank all for your review and comments. However, since it has been a quite long time with no activity, is it a good idea to close the PR? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18736: [SPARK-21481][ML] Add indexOf method for ml.feature.Hash...
Github user facaiy commented on the issue: https://github.com/apache/spark/pull/18736 Closed as #18998 takes too long to wait. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18736: [SPARK-21481][ML] Add indexOf method for ml.featu...
Github user facaiy closed the pull request at: https://github.com/apache/spark/pull/18736 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17503: [SPARK-3159][MLlib] Check for reducible DecisionTree
Github user facaiy commented on the issue: https://github.com/apache/spark/pull/17503 Colsed since its duplicate PR #20632 has been merged. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17503: [SPARK-3159][MLlib] Check for reducible DecisionT...
Github user facaiy closed the pull request at: https://github.com/apache/spark/pull/17503 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18998: [SPARK-21748][ML] Migrate the implementation of H...
Github user facaiy commented on a diff in the pull request: https://github.com/apache/spark/pull/18998#discussion_r172015693 --- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/HashingTF.scala --- @@ -93,11 +97,21 @@ class HashingTF @Since("1.4.0") (@Since("1.4.0") override val uid: String) @Since("2.0.0") override def transform(dataset: Dataset[_]): DataFrame = { val outputSchema = transformSchema(dataset.schema) -val hashingTF = new feature.HashingTF($(numFeatures)).setBinary($(binary)) -// TODO: Make the hashingTF.transform natively in ml framework to avoid extra conversion. -val t = udf { terms: Seq[_] => hashingTF.transform(terms).asML } +val hashUDF = udf { (terms: Seq[_]) => + val ids = terms.map { term => --- End diff -- By the way, as the PR doesn't change the function's behavior, I think it should be correct if all tests pass, right? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18998: [SPARK-21748][ML] Migrate the implementation of H...
Github user facaiy commented on a diff in the pull request: https://github.com/apache/spark/pull/18998#discussion_r172015644 --- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/HashingTF.scala --- @@ -93,11 +97,21 @@ class HashingTF @Since("1.4.0") (@Since("1.4.0") override val uid: String) @Since("2.0.0") override def transform(dataset: Dataset[_]): DataFrame = { val outputSchema = transformSchema(dataset.schema) -val hashingTF = new feature.HashingTF($(numFeatures)).setBinary($(binary)) -// TODO: Make the hashingTF.transform natively in ml framework to avoid extra conversion. -val t = udf { terms: Seq[_] => hashingTF.transform(terms).asML } +val hashUDF = udf { (terms: Seq[_]) => + val ids = terms.map { term => --- End diff -- Sorry for the delay. Yes, original implement use mutable map in mllib package, which is different with the PR. However, I think the PR is more clear and their logic are the same. Which one is faster in your test? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18998: [SPARK-21748][ML] Migrate the implementation of H...
Github user facaiy commented on a diff in the pull request: https://github.com/apache/spark/pull/18998#discussion_r171412547 --- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/HashingTF.scala --- @@ -93,11 +97,21 @@ class HashingTF @Since("1.4.0") (@Since("1.4.0") override val uid: String) @Since("2.0.0") override def transform(dataset: Dataset[_]): DataFrame = { val outputSchema = transformSchema(dataset.schema) -val hashingTF = new feature.HashingTF($(numFeatures)).setBinary($(binary)) -// TODO: Make the hashingTF.transform natively in ml framework to avoid extra conversion. -val t = udf { terms: Seq[_] => hashingTF.transform(terms).asML } +val hashUDF = udf { (terms: Seq[_]) => + val ids = terms.map { term => --- End diff -- Sorry, I can't remember all details exactly since the pr is too old. if my memory is correct, the ML implementation keep consistent with MLLIB (old one). As the "TODO" above said > Make the hashingTF.transform natively in ml framework to avoid extra conversion. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19666: [SPARK-22451][ML] Reduce decision tree aggregate size fo...
Github user facaiy commented on the issue: https://github.com/apache/spark/pull/19666 Thank you, @WeichenXu123 . You can also use the condition "include the first bin" to filter left splits. Perhaps it is better. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19666: [SPARK-22451][ML] Reduce decision tree aggregate size fo...
Github user facaiy commented on the issue: https://github.com/apache/spark/pull/19666 In fact, I'm not sure whether the idea is right, so no hesitate to correct me. I assume the algorithm requires O(N^2) complexity. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19666: [SPARK-22451][ML] Reduce decision tree aggregate size fo...
Github user facaiy commented on the issue: https://github.com/apache/spark/pull/19666 Hi, I write a demo with python. I'll be happy if it could be useful. For N bins, say `[x_1, x_2, ..., x_N]`, since all its splits contain either `x_1` or not, so we can choose the half splits which doesn't contain x_1 as left splits. If I understand it correctly, the left splits are indeed all combinations of the left bins, `[x_2, x_3, ... x_N]`. The problem can be solved by the [backtracking algorithm](https://en.wikipedia.org/wiki/Backtracking). Please correct me if I'm wrong. Thanks very much. ```python #!/usr/bin/env python def gen_splits(bins): if len(bins) == 1: return bins results = [] partial_res = [] gen_splits_iter(1, bins, partial_res, results) return results def gen_splits_iter(dep, bins, partial_res, results): if partial_res: left_splits = partial_res[:] right_splits = [x for x in bins if x not in left_splits] results.append("left: {:20}, right: {}".format(str(left_splits), right_splits)) for m in range(dep, len(bins)): partial_res.append(bins[m]) gen_splits_iter(m+1, bins, partial_res, results) partial_res.pop() if __name__ == "__main__": print("first example:") bins = ["a", "b", "c"] print("bins: {}\n-".format(bins)) splits = gen_splits(bins) for s in splits: print(s) print("\n\n=") print("second example:") bins = ["a", "b", "c", "d", "e"] print("bins: {}\n-".format(bins)) splits = gen_splits(bins) for s in splits: print(s) ``` logs: ```bash ~/Downloads â¯â¯â¯ python test.py first example: bins: ['a', 'b', 'c'] - left: ['b'] , right: ['a', 'c'] left: ['b', 'c'] , right: ['a'] left: ['c'] , right: ['a', 'b'] = second example: bins: ['a', 'b', 'c', 'd', 'e'] - left: ['b'] , right: ['a', 'c', 'd', 'e'] left: ['b', 'c'] , right: ['a', 'd', 'e'] left: ['b', 'c', 'd'] , right: ['a', 'e'] left: ['b', 'c', 'd', 'e'], right: ['a'] left: ['b', 'c', 'e'] , right: ['a', 'd'] left: ['b', 'd'] , right: ['a', 'c', 'e'] left: ['b', 'd', 'e'] , right: ['a', 'c'] left: ['b', 'e'] , right: ['a', 'c', 'd'] left: ['c'] , right: ['a', 'b', 'd', 'e'] left: ['c', 'd'] , right: ['a', 'b', 'e'] left: ['c', 'd', 'e'] , right: ['a', 'b'] left: ['c', 'e'] , right: ['a', 'b', 'd'] left: ['d'] , right: ['a', 'b', 'c', 'e'] left: ['d', 'e'] , right: ['a', 'b', 'c'] left: ['e'] , right: ['a', 'b', 'c', 'd'] ``` --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19666: [SPARK-22451][ML] Reduce decision tree aggregate size fo...
Github user facaiy commented on the issue: https://github.com/apache/spark/pull/19666 I believe that unordered features will benefit a lot from the idea, however I have two questions: 1. I'm a little confused by 964L in `traverseUnorderedSplits`. Is it a backtracking algorithm? ```scala dfs(binIndex + 1, combNumber, stats) ``` 2. `traverseUnorderedSplits` succeed in decoupling an abstraction from its implementation. Cool! However, I just wonder whether we can write a simpler function, say, remove `seqOp`, `finalizer`, `T` and collect all logic together in one place? Anyway, thanks for your good work, @WeichenXu123. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19666: [SPARK-22451][ML] Reduce decision tree aggregate ...
Github user facaiy commented on a diff in the pull request: https://github.com/apache/spark/pull/19666#discussion_r149313427 --- Diff: mllib/src/test/scala/org/apache/spark/ml/tree/impl/RandomForestSuite.scala --- @@ -631,6 +614,42 @@ class RandomForestSuite extends SparkFunSuite with MLlibTestSparkContext { val expected = Map(0 -> 1.0 / 3.0, 2 -> 2.0 / 3.0) assert(mapToVec(map.toMap) ~== mapToVec(expected) relTol 0.01) } + + test("traverseUnorderedSplits") { + --- End diff -- Since `traverseUnorderedSplits` is a private method, I wonder whether we can check the unorder splits on DecisonTree directly? For example, create a tiny dataset and generate a shallow tree (depth = 1?). I know the test case is difficult (maybe impossible) to design, however it focuses on behavior instead of implementation. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18998: [SPARK-21748][ML] Migrate the implementation of HashingT...
Github user facaiy commented on the issue: https://github.com/apache/spark/pull/18998 ping @yanboliang --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17383: [SPARK-3165][MLlib] DecisionTree use sparsity in ...
Github user facaiy closed the pull request at: https://github.com/apache/spark/pull/17383 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17383: [SPARK-3165][MLlib] DecisionTree use sparsity in data
Github user facaiy commented on the issue: https://github.com/apache/spark/pull/17383 Hi, since the work has been done for a long time, I take a review by myself. After careful review, as SparseVector is compressed sparse row format, so the only benefit of the PR would be for data storage but in the cost of performance. But for tree-method, it is uncommon to handle a super large dimension features. Hence, it cannot satisfy me. I prefer to [SPARK-3717: DecisionTree, RandomForest: Partition by feature](https://issues.apache.org/jira/browse/SPARK-3717) as an alternative, which will be benefits in both performance and storage if I understand correctly. So the PR is closed. Thank everyone for review / comment. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17503: [SPARK-3159][MLlib] Check for reducible DecisionTree
Github user facaiy commented on the issue: https://github.com/apache/spark/pull/17503 HI, @WeichenXu123. As said by @srowen , the benefit of this would be for speed at predict time or for model storage. Hence I'm not sure whether benchmark is really need for the PR. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17383: [SPARK-3165][MLlib] DecisionTree use sparsity in data
Github user facaiy commented on the issue: https://github.com/apache/spark/pull/17383 Sure, @WeichenXu123 , perhaps one or two weeks later, is it OK? By the way, I think using sparse representation can only reduce memory usage, and it is in the cost of compute performance. Hence, it cannot speed up calculation for low-dimension data. The only advantage sparse representation has is for super large dimension in my opinion. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17383: [SPARK-3165][MLlib] DecisionTree use sparsity in data
Github user facaiy commented on the issue: https://github.com/apache/spark/pull/17383 Thank you for comment. Very good question, at least for me, the answer to both questions is no. In most case, we feed dense raw data into tree model. However, if large dimensions required, I believe the sparse representation is necessary. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18998: [SPARK-21748][ML] Migrate the implementation of HashingT...
Github user facaiy commented on the issue: https://github.com/apache/spark/pull/18998 Hi, @yanboliang and @srowen . Thanks for your comments. For HashingTF, I agree that it is necessary to migrate its implementation so that new method could be added easily. Thanks, any review will be appreciated. Hence, I'm glad to get feedback on the work. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18120: [SPARK-20498][PYSPARK][ML] Expose getMaxDepth for...
Github user facaiy closed the pull request at: https://github.com/apache/spark/pull/18120 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17503: [SPARK-3159][MLlib] Check for reducible DecisionTree
Github user facaiy commented on the issue: https://github.com/apache/spark/pull/17503 Hi, @yanboliang . Do you have time to take a look at first? Thanks very much. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18998: [SPARK-21748][ML] Migrate the implementation of HashingT...
Github user facaiy commented on the issue: https://github.com/apache/spark/pull/18998 The PR is dependent by #18736 . To keep consistency of `setxxx` methods between scala and python , as @yanboliang suggested, it is better to migrate the HashingTF implementation from mllib to ml firstly, and then add `indexOf` function. More to see the discussion in #18736. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18998: [SPARK-21748][ML] Migrate the implementation of HashingT...
Github user facaiy commented on the issue: https://github.com/apache/spark/pull/18998 Hi, @srowen . Could you take a look at the PR? Thanks. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18998: [SPARK-21748][ML] Migrate the implementation of HashingT...
Github user facaiy commented on the issue: https://github.com/apache/spark/pull/18998 cc @yanboliang @WeichenXu123 who I believe are interested in this PR. Could you take a look please? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18998: [SPARK-21748][ML] Migrate the implementation of H...
GitHub user facaiy opened a pull request: https://github.com/apache/spark/pull/18998 [SPARK-21748][ML] Migrate the implementation of HashingTF from MLlib to ML ## What changes were proposed in this pull request? Migrate the implementation of HashingTF from MLlib to ML. ## How was this patch tested? + [ ] Pass all unit tests. You can merge this pull request into a Git repository by running: $ git pull https://github.com/facaiy/spark ENH/migrate_hash_tf_to_ml Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/18998.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #18998 commit 6a25da2cc96a744aaf047280ac414e5ff4515434 Author: Yan Facai (é¢åæ) Date: 2017-08-19T02:24:14Z ENH: implement HashingTF in ml --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18736: [SPARK-21481][ML] Add indexOf method for ml.feature.Hash...
Github user facaiy commented on the issue: https://github.com/apache/spark/pull/18736 Sure, @yanboliang . Thanks for your suggestion. I'll work on it later, perhaps next week. Is it OK? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18736: [SPARK-21481][ML] Add indexOf method for ml.feature.Hash...
Github user facaiy commented on the issue: https://github.com/apache/spark/pull/18736 @yanboliang Hi, yangbo. Could you help review the PR? Thanks. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18736: [SPARK-21481][ML] Add indexOf method for ml.featu...
Github user facaiy commented on a diff in the pull request: https://github.com/apache/spark/pull/18736#discussion_r132618802 --- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/HashingTF.scala --- @@ -80,20 +82,31 @@ class HashingTF @Since("1.4.0") (@Since("1.4.0") override val uid: String) /** @group setParam */ @Since("1.2.0") - def setNumFeatures(value: Int): this.type = set(numFeatures, value) + def setNumFeatures(value: Int): this.type = { +hashingTF = new feature.HashingTF($(numFeatures)).setBinary($(binary)) --- End diff -- @WeichenXu123 How about adding `setNumFeatures` method to mllib? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18736: [SPARK-21481][ML] Add indexOf method for ml.featu...
Github user facaiy commented on a diff in the pull request: https://github.com/apache/spark/pull/18736#discussion_r132131171 --- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/HashingTF.scala --- @@ -90,10 +92,22 @@ class HashingTF @Since("1.4.0") (@Since("1.4.0") override val uid: String) @Since("2.0.0") def setBinary(value: Boolean): this.type = set(binary, value) + /** + * Returns the index of the input term. + */ + @Since("2.3.0") + def indexOf(term: Any): Int = { +if (hashingTF != null) { + hashingTF.indexOf(term) --- End diff -- Thanks for you suggestion, I'll like to modify the code later. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18763: [SPARK-21306][ML] For branch 2.1, OneVsRest shoul...
Github user facaiy closed the pull request at: https://github.com/apache/spark/pull/18763 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18763: [SPARK-21306][ML] For branch 2.1, OneVsRest should suppo...
Github user facaiy commented on the issue: https://github.com/apache/spark/pull/18763 Thanks, all. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18764: [SPARK-21306][ML] For branch 2.0, OneVsRest should suppo...
Github user facaiy commented on the issue: https://github.com/apache/spark/pull/18764 Sure, thanks, @yanboliang ! --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18764: [SPARK-21306][ML] For branch 2.0, OneVsRest shoul...
Github user facaiy closed the pull request at: https://github.com/apache/spark/pull/18764 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18764: [SPARK-21306][ML] For branch 2.0, OneVsRest should suppo...
Github user facaiy commented on the issue: https://github.com/apache/spark/pull/18764 Thanks, @yanboliang @gatorsmile --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18764: [SPARK-21306][ML] For branch 2.0, OneVsRest should suppo...
Github user facaiy commented on the issue: https://github.com/apache/spark/pull/18764 @SparkQA Take a test, please. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18763: [SPARK-21306][ML] For branch 2.1, OneVsRest shoul...
Github user facaiy commented on a diff in the pull request: https://github.com/apache/spark/pull/18763#discussion_r131529768 --- Diff: python/pyspark/ml/classification.py --- @@ -1423,7 +1425,18 @@ def _fit(self, dataset): numClasses = int(dataset.agg({labelCol: "max"}).head()["max("+labelCol+")"]) + 1 -multiclassLabeled = dataset.select(labelCol, featuresCol) +weightCol = None +if (self.isDefined(self.weightCol) and self.getWeightCol()): +if isinstance(classifier, HasWeightCol): +weightCol = self.getWeightCol() +else: +warnings.warn("weightCol is ignored, " + "as it is not supported by {} now.".format(classifier)) --- End diff -- Modified as suggested, thanks very much! --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18764: [SPARK-21306][ML] For branch 2.0, OneVsRest shoul...
Github user facaiy commented on a diff in the pull request: https://github.com/apache/spark/pull/18764#discussion_r131529693 --- Diff: python/pyspark/ml/classification.py --- @@ -1344,7 +1346,19 @@ def _fit(self, dataset): numClasses = int(dataset.agg({labelCol: "max"}).head()["max("+labelCol+")"]) + 1 -multiclassLabeled = dataset.select(labelCol, featuresCol) +weightCol = None +if (self.isDefined(self.weightCol) and self.getWeightCol()): +if isinstance(classifier, HasWeightCol): +weightCol = self.getWeightCol() +else: +warnings.warn("weightCol is ignored, " + "as it is not supported by {} now.".format( --- End diff -- Thank you very much for help, @yanboliang ! --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18764: [SPARK-21306][ML] For branch 2.0, OneVsRest should suppo...
Github user facaiy commented on the issue: https://github.com/apache/spark/pull/18764 @yanboliang Thanks, yanbo. I am not familar with python 2.6, which is too outdated. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18764: [SPARK-21306][ML] For branch 2.0, OneVsRest should suppo...
Github user facaiy commented on the issue: https://github.com/apache/spark/pull/18764 Test failures in pyspark.ml.tests with python2.6, but I don't have the environment. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18764: [SPARK-21306][ML] For branch 2.0, OneVsRest should suppo...
Github user facaiy commented on the issue: https://github.com/apache/spark/pull/18764 Test failures in pyspark.ml.tests with python2.6, but I don't have the environment. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18764: [SPARK-21306][ML] For branch 2.0, OneVsRest should suppo...
Github user facaiy commented on the issue: https://github.com/apache/spark/pull/18764 Jenkins, test this please. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18764: [SPARK-21306][ML] For branch 2.0, OneVsRest should suppo...
Github user facaiy commented on the issue: https://github.com/apache/spark/pull/18764 Thanks, @yanboliang . Could you give a hand, @srowen ? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18763: [SPARK-21306][ML] For branch 2.1, OneVsRest shoul...
Github user facaiy commented on a diff in the pull request: https://github.com/apache/spark/pull/18763#discussion_r130213337 --- Diff: mllib/src/test/scala/org/apache/spark/ml/classification/OneVsRestSuite.scala --- @@ -158,7 +158,7 @@ class OneVsRestSuite extends SparkFunSuite with MLlibTestSparkContext with Defau } test("SPARK-21306: OneVsRest should support setWeightCol") { -val dataset2 = dataset.withColumn("weight", lit(1)) +val dataset2 = dataset.withColumn("weight", lit(1.0)) --- End diff -- OK. Could you test and merge the PR into branch 2.1? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18763: [SPARK-21306][ML] For branch 2.1, OneVsRest shoul...
Github user facaiy commented on a diff in the pull request: https://github.com/apache/spark/pull/18763#discussion_r130202540 --- Diff: mllib/src/test/scala/org/apache/spark/ml/classification/OneVsRestSuite.scala --- @@ -158,7 +158,7 @@ class OneVsRestSuite extends SparkFunSuite with MLlibTestSparkContext with Defau } test("SPARK-21306: OneVsRest should support setWeightCol") { -val dataset2 = dataset.withColumn("weight", lit(1)) +val dataset2 = dataset.withColumn("weight", lit(1.0)) --- End diff -- @yanboliang @srowen My mistake. It will be better to use double value 1.0 in master / branch-2.2 / branch-2.3 in the same way. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18763: [SPARK-21306][ML] For branch-2.1, OneVsRest shoul...
Github user facaiy commented on a diff in the pull request: https://github.com/apache/spark/pull/18763#discussion_r130200461 --- Diff: mllib/src/test/scala/org/apache/spark/ml/classification/OneVsRestSuite.scala --- @@ -157,6 +157,16 @@ class OneVsRestSuite extends SparkFunSuite with MLlibTestSparkContext with Defau assert(output.schema.fieldNames.toSet === Set("label", "features", "prediction")) } + test("SPARK-21306: OneVsRest should support setWeightCol") { +val dataset2 = dataset.withColumn("weight", lit(1.0)) --- End diff -- use double value, `lit(1.0)` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18764: [SPARK-21306][ML] For branch 2.0, OneVsRest shoul...
Github user facaiy commented on a diff in the pull request: https://github.com/apache/spark/pull/18764#discussion_r130200379 --- Diff: mllib/src/test/scala/org/apache/spark/ml/classification/OneVsRestSuite.scala --- @@ -143,6 +144,16 @@ class OneVsRestSuite extends SparkFunSuite with MLlibTestSparkContext with Defau assert(output.schema.fieldNames.toSet === Set("label", "features", "prediction")) } + test("SPARK-21306: OneVsRest should support setWeightCol") { +val dataset2 = dataset.withColumn("weight", lit(1.0)) --- End diff -- use double value, `lit(1.0)` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18764: [SPARK-21306][ML] For branch 2.0, OneVsRest shoul...
Github user facaiy commented on a diff in the pull request: https://github.com/apache/spark/pull/18764#discussion_r130200288 --- Diff: mllib/src/test/scala/org/apache/spark/ml/classification/OneVsRestSuite.scala --- @@ -33,6 +33,7 @@ import org.apache.spark.mllib.util.MLlibTestSparkContext import org.apache.spark.mllib.util.TestingUtils._ import org.apache.spark.rdd.RDD import org.apache.spark.sql.Dataset +import org.apache.spark.sql.functions._ --- End diff -- add missing import for branch 2.0 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18764: [SPARK-21306][ML] For branch 2.0, OneVsRest shoul...
GitHub user facaiy opened a pull request: https://github.com/apache/spark/pull/18764 [SPARK-21306][ML] For branch 2.0, OneVsRest should support setWeightCol The PR is related to #18554, and is modified for branch 2.0. ## What changes were proposed in this pull request? add `setWeightCol` method for OneVsRest. `weightCol` is ignored if classifier doesn't inherit HasWeightCol trait. ## How was this patch tested? + [x] add an unit test. You can merge this pull request into a Git repository by running: $ git pull https://github.com/facaiy/spark BUG/branch-2.0_OneVsRest_support_setWeightCol Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/18764.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #18764 commit 0c60c28ca6a3da84401f4947a21ace6980be08fa Author: Yan Facai (é¢åæ) Date: 2017-07-28T02:10:35Z [SPARK-21306][ML] For branch 2.0, OneVsRest should support setWeightCol --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18763: [SPARK-21306][ML] OneVsRest should support setWei...
GitHub user facaiy opened a pull request: https://github.com/apache/spark/pull/18763 [SPARK-21306][ML] OneVsRest should support setWeightCol for branch-2.1 The PR is related to #18554, and is modified for branch 2.1. ## What changes were proposed in this pull request? add `setWeightCol` method for OneVsRest. `weightCol` is ignored if classifier doesn't inherit HasWeightCol trait. ## How was this patch tested? + [x] add an unit test. You can merge this pull request into a Git repository by running: $ git pull https://github.com/facaiy/spark BUG/branch-2.1_OneVsRest_support_setWeightCol Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/18763.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #18763 commit ddb83da765f38546a96a63b26478f594601c7e2b Author: Yan Facai (é¢åæ) Date: 2017-07-28T02:10:35Z [SPARK-21306][ML] OneVsRest should support setWeightCol --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18554: [SPARK-21306][ML] OneVsRest should support setWei...
Github user facaiy commented on a diff in the pull request: https://github.com/apache/spark/pull/18554#discussion_r129562237 --- Diff: python/pyspark/ml/tests.py --- @@ -1255,6 +1255,24 @@ def test_output_columns(self): output = model.transform(df) self.assertEqual(output.columns, ["label", "features", "prediction"]) +def test_support_for_weightCol(self): +df = self.spark.createDataFrame([(0.0, Vectors.dense(1.0, 0.8), 1.0), + (1.0, Vectors.sparse(2, [], []), 1.0), + (2.0, Vectors.dense(0.5, 0.5), 1.0)], +["label", "features", "weight"]) +# classifier inherits hasWeightCol +lr = LogisticRegression(maxIter=5, regParam=0.01) +ovr = OneVsRest(classifier=lr, weightCol="weight") +self.assertIsNotNone(ovr.fit(df)) +ovr2 = OneVsRest(classifier=lr).setWeightCol("weight") --- End diff -- cleaned. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18554: [SPARK-21306][ML] OneVsRest should support setWei...
Github user facaiy commented on a diff in the pull request: https://github.com/apache/spark/pull/18554#discussion_r129562189 --- Diff: python/pyspark/ml/classification.py --- @@ -1517,20 +1517,22 @@ class OneVsRest(Estimator, OneVsRestParams, MLReadable, MLWritable): @keyword_only def __init__(self, featuresCol="features", labelCol="label", predictionCol="prediction", - classifier=None): + weightCol=None, classifier=None): --- End diff -- moved. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18554: [SPARK-21306][ML] OneVsRest should support setWeightCol
Github user facaiy commented on the issue: https://github.com/apache/spark/pull/18554 ping @holdenk @yanboliang --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18736: [SPARK-21481][ML] Add indexOf method for ml.featu...
GitHub user facaiy opened a pull request: https://github.com/apache/spark/pull/18736 [SPARK-21481][ML] Add indexOf method for ml.feature.HashingTF ## What changes were proposed in this pull request? Add indexOf method for ml.feature.HashingTF. The PR is a hotfix by storing hashingTF. I believe it is better to migrate native implement from mllib to ml. I can work on it, but I don't know whether to open an new JIRA for migrating or not. ## How was this patch tested? + [x] add a test case. Please review http://spark.apache.org/contributing.html before opening a pull request. You can merge this pull request into a Git repository by running: $ git pull https://github.com/facaiy/spark ENH/add_indexOf_for_ml Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/18736.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #18736 commit f08861d64390f0eb4d7e34de13d78db67e046457 Author: Yan Facai (é¢åæ) Date: 2017-07-26T05:45:37Z hotfix: store hashingTF to reuse indexOf commit 174a2d57ebeccba10b8f1435970cde69f5829c79 Author: Yan Facai (é¢åæ) Date: 2017-07-26T06:14:37Z TST: add test case --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18554: [SPARK-21306][ML] OneVsRest should support setWei...
Github user facaiy commented on a diff in the pull request: https://github.com/apache/spark/pull/18554#discussion_r128158473 --- Diff: python/pyspark/ml/tests.py --- @@ -1255,6 +1255,17 @@ def test_output_columns(self): output = model.transform(df) self.assertEqual(output.columns, ["label", "features", "prediction"]) +def test_support_for_weightCol(self): --- End diff -- Sure. Use DecisionTreeClassifier to test. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18305: [SPARK-20988][ML] Logistic regression uses aggreg...
Github user facaiy commented on a diff in the pull request: https://github.com/apache/spark/pull/18305#discussion_r127972263 --- Diff: mllib/src/main/scala/org/apache/spark/ml/classification/LogisticRegression.scala --- @@ -598,8 +598,23 @@ class LogisticRegression @Since("1.2.0") ( val regParamL2 = (1.0 - $(elasticNetParam)) * $(regParam) val bcFeaturesStd = instances.context.broadcast(featuresStd) -val costFun = new LogisticCostFun(instances, numClasses, $(fitIntercept), - $(standardization), bcFeaturesStd, regParamL2, multinomial = isMultinomial, +val getAggregatorFunc = new LogisticAggregator(bcFeaturesStd, numClasses, $(fitIntercept), + multinomial = isMultinomial)(_) +val getFeaturesStd = (j: Int) => if (j >= 0 && j < numCoefficientSets * numFeatures) { + featuresStd(j / numCoefficientSets) +} else { + 0.0 +} + +val regularization = if (regParamL2 != 0.0) { + val shouldApply = (idx: Int) => idx >= 0 && idx < numFeatures * numCoefficientSets --- End diff -- Thanks for clarifying. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18305: [SPARK-20988][ML] Logistic regression uses aggreg...
Github user facaiy commented on a diff in the pull request: https://github.com/apache/spark/pull/18305#discussion_r127874833 --- Diff: mllib/src/main/scala/org/apache/spark/ml/optim/loss/DifferentiableRegularization.scala --- @@ -32,40 +34,45 @@ private[ml] trait DifferentiableRegularization[T] extends DiffFunction[T] { } /** - * A Breeze diff function for computing the L2 regularized loss and gradient of an array of + * A Breeze diff function for computing the L2 regularized loss and gradient of a vector of * coefficients. * * @param regParam The magnitude of the regularization. * @param shouldApply A function (Int => Boolean) indicating whether a given index should have *regularization applied to it. - * @param featuresStd Option indicating whether the regularization should be scaled by the standard - *deviation of the features. + * @param applyFeaturesStd Option for a function which maps coefficient index (column major) to the + * feature standard deviation. If `None`, no standardization is applied. */ private[ml] class L2Regularization( -val regParam: Double, +override val regParam: Double, shouldApply: Int => Boolean, -featuresStd: Option[Array[Double]]) extends DifferentiableRegularization[Array[Double]] { +applyFeaturesStd: Option[Int => Double]) extends DifferentiableRegularization[Vector] { - override def calculate(coefficients: Array[Double]): (Double, Array[Double]) = { -var sum = 0.0 -val gradient = new Array[Double](coefficients.length) -coefficients.indices.filter(shouldApply).foreach { j => - val coef = coefficients(j) - featuresStd match { -case Some(stds) => - val std = stds(j) - if (std != 0.0) { -val temp = coef / (std * std) -sum += coef * temp -gradient(j) = regParam * temp - } else { -0.0 + override def calculate(coefficients: Vector): (Double, Vector) = { +coefficients match { + case dv: DenseVector => +var sum = 0.0 +val gradient = new Array[Double](dv.size) +dv.values.indices.filter(shouldApply).foreach { j => + val coef = coefficients(j) + applyFeaturesStd match { +case Some(getStd) => + val std = getStd(j) + if (std != 0.0) { +val temp = coef / (std * std) +sum += coef * temp +gradient(j) = regParam * temp + } else { +0.0 + } +case None => + sum += coef * coef + gradient(j) = coef * regParam --- End diff -- Trivial, to match `regParam * temp` above, how about using `regParam * coef`? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18305: [SPARK-20988][ML] Logistic regression uses aggreg...
Github user facaiy commented on a diff in the pull request: https://github.com/apache/spark/pull/18305#discussion_r127873828 --- Diff: mllib/src/main/scala/org/apache/spark/ml/classification/LogisticRegression.scala --- @@ -598,8 +598,23 @@ class LogisticRegression @Since("1.2.0") ( val regParamL2 = (1.0 - $(elasticNetParam)) * $(regParam) val bcFeaturesStd = instances.context.broadcast(featuresStd) -val costFun = new LogisticCostFun(instances, numClasses, $(fitIntercept), - $(standardization), bcFeaturesStd, regParamL2, multinomial = isMultinomial, +val getAggregatorFunc = new LogisticAggregator(bcFeaturesStd, numClasses, $(fitIntercept), + multinomial = isMultinomial)(_) +val getFeaturesStd = (j: Int) => if (j >= 0 && j < numCoefficientSets * numFeatures) { + featuresStd(j / numCoefficientSets) +} else { + 0.0 +} + +val regularization = if (regParamL2 != 0.0) { + val shouldApply = (idx: Int) => idx >= 0 && idx < numFeatures * numCoefficientSets --- End diff -- It seems that the `regularization` contains `intercept`, right? However, the comment in [LogisticRegression.scala: 1903L](https://github.com/apache/spark/pull/18305/files#diff-3734f1689cb8a80b07974eb93de0795dL1903) is: > // We do not apply regularization to the intercepts --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18554: [SPARK-21306][ML] OneVsRest should cache weightCo...
Github user facaiy commented on a diff in the pull request: https://github.com/apache/spark/pull/18554#discussion_r126863072 --- Diff: mllib/src/main/scala/org/apache/spark/ml/classification/OneVsRest.scala --- @@ -317,7 +318,12 @@ final class OneVsRest @Since("1.4.0") ( val numClasses = MetadataUtils.getNumClasses(labelSchema).fold(computeNumClasses())(identity) instr.logNumClasses(numClasses) -val multiclassLabeled = dataset.select($(labelCol), $(featuresCol)) +val multiclassLabeled = getClassifier match { + // SPARK-21306: cache weightCol if necessary + case c: HasWeightCol if c.isDefined(c.weightCol) && c.getWeightCol.nonEmpty => +dataset.select($(labelCol), $(featuresCol), c.getWeightCol) + case _ => dataset.select($(labelCol), $(featuresCol)) +} --- End diff -- Hi, @yanboliang . As @MLnick said, no all classifiers inherits HasWeightCol, so it might cause confusion. In my opinion, `setWeightCol` is an attribute owned by one specific classifier, like `setProbabilityCol` and `setRawPredictionCol` for Logistic Regreesion. So I'd suggest that user should configure the classifier itself, rather than OneVsRest. Is it OK? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18582: [SPARK-18619][ML] Make QuantileDiscretizer/Bucket...
Github user facaiy commented on a diff in the pull request: https://github.com/apache/spark/pull/18582#discussion_r126646511 --- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/StringIndexer.scala --- @@ -36,7 +36,8 @@ import org.apache.spark.util.collection.OpenHashMap /** * Base trait for [[StringIndexer]] and [[StringIndexerModel]]. */ -private[feature] trait StringIndexerBase extends Params with HasInputCol with HasOutputCol { +private[feature] trait StringIndexerBase extends Params with HasHandleInvalid with HasInputCol --- End diff -- Perhaps it is better to break the line at `extends`. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18582: [SPARK-18619][ML] Make QuantileDiscretizer/Bucket...
Github user facaiy commented on a diff in the pull request: https://github.com/apache/spark/pull/18582#discussion_r126645714 --- Diff: python/pyspark/ml/feature.py --- @@ -3058,26 +3035,37 @@ class RFormula(JavaEstimator, HasFeaturesCol, HasLabelCol, JavaMLReadable, JavaM "RFormula drops the same category as R when encoding strings.", typeConverter=TypeConverters.toString) +handleInvalid = Param(Params._dummy(), "handleInvalid", "how to handle invalid entries. " + --- End diff -- If I understand correctly, `handleInvalid` has been declared in `HasHandleInvalid` interface, right? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18582: [SPARK-18619][ML] Make QuantileDiscretizer/Bucket...
Github user facaiy commented on a diff in the pull request: https://github.com/apache/spark/pull/18582#discussion_r126643882 --- Diff: mllib/src/main/scala/org/apache/spark/ml/regression/LinearRegression.scala --- @@ -460,16 +460,16 @@ object LinearRegression extends DefaultParamsReadable[LinearRegression] { val MAX_FEATURES_FOR_NORMAL_SOLVER: Int = WeightedLeastSquares.MAX_NUM_FEATURES /** String name for "auto". */ - private[regression] val AUTO = "auto" + private[regression] val Auto = "auto" --- End diff -- Is `solver` related with `handleInvalid`? It seems a little confused. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18582: [SPARK-18619][ML] Make QuantileDiscretizer/Bucket...
Github user facaiy commented on a diff in the pull request: https://github.com/apache/spark/pull/18582#discussion_r126642928 --- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/Bucketizer.scala --- @@ -36,7 +36,8 @@ import org.apache.spark.sql.types.{DoubleType, StructField, StructType} */ @Since("1.4.0") final class Bucketizer @Since("1.4.0") (@Since("1.4.0") override val uid: String) - extends Model[Bucketizer] with HasInputCol with HasOutputCol with DefaultParamsWritable { + extends Model[Bucketizer] with HasHandleInvalid with HasInputCol with HasOutputCol +with DefaultParamsWritable { --- End diff -- How about aligning `with` with `extends`? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18554: [SPARK-21306][ML] OneVsRest should cache weightCol if ne...
Github user facaiy commented on the issue: https://github.com/apache/spark/pull/18554 @srowen @yanboliang Could you help review the PR? Thanks. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18554: [SPARK-21306][ML] OneVsRest should cache weightCol if ne...
Github user facaiy commented on the issue: https://github.com/apache/spark/pull/18554 I'm not familiar with R, and use grep to search "OneVsRest" and get nothing. Hence it seems that nothing is needed to do with R part. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18523: [SPARK-21285][ML] VectorAssembler reports the column nam...
Github user facaiy commented on the issue: https://github.com/apache/spark/pull/18523 @SparkQA test again, please. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18556: [SPARK-21326][SPARK-21066][ML] Use TextFileFormat...
Github user facaiy commented on a diff in the pull request: https://github.com/apache/spark/pull/18556#discussion_r126050849 --- Diff: mllib/src/main/scala/org/apache/spark/ml/source/libsvm/LibSVMRelation.scala --- @@ -89,18 +93,17 @@ private[libsvm] class LibSVMFileFormat extends TextBasedFileFormat with DataSour files: Seq[FileStatus]): Option[StructType] = { val libSVMOptions = new LibSVMOptions(options) val numFeatures: Int = libSVMOptions.numFeatures.getOrElse { - // Infers number of features if the user doesn't specify (a valid) one. - val dataFiles = files.filterNot(_.getPath.getName startsWith "_") - val path = if (dataFiles.length == 1) { -dataFiles.head.getPath.toUri.toString - } else if (dataFiles.isEmpty) { + if (files.isEmpty) { throw new IOException("No input path specified for libsvm data") --- End diff -- In my opinion, it is safe / necessary to check whether the parameter is valid in advance. Perhaps `IllegalArgumentException` is more suitable. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18554: [SPARK-21306][ML] OneVsRest should cache weightCol if ne...
Github user facaiy commented on the issue: https://github.com/apache/spark/pull/18554 @lins05 thanks, reasonable suggestion, I will fix it later. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18556: [SPARK-21326][SPARK-21066][ML] Use TextFileFormat...
Github user facaiy commented on a diff in the pull request: https://github.com/apache/spark/pull/18556#discussion_r126026388 --- Diff: mllib/src/main/scala/org/apache/spark/ml/source/libsvm/LibSVMRelation.scala --- @@ -89,18 +93,17 @@ private[libsvm] class LibSVMFileFormat extends TextBasedFileFormat with DataSour files: Seq[FileStatus]): Option[StructType] = { val libSVMOptions = new LibSVMOptions(options) val numFeatures: Int = libSVMOptions.numFeatures.getOrElse { - // Infers number of features if the user doesn't specify (a valid) one. - val dataFiles = files.filterNot(_.getPath.getName startsWith "_") --- End diff -- Do you mean that _success file of hdfs, if existed, will be safely ignored by spark? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18556: [SPARK-21326][SPARK-21066][ML] Use TextFileFormat...
Github user facaiy commented on a diff in the pull request: https://github.com/apache/spark/pull/18556#discussion_r126023986 --- Diff: mllib/src/main/scala/org/apache/spark/ml/source/libsvm/LibSVMRelation.scala --- @@ -89,18 +93,17 @@ private[libsvm] class LibSVMFileFormat extends TextBasedFileFormat with DataSour files: Seq[FileStatus]): Option[StructType] = { val libSVMOptions = new LibSVMOptions(options) val numFeatures: Int = libSVMOptions.numFeatures.getOrElse { - // Infers number of features if the user doesn't specify (a valid) one. - val dataFiles = files.filterNot(_.getPath.getName startsWith "_") - val path = if (dataFiles.length == 1) { -dataFiles.head.getPath.toUri.toString - } else if (dataFiles.isEmpty) { + if (files.isEmpty) { throw new IOException("No input path specified for libsvm data") - } else { -throw new IOException("Multiple input paths are not supported for libsvm data.") + } else if (files.length > 1) { +logWarning( + "Multiple input paths detected, determining the number of features by going " + --- End diff -- Good. How about warning only if `numFeature` is missing? I mean, remove the condition `files.length > 1` ```scala if (files.isEmpty) { } else { ... logWarning ...} ``` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18523: [SPARK-21285][ML] VectorAssembler reports the col...
Github user facaiy commented on a diff in the pull request: https://github.com/apache/spark/pull/18523#discussion_r125860650 --- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/VectorAssembler.scala --- @@ -113,12 +113,15 @@ class VectorAssembler @Since("1.4.0") (@Since("1.4.0") override val uid: String) override def transformSchema(schema: StructType): StructType = { val inputColNames = $(inputCols) val outputColName = $(outputCol) -val inputDataTypes = inputColNames.map(name => schema(name).dataType) -inputDataTypes.foreach { - case _: NumericType | BooleanType => - case t if t.isInstanceOf[VectorUDT] => - case other => -throw new IllegalArgumentException(s"Data type $other is not supported.") +val incorrectColumns = inputColNames.flatMap { name => + schema(name).dataType match { +case _: NumericType | BooleanType => None +case t if t.isInstanceOf[VectorUDT] => None +case other => Some(s"Data type $other of column $name is not supported.") + } +} +if (!incorrectColumns.isEmpty) { --- End diff -- Yes, modified. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18554: [SPARK-21306][ML] OneVsRest should cache weightCo...
GitHub user facaiy opened a pull request: https://github.com/apache/spark/pull/18554 [SPARK-21306][ML] OneVsRest should cache weightCol if necessary ## What changes were proposed in this pull request? cache weightCol if classifier inherits HasWeightCol trait. ## How was this patch tested? + [x] add an unit test. You can merge this pull request into a Git repository by running: $ git pull https://github.com/facaiy/spark BUG/oneVsRest_missing_weightCol Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/18554.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #18554 commit f1785959e04e462877fd74e6c767e47e0fb207d6 Author: Yan Facai (é¢åæ) Date: 2017-07-06T09:38:01Z BUG: cache weightCol if necessary commit a3e6e967a23f06dbd50aee08c34b07055c5646ee Author: Yan Facai (é¢åæ) Date: 2017-07-06T09:55:12Z TST: add unit test --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18523: [SPARK-21285][ML] VectorAssembler reports the column nam...
Github user facaiy commented on the issue: https://github.com/apache/spark/pull/18523 @SparkQA Jenkins, run tests again, please. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18523: [SPARK-21285][ML] VectorAssembler reports the col...
Github user facaiy commented on a diff in the pull request: https://github.com/apache/spark/pull/18523#discussion_r125763918 --- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/VectorAssembler.scala --- @@ -113,12 +113,15 @@ class VectorAssembler @Since("1.4.0") (@Since("1.4.0") override val uid: String) override def transformSchema(schema: StructType): StructType = { val inputColNames = $(inputCols) val outputColName = $(outputCol) -val inputDataTypes = inputColNames.map(name => schema(name).dataType) -inputDataTypes.foreach { - case _: NumericType | BooleanType => - case t if t.isInstanceOf[VectorUDT] => - case other => -throw new IllegalArgumentException(s"Data type $other is not supported.") +val incorrectColumns = inputColNames.flatMap { name => --- End diff -- Hi, yanbo. Use `flatMap` is aimed at unpacking value from Option. Is it OK? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18523: [SPARK-21285][ML] VectorAssembler reports the column nam...
Github user facaiy commented on the issue: https://github.com/apache/spark/pull/18523 I don't know how to write an unit test for the pr? Is it necessary? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18523: [SPARK-21285][ML] VectorAssembler reports the column nam...
Github user facaiy commented on the issue: https://github.com/apache/spark/pull/18523 Good idea! --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18523: [SPARK-21285][ML] VectorAssembler reports the col...
Github user facaiy commented on a diff in the pull request: https://github.com/apache/spark/pull/18523#discussion_r125584572 --- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/VectorAssembler.scala --- @@ -113,12 +113,12 @@ class VectorAssembler @Since("1.4.0") (@Since("1.4.0") override val uid: String) override def transformSchema(schema: StructType): StructType = { val inputColNames = $(inputCols) val outputColName = $(outputCol) -val inputDataTypes = inputColNames.map(name => schema(name).dataType) +val inputDataTypes = inputColNames.map(name => (name, schema(name).dataType)) inputDataTypes.foreach { - case _: NumericType | BooleanType => - case t if t.isInstanceOf[VectorUDT] => - case other => -throw new IllegalArgumentException(s"Data type $other is not supported.") + case (_, _: NumericType | BooleanType) => + case (_, t) if t.isInstanceOf[VectorUDT] => --- End diff -- Hey, it will throw no found exception if `t: VectorUDT` is used. Do you have any idea? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17383: [SPARK-3165][MLlib][WIP] DecisionTree does not us...
GitHub user facaiy reopened a pull request: https://github.com/apache/spark/pull/17383 [SPARK-3165][MLlib][WIP] DecisionTree does not use sparsity in data ## What changes were proposed in this pull request? DecisionTree should take advantage of sparse feature vectors. Aggregation over training data could handle the empty/zero-valued data elements more efficiently. ## How was this patch tested? Modifying Inner implementation won't change behavior of DecisionTree module, hence all unit tests before should pass. Some performance benchmark perhaps are need. You can merge this pull request into a Git repository by running: $ git pull https://github.com/facaiy/spark ENH/use_sparsity_in_decision_tree Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/17383.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #17383 commit d2eea0645110b3bcc6c0b905bc55e43e0af9debb Author: é¢åæï¼Yan Facaiï¼ Date: 2017-03-22T05:45:58Z CLN: use Vector to implement binnedFeatures in TreePoint commit 9ce6b813beffb9d58e7b2907425a1262610256be Author: é¢åæï¼Yan Facaiï¼ Date: 2017-03-22T09:15:30Z BUG: fix for incompatible argument of predictImpl method commit 37f05f9b0386acc8bea048e72aff2b9c37ca4ca6 Author: é¢åæï¼Yan Facaiï¼ Date: 2017-03-22T09:18:04Z CLN: create sparse vector when converting to TreePoint commit c9664ce6c94b98cbc76253817e637d9a968e4bd6 Author: é¢åæï¼Yan Facaiï¼ Date: 2017-03-22T09:21:59Z CLN: change Array to Vector in TreePoint when created commit d6ef9e512ea4a58db2dccf3e7cca95f9e8b0df8f Author: é¢åæï¼Yan Facaiï¼ Date: 2017-03-23T02:12:22Z PREP: use Vector[Int] to store binnedFeature commit 59eb779a9d4f711e7b28d31d579cc49e3d3cc370 Author: é¢åæï¼Yan Facaiï¼ Date: 2017-03-23T03:50:14Z CLN: change binnedFeatures from def to val commit 9cbe577b408e987f3026d01316f5a7f2d4c5cfb2 Author: é¢åæï¼Yan Facaiï¼ Date: 2017-03-28T00:57:42Z CLN: use filter to select non-zero bits commit b5b0dc8683b6e2d7d274aa8d39932dec61e6193d Author: é¢åæï¼Yan Facaiï¼ Date: 2017-03-28T01:03:55Z BUG: fix, compile fails commit cf7e3d8e03f73df725336d0d5a9dd6cc16e7bf95 Author: Yan Facai (é¢åæ) Date: 2017-07-05T05:42:09Z Merge branch 'master' into ENH/use_sparsity_in_decision_tree commit 032d50d8c8a851671ba2754cec817d0f6e9ae70f Author: Yan Facai (é¢åæ) Date: 2017-07-05T06:20:38Z CLN: use BSV in predictImpl commit 257ddf773eb47499962d6cc57fd1323324dd4ab8 Author: Yan Facai (é¢åæ) Date: 2017-07-05T06:42:24Z ENH: create subclass TreeSparsePoint commit 8a919735f9474283d263df78feb2e176f66917f3 Author: Yan Facai (é¢åæ) Date: 2017-07-05T06:58:54Z ENH: use TreeDensePoint when numFeatures < 1 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17383: [SPARK-3165][MLlib][WIP] DecisionTree does not us...
Github user facaiy closed the pull request at: https://github.com/apache/spark/pull/17383 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18523: [SPARK-21285][ML] VectorAssembler reports the col...
Github user facaiy commented on a diff in the pull request: https://github.com/apache/spark/pull/18523#discussion_r125539040 --- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/VectorAssembler.scala --- @@ -113,12 +113,12 @@ class VectorAssembler @Since("1.4.0") (@Since("1.4.0") override val uid: String) override def transformSchema(schema: StructType): StructType = { val inputColNames = $(inputCols) val outputColName = $(outputCol) -val inputDataTypes = inputColNames.map(name => schema(name).dataType) +val inputDataTypes = inputColNames.map(name => (name, schema(name).dataType)) inputDataTypes.foreach { - case _: NumericType | BooleanType => - case t if t.isInstanceOf[VectorUDT] => - case other => -throw new IllegalArgumentException(s"Data type $other is not supported.") + case (_, _: NumericType | BooleanType) => + case (_, _: VectorUDT) => + case (name, other) => +throw new IllegalArgumentException(s"Data type $other of column $name is not supported.") --- End diff -- Fine, modified. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17503: [SPARK-3159][MLlib] Check for reducible DecisionTree
Github user facaiy commented on the issue: https://github.com/apache/spark/pull/17503 @jkbradley May you have time reviewing the pr? I believe that it will be a little improvement for predict. Thanks. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18523: [SPARK-21285][ML] VectorAssembler reports the col...
Github user facaiy commented on a diff in the pull request: https://github.com/apache/spark/pull/18523#discussion_r125398010 --- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/VectorAssembler.scala --- @@ -113,12 +113,12 @@ class VectorAssembler @Since("1.4.0") (@Since("1.4.0") override val uid: String) override def transformSchema(schema: StructType): StructType = { val inputColNames = $(inputCols) val outputColName = $(outputCol) -val inputDataTypes = inputColNames.map(name => schema(name).dataType) +val inputDataTypes = inputColNames.map(name => (name, schema(name).dataType)) inputDataTypes.foreach { - case _: NumericType | BooleanType => - case t if t.isInstanceOf[VectorUDT] => - case other => -throw new IllegalArgumentException(s"Data type $other is not supported.") + case (_, _: NumericType | BooleanType) => + case (_, t) if t.isInstanceOf[VectorUDT] => --- End diff -- Good suggestion. fixed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18523: [SPARK-21285][ML] VectorAssembler reports the col...
GitHub user facaiy opened a pull request: https://github.com/apache/spark/pull/18523 [SPARK-21285][ML] VectorAssembler reports the column name of unsupported data type ## What changes were proposed in this pull request? add the column name in the exception which is raised by unsupported data type. ## How was this patch tested? [ ] pass all tests. You can merge this pull request into a Git repository by running: $ git pull https://github.com/facaiy/spark ENH/vectorassembler_add_col Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/18523.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #18523 commit 95dbf6c7b287d0010af9de377ff6b93dec760808 Author: Yan Facai (é¢åæ) Date: 2017-07-04T05:42:07Z ENH: report the name of missing column --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18288: [SPARK-21066][ML] LibSVM load just one input file
Github user facaiy commented on the issue: https://github.com/apache/spark/pull/18288 Yes. an example code: ```scala val df = spark.read.format("libsvm") .option("numFeatures", "780") .load("data/mllib/sample_libsvm_data.txt") ``` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18288: [SPARK-21066][ML] LibSVM load just one input file
Github user facaiy commented on the issue: https://github.com/apache/spark/pull/18288 You might be mistaken. The aim of code here is to encourage user to specify `numFeatures` in any case, rather than encourage user to use only one file. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18288: [SPARK-21066][ML] LibSVM load just one input file
Github user facaiy commented on a diff in the pull request: https://github.com/apache/spark/pull/18288#discussion_r123474003 --- Diff: mllib/src/main/scala/org/apache/spark/ml/source/libsvm/LibSVMRelation.scala --- @@ -91,12 +91,10 @@ private[libsvm] class LibSVMFileFormat extends TextBasedFileFormat with DataSour val numFeatures: Int = libSVMOptions.numFeatures.getOrElse { // Infers number of features if the user doesn't specify (a valid) one. val dataFiles = files.filterNot(_.getPath.getName startsWith "_") - val path = if (dataFiles.length == 1) { -dataFiles.head.getPath.toUri.toString - } else if (dataFiles.isEmpty) { + val path = if (dataFiles.isEmpty) { throw new IOException("No input path specified for libsvm data") } else { -throw new IOException("Multiple input paths are not supported for libsvm data.") +dataFiles.map(_.getPath.toUri.toString).mkString(",") --- End diff -- @srowen I see your point: it's OK to infer `numFeature` for either one or more files. However, it is costly (or unworthy). Here is a compromise: + For ONLY one file, `numFeatures` is optional. + For multiple files, `numFeatures` must be required. It doesn't sample one of many files. To be honest, it is inconvenient while a little reasonable. Hence, I prefer to revise the message of exception. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18288: [SPARK-21066][ML] LibSVM load just one input file
Github user facaiy commented on the issue: https://github.com/apache/spark/pull/18288 In my opinion, `numFeatures` is vital for sparse data. Say our feature is 100-dim indeed, while in a small train data their maximum size is 990. It is dangerous (or wrong) to train a 990-dim model as it might fail in the coming test data. Hence, in most cases, `numFeatures` should be given by user. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18288: [SPARK-21066][ML] LibSVM load just one input file
Github user facaiy commented on a diff in the pull request: https://github.com/apache/spark/pull/18288#discussion_r122909919 --- Diff: mllib/src/main/scala/org/apache/spark/ml/source/libsvm/LibSVMRelation.scala --- @@ -91,12 +91,10 @@ private[libsvm] class LibSVMFileFormat extends TextBasedFileFormat with DataSour val numFeatures: Int = libSVMOptions.numFeatures.getOrElse { // Infers number of features if the user doesn't specify (a valid) one. val dataFiles = files.filterNot(_.getPath.getName startsWith "_") - val path = if (dataFiles.length == 1) { -dataFiles.head.getPath.toUri.toString - } else if (dataFiles.isEmpty) { + val path = if (dataFiles.isEmpty) { throw new IOException("No input path specified for libsvm data") } else { -throw new IOException("Multiple input paths are not supported for libsvm data.") +dataFiles.map(_.getPath.toUri.toString).mkString(",") --- End diff -- The exception is misleading. It should suggest user to specify `numFeatures`, rather than warn of unsupported message. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18288: [SPARK-21066][ML] LibSVM load just one input file
Github user facaiy commented on a diff in the pull request: https://github.com/apache/spark/pull/18288#discussion_r122908140 --- Diff: mllib/src/main/scala/org/apache/spark/ml/source/libsvm/LibSVMRelation.scala --- @@ -91,12 +91,10 @@ private[libsvm] class LibSVMFileFormat extends TextBasedFileFormat with DataSour val numFeatures: Int = libSVMOptions.numFeatures.getOrElse { // Infers number of features if the user doesn't specify (a valid) one. val dataFiles = files.filterNot(_.getPath.getName startsWith "_") - val path = if (dataFiles.length == 1) { -dataFiles.head.getPath.toUri.toString - } else if (dataFiles.isEmpty) { + val path = if (dataFiles.isEmpty) { throw new IOException("No input path specified for libsvm data") } else { -throw new IOException("Multiple input paths are not supported for libsvm data.") +dataFiles.map(_.getPath.toUri.toString).mkString(",") --- End diff -- For efficiency, `numFeaures` is limited to scan only one file. If multiple files are needed, please assign the parameter `numFeatures` manually. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18139: [SPARK-20787][PYTHON] PySpark can't handle dateti...
Github user facaiy commented on a diff in the pull request: https://github.com/apache/spark/pull/18139#discussion_r119346663 --- Diff: python/pyspark/sql/types.py --- @@ -187,8 +187,11 @@ def needConversion(self): def toInternal(self, dt): if dt is not None: -seconds = (calendar.timegm(dt.utctimetuple()) if dt.tzinfo - else time.mktime(dt.timetuple())) +seconds = calendar.timegm(dt.utctimetuple()) +# Avoiding the invalid range of years (100-1899) for mktime in Python < 3 +if dt.year > 1899 or dt.year <= 100: +seconds = (calendar.timegm(dt.utctimetuple()) if dt.tzinfo + else time.mktime(dt.timetuple())) --- End diff -- How about adding a `else` branch to eliminate repetitive computation of `seconds`? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18139: [SPARK-20787][PYTHON] PySpark can't handle dateti...
Github user facaiy commented on a diff in the pull request: https://github.com/apache/spark/pull/18139#discussion_r119263608 --- Diff: python/pyspark/sql/types.py --- @@ -187,8 +187,11 @@ def needConversion(self): def toInternal(self, dt): if dt is not None: -seconds = (calendar.timegm(dt.utctimetuple()) if dt.tzinfo - else time.mktime(dt.timetuple())) +seconds = calendar.timegm(dt.utctimetuple()) --- End diff -- Perhaps it need to be confirmed by @davies ? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18120: [SPARK-20498][PYSPARK][ML] Expose getMaxDepth for ensemb...
Github user facaiy commented on the issue: https://github.com/apache/spark/pull/18120 Thanks, @BryanCutler. It seems that #17849 copys `Params` from `Estimator` to `Model` automatically, which is pretty useful. However, `getter` method is still missing and need to be added manually as the pr. If only model could inherit both params and its getter method, perhaps it will be better in my opinion. Anyway, it is OK whether the pr will be merged or not, as it is trivial. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18120: [SPARK-20498][PYSPARK][ML] Expose getMaxDepth for ensemb...
Github user facaiy commented on the issue: https://github.com/apache/spark/pull/18120 Hi, @keypointt . It's the feature of Python. The doctest is both document and unit test. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18120: [SPARK-20498][PYSPARK][ML] Expose getMaxDepth for ensemb...
Github user facaiy commented on the issue: https://github.com/apache/spark/pull/18120 @keypointt Hi, could you help check the pr is consistent with your #17207 ? Thanks. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18120: [SPARK-20498][PYSPARK][ML] Expose getMaxDepth for...
GitHub user facaiy opened a pull request: https://github.com/apache/spark/pull/18120 [SPARK-20498][PYSPARK][ML] Expose getMaxDepth for ensemble tree model in PySpark ## What changes were proposed in this pull request? add `getMaxDepth` method for ensemble tree models: + `RandomForestClassifierModel` + `RandomForestRegressionModel` + `GBTClassificationModel` + `GBTRegressionModel` ## How was this patch tested? + [ ] pass all unit tests. + [ ] add new doctest. You can merge this pull request into a Git repository by running: $ git pull https://github.com/facaiy/spark ENH/pyspark_rf_add_get_max_depth Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/18120.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #18120 commit 96e797f79000964ea226efced7318f24a1722535 Author: Yan Facai (é¢åæ) Date: 2017-05-26T07:48:02Z ENH: add maxDepth for resemble tree model commit 50815928f070f89ae71e235d51e833cef596a361 Author: Yan Facai (é¢åæ) Date: 2017-05-26T08:45:11Z TST: add doctest --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18058: [SPARK-20768][PYSPARK][ML] Expose numPartitions (expert)...
Github user facaiy commented on the issue: https://github.com/apache/spark/pull/18058 Resolved. By the way, Which one is preferable, rebase or merge? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18058: [SPARK-20768][PYSPARK][ML] Expose numPartitions (expert)...
Github user facaiy commented on the issue: https://github.com/apache/spark/pull/18058 Hi, I'm not familiar with pyspark. I just wonder whether is it needed to create a unit test for verification. If yes, how to check it? Thanks. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18058: [SPARK-20768][PYSPARK][ML] Expose numPartitions (...
Github user facaiy commented on a diff in the pull request: https://github.com/apache/spark/pull/18058#discussion_r118416434 --- Diff: python/pyspark/ml/fpm.py --- @@ -49,6 +49,32 @@ def getMinSupport(self): return self.getOrDefault(self.minSupport) +class HasNumPartitions(Params): +""" +Mixin for param support. +""" + +numPartitions = Param( +Params._dummy(), +"numPartitions", +"""Number of partitions (at least 1) used by parallel FP-growth. +By default the param is not set, +and partition number of the input dataset is used.""", +typeConverter=TypeConverters.toInt) + +def setNumPartitions(self, value): +""" +Sets the value of :py:attr:`numPartitions`. +""" +return self._set(numPartitions=value) + +def getNumPartitions(self): +""" +Gets the value of numPartitions or its default value. --- End diff -- added. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org