[GitHub] spark pull request #17161: [SPARK-19819][SparkR] Use concrete data in SparkR...

2017-05-24 Thread actuaryzhang
Github user actuaryzhang closed the pull request at: https://github.com/apache/spark/pull/17161 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature

[GitHub] spark issue #17967: [SPARK-14659][ML] RFormula consistent with R when handli...

2017-05-24 Thread actuaryzhang
Github user actuaryzhang commented on the issue: https://github.com/apache/spark/pull/17967 This is what we get from the current doc: ![image](https://cloud.githubusercontent.com/assets/11082368/26430799/dd49fa4c-40a4-11e7-95c6-66def9a8f588.png) --- If your project is set

[GitHub] spark pull request #12414: [SPARK-14657][SPARKR][ML] RFormula w/o intercept ...

2017-05-24 Thread actuaryzhang
Github user actuaryzhang commented on a diff in the pull request: https://github.com/apache/spark/pull/12414#discussion_r118392116 --- Diff: mllib/src/test/scala/org/apache/spark/ml/feature/RFormulaSuite.scala --- @@ -129,6 +129,23 @@ class RFormulaSuite extends SparkFunSuite

[GitHub] spark pull request #12414: [SPARK-14657][SPARKR][ML] RFormula w/o intercept ...

2017-05-24 Thread actuaryzhang
Github user actuaryzhang commented on a diff in the pull request: https://github.com/apache/spark/pull/12414#discussion_r118391881 --- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/RFormula.scala --- @@ -163,12 +163,20 @@ class RFormula @Since("1.5.0") (@Si

[GitHub] spark issue #18025: [WIP][SparkR] Update doc and examples for sql functions

2017-05-24 Thread actuaryzhang
Github user actuaryzhang commented on the issue: https://github.com/apache/spark/pull/18025 ![image](https://cloud.githubusercontent.com/assets/11082368/26429381/64dd117e-409b-11e7-9661-659b5fbe8206.png) ![image](https://cloud.githubusercontent.com/assets/11082368/26429388

[GitHub] spark issue #18025: [WIP][SparkR] Update doc and examples for sql functions

2017-05-24 Thread actuaryzhang
Github user actuaryzhang commented on the issue: https://github.com/apache/spark/pull/18025 @felixcheung I just made a new commit which I think has the cleanest solution so far. In this one, I implemented grouping for all aggregate functions for Column, except those that are also

[GitHub] spark issue #17967: [SPARK-14659][ML] RFormula consistent with R when handli...

2017-05-24 Thread actuaryzhang
Github user actuaryzhang commented on the issue: https://github.com/apache/spark/pull/17967 I tried using ``, but in Scaladoc, it is not correctly formatted. I tried a few other options, but it seems the html attributes are ignored in Scaladoc. ![image](https

[GitHub] spark pull request #17967: [SPARK-14659][ML] RFormula consistent with R when...

2017-05-23 Thread actuaryzhang
Github user actuaryzhang commented on a diff in the pull request: https://github.com/apache/spark/pull/17967#discussion_r117925930 --- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/RFormula.scala --- @@ -37,6 +37,42 @@ import org.apache.spark.sql.types

[GitHub] spark issue #18025: [WIP][SparkR] Update doc and examples for sql functions

2017-05-23 Thread actuaryzhang
Github user actuaryzhang commented on the issue: https://github.com/apache/spark/pull/18025 Great point. - For a method that is defined in one class and belongs in a group like `cov`, we can document it in its own Rd, and add a link to in the `SeeAlso` section of the group

[GitHub] spark pull request #17967: [SPARK-14659][ML] RFormula consistent with R when...

2017-05-23 Thread actuaryzhang
Github user actuaryzhang commented on a diff in the pull request: https://github.com/apache/spark/pull/17967#discussion_r117909338 --- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/RFormula.scala --- @@ -37,6 +37,42 @@ import org.apache.spark.sql.types

[GitHub] spark issue #17864: [SPARK-20604][ML] Allow imputer to handle numeric types

2017-05-23 Thread actuaryzhang
Github user actuaryzhang commented on the issue: https://github.com/apache/spark/pull/17864 Ping folks for comments/review. Many thanks. @viirya @MLnick @jkbradley @hhbyyh @yanboliang @BenFradet --- If your project is set up for it, you can reply to this email and have your

[GitHub] spark issue #18025: [WIP][SparkR] Update doc and examples for sql functions

2017-05-22 Thread actuaryzhang
Github user actuaryzhang commented on the issue: https://github.com/apache/spark/pull/18025 @felixcheung I think we may want to distinguish a few cases: 1. For methods that are mainly defined by only one class, e.g., most function methods for Column, it makes sense to group

[GitHub] spark issue #18051: [SPARK-18825][SPARKR][DOCS][WIP] Eliminate duplicate lin...

2017-05-22 Thread actuaryzhang
Github user actuaryzhang commented on the issue: https://github.com/apache/spark/pull/18051 Maybe I'm missing something completely, but I still don't get the point why we are removing the `xx-method` link since we are defining methods as S4 using `setMethod`. Lots of packages have

[GitHub] spark pull request #17967: [SPARK-14659][ML] RFormula consistent with R when...

2017-05-22 Thread actuaryzhang
Github user actuaryzhang commented on a diff in the pull request: https://github.com/apache/spark/pull/17967#discussion_r117892629 --- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/RFormula.scala --- @@ -37,6 +37,42 @@ import org.apache.spark.sql.types

[GitHub] spark issue #17967: [SPARK-14659][ML] RFormula consistent with R when handli...

2017-05-22 Thread actuaryzhang
Github user actuaryzhang commented on the issue: https://github.com/apache/spark/pull/17967 @yanboliang I updated the example in the param doc. I hope it is clear now that it is `alphabetDesc` that drops the same category as R. That is, RFormula with `alphabetDesc` drops the first

[GitHub] spark pull request #17967: [SPARK-14659][ML] RFormula consistent with R when...

2017-05-22 Thread actuaryzhang
Github user actuaryzhang commented on a diff in the pull request: https://github.com/apache/spark/pull/17967#discussion_r117873500 --- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/RFormula.scala --- @@ -37,6 +37,42 @@ import org.apache.spark.sql.types

[GitHub] spark issue #17967: [SPARK-14659][ML] RFormula consistent with R when handli...

2017-05-22 Thread actuaryzhang
Github user actuaryzhang commented on the issue: https://github.com/apache/spark/pull/17967 @felixcheung Is the html tag `` supported? Tried this but failed to compile... --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well

[GitHub] spark pull request #17967: [SPARK-14659][ML] RFormula consistent with R when...

2017-05-22 Thread actuaryzhang
Github user actuaryzhang commented on a diff in the pull request: https://github.com/apache/spark/pull/17967#discussion_r117775336 --- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/RFormula.scala --- @@ -37,6 +37,42 @@ import org.apache.spark.sql.types

[GitHub] spark issue #17967: [SPARK-14659][ML] RFormula consistent with R when handli...

2017-05-22 Thread actuaryzhang
Github user actuaryzhang commented on the issue: https://github.com/apache/spark/pull/17967 @yanboliang I understand your points. The issue is `OneHotEncoder` only supports `dropLast`. The ideal solution to match R exactly (both the category dropped and ordering of feature

[GitHub] spark issue #17978: [SPARK-20736][Python] PySpark StringIndexer supports Str...

2017-05-20 Thread actuaryzhang
Github user actuaryzhang commented on the issue: https://github.com/apache/spark/pull/17978 @holdenk Thanks for the comment. Added default value in docstring. @felixcheung Please let me know if there is anything else needed for this PR. Thanks everyone for the review

[GitHub] spark issue #17967: [SPARK-14659][ML] RFormula consistent with R when handli...

2017-05-20 Thread actuaryzhang
Github user actuaryzhang commented on the issue: https://github.com/apache/spark/pull/17967 @HyukjinKwon @felixcheung I confirm it works for Javadoc. ![image](https://cloud.githubusercontent.com/assets/11082368/26277962/21dbe70e-3d46-11e7-978f-e422b9122e87.png) --- If your

[GitHub] spark issue #18025: [WIP][SparkR] Update doc and examples for sql functions

2017-05-20 Thread actuaryzhang
Github user actuaryzhang commented on the issue: https://github.com/apache/spark/pull/18025 @felixcheung I'm curious what is the motivation to remove the style `avg,Column-method`? This is the default and preferred (I believe) way to reference to S4 methods in R. Suppose

[GitHub] spark issue #17967: [SPARK-14659][ML] RFormula consistent with R when handli...

2017-05-20 Thread actuaryzhang
Github user actuaryzhang commented on the issue: https://github.com/apache/spark/pull/17967 @felixcheung @HyukjinKwon Thanks much for pointing out the documentation issues. I still prefer to have a table to clearly illustrate what each option is doing. Made a new commit

[GitHub] spark issue #17967: [SPARK-14659][ML] RFormula consistent with R when handli...

2017-05-19 Thread actuaryzhang
Github user actuaryzhang commented on the issue: https://github.com/apache/spark/pull/17967 @yanboliang Thanks for the review and suggestion. Makes lots of sense. I made a new commit to address these. --- If your project is set up for it, you can reply to this email and have your

[GitHub] spark issue #17967: [SPARK-14659][ML] RFormula consistent with R when handli...

2017-05-19 Thread actuaryzhang
Github user actuaryzhang commented on the issue: https://github.com/apache/spark/pull/17967 @yanboliang Thanks for the question. The alphabetically ascending order in R is very convenient for display purpose. For example, when you do a summary of model results, the results

[GitHub] spark issue #16699: [SPARK-18710][ML] Add offset in GLM

2017-05-19 Thread actuaryzhang
Github user actuaryzhang commented on the issue: https://github.com/apache/spark/pull/16699 @yanboliang Thanks for the update. This is a big change, although lots of then are on the test side. Let me know if there is anything I can do to help make the review easier

[GitHub] spark issue #17978: [SPARK-20736][Python] PySpark StringIndexer supports Str...

2017-05-19 Thread actuaryzhang
Github user actuaryzhang commented on the issue: https://github.com/apache/spark/pull/17978 @felixcheung Would you help merge this? Thanks. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have

[GitHub] spark issue #18025: [WIP][SparkR] Update doc and examples for sql functions

2017-05-19 Thread actuaryzhang
Github user actuaryzhang commented on the issue: https://github.com/apache/spark/pull/18025 @felixcheung Thanks for asking this. I should have been more clear. `@name` is the "name" of the Rd object represented, which is unique. An Rd object can have multip

[GitHub] spark pull request #18025: [WIP][SparkR] Update doc and examples for sql fun...

2017-05-19 Thread actuaryzhang
Github user actuaryzhang commented on a diff in the pull request: https://github.com/apache/spark/pull/18025#discussion_r117566353 --- Diff: R/pkg/R/generics.R --- @@ -396,67 +397,81 @@ setGeneric("agg", function (x, ...) { standardGeneric("agg") })

[GitHub] spark pull request #18025: [WIP][SparkR] Update doc and examples for sql fun...

2017-05-19 Thread actuaryzhang
Github user actuaryzhang commented on a diff in the pull request: https://github.com/apache/spark/pull/18025#discussion_r117565367 --- Diff: R/pkg/R/generics.R --- @@ -385,6 +385,7 @@ setGeneric("value", function(bcast) { standardGeneric("value"

[GitHub] spark issue #17967: [SPARK-14659][ML] RFormula consistent with R when handli...

2017-05-19 Thread actuaryzhang
Github user actuaryzhang commented on the issue: https://github.com/apache/spark/pull/17967 @viirya Great point. Added a comment to explain this in the doc. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your

[GitHub] spark issue #18025: [WIP][SparkR] Update doc and examples for sql functions

2017-05-18 Thread actuaryzhang
Github user actuaryzhang commented on the issue: https://github.com/apache/spark/pull/18025 @felixcheung Thanks for your feedback. - This does not affect discoverability: the name of the method is still on the index list - No problem with help either, e.g., one can use `?avg

[GitHub] spark issue #18025: [WIP][SparkR] Update doc and examples for sql functions

2017-05-18 Thread actuaryzhang
Github user actuaryzhang commented on the issue: https://github.com/apache/spark/pull/18025 This is what the `'column_aggregate_functions.Rd'` doc looks like: ![image](https://cloud.githubusercontent.com/assets/11082368/26190195/fd353224-3b5c-11e7-9a78-2607cc665f49.png

[GitHub] spark issue #18025: [WIP][SparkR] Update doc and examples for sql functions

2017-05-18 Thread actuaryzhang
Github user actuaryzhang commented on the issue: https://github.com/apache/spark/pull/18025 @felixcheung @HyukjinKwon Per this [suggestion](https://github.com/apache/spark/pull/18003#discussion-diff-116853922L57), I'm creating more meaningful examples for the SQL functions

[GitHub] spark pull request #18025: [WIP][SparkR] Update doc and examples for sql fun...

2017-05-18 Thread actuaryzhang
GitHub user actuaryzhang opened a pull request: https://github.com/apache/spark/pull/18025 [WIP][SparkR] Update doc and examples for sql functions ## What changes were proposed in this pull request? Create better examples for sql functions. You can merge this pull

[GitHub] spark issue #18003: [SparkR] Fix bad examples in DataFrame methods and style...

2017-05-17 Thread actuaryzhang
Github user actuaryzhang commented on the issue: https://github.com/apache/spark/pull/18003 Thanks for the update. I will work on better examples on these functions. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well

[GitHub] spark pull request #18003: [SparkR] Fix bad examples in DataFrame methods an...

2017-05-16 Thread actuaryzhang
Github user actuaryzhang commented on a diff in the pull request: https://github.com/apache/spark/pull/18003#discussion_r116888993 --- Diff: R/pkg/R/functions.R --- @@ -54,7 +54,8 @@ setMethod("lit", signature("ANY"), #' @name abs #' @family normal_f

[GitHub] spark issue #18003: [SparkR] Fix bad examples in DataFrame methods and style...

2017-05-16 Thread actuaryzhang
Github user actuaryzhang commented on the issue: https://github.com/apache/spark/pull/18003 @felixcheung Thanks for your suggestion. I reverted back the change on single line example. I agree that the examples in sql are not informative. I will put in better examples in another

[GitHub] spark pull request #18003: [SparkR] Fix bad examples in DataFrame methods an...

2017-05-16 Thread actuaryzhang
Github user actuaryzhang commented on a diff in the pull request: https://github.com/apache/spark/pull/18003#discussion_r116886110 --- Diff: R/pkg/R/DataFrame.R --- @@ -2814,7 +2815,7 @@ setMethod("except", #' path <- "path/to/file.json"

[GitHub] spark issue #18003: [SparkR] Fix bad examples in DataFrame methods

2017-05-16 Thread actuaryzhang
Github user actuaryzhang commented on the issue: https://github.com/apache/spark/pull/18003 @bdwyer2 Thanks for the suggestion. I ran a check on all the R scripts and fixed the style issues. --- If your project is set up for it, you can reply to this email and have your reply

[GitHub] spark issue #18003: [SparkR] Fix bad examples in DataFrame methods

2017-05-16 Thread actuaryzhang
Github user actuaryzhang commented on the issue: https://github.com/apache/spark/pull/18003 @felixcheung --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so

[GitHub] spark pull request #18003: [SparkR] Fix bad examples in DataFrame methods

2017-05-16 Thread actuaryzhang
GitHub user actuaryzhang opened a pull request: https://github.com/apache/spark/pull/18003 [SparkR] Fix bad examples in DataFrame methods ## What changes were proposed in this pull request? Some examples in the DataFrame methods are syntactically wrong, even though

[GitHub] spark issue #17978: [SPARK-20736][Python] PySpark StringIndexer supports Str...

2017-05-16 Thread actuaryzhang
Github user actuaryzhang commented on the issue: https://github.com/apache/spark/pull/17978 @viirya @felixcheung Any additional changes needed for this one? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your

[GitHub] spark issue #17978: [SPARK-20736][Python] PySpark StringIndexer supports Str...

2017-05-15 Thread actuaryzhang
Github user actuaryzhang commented on the issue: https://github.com/apache/spark/pull/17978 @felixcheung Thanks so much for the review. I addressed most of the comments except auto generating code for defining `stringOrderType`. This parameter is not a shared trait on the Scala side

[GitHub] spark pull request #17978: [SPARK-20736][Python] PySpark StringIndexer suppo...

2017-05-15 Thread actuaryzhang
Github user actuaryzhang commented on a diff in the pull request: https://github.com/apache/spark/pull/17978#discussion_r116655031 --- Diff: python/pyspark/ml/feature.py --- @@ -2111,26 +2112,45 @@ class StringIndexer(JavaEstimator, HasInputCol, HasOutputCol, HasHandleInvalid

[GitHub] spark pull request #17978: [SPARK-20736][Python] PySpark StringIndexer suppo...

2017-05-15 Thread actuaryzhang
Github user actuaryzhang commented on a diff in the pull request: https://github.com/apache/spark/pull/17978#discussion_r116654554 --- Diff: python/pyspark/ml/feature.py --- @@ -2082,8 +2082,9 @@ class StringIndexer(JavaEstimator, HasInputCol, HasOutputCol, HasHandleInvalid

[GitHub] spark pull request #17978: [SPARK-20736][Python] PySpark StringIndexer suppo...

2017-05-15 Thread actuaryzhang
Github user actuaryzhang commented on a diff in the pull request: https://github.com/apache/spark/pull/17978#discussion_r116654542 --- Diff: python/pyspark/ml/feature.py --- @@ -2139,6 +2159,20 @@ def setParams(self, inputCol=None, outputCol=None, handleInvalid="

[GitHub] spark issue #17978: [SPARK-20736][Python] PySpark StringIndexer supports Str...

2017-05-15 Thread actuaryzhang
Github user actuaryzhang commented on the issue: https://github.com/apache/spark/pull/17978 @felixcheung Could you take a look? Thanks. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have

[GitHub] spark pull request #17978: [SPARK-20736][Python] PySpark StringIndexer suppo...

2017-05-14 Thread actuaryzhang
Github user actuaryzhang commented on a diff in the pull request: https://github.com/apache/spark/pull/17978#discussion_r116411579 --- Diff: python/pyspark/ml/feature.py --- @@ -2115,22 +2115,32 @@ class StringIndexer(JavaEstimator, HasInputCol, HasOutputCol, HasHandleInvalid

[GitHub] spark issue #17978: [SPARK-20736][Python] PySpark StringIndexer supports Str...

2017-05-14 Thread actuaryzhang
Github user actuaryzhang commented on the issue: https://github.com/apache/spark/pull/17978 @viirya Thanks much for your review. I corrected the typo and added some tests. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well

[GitHub] spark issue #17978: [SPARK-20736][Python] PySpark StringIndexer supports Str...

2017-05-14 Thread actuaryzhang
Github user actuaryzhang commented on the issue: https://github.com/apache/spark/pull/17978 @viirya @MLnick @BryanCutler @yinxusen @brkyvz @HyukjinKwon @srowen Ping for reviews or comments. Thanks much. --- If your project is set up for it, you can reply to this email and have

[GitHub] spark issue #17967: [SPARK-14659][ML] RFormula consistent with R when handli...

2017-05-14 Thread actuaryzhang
Github user actuaryzhang commented on the issue: https://github.com/apache/spark/pull/17967 @felixcheung Once this PR gets in, I'll update the SparkR side and include some test. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub

[GitHub] spark issue #17978: [SPARK-20736] PySpark StringIndexer supports StringOrder...

2017-05-14 Thread actuaryzhang
Github user actuaryzhang commented on the issue: https://github.com/apache/spark/pull/17978 @felixcheung --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so

[GitHub] spark pull request #17978: [SPARK-20736] PySpark StringIndexer supports Stri...

2017-05-14 Thread actuaryzhang
GitHub user actuaryzhang opened a pull request: https://github.com/apache/spark/pull/17978 [SPARK-20736] PySpark StringIndexer supports StringOrderType ## What changes were proposed in this pull request? PySpark StringIndexer supports StringOrderType added in #17879. You can

[GitHub] spark pull request #17967: [SPARK-14659][ML] RFormula consistent with R when...

2017-05-14 Thread actuaryzhang
Github user actuaryzhang commented on a diff in the pull request: https://github.com/apache/spark/pull/17967#discussion_r116389971 --- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/RFormula.scala --- @@ -37,6 +37,29 @@ import org.apache.spark.sql.types

[GitHub] spark issue #17967: [SPARK-14659][ML] RFormula consistent with R when handli...

2017-05-14 Thread actuaryzhang
Github user actuaryzhang commented on the issue: https://github.com/apache/spark/pull/17967 @felixcheung Thanks for the review. I fixed some typo. Below is an example to show the difference in model estimates due to different string ordering between R and RFormula

[GitHub] spark issue #17084: [SPARK-18693][ML][MLLIB] ML Evaluators should use weight...

2017-05-13 Thread actuaryzhang
Github user actuaryzhang commented on the issue: https://github.com/apache/spark/pull/17084 @imatiach-msft Thanks for the PR. Added a couple of comments. Sorry for the delayed review. --- If your project is set up for it, you can reply to this email and have your reply appear

[GitHub] spark pull request #17084: [SPARK-18693][ML][MLLIB] ML Evaluators should use...

2017-05-13 Thread actuaryzhang
Github user actuaryzhang commented on a diff in the pull request: https://github.com/apache/spark/pull/17084#discussion_r116364047 --- Diff: mllib/src/main/scala/org/apache/spark/ml/evaluation/BinaryClassificationEvaluator.scala --- @@ -77,12 +87,16 @@ class

[GitHub] spark pull request #17084: [SPARK-18693][ML][MLLIB] ML Evaluators should use...

2017-05-13 Thread actuaryzhang
Github user actuaryzhang commented on a diff in the pull request: https://github.com/apache/spark/pull/17084#discussion_r116364179 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/evaluation/binary/BinaryConfusionMatrix.scala --- @@ -22,22 +22,22 @@ package

[GitHub] spark pull request #17084: [SPARK-18693][ML][MLLIB] ML Evaluators should use...

2017-05-13 Thread actuaryzhang
Github user actuaryzhang commented on a diff in the pull request: https://github.com/apache/spark/pull/17084#discussion_r116364061 --- Diff: mllib/src/main/scala/org/apache/spark/ml/evaluation/BinaryClassificationEvaluator.scala --- @@ -36,12 +36,18 @@ import

[GitHub] spark pull request #17084: [SPARK-18693][ML][MLLIB] ML Evaluators should use...

2017-05-13 Thread actuaryzhang
Github user actuaryzhang commented on a diff in the pull request: https://github.com/apache/spark/pull/17084#discussion_r116364140 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/evaluation/BinaryClassificationMetrics.scala --- @@ -41,13 +41,27 @@ import

[GitHub] spark pull request #17084: [SPARK-18693][ML][MLLIB] ML Evaluators should use...

2017-05-13 Thread actuaryzhang
Github user actuaryzhang commented on a diff in the pull request: https://github.com/apache/spark/pull/17084#discussion_r116364224 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/evaluation/BinaryClassificationMetrics.scala --- @@ -146,11 +160,13 @@ class

[GitHub] spark issue #17967: [SPARK-14659][ML] RFormula consistent with R when handli...

2017-05-12 Thread actuaryzhang
Github user actuaryzhang commented on the issue: https://github.com/apache/spark/pull/17967 @yanboliang @MLnick @HyukjinKwon @jkbradley @sethah --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does

[GitHub] spark issue #17967: [SPARK-14659][ML] RFormula allows to drop the same categ...

2017-05-12 Thread actuaryzhang
Github user actuaryzhang commented on the issue: https://github.com/apache/spark/pull/17967 Questions I have: 1. StringIndexer and RFormula use repetitive code for defining `stringOrderType`. Should this be moved to a trait? 2. When setting `stringOrderType = "alphab

[GitHub] spark issue #17967: [SPARK-14659] RFormula allows to drop the same category ...

2017-05-12 Thread actuaryzhang
Github user actuaryzhang commented on the issue: https://github.com/apache/spark/pull/17967 @felixcheung @viirya @ericl @yinxusen ``` val original = Seq((1, "foo", 4), (2, "bar", 4), (3, "bar", 5), (4, "baz", 5)

[GitHub] spark pull request #17967: [SPARK-14659] RFormula allows to drop the same ca...

2017-05-12 Thread actuaryzhang
GitHub user actuaryzhang opened a pull request: https://github.com/apache/spark/pull/17967 [SPARK-14659] RFormula allows to drop the same category as R when handling strings ## What changes were proposed in this pull request? When handling strings, the category dropped

[GitHub] spark issue #17879: [SPARK-20619][ML] StringIndexer supports multiple ways t...

2017-05-12 Thread actuaryzhang
Github user actuaryzhang commented on the issue: https://github.com/apache/spark/pull/17879 Thanks much! --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so

[GitHub] spark issue #17879: [SPARK-20619][ML] StringIndexer supports multiple ways t...

2017-05-11 Thread actuaryzhang
Github user actuaryzhang commented on the issue: https://github.com/apache/spark/pull/17879 @felixcheung It would be great if you could help merge this. I could address comments (if any) in a future PR. This seems a pretty straightforward change that removes a big blocker. I

[GitHub] spark issue #17879: [SPARK-20619][ML] StringIndexer supports multiple ways t...

2017-05-11 Thread actuaryzhang
Github user actuaryzhang commented on the issue: https://github.com/apache/spark/pull/17879 @felixcheung Thanks much for your review. @yanboliang @jkbradley Since there are two approvals, could you guys take a look and merge if it's good? We really need this for a couple

[GitHub] spark issue #17864: [SPARK-20604][ML] Allow imputer to handle numeric types

2017-05-10 Thread actuaryzhang
Github user actuaryzhang commented on the issue: https://github.com/apache/spark/pull/17864 @hhbyyh @sethah @MLnick Could you take a look at the new commit? Thanks. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well

[GitHub] spark issue #17879: [SPARK-20619][ML] StringIndexer supports multiple ways t...

2017-05-09 Thread actuaryzhang
Github user actuaryzhang commented on the issue: https://github.com/apache/spark/pull/17879 @felixcheung @gatorsmile @MLnick @jkbradley @holdenk @yanboliang @srowen @sethah Would you please take another look and let me know any additional suggestions? Thanks much! --- If your

[GitHub] spark pull request #17879: [SPARK-20619][ML] StringIndexer supports multiple...

2017-05-09 Thread actuaryzhang
Github user actuaryzhang commented on a diff in the pull request: https://github.com/apache/spark/pull/17879#discussion_r115414436 --- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/StringIndexer.scala --- @@ -59,6 +59,29 @@ private[feature] trait StringIndexerBase extends

[GitHub] spark pull request #17879: [SPARK-20619][ML] StringIndexer supports multiple...

2017-05-09 Thread actuaryzhang
Github user actuaryzhang commented on a diff in the pull request: https://github.com/apache/spark/pull/17879#discussion_r115413770 --- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/StringIndexer.scala --- @@ -131,6 +167,12 @@ object StringIndexer extends

[GitHub] spark issue #17879: [SPARK-20619][ML] StringIndexer supports multiple ways t...

2017-05-08 Thread actuaryzhang
Github user actuaryzhang commented on the issue: https://github.com/apache/spark/pull/17879 Thanks much @felixcheung and @viirya. I have addressed your comments. - update from 2.2 to 2.3 - change `freq_desc` to `frequency_desc`. - move toLowerCase to the getter method

[GitHub] spark issue #17879: [SPARK-20619][ML] StringIndexer supports multiple ways t...

2017-05-08 Thread actuaryzhang
Github user actuaryzhang commented on the issue: https://github.com/apache/spark/pull/17879 @felixcheung Thanks. I will update the annotation. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does

[GitHub] spark pull request #17879: [SPARK-20619][ML] StringIndexer supports multiple...

2017-05-08 Thread actuaryzhang
Github user actuaryzhang commented on a diff in the pull request: https://github.com/apache/spark/pull/17879#discussion_r115402478 --- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/StringIndexer.scala --- @@ -131,6 +163,12 @@ object StringIndexer extends

[GitHub] spark issue #17879: [SPARK-20619][ML] StringIndexer supports multiple ways t...

2017-05-08 Thread actuaryzhang
Github user actuaryzhang commented on the issue: https://github.com/apache/spark/pull/17879 ping @yanboliang @felixcheung This is needed for one-hot encoding to be consistent with R, therefore enabling directly comparison of Spark results to R. Could you guys please take a look

[GitHub] spark pull request #17884: [SparkR][Doc] fix typo in vignettes

2017-05-07 Thread actuaryzhang
Github user actuaryzhang commented on a diff in the pull request: https://github.com/apache/spark/pull/17884#discussion_r115157036 --- Diff: R/pkg/vignettes/sparkr-vignettes.Rmd --- @@ -405,7 +405,7 @@ result <- gapply( head(arrange(result, "max_mpg", decr

[GitHub] spark pull request #17884: [SparkR][Doc] fix typo in vignettes

2017-05-07 Thread actuaryzhang
Github user actuaryzhang commented on a diff in the pull request: https://github.com/apache/spark/pull/17884#discussion_r115157013 --- Diff: R/pkg/vignettes/sparkr-vignettes.Rmd --- @@ -1079,19 +1079,19 @@ There are three main object classes in SparkR you may be working

[GitHub] spark issue #17884: [SparkR][Doc] fix typo in vignettes

2017-05-06 Thread actuaryzhang
Github user actuaryzhang commented on the issue: https://github.com/apache/spark/pull/17884 @felixcheung I ran a quick QA on the vignettes and fixed some additional typos and styles. --- If your project is set up for it, you can reply to this email and have your reply appear

[GitHub] spark issue #17884: [SparkR][Doc] fix typo in vignettes

2017-05-06 Thread actuaryzhang
Github user actuaryzhang commented on the issue: https://github.com/apache/spark/pull/17884 @HyukjinKwon Thanks for pointing this out. I will keep this in mind next time. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well

[GitHub] spark issue #17884: [SparkR][Doc] fix typo in vignettes

2017-05-06 Thread actuaryzhang
Github user actuaryzhang commented on the issue: https://github.com/apache/spark/pull/17884 @felixcheung --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so

[GitHub] spark pull request #17884: [SparkR][Doc] fix typo in vignettes

2017-05-06 Thread actuaryzhang
GitHub user actuaryzhang opened a pull request: https://github.com/apache/spark/pull/17884 [SparkR][Doc] fix typo in vignettes ## What changes were proposed in this pull request? Fix typo in vignettes You can merge this pull request into a Git repository by running: $ git

[GitHub] spark pull request #17879: [SPARK-20619][ML]StringIndexer supports multiple ...

2017-05-06 Thread actuaryzhang
Github user actuaryzhang commented on a diff in the pull request: https://github.com/apache/spark/pull/17879#discussion_r115124754 --- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/StringIndexer.scala --- @@ -59,6 +59,28 @@ private[feature] trait StringIndexerBase extends

[GitHub] spark pull request #17879: [SPARK-20619][ML]StringIndexer supports multiple ...

2017-05-06 Thread actuaryzhang
Github user actuaryzhang commented on a diff in the pull request: https://github.com/apache/spark/pull/17879#discussion_r115116222 --- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/StringIndexer.scala --- @@ -107,11 +135,15 @@ class StringIndexer @Since("

[GitHub] spark pull request #17879: [SPARK-20619][ML]StringIndexer supports multiple ...

2017-05-06 Thread actuaryzhang
Github user actuaryzhang commented on a diff in the pull request: https://github.com/apache/spark/pull/17879#discussion_r115116212 --- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/StringIndexer.scala --- @@ -59,6 +59,28 @@ private[feature] trait StringIndexerBase extends

[GitHub] spark issue #17879: [SPARK-20619][ML]StringIndexer supports multiple ways of...

2017-05-06 Thread actuaryzhang
Github user actuaryzhang commented on the issue: https://github.com/apache/spark/pull/17879 @viirya Thanks much for your comments. Made a new commit to address them. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well

[GitHub] spark issue #17879: [SPARK-20619][ML]StringIndexer supports multiple ways of...

2017-05-06 Thread actuaryzhang
Github user actuaryzhang commented on the issue: https://github.com/apache/spark/pull/17879 @yanboliang Since you have reported a few issues due to different encoding between Spark and R (e.g., #SPARK-14659 and #SPARK-14657), probably you could add some comments? --- If your

[GitHub] spark pull request #17005: [SPARK-14659][ML] RFormula supports setting base ...

2017-05-06 Thread actuaryzhang
Github user actuaryzhang closed the pull request at: https://github.com/apache/spark/pull/17005 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature

[GitHub] spark issue #17879: [SPARK-20619][ML]StringIndexer supports multiple ways of...

2017-05-06 Thread actuaryzhang
Github user actuaryzhang commented on the issue: https://github.com/apache/spark/pull/17879 @holdenk The main motivation for this PR is that the behavior of StringIndexer will affect OneHotEncoder, RFormula and models estimated based on these transformers. There have been a few

[GitHub] spark issue #17879: [SPARK-20619][ML]StringIndexer supports multiple ways of...

2017-05-06 Thread actuaryzhang
Github user actuaryzhang commented on the issue: https://github.com/apache/spark/pull/17879 @jkbradley @MLnick @holdenk @pnpritchard @yanboliang @sethah @imatiach-msft @srowen An example to illustrate the idea: ``` val data = Seq((0, "b"), (1, "b&

[GitHub] spark pull request #17879: [SPARK-20619][ML]StringIndexer supports multiple ...

2017-05-06 Thread actuaryzhang
GitHub user actuaryzhang opened a pull request: https://github.com/apache/spark/pull/17879 [SPARK-20619][ML]StringIndexer supports multiple ways of label ordering ## What changes were proposed in this pull request? StringIndexer maps labels to numbers according

[GitHub] spark issue #17864: [SPARK-20604][ML] Allow imputer to handle numeric types

2017-05-05 Thread actuaryzhang
Github user actuaryzhang commented on the issue: https://github.com/apache/spark/pull/17864 @hhbyyh Thanks for the suggestion. I have made a new commit that always casts the input to double and outputs the imputed column as double. --- If your project is set up for it, you can

[GitHub] spark issue #17864: [SPARK-20604][ML] Allow imputer to handle numeric types

2017-05-04 Thread actuaryzhang
Github user actuaryzhang commented on the issue: https://github.com/apache/spark/pull/17864 @sethah Thanks for summarizing the previous discussions. What are you suggesting for this PR? I think it makes sense to log a warning when imputing integer types with mean. In addition

[GitHub] spark issue #17864: [SPARK-20604][ML] Allow imputer to handle numeric types

2017-05-04 Thread actuaryzhang
Github user actuaryzhang commented on the issue: https://github.com/apache/spark/pull/17864 @yanboliang @srowen @MLnick @jkbradley The example below shows failure of Imputer on integer data. ``` val df = spark.createDataFrame( Seq( (0, 1.0, 1.0, 1.0

[GitHub] spark pull request #17864: [SPARK-20604][ML] Allow imputer to handle numeric...

2017-05-04 Thread actuaryzhang
GitHub user actuaryzhang opened a pull request: https://github.com/apache/spark/pull/17864 [SPARK-20604][ML] Allow imputer to handle numeric types ## What changes were proposed in this pull request? Imputer currently requires input column to be Double or Float

[GitHub] spark issue #17840: [SPARK-20574][ML] Allow Bucketizer to handle non-Double ...

2017-05-04 Thread actuaryzhang
Github user actuaryzhang commented on the issue: https://github.com/apache/spark/pull/17840 Thanks @yanboliang and @MLnick. I have included the additional numeric types and updated the error msg. --- If your project is set up for it, you can reply to this email and have your

[GitHub] spark issue #17840: [SPARK-20574][ML] Allow Bucketizer to handle non-Double ...

2017-05-03 Thread actuaryzhang
Github user actuaryzhang commented on the issue: https://github.com/apache/spark/pull/17840 Weird failure message, the log shows all tests passed... --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does

[GitHub] spark issue #17840: [SPARK-20574][ML] Allow Bucketizer to handle non-Double ...

2017-05-03 Thread actuaryzhang
Github user actuaryzhang commented on the issue: https://github.com/apache/spark/pull/17840 @yinxusen @srowen @mengxr @jkbradley @VinceShieh @yanboliang The example below shows failure of Bucketizer on integer data. ``` val splits = Array(-3.0, 0.0, 3.0) val

[GitHub] spark pull request #17840: [SPARK-20574][ML] Allow Bucketizer to handle non-...

2017-05-03 Thread actuaryzhang
GitHub user actuaryzhang opened a pull request: https://github.com/apache/spark/pull/17840 [SPARK-20574][ML] Allow Bucketizer to handle non-Double column ## What changes were proposed in this pull request? Bucketizer currently requires input column to be Double, but the logic

<    1   2   3   4   5   6   >