[GitHub] spark pull request #18831: [SPARK-21622][ML][SparkR] Support offset in Spark...

2017-08-02 Thread actuaryzhang
GitHub user actuaryzhang opened a pull request:

https://github.com/apache/spark/pull/18831

[SPARK-21622][ML][SparkR] Support offset in SparkR GLM

## What changes were proposed in this pull request?
Support offset in SparkR GLM #16699 

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/actuaryzhang/spark sparkROffset

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/18831.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #18831


commit 6ec068e5f48d393d539f4600bca3cbd1ea7d65a3
Author: actuaryzhang 
Date:   2017-08-03T06:37:41Z

add offset to SparkR




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #18831: [SPARK-21622][ML][SparkR] Support offset in Spark...

2017-08-03 Thread felixcheung
Github user felixcheung commented on a diff in the pull request:

https://github.com/apache/spark/pull/18831#discussion_r131211844
  
--- Diff: R/pkg/R/mllib_regression.R ---
@@ -159,10 +161,16 @@ setMethod("spark.glm", signature(data = 
"SparkDataFrame", formula = "formula"),
   weightCol <- as.character(weightCol)
 }
 
+if (!is.null(offsetCol) && offsetCol == "") {
+  offsetCol <- NULL
+} else if (!is.null(offsetCol)) {
+  offsetCol <- as.character(offsetCol)
+}
--- End diff --

perhaps
```
if (!is.null(offsetCol)) {
  offsetCol <- as.character(offsetCol)
  if (nchar(offsetCol) == 0) {
offsetCol <- NULL
  }
}
```

not sure if you want to cover other cases when offsetCol cannot be coerced 
- eg. NA


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #18831: [SPARK-21622][ML][SparkR] Support offset in Spark...

2017-08-03 Thread felixcheung
Github user felixcheung commented on a diff in the pull request:

https://github.com/apache/spark/pull/18831#discussion_r131209057
  
--- Diff: R/pkg/R/mllib_regression.R ---
@@ -125,7 +127,7 @@ setClass("IsotonicRegressionModel", representation(jobj 
= "jobj"))
 #' @seealso \link{glm}, \link{read.ml}
 setMethod("spark.glm", signature(data = "SparkDataFrame", formula = 
"formula"),
   function(data, formula, family = gaussian, tol = 1e-6, maxIter = 
25, weightCol = NULL,
-   regParam = 0.0, var.power = 0.0, link.power = 1.0 - 
var.power,
+   offsetCol = NULL, regParam = 0.0, var.power = 0.0, 
link.power = 1.0 - var.power,
--- End diff --

I'd avoid adding a param in the middle - it breaks code passing param by 
order


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #18831: [SPARK-21622][ML][SparkR] Support offset in Spark...

2017-08-03 Thread felixcheung
Github user felixcheung commented on a diff in the pull request:

https://github.com/apache/spark/pull/18831#discussion_r131313635
  
--- Diff: R/pkg/tests/fulltests/test_mllib_regression.R ---
@@ -173,6 +173,14 @@ test_that("spark.glm summary", {
   expect_equal(stats$df.residual, rStats$df.residual)
   expect_equal(stats$aic, rStats$aic)
 
+  # Test spark.glm works with offset
+  training <- suppressWarnings(createDataFrame(iris))
+  stats <- summary(spark.glm(training, Sepal_Width ~ Sepal_Length + 
Species,
+ family = poisson(), offsetCol = 
"Petal_Length"))
+  rStats <- suppressWarnings(summary(glm(Sepal.Width ~ Sepal.Length + 
Species,
+data = iris, family = poisson(), offset = 
iris$Petal.Length)))
--- End diff --

that's interesting - perhaps we should take `col` in addition to `col name` 
too


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #18831: [SPARK-21622][ML][SparkR] Support offset in Spark...

2017-08-04 Thread actuaryzhang
Github user actuaryzhang commented on a diff in the pull request:

https://github.com/apache/spark/pull/18831#discussion_r131386220
  
--- Diff: R/pkg/tests/fulltests/test_mllib_regression.R ---
@@ -173,6 +173,14 @@ test_that("spark.glm summary", {
   expect_equal(stats$df.residual, rStats$df.residual)
   expect_equal(stats$aic, rStats$aic)
 
+  # Test spark.glm works with offset
+  training <- suppressWarnings(createDataFrame(iris))
+  stats <- summary(spark.glm(training, Sepal_Width ~ Sepal_Length + 
Species,
+ family = poisson(), offsetCol = 
"Petal_Length"))
+  rStats <- suppressWarnings(summary(glm(Sepal.Width ~ Sepal.Length + 
Species,
+data = iris, family = poisson(), offset = 
iris$Petal.Length)))
--- End diff --

Then do you want to make the change for weight as well?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #18831: [SPARK-21622][ML][SparkR] Support offset in Spark...

2017-08-05 Thread felixcheung
Github user felixcheung commented on a diff in the pull request:

https://github.com/apache/spark/pull/18831#discussion_r131516514
  
--- Diff: R/pkg/tests/fulltests/test_mllib_regression.R ---
@@ -173,6 +173,14 @@ test_that("spark.glm summary", {
   expect_equal(stats$df.residual, rStats$df.residual)
   expect_equal(stats$aic, rStats$aic)
 
+  # Test spark.glm works with offset
+  training <- suppressWarnings(createDataFrame(iris))
+  stats <- summary(spark.glm(training, Sepal_Width ~ Sepal_Length + 
Species,
+ family = poisson(), offsetCol = 
"Petal_Length"))
+  rStats <- suppressWarnings(summary(glm(Sepal.Width ~ Sepal.Length + 
Species,
+data = iris, family = poisson(), offset = 
iris$Petal.Length)))
--- End diff --

probably across every in ml.
let's discuss this in a new JIRA.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #18831: [SPARK-21622][ML][SparkR] Support offset in Spark...

2017-08-05 Thread yanboliang
Github user yanboliang commented on a diff in the pull request:

https://github.com/apache/spark/pull/18831#discussion_r131520948
  
--- Diff: R/pkg/tests/fulltests/test_mllib_regression.R ---
@@ -173,6 +173,14 @@ test_that("spark.glm summary", {
   expect_equal(stats$df.residual, rStats$df.residual)
   expect_equal(stats$aic, rStats$aic)
 
+  # Test spark.glm works with offset
+  training <- suppressWarnings(createDataFrame(iris))
+  stats <- summary(spark.glm(training, Sepal_Width ~ Sepal_Length + 
Species,
+ family = poisson(), offsetCol = 
"Petal_Length"))
+  rStats <- suppressWarnings(summary(glm(Sepal.Width ~ Sepal.Length + 
Species,
+data = iris, family = poisson(), offset = 
iris$Petal.Length)))
--- End diff --

I vote to keep the name as it is, because it's the column name of 
```offset``` rather than the ```offset``` itself. We would like to keep SparkR 
MLlib wrappers' argument name consistent with R only when it's applicable. I'm 
ok to create a new JIRA to discuss it. Thanks. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #18831: [SPARK-21622][ML][SparkR] Support offset in Spark...

2017-08-06 Thread asfgit
Github user asfgit closed the pull request at:

https://github.com/apache/spark/pull/18831


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org