[GitHub] spark pull request #14881: [SPARK-17315][SparkR] Kolmogorov-Smirnov test Spa...

2016-09-03 Thread asfgit
Github user asfgit closed the pull request at:

https://github.com/apache/spark/pull/14881


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #14881: [SPARK-17315][SparkR] Kolmogorov-Smirnov test Spa...

2016-09-02 Thread felixcheung
Github user felixcheung commented on a diff in the pull request:

https://github.com/apache/spark/pull/14881#discussion_r77388738
  
--- Diff: R/pkg/R/mllib.R ---
@@ -1308,3 +1315,104 @@ setMethod("write.ml", signature(object = 
"ALSModel", path = "character"),
   function(object, path, overwrite = FALSE) {
 write_internal(object, path, overwrite)
   })
+
+#' (One-Sample) Kolmogorov-Smirnov Test
+#'
+#' @description
+#' \code{spark.kstest} Conduct the two-sided Kolmogorov-Smirnov (KS) test 
for data sampled from a
+#' continuous distribution.
+#'
+#' By comparing the largest difference between the empirical cumulative
+#' distribution of the sample data and the theoretical distribution we can 
provide a test for the
+#' the null hypothesis that the sample data comes from that theoretical 
distribution.
+#'
+#' Users can call \code{summary} to obtain a summary of the test, and 
\code{print.summary.KSTest}
+#' to print out a summary result.
+#'
+#' @details
+#' For more details, see
+#' 
\href{http://spark.apache.org/docs/latest/mllib-statistics.html#hypothesis-testing}{
+#' MLlib: Hypothesis Testing}.
+#'
+#' @param data a SparkDataFrame of user data.
+#' @param testCol column name where the test data is from. It should be a 
column of double type.
+#' @param nullHypothesis name of the theoretical distribution tested 
against. Currently only
+#'   \code{"norm"} for normal distribution is 
supported.
+#' @param distParams parameters(s) of the distribution. For 
\code{nullHypothesis = "norm"},
+#'   we can provide as a vector the mean and standard 
deviation of
+#'   the distribution. If none is provided, then standard 
normal will be used.
+#'   If only one is provided, then the standard deviation 
will be set to be one.
+#' @param ... additional argument(s) passed to the method.
+#' @return \code{spark.kstest} returns a test result object.
+#' @rdname spark.kstest
+#' @aliases spark.kstest,SparkDataFrame-method
+#' @name spark.kstest
+#' @export
+#' @examples
+#' \dontrun{
+#' data <- data.frame(test = c(0.1, 0.15, 0.2, 0.3, 0.25))
+#' df <- createDataFrame(data)
+#' test <- spark.ktest(df, "test", "norm", c(0, 1))
+#'
+#' # get a summary of the test result
+#' testSummary <- summary(test)
+#' testSummary
+#'
+#' # print out the summary in an organized way
+#' print.summary.KSTest(test)
+#' }
+#' @note spark.kstest since 2.1.0
+setMethod("spark.kstest", signature(data = "SparkDataFrame"),
+  function(data, testCol = "test", nullHypothesis = c("norm"), 
distParams = c(0, 1)) {
+tryCatch(match.arg(nullHypothesis),
+ error = function(e) {
+   msg <- paste("Distribution", nullHypothesis, "is 
not supported.")
+   stop(msg)
+ })
+if (nullHypothesis == "norm") {
+  distParams <- as.numeric(distParams)
+  mu <- ifelse(length(distParams) < 1, 0, distParams[1])
+  sigma <- ifelse(length(distParams) < 2, 1, distParams[2])
+  jobj <- callJStatic("org.apache.spark.ml.r.KSTestWrapper",
+  "test", data@sdf, testCol, 
nullHypothesis,
+  as.array(c(mu, sigma)))
+  new("KSTest", jobj = jobj)
+}
+})
+
+#  Get the summary of Kolmogorov-Smirnov (KS) Test.
+#' @param object test result object of KS.
--- End diff --

It is, it's your call - I thought it would be better to be consistent with 
all other summary methods but it wasn't clear why that was done initially.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #14881: [SPARK-17315][SparkR] Kolmogorov-Smirnov test Spa...

2016-09-02 Thread junyangq
Github user junyangq commented on a diff in the pull request:

https://github.com/apache/spark/pull/14881#discussion_r77384344
  
--- Diff: R/pkg/R/mllib.R ---
@@ -1308,3 +1315,104 @@ setMethod("write.ml", signature(object = 
"ALSModel", path = "character"),
   function(object, path, overwrite = FALSE) {
 write_internal(object, path, overwrite)
   })
+
+#' (One-Sample) Kolmogorov-Smirnov Test
+#'
+#' @description
+#' \code{spark.kstest} Conduct the two-sided Kolmogorov-Smirnov (KS) test 
for data sampled from a
+#' continuous distribution.
+#'
+#' By comparing the largest difference between the empirical cumulative
+#' distribution of the sample data and the theoretical distribution we can 
provide a test for the
+#' the null hypothesis that the sample data comes from that theoretical 
distribution.
+#'
+#' Users can call \code{summary} to obtain a summary of the test, and 
\code{print.summary.KSTest}
+#' to print out a summary result.
+#'
+#' @details
+#' For more details, see
+#' 
\href{http://spark.apache.org/docs/latest/mllib-statistics.html#hypothesis-testing}{
+#' MLlib: Hypothesis Testing}.
+#'
+#' @param data a SparkDataFrame of user data.
+#' @param testCol column name where the test data is from. It should be a 
column of double type.
+#' @param nullHypothesis name of the theoretical distribution tested 
against. Currently only
+#'   \code{"norm"} for normal distribution is 
supported.
+#' @param distParams parameters(s) of the distribution. For 
\code{nullHypothesis = "norm"},
+#'   we can provide as a vector the mean and standard 
deviation of
+#'   the distribution. If none is provided, then standard 
normal will be used.
+#'   If only one is provided, then the standard deviation 
will be set to be one.
+#' @param ... additional argument(s) passed to the method.
+#' @return \code{spark.kstest} returns a test result object.
+#' @rdname spark.kstest
+#' @aliases spark.kstest,SparkDataFrame-method
+#' @name spark.kstest
+#' @export
+#' @examples
+#' \dontrun{
+#' data <- data.frame(test = c(0.1, 0.15, 0.2, 0.3, 0.25))
+#' df <- createDataFrame(data)
+#' test <- spark.ktest(df, "test", "norm", c(0, 1))
+#'
+#' # get a summary of the test result
+#' testSummary <- summary(test)
+#' testSummary
+#'
+#' # print out the summary in an organized way
+#' print.summary.KSTest(test)
+#' }
+#' @note spark.kstest since 2.1.0
+setMethod("spark.kstest", signature(data = "SparkDataFrame"),
+  function(data, testCol = "test", nullHypothesis = c("norm"), 
distParams = c(0, 1)) {
+tryCatch(match.arg(nullHypothesis),
+ error = function(e) {
+   msg <- paste("Distribution", nullHypothesis, "is 
not supported.")
+   stop(msg)
+ })
+if (nullHypothesis == "norm") {
+  distParams <- as.numeric(distParams)
+  mu <- ifelse(length(distParams) < 1, 0, distParams[1])
+  sigma <- ifelse(length(distParams) < 2, 1, distParams[2])
+  jobj <- callJStatic("org.apache.spark.ml.r.KSTestWrapper",
+  "test", data@sdf, testCol, 
nullHypothesis,
+  as.array(c(mu, sigma)))
+  new("KSTest", jobj = jobj)
+}
+})
+
+#  Get the summary of Kolmogorov-Smirnov (KS) Test.
+#' @param object test result object of KS.
--- End diff --

It seems the summary method is in the `spark.kstest` doc? summary rd only 
includes methods for `SparkDataFrame`.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #14881: [SPARK-17315][SparkR] Kolmogorov-Smirnov test Spa...

2016-09-01 Thread felixcheung
Github user felixcheung commented on a diff in the pull request:

https://github.com/apache/spark/pull/14881#discussion_r77272122
  
--- Diff: R/pkg/R/mllib.R ---
@@ -1308,3 +1315,104 @@ setMethod("write.ml", signature(object = 
"ALSModel", path = "character"),
   function(object, path, overwrite = FALSE) {
 write_internal(object, path, overwrite)
   })
+
+#' (One-Sample) Kolmogorov-Smirnov Test
+#'
+#' @description
+#' \code{spark.kstest} Conduct the two-sided Kolmogorov-Smirnov (KS) test 
for data sampled from a
+#' continuous distribution.
+#'
+#' By comparing the largest difference between the empirical cumulative
+#' distribution of the sample data and the theoretical distribution we can 
provide a test for the
+#' the null hypothesis that the sample data comes from that theoretical 
distribution.
+#'
+#' Users can call \code{summary} to obtain a summary of the test, and 
\code{print.summary.KSTest}
+#' to print out a summary result.
+#'
+#' @details
+#' For more details, see
+#' 
\href{http://spark.apache.org/docs/latest/mllib-statistics.html#hypothesis-testing}{
+#' MLlib: Hypothesis Testing}.
+#'
+#' @param data a SparkDataFrame of user data.
+#' @param testCol column name where the test data is from. It should be a 
column of double type.
+#' @param nullHypothesis name of the theoretical distribution tested 
against. Currently only
+#'   \code{"norm"} for normal distribution is 
supported.
+#' @param distParams parameters(s) of the distribution. For 
\code{nullHypothesis = "norm"},
+#'   we can provide as a vector the mean and standard 
deviation of
+#'   the distribution. If none is provided, then standard 
normal will be used.
+#'   If only one is provided, then the standard deviation 
will be set to be one.
+#' @param ... additional argument(s) passed to the method.
+#' @return \code{spark.kstest} returns a test result object.
+#' @rdname spark.kstest
+#' @aliases spark.kstest,SparkDataFrame-method
+#' @name spark.kstest
+#' @export
+#' @examples
+#' \dontrun{
+#' data <- data.frame(test = c(0.1, 0.15, 0.2, 0.3, 0.25))
+#' df <- createDataFrame(data)
+#' test <- spark.ktest(df, "test", "norm", c(0, 1))
+#'
+#' # get a summary of the test result
+#' testSummary <- summary(test)
+#' testSummary
+#'
+#' # print out the summary in an organized way
+#' print.summary.KSTest(test)
+#' }
+#' @note spark.kstest since 2.1.0
+setMethod("spark.kstest", signature(data = "SparkDataFrame"),
+  function(data, testCol = "test", nullHypothesis = c("norm"), 
distParams = c(0, 1)) {
+tryCatch(match.arg(nullHypothesis),
+ error = function(e) {
+   msg <- paste("Distribution", nullHypothesis, "is 
not supported.")
+   stop(msg)
+ })
+if (nullHypothesis == "norm") {
+  distParams <- as.numeric(distParams)
+  mu <- ifelse(length(distParams) < 1, 0, distParams[1])
+  sigma <- ifelse(length(distParams) < 2, 1, distParams[2])
+  jobj <- callJStatic("org.apache.spark.ml.r.KSTestWrapper",
+  "test", data@sdf, testCol, 
nullHypothesis,
+  as.array(c(mu, sigma)))
+  new("KSTest", jobj = jobj)
+}
+})
+
+#  Get the summary of Kolmogorov-Smirnov (KS) Test.
+#' @param object test result object of KS.
--- End diff --

The earlier 
[comment](https://github.com/apache/spark/pull/14881#discussion_r77084448) was 
about this line in summary actually.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #14881: [SPARK-17315][SparkR] Kolmogorov-Smirnov test Spa...

2016-09-01 Thread junyangq
Github user junyangq commented on a diff in the pull request:

https://github.com/apache/spark/pull/14881#discussion_r77212162
  
--- Diff: R/pkg/R/mllib.R ---
@@ -1308,3 +1315,104 @@ setMethod("write.ml", signature(object = 
"ALSModel", path = "character"),
   function(object, path, overwrite = FALSE) {
 write_internal(object, path, overwrite)
   })
+
+#' (One-Sample) Kolmogorov-Smirnov Test
+#'
+#' @description
+#' \code{spark.kstest} Conduct the two-sided Kolmogorov-Smirnov (KS) test 
for data sampled from a
+#' continuous distribution.
+#'
+#' By comparing the largest difference between the empirical cumulative
+#' distribution of the sample data and the theoretical distribution we can 
provide a test for the
+#' the null hypothesis that the sample data comes from that theoretical 
distribution.
+#'
+#' Users can call \code{summary} to obtain a summary of the test, and 
\code{print.summary.KSTest}
+#' to print out a summary result.
+#'
+#' @details
+#' For more details, see
+#' 
\href{http://spark.apache.org/docs/latest/mllib-statistics.html#hypothesis-testing}{
+#' MLlib: Hypothesis Testing}.
+#'
+#' @param data a SparkDataFrame of user data.
+#' @param testCol column name where the test data is from. It should be a 
column of double type.
+#' @param nullHypothesis name of the theoretical distribution tested 
against. Currently only
+#'   \code{"norm"} for normal distribution is 
supported.
+#' @param distParams parameters(s) of the distribution. For 
\code{nullHypothesis = "norm"},
+#'   we can provide as a vector the mean and standard 
deviation of
+#'   the distribution. If none is provided, then standard 
normal will be used.
+#'   If only one is provided, then the standard deviation 
will be set to be one.
+#' @param ... additional argument(s) passed to the method.
+#' @return \code{spark.kstest} returns a test result object.
+#' @rdname spark.kstest
+#' @aliases spark.kstest,SparkDataFrame-method
+#' @name spark.kstest
+#' @export
+#' @examples
+#' \dontrun{
+#' data <- data.frame(test = c(0.1, 0.15, 0.2, 0.3, 0.25))
+#' df <- createDataFrame(data)
+#' test <- spark.ktest(df, "test", "norm", c(0, 1))
+#'
+#' # get a summary of the test result
+#' testSummary <- summary(test)
+#' testSummary
+#'
+#' # print out the summary in an organized way
+#' print.summary.KSTest(test)
+#' }
+#' @note spark.kstest since 2.1.0
+setMethod("spark.kstest", signature(data = "SparkDataFrame"),
+  function(data, testCol = "test", nullHypothesis = c("norm"), 
distParams = c(0, 1)) {
+tryCatch(match.arg(nullHypothesis),
+ error = function(e) {
+   msg <- paste("Distribution", nullHypothesis, "is 
not supported.")
+   stop(msg)
+ })
+if (nullHypothesis == "norm") {
--- End diff --

I did this intentionally in case we add more distributions in the future.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #14881: [SPARK-17315][SparkR] Kolmogorov-Smirnov test Spa...

2016-08-31 Thread felixcheung
Github user felixcheung commented on a diff in the pull request:

https://github.com/apache/spark/pull/14881#discussion_r77084764
  
--- Diff: R/pkg/inst/tests/testthat/test_mllib.R ---
@@ -736,4 +736,24 @@ test_that("spark.als", {
   unlink(modelPath)
 })
 
+test_that("spark.kstest", {
+  data <- data.frame(test = c(0.1, 0.15, 0.2, 0.3, 0.25, -1, -0.5))
+  df <- createDataFrame(data)
+  testResult <- spark.kstest(df, "test", "norm")
+  stats <- summary(testResult)
+
+  rStats <- ks.test(data$test, "pnorm", alternative = "two.sided")
+
+  expect_equal(stats$p.value, rStats$p.value, tolerance = 1e-4)
+  expect_equal(stats$statistic, unname(rStats$statistic), tolerance = 1e-4)
+
+  testResult <- spark.kstest(df, "test", "norm", -0.5)
+  stats <- summary(testResult)
--- End diff --

could you add a test for `print.summary` too?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #14881: [SPARK-17315][SparkR] Kolmogorov-Smirnov test Spa...

2016-08-31 Thread felixcheung
Github user felixcheung commented on a diff in the pull request:

https://github.com/apache/spark/pull/14881#discussion_r77084448
  
--- Diff: R/pkg/R/mllib.R ---
@@ -1308,3 +1315,104 @@ setMethod("write.ml", signature(object = 
"ALSModel", path = "character"),
   function(object, path, overwrite = FALSE) {
 write_internal(object, path, overwrite)
   })
+
+#' (One-Sample) Kolmogorov-Smirnov Test
+#'
+#' @description
+#' \code{spark.kstest} Conduct the two-sided Kolmogorov-Smirnov (KS) test 
for data sampled from a
+#' continuous distribution.
+#'
+#' By comparing the largest difference between the empirical cumulative
+#' distribution of the sample data and the theoretical distribution we can 
provide a test for the
+#' the null hypothesis that the sample data comes from that theoretical 
distribution.
+#'
+#' Users can call \code{summary} to obtain a summary of the test, and 
\code{print.summary.KSTest}
+#' to print out a summary result.
+#'
+#' @details
+#' For more details, see
+#' 
\href{http://spark.apache.org/docs/latest/mllib-statistics.html#hypothesis-testing}{
+#' MLlib: Hypothesis Testing}.
+#'
+#' @param data a SparkDataFrame of user data.
+#' @param testCol column name where the test data is from. It should be a 
column of double type.
+#' @param nullHypothesis name of the theoretical distribution tested 
against. Currently only
+#'   \code{"norm"} for normal distribution is 
supported.
+#' @param distParams parameters(s) of the distribution. For 
\code{nullHypothesis = "norm"},
+#'   we can provide as a vector the mean and standard 
deviation of
+#'   the distribution. If none is provided, then standard 
normal will be used.
+#'   If only one is provided, then the standard deviation 
will be set to be one.
+#' @param ... additional argument(s) passed to the method.
+#' @return \code{spark.kstest} returns a test result object.
+#' @rdname spark.kstest
+#' @aliases spark.kstest,SparkDataFrame-method
+#' @name spark.kstest
+#' @export
+#' @examples
+#' \dontrun{
+#' data <- data.frame(test = c(0.1, 0.15, 0.2, 0.3, 0.25))
+#' df <- createDataFrame(data)
+#' test <- spark.ktest(df, "test", "norm", c(0, 1))
+#'
+#' # get a summary of the test result
+#' testSummary <- summary(test)
+#' testSummary
+#'
+#' # print out the summary in an organized way
+#' print.summary.KSTest(test)
+#' }
+#' @note spark.kstest since 2.1.0
+setMethod("spark.kstest", signature(data = "SparkDataFrame"),
+  function(data, testCol = "test", nullHypothesis = c("norm"), 
distParams = c(0, 1)) {
+tryCatch(match.arg(nullHypothesis),
+ error = function(e) {
+   msg <- paste("Distribution", nullHypothesis, "is 
not supported.")
+   stop(msg)
+ })
+if (nullHypothesis == "norm") {
+  distParams <- as.numeric(distParams)
+  mu <- ifelse(length(distParams) < 1, 0, distParams[1])
+  sigma <- ifelse(length(distParams) < 2, 1, distParams[2])
+  jobj <- callJStatic("org.apache.spark.ml.r.KSTestWrapper",
+  "test", data@sdf, testCol, 
nullHypothesis,
+  as.array(c(mu, sigma)))
+  new("KSTest", jobj = jobj)
+}
+})
+
+#  Get the summary of Kolmogorov-Smirnov (KS) Test.
+#' @param object test result object of KS.
--- End diff --

seem like we usually call out `\code{spark.kstest}` - I think because 
summary rd is documenting a bunch of functions


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #14881: [SPARK-17315][SparkR] Kolmogorov-Smirnov test Spa...

2016-08-31 Thread felixcheung
Github user felixcheung commented on a diff in the pull request:

https://github.com/apache/spark/pull/14881#discussion_r77083234
  
--- Diff: R/pkg/R/mllib.R ---
@@ -1308,3 +1315,104 @@ setMethod("write.ml", signature(object = 
"ALSModel", path = "character"),
   function(object, path, overwrite = FALSE) {
 write_internal(object, path, overwrite)
   })
+
+#' (One-Sample) Kolmogorov-Smirnov Test
+#'
+#' @description
+#' \code{spark.kstest} Conduct the two-sided Kolmogorov-Smirnov (KS) test 
for data sampled from a
+#' continuous distribution.
+#'
+#' By comparing the largest difference between the empirical cumulative
+#' distribution of the sample data and the theoretical distribution we can 
provide a test for the
+#' the null hypothesis that the sample data comes from that theoretical 
distribution.
+#'
+#' Users can call \code{summary} to obtain a summary of the test, and 
\code{print.summary.KSTest}
+#' to print out a summary result.
+#'
+#' @details
+#' For more details, see
+#' 
\href{http://spark.apache.org/docs/latest/mllib-statistics.html#hypothesis-testing}{
+#' MLlib: Hypothesis Testing}.
+#'
+#' @param data a SparkDataFrame of user data.
+#' @param testCol column name where the test data is from. It should be a 
column of double type.
+#' @param nullHypothesis name of the theoretical distribution tested 
against. Currently only
+#'   \code{"norm"} for normal distribution is 
supported.
+#' @param distParams parameters(s) of the distribution. For 
\code{nullHypothesis = "norm"},
+#'   we can provide as a vector the mean and standard 
deviation of
+#'   the distribution. If none is provided, then standard 
normal will be used.
+#'   If only one is provided, then the standard deviation 
will be set to be one.
+#' @param ... additional argument(s) passed to the method.
+#' @return \code{spark.kstest} returns a test result object.
+#' @rdname spark.kstest
+#' @aliases spark.kstest,SparkDataFrame-method
+#' @name spark.kstest
+#' @export
+#' @examples
+#' \dontrun{
+#' data <- data.frame(test = c(0.1, 0.15, 0.2, 0.3, 0.25))
+#' df <- createDataFrame(data)
+#' test <- spark.ktest(df, "test", "norm", c(0, 1))
+#'
+#' # get a summary of the test result
+#' testSummary <- summary(test)
+#' testSummary
+#'
+#' # print out the summary in an organized way
+#' print.summary.KSTest(test)
+#' }
+#' @note spark.kstest since 2.1.0
+setMethod("spark.kstest", signature(data = "SparkDataFrame"),
+  function(data, testCol = "test", nullHypothesis = c("norm"), 
distParams = c(0, 1)) {
+tryCatch(match.arg(nullHypothesis),
+ error = function(e) {
+   msg <- paste("Distribution", nullHypothesis, "is 
not supported.")
+   stop(msg)
+ })
+if (nullHypothesis == "norm") {
+  distParams <- as.numeric(distParams)
+  mu <- ifelse(length(distParams) < 1, 0, distParams[1])
+  sigma <- ifelse(length(distParams) < 2, 1, distParams[2])
+  jobj <- callJStatic("org.apache.spark.ml.r.KSTestWrapper",
+  "test", data@sdf, testCol, 
nullHypothesis,
+  as.array(c(mu, sigma)))
+  new("KSTest", jobj = jobj)
+}
+})
+
+#  Get the summary of Kolmogorov-Smirnov (KS) Test.
+#' @param object test result object of KS.
+#' @return \code{summary} returns a list containing the p-value, test 
statistic computed for the
+#' test, the null hypothesis with its parameters tested against
+#' and degrees of freedom of the test.
+#' @rdname spark.kstest
+#' @aliases summary,KSTest-method
+#' @export
+#' @note summary(KSTest) since 2.1.0
+setMethod("summary", signature(object = "KSTest"),
+  function(object) {
+jobj <- object@jobj
+pValue <- callJMethod(jobj, "pValue")
+statistic <- callJMethod(jobj, "statistic")
+nullHypothesis <- callJMethod(jobj, "nullHypothesis")
+distName <- callJMethod(jobj, "distName")
+distParams <- unlist(callJMethod(jobj, "distParams"))
+degreesOfFreedom <- callJMethod(jobj, "degreesOfFreedom")
+
+list(p.value = pValue, statistic = statistic, nullHypothesis = 
nullHypothesis,
+ nullHypothesis.name = distName, nullHypothesis.parameters 
= distParams,
+ degreesOfFreedom = degreesOfFreedom)
+  })
+
+#  Prints the summary of GeneralizedLinearRegressionModel

[GitHub] spark pull request #14881: [SPARK-17315][SparkR] Kolmogorov-Smirnov test Spa...

2016-08-31 Thread felixcheung
Github user felixcheung commented on a diff in the pull request:

https://github.com/apache/spark/pull/14881#discussion_r77083054
  
--- Diff: R/pkg/R/mllib.R ---
@@ -1308,3 +1315,104 @@ setMethod("write.ml", signature(object = 
"ALSModel", path = "character"),
   function(object, path, overwrite = FALSE) {
 write_internal(object, path, overwrite)
   })
+
+#' (One-Sample) Kolmogorov-Smirnov Test
+#'
+#' @description
+#' \code{spark.kstest} Conduct the two-sided Kolmogorov-Smirnov (KS) test 
for data sampled from a
+#' continuous distribution.
+#'
+#' By comparing the largest difference between the empirical cumulative
+#' distribution of the sample data and the theoretical distribution we can 
provide a test for the
+#' the null hypothesis that the sample data comes from that theoretical 
distribution.
+#'
+#' Users can call \code{summary} to obtain a summary of the test, and 
\code{print.summary.KSTest}
+#' to print out a summary result.
+#'
+#' @details
+#' For more details, see
+#' 
\href{http://spark.apache.org/docs/latest/mllib-statistics.html#hypothesis-testing}{
+#' MLlib: Hypothesis Testing}.
+#'
+#' @param data a SparkDataFrame of user data.
+#' @param testCol column name where the test data is from. It should be a 
column of double type.
+#' @param nullHypothesis name of the theoretical distribution tested 
against. Currently only
+#'   \code{"norm"} for normal distribution is 
supported.
+#' @param distParams parameters(s) of the distribution. For 
\code{nullHypothesis = "norm"},
+#'   we can provide as a vector the mean and standard 
deviation of
+#'   the distribution. If none is provided, then standard 
normal will be used.
+#'   If only one is provided, then the standard deviation 
will be set to be one.
+#' @param ... additional argument(s) passed to the method.
+#' @return \code{spark.kstest} returns a test result object.
+#' @rdname spark.kstest
+#' @aliases spark.kstest,SparkDataFrame-method
+#' @name spark.kstest
+#' @export
+#' @examples
+#' \dontrun{
+#' data <- data.frame(test = c(0.1, 0.15, 0.2, 0.3, 0.25))
+#' df <- createDataFrame(data)
+#' test <- spark.ktest(df, "test", "norm", c(0, 1))
+#'
+#' # get a summary of the test result
+#' testSummary <- summary(test)
+#' testSummary
+#'
+#' # print out the summary in an organized way
+#' print.summary.KSTest(test)
+#' }
+#' @note spark.kstest since 2.1.0
+setMethod("spark.kstest", signature(data = "SparkDataFrame"),
+  function(data, testCol = "test", nullHypothesis = c("norm"), 
distParams = c(0, 1)) {
+tryCatch(match.arg(nullHypothesis),
+ error = function(e) {
+   msg <- paste("Distribution", nullHypothesis, "is 
not supported.")
+   stop(msg)
+ })
+if (nullHypothesis == "norm") {
--- End diff --

use `match.arg()` instead?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #14881: [SPARK-17315][SparkR] Kolmogorov-Smirnov test Spa...

2016-08-31 Thread felixcheung
Github user felixcheung commented on a diff in the pull request:

https://github.com/apache/spark/pull/14881#discussion_r77082946
  
--- Diff: R/pkg/R/mllib.R ---
@@ -1308,3 +1315,104 @@ setMethod("write.ml", signature(object = 
"ALSModel", path = "character"),
   function(object, path, overwrite = FALSE) {
 write_internal(object, path, overwrite)
   })
+
+#' (One-Sample) Kolmogorov-Smirnov Test
+#'
+#' @description
+#' \code{spark.kstest} Conduct the two-sided Kolmogorov-Smirnov (KS) test 
for data sampled from a
+#' continuous distribution.
+#'
+#' By comparing the largest difference between the empirical cumulative
+#' distribution of the sample data and the theoretical distribution we can 
provide a test for the
+#' the null hypothesis that the sample data comes from that theoretical 
distribution.
+#'
+#' Users can call \code{summary} to obtain a summary of the test, and 
\code{print.summary.KSTest}
+#' to print out a summary result.
+#'
+#' @details
+#' For more details, see
+#' 
\href{http://spark.apache.org/docs/latest/mllib-statistics.html#hypothesis-testing}{
+#' MLlib: Hypothesis Testing}.
+#'
+#' @param data a SparkDataFrame of user data.
+#' @param testCol column name where the test data is from. It should be a 
column of double type.
+#' @param nullHypothesis name of the theoretical distribution tested 
against. Currently only
+#'   \code{"norm"} for normal distribution is 
supported.
+#' @param distParams parameters(s) of the distribution. For 
\code{nullHypothesis = "norm"},
+#'   we can provide as a vector the mean and standard 
deviation of
+#'   the distribution. If none is provided, then standard 
normal will be used.
+#'   If only one is provided, then the standard deviation 
will be set to be one.
+#' @param ... additional argument(s) passed to the method.
+#' @return \code{spark.kstest} returns a test result object.
+#' @rdname spark.kstest
+#' @aliases spark.kstest,SparkDataFrame-method
+#' @name spark.kstest
+#' @export
+#' @examples
+#' \dontrun{
+#' data <- data.frame(test = c(0.1, 0.15, 0.2, 0.3, 0.25))
+#' df <- createDataFrame(data)
+#' test <- spark.ktest(df, "test", "norm", c(0, 1))
+#'
+#' # get a summary of the test result
+#' testSummary <- summary(test)
+#' testSummary
+#'
+#' # print out the summary in an organized way
+#' print.summary.KSTest(test)
+#' }
+#' @note spark.kstest since 2.1.0
+setMethod("spark.kstest", signature(data = "SparkDataFrame"),
+  function(data, testCol = "test", nullHypothesis = c("norm"), 
distParams = c(0, 1)) {
+tryCatch(match.arg(nullHypothesis),
+ error = function(e) {
+   msg <- paste("Distribution", nullHypothesis, "is 
not supported.")
+   stop(msg)
+ })
+if (nullHypothesis == "norm") {
--- End diff --

should it `stop` if `nullHypothesis` is not `norm`?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #14881: [SPARK-17315][SparkR] Kolmogorov-Smirnov test Spa...

2016-08-31 Thread felixcheung
Github user felixcheung commented on a diff in the pull request:

https://github.com/apache/spark/pull/14881#discussion_r77082768
  
--- Diff: R/pkg/R/mllib.R ---
@@ -1308,3 +1315,104 @@ setMethod("write.ml", signature(object = 
"ALSModel", path = "character"),
   function(object, path, overwrite = FALSE) {
 write_internal(object, path, overwrite)
   })
+
+#' (One-Sample) Kolmogorov-Smirnov Test
+#'
+#' @description
+#' \code{spark.kstest} Conduct the two-sided Kolmogorov-Smirnov (KS) test 
for data sampled from a
+#' continuous distribution.
+#'
+#' By comparing the largest difference between the empirical cumulative
+#' distribution of the sample data and the theoretical distribution we can 
provide a test for the
+#' the null hypothesis that the sample data comes from that theoretical 
distribution.
+#'
+#' Users can call \code{summary} to obtain a summary of the test, and 
\code{print.summary.KSTest}
+#' to print out a summary result.
+#'
+#' @details
+#' For more details, see
+#' 
\href{http://spark.apache.org/docs/latest/mllib-statistics.html#hypothesis-testing}{
+#' MLlib: Hypothesis Testing}.
--- End diff --

maybe put this in @seealso? That seems to be the typical way to add link in 
our doc


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #14881: [SPARK-17315][SparkR] Kolmogorov-Smirnov test Spa...

2016-08-30 Thread junyangq
GitHub user junyangq opened a pull request:

https://github.com/apache/spark/pull/14881

[SPARK-17315][SparkR] Kolmogorov-Smirnov test SparkR wrapper

## What changes were proposed in this pull request?

This PR tries to add Kolmogorov-Smirnov Test wrapper to SparkR. This 
wrapper implementation only supports one sample test against normal 
distribution.

## How was this patch tested?

R unit test. 

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/junyangq/spark SPARK-17315

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/14881.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #14881


commit d4b14596e9517e6ec272dd530e569b3373c31fcb
Author: Junyang Qian 
Date:   2016-08-29T21:09:22Z

Add Kolmogorov-Smirnov Test wrapper to SparkR. Currently only support 
normal distribution as null hypothesis.




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org