[GitHub] spark pull request #14881: [SPARK-17315][SparkR] Kolmogorov-Smirnov test Spa...

felixcheung Fri, 02 Sep 2016 11:07:10 -0700

Github user felixcheung commented on a diff in the pull request:

    https://github.com/apache/spark/pull/14881#discussion_r77388738
  
    --- Diff: R/pkg/R/mllib.R ---
    @@ -1308,3 +1315,104 @@ setMethod("write.ml", signature(object = 
"ALSModel", path = "character"),
               function(object, path, overwrite = FALSE) {
                 write_internal(object, path, overwrite)
               })
    +
    +#' (One-Sample) Kolmogorov-Smirnov Test
    +#'
    +#' @description
    +#' \code{spark.kstest} Conduct the two-sided Kolmogorov-Smirnov (KS) test 
for data sampled from a
    +#' continuous distribution.
    +#'
    +#' By comparing the largest difference between the empirical cumulative
    +#' distribution of the sample data and the theoretical distribution we can 
provide a test for the
    +#' the null hypothesis that the sample data comes from that theoretical 
distribution.
    +#'
    +#' Users can call \code{summary} to obtain a summary of the test, and 
\code{print.summary.KSTest}
    +#' to print out a summary result.
    +#'
    +#' @details
    +#' For more details, see
    +#' 
\href{http://spark.apache.org/docs/latest/mllib-statistics.html#hypothesis-testing}{
    +#' MLlib: Hypothesis Testing}.
    +#'
    +#' @param data a SparkDataFrame of user data.
    +#' @param testCol column name where the test data is from. It should be a 
column of double type.
    +#' @param nullHypothesis name of the theoretical distribution tested 
against. Currently only
    +#'                       \code{"norm"} for normal distribution is 
supported.
    +#' @param distParams parameters(s) of the distribution. For 
\code{nullHypothesis = "norm"},
    +#'                   we can provide as a vector the mean and standard 
deviation of
    +#'                   the distribution. If none is provided, then standard 
normal will be used.
    +#'                   If only one is provided, then the standard deviation 
will be set to be one.
    +#' @param ... additional argument(s) passed to the method.
    +#' @return \code{spark.kstest} returns a test result object.
    +#' @rdname spark.kstest
    +#' @aliases spark.kstest,SparkDataFrame-method
    +#' @name spark.kstest
    +#' @export
    +#' @examples
    +#' \dontrun{
    +#' data <- data.frame(test = c(0.1, 0.15, 0.2, 0.3, 0.25))
    +#' df <- createDataFrame(data)
    +#' test <- spark.ktest(df, "test", "norm", c(0, 1))
    +#'
    +#' # get a summary of the test result
    +#' testSummary <- summary(test)
    +#' testSummary
    +#'
    +#' # print out the summary in an organized way
    +#' print.summary.KSTest(test)
    +#' }
    +#' @note spark.kstest since 2.1.0
    +setMethod("spark.kstest", signature(data = "SparkDataFrame"),
    +          function(data, testCol = "test", nullHypothesis = c("norm"), 
distParams = c(0, 1)) {
    +            tryCatch(match.arg(nullHypothesis),
    +                     error = function(e) {
    +                       msg <- paste("Distribution", nullHypothesis, "is 
not supported.")
    +                       stop(msg)
    +                     })
    +            if (nullHypothesis == "norm") {
    +              distParams <- as.numeric(distParams)
    +              mu <- ifelse(length(distParams) < 1, 0, distParams[1])
    +              sigma <- ifelse(length(distParams) < 2, 1, distParams[2])
    +              jobj <- callJStatic("org.apache.spark.ml.r.KSTestWrapper",
    +                                  "test", data@sdf, testCol, 
nullHypothesis,
    +                                  as.array(c(mu, sigma)))
    +              new("KSTest", jobj = jobj)
    +            }
    +})
    +
    +#  Get the summary of Kolmogorov-Smirnov (KS) Test.
    +#' @param object test result object of KS.
    --- End diff --
    
    It is, it's your call - I thought it would be better to be consistent with 
all other summary methods but it wasn't clear why that was done initially.




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #14881: [SPARK-17315][SparkR] Kolmogorov-Smirnov test Spa...

Reply via email to