[GitHub] spark pull request #14392: [SPARK-16446] [SparkR] [ML] Gaussian Mixture Mode...

2016-08-17 Thread asfgit
Github user asfgit closed the pull request at:

https://github.com/apache/spark/pull/14392


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #14392: [SPARK-16446] [SparkR] [ML] Gaussian Mixture Mode...

2016-08-17 Thread yanboliang
Github user yanboliang commented on a diff in the pull request:

https://github.com/apache/spark/pull/14392#discussion_r75123679
  
--- Diff: R/pkg/R/mllib.R ---
@@ -632,3 +660,110 @@ setMethod("predict", signature(object = 
"AFTSurvivalRegressionModel"),
   function(object, newData) {
 return(dataFrame(callJMethod(object@jobj, "transform", 
newData@sdf)))
   })
+
+#' Multivariate Gaussian Mixture Model (GMM)
+#'
+#' Fits multivariate gaussian mixture model against a Spark DataFrame, 
similarly to R's
+#' mvnormalmixEM(). Users can call \code{summary} to print a summary of 
the fitted model,
+#' \code{predict} to make predictions on new data, and 
\code{write.ml}/\code{read.ml}
+#' to save/load fitted models.
+#'
+#' @param data a SparkDataFrame for training.
+#' @param formula a symbolic description of the model to be fitted. 
Currently only a few formula
+#'operators are supported, including '~', '.', ':', '+', 
and '-'.
+#'Note that the response variable of formula is empty in 
spark.gaussianMixture.
+#' @param k number of independent Gaussians in the mixture model.
+#' @param maxIter maximum iteration number.
+#' @param tol the convergence tolerance.
+#' @aliases spark.gaussianMixture,SparkDataFrame,formula-method
+#' @return \code{spark.gaussianMixture} returns a fitted multivariate 
gaussian mixture model.
+#' @rdname spark.gaussianMixture
+#' @name spark.gaussianMixture
+#' @seealso mixtools: 
\url{https://cran.r-project.org/web/packages/mixtools/}
+#' @export
+#' @examples
+#' \dontrun{
+#' sparkR.session()
+#' library(mvtnorm)
+#' set.seed(100)
+#' a <- rmvnorm(4, c(0, 0))
+#' b <- rmvnorm(6, c(3, 4))
+#' data <- rbind(a, b)
+#' df <- createDataFrame(as.data.frame(data))
+#' model <- spark.gaussianMixture(df, ~ V1 + V2, k = 2)
+#' summary(model)
+#'
+#' # fitted values on training data
+#' fitted <- predict(model, df)
+#' head(select(fitted, "V1", "prediction"))
+#'
+#' # save fitted model to input path
+#' path <- "path/to/model"
+#' write.ml(model, path)
+#'
+#' # can also read back the saved model and print
+#' savedModel <- read.ml(path)
+#' summary(savedModel)
+#' }
+#' @note spark.gaussianMixture since 2.1.0
+#' @seealso \link{predict}, \link{read.ml}, \link{write.ml}
+setMethod("spark.gaussianMixture", signature(data = "SparkDataFrame", 
formula = "formula"),
+  function(data, formula, k = 2, maxIter = 100, tol = 0.01) {
+formula <- paste(deparse(formula), collapse = "")
+jobj <- 
callJStatic("org.apache.spark.ml.r.GaussianMixtureWrapper", "fit", data@sdf,
+formula, as.integer(k), 
as.integer(maxIter), as.numeric(tol))
+return(new("GaussianMixtureModel", jobj = jobj))
+  })
+
+#  Get the summary of a multivariate gaussian mixture model
+
+#' @param object a fitted gaussian mixture model.
+#' @param ... currently not used argument(s) passed to the method.
+#' @return \code{summary} returns the model's lambda, mu, sigma and 
posterior.
+#' @aliases spark.gaussianMixture,SparkDataFrame,formula-method
+#' @rdname spark.gaussianMixture
+#' @export
+#' @note summary(GaussianMixtureModel) since 2.1.0
+setMethod("summary", signature(object = "GaussianMixtureModel"),
+  function(object, ...) {
+jobj <- object@jobj
+is.loaded <- callJMethod(jobj, "isLoaded")
+lambda <- unlist(callJMethod(jobj, "lambda"))
+muList <- callJMethod(jobj, "mu")
+sigmaList <- callJMethod(jobj, "sigma")
+k <- callJMethod(jobj, "k")
+dim <- callJMethod(jobj, "dim")
+mu <- c()
+for (i in 1 : k) {
+  start <- (i - 1) * dim + 1
+  end <- i * dim
+  mu[[i]] <- unlist(muList[start : end])
--- End diff --

Either is OK, I use ```[[``` to make the output format consistent with R 
```mvnormalmixEM```.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #14392: [SPARK-16446] [SparkR] [ML] Gaussian Mixture Mode...

2016-08-17 Thread yanboliang
Github user yanboliang commented on a diff in the pull request:

https://github.com/apache/spark/pull/14392#discussion_r75123510
  
--- Diff: R/pkg/R/mllib.R ---
@@ -632,3 +660,110 @@ setMethod("predict", signature(object = 
"AFTSurvivalRegressionModel"),
   function(object, newData) {
 return(dataFrame(callJMethod(object@jobj, "transform", 
newData@sdf)))
   })
+
+#' Multivariate Gaussian Mixture Model (GMM)
+#'
+#' Fits multivariate gaussian mixture model against a Spark DataFrame, 
similarly to R's
+#' mvnormalmixEM(). Users can call \code{summary} to print a summary of 
the fitted model,
+#' \code{predict} to make predictions on new data, and 
\code{write.ml}/\code{read.ml}
+#' to save/load fitted models.
+#'
+#' @param data a SparkDataFrame for training.
+#' @param formula a symbolic description of the model to be fitted. 
Currently only a few formula
+#'operators are supported, including '~', '.', ':', '+', 
and '-'.
+#'Note that the response variable of formula is empty in 
spark.gaussianMixture.
+#' @param k number of independent Gaussians in the mixture model.
+#' @param maxIter maximum iteration number.
+#' @param tol the convergence tolerance.
+#' @aliases spark.gaussianMixture,SparkDataFrame,formula-method
+#' @return \code{spark.gaussianMixture} returns a fitted multivariate 
gaussian mixture model.
+#' @rdname spark.gaussianMixture
+#' @name spark.gaussianMixture
+#' @seealso mixtools: 
\url{https://cran.r-project.org/web/packages/mixtools/}
+#' @export
+#' @examples
+#' \dontrun{
+#' sparkR.session()
+#' library(mvtnorm)
--- End diff --

Yep, I think it's OK since users can load other library in SparkR session, 
and this is not necessary if users have their own dataset.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #14392: [SPARK-16446] [SparkR] [ML] Gaussian Mixture Mode...

2016-08-17 Thread felixcheung
Github user felixcheung commented on a diff in the pull request:

https://github.com/apache/spark/pull/14392#discussion_r75122731
  
--- Diff: R/pkg/R/mllib.R ---
@@ -632,3 +660,110 @@ setMethod("predict", signature(object = 
"AFTSurvivalRegressionModel"),
   function(object, newData) {
 return(dataFrame(callJMethod(object@jobj, "transform", 
newData@sdf)))
   })
+
+#' Multivariate Gaussian Mixture Model (GMM)
+#'
+#' Fits multivariate gaussian mixture model against a Spark DataFrame, 
similarly to R's
+#' mvnormalmixEM(). Users can call \code{summary} to print a summary of 
the fitted model,
+#' \code{predict} to make predictions on new data, and 
\code{write.ml}/\code{read.ml}
+#' to save/load fitted models.
+#'
+#' @param data a SparkDataFrame for training.
+#' @param formula a symbolic description of the model to be fitted. 
Currently only a few formula
+#'operators are supported, including '~', '.', ':', '+', 
and '-'.
+#'Note that the response variable of formula is empty in 
spark.gaussianMixture.
+#' @param k number of independent Gaussians in the mixture model.
+#' @param maxIter maximum iteration number.
+#' @param tol the convergence tolerance.
+#' @aliases spark.gaussianMixture,SparkDataFrame,formula-method
+#' @return \code{spark.gaussianMixture} returns a fitted multivariate 
gaussian mixture model.
+#' @rdname spark.gaussianMixture
+#' @name spark.gaussianMixture
+#' @seealso mixtools: 
\url{https://cran.r-project.org/web/packages/mixtools/}
+#' @export
+#' @examples
+#' \dontrun{
+#' sparkR.session()
+#' library(mvtnorm)
--- End diff --

it looks like it's only needed to build sample data?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #14392: [SPARK-16446] [SparkR] [ML] Gaussian Mixture Mode...

2016-08-16 Thread junyangq
Github user junyangq commented on a diff in the pull request:

https://github.com/apache/spark/pull/14392#discussion_r74992582
  
--- Diff: R/pkg/R/mllib.R ---
@@ -632,3 +660,110 @@ setMethod("predict", signature(object = 
"AFTSurvivalRegressionModel"),
   function(object, newData) {
 return(dataFrame(callJMethod(object@jobj, "transform", 
newData@sdf)))
   })
+
+#' Multivariate Gaussian Mixture Model (GMM)
+#'
+#' Fits multivariate gaussian mixture model against a Spark DataFrame, 
similarly to R's
+#' mvnormalmixEM(). Users can call \code{summary} to print a summary of 
the fitted model,
+#' \code{predict} to make predictions on new data, and 
\code{write.ml}/\code{read.ml}
+#' to save/load fitted models.
+#'
+#' @param data a SparkDataFrame for training.
+#' @param formula a symbolic description of the model to be fitted. 
Currently only a few formula
+#'operators are supported, including '~', '.', ':', '+', 
and '-'.
+#'Note that the response variable of formula is empty in 
spark.gaussianMixture.
+#' @param k number of independent Gaussians in the mixture model.
+#' @param maxIter maximum iteration number.
+#' @param tol the convergence tolerance.
+#' @aliases spark.gaussianMixture,SparkDataFrame,formula-method
+#' @return \code{spark.gaussianMixture} returns a fitted multivariate 
gaussian mixture model.
+#' @rdname spark.gaussianMixture
+#' @name spark.gaussianMixture
+#' @seealso mixtools: 
\url{https://cran.r-project.org/web/packages/mixtools/}
+#' @export
+#' @examples
+#' \dontrun{
+#' sparkR.session()
+#' library(mvtnorm)
+#' set.seed(100)
+#' a <- rmvnorm(4, c(0, 0))
+#' b <- rmvnorm(6, c(3, 4))
+#' data <- rbind(a, b)
+#' df <- createDataFrame(as.data.frame(data))
+#' model <- spark.gaussianMixture(df, ~ V1 + V2, k = 2)
+#' summary(model)
+#'
+#' # fitted values on training data
+#' fitted <- predict(model, df)
+#' head(select(fitted, "V1", "prediction"))
+#'
+#' # save fitted model to input path
+#' path <- "path/to/model"
+#' write.ml(model, path)
+#'
+#' # can also read back the saved model and print
+#' savedModel <- read.ml(path)
+#' summary(savedModel)
+#' }
+#' @note spark.gaussianMixture since 2.1.0
+#' @seealso \link{predict}, \link{read.ml}, \link{write.ml}
+setMethod("spark.gaussianMixture", signature(data = "SparkDataFrame", 
formula = "formula"),
+  function(data, formula, k = 2, maxIter = 100, tol = 0.01) {
+formula <- paste(deparse(formula), collapse = "")
+jobj <- 
callJStatic("org.apache.spark.ml.r.GaussianMixtureWrapper", "fit", data@sdf,
+formula, as.integer(k), 
as.integer(maxIter), as.numeric(tol))
+return(new("GaussianMixtureModel", jobj = jobj))
+  })
+
+#  Get the summary of a multivariate gaussian mixture model
+
+#' @param object a fitted gaussian mixture model.
+#' @param ... currently not used argument(s) passed to the method.
+#' @return \code{summary} returns the model's lambda, mu, sigma and 
posterior.
+#' @aliases spark.gaussianMixture,SparkDataFrame,formula-method
+#' @rdname spark.gaussianMixture
+#' @export
+#' @note summary(GaussianMixtureModel) since 2.1.0
+setMethod("summary", signature(object = "GaussianMixtureModel"),
+  function(object, ...) {
+jobj <- object@jobj
+is.loaded <- callJMethod(jobj, "isLoaded")
+lambda <- unlist(callJMethod(jobj, "lambda"))
+muList <- callJMethod(jobj, "mu")
+sigmaList <- callJMethod(jobj, "sigma")
+k <- callJMethod(jobj, "k")
+dim <- callJMethod(jobj, "dim")
+mu <- c()
+for (i in 1 : k) {
+  start <- (i - 1) * dim + 1
+  end <- i * dim
+  mu[[i]] <- unlist(muList[start : end])
--- End diff --

Any specific reason to choose `[[` rather than `[`?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #14392: [SPARK-16446] [SparkR] [ML] Gaussian Mixture Mode...

2016-08-16 Thread junyangq
Github user junyangq commented on a diff in the pull request:

https://github.com/apache/spark/pull/14392#discussion_r74990669
  
--- Diff: R/pkg/inst/tests/testthat/test_mllib.R ---
@@ -454,4 +454,66 @@ test_that("spark.survreg", {
   }
 })
 
+test_that("spark.gaussianMixture", {
+  # R code to reproduce the result.
+  # nolint start
+  #' library(mvtnorm)
+  #' set.seed(100)
+  #' a <- rmvnorm(4, c(0, 0))
+  #' b <- rmvnorm(6, c(3, 4))
+  #' data <- rbind(a, b)
+  #' model <- mvnormalmixEM(data, k = 2)
+  #' model$lambda
+  #
+  #  [1] 0.4 0.6
+  #
+  #' model$mu
+  #
+  #  [1] -0.2614822  0.5128697
+  #  [1] 2.647284 4.544682
+  #
+  #' model$sigma
+  #
+  #  [[1]]
+  #  [,1]   [,2]
+  #  [1,] 0.08427399 0.00548772
+  #  [2,] 0.00548772 0.09090715
+  #
+  #  [[2]]
+  #  [,1]   [,2]
+  #  [1,]  0.1641373 -0.1673806
+  #  [2,] -0.1673806  0.7508951
+  # nolint end
+  data <- list(list(-0.50219235, 0.1315312), list(-0.07891709, 0.8867848),
+   list(0.11697127, 0.3186301), list(-0.58179068, 0.7145327),
+   list(2.17474057, 3.6401379), list(3.08988614, 4.0962745),
+   list(2.79836605, 4.7398405), list(3.12337950, 3.9706833),
+   list(2.61114575, 4.5108563), list(2.08618581, 6.3102968))
+  df <- createDataFrame(data, c("x1", "x2"))
+  model <- spark.gaussianMixture(df, ~ x1 + x2, k = 2)
+  stats <- summary(model)
+  rLambda <- c(0.4, 0.6)
+  rMu <- c(-0.2614822, 0.5128697, 2.647284, 4.544682)
+  rSigma <- c(0.08427399, 0.00548772, 0.00548772, 0.09090715,
+  0.1641373, -0.1673806, -0.1673806, 0.7508951)
+  expect_equal(stats$lambda, rLambda)
+  expect_equal(as.vector(unlist(stats$mu)), rMu, tolerance = 1e-3)
+  expect_equal(as.vector(unlist(stats$sigma)), rSigma, tolerance = 1e-3)
--- End diff --

same here


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #14392: [SPARK-16446] [SparkR] [ML] Gaussian Mixture Mode...

2016-08-16 Thread junyangq
Github user junyangq commented on a diff in the pull request:

https://github.com/apache/spark/pull/14392#discussion_r74990013
  
--- Diff: R/pkg/inst/tests/testthat/test_mllib.R ---
@@ -454,4 +454,66 @@ test_that("spark.survreg", {
   }
 })
 
+test_that("spark.gaussianMixture", {
+  # R code to reproduce the result.
+  # nolint start
+  #' library(mvtnorm)
+  #' set.seed(100)
+  #' a <- rmvnorm(4, c(0, 0))
+  #' b <- rmvnorm(6, c(3, 4))
+  #' data <- rbind(a, b)
+  #' model <- mvnormalmixEM(data, k = 2)
+  #' model$lambda
+  #
+  #  [1] 0.4 0.6
+  #
+  #' model$mu
+  #
+  #  [1] -0.2614822  0.5128697
+  #  [1] 2.647284 4.544682
+  #
+  #' model$sigma
+  #
+  #  [[1]]
+  #  [,1]   [,2]
+  #  [1,] 0.08427399 0.00548772
+  #  [2,] 0.00548772 0.09090715
+  #
+  #  [[2]]
+  #  [,1]   [,2]
+  #  [1,]  0.1641373 -0.1673806
+  #  [2,] -0.1673806  0.7508951
+  # nolint end
+  data <- list(list(-0.50219235, 0.1315312), list(-0.07891709, 0.8867848),
+   list(0.11697127, 0.3186301), list(-0.58179068, 0.7145327),
+   list(2.17474057, 3.6401379), list(3.08988614, 4.0962745),
+   list(2.79836605, 4.7398405), list(3.12337950, 3.9706833),
+   list(2.61114575, 4.5108563), list(2.08618581, 6.3102968))
+  df <- createDataFrame(data, c("x1", "x2"))
+  model <- spark.gaussianMixture(df, ~ x1 + x2, k = 2)
+  stats <- summary(model)
+  rLambda <- c(0.4, 0.6)
+  rMu <- c(-0.2614822, 0.5128697, 2.647284, 4.544682)
+  rSigma <- c(0.08427399, 0.00548772, 0.00548772, 0.09090715,
+  0.1641373, -0.1673806, -0.1673806, 0.7508951)
+  expect_equal(stats$lambda, rLambda)
+  expect_equal(as.vector(unlist(stats$mu)), rMu, tolerance = 1e-3)
+  expect_equal(as.vector(unlist(stats$sigma)), rSigma, tolerance = 1e-3)
+  p <- collect(select(predict(model, df), "prediction"))
+  expect_equal(p$prediction, c(0, 0, 0, 0, 1, 1, 1, 1, 1, 1))
+
+  # Test model save/load
+  modelPath <- tempfile(pattern = "spark-gaussianMixture", fileext = 
".tmp")
+  write.ml(model, modelPath)
+  expect_error(write.ml(model, modelPath))
+  write.ml(model, modelPath, overwrite = TRUE)
+  model2 <- read.ml(modelPath)
+  stats2 <- summary(model2)
+  expect_equal(stats$lambda, stats2$lambda)
+  expect_equal(as.vector(unlist(stats$mu)), as.vector(unlist(stats2$mu)))
+  expect_equal(as.vector(unlist(stats$sigma)), 
as.vector(unlist(stats2$sigma)))
--- End diff --

unlist already returns a vector?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #14392: [SPARK-16446] [SparkR] [ML] Gaussian Mixture Mode...

2016-08-16 Thread junyangq
Github user junyangq commented on a diff in the pull request:

https://github.com/apache/spark/pull/14392#discussion_r74989154
  
--- Diff: R/pkg/R/mllib.R ---
@@ -632,3 +660,110 @@ setMethod("predict", signature(object = 
"AFTSurvivalRegressionModel"),
   function(object, newData) {
 return(dataFrame(callJMethod(object@jobj, "transform", 
newData@sdf)))
   })
+
+#' Multivariate Gaussian Mixture Model (GMM)
+#'
+#' Fits multivariate gaussian mixture model against a Spark DataFrame, 
similarly to R's
+#' mvnormalmixEM(). Users can call \code{summary} to print a summary of 
the fitted model,
+#' \code{predict} to make predictions on new data, and 
\code{write.ml}/\code{read.ml}
+#' to save/load fitted models.
+#'
+#' @param data a SparkDataFrame for training.
+#' @param formula a symbolic description of the model to be fitted. 
Currently only a few formula
+#'operators are supported, including '~', '.', ':', '+', 
and '-'.
+#'Note that the response variable of formula is empty in 
spark.gaussianMixture.
+#' @param k number of independent Gaussians in the mixture model.
+#' @param maxIter maximum iteration number.
+#' @param tol the convergence tolerance.
+#' @aliases spark.gaussianMixture,SparkDataFrame,formula-method
+#' @return \code{spark.gaussianMixture} returns a fitted multivariate 
gaussian mixture model.
+#' @rdname spark.gaussianMixture
+#' @name spark.gaussianMixture
+#' @seealso mixtools: 
\url{https://cran.r-project.org/web/packages/mixtools/}
+#' @export
+#' @examples
+#' \dontrun{
+#' sparkR.session()
+#' library(mvtnorm)
--- End diff --

Should we be concerned about the fact that the package is not in the 
package dependency?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #14392: [SPARK-16446] [SparkR] [ML] Gaussian Mixture Mode...

2016-08-15 Thread felixcheung
Github user felixcheung commented on a diff in the pull request:

https://github.com/apache/spark/pull/14392#discussion_r74882538
  
--- Diff: R/pkg/R/mllib.R ---
@@ -717,8 +717,9 @@ setMethod("spark.gaussianMixture", signature(data = 
"SparkDataFrame", formula =
 
 #  Get the summary of a multivariate gaussian mixture model
 
-#' @param object A fitted gaussian mixture model
-#' @return \code{summary} returns the model's lambda, mu, sigma and 
posterior
+#' @param object a fitted gaussian mixture model.
+#' @param ... additional argument(s) passed to the method.
--- End diff --

I'd say "Currently not used" instead for this case.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #14392: [SPARK-16446] [SparkR] [ML] Gaussian Mixture Mode...

2016-08-15 Thread yanboliang
Github user yanboliang commented on a diff in the pull request:

https://github.com/apache/spark/pull/14392#discussion_r74873932
  
--- Diff: R/pkg/R/generics.R ---
@@ -1279,6 +1279,13 @@ setGeneric("spark.naiveBayes", function(data, 
formula, ...) { standardGeneric("s
 #' @export
 setGeneric("spark.survreg", function(data, formula, ...) { 
standardGeneric("spark.survreg") })
 
+#' @rdname spark.gaussianMixture
+#' @export
+setGeneric("spark.gaussianMixture",
+   function(data, formula, ...) {
+ standardGeneric("spark.gaussianMixture")
--- End diff --

It can not fit one line, since ```lint-r``` requires lines should not be 
more than 100 characters.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #14392: [SPARK-16446] [SparkR] [ML] Gaussian Mixture Mode...

2016-08-15 Thread felixcheung
Github user felixcheung commented on a diff in the pull request:

https://github.com/apache/spark/pull/14392#discussion_r74870049
  
--- Diff: R/pkg/R/mllib.R ---
@@ -632,3 +659,110 @@ setMethod("predict", signature(object = 
"AFTSurvivalRegressionModel"),
   function(object, newData) {
 return(dataFrame(callJMethod(object@jobj, "transform", 
newData@sdf)))
   })
+
+#' Multivariate Gaussian Mixture Model (GMM)
+#'
+#' Fits multivariate gaussian mixture model against a Spark DataFrame, 
similarly to R's
+#' mvnormalmixEM(). Users can call \code{summary} to print a summary of 
the fitted model,
+#' \code{predict} to make predictions on new data, and 
\code{write.ml}/\code{read.ml}
+#' to save/load fitted models.
+#'
+#' @param data SparkDataFrame for training
+#' @param formula A symbolic description of the model to be fitted. 
Currently only a few formula
+#'operators are supported, including '~', '.', ':', '+', 
and '-'.
+#'Note that the response variable of formula is empty in 
spark.gaussianMixture.
+#' @param k Number of independent Gaussians in the mixture model.
+#' @param maxIter Maximum iteration number
+#' @param tol The convergence tolerance
+#' @aliases spark.gaussianMixture,SparkDataFrame,formula-method
+#' @return \code{spark.gaussianMixture} returns a fitted multivariate 
gaussian mixture model
+#' @rdname spark.gaussianMixture
+#' @name spark.gaussianMixture
+#' @seealso mixtools: 
\url{https://cran.r-project.org/web/packages/mixtools/}
+#' @export
+#' @examples
+#' \dontrun{
+#' sparkR.session()
+#' library(mvtnorm)
+#' set.seed(100)
+#' a <- rmvnorm(4, c(0, 0))
+#' b <- rmvnorm(6, c(3, 4))
+#' data <- rbind(a, b)
+#' df <- createDataFrame(as.data.frame(data))
+#' model <- spark.gaussianMixture(df, ~ V1 + V2, k = 2)
+#' summary(model)
+#'
+#' # fitted values on training data
+#' fitted <- predict(model, df)
+#' head(select(fitted, "V1", "prediction"))
+#'
+#' # save fitted model to input path
+#' path <- "path/to/model"
+#' write.ml(model, path)
+#'
+#' # can also read back the saved model and print
+#' savedModel <- read.ml(path)
+#' summary(savedModel)
+#' }
+#' @note spark.gaussianMixture since 2.1.0
+#' @seealso \link{predict}, \link{read.ml}, \link{write.ml}
+setMethod("spark.gaussianMixture", signature(data = "SparkDataFrame", 
formula = "formula"),
+  function(data, formula, k = 2, maxIter = 100, tol = 0.01) {
+formula <- paste(deparse(formula), collapse = "")
+jobj <- 
callJStatic("org.apache.spark.ml.r.GaussianMixtureWrapper", "fit", data@sdf,
+formula, as.integer(k), 
as.integer(maxIter), tol)
+return(new("GaussianMixtureModel", jobj = jobj))
+  })
+
+#  Get the summary of a multivariate gaussian mixture model
+
+#' @param object A fitted gaussian mixture model
+#' @return \code{summary} returns the model's lambda, mu, sigma and 
posterior
+#' @aliases spark.gaussianMixture,SparkDataFrame,formula-method
+#' @rdname spark.gaussianMixture
+#' @export
+#' @note summary(GaussianMixtureModel) since 2.1.0
+setMethod("summary", signature(object = "GaussianMixtureModel"),
+  function(object, ...) {
+jobj <- object@jobj
+is.loaded <- callJMethod(jobj, "isLoaded")
+lambda <- unlist(callJMethod(jobj, "lambda"))
+muList <- callJMethod(jobj, "mu")
+sigmaList <- callJMethod(jobj, "sigma")
+k <- callJMethod(jobj, "k")
+dim <- callJMethod(jobj, "dim")
+mu <- c()
+for (i in 1 : k) {
+  start <- (i - 1) * dim + 1
+  end <- i * dim
+  mu[[i]] <- unlist(muList[start : end])
+}
+sigma <- c()
+for (i in 1 : k) {
+  start <- (i - 1) * dim * dim + 1
+  end <- i * dim * dim
+  sigma[[i]] <- t(matrix(sigmaList[start : end], ncol = dim))
+}
+posterior <- if (is.loaded) {
+  NULL
+} else {
+  dataFrame(callJMethod(jobj, "posterior"))
+}
+return(list(lambda = lambda, mu = mu, sigma = sigma,
+   posterior = posterior, is.loaded = is.loaded))
+  })
+
+#  Predicted values based on a gaussian mixture model
+
+#' @param newData SparkDataFrame for testing
+#' @return \code{predict} returns a SparkDataFrame containing predicted 
labels in a column named
+#' "prediction"
+#' @return \code{predict} returns the predicted values based

[GitHub] spark pull request #14392: [SPARK-16446] [SparkR] [ML] Gaussian Mixture Mode...

2016-08-15 Thread felixcheung
Github user felixcheung commented on a diff in the pull request:

https://github.com/apache/spark/pull/14392#discussion_r74869983
  
--- Diff: R/pkg/R/mllib.R ---
@@ -632,3 +659,110 @@ setMethod("predict", signature(object = 
"AFTSurvivalRegressionModel"),
   function(object, newData) {
 return(dataFrame(callJMethod(object@jobj, "transform", 
newData@sdf)))
   })
+
+#' Multivariate Gaussian Mixture Model (GMM)
+#'
+#' Fits multivariate gaussian mixture model against a Spark DataFrame, 
similarly to R's
+#' mvnormalmixEM(). Users can call \code{summary} to print a summary of 
the fitted model,
+#' \code{predict} to make predictions on new data, and 
\code{write.ml}/\code{read.ml}
+#' to save/load fitted models.
+#'
+#' @param data SparkDataFrame for training
+#' @param formula A symbolic description of the model to be fitted. 
Currently only a few formula
+#'operators are supported, including '~', '.', ':', '+', 
and '-'.
+#'Note that the response variable of formula is empty in 
spark.gaussianMixture.
+#' @param k Number of independent Gaussians in the mixture model.
+#' @param maxIter Maximum iteration number
+#' @param tol The convergence tolerance
+#' @aliases spark.gaussianMixture,SparkDataFrame,formula-method
+#' @return \code{spark.gaussianMixture} returns a fitted multivariate 
gaussian mixture model
+#' @rdname spark.gaussianMixture
+#' @name spark.gaussianMixture
+#' @seealso mixtools: 
\url{https://cran.r-project.org/web/packages/mixtools/}
+#' @export
+#' @examples
+#' \dontrun{
+#' sparkR.session()
+#' library(mvtnorm)
+#' set.seed(100)
+#' a <- rmvnorm(4, c(0, 0))
+#' b <- rmvnorm(6, c(3, 4))
+#' data <- rbind(a, b)
+#' df <- createDataFrame(as.data.frame(data))
+#' model <- spark.gaussianMixture(df, ~ V1 + V2, k = 2)
+#' summary(model)
+#'
+#' # fitted values on training data
+#' fitted <- predict(model, df)
+#' head(select(fitted, "V1", "prediction"))
+#'
+#' # save fitted model to input path
+#' path <- "path/to/model"
+#' write.ml(model, path)
+#'
+#' # can also read back the saved model and print
+#' savedModel <- read.ml(path)
+#' summary(savedModel)
+#' }
+#' @note spark.gaussianMixture since 2.1.0
+#' @seealso \link{predict}, \link{read.ml}, \link{write.ml}
+setMethod("spark.gaussianMixture", signature(data = "SparkDataFrame", 
formula = "formula"),
+  function(data, formula, k = 2, maxIter = 100, tol = 0.01) {
+formula <- paste(deparse(formula), collapse = "")
+jobj <- 
callJStatic("org.apache.spark.ml.r.GaussianMixtureWrapper", "fit", data@sdf,
+formula, as.integer(k), 
as.integer(maxIter), tol)
--- End diff --

add `as.numeric(tol)` if we could, since tol is not in the signature


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #14392: [SPARK-16446] [SparkR] [ML] Gaussian Mixture Mode...

2016-08-15 Thread felixcheung
Github user felixcheung commented on a diff in the pull request:

https://github.com/apache/spark/pull/14392#discussion_r74869872
  
--- Diff: R/pkg/R/mllib.R ---
@@ -526,6 +533,24 @@ setMethod("write.ml", signature(object = 
"KMeansModel", path = "character"),
 invisible(callJMethod(writer, "save", path))
   })
 
+#  Save fitted MLlib model to the input path
+
+#' @param path The directory where the model is saved
+#' @param overwrite Overwrites or not if the output path already exists. 
Default is FALSE
+#'  which means throw exception if the output path exists.
+#'
+#' @rdname spark.gaussianMixture
--- End diff --

let's add `@aliases`


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #14392: [SPARK-16446] [SparkR] [ML] Gaussian Mixture Mode...

2016-08-15 Thread felixcheung
Github user felixcheung commented on a diff in the pull request:

https://github.com/apache/spark/pull/14392#discussion_r74869843
  
--- Diff: R/pkg/R/generics.R ---
@@ -1279,6 +1279,13 @@ setGeneric("spark.naiveBayes", function(data, 
formula, ...) { standardGeneric("s
 #' @export
 setGeneric("spark.survreg", function(data, formula, ...) { 
standardGeneric("spark.survreg") })
 
+#' @rdname spark.gaussianMixture
+#' @export
+setGeneric("spark.gaussianMixture",
+   function(data, formula, ...) {
+ standardGeneric("spark.gaussianMixture")
--- End diff --

does it fit one line, like the others?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #14392: [SPARK-16446] [SparkR] [ML] Gaussian Mixture Mode...

2016-08-01 Thread shivaram
Github user shivaram commented on a diff in the pull request:

https://github.com/apache/spark/pull/14392#discussion_r73079016
  
--- Diff: R/pkg/R/mllib.R ---
@@ -632,3 +659,106 @@ setMethod("predict", signature(object = 
"AFTSurvivalRegressionModel"),
   function(object, newData) {
 return(dataFrame(callJMethod(object@jobj, "transform", 
newData@sdf)))
   })
+
+#' Multivariate Gaussian Mixture Model (GMM)
+#'
+#' Fits multivariate gaussian mixture model against a Spark DataFrame, 
similarly to R's
+#' mvnormalmixEM(). Users can call \code{summary} to print a summary of 
the fitted model,
+#' \code{predict} to make predictions on new data, and 
\code{write.ml}/\code{read.ml}
+#' to save/load fitted models.
+#'
+#' @param data SparkDataFrame for training
+#' @param formula A symbolic description of the model to be fitted. 
Currently only a few formula
+#'operators are supported, including '~', '.', ':', '+', 
and '-'.
+#'Note that the response variable of formula is empty in 
spark.mvnormalmixEM.
+#' @param k Number of independent Gaussians in the mixture model.
+#' @param maxIter Maximum iteration number
+#' @param tol The convergence tolerance
+#' @aliases spark.mvnormalmixEM,SparkDataFrame,formula-method
+#' @return \code{spark.mvnormalmixEM} returns a fitted multivariate 
gaussian mixture model
+#' @rdname spark.mvnormalmixEM
+#' @name spark.mvnormalmixEM
+#' @export
+#' @examples
+#' \dontrun{
+#' sparkR.session()
+#' library(mvtnorm)
+#' set.seed(100)
+#' a <- rmvnorm(4, c(0, 0))
+#' b <- rmvnorm(6, c(3, 4))
+#' data <- rbind(a, b)
+#' df <- createDataFrame(as.data.frame(data))
+#' model <- spark.mvnormalmixEM(df, ~ V1 + V2, k = 2)
+#' summary(model)
+#'
+#' # fitted values on training data
+#' fitted <- predict(model, df)
+#' head(select(fitted, "V1", "prediction"))
+#'
+#' # save fitted model to input path
+#' path <- "path/to/model"
+#' write.ml(model, path)
+#'
+#' # can also read back the saved model and print
+#' savedModel <- read.ml(path)
+#' summary(savedModel)
+#' }
+#' @note spark.mvnormalmixEM since 2.1.0
+#' @seealso mixtools: 
\url{https://cran.r-project.org/web/packages/mixtools/}
+#' @seealso \link{predict}, \link{read.ml}, \link{write.ml}
+setMethod("spark.mvnormalmixEM", signature(data = "SparkDataFrame", 
formula = "formula"),
+  function(data, formula, k = 2, maxIter = 100, tol = 0.01) {
+formula <- paste(deparse(formula), collapse = "")
+jobj <- 
callJStatic("org.apache.spark.ml.r.GaussianMixtureWrapper", "fit", data@sdf,
+formula, as.integer(k), 
as.integer(maxIter), tol)
+return(new("GaussianMixtureModel", jobj = jobj))
+  })
+
+#  Get the summary of a multivariate gaussian mixture model
+
+#' @param object A fitted gaussian mixture model
+#' @return \code{summary} returns the model's lambda, mu, sigma and 
posterior
+#' @rdname spark.mvnormalmixEM
--- End diff --

You can also run the `check-cran.sh` script in `R/` and see if there are 
any warnings related to the methods being added in this PR.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #14392: [SPARK-16446] [SparkR] [ML] Gaussian Mixture Mode...

2016-08-01 Thread felixcheung
Github user felixcheung commented on a diff in the pull request:

https://github.com/apache/spark/pull/14392#discussion_r73039485
  
--- Diff: R/pkg/R/mllib.R ---
@@ -632,3 +659,106 @@ setMethod("predict", signature(object = 
"AFTSurvivalRegressionModel"),
   function(object, newData) {
 return(dataFrame(callJMethod(object@jobj, "transform", 
newData@sdf)))
   })
+
+#' Multivariate Gaussian Mixture Model (GMM)
+#'
+#' Fits multivariate gaussian mixture model against a Spark DataFrame, 
similarly to R's
+#' mvnormalmixEM(). Users can call \code{summary} to print a summary of 
the fitted model,
+#' \code{predict} to make predictions on new data, and 
\code{write.ml}/\code{read.ml}
+#' to save/load fitted models.
+#'
+#' @param data SparkDataFrame for training
+#' @param formula A symbolic description of the model to be fitted. 
Currently only a few formula
+#'operators are supported, including '~', '.', ':', '+', 
and '-'.
+#'Note that the response variable of formula is empty in 
spark.mvnormalmixEM.
+#' @param k Number of independent Gaussians in the mixture model.
+#' @param maxIter Maximum iteration number
+#' @param tol The convergence tolerance
+#' @aliases spark.mvnormalmixEM,SparkDataFrame,formula-method
+#' @return \code{spark.mvnormalmixEM} returns a fitted multivariate 
gaussian mixture model
+#' @rdname spark.mvnormalmixEM
+#' @name spark.mvnormalmixEM
+#' @export
+#' @examples
+#' \dontrun{
+#' sparkR.session()
+#' library(mvtnorm)
+#' set.seed(100)
+#' a <- rmvnorm(4, c(0, 0))
+#' b <- rmvnorm(6, c(3, 4))
+#' data <- rbind(a, b)
+#' df <- createDataFrame(as.data.frame(data))
+#' model <- spark.mvnormalmixEM(df, ~ V1 + V2, k = 2)
+#' summary(model)
+#'
+#' # fitted values on training data
+#' fitted <- predict(model, df)
+#' head(select(fitted, "V1", "prediction"))
+#'
+#' # save fitted model to input path
+#' path <- "path/to/model"
+#' write.ml(model, path)
+#'
+#' # can also read back the saved model and print
+#' savedModel <- read.ml(path)
+#' summary(savedModel)
+#' }
+#' @note spark.mvnormalmixEM since 2.1.0
+#' @seealso mixtools: 
\url{https://cran.r-project.org/web/packages/mixtools/}
+#' @seealso \link{predict}, \link{read.ml}, \link{write.ml}
+setMethod("spark.mvnormalmixEM", signature(data = "SparkDataFrame", 
formula = "formula"),
+  function(data, formula, k = 2, maxIter = 100, tol = 0.01) {
+formula <- paste(deparse(formula), collapse = "")
+jobj <- 
callJStatic("org.apache.spark.ml.r.GaussianMixtureWrapper", "fit", data@sdf,
+formula, as.integer(k), 
as.integer(maxIter), tol)
+return(new("GaussianMixtureModel", jobj = jobj))
+  })
+
+#  Get the summary of a multivariate gaussian mixture model
+
+#' @param object A fitted gaussian mixture model
+#' @return \code{summary} returns the model's lambda, mu, sigma and 
posterior
+#' @rdname spark.mvnormalmixEM
--- End diff --

We really should add @aliases for all functions - it helps with 
discoverability when user type `?spark.glm` and so on in R shell. Also I think 
it's one of the warning for publishing in CRAN cc @shivaram which I'd be 
checking for all other function.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #14392: [SPARK-16446] [SparkR] [ML] Gaussian Mixture Mode...

2016-08-01 Thread yanboliang
Github user yanboliang commented on a diff in the pull request:

https://github.com/apache/spark/pull/14392#discussion_r72994997
  
--- Diff: R/pkg/R/mllib.R ---
@@ -632,3 +659,106 @@ setMethod("predict", signature(object = 
"AFTSurvivalRegressionModel"),
   function(object, newData) {
 return(dataFrame(callJMethod(object@jobj, "transform", 
newData@sdf)))
   })
+
+#' Multivariate Gaussian Mixture Model (GMM)
+#'
+#' Fits multivariate gaussian mixture model against a Spark DataFrame, 
similarly to R's
+#' mvnormalmixEM(). Users can call \code{summary} to print a summary of 
the fitted model,
+#' \code{predict} to make predictions on new data, and 
\code{write.ml}/\code{read.ml}
+#' to save/load fitted models.
+#'
+#' @param data SparkDataFrame for training
+#' @param formula A symbolic description of the model to be fitted. 
Currently only a few formula
+#'operators are supported, including '~', '.', ':', '+', 
and '-'.
+#'Note that the response variable of formula is empty in 
spark.mvnormalmixEM.
+#' @param k Number of independent Gaussians in the mixture model.
+#' @param maxIter Maximum iteration number
+#' @param tol The convergence tolerance
+#' @aliases spark.mvnormalmixEM,SparkDataFrame,formula-method
+#' @return \code{spark.mvnormalmixEM} returns a fitted multivariate 
gaussian mixture model
+#' @rdname spark.mvnormalmixEM
+#' @name spark.mvnormalmixEM
+#' @export
+#' @examples
+#' \dontrun{
+#' sparkR.session()
+#' library(mvtnorm)
+#' set.seed(100)
+#' a <- rmvnorm(4, c(0, 0))
+#' b <- rmvnorm(6, c(3, 4))
+#' data <- rbind(a, b)
+#' df <- createDataFrame(as.data.frame(data))
+#' model <- spark.mvnormalmixEM(df, ~ V1 + V2, k = 2)
+#' summary(model)
+#'
+#' # fitted values on training data
+#' fitted <- predict(model, df)
+#' head(select(fitted, "V1", "prediction"))
+#'
+#' # save fitted model to input path
+#' path <- "path/to/model"
+#' write.ml(model, path)
+#'
+#' # can also read back the saved model and print
+#' savedModel <- read.ml(path)
+#' summary(savedModel)
+#' }
+#' @note spark.mvnormalmixEM since 2.1.0
+#' @seealso mixtools: 
\url{https://cran.r-project.org/web/packages/mixtools/}
+#' @seealso \link{predict}, \link{read.ml}, \link{write.ml}
+setMethod("spark.mvnormalmixEM", signature(data = "SparkDataFrame", 
formula = "formula"),
+  function(data, formula, k = 2, maxIter = 100, tol = 0.01) {
+formula <- paste(deparse(formula), collapse = "")
+jobj <- 
callJStatic("org.apache.spark.ml.r.GaussianMixtureWrapper", "fit", data@sdf,
+formula, as.integer(k), 
as.integer(maxIter), tol)
+return(new("GaussianMixtureModel", jobj = jobj))
+  })
+
+#  Get the summary of a multivariate gaussian mixture model
+
+#' @param object A fitted gaussian mixture model
+#' @return \code{summary} returns the model's lambda, mu, sigma and 
posterior
+#' @rdname spark.mvnormalmixEM
--- End diff --

We have grouped ```spark.mvnormalmixEM```, ```summary``` and ```predict``` 
together in docs, should we also need to add @aliases for each function? I saw 
we did not do this for  other SparkR ML wrappers such as ```spark.glm```. 
Please correct me if I have misunderstand.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #14392: [SPARK-16446] [SparkR] [ML] Gaussian Mixture Mode...

2016-07-29 Thread felixcheung
Github user felixcheung commented on a diff in the pull request:

https://github.com/apache/spark/pull/14392#discussion_r72880444
  
--- Diff: R/pkg/inst/tests/testthat/test_mllib.R ---
@@ -454,4 +454,66 @@ test_that("spark.survreg", {
   }
 })
 
+test_that("spark.mvnormalmixEM", {
+  # R code to reproduce the result.
+  #
+  #' library(mvtnorm)
--- End diff --

if you have a block of code in comment, try doing 
```
# nolint start
# code...
# nolint end
```
 instead. switching between `#` and `#'` is a bit confusing


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #14392: [SPARK-16446] [SparkR] [ML] Gaussian Mixture Mode...

2016-07-29 Thread felixcheung
Github user felixcheung commented on a diff in the pull request:

https://github.com/apache/spark/pull/14392#discussion_r72880430
  
--- Diff: R/pkg/R/mllib.R ---
@@ -632,3 +659,106 @@ setMethod("predict", signature(object = 
"AFTSurvivalRegressionModel"),
   function(object, newData) {
 return(dataFrame(callJMethod(object@jobj, "transform", 
newData@sdf)))
   })
+
+#' Multivariate Gaussian Mixture Model (GMM)
+#'
+#' Fits multivariate gaussian mixture model against a Spark DataFrame, 
similarly to R's
+#' mvnormalmixEM(). Users can call \code{summary} to print a summary of 
the fitted model,
+#' \code{predict} to make predictions on new data, and 
\code{write.ml}/\code{read.ml}
+#' to save/load fitted models.
+#'
+#' @param data SparkDataFrame for training
+#' @param formula A symbolic description of the model to be fitted. 
Currently only a few formula
+#'operators are supported, including '~', '.', ':', '+', 
and '-'.
+#'Note that the response variable of formula is empty in 
spark.mvnormalmixEM.
+#' @param k Number of independent Gaussians in the mixture model.
+#' @param maxIter Maximum iteration number
+#' @param tol The convergence tolerance
+#' @aliases spark.mvnormalmixEM,SparkDataFrame,formula-method
+#' @return \code{spark.mvnormalmixEM} returns a fitted multivariate 
gaussian mixture model
+#' @rdname spark.mvnormalmixEM
+#' @name spark.mvnormalmixEM
+#' @export
+#' @examples
+#' \dontrun{
+#' sparkR.session()
+#' library(mvtnorm)
+#' set.seed(100)
+#' a <- rmvnorm(4, c(0, 0))
+#' b <- rmvnorm(6, c(3, 4))
+#' data <- rbind(a, b)
+#' df <- createDataFrame(as.data.frame(data))
+#' model <- spark.mvnormalmixEM(df, ~ V1 + V2, k = 2)
+#' summary(model)
+#'
+#' # fitted values on training data
+#' fitted <- predict(model, df)
+#' head(select(fitted, "V1", "prediction"))
+#'
+#' # save fitted model to input path
+#' path <- "path/to/model"
+#' write.ml(model, path)
+#'
+#' # can also read back the saved model and print
+#' savedModel <- read.ml(path)
+#' summary(savedModel)
+#' }
+#' @note spark.mvnormalmixEM since 2.1.0
+#' @seealso mixtools: 
\url{https://cran.r-project.org/web/packages/mixtools/}
+#' @seealso \link{predict}, \link{read.ml}, \link{write.ml}
+setMethod("spark.mvnormalmixEM", signature(data = "SparkDataFrame", 
formula = "formula"),
+  function(data, formula, k = 2, maxIter = 100, tol = 0.01) {
+formula <- paste(deparse(formula), collapse = "")
+jobj <- 
callJStatic("org.apache.spark.ml.r.GaussianMixtureWrapper", "fit", data@sdf,
+formula, as.integer(k), 
as.integer(maxIter), tol)
+return(new("GaussianMixtureModel", jobj = jobj))
+  })
+
+#  Get the summary of a multivariate gaussian mixture model
+
+#' @param object A fitted gaussian mixture model
+#' @return \code{summary} returns the model's lambda, mu, sigma and 
posterior
+#' @rdname spark.mvnormalmixEM
+#' @export
+#' @note summary(GaussianMixtureModel) since 2.1.0
+setMethod("summary", signature(object = "GaussianMixtureModel"),
+  function(object, ...) {
+jobj <- object@jobj
+is.loaded <- callJMethod(jobj, "isLoaded")
+lambda <- callJMethod(jobj, "lambda")
+muList <- callJMethod(jobj, "mu")
+sigmaList <- callJMethod(jobj, "sigma")
+k <- callJMethod(jobj, "k")
+dim <- callJMethod(jobj, "dim")
+lambda <- as.vector(unlist(lambda))
+mu <- c()
+for (i in 1 : k) {
+  start <- (i - 1) * dim + 1
+  end <- i * dim
+  mu[[i]] <- as.vector(unlist(muList[start : end]))
+}
+sigma <- c()
+for (i in 1 : k) {
+  start <- (i - 1) * dim * dim + 1
+  end <- i * dim * dim
+  sigma[[i]] <- t(matrix(sigmaList[start : end], ncol = dim))
+}
+posterior <- if (is.loaded) {
+  NULL
+} else {
+  dataFrame(callJMethod(jobj, "posterior"))
+}
+return(list(lambda = lambda, mu = mu, sigma = sigma,
+   posterior = posterior, is.loaded = is.loaded))
+  })
+
+#  Predicted values based on a gaussian mixture model
+
+#' @return \code{predict} returns the predicted values based on a gaussian 
mixture model
+#' @rdname spark.mvnormalmixEM
--- End diff --

same here for @aliases


---
If your project is set up for it, you can reply to this email and have your
reply appear 

[GitHub] spark pull request #14392: [SPARK-16446] [SparkR] [ML] Gaussian Mixture Mode...

2016-07-29 Thread felixcheung
Github user felixcheung commented on a diff in the pull request:

https://github.com/apache/spark/pull/14392#discussion_r72880418
  
--- Diff: R/pkg/R/mllib.R ---
@@ -632,3 +659,106 @@ setMethod("predict", signature(object = 
"AFTSurvivalRegressionModel"),
   function(object, newData) {
 return(dataFrame(callJMethod(object@jobj, "transform", 
newData@sdf)))
   })
+
+#' Multivariate Gaussian Mixture Model (GMM)
+#'
+#' Fits multivariate gaussian mixture model against a Spark DataFrame, 
similarly to R's
+#' mvnormalmixEM(). Users can call \code{summary} to print a summary of 
the fitted model,
+#' \code{predict} to make predictions on new data, and 
\code{write.ml}/\code{read.ml}
+#' to save/load fitted models.
+#'
+#' @param data SparkDataFrame for training
+#' @param formula A symbolic description of the model to be fitted. 
Currently only a few formula
+#'operators are supported, including '~', '.', ':', '+', 
and '-'.
+#'Note that the response variable of formula is empty in 
spark.mvnormalmixEM.
+#' @param k Number of independent Gaussians in the mixture model.
+#' @param maxIter Maximum iteration number
+#' @param tol The convergence tolerance
+#' @aliases spark.mvnormalmixEM,SparkDataFrame,formula-method
+#' @return \code{spark.mvnormalmixEM} returns a fitted multivariate 
gaussian mixture model
+#' @rdname spark.mvnormalmixEM
+#' @name spark.mvnormalmixEM
+#' @export
+#' @examples
+#' \dontrun{
+#' sparkR.session()
+#' library(mvtnorm)
+#' set.seed(100)
+#' a <- rmvnorm(4, c(0, 0))
+#' b <- rmvnorm(6, c(3, 4))
+#' data <- rbind(a, b)
+#' df <- createDataFrame(as.data.frame(data))
+#' model <- spark.mvnormalmixEM(df, ~ V1 + V2, k = 2)
+#' summary(model)
+#'
+#' # fitted values on training data
+#' fitted <- predict(model, df)
+#' head(select(fitted, "V1", "prediction"))
+#'
+#' # save fitted model to input path
+#' path <- "path/to/model"
+#' write.ml(model, path)
+#'
+#' # can also read back the saved model and print
+#' savedModel <- read.ml(path)
+#' summary(savedModel)
+#' }
+#' @note spark.mvnormalmixEM since 2.1.0
+#' @seealso mixtools: 
\url{https://cran.r-project.org/web/packages/mixtools/}
+#' @seealso \link{predict}, \link{read.ml}, \link{write.ml}
+setMethod("spark.mvnormalmixEM", signature(data = "SparkDataFrame", 
formula = "formula"),
+  function(data, formula, k = 2, maxIter = 100, tol = 0.01) {
+formula <- paste(deparse(formula), collapse = "")
+jobj <- 
callJStatic("org.apache.spark.ml.r.GaussianMixtureWrapper", "fit", data@sdf,
+formula, as.integer(k), 
as.integer(maxIter), tol)
+return(new("GaussianMixtureModel", jobj = jobj))
+  })
+
+#  Get the summary of a multivariate gaussian mixture model
+
+#' @param object A fitted gaussian mixture model
+#' @return \code{summary} returns the model's lambda, mu, sigma and 
posterior
+#' @rdname spark.mvnormalmixEM
+#' @export
+#' @note summary(GaussianMixtureModel) since 2.1.0
+setMethod("summary", signature(object = "GaussianMixtureModel"),
+  function(object, ...) {
+jobj <- object@jobj
+is.loaded <- callJMethod(jobj, "isLoaded")
+lambda <- callJMethod(jobj, "lambda")
+muList <- callJMethod(jobj, "mu")
+sigmaList <- callJMethod(jobj, "sigma")
+k <- callJMethod(jobj, "k")
+dim <- callJMethod(jobj, "dim")
+lambda <- as.vector(unlist(lambda))
+mu <- c()
+for (i in 1 : k) {
+  start <- (i - 1) * dim + 1
+  end <- i * dim
+  mu[[i]] <- as.vector(unlist(muList[start : end]))
--- End diff --

unlist should already give you a vector?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #14392: [SPARK-16446] [SparkR] [ML] Gaussian Mixture Mode...

2016-07-29 Thread felixcheung
Github user felixcheung commented on a diff in the pull request:

https://github.com/apache/spark/pull/14392#discussion_r72880413
  
--- Diff: R/pkg/R/mllib.R ---
@@ -632,3 +659,106 @@ setMethod("predict", signature(object = 
"AFTSurvivalRegressionModel"),
   function(object, newData) {
 return(dataFrame(callJMethod(object@jobj, "transform", 
newData@sdf)))
   })
+
+#' Multivariate Gaussian Mixture Model (GMM)
+#'
+#' Fits multivariate gaussian mixture model against a Spark DataFrame, 
similarly to R's
+#' mvnormalmixEM(). Users can call \code{summary} to print a summary of 
the fitted model,
+#' \code{predict} to make predictions on new data, and 
\code{write.ml}/\code{read.ml}
+#' to save/load fitted models.
+#'
+#' @param data SparkDataFrame for training
+#' @param formula A symbolic description of the model to be fitted. 
Currently only a few formula
+#'operators are supported, including '~', '.', ':', '+', 
and '-'.
+#'Note that the response variable of formula is empty in 
spark.mvnormalmixEM.
+#' @param k Number of independent Gaussians in the mixture model.
+#' @param maxIter Maximum iteration number
+#' @param tol The convergence tolerance
+#' @aliases spark.mvnormalmixEM,SparkDataFrame,formula-method
+#' @return \code{spark.mvnormalmixEM} returns a fitted multivariate 
gaussian mixture model
+#' @rdname spark.mvnormalmixEM
+#' @name spark.mvnormalmixEM
+#' @export
+#' @examples
+#' \dontrun{
+#' sparkR.session()
+#' library(mvtnorm)
+#' set.seed(100)
+#' a <- rmvnorm(4, c(0, 0))
+#' b <- rmvnorm(6, c(3, 4))
+#' data <- rbind(a, b)
+#' df <- createDataFrame(as.data.frame(data))
+#' model <- spark.mvnormalmixEM(df, ~ V1 + V2, k = 2)
+#' summary(model)
+#'
+#' # fitted values on training data
+#' fitted <- predict(model, df)
+#' head(select(fitted, "V1", "prediction"))
+#'
+#' # save fitted model to input path
+#' path <- "path/to/model"
+#' write.ml(model, path)
+#'
+#' # can also read back the saved model and print
+#' savedModel <- read.ml(path)
+#' summary(savedModel)
+#' }
+#' @note spark.mvnormalmixEM since 2.1.0
+#' @seealso mixtools: 
\url{https://cran.r-project.org/web/packages/mixtools/}
+#' @seealso \link{predict}, \link{read.ml}, \link{write.ml}
+setMethod("spark.mvnormalmixEM", signature(data = "SparkDataFrame", 
formula = "formula"),
+  function(data, formula, k = 2, maxIter = 100, tol = 0.01) {
+formula <- paste(deparse(formula), collapse = "")
+jobj <- 
callJStatic("org.apache.spark.ml.r.GaussianMixtureWrapper", "fit", data@sdf,
+formula, as.integer(k), 
as.integer(maxIter), tol)
+return(new("GaussianMixtureModel", jobj = jobj))
+  })
+
+#  Get the summary of a multivariate gaussian mixture model
+
+#' @param object A fitted gaussian mixture model
+#' @return \code{summary} returns the model's lambda, mu, sigma and 
posterior
+#' @rdname spark.mvnormalmixEM
+#' @export
+#' @note summary(GaussianMixtureModel) since 2.1.0
+setMethod("summary", signature(object = "GaussianMixtureModel"),
+  function(object, ...) {
+jobj <- object@jobj
+is.loaded <- callJMethod(jobj, "isLoaded")
+lambda <- callJMethod(jobj, "lambda")
+muList <- callJMethod(jobj, "mu")
+sigmaList <- callJMethod(jobj, "sigma")
+k <- callJMethod(jobj, "k")
+dim <- callJMethod(jobj, "dim")
+lambda <- as.vector(unlist(lambda))
--- End diff --

nit: merge this line with L728?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #14392: [SPARK-16446] [SparkR] [ML] Gaussian Mixture Mode...

2016-07-29 Thread felixcheung
Github user felixcheung commented on a diff in the pull request:

https://github.com/apache/spark/pull/14392#discussion_r72880405
  
--- Diff: R/pkg/R/mllib.R ---
@@ -632,3 +659,106 @@ setMethod("predict", signature(object = 
"AFTSurvivalRegressionModel"),
   function(object, newData) {
 return(dataFrame(callJMethod(object@jobj, "transform", 
newData@sdf)))
   })
+
+#' Multivariate Gaussian Mixture Model (GMM)
+#'
+#' Fits multivariate gaussian mixture model against a Spark DataFrame, 
similarly to R's
+#' mvnormalmixEM(). Users can call \code{summary} to print a summary of 
the fitted model,
+#' \code{predict} to make predictions on new data, and 
\code{write.ml}/\code{read.ml}
+#' to save/load fitted models.
+#'
+#' @param data SparkDataFrame for training
+#' @param formula A symbolic description of the model to be fitted. 
Currently only a few formula
+#'operators are supported, including '~', '.', ':', '+', 
and '-'.
+#'Note that the response variable of formula is empty in 
spark.mvnormalmixEM.
+#' @param k Number of independent Gaussians in the mixture model.
+#' @param maxIter Maximum iteration number
+#' @param tol The convergence tolerance
+#' @aliases spark.mvnormalmixEM,SparkDataFrame,formula-method
+#' @return \code{spark.mvnormalmixEM} returns a fitted multivariate 
gaussian mixture model
+#' @rdname spark.mvnormalmixEM
+#' @name spark.mvnormalmixEM
+#' @export
+#' @examples
+#' \dontrun{
+#' sparkR.session()
+#' library(mvtnorm)
+#' set.seed(100)
+#' a <- rmvnorm(4, c(0, 0))
+#' b <- rmvnorm(6, c(3, 4))
+#' data <- rbind(a, b)
+#' df <- createDataFrame(as.data.frame(data))
+#' model <- spark.mvnormalmixEM(df, ~ V1 + V2, k = 2)
+#' summary(model)
+#'
+#' # fitted values on training data
+#' fitted <- predict(model, df)
+#' head(select(fitted, "V1", "prediction"))
+#'
+#' # save fitted model to input path
+#' path <- "path/to/model"
+#' write.ml(model, path)
+#'
+#' # can also read back the saved model and print
+#' savedModel <- read.ml(path)
+#' summary(savedModel)
+#' }
+#' @note spark.mvnormalmixEM since 2.1.0
+#' @seealso mixtools: 
\url{https://cran.r-project.org/web/packages/mixtools/}
+#' @seealso \link{predict}, \link{read.ml}, \link{write.ml}
+setMethod("spark.mvnormalmixEM", signature(data = "SparkDataFrame", 
formula = "formula"),
+  function(data, formula, k = 2, maxIter = 100, tol = 0.01) {
+formula <- paste(deparse(formula), collapse = "")
+jobj <- 
callJStatic("org.apache.spark.ml.r.GaussianMixtureWrapper", "fit", data@sdf,
+formula, as.integer(k), 
as.integer(maxIter), tol)
+return(new("GaussianMixtureModel", jobj = jobj))
+  })
+
+#  Get the summary of a multivariate gaussian mixture model
+
+#' @param object A fitted gaussian mixture model
+#' @return \code{summary} returns the model's lambda, mu, sigma and 
posterior
+#' @rdname spark.mvnormalmixEM
--- End diff --

probably would be good to add @aliases here


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #14392: [SPARK-16446] [SparkR] [ML] Gaussian Mixture Mode...

2016-07-29 Thread felixcheung
Github user felixcheung commented on a diff in the pull request:

https://github.com/apache/spark/pull/14392#discussion_r72880330
  
--- Diff: R/pkg/R/mllib.R ---
@@ -632,3 +659,106 @@ setMethod("predict", signature(object = 
"AFTSurvivalRegressionModel"),
   function(object, newData) {
 return(dataFrame(callJMethod(object@jobj, "transform", 
newData@sdf)))
   })
+
+#' Multivariate Gaussian Mixture Model (GMM)
+#'
+#' Fits multivariate gaussian mixture model against a Spark DataFrame, 
similarly to R's
+#' mvnormalmixEM(). Users can call \code{summary} to print a summary of 
the fitted model,
+#' \code{predict} to make predictions on new data, and 
\code{write.ml}/\code{read.ml}
+#' to save/load fitted models.
+#'
+#' @param data SparkDataFrame for training
+#' @param formula A symbolic description of the model to be fitted. 
Currently only a few formula
+#'operators are supported, including '~', '.', ':', '+', 
and '-'.
+#'Note that the response variable of formula is empty in 
spark.mvnormalmixEM.
+#' @param k Number of independent Gaussians in the mixture model.
+#' @param maxIter Maximum iteration number
+#' @param tol The convergence tolerance
+#' @aliases spark.mvnormalmixEM,SparkDataFrame,formula-method
+#' @return \code{spark.mvnormalmixEM} returns a fitted multivariate 
gaussian mixture model
+#' @rdname spark.mvnormalmixEM
+#' @name spark.mvnormalmixEM
+#' @export
+#' @examples
--- End diff --

perhaps a @seealso to R mvnormalmixEM?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #14392: [SPARK-16446] [SparkR] [ML] Gaussian Mixture Mode...

2016-07-28 Thread yanboliang
GitHub user yanboliang opened a pull request:

https://github.com/apache/spark/pull/14392

[SPARK-16446] [SparkR] [ML] Gaussian Mixture Model wrapper in SparkR

## What changes were proposed in this pull request?
Gaussian Mixture Model wrapper in SparkR, similarly to R's 
```mvnormalmixEM```.

## How was this patch tested?
Unit test.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/yanboliang/spark spark-16446

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/14392.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #14392


commit 7a2f51be8c31cfe7faf639ca9e7fb7665f4e3f4d
Author: Yanbo Liang 
Date:   2016-07-28T11:33:51Z

Gaussian Mixture Model wrapper in SparkR

commit 96e50a5cbde50d645fe1873b092b244615f1f8a9
Author: Yanbo Liang 
Date:   2016-07-28T12:21:08Z

Fix typos




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org