spark git commit: [SPARK-18412][SPARKR][ML] Fix exception for some SparkR ML algorithms training on libsvm data

2016-11-13 Thread yliang
Repository: spark
Updated Branches:
  refs/heads/branch-2.1 0c69224ed -> 8fc6455c0


[SPARK-18412][SPARKR][ML] Fix exception for some SparkR ML algorithms training 
on libsvm data

## What changes were proposed in this pull request?
* Fix the following exceptions which throws when 
```spark.randomForest```(classification), ```spark.gbt```(classification), 
```spark.naiveBayes``` and ```spark.glm```(binomial family) were fitted on 
libsvm data.
```
java.lang.IllegalArgumentException: requirement failed: If label column already 
exists, forceIndexLabel can not be set with true.
```
See [SPARK-18412](https://issues.apache.org/jira/browse/SPARK-18412) for more 
detail about how to reproduce this bug.
* Refactor out ```getFeaturesAndLabels``` to RWrapperUtils, since lots of ML 
algorithm wrappers use this function.
* Drop some unwanted columns when making prediction.

## How was this patch tested?
Add unit test.

Author: Yanbo Liang 

Closes #15851 from yanboliang/spark-18412.

(cherry picked from commit 07be232ea12dfc8dc3701ca948814be7dbebf4ee)
Signed-off-by: Yanbo Liang 


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/8fc6455c
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/8fc6455c
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/8fc6455c

Branch: refs/heads/branch-2.1
Commit: 8fc6455c0b77f81be79908bb65e6264bf61c90e7
Parents: 0c69224
Author: Yanbo Liang 
Authored: Sun Nov 13 20:25:12 2016 -0800
Committer: Yanbo Liang 
Committed: Sun Nov 13 20:25:30 2016 -0800

--
 R/pkg/inst/tests/testthat/test_mllib.R  | 18 --
 .../spark/ml/r/GBTClassificationWrapper.scala   | 18 --
 .../r/GeneralizedLinearRegressionWrapper.scala  |  5 ++-
 .../apache/spark/ml/r/NaiveBayesWrapper.scala   | 14 +++-
 .../org/apache/spark/ml/r/RWrapperUtils.scala   | 36 +---
 .../r/RandomForestClassificationWrapper.scala   | 18 --
 6 files changed, 68 insertions(+), 41 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/8fc6455c/R/pkg/inst/tests/testthat/test_mllib.R
--
diff --git a/R/pkg/inst/tests/testthat/test_mllib.R 
b/R/pkg/inst/tests/testthat/test_mllib.R
index 33e85b7..4831ce2 100644
--- a/R/pkg/inst/tests/testthat/test_mllib.R
+++ b/R/pkg/inst/tests/testthat/test_mllib.R
@@ -881,7 +881,8 @@ test_that("spark.kstest", {
   expect_match(capture.output(stats)[1], "Kolmogorov-Smirnov test summary:")
 })
 
-test_that("spark.randomForest Regression", {
+test_that("spark.randomForest", {
+  # regression
   data <- suppressWarnings(createDataFrame(longley))
   model <- spark.randomForest(data, Employed ~ ., "regression", maxDepth = 5, 
maxBins = 16,
   numTrees = 1)
@@ -923,9 +924,8 @@ test_that("spark.randomForest Regression", {
   expect_equal(stats$treeWeights, stats2$treeWeights)
 
   unlink(modelPath)
-})
 
-test_that("spark.randomForest Classification", {
+  # classification
   data <- suppressWarnings(createDataFrame(iris))
   model <- spark.randomForest(data, Species ~ Petal_Length + Petal_Width, 
"classification",
   maxDepth = 5, maxBins = 16)
@@ -971,6 +971,12 @@ test_that("spark.randomForest Classification", {
   predictions <- collect(predict(model, data))$prediction
   expect_equal(length(grep("1.0", predictions)), 50)
   expect_equal(length(grep("2.0", predictions)), 50)
+
+  # spark.randomForest classification can work on libsvm data
+  data <- 
read.df(absoluteSparkPath("data/mllib/sample_multiclass_classification_data.txt"),
+source = "libsvm")
+  model <- spark.randomForest(data, label ~ features, "classification")
+  expect_equal(summary(model)$numFeatures, 4)
 })
 
 test_that("spark.gbt", {
@@ -1039,6 +1045,12 @@ test_that("spark.gbt", {
   expect_equal(iris2$NumericSpecies, as.double(collect(predict(m, 
df))$prediction))
   expect_equal(s$numFeatures, 5)
   expect_equal(s$numTrees, 20)
+
+  # spark.gbt classification can work on libsvm data
+  data <- 
read.df(absoluteSparkPath("data/mllib/sample_binary_classification_data.txt"),
+source = "libsvm")
+  model <- spark.gbt(data, label ~ features, "classification")
+  expect_equal(summary(model)$numFeatures, 692)
 })
 
 sparkR.session.stop()

http://git-wip-us.apache.org/repos/asf/spark/blob/8fc6455c/mllib/src/main/scala/org/apache/spark/ml/r/GBTClassificationWrapper.scala
--
diff --git 
a/mllib/src/main/scala/org/apache/spark/ml/r/GBTClassificationWrapper.scala 
b/mllib/src/main/scala/org/apache/spark/ml/r/GBTClassificationWrapper.scala
index 8946025..aacb41e 100644
--- a/mllib/src/main/scala/org/apache/spark/ml/r/GBTClassificationWrapper.scala
+++ b/mllib/src/main/sc

spark git commit: [SPARK-18412][SPARKR][ML] Fix exception for some SparkR ML algorithms training on libsvm data

2016-11-13 Thread yliang
Repository: spark
Updated Branches:
  refs/heads/master b91a51bb2 -> 07be232ea


[SPARK-18412][SPARKR][ML] Fix exception for some SparkR ML algorithms training 
on libsvm data

## What changes were proposed in this pull request?
* Fix the following exceptions which throws when 
```spark.randomForest```(classification), ```spark.gbt```(classification), 
```spark.naiveBayes``` and ```spark.glm```(binomial family) were fitted on 
libsvm data.
```
java.lang.IllegalArgumentException: requirement failed: If label column already 
exists, forceIndexLabel can not be set with true.
```
See [SPARK-18412](https://issues.apache.org/jira/browse/SPARK-18412) for more 
detail about how to reproduce this bug.
* Refactor out ```getFeaturesAndLabels``` to RWrapperUtils, since lots of ML 
algorithm wrappers use this function.
* Drop some unwanted columns when making prediction.

## How was this patch tested?
Add unit test.

Author: Yanbo Liang 

Closes #15851 from yanboliang/spark-18412.


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/07be232e
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/07be232e
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/07be232e

Branch: refs/heads/master
Commit: 07be232ea12dfc8dc3701ca948814be7dbebf4ee
Parents: b91a51b
Author: Yanbo Liang 
Authored: Sun Nov 13 20:25:12 2016 -0800
Committer: Yanbo Liang 
Committed: Sun Nov 13 20:25:12 2016 -0800

--
 R/pkg/inst/tests/testthat/test_mllib.R  | 18 --
 .../spark/ml/r/GBTClassificationWrapper.scala   | 18 --
 .../r/GeneralizedLinearRegressionWrapper.scala  |  5 ++-
 .../apache/spark/ml/r/NaiveBayesWrapper.scala   | 14 +++-
 .../org/apache/spark/ml/r/RWrapperUtils.scala   | 36 +---
 .../r/RandomForestClassificationWrapper.scala   | 18 --
 6 files changed, 68 insertions(+), 41 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/07be232e/R/pkg/inst/tests/testthat/test_mllib.R
--
diff --git a/R/pkg/inst/tests/testthat/test_mllib.R 
b/R/pkg/inst/tests/testthat/test_mllib.R
index b76f75d..07df4b6 100644
--- a/R/pkg/inst/tests/testthat/test_mllib.R
+++ b/R/pkg/inst/tests/testthat/test_mllib.R
@@ -881,7 +881,8 @@ test_that("spark.kstest", {
   expect_match(capture.output(stats)[1], "Kolmogorov-Smirnov test summary:")
 })
 
-test_that("spark.randomForest Regression", {
+test_that("spark.randomForest", {
+  # regression
   data <- suppressWarnings(createDataFrame(longley))
   model <- spark.randomForest(data, Employed ~ ., "regression", maxDepth = 5, 
maxBins = 16,
   numTrees = 1)
@@ -923,9 +924,8 @@ test_that("spark.randomForest Regression", {
   expect_equal(stats$treeWeights, stats2$treeWeights)
 
   unlink(modelPath)
-})
 
-test_that("spark.randomForest Classification", {
+  # classification
   data <- suppressWarnings(createDataFrame(iris))
   model <- spark.randomForest(data, Species ~ Petal_Length + Petal_Width, 
"classification",
   maxDepth = 5, maxBins = 16)
@@ -971,6 +971,12 @@ test_that("spark.randomForest Classification", {
   predictions <- collect(predict(model, data))$prediction
   expect_equal(length(grep("1.0", predictions)), 50)
   expect_equal(length(grep("2.0", predictions)), 50)
+
+  # spark.randomForest classification can work on libsvm data
+  data <- 
read.df(absoluteSparkPath("data/mllib/sample_multiclass_classification_data.txt"),
+source = "libsvm")
+  model <- spark.randomForest(data, label ~ features, "classification")
+  expect_equal(summary(model)$numFeatures, 4)
 })
 
 test_that("spark.gbt", {
@@ -1039,6 +1045,12 @@ test_that("spark.gbt", {
   expect_equal(iris2$NumericSpecies, as.double(collect(predict(m, 
df))$prediction))
   expect_equal(s$numFeatures, 5)
   expect_equal(s$numTrees, 20)
+
+  # spark.gbt classification can work on libsvm data
+  data <- 
read.df(absoluteSparkPath("data/mllib/sample_binary_classification_data.txt"),
+source = "libsvm")
+  model <- spark.gbt(data, label ~ features, "classification")
+  expect_equal(summary(model)$numFeatures, 692)
 })
 
 sparkR.session.stop()

http://git-wip-us.apache.org/repos/asf/spark/blob/07be232e/mllib/src/main/scala/org/apache/spark/ml/r/GBTClassificationWrapper.scala
--
diff --git 
a/mllib/src/main/scala/org/apache/spark/ml/r/GBTClassificationWrapper.scala 
b/mllib/src/main/scala/org/apache/spark/ml/r/GBTClassificationWrapper.scala
index 8946025..aacb41e 100644
--- a/mllib/src/main/scala/org/apache/spark/ml/r/GBTClassificationWrapper.scala
+++ b/mllib/src/main/scala/org/apache/spark/ml/r/GBTClassificationWrapper.scala
@@ -23,10 +23,10 @@ import org.json4s.JsonDSL._