[GitHub] spark pull request #18496: [SparkR][SPARK-20307]:SparkR: pass on setHandleIn...

2017-07-08 Thread asfgit
Github user asfgit closed the pull request at:

https://github.com/apache/spark/pull/18496


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #18496: [SparkR][SPARK-20307]:SparkR: pass on setHandleIn...

2017-07-07 Thread wangmiao1981
Github user wangmiao1981 commented on a diff in the pull request:

https://github.com/apache/spark/pull/18496#discussion_r126198035
  
--- Diff: R/pkg/tests/fulltests/test_mllib_tree.R ---
@@ -212,6 +212,23 @@ test_that("spark.randomForest", {
   expect_equal(length(grep("1.0", predictions)), 50)
   expect_equal(length(grep("2.0", predictions)), 50)
 
+  # Test unseen labels
+  data <- data.frame(clicked = base::sample(c(0, 1), 10, replace = TRUE),
+someString = base::sample(c("this", "that"), 10, 
replace = TRUE),
+stringsAsFactors = FALSE)
+  trainidxs <- base::sample(nrow(data), nrow(data) * 0.7)
+  traindf <- as.DataFrame(data[trainidxs, ])
+  testdf <- as.DataFrame(rbind(data[-trainidxs, ], c(0, "the other")))
+  model <- spark.randomForest(traindf, clicked ~ ., type = 
"classification",
+  maxDepth = 10, maxBins = 10, numTrees = 10)
+  predictions <- predict(model, testdf)
+  expect_error(collect(predictions))
--- End diff --

On Scala side, I created a case where unseen label is used in test data:
`val data: Seq[(Int, String)] = Seq((0, "a"), (1, "b"), (2, "b"), (3, null))
val data2: Seq[(Int, String)] = Seq((0, "a"), (1, "b"), (3, "d"))
val df = data.toDF("id", "label")
val df2 = data2.toDF("id", "label")

val indexer = new StringIndexer()
  .setInputCol("label")
  .setOutputCol("labelIndex")

indexer.setHandleInvalid("error")
indexer.fit(df).transform(df2).collect()
`
It also fails with same error message as R case. I think it is the expected 
behavior for `"error"`.

Failed Messages:

Failed to execute user defined function($anonfun$9: (string) => double)
org.apache.spark.SparkException: Failed to execute user defined 
function($anonfun$9: (string) => double)
at 
org.apache.spark.sql.catalyst.expressions.ScalaUDF.eval(ScalaUDF.scala:1075)
at 
org.apache.spark.sql.catalyst.expressions.Alias.eval(namedExpressions.scala:139)
at 
org.apache.spark.sql.catalyst.expressions.InterpretedProjection.apply(Projection.scala:48)
at 
org.apache.spark.sql.catalyst.expressions.InterpretedProjection.apply(Projection.scala:30)
at 
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #18496: [SparkR][SPARK-20307]:SparkR: pass on setHandleIn...

2017-07-06 Thread wangmiao1981
Github user wangmiao1981 commented on a diff in the pull request:

https://github.com/apache/spark/pull/18496#discussion_r125954907
  
--- Diff: R/pkg/tests/fulltests/test_mllib_tree.R ---
@@ -212,6 +212,23 @@ test_that("spark.randomForest", {
   expect_equal(length(grep("1.0", predictions)), 50)
   expect_equal(length(grep("2.0", predictions)), 50)
 
+  # Test unseen labels
+  data <- data.frame(clicked = base::sample(c(0, 1), 10, replace = TRUE),
+someString = base::sample(c("this", "that"), 10, 
replace = TRUE),
+stringsAsFactors = FALSE)
+  trainidxs <- base::sample(nrow(data), nrow(data) * 0.7)
+  traindf <- as.DataFrame(data[trainidxs, ])
+  testdf <- as.DataFrame(rbind(data[-trainidxs, ], c(0, "the other")))
+  model <- spark.randomForest(traindf, clicked ~ ., type = 
"classification",
+  maxDepth = 10, maxBins = 10, numTrees = 10)
+  predictions <- predict(model, testdf)
+  expect_error(collect(predictions))
--- End diff --

Let me check how "error" option is handled. It seems that there is no 
exception thrown out.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #18496: [SparkR][SPARK-20307]:SparkR: pass on setHandleIn...

2017-07-05 Thread felixcheung
Github user felixcheung commented on a diff in the pull request:

https://github.com/apache/spark/pull/18496#discussion_r125815887
  
--- Diff: R/pkg/tests/fulltests/test_mllib_tree.R ---
@@ -212,6 +212,23 @@ test_that("spark.randomForest", {
   expect_equal(length(grep("1.0", predictions)), 50)
   expect_equal(length(grep("2.0", predictions)), 50)
 
+  # Test unseen labels
+  data <- data.frame(clicked = base::sample(c(0, 1), 10, replace = TRUE),
+someString = base::sample(c("this", "that"), 10, 
replace = TRUE),
+stringsAsFactors = FALSE)
+  trainidxs <- base::sample(nrow(data), nrow(data) * 0.7)
+  traindf <- as.DataFrame(data[trainidxs, ])
+  testdf <- as.DataFrame(rbind(data[-trainidxs, ], c(0, "the other")))
+  model <- spark.randomForest(traindf, clicked ~ ., type = 
"classification",
+  maxDepth = 10, maxBins = 10, numTrees = 10)
+  predictions <- predict(model, testdf)
+  expect_error(collect(predictions))
--- End diff --

hm, it looks like the task has failed 

is this the proper or expected behavior on the ML side? it seems odd the 
error is not reported but instead the column is given a "wrong type" 




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #18496: [SparkR][SPARK-20307]:SparkR: pass on setHandleIn...

2017-07-05 Thread wangmiao1981
Github user wangmiao1981 commented on a diff in the pull request:

https://github.com/apache/spark/pull/18496#discussion_r125802201
  
--- Diff: R/pkg/tests/fulltests/test_mllib_tree.R ---
@@ -212,6 +212,23 @@ test_that("spark.randomForest", {
   expect_equal(length(grep("1.0", predictions)), 50)
   expect_equal(length(grep("2.0", predictions)), 50)
 
+  # Test unseen labels
+  data <- data.frame(clicked = base::sample(c(0, 1), 10, replace = TRUE),
+someString = base::sample(c("this", "that"), 10, 
replace = TRUE),
+stringsAsFactors = FALSE)
+  trainidxs <- base::sample(nrow(data), nrow(data) * 0.7)
+  traindf <- as.DataFrame(data[trainidxs, ])
+  testdf <- as.DataFrame(rbind(data[-trainidxs, ], c(0, "the other")))
+  model <- spark.randomForest(traindf, clicked ~ ., type = 
"classification",
+  maxDepth = 10, maxBins = 10, numTrees = 10)
+  predictions <- predict(model, testdf)
+  expect_error(collect(predictions))
--- End diff --

The console prints out :

Error in handleErrors(returnStatus, conn) : 
  org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 
in stage 13.0 failed 1 times, most recent failure: Lost task 0.0 in stage 13.0 
(TID 13, localhost, executor driver): org.apache.spark.SparkException: Failed 
to execute user defined function($anonfun$9: (string) => double)

Shall I match this?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #18496: [SparkR][SPARK-20307]:SparkR: pass on setHandleIn...

2017-07-05 Thread wangmiao1981
Github user wangmiao1981 commented on a diff in the pull request:

https://github.com/apache/spark/pull/18496#discussion_r125703030
  
--- Diff: R/pkg/tests/fulltests/test_mllib_tree.R ---
@@ -212,6 +212,23 @@ test_that("spark.randomForest", {
   expect_equal(length(grep("1.0", predictions)), 50)
   expect_equal(length(grep("2.0", predictions)), 50)
 
+  # Test unseen labels
+  data <- data.frame(clicked = base::sample(c(0, 1), 10, replace = TRUE),
+someString = base::sample(c("this", "that"), 10, 
replace = TRUE),
+stringsAsFactors = FALSE)
+  trainidxs <- base::sample(nrow(data), nrow(data) * 0.7)
+  traindf <- as.DataFrame(data[trainidxs, ])
+  testdf <- as.DataFrame(rbind(data[-trainidxs, ], c(0, "the other")))
+  model <- spark.randomForest(traindf, clicked ~ ., type = 
"classification",
+  maxDepth = 10, maxBins = 10, numTrees = 10)
+  predictions <- predict(model, testdf)
+  expect_error(collect(predictions))
--- End diff --

The training call has no error because it has no unseen label. 

I think the internal has logic handling unseen label but when doing 
collection (action), it can't map the internal value to the unseen label. That 
is the reason why it only fails when doing collection.

I will add the error string.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #18496: [SparkR][SPARK-20307]:SparkR: pass on setHandleIn...

2017-07-05 Thread wangmiao1981
Github user wangmiao1981 commented on a diff in the pull request:

https://github.com/apache/spark/pull/18496#discussion_r125702340
  
--- Diff: R/pkg/R/mllib_tree.R ---
@@ -374,6 +374,10 @@ setMethod("write.ml", signature(object = 
"GBTClassificationModel", path = "chara
 #' nodes. If TRUE, the algorithm will cache node IDs 
for each instance. Caching
 #' can speed up training of deeper trees. Users can 
set how often should the
 #' cache be checkpointed or disable it by setting 
checkpointInterval.
+#' @param handleInvalid How to handle invalid data (unseen labels or NULL 
values) in classification model.
+#'Supported options: "skip" (filter out rows with invalid data),
+#'   "error" (throw an error), "keep" (put invalid 
data in a special additional
--- End diff --

Yes. `error` is the default behavior. The backend code has setDefault. I 
will reorder it and add the text in the document.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #18496: [SparkR][SPARK-20307]:SparkR: pass on setHandleIn...

2017-07-04 Thread felixcheung
Github user felixcheung commented on a diff in the pull request:

https://github.com/apache/spark/pull/18496#discussion_r125522676
  
--- Diff: R/pkg/tests/fulltests/test_mllib_tree.R ---
@@ -212,6 +212,23 @@ test_that("spark.randomForest", {
   expect_equal(length(grep("1.0", predictions)), 50)
   expect_equal(length(grep("2.0", predictions)), 50)
 
+  # Test unseen labels
+  data <- data.frame(clicked = base::sample(c(0, 1), 10, replace = TRUE),
+someString = base::sample(c("this", "that"), 10, 
replace = TRUE),
+stringsAsFactors = FALSE)
+  trainidxs <- base::sample(nrow(data), nrow(data) * 0.7)
+  traindf <- as.DataFrame(data[trainidxs, ])
+  testdf <- as.DataFrame(rbind(data[-trainidxs, ], c(0, "the other")))
+  model <- spark.randomForest(traindf, clicked ~ ., type = 
"classification",
+  maxDepth = 10, maxBins = 10, numTrees = 10)
+  predictions <- predict(model, testdf)
+  expect_error(collect(predictions))
--- End diff --

actually, this is a bit strange - so the spark.randomForest call and 
predict runs successfully, only fails when the predictions are collected?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #18496: [SparkR][SPARK-20307]:SparkR: pass on setHandleIn...

2017-07-04 Thread felixcheung
Github user felixcheung commented on a diff in the pull request:

https://github.com/apache/spark/pull/18496#discussion_r125522252
  
--- Diff: R/pkg/R/mllib_tree.R ---
@@ -374,6 +374,10 @@ setMethod("write.ml", signature(object = 
"GBTClassificationModel", path = "chara
 #' nodes. If TRUE, the algorithm will cache node IDs 
for each instance. Caching
 #' can speed up training of deeper trees. Users can 
set how often should the
 #' cache be checkpointed or disable it by setting 
checkpointInterval.
+#' @param handleInvalid How to handle invalid data (unseen labels or NULL 
values) in classification model.
+#'Supported options: "skip" (filter out rows with invalid data),
+#'   "error" (throw an error), "keep" (put invalid 
data in a special additional
--- End diff --

is "error" the default behavior? since we are doing `handleInvalid = 
c("error", "skip", "keep")` could you add text to say it's the default. also 
consider sorting the options - default first "error", then "keep", "skip" if we 
are alphabetical?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #18496: [SparkR][SPARK-20307]:SparkR: pass on setHandleIn...

2017-07-04 Thread felixcheung
Github user felixcheung commented on a diff in the pull request:

https://github.com/apache/spark/pull/18496#discussion_r125522351
  
--- Diff: R/pkg/R/mllib_tree.R ---
@@ -439,7 +445,8 @@ setMethod("spark.randomForest", signature(data = 
"SparkDataFrame", formula = "fo
  as.numeric(minInfoGain), 
as.integer(checkpointInterval),
  
as.character(featureSubsetStrategy), seed,
  as.numeric(subsamplingRate),
- as.integer(maxMemoryInMB), 
as.logical(cacheNodeIds))
+ as.integer(maxMemoryInMB), 
as.logical(cacheNodeIds),
+ as.character(handleInvalid))
--- End diff --

nit: don't need as.character if calling match.arg


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #18496: [SparkR][SPARK-20307]:SparkR: pass on setHandleIn...

2017-07-04 Thread felixcheung
Github user felixcheung commented on a diff in the pull request:

https://github.com/apache/spark/pull/18496#discussion_r125522472
  
--- Diff: R/pkg/tests/fulltests/test_mllib_tree.R ---
@@ -212,6 +212,23 @@ test_that("spark.randomForest", {
   expect_equal(length(grep("1.0", predictions)), 50)
   expect_equal(length(grep("2.0", predictions)), 50)
 
+  # Test unseen labels
+  data <- data.frame(clicked = base::sample(c(0, 1), 10, replace = TRUE),
+someString = base::sample(c("this", "that"), 10, 
replace = TRUE),
+stringsAsFactors = FALSE)
+  trainidxs <- base::sample(nrow(data), nrow(data) * 0.7)
+  traindf <- as.DataFrame(data[trainidxs, ])
+  testdf <- as.DataFrame(rbind(data[-trainidxs, ], c(0, "the other")))
+  model <- spark.randomForest(traindf, clicked ~ ., type = 
"classification",
+  maxDepth = 10, maxBins = 10, numTrees = 10)
+  predictions <- predict(model, testdf)
+  expect_error(collect(predictions))
--- End diff --

add you add the error string to match with expect_error


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #18496: [SparkR][SPARK-20307]:SparkR: pass on setHandleIn...

2017-06-30 Thread wangmiao1981
Github user wangmiao1981 commented on a diff in the pull request:

https://github.com/apache/spark/pull/18496#discussion_r125154756
  
--- Diff: R/pkg/R/mllib_tree.R ---
@@ -409,7 +413,7 @@ setMethod("spark.randomForest", signature(data = 
"SparkDataFrame", formula = "fo
maxDepth = 5, maxBins = 32, numTrees = 20, impurity = 
NULL,
featureSubsetStrategy = "auto", seed = NULL, 
subsamplingRate = 1.0,
minInstancesPerNode = 1, minInfoGain = 0.0, 
checkpointInterval = 10,
-   maxMemoryInMB = 256, cacheNodeIds = FALSE) {
+   maxMemoryInMB = 256, cacheNodeIds = FALSE, 
handleInvalid = "error") {
--- End diff --

Let me check how to use match.arg().


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #18496: [SparkR][SPARK-20307]:SparkR: pass on setHandleIn...

2017-06-30 Thread wangmiao1981
Github user wangmiao1981 commented on a diff in the pull request:

https://github.com/apache/spark/pull/18496#discussion_r125154735
  
--- Diff: R/pkg/R/mllib_tree.R ---
@@ -374,6 +374,10 @@ setMethod("write.ml", signature(object = 
"GBTClassificationModel", path = "chara
 #' nodes. If TRUE, the algorithm will cache node IDs 
for each instance. Caching
 #' can speed up training of deeper trees. Users can 
set how often should the
 #' cache be checkpointed or disable it by setting 
checkpointInterval.
+#' @param handleInvalid How to handle invalid data (unseen labels or NULL 
values) in classification model.
--- End diff --

I think the `labels` means the string label of a feature, which is 
categorical (e.g., `white`, `black`). 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #18496: [SparkR][SPARK-20307]:SparkR: pass on setHandleIn...

2017-06-30 Thread felixcheung
Github user felixcheung commented on a diff in the pull request:

https://github.com/apache/spark/pull/18496#discussion_r125154606
  
--- Diff: R/pkg/R/mllib_tree.R ---
@@ -374,6 +374,10 @@ setMethod("write.ml", signature(object = 
"GBTClassificationModel", path = "chara
 #' nodes. If TRUE, the algorithm will cache node IDs 
for each instance. Caching
 #' can speed up training of deeper trees. Users can 
set how often should the
 #' cache be checkpointed or disable it by setting 
checkpointInterval.
+#' @param handleInvalid How to handle invalid data (unseen labels or NULL 
values) in classification model.
--- End diff --

is this on "features" or "labels"? it seems it's only set on RFormula.terms 
which are features


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #18496: [SparkR][SPARK-20307]:SparkR: pass on setHandleIn...

2017-06-30 Thread felixcheung
Github user felixcheung commented on a diff in the pull request:

https://github.com/apache/spark/pull/18496#discussion_r125154616
  
--- Diff: R/pkg/R/mllib_tree.R ---
@@ -409,7 +413,7 @@ setMethod("spark.randomForest", signature(data = 
"SparkDataFrame", formula = "fo
maxDepth = 5, maxBins = 32, numTrees = 20, impurity = 
NULL,
featureSubsetStrategy = "auto", seed = NULL, 
subsamplingRate = 1.0,
minInstancesPerNode = 1, minInfoGain = 0.0, 
checkpointInterval = 10,
-   maxMemoryInMB = 256, cacheNodeIds = FALSE) {
+   maxMemoryInMB = 256, cacheNodeIds = FALSE, 
handleInvalid = "error") {
--- End diff --

use match.arg(), and then no need to as.character(handleInvalid)

also, perhaps handleInvalid is a bit generic? maybe something that says it 
has to do with labels? or label string indexing?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #18496: [SparkR][SPARK-20307]:SparkR: pass on setHandleIn...

2017-06-30 Thread wangmiao1981
GitHub user wangmiao1981 opened a pull request:

https://github.com/apache/spark/pull/18496

[SparkR][SPARK-20307]:SparkR: pass on setHandleInvalid to spark.mllib 
functions that use StringIndexer

## What changes were proposed in this pull request?

For randomForest classifier, if test data contains unseen labels, it will 
throw an error. The StringIndexer already has the handleInvalid logic. The 
patch add a new method to set the underlying StringIndexer handleInvalid logic.

This patch should also apply to other classifiers. This PR focuses on the 
main logic and randomForest classifier. I will do follow-up PR for other 
classifiers.

## How was this patch tested?

Add a new unit test based on the error case in the JIRA.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/wangmiao1981/spark handle

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/18496.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #18496


commit a2cdf511f6ad346efcb81d51f3b805a34063fa0f
Author: wangmiao1981 
Date:   2017-07-01T04:00:27Z

handle unseen labels




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org