GitHub user yanboliang opened a pull request:

    https://github.com/apache/spark/pull/15851

    [SPARK-18412][SPARKR][ML] Fix exception for some SparkR ML algorithms 
training on libsvm data

    ## What changes were proposed in this pull request?
    * Fix the following exceptions which throws when 
```spark.randomForest```(classification), ```spark.gbt```(classification), 
```spark.naiveBayes``` and ```spark.glm```(binomial family) were fitted on 
libsvm data.
    ```
    java.lang.IllegalArgumentException: requirement failed: If label column 
already exists, forceIndexLabel can not be set with true.
    ```
    See [SPARK-18412](https://issues.apache.org/jira/browse/SPARK-18412) for 
more detail about how to reproduce this bug.
    * Refactor out ```getFeaturesAndLabels``` to RWrapperUtils, since lots of 
ML algorithm wrappers use this function.
    * Drop some unwanted columns when making prediction.
    
    ## How was this patch tested?
    Add unit test.


You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/yanboliang/spark spark-18412

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/15851.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #15851
    
----
commit 4752fe2c1e0e211ae2e27a0a7807f141c91430a2
Author: Yanbo Liang <yblia...@gmail.com>
Date:   2016-11-11T10:29:27Z

    Handle the case label column already exists and forceIndexLabel = true.

commit 6262178be4b2a085fb48ad0be8b1bf61c7812689
Author: Yanbo Liang <yblia...@gmail.com>
Date:   2016-11-11T10:42:17Z

    Add unit tests.

commit 26eb40aaca3b8e4de4d2f1922a83dc2198754c6a
Author: Yanbo Liang <yblia...@gmail.com>
Date:   2016-11-11T11:16:12Z

    Set correct label column for classification algorithms.

commit d0d7c28b05bbba51266a9a1364b7fe9e4c452ed9
Author: Yanbo Liang <yblia...@gmail.com>
Date:   2016-11-11T11:47:57Z

    Divide spark.gbt test into two parts: classification and regression.

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Reply via email to