This is an automated email from the ASF dual-hosted git repository.

srowen pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
     new b3b62ba  [SPARK-19591][ML][MLLIB][FOLLOWUP] Add sample weights to 
decision trees - fix tolerance
b3b62ba is described below

commit b3b62ba303af9daad4826d274856c61acb88a6a1
Author: Ilya Matiach <il...@microsoft.com>
AuthorDate: Thu Jan 31 05:44:55 2019 -0600

    [SPARK-19591][ML][MLLIB][FOLLOWUP] Add sample weights to decision trees - 
fix tolerance
    
    This is a follow-up to PR:
    https://github.com/apache/spark/pull/21632
    
    ## What changes were proposed in this pull request?
    
    This PR tunes the tolerance used for deciding whether to add zero feature 
values to a value-count map (where the key is the feature value and the value 
is the weighted count of those feature values).
    In the previous PR the tolerance scaled by the square of the unweighted 
number of samples, which is too aggressive for a large number of unweighted 
samples.  Unfortunately using just "Utils.EPSILON * unweightedNumSamples" is 
not enough either, so I multiplied that by a factor tuned by the testing 
procedure below.
    
    ## How was this patch tested?
    
    This involved manually running the sample weight tests for decision tree 
regressor to see whether the tolerance was large enough to exclude zero feature 
values.
    
    Eg in SBT:
    ```
    ./build/sbt
    > project mllib
    > testOnly *DecisionTreeRegressorSuite -- -z "training with sample weights"
    ```
    
    For validation, I added a print inside the if in the code below and 
validated that the tolerance was large enough so that we would not include zero 
features (which don't exist in that test):
    ```
          val valueCountMap = if (weightedNumSamples - partNumSamples > 
tolerance) {
            print("should not print this")
            partValueCountMap + (0.0 -> (weightedNumSamples - partNumSamples))
          } else {
            partValueCountMap
          }
    ```
    
    Closes #23682 from imatiach-msft/ilmat/sample-weights-tol.
    
    Authored-by: Ilya Matiach <il...@microsoft.com>
    Signed-off-by: Sean Owen <sean.o...@databricks.com>
---
 .../src/main/scala/org/apache/spark/ml/tree/impl/RandomForest.scala  | 5 ++++-
 1 file changed, 4 insertions(+), 1 deletion(-)

diff --git 
a/mllib/src/main/scala/org/apache/spark/ml/tree/impl/RandomForest.scala 
b/mllib/src/main/scala/org/apache/spark/ml/tree/impl/RandomForest.scala
index fb4c321..b041dd4 100644
--- a/mllib/src/main/scala/org/apache/spark/ml/tree/impl/RandomForest.scala
+++ b/mllib/src/main/scala/org/apache/spark/ml/tree/impl/RandomForest.scala
@@ -1050,8 +1050,11 @@ private[spark] object RandomForest extends Logging with 
Serializable {
       // Calculate the expected number of samples for finding splits
       val weightedNumSamples = samplesFractionForFindSplits(metadata) *
         metadata.weightedNumExamples
+      // scale tolerance by number of samples with constant factor
+      // Note: constant factor was tuned by running some tests where there 
were no zero
+      // feature values and validating we are never within tolerance
+      val tolerance = Utils.EPSILON * unweightedNumSamples * 100
       // add expected zero value count and get complete statistics
-      val tolerance = Utils.EPSILON * unweightedNumSamples * 
unweightedNumSamples
       val valueCountMap = if (weightedNumSamples - partNumSamples > tolerance) 
{
         partValueCountMap + (0.0 -> (weightedNumSamples - partNumSamples))
       } else {


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

Reply via email to