[GitHub] spark pull request #19433: [SPARK-3162] [MLlib] Add local tree training for ...

jkbradley Fri, 10 Nov 2017 14:36:57 -0800

Github user jkbradley commented on a diff in the pull request:

    https://github.com/apache/spark/pull/19433#discussion_r150309552
  
    --- Diff: 
mllib/src/main/scala/org/apache/spark/ml/tree/impl/SplitUtils.scala ---
    @@ -0,0 +1,215 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.spark.ml.tree.impl
    +
    +import org.apache.spark.ml.tree.{CategoricalSplit, Split}
    +import org.apache.spark.mllib.tree.impurity.ImpurityCalculator
    +import org.apache.spark.mllib.tree.model.ImpurityStats
    +
    +/** Utility methods for choosing splits during local & distributed tree 
training. */
    +private[impl] object SplitUtils {
    +
    +  /** Sorts ordered feature categories by label centroid, returning an 
ordered list of categories */
    +  private def sortByCentroid(
    +      binAggregates: DTStatsAggregator,
    +      featureIndex: Int,
    +      featureIndexIdx: Int): List[Int] = {
    +    /* Each bin is one category (feature value).
    +     * The bins are ordered based on centroidForCategories, and this 
ordering determines which
    +     * splits are considered.  (With K categories, we consider K - 1 
possible splits.)
    +     *
    +     * centroidForCategories is a list: (category, centroid)
    +     */
    +    val numCategories = binAggregates.metadata.numBins(featureIndex)
    +    val nodeFeatureOffset = binAggregates.getFeatureOffset(featureIndexIdx)
    +
    +    val centroidForCategories = Range(0, numCategories).map { featureValue 
=>
    +      val categoryStats =
    +        binAggregates.getImpurityCalculator(nodeFeatureOffset, 
featureValue)
    +      val centroid = ImpurityUtils.getCentroid(binAggregates.metadata, 
categoryStats)
    +      (featureValue, centroid)
    +    }
    +    // TODO(smurching): How to handle logging statements like these?
    --- End diff --
    
    What's the issue?  You should be able to call logDebug if this object 
inherits from org.apache.spark.internal.Logging



---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #19433: [SPARK-3162] [MLlib] Add local tree training for ...

Reply via email to