[GitHub] spark issue #20389: [SPARK-23205][ML] Update ImageSchema.readImages to corre...
Github user smurching commented on the issue: https://github.com/apache/spark/pull/20389 Thanks for the reviews @srowen, @dongjoon-hyun! Would it make sense to merge this before Spark 2.3 is released & if so would one of you be able to do so? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20389: [SPARK-23205][ML] Update ImageSchema.readImages t...
GitHub user smurching opened a pull request: https://github.com/apache/spark/pull/20389 [SPARK-23205][ML] Update ImageSchema.readImages to correctly set alpha values for four-channel images ## What changes were proposed in this pull request? When parsing raw image data in ImageSchema.decode(), we use a [java.awt.Color](https://docs.oracle.com/javase/7/docs/api/java/awt/Color.html#Color(int)) constructor that sets alpha = 255, even for four-channel images (which may have different alpha values). This PR fixes this issue & adds a unit test to verify correctness of reading four-channel images. ## How was this patch tested? Updates an existing unit test ("readImages pixel values test" in `ImageSchemaSuite`) to also verify correctness when reading a four-channel image. You can merge this pull request into a Git repository by running: $ git pull https://github.com/smurching/spark image-schema-bugfix Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/20389.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #20389 commit 054c1dd457e5c95872a188f7978f678d7c7093d5 Author: Sid Murching <sid.murching@...> Date: 2018-01-24T22:34:08Z Bugfix + test --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19758: [SPARK-3162][MLlib] Local Tree Training Pt 1: Ref...
Github user smurching commented on a diff in the pull request: https://github.com/apache/spark/pull/19758#discussion_r154284858 --- Diff: mllib/src/test/scala/org/apache/spark/ml/tree/impl/TreeSplitUtilsSuite.scala --- @@ -0,0 +1,280 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.ml.tree.impl + +import org.apache.spark.SparkFunSuite +import org.apache.spark.ml.tree.{CategoricalSplit, ContinuousSplit, Split} +import org.apache.spark.ml.util.DefaultReadWriteTest +import org.apache.spark.mllib.tree.impurity.{Entropy, Impurity} +import org.apache.spark.mllib.tree.model.ImpurityStats +import org.apache.spark.mllib.util.MLlibTestSparkContext + +/** Suite exercising helper methods for making split decisions during decision tree training. */ +class TreeSplitUtilsSuite + extends SparkFunSuite with MLlibTestSparkContext with DefaultReadWriteTest { + + /** + * Get a DTStatsAggregator for sufficient stat collection/impurity calculation populated + * with the data from the specified training points. Assumes a feature index of 0 and that + * all training points have the same weights (1.0). + */ + private def getAggregator( + metadata: DecisionTreeMetadata, + values: Array[Int], + labels: Array[Double], + featureSplits: Array[Split]): DTStatsAggregator = { +// Create stats aggregator +val statsAggregator = new DTStatsAggregator(metadata, featureSubset = None) +// Update parent impurity stats +val featureIndex = 0 +val instanceWeights = Array.fill[Double](values.length)(1.0) +AggUpdateUtils.updateParentImpurity(statsAggregator, indices = values.indices.toArray, + from = 0, to = values.length, instanceWeights, labels) +// Update current aggregator's impurity stats +values.zip(labels).foreach { case (value: Int, label: Double) => + if (metadata.isUnordered(featureIndex)) { +AggUpdateUtils.updateUnorderedFeature(statsAggregator, value, label, + featureIndex = featureIndex, featureIndexIdx = 0, featureSplits, instanceWeight = 1.0) + } else { +AggUpdateUtils.updateOrderedFeature(statsAggregator, value, label, featureIndexIdx = 0, + instanceWeight = 1.0) + } +} +statsAggregator + } + + /** + * Check that left/right impurities match what we'd expect for a split. + * @param labels Labels whose impurity information should be reflected in stats + * @param stats ImpurityStats object containing impurity info for the left/right sides of a split + */ + private def validateImpurityStats( + impurity: Impurity, + labels: Array[Double], + stats: ImpurityStats, + expectedLeftStats: Array[Double], + expectedRightStats: Array[Double]): Unit = { +// Compute impurity for our data points manually +val numClasses = (labels.max + 1).toInt +val fullImpurityStatsArray + = Array.tabulate[Double](numClasses)((label: Int) => labels.count(_ == label).toDouble) +val fullImpurity = Entropy.calculate(fullImpurityStatsArray, labels.length) +// Verify that impurity stats were computed correctly for split +assert(stats.impurityCalculator.stats === fullImpurityStatsArray) +assert(stats.impurity === fullImpurity) +assert(stats.leftImpurityCalculator.stats === expectedLeftStats) +assert(stats.rightImpurityCalculator.stats === expectedRightStats) +assert(stats.valid) + } + + /* * * * * * * * * * * Choosing Splits * * * * * * * * * * */ + + test("chooseSplit: choose correct type of split (continuous split)") { +// Construct (binned) continuous data +val labels = Array(0.0, 0.0, 1.0) +val values = Array(1, 2, 3) +val featureIndex = 0 +// Get an array of continuous splits corresponding to values in
[GitHub] spark issue #19753: [SPARK-22521][ML] VectorIndexerModel support handle unse...
Github user smurching commented on the issue: https://github.com/apache/spark/pull/19753 This LGTM, @jkbradley would you be able to give this a look? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19753: [SPARK-22521][ML] VectorIndexerModel support hand...
Github user smurching commented on a diff in the pull request: https://github.com/apache/spark/pull/19753#discussion_r151315757 --- Diff: python/pyspark/ml/feature.py --- @@ -2565,22 +2575,28 @@ class VectorIndexer(JavaEstimator, HasInputCol, HasOutputCol, JavaMLReadable, Ja "(>= 2). If a feature is found to have > maxCategories values, then " + "it is declared continuous.", typeConverter=TypeConverters.toInt) +handleInvalid = Param(Params._dummy(), "handleInvalid", "How to handle invalid data " + + "(unseen labels or NULL values). Options are 'skip' (filter out " + + "rows with invalid data), 'error' (throw an error), or 'keep' (put " + + "invalid data in a special additional bucket, at index numCategories).", + typeConverter=TypeConverters.toString) + @keyword_only -def __init__(self, maxCategories=20, inputCol=None, outputCol=None): +def __init__(self, maxCategories=20, inputCol=None, outputCol=None, handleInvalid="error"): """ -__init__(self, maxCategories=20, inputCol=None, outputCol=None) +__init__(self, maxCategories=20, inputCol=None, outputCol=None, handleInvalid="error") """ super(VectorIndexer, self).__init__() self._java_obj = self._new_java_obj("org.apache.spark.ml.feature.VectorIndexer", self.uid) -self._setDefault(maxCategories=20) +self._setDefault(maxCategories=20, handleInvalid="error") kwargs = self._input_kwargs self.setParams(**kwargs) @keyword_only @since("1.4.0") -def setParams(self, maxCategories=20, inputCol=None, outputCol=None): +def setParams(self, maxCategories=20, inputCol=None, outputCol=None, handleInvalid="error"): --- End diff -- Thanks for the explanation, that makes sense. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19753: [SPARK-22521][ML] VectorIndexerModel support hand...
Github user smurching commented on a diff in the pull request: https://github.com/apache/spark/pull/19753#discussion_r151298582 --- Diff: python/pyspark/ml/feature.py --- @@ -2565,22 +2575,28 @@ class VectorIndexer(JavaEstimator, HasInputCol, HasOutputCol, JavaMLReadable, Ja "(>= 2). If a feature is found to have > maxCategories values, then " + "it is declared continuous.", typeConverter=TypeConverters.toInt) +handleInvalid = Param(Params._dummy(), "handleInvalid", "How to handle invalid data " + + "(unseen labels or NULL values). Options are 'skip' (filter out " + + "rows with invalid data), 'error' (throw an error), or 'keep' (put " + + "invalid data in a special additional bucket, at index numCategories).", + typeConverter=TypeConverters.toString) + @keyword_only -def __init__(self, maxCategories=20, inputCol=None, outputCol=None): +def __init__(self, maxCategories=20, inputCol=None, outputCol=None, handleInvalid="error"): """ -__init__(self, maxCategories=20, inputCol=None, outputCol=None) +__init__(self, maxCategories=20, inputCol=None, outputCol=None, handleInvalid="error") """ super(VectorIndexer, self).__init__() self._java_obj = self._new_java_obj("org.apache.spark.ml.feature.VectorIndexer", self.uid) -self._setDefault(maxCategories=20) +self._setDefault(maxCategories=20, handleInvalid="error") kwargs = self._input_kwargs self.setParams(**kwargs) @keyword_only @since("1.4.0") -def setParams(self, maxCategories=20, inputCol=None, outputCol=None): +def setParams(self, maxCategories=20, inputCol=None, outputCol=None, handleInvalid="error"): --- End diff -- Another Q: I see there's a pattern of `setParams` using `None` as a default value for all/most of its arguments in other featurizers, perhaps we should do the same (i.e. have a default argument of `handleValid=None` here)? IMO specifying the default parameter value in one place is preferable to duplicating it. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19753: [SPARK-22521][ML] VectorIndexerModel support hand...
Github user smurching commented on a diff in the pull request: https://github.com/apache/spark/pull/19753#discussion_r151298765 --- Diff: python/pyspark/ml/feature.py --- @@ -2565,22 +2575,28 @@ class VectorIndexer(JavaEstimator, HasInputCol, HasOutputCol, JavaMLReadable, Ja "(>= 2). If a feature is found to have > maxCategories values, then " + "it is declared continuous.", typeConverter=TypeConverters.toInt) +handleInvalid = Param(Params._dummy(), "handleInvalid", "How to handle invalid data " + + "(unseen labels or NULL values). Options are 'skip' (filter out " + + "rows with invalid data), 'error' (throw an error), or 'keep' (put " + + "invalid data in a special additional bucket, at index numCategories).", + typeConverter=TypeConverters.toString) + @keyword_only -def __init__(self, maxCategories=20, inputCol=None, outputCol=None): +def __init__(self, maxCategories=20, inputCol=None, outputCol=None, handleInvalid="error"): """ -__init__(self, maxCategories=20, inputCol=None, outputCol=None) +__init__(self, maxCategories=20, inputCol=None, outputCol=None, handleInvalid="error") """ super(VectorIndexer, self).__init__() self._java_obj = self._new_java_obj("org.apache.spark.ml.feature.VectorIndexer", self.uid) -self._setDefault(maxCategories=20) +self._setDefault(maxCategories=20, handleInvalid="error") kwargs = self._input_kwargs self.setParams(**kwargs) @keyword_only @since("1.4.0") -def setParams(self, maxCategories=20, inputCol=None, outputCol=None): +def setParams(self, maxCategories=20, inputCol=None, outputCol=None, handleInvalid="error"): --- End diff -- The same goes for the constructor (IMO we should default to `handleInvalid=None` there too), but open to hearing your thoughts. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19758: [SPARK-3162][MLlib] Local Tree Training Pt 1: Ref...
GitHub user smurching opened a pull request: https://github.com/apache/spark/pull/19758 [SPARK-3162][MLlib] Local Tree Training Pt 1: Refactor RandomForest.scala into utility classes ## What changes were proposed in this pull request? Breaks up #19433 to help unblock #19666; after this PR is merged, #19666 can be merged. This PR contains the changes made to migrate functionality from RandomForest.scala into the following utility classes: * AggUpdateUtils * ImpurityUtils * SplitUtils The PR also adds tests for split selection logic in TreeSplitUtilsSuite. A follow-up PR will include the other changes from #19433: * Local decision tree data structures & tests * Local tree training logic & tests ## How was this patch tested? Adds unit tests for split selection logic in TreeSplitUtilsSuite (Please explain how this patch was tested. E.g. unit tests, integration tests, manual tests) (If this patch involves UI changes, please attach a screenshot; otherwise, remove this) Please review http://spark.apache.org/contributing.html before opening a pull request. You can merge this pull request into a Git repository by running: $ git pull https://github.com/smurching/spark refactor-random-forest Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/19758.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #19758 commit f2e3fbd40eea2919d249710eae5b5789d97543b7 Author: Sid Murching <sid.murch...@databricks.com> Date: 2017-11-15T17:52:01Z Local tree training part 1 (refactor RandomForest.scala into utility classes) commit a2357c95672e94a148051d00e26b89245eb8e204 Author: Sid Murching <sid.murch...@databricks.com> Date: 2017-11-15T17:57:55Z WIP adding TreeSplitUtilsSuite commit 320c32ee8d0ac9bde457b0286d064470648c73af Author: Sid Murching <sid.murch...@databricks.com> Date: 2017-11-15T19:37:56Z WIP commit b93f9f3da9cca0887c0264162f5b032f14fa87d7 Author: Sid Murching <sid.murch...@databricks.com> Date: 2017-11-15T19:57:25Z Add TreeSplitUtilsSuite, refactor it to not depend on any local tree training code --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19753: [SPARK-22521][ML] VectorIndexerModel support handle unse...
Github user smurching commented on the issue: https://github.com/apache/spark/pull/19753 Looking at this now, thanks @WeichenXu123! --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19381: [SPARK-10884][ML] Support prediction on single in...
Github user smurching commented on a diff in the pull request: https://github.com/apache/spark/pull/19381#discussion_r151020618 --- Diff: mllib/src/test/scala/org/apache/spark/ml/regression/RandomForestRegressorSuite.scala --- @@ -19,14 +19,16 @@ package org.apache.spark.ml.regression import org.apache.spark.SparkFunSuite import org.apache.spark.ml.feature.LabeledPoint +import org.apache.spark.ml.linalg.Vector import org.apache.spark.ml.tree.impl.TreeTests import org.apache.spark.ml.util.{DefaultReadWriteTest, MLTestingUtils} +import org.apache.spark.ml.util.TestingUtils._ --- End diff -- Nit: unused import --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19381: [SPARK-10884][ML] Support prediction on single in...
Github user smurching commented on a diff in the pull request: https://github.com/apache/spark/pull/19381#discussion_r151020666 --- Diff: mllib/src/test/scala/org/apache/spark/ml/regression/GBTRegressorSuite.scala --- @@ -19,15 +19,16 @@ package org.apache.spark.ml.regression import org.apache.spark.SparkFunSuite import org.apache.spark.ml.feature.LabeledPoint -import org.apache.spark.ml.linalg.Vectors +import org.apache.spark.ml.linalg.{Vector, Vectors} import org.apache.spark.ml.tree.impl.TreeTests import org.apache.spark.ml.util.{DefaultReadWriteTest, MLTestingUtils} +import org.apache.spark.ml.util.TestingUtils._ --- End diff -- Nit: unused import --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19433: [SPARK-3162] [MLlib] Add local tree training for ...
Github user smurching commented on a diff in the pull request: https://github.com/apache/spark/pull/19433#discussion_r151019591 --- Diff: mllib/src/main/scala/org/apache/spark/ml/tree/impl/RandomForest.scala --- @@ -852,6 +662,41 @@ private[spark] object RandomForest extends Logging { } /** + * Find the best split for a node. + * + * @param binAggregates Bin statistics. + * @return tuple for best split: (Split, information gain, prediction at node) + */ + private[tree] def binsToBestSplit( + binAggregates: DTStatsAggregator, + splits: Array[Array[Split]], + featuresForNode: Option[Array[Int]], + node: LearningNode): (Split, ImpurityStats) = { +val validFeatureSplits = getNonConstantFeatures(binAggregates.metadata, featuresForNode) +// For each (feature, split), calculate the gain, and select the best (feature, split). +val parentImpurityCalc = if (node.stats == null) None else Some(node.stats.impurityCalculator) --- End diff -- I believe so, the nodes at the top level are created ([RandomForest.scala:178](https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/tree/impl/RandomForest.scala#L178)) with [`LearningNode.emptyNode`](https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/tree/Node.scala#L341), which sets `node.stats = null`. I could change this to check node depth (via node index), but if we're planning on deprecating node indices in the future it might be best not to. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19433: [SPARK-3162] [MLlib] Add local tree training for ...
Github user smurching commented on a diff in the pull request: https://github.com/apache/spark/pull/19433#discussion_r151017375 --- Diff: mllib/src/main/scala/org/apache/spark/ml/tree/impl/SplitUtils.scala --- @@ -0,0 +1,215 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.ml.tree.impl + +import org.apache.spark.ml.tree.{CategoricalSplit, Split} +import org.apache.spark.mllib.tree.impurity.ImpurityCalculator +import org.apache.spark.mllib.tree.model.ImpurityStats + +/** Utility methods for choosing splits during local & distributed tree training. */ +private[impl] object SplitUtils { + + /** Sorts ordered feature categories by label centroid, returning an ordered list of categories */ + private def sortByCentroid( + binAggregates: DTStatsAggregator, + featureIndex: Int, + featureIndexIdx: Int): List[Int] = { +/* Each bin is one category (feature value). + * The bins are ordered based on centroidForCategories, and this ordering determines which + * splits are considered. (With K categories, we consider K - 1 possible splits.) + * + * centroidForCategories is a list: (category, centroid) + */ +val numCategories = binAggregates.metadata.numBins(featureIndex) +val nodeFeatureOffset = binAggregates.getFeatureOffset(featureIndexIdx) + +val centroidForCategories = Range(0, numCategories).map { featureValue => + val categoryStats = +binAggregates.getImpurityCalculator(nodeFeatureOffset, featureValue) + val centroid = ImpurityUtils.getCentroid(binAggregates.metadata, categoryStats) + (featureValue, centroid) +} +// TODO(smurching): How to handle logging statements like these? +// logDebug("Centroids for categorical variable: " + centroidForCategories.mkString(",")) +// bins sorted by centroids +val categoriesSortedByCentroid = centroidForCategories.toList.sortBy(_._2).map(_._1) +// logDebug("Sorted centroids for categorical variable = " + +// categoriesSortedByCentroid.mkString(",")) +categoriesSortedByCentroid + } + + /** + * Find the best split for an unordered categorical feature at a single node. + * + * Algorithm: + * - Considers all possible subsets (exponentially many) + * + * @param featureIndex Global index of feature being split. + * @param featureIndexIdx Index of feature being split within subset of features for current node. + * @param featureSplits Array of splits for the current feature + * @param parentCalculator Optional: ImpurityCalculator containing impurity stats for current node + * @return (best split, statistics for split) If no valid split was found, the returned + * ImpurityStats instance will be invalid (have member valid = false). + */ + private[impl] def chooseUnorderedCategoricalSplit( + binAggregates: DTStatsAggregator, + featureIndex: Int, + featureIndexIdx: Int, + featureSplits: Array[Split], + parentCalculator: Option[ImpurityCalculator] = None): (Split, ImpurityStats) = { +// Unordered categorical feature +val nodeFeatureOffset = binAggregates.getFeatureOffset(featureIndexIdx) +val numSplits = binAggregates.metadata.numSplits(featureIndex) +var parentCalc = parentCalculator +val (bestFeatureSplitIndex, bestFeatureGainStats) = + Range(0, numSplits).map { splitIndex => +val leftChildStats = binAggregates.getImpurityCalculator(nodeFeatureOffset, splitIndex) +val rightChildStats = binAggregates.getParentImpurityCalculator() + .subtract(leftChildStats) +val gainAndImpurityStats = ImpurityUtils.calculateImpurityStats(parentCalc, + leftChildStats, rightChildStats, binAggregates.metadata) +// Comput
[GitHub] spark pull request #19433: [SPARK-3162] [MLlib] Add local tree training for ...
Github user smurching commented on a diff in the pull request: https://github.com/apache/spark/pull/19433#discussion_r151011913 --- Diff: mllib/src/main/scala/org/apache/spark/ml/tree/impl/RandomForest.scala --- @@ -627,221 +621,37 @@ private[spark] object RandomForest extends Logging { } /** - * Calculate the impurity statistics for a given (feature, split) based upon left/right - * aggregates. - * - * @param stats the recycle impurity statistics for this feature's all splits, - * only 'impurity' and 'impurityCalculator' are valid between each iteration - * @param leftImpurityCalculator left node aggregates for this (feature, split) - * @param rightImpurityCalculator right node aggregate for this (feature, split) - * @param metadata learning and dataset metadata for DecisionTree - * @return Impurity statistics for this (feature, split) + * Return a list of pairs (featureIndexIdx, featureIndex) where featureIndex is the global + * (across all trees) index of a feature and featureIndexIdx is the index of a feature within the + * list of features for a given node. Filters out constant features (features with 0 splits) */ - private def calculateImpurityStats( - stats: ImpurityStats, - leftImpurityCalculator: ImpurityCalculator, - rightImpurityCalculator: ImpurityCalculator, - metadata: DecisionTreeMetadata): ImpurityStats = { - -val parentImpurityCalculator: ImpurityCalculator = if (stats == null) { - leftImpurityCalculator.copy.add(rightImpurityCalculator) -} else { - stats.impurityCalculator -} - -val impurity: Double = if (stats == null) { - parentImpurityCalculator.calculate() -} else { - stats.impurity -} - -val leftCount = leftImpurityCalculator.count -val rightCount = rightImpurityCalculator.count - -val totalCount = leftCount + rightCount - -// If left child or right child doesn't satisfy minimum instances per node, -// then this split is invalid, return invalid information gain stats. -if ((leftCount < metadata.minInstancesPerNode) || - (rightCount < metadata.minInstancesPerNode)) { - return ImpurityStats.getInvalidImpurityStats(parentImpurityCalculator) -} - -val leftImpurity = leftImpurityCalculator.calculate() // Note: This equals 0 if count = 0 -val rightImpurity = rightImpurityCalculator.calculate() - -val leftWeight = leftCount / totalCount.toDouble -val rightWeight = rightCount / totalCount.toDouble - -val gain = impurity - leftWeight * leftImpurity - rightWeight * rightImpurity - -// if information gain doesn't satisfy minimum information gain, -// then this split is invalid, return invalid information gain stats. -if (gain < metadata.minInfoGain) { - return ImpurityStats.getInvalidImpurityStats(parentImpurityCalculator) + private[impl] def getNonConstantFeatures( + metadata: DecisionTreeMetadata, + featuresForNode: Option[Array[Int]]): Seq[(Int, Int)] = { +Range(0, metadata.numFeaturesPerNode).map { featureIndexIdx => --- End diff -- At some point when refactoring I was hitting errors caused by a stateful operation within a `map` over the output of this method (IIRC the result of the `map` was accessed repeatedly, causing the stateful operation to inadvertently be run multiple times). However using `withFilter` and `view` now seems to work, I'll change it back :) --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19433: [SPARK-3162] [MLlib] Add local tree training for ...
Github user smurching commented on a diff in the pull request: https://github.com/apache/spark/pull/19433#discussion_r151011879 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/tree/model/InformationGainStats.scala --- @@ -112,7 +113,7 @@ private[spark] object ImpurityStats { * minimum number of instances per node. */ def getInvalidImpurityStats(impurityCalculator: ImpurityCalculator): ImpurityStats = { -new ImpurityStats(Double.MinValue, impurityCalculator.calculate(), +new ImpurityStats(Double.MinValue, impurity = -1, --- End diff -- I changed this to be -1 here since node impurity would eventually get set to -1 anyways when `LearningNodes` with invalid `ImpurityStats` were converted into decision tree leaf nodes (see [`LearningNode.toNode`](https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/tree/Node.scala#L279)) --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19666: [SPARK-22451][ML] Reduce decision tree aggregate size fo...
Github user smurching commented on the issue: https://github.com/apache/spark/pull/19666 Discussed with @jkbradley, I'll split up #19433 so that the parts of it that'd potentially conflict with this PR (refactoring RandomForest.scala into utility classes) can be merged first. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19666: [SPARK-22451][ML] Reduce decision tree aggregate ...
Github user smurching commented on a diff in the pull request: https://github.com/apache/spark/pull/19666#discussion_r149238531 --- Diff: mllib/src/main/scala/org/apache/spark/ml/tree/impl/RandomForest.scala --- @@ -976,6 +930,44 @@ private[spark] object RandomForest extends Logging { categories } + private[tree] def traverseUnorderedSplits[T]( --- End diff -- Could you please add a docstring for this method, since it's a bit complicated? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19666: [SPARK-22451][ML] Reduce decision tree aggregate ...
Github user smurching commented on a diff in the pull request: https://github.com/apache/spark/pull/19666#discussion_r149238185 --- Diff: mllib/src/main/scala/org/apache/spark/ml/tree/impl/RandomForest.scala --- @@ -741,17 +678,43 @@ private[spark] object RandomForest extends Logging { (splits(featureIndex)(bestFeatureSplitIndex), bestFeatureGainStats) } else if (binAggregates.metadata.isUnordered(featureIndex)) { // Unordered categorical feature - val leftChildOffset = binAggregates.getFeatureOffset(featureIndexIdx) - val (bestFeatureSplitIndex, bestFeatureGainStats) = -Range(0, numSplits).map { splitIndex => - val leftChildStats = binAggregates.getImpurityCalculator(leftChildOffset, splitIndex) - val rightChildStats = binAggregates.getParentImpurityCalculator() -.subtract(leftChildStats) + val numBins = binAggregates.metadata.numBins(featureIndex) + val featureOffset = binAggregates.getFeatureOffset(featureIndexIdx) + + val binStatsArray = Array.tabulate(numBins) { binIndex => +binAggregates.getImpurityCalculator(featureOffset, binIndex) + } + val parentStats = binAggregates.getParentImpurityCalculator() + + var bestGain = Double.NegativeInfinity + var bestSet: BitSet = null + var bestLeftChildStats: ImpurityCalculator = null + var bestRightChildStats: ImpurityCalculator = null + + traverseUnorderedSplits[ImpurityCalculator](numBins, null, --- End diff -- Could you please add a comment explaining what this does? E.g.: `// Computes the best split for the current feature, storing the result across the vars above` --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19666: [SPARK-22451][ML] Reduce decision tree aggregate ...
Github user smurching commented on a diff in the pull request: https://github.com/apache/spark/pull/19666#discussion_r149237212 --- Diff: mllib/src/main/scala/org/apache/spark/ml/tree/impl/RandomForest.scala --- @@ -741,17 +678,43 @@ private[spark] object RandomForest extends Logging { (splits(featureIndex)(bestFeatureSplitIndex), bestFeatureGainStats) } else if (binAggregates.metadata.isUnordered(featureIndex)) { // Unordered categorical feature - val leftChildOffset = binAggregates.getFeatureOffset(featureIndexIdx) - val (bestFeatureSplitIndex, bestFeatureGainStats) = -Range(0, numSplits).map { splitIndex => - val leftChildStats = binAggregates.getImpurityCalculator(leftChildOffset, splitIndex) - val rightChildStats = binAggregates.getParentImpurityCalculator() -.subtract(leftChildStats) + val numBins = binAggregates.metadata.numBins(featureIndex) + val featureOffset = binAggregates.getFeatureOffset(featureIndexIdx) + + val binStatsArray = Array.tabulate(numBins) { binIndex => --- End diff -- Could you please add a comment explaining what this is? E.g.: `// Each element of binStatsArray stores pre-computed label statistics for a single bin of the current future` --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19666: [SPARK-22451][ML] Reduce decision tree aggregate ...
Github user smurching commented on a diff in the pull request: https://github.com/apache/spark/pull/19666#discussion_r149242575 --- Diff: mllib/src/main/scala/org/apache/spark/ml/tree/impl/RandomForest.scala --- @@ -976,6 +930,44 @@ private[spark] object RandomForest extends Logging { categories } + private[tree] def traverseUnorderedSplits[T]( --- End diff -- Also, does `traverseUnorderedSplits` need to take a type parameter / two different closures as method arguments? AFAICT the use of a type parameter/closures here allow us to unit test this functionality on a simple example, but I wonder if we could simplify this somehow. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19666: [SPARK-22451][ML] Reduce decision tree aggregate ...
Github user smurching commented on a diff in the pull request: https://github.com/apache/spark/pull/19666#discussion_r149238123 --- Diff: mllib/src/main/scala/org/apache/spark/ml/tree/impl/RandomForest.scala --- @@ -741,17 +678,43 @@ private[spark] object RandomForest extends Logging { (splits(featureIndex)(bestFeatureSplitIndex), bestFeatureGainStats) } else if (binAggregates.metadata.isUnordered(featureIndex)) { // Unordered categorical feature - val leftChildOffset = binAggregates.getFeatureOffset(featureIndexIdx) - val (bestFeatureSplitIndex, bestFeatureGainStats) = -Range(0, numSplits).map { splitIndex => - val leftChildStats = binAggregates.getImpurityCalculator(leftChildOffset, splitIndex) - val rightChildStats = binAggregates.getParentImpurityCalculator() -.subtract(leftChildStats) + val numBins = binAggregates.metadata.numBins(featureIndex) + val featureOffset = binAggregates.getFeatureOffset(featureIndexIdx) + + val binStatsArray = Array.tabulate(numBins) { binIndex => +binAggregates.getImpurityCalculator(featureOffset, binIndex) + } + val parentStats = binAggregates.getParentImpurityCalculator() + + var bestGain = Double.NegativeInfinity + var bestSet: BitSet = null + var bestLeftChildStats: ImpurityCalculator = null + var bestRightChildStats: ImpurityCalculator = null + + traverseUnorderedSplits[ImpurityCalculator](numBins, null, +(stats, binIndex) => { + val binStats = binStatsArray(binIndex) + if (stats == null) { +binStats + } else { +stats.copy.add(binStats) + } +}, +(set, leftChildStats) => { + val rightChildStats = parentStats.copy.subtract(leftChildStats) gainAndImpurityStats = calculateImpurityStats(gainAndImpurityStats, leftChildStats, rightChildStats, binAggregates.metadata) - (splitIndex, gainAndImpurityStats) -}.maxBy(_._2.gain) - (splits(featureIndex)(bestFeatureSplitIndex), bestFeatureGainStats) + if (gainAndImpurityStats.gain > bestGain) { +bestGain = gainAndImpurityStats.gain +bestSet = set | new BitSet(numBins) // copy set --- End diff -- Why not use `set.copy()`? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19666: [SPARK-22451][ML] Reduce decision tree aggregate ...
Github user smurching commented on a diff in the pull request: https://github.com/apache/spark/pull/19666#discussion_r149241680 --- Diff: mllib/src/main/scala/org/apache/spark/ml/tree/impl/RandomForest.scala --- @@ -976,6 +930,44 @@ private[spark] object RandomForest extends Logging { categories } + private[tree] def traverseUnorderedSplits[T]( + arity: Int, + zeroStats: T, + seqOp: (T, Int) => T, + finalizer: (BitSet, T) => Unit): Unit = { +assert(arity > 1) + +// numSplits = (1 << arity - 1) - 1 +val numSplits = DecisionTreeMetadata.numUnorderedSplits(arity) +val subSet: BitSet = new BitSet(arity) + +// dfs traverse +// binIndex: [0, arity) +def dfs(binIndex: Int, combNumber: Int, stats: T): Unit = { + if (binIndex == arity) { +// recursion exit when binIndex == arity +if (combNumber > 0) { + // we get an available unordered split, saved in subSet. + finalizer(subSet, stats) +} + } else { +subSet.set(binIndex) +val leftChildCombNumber = combNumber + (1 << binIndex) +// pruning: only need combNumber satisfy: 1 <= combNumber <= numSplits --- End diff -- If I understand correctly, the check `if (leftChildCombNumber <= numSplits)` helps us ensure that we consider each split only once, right? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19433: [SPARK-3162] [MLlib] Add local tree training for decisio...
Github user smurching commented on the issue: https://github.com/apache/spark/pull/19433 jenkins retest this please --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19381: [SPARK-10884][ML] Support prediction on single in...
Github user smurching commented on a diff in the pull request: https://github.com/apache/spark/pull/19381#discussion_r148709163 --- Diff: mllib/src/test/scala/org/apache/spark/ml/classification/LinearSVCSuite.scala --- @@ -202,6 +202,15 @@ class LinearSVCSuite extends SparkFunSuite with MLlibTestSparkContext with Defau dataset.as[LabeledPoint], estimator, modelEquals, 42L) } + test("prediction on single instance") { +val trainer = new LinearSVC() +val model = trainer.fit(smallBinaryDataset) +model.transform(smallBinaryDataset).select("features", "prediction").collect().foreach { + case Row(features: Vector, prediction: Double) => +assert(prediction ~== model.predict(features) relTol 1E-5) --- End diff -- Could you please check for exact equality here (& in other tests)? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19381: [SPARK-10884][ML] Support prediction on single in...
Github user smurching commented on a diff in the pull request: https://github.com/apache/spark/pull/19381#discussion_r148708939 --- Diff: mllib/src/test/scala/org/apache/spark/ml/classification/NaiveBayesSuite.scala --- @@ -165,6 +165,35 @@ class NaiveBayesSuite extends SparkFunSuite with MLlibTestSparkContext with Defa Vector, NaiveBayesModel](model, testDataset) } + test("prediction on single instance") { +val nPoints = 1000 +val piArray = Array(0.5, 0.1, 0.4).map(math.log) +val thetaArray = Array( + Array(0.70, 0.10, 0.10, 0.10), // label 0 + Array(0.10, 0.70, 0.10, 0.10), // label 1 + Array(0.10, 0.10, 0.70, 0.10) // label 2 +).map(_.map(math.log)) +val pi = Vectors.dense(piArray) +val theta = new DenseMatrix(3, 4, thetaArray.flatten, true) + +val testDataset = + generateNaiveBayesInput(piArray, thetaArray, nPoints, seed, "multinomial").toDF() +val nb = new NaiveBayes().setSmoothing(1.0).setModelType("multinomial") +val model = nb.fit(testDataset) + +validateModelFit(pi, theta, model) --- End diff -- Do we need lines 184-186? They seem unrelated to what we want to test (that `predict` produces the same result as `transform` on a single instance). Similarly, I don't think we need to create `piArray`, `thetaArray`, `pi`, `theta`, etc; this test should just fit a model on a dataset and compare the fitted model's `predict` and `transform` outputs. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19381: [SPARK-10884][ML] Support prediction on single in...
Github user smurching commented on a diff in the pull request: https://github.com/apache/spark/pull/19381#discussion_r148709125 --- Diff: mllib/src/test/scala/org/apache/spark/ml/classification/NaiveBayesSuite.scala --- @@ -165,6 +165,35 @@ class NaiveBayesSuite extends SparkFunSuite with MLlibTestSparkContext with Defa Vector, NaiveBayesModel](model, testDataset) } + test("prediction on single instance") { +val nPoints = 1000 +val piArray = Array(0.5, 0.1, 0.4).map(math.log) +val thetaArray = Array( + Array(0.70, 0.10, 0.10, 0.10), // label 0 + Array(0.10, 0.70, 0.10, 0.10), // label 1 + Array(0.10, 0.10, 0.70, 0.10) // label 2 +).map(_.map(math.log)) +val pi = Vectors.dense(piArray) +val theta = new DenseMatrix(3, 4, thetaArray.flatten, true) + +val testDataset = --- End diff -- Suggestion: rename to `trainDataset`? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19433: [SPARK-3162] [MLlib] Add local tree training for decisio...
Github user smurching commented on the issue: https://github.com/apache/spark/pull/19433 jenkins retest this please --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19433: [SPARK-3162] [MLlib] Add local tree training for decisio...
Github user smurching commented on the issue: https://github.com/apache/spark/pull/19433 Made a few updates, hereâs a quick summary/what Iâd propose moving forward: Right now: * Shared row indices for all (categorical & continuous) features are stored & updated in `TrainingInfo` * `LocalDecisionTree.computeBestSplits` computes best splits/sufficient stats for a single feature at a time * A utility method (`LocalDecisionTreeUtils.updateArrayForSplit`) is used to sort both feature values and shared row indices When we add support for raw continuous feature values: * Add a subclass of `FeatureColumn` (e.g. `ContinuousFeatureColumn`) that stores and sorts its own array of row indices, pass these row indices to methods requiring them. I also renamed `FeatureVector` to `FeatureColumn` since the former seemed like itâd confuse developers (`FeatureVector` sounds like a single data point) --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19433: [SPARK-3162] [MLlib] Add local tree training for ...
Github user smurching commented on a diff in the pull request: https://github.com/apache/spark/pull/19433#discussion_r147307553 --- Diff: mllib/src/main/scala/org/apache/spark/ml/tree/impl/LocalDecisionTree.scala --- @@ -0,0 +1,250 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.ml.tree.impl + +import org.apache.spark.ml.tree._ +import org.apache.spark.mllib.tree.model.ImpurityStats + +/** Object exposing methods for local training of decision trees */ +private[ml] object LocalDecisionTree { + + /** + * Fully splits the passed-in node on the provided local dataset, returning + * an InternalNode/LeafNode corresponding to the root of the resulting tree. + * + * @param node LearningNode to use as the root of the subtree fit on the passed-in dataset + * @param metadata learning and dataset metadata for DecisionTree + * @param splits splits(i) = array of splits for feature i + */ + private[ml] def fitNode( + input: Array[TreePoint], + instanceWeights: Array[Double], + node: LearningNode, + metadata: DecisionTreeMetadata, + splits: Array[Array[Split]]): Node = { + +// The case with 1 node (depth = 0) is handled separately. +// This allows all iterations in the depth > 0 case to use the same code. +// TODO: Check that learning works when maxDepth > 0 but learning stops at 1 node (because of +// other parameters). +if (metadata.maxDepth == 0) { + return node.toNode +} + +// Prepare column store. +// Note: rowToColumnStoreDense checks to make sure numRows < Int.MaxValue. +val colStoreInit: Array[Array[Int]] += LocalDecisionTreeUtils.rowToColumnStoreDense(input.map(_.binnedFeatures)) +val labels = input.map(_.label) + +// Fit a regression model on the dataset, throwing an error if metadata indicates that +// we should train a classifier. +// TODO: Add support for training classifiers +if (metadata.numClasses > 1 && metadata.numClasses <= 32) { + throw new UnsupportedOperationException("Local training of a decision tree classifier is " + +"unsupported; currently, only regression is supported") +} else { + trainRegressor(node, colStoreInit, instanceWeights, labels, metadata, splits) +} + } + + /** + * Locally fits a decision tree regressor. + * TODO(smurching): Logic for fitting a classifier & regressor is the same; only difference + * is impurity metric. Use the same logic for fitting a classifier. + * + * @param rootNode Node to use as root of the tree fit on the passed-in dataset + * @param colStoreInit Array of columns of training data + * @param instanceWeights Array of weights for each training example + * @param metadata learning and dataset metadata for DecisionTree + * @param splits splits(i) = Array of possible splits for feature i + * @return LeafNode or InternalNode representation of rootNode + */ + private[ml] def trainRegressor( + rootNode: LearningNode, + colStoreInit: Array[Array[Int]], + instanceWeights: Array[Double], + labels: Array[Double], + metadata: DecisionTreeMetadata, + splits: Array[Array[Split]]): Node = { + +// Sort each column by decision tree node. +val colStore: Array[FeatureVector] = colStoreInit.zipWithIndex.map { case (col, featureIndex) => + val featureArity: Int = metadata.featureArity.getOrElse(featureIndex, 0) + FeatureVector(featureIndex, featureArity, col) +} + +val numRows = colStore.headOption match { + case None => 0 + case Some(column) => column.values.length +} + +// Create a new PartitionInfo describing the status of our partially-trained subtree +
[GitHub] spark pull request #19381: [SPARK-10884][ML] Support prediction on single in...
Github user smurching commented on a diff in the pull request: https://github.com/apache/spark/pull/19381#discussion_r146986798 --- Diff: mllib/src/test/scala/org/apache/spark/ml/classification/DecisionTreeClassifierSuite.scala --- @@ -267,6 +268,24 @@ class DecisionTreeClassifierSuite Vector, DecisionTreeClassificationModel](newTree, newData) } + test("prediction on single instance") { +val rdd = continuousDataPointsForMulticlassRDD +val dt = new DecisionTreeClassifier() + .setImpurity("Gini") + .setMaxDepth(4) + .setMaxBins(100) +val categoricalFeatures = Map(0 -> 3) +val numClasses = 3 + +val newData: DataFrame = TreeTests.setMetadata(rdd, categoricalFeatures, numClasses) +val newTree = dt.fit(newData) + +newTree.transform(newData).select(dt.getFeaturesCol, dt.getPredictionCol).collect().foreach { + case Row(features: Vector, prediction: Double) => +assert(prediction ~== newTree.predict(features) relTol 1E-5) --- End diff -- Can we test exact equality (e.g. `prediction === newTree.predict(features)`) here and in other unit tests? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19381: [SPARK-10884][ML] Support prediction on single instance ...
Github user smurching commented on the issue: https://github.com/apache/spark/pull/19381 Looking at this now! --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19433: [SPARK-3162] [MLlib] Add local tree training for ...
Github user smurching commented on a diff in the pull request: https://github.com/apache/spark/pull/19433#discussion_r146981434 --- Diff: mllib/src/main/scala/org/apache/spark/ml/tree/impl/LocalDecisionTree.scala --- @@ -0,0 +1,250 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.ml.tree.impl + +import org.apache.spark.ml.tree._ +import org.apache.spark.mllib.tree.model.ImpurityStats + +/** Object exposing methods for local training of decision trees */ +private[ml] object LocalDecisionTree { + + /** + * Fully splits the passed-in node on the provided local dataset, returning + * an InternalNode/LeafNode corresponding to the root of the resulting tree. + * + * @param node LearningNode to use as the root of the subtree fit on the passed-in dataset + * @param metadata learning and dataset metadata for DecisionTree + * @param splits splits(i) = array of splits for feature i + */ + private[ml] def fitNode( + input: Array[TreePoint], + instanceWeights: Array[Double], + node: LearningNode, + metadata: DecisionTreeMetadata, + splits: Array[Array[Split]]): Node = { + +// The case with 1 node (depth = 0) is handled separately. +// This allows all iterations in the depth > 0 case to use the same code. +// TODO: Check that learning works when maxDepth > 0 but learning stops at 1 node (because of +// other parameters). +if (metadata.maxDepth == 0) { + return node.toNode +} + +// Prepare column store. +// Note: rowToColumnStoreDense checks to make sure numRows < Int.MaxValue. +val colStoreInit: Array[Array[Int]] += LocalDecisionTreeUtils.rowToColumnStoreDense(input.map(_.binnedFeatures)) +val labels = input.map(_.label) + +// Fit a regression model on the dataset, throwing an error if metadata indicates that +// we should train a classifier. +// TODO: Add support for training classifiers +if (metadata.numClasses > 1 && metadata.numClasses <= 32) { + throw new UnsupportedOperationException("Local training of a decision tree classifier is " + +"unsupported; currently, only regression is supported") +} else { + trainRegressor(node, colStoreInit, instanceWeights, labels, metadata, splits) +} + } + + /** + * Locally fits a decision tree regressor. + * TODO(smurching): Logic for fitting a classifier & regressor is the same; only difference + * is impurity metric. Use the same logic for fitting a classifier. + * + * @param rootNode Node to use as root of the tree fit on the passed-in dataset + * @param colStoreInit Array of columns of training data + * @param instanceWeights Array of weights for each training example + * @param metadata learning and dataset metadata for DecisionTree + * @param splits splits(i) = Array of possible splits for feature i + * @return LeafNode or InternalNode representation of rootNode + */ + private[ml] def trainRegressor( + rootNode: LearningNode, + colStoreInit: Array[Array[Int]], + instanceWeights: Array[Double], + labels: Array[Double], + metadata: DecisionTreeMetadata, + splits: Array[Array[Split]]): Node = { + +// Sort each column by decision tree node. +val colStore: Array[FeatureVector] = colStoreInit.zipWithIndex.map { case (col, featureIndex) => + val featureArity: Int = metadata.featureArity.getOrElse(featureIndex, 0) + FeatureVector(featureIndex, featureArity, col) +} + +val numRows = colStore.headOption match { + case None => 0 + case Some(column) => column.values.length +} + +// Create a new PartitionInfo describing the status of our partially-trained subtree +
[GitHub] spark issue #19433: [SPARK-3162] [MLlib] Add local tree training for decisio...
Github user smurching commented on the issue: https://github.com/apache/spark/pull/19433 @WeichenXu123 Thanks for the comments! I'll respond inline: > In your doc, you said "Specifically, we only need to store sufficient stats for each bin of a single feature, as opposed to each bin of every feature", BUT, current implementation, you still allocate space for all features when computing: -- see DTStatsAggregator implementation, you pass featureSubset = None so DTStatsAggregator will allocate space for every features. According to your purpose, you should pass featureSubset = Some(Array(currentFeatureIndex)). I like your proposed solution (pass `featureSubset = Some(Array(currentFeatureIndex))`). I'll go ahead & implement it. > Current implementation still use binnedFeatures. You said in future it will be improved to sort feature values for continuous feature (for more precise tree training), if you want to consider every possible thresholds, you need hold rawFeatures instead of binnedFeatures in the columnar feature array, and in each split range offset, you need sort every continuous features. Is this the thing you want to do in the future ? This will increase calculation amount. Yep, we'll have to pass raw (`Double`) continuous features to the local tree training methods, which will require them to accept an `Array[LabeledPoint]` instead of an `Array[TreePoint]` as input & increase memory usage (along with requiring us to store additional indices). We'll actually only have to run an `O(n log n)` sort on continuous feature values once (i.e. in the `FeatureVector` constructor), since once the continuous features are sorted we can update them as we would for categorical features when splitting nodes (in `O(n)` time) and they'll remain sorted. > For current implementation(using binnedFeature) , there is no need to sort continuous features inside each split offset. So the indices for each feature is exactly the same. In order to save memory, I think these indices should be shared, no need to create separate indices array for each features. Even if you add the improvements for continuous features mentioned above, you can create separate indices array for only continuous features, the categorical features can still share the same indices array. Agreed, I'll make this change. > About locality advantage of columnar format, I have some doubts. Current implementation, you do not reorder the label and weight array, access label and weight value need use indices, when calculating DTStat, this break locality. (But I'm not sure how much impact to perf this will bring). Yeah, I'm not sure if it'd be better to reorder the labels/weights arrays to achieve improved locality. I think we could experiment with both, but I'd prefer to save that for a follow-up PR unless you or another reviewer think it'll make a big perf difference. > About the overhead of columnar format: when making reordering (when get new split, we need reorder left sub-tree samples into front), so you need reordering on each column, and at the same time, update the indices array. But, if we use row format, like: Array[(features, label, weight)], reordering will be much easier, and do not need indices. So, I am considering, whether we can use row format, but at the time when we need DTStatsAggregator computation, copy the data we need from the row format into columnar format array (only need to copy rows between sub-node offset and only copy the sampled features if using feature subsampling). This is an interesting idea, my main concern is that on the first iteration of local tree training we'd need to copy the entire training data matrix from row -> columnar format, which negates any memory savings we get from not using indices. I'm also concerned about the overhead of repeatedly copying data from row -> columnar format. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19433: [SPARK-3162] [MLlib] Add local tree training for ...
Github user smurching commented on a diff in the pull request: https://github.com/apache/spark/pull/19433#discussion_r146731101 --- Diff: mllib/src/main/scala/org/apache/spark/ml/tree/impl/TrainingInfo.scala --- @@ -0,0 +1,144 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.ml.tree.impl + +import scala.collection.mutable.ArrayBuffer + +import org.apache.spark.ml.tree.{LearningNode, Split} +import org.apache.spark.util.collection.BitSet + +/** + * Maintains intermediate state of data (columns) and tree during local tree training. + * Primary local tree training data structure; contains all information required to describe + * the state of the algorithm at any point during learning.?? + * + * Nodes are indexed left-to-right along the periphery of the tree, with 0-based indices. + * The "periphery" is the set of leaf nodes (active and inactive). + * + * @param columns Array of columns. + * Each column is sorted first by nodes (left-to-right along the tree periphery); + * all columns share this first level of sorting. + * Within each node's group, each column is sorted based on feature value; + * this second level of sorting differs across columns. + * @param instanceWeights Array of weights for each training example + * @param nodeOffsets Offsets into the columns indicating the first level of sorting (by node). + * The rows corresponding to the node activeNodes(i) are in the range + * [nodeOffsets(i)(0), nodeOffsets(i)(1)) . + * @param activeNodes Nodes which are active (still being split). + * Inactive nodes are known to be leaves in the final tree. + */ +private[impl] case class TrainingInfo( +columns: Array[FeatureVector], +instanceWeights: Array[Double], --- End diff -- Good call, I'll move `instanceWeights` outside `TrainingInfo` --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19433: [SPARK-3162] [MLlib] Add local tree training for ...
Github user smurching commented on a diff in the pull request: https://github.com/apache/spark/pull/19433#discussion_r146731083 --- Diff: mllib/src/main/scala/org/apache/spark/ml/tree/impl/LocalDecisionTree.scala --- @@ -0,0 +1,250 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.ml.tree.impl + +import org.apache.spark.ml.tree._ +import org.apache.spark.mllib.tree.model.ImpurityStats + +/** Object exposing methods for local training of decision trees */ +private[ml] object LocalDecisionTree { + + /** + * Fully splits the passed-in node on the provided local dataset, returning + * an InternalNode/LeafNode corresponding to the root of the resulting tree. + * + * @param node LearningNode to use as the root of the subtree fit on the passed-in dataset + * @param metadata learning and dataset metadata for DecisionTree + * @param splits splits(i) = array of splits for feature i + */ + private[ml] def fitNode( + input: Array[TreePoint], + instanceWeights: Array[Double], + node: LearningNode, + metadata: DecisionTreeMetadata, + splits: Array[Array[Split]]): Node = { + +// The case with 1 node (depth = 0) is handled separately. +// This allows all iterations in the depth > 0 case to use the same code. +// TODO: Check that learning works when maxDepth > 0 but learning stops at 1 node (because of +// other parameters). +if (metadata.maxDepth == 0) { + return node.toNode +} + +// Prepare column store. +// Note: rowToColumnStoreDense checks to make sure numRows < Int.MaxValue. +val colStoreInit: Array[Array[Int]] += LocalDecisionTreeUtils.rowToColumnStoreDense(input.map(_.binnedFeatures)) +val labels = input.map(_.label) + +// Fit a regression model on the dataset, throwing an error if metadata indicates that +// we should train a classifier. +// TODO: Add support for training classifiers +if (metadata.numClasses > 1 && metadata.numClasses <= 32) { + throw new UnsupportedOperationException("Local training of a decision tree classifier is " + +"unsupported; currently, only regression is supported") +} else { + trainRegressor(node, colStoreInit, instanceWeights, labels, metadata, splits) +} + } + + /** + * Locally fits a decision tree regressor. + * TODO(smurching): Logic for fitting a classifier & regressor is the same; only difference + * is impurity metric. Use the same logic for fitting a classifier. + * + * @param rootNode Node to use as root of the tree fit on the passed-in dataset + * @param colStoreInit Array of columns of training data + * @param instanceWeights Array of weights for each training example + * @param metadata learning and dataset metadata for DecisionTree + * @param splits splits(i) = Array of possible splits for feature i + * @return LeafNode or InternalNode representation of rootNode + */ + private[ml] def trainRegressor( + rootNode: LearningNode, + colStoreInit: Array[Array[Int]], + instanceWeights: Array[Double], + labels: Array[Double], + metadata: DecisionTreeMetadata, + splits: Array[Array[Split]]): Node = { + +// Sort each column by decision tree node. +val colStore: Array[FeatureVector] = colStoreInit.zipWithIndex.map { case (col, featureIndex) => + val featureArity: Int = metadata.featureArity.getOrElse(featureIndex, 0) + FeatureVector(featureIndex, featureArity, col) +} + +val numRows = colStore.headOption match { + case None => 0 + case Some(column) => column.values.length +} + +// Create a new PartitionInfo describing the status of our partially-trained subtree +
[GitHub] spark pull request #19433: [SPARK-3162] [MLlib] Add local tree training for ...
Github user smurching commented on a diff in the pull request: https://github.com/apache/spark/pull/19433#discussion_r146731070 --- Diff: mllib/src/main/scala/org/apache/spark/ml/tree/impl/AggUpdateUtils.scala --- @@ -0,0 +1,85 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.ml.tree.impl + +import org.apache.spark.ml.tree.Split + +/** + * Helpers for updating DTStatsAggregators during collection of sufficient stats for tree training. + */ +private[impl] object AggUpdateUtils { + + /** + * Updates the parent node stats of the passed-in impurity aggregator with the labels + * corresponding to the feature values at indices [from, to). + */ + private[impl] def updateParentImpurity( + statsAggregator: DTStatsAggregator, + col: FeatureVector, + from: Int, + to: Int, + instanceWeights: Array[Double], + labels: Array[Double]): Unit = { +from.until(to).foreach { idx => + val rowIndex = col.indices(idx) + val label = labels(rowIndex) + statsAggregator.updateParent(label, instanceWeights(rowIndex)) +} + } + + /** + * Update aggregator for an (unordered feature, label) pair + * @param splits Array of arrays of splits for each feature; splits(i) = splits for feature i. + */ + private[impl] def updateUnorderedFeature( + agg: DTStatsAggregator, + featureValue: Int, + label: Double, + featureIndex: Int, + featureIndexIdx: Int, + splits: Array[Array[Split]], --- End diff -- Good call, I'll make this change. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19433: [SPARK-3162] [MLlib] Add local tree training for ...
Github user smurching commented on a diff in the pull request: https://github.com/apache/spark/pull/19433#discussion_r146731072 --- Diff: mllib/src/main/scala/org/apache/spark/ml/tree/impl/FeatureVector.scala --- @@ -0,0 +1,148 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.ml.tree.impl + +import org.apache.spark.util.collection.BitSet + +/** + * Stores values for a single training data column (a single continuous or categorical feature). + * + * Values are currently stored in a dense representation only. + * TODO: Support sparse storage (to optimize deeper levels of the tree), and maybe compressed + * storage (to optimize upper levels of the tree). + * + * TODO: Sort feature values to support more complicated splitting logic (e.g. considering every + * possible continuous split instead of discretizing continuous features). + * + * NOTE: We could add sorting of feature values in this PR; the only changed required would be to + * sort feature values at construction-time. Sorting might improve locality during stats + * aggregation (we'd frequently update the same O(statsSize) array for a (feature, bin), + * instead of frequently updating for the same feature). + * + * @param featureArity For categorical features, this gives the number of categories. + * For continuous features, this should be set to 0. + * @param rowIndices Optional: rowIndices(i) is the row index of the ith feature value (values(i)) + * If unspecified, feature values are assumed to be ordered by row (i.e. values(i) + * is a feature value from the ith row). + */ +private[impl] class FeatureVector( +val featureIndex: Int, +val featureArity: Int, +val values: Array[Int], +private val rowIndices: Option[Array[Int]]) + extends Serializable { + // Associates feature values with training point rows. indices(i) = training point index + // (row index) of ith feature value + val indices = rowIndices.getOrElse(values.indices.toArray) + + def isCategorical: Boolean = featureArity > 0 + + /** For debugging */ + override def toString: String = { +" FeatureVector(" + + s"featureIndex: $featureIndex,\n" + + s"featureType: ${if (featureArity == 0) "Continuous" else "Categorical"},\n" + + s"featureArity: $featureArity,\n" + + s"values: ${values.mkString(", ")},\n" + + s"indices: ${indices.mkString(", ")},\n" + + " )" + } + + def deepCopy(): FeatureVector = +new FeatureVector(featureIndex, featureArity, values.clone(), Some(indices.clone())) + + override def equals(other: Any): Boolean = { +other match { + case o: FeatureVector => +featureIndex == o.featureIndex && featureArity == o.featureArity && + values.sameElements(o.values) && indices.sameElements(o.indices) + case _ => false +} + } + + /** + * Reorders the subset of feature values at indices [from, to) in the passed-in column + * according to the split information encoded in instanceBitVector (feature values for rows + * that split left appear before feature values for rows that split right). + * + * @param numLeftRows Number of rows on the left side of the split + * @param tempVals Destination buffer for reordered feature values + * @param tempIndices Destination buffer for row indices corresponding to reordered feature values + * @param instanceBitVector instanceBitVector(i) = true if the row for the ith feature + * value splits right, false otherwise + */ + private[ml] def updateForSplit( + from: Int
[GitHub] spark issue #19433: [SPARK-3162] [MLlib] Add local tree training for decisio...
Github user smurching commented on the issue: https://github.com/apache/spark/pull/19433 Sorry, realized I conflated feature subsampling and `subsampleWeights` (instance weights for training examples). IMO feature subsampling can be added in a follow-up PR, but `subsampleWeights` should go in this PR. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19433: [SPARK-3162] [MLlib] Add local tree training for decisio...
Github user smurching commented on the issue: https://github.com/apache/spark/pull/19433 Thanks for the comments! - Yep, feature subsampling is necessary for using local tree training in distributed training. I was thinking of adding subsampling in a follow-up PR. You're right that we don't need to pass an array of `BaggedPoints` to local tree training; we should just pass an array of `subsampleWeights` (weights for the current tree) and an array of `TreePoints`. I'll push an update for this. - Agreed that the logic for classification will be the same but with a different impurity metric. I can add support for classification & associated tests in a follow-up PR. - IMO the primary advantage of the columnar storage format is that it'll eventually enable improvements to best split calculations; specifically, for continuous features we could sort the unbinned feature values and consider every possible threshold. There are also the locality & memory advantages described in the design doc. In brief, `DTStatsAggregator` stores a flat array partitioned by (feature x bin). If we can iterate through all values for a single feature at once, most updates to `DTStatsAggregator`will occur within the same subarray. - Multithreading could be a nice way to increase parallelism since we don't use Spark during local tree training. I think we could add it in a follow-up PR. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19433: [SPARK-3162] [MLlib] Add local tree training for decisio...
Github user smurching commented on the issue: https://github.com/apache/spark/pull/19433 The failing SparkR test (which compares `RandomForest` predictions to hardcoded values) fails not due to a correctness issue but (AFAICT) because of an implementation change in best-split selection. In this PR we recompute parent node impurity stats when considering each split for a feature, instead of computing parent impurity stats once per feature (see this by comparing `RandomForest.calculateImpurityStats` in Spark master and `ImpurityUtils.calculateImpurityStats` in this PR). The process of repeatedly computing parent impurity stats results in slightly different results at each iteration due to Double precision limitations. This in turn can cause different splits to be selected (e.g. if two splits have mathematically equal gains, Double precision limitations can cause one split to have a higher/smaller gain than the other, influencing tiebreaking). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19433: [SPARK-3162] [MLlib] Add local tree training for decisio...
Github user smurching commented on the issue: https://github.com/apache/spark/pull/19433 The failing tests (in `DecisionTreeSuite`) fail because we've historically handled a) splits that have 0 gain differently from b) splits that fail to achieve user-specified minimum gain (`metadata.minInfoGain`) or don't meet minimum instance-counts per node (`metadata.minInstancesPerNode`). Previously we'd create a leaf node with valid impurity stats in case a) and invalid impurity stats in case b). This PR creates a leaf node with invalid impurity stats in both cases. As a fix I'd suggest creating a `LeafNode` with correct impurity stats in case a), but with the `stats.valid` member set to `false` to indicate that the node should not be split. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19433: [SPARK-3162] [MLlib][WIP] Add local tree training for de...
Github user smurching commented on the issue: https://github.com/apache/spark/pull/19433 Thanks! I'll remove the WIP. To clear things up for the future, I'd thought [WIP] was the appropriate tag for a PR that's ready for review but not ready to be merged (based on https://spark.apache.org/contributing.html) -- have we stopped using the WIP tag? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19433: [SPARK-3162] [MLlib][WIP] Add local tree training...
Github user smurching commented on a diff in the pull request: https://github.com/apache/spark/pull/19433#discussion_r143398990 --- Diff: mllib/src/main/scala/org/apache/spark/ml/tree/impl/LocalDecisionTreeUtils.scala --- @@ -0,0 +1,59 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.ml.tree.impl + +import org.apache.spark.internal.Logging + +/** + * Utility methods specific to local decision tree training. + */ +private[ml] object LocalDecisionTreeUtils extends Logging { + + /** + * Convert a dataset of binned feature values from row storage to column storage. + * Stores data as [[org.apache.spark.ml.linalg.DenseVector]]. + * + * + * @param rowStore An array of input data rows, each represented as an + * int array of binned feature values + * @return Transpose of rowStore as an array of columns consisting of binned feature values. + * + * TODO: Add implementation for sparse data. + * For sparse data, distribute more evenly based on number of non-zeros. + * (First collect stats to decide how to partition.) + */ + private[impl] def rowToColumnStoreDense(rowStore: Array[Array[Int]]): Array[Array[Int]] = { +// Compute the number of rows in the data +val numRows = { + val longNumRows: Long = rowStore.length + require(longNumRows < Int.MaxValue, s"rowToColumnStore given RDD with $longNumRows rows," + +s" but can handle at most ${Int.MaxValue} rows") + longNumRows.toInt +} + +// Check that the input dataset isn't empty (0 rows) or featureless (rows with 0 features) +require(numRows > 0, "Local decision tree training requires numRows > 0.") +val numFeatures = rowStore(0).length +require(numFeatures > 0, "Local decision tree training requires numFeatures > 0.") +// Return the transpose of the rowStore matrix +0.until(numFeatures).map { colIdx => --- End diff -- TODO: replace this with `rowStore.transpose`, which is more memory efficient (iterates over each row once, allowing for rows of the original matrix to be GC'd during the transpose operation). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19433: [SPARK-3162] [MLlib][WIP] Add local tree training for de...
Github user smurching commented on the issue: https://github.com/apache/spark/pull/19433 @WeichenXu123 would you be able to take an initial look at this? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19433: [SPARK-3162] [MLlib][WIP] Add local tree training...
GitHub user smurching opened a pull request: https://github.com/apache/spark/pull/19433 [SPARK-3162] [MLlib][WIP] Add local tree training for decision tree regressors ## What changes were proposed in this pull request? WIP, DO NOT MERGE ### Overview This PR adds local tree training for decision tree regressors as a first step for addressing [SPARK-3162](https://issues.apache.org/jira/browse/SPARK-3162) (train decision trees locally when possible). See [this design doc](https://docs.google.com/document/d/1baU5KeorrmLpC4EZoqLuG-E8sUJqmdELLbr8o6wdbVM/edit) for a detailed description of the proposed changes. Distributed training logic has been refactored but only minimally modified; the local tree training implementation leverages existing distributed training logic for computing impurities and splits. This shared logic has been refactored into `...Utils` objects (e.g. `SplitUtils.scala`, `ImpurityUtils.scala`). ### How to Review Each commit in this PR adds non-overlapping functionality, so the PR should be reviewable commit-by-commit. Changes introduced by each commit: 1. Adds new data structures for local tree training (`FeatureVector`, `TrainingInfo`) & associated unit tests (`LocalTreeDataSuite`) 2. Adds shared utility methods for computing splits/impurities (`SplitUtils`, `ImpurityUtils`, `AggUpdateUtils`), largely copied from existing distributed training code in `RandomForest.scala`. 3. Unit tests for split/impurity utility methods (`TreeSplitUtilsSuite`) 4. Updates distributed training code in `RandomForest.scala` to depend on the utility methods introduced in 2. 5. Adds local tree training logic (`LocalDecisionTree`) 6. Local tree unit/integration tests (`LocalTreeUnitSuite`, `LocalTreeIntegrationSuite`) ## How was this patch tested? No existing tests were modified. The following new tests were added (also described above): * Unit tests for new data structures specific to local tree training (`LocalTreeDataSuite`, `LocalTreeUtilsSuite`) * Unit tests for impurity/split utility methods (`TreeSplitUtilsSuite`) * Unit tests for local tree training logic (`LocalTreeUnitSuite`) * Integration tests verifying that local & distributed tree training produce the same trees (`LocalTreeIntegrationSuite`) (Please explain how this patch was tested. E.g. unit tests, integration tests, manual tests) (If this patch involves UI changes, please attach a screenshot; otherwise, remove this) Please review http://spark.apache.org/contributing.html before opening a pull request. You can merge this pull request into a Git repository by running: $ git pull https://github.com/smurching/spark pr-splitup Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/19433.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #19433 commit 219a12001383017e70f10cd7c785272e70e64b28 Author: Sid Murching <sid.murch...@databricks.com> Date: 2017-10-04T20:55:35Z Add data structures for local tree training & associated tests (in LocalTreeDataSuite): * TrainingInfo: primary local tree training data structure, contains all information required to describe state of algorithm at any point during learning * FeatureVector: Stores data for an individual feature as an Array[Int] commit 710714395c966f664af7f7b62226336675ec2ea7 Author: Sid Murching <sid.murch...@databricks.com> Date: 2017-10-04T20:57:30Z Add utility methods used for impurity and split calculations during both local & distributed training: * AggUpdateUtils: Helper methods for updating sufficient stats for a given node * ImpurityUtils: Helper methods for impurity-related calcluations during node split decisions * SplitUtils: Helper methods for choosing splits given sufficient stats NOTE: Both ImpurityUtils and SplitUtils primarily contain code taken from RandomForest.scala, with slight modifications. Tests for SplitUtils are contained in the next commit. commit 49bf0ae9b275264e757de573f81b816437be77e7 Author: Sid Murching <sid.murch...@databricks.com> Date: 2017-10-04T21:36:15Z Add test suites for utility methods used during best-split computation: * TreeSplitUtilsSuite: Test suite for SplitUtils * TreeTests: Add utility method (getMetadata) for TreeSplitUtilsSuite Also add methods used by these tests in LocalDecisionTree.scala, RandomForest.scala commit bc54b165849202269b80bbac1a84afb857e87e31 Author: Sid Murching <sid.murch...@databricks.com> Date: 2017-10-04T21:48:33Z Update RandomForest.scala to use new utility methods for impurity/split calculations commit 6a68a5cc6a6b7087163bbe5681ad41ae
[GitHub] spark pull request #19278: [SPARK-22060][ML] Fix CrossValidator/TrainValidat...
Github user smurching commented on a diff in the pull request: https://github.com/apache/spark/pull/19278#discussion_r139816308 --- Diff: mllib/src/main/scala/org/apache/spark/ml/util/ReadWrite.scala --- @@ -399,14 +399,17 @@ private[ml] object DefaultParamsReader { * This works if all Params implement [[org.apache.spark.ml.param.Param.jsonDecode()]]. --- End diff -- Update the docstring to state that params included in `skipParams` aren't set. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19278: [SPARK-22060][ML] Fix CrossValidator/TrainValidat...
Github user smurching commented on a diff in the pull request: https://github.com/apache/spark/pull/19278#discussion_r139809836 --- Diff: mllib/src/main/scala/org/apache/spark/ml/util/ReadWrite.scala --- @@ -399,14 +399,17 @@ private[ml] object DefaultParamsReader { * This works if all Params implement [[org.apache.spark.ml.param.Param.jsonDecode()]]. * TODO: Move to [[Metadata]] method */ - def getAndSetParams(instance: Params, metadata: Metadata): Unit = { + def getAndSetParams(instance: Params, metadata: Metadata, + skipParams: List[String] = null): Unit = { --- End diff -- Use an `Option[List[String]]` that defaults to `None` instead of a `List[String]` that defaults to null? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19278: [SPARK-22060][ML] Fix CrossValidator/TrainValidat...
Github user smurching commented on a diff in the pull request: https://github.com/apache/spark/pull/19278#discussion_r139817121 --- Diff: mllib/src/main/scala/org/apache/spark/ml/tuning/CrossValidator.scala --- @@ -17,6 +17,7 @@ package org.apache.spark.ml.tuning +import java.io.IOException --- End diff -- This exception is unused & can be removed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19208: [SPARK-21087] [ML] CrossValidator, TrainValidatio...
Github user smurching commented on a diff in the pull request: https://github.com/apache/spark/pull/19208#discussion_r139586600 --- Diff: mllib/src/main/scala/org/apache/spark/ml/tuning/TrainValidationSplit.scala --- @@ -276,12 +315,32 @@ object TrainValidationSplitModel extends MLReadable[TrainValidationSplitModel] { ValidatorParams.validateParams(instance) +protected var shouldPersistSubModels: Boolean = false + +/** + * Set option for persist sub models. + */ +@Since("2.3.0") +def persistSubModels(persist: Boolean): this.type = { + shouldPersistSubModels = persist + this +} + override protected def saveImpl(path: String): Unit = { import org.json4s.JsonDSL._ - val extraMetadata = "validationMetrics" -> instance.validationMetrics.toSeq + val extraMetadata = ("validationMetrics" -> instance.validationMetrics.toSeq) ~ +("shouldPersistSubModels" -> shouldPersistSubModels) ValidatorParams.saveImpl(path, instance, sc, Some(extraMetadata)) val bestModelPath = new Path(path, "bestModel").toString instance.bestModel.asInstanceOf[MLWritable].save(bestModelPath) + if (shouldPersistSubModels) { +require(instance.subModels != null, "Cannot get sub models to persist.") +val subModelsPath = new Path(path, "subModels") +for (paramIndex <- 0 until instance.getEstimatorParamMaps.length) { + val modelPath = new Path(subModelsPath, paramIndex.toString).toString + instance.subModels(paramIndex).asInstanceOf[MLWritable].save(modelPath) --- End diff -- @WeichenXu123 Actually I don't think we have to worry about this; Pipeline persistence doesn't clean up if a stage fails to persist (see [Pipeline.scala](https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/Pipeline.scala#L242)) --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19208: [SPARK-21087] [ML] CrossValidator, TrainValidatio...
Github user smurching commented on a diff in the pull request: https://github.com/apache/spark/pull/19208#discussion_r139578700 --- Diff: mllib/src/main/scala/org/apache/spark/ml/tuning/TrainValidationSplit.scala --- @@ -276,12 +315,32 @@ object TrainValidationSplitModel extends MLReadable[TrainValidationSplitModel] { ValidatorParams.validateParams(instance) +protected var shouldPersistSubModels: Boolean = false + +/** + * Set option for persist sub models. + */ +@Since("2.3.0") +def persistSubModels(persist: Boolean): this.type = { + shouldPersistSubModels = persist + this +} + override protected def saveImpl(path: String): Unit = { import org.json4s.JsonDSL._ - val extraMetadata = "validationMetrics" -> instance.validationMetrics.toSeq + val extraMetadata = ("validationMetrics" -> instance.validationMetrics.toSeq) ~ +("shouldPersistSubModels" -> shouldPersistSubModels) ValidatorParams.saveImpl(path, instance, sc, Some(extraMetadata)) val bestModelPath = new Path(path, "bestModel").toString instance.bestModel.asInstanceOf[MLWritable].save(bestModelPath) + if (shouldPersistSubModels) { +require(instance.subModels != null, "Cannot get sub models to persist.") +val subModelsPath = new Path(path, "subModels") +for (paramIndex <- 0 until instance.getEstimatorParamMaps.length) { + val modelPath = new Path(subModelsPath, paramIndex.toString).toString + instance.subModels(paramIndex).asInstanceOf[MLWritable].save(modelPath) --- End diff -- Should we clean up/remove the partially-persisted `subModels` if any of these `save()` calls fail? E.g. let's say we have four subModels and the first three `save()` calls succeed but the fourth fails - should we delete the folders for the first three submodels? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19208: [SPARK-21087] [ML] CrossValidator, TrainValidatio...
Github user smurching commented on a diff in the pull request: https://github.com/apache/spark/pull/19208#discussion_r139573779 --- Diff: mllib/src/main/scala/org/apache/spark/ml/tuning/CrossValidator.scala --- @@ -261,17 +290,40 @@ class CrossValidatorModel private[ml] ( val copied = new CrossValidatorModel( uid, bestModel.copy(extra).asInstanceOf[Model[_]], - avgMetrics.clone()) + avgMetrics.clone(), + CrossValidatorModel.copySubModels(subModels)) copyValues(copied, extra).setParent(parent) } @Since("1.6.0") override def write: MLWriter = new CrossValidatorModel.CrossValidatorModelWriter(this) + + @Since("2.3.0") + @throws[IOException]("If the input path already exists but overwrite is not enabled.") + def save(path: String, persistSubModels: Boolean): Unit = { +write.asInstanceOf[CrossValidatorModel.CrossValidatorModelWriter] + .persistSubModels(persistSubModels).save(path) + } --- End diff -- I think users can still access `CrossValidatorModelWriter` through `CrossValidatorModel.write`, so the `save` method is unnecessary. The `private[CrossValidatorModel]` annotation on the `CrossValidatorModelWriter` constructor only means that users can't create instances of the class e.g. via `new CrossValidatorModel.CrossValidatorModelWriter(...)` --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19208: [SPARK-21087] [ML] CrossValidator, TrainValidatio...
Github user smurching commented on a diff in the pull request: https://github.com/apache/spark/pull/19208#discussion_r139556318 --- Diff: mllib/src/main/scala/org/apache/spark/ml/param/shared/SharedParamsCodeGen.scala --- @@ -82,7 +82,10 @@ private[shared] object SharedParamsCodeGen { "all instance weights as 1.0"), ParamDesc[String]("solver", "the solver algorithm for optimization", finalFields = false), ParamDesc[Int]("aggregationDepth", "suggested depth for treeAggregate (>= 2)", Some("2"), -isValid = "ParamValidators.gtEq(2)", isExpertParam = true)) +isValid = "ParamValidators.gtEq(2)", isExpertParam = true), + ParamDesc[Boolean]("collectSubModels", "whether to collect sub models when tuning fitting", --- End diff -- Suggestion: reword "whether to collect sub models when tuning fitting" --> "whether to collect a list of sub-models trained during tuning" --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19208: [SPARK-21087] [ML] CrossValidator, TrainValidatio...
Github user smurching commented on a diff in the pull request: https://github.com/apache/spark/pull/19208#discussion_r139568979 --- Diff: mllib/src/main/scala/org/apache/spark/ml/tuning/CrossValidator.scala --- @@ -237,12 +251,17 @@ object CrossValidator extends MLReadable[CrossValidator] { class CrossValidatorModel private[ml] ( @Since("1.4.0") override val uid: String, @Since("1.2.0") val bestModel: Model[_], -@Since("1.5.0") val avgMetrics: Array[Double]) +@Since("1.5.0") val avgMetrics: Array[Double], +@Since("2.3.0") val subModels: Array[Array[Model[_]]]) extends Model[CrossValidatorModel] with CrossValidatorParams with MLWritable { /** A Python-friendly auxiliary constructor. */ private[ml] def this(uid: String, bestModel: Model[_], avgMetrics: JList[Double]) = { -this(uid, bestModel, avgMetrics.asScala.toArray) +this(uid, bestModel, avgMetrics.asScala.toArray, null) --- End diff -- See earlier suggestion, use an Option set to `None` instead of setting the Array to null --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19208: [SPARK-21087] [ML] CrossValidator, TrainValidatio...
Github user smurching commented on a diff in the pull request: https://github.com/apache/spark/pull/19208#discussion_r139557219 --- Diff: mllib/src/main/scala/org/apache/spark/ml/tuning/CrossValidator.scala --- @@ -117,6 +123,12 @@ class CrossValidator @Since("1.2.0") (@Since("1.4.0") override val uid: String) instr.logParams(numFolds, seed, parallelism) logTuningParams(instr) +val collectSubModelsParam = $(collectSubModels) + +var subModels: Array[Array[Model[_]]] = if (collectSubModelsParam) { --- End diff -- Perhaps use an `Option[Array[Model[_]]]` instead of setting `subModels` to null? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19186: [SPARK-21972][ML] Add param handlePersistence
Github user smurching commented on a diff in the pull request: https://github.com/apache/spark/pull/19186#discussion_r138712707 --- Diff: mllib/src/main/scala/org/apache/spark/ml/classification/LogisticRegression.scala --- @@ -483,24 +488,24 @@ class LogisticRegression @Since("1.2.0") ( this } - override protected[spark] def train(dataset: Dataset[_]): LogisticRegressionModel = { -val handlePersistence = dataset.storageLevel == StorageLevel.NONE -train(dataset, handlePersistence) - } - - protected[spark] def train( - dataset: Dataset[_], - handlePersistence: Boolean): LogisticRegressionModel = { + protected[spark] def train(dataset: Dataset[_]): LogisticRegressionModel = { val w = if (!isDefined(weightCol) || $(weightCol).isEmpty) lit(1.0) else col($(weightCol)) val instances: RDD[Instance] = dataset.select(col($(labelCol)), w, col($(featuresCol))).rdd.map { case Row(label: Double, weight: Double, features: Vector) => Instance(label, weight, features) } -if (handlePersistence) instances.persist(StorageLevel.MEMORY_AND_DISK) +if (dataset.storageLevel == StorageLevel.NONE) { + if ($(handlePersistence)) { +instances.persist(StorageLevel.MEMORY_AND_DISK) + } else { +logWarning("The input dataset is uncached, which may hurt performance if its upstreams " + + "are also uncached.") + } +} --- End diff -- Oops, yeah I had forgotten about that (thanks for the catch). One solution could be to extend `HasHandlePersistence` in `Predictor` and check `handlePersistence` / cache uncached data in `Predictor.fit()` instead of `Predictor.train()`. This has the drawback of limiting individual algorithms' ability to customize their caching behavior. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19106: [SPARK-21770][ML] ProbabilisticClassificationModel fix c...
Github user smurching commented on the issue: https://github.com/apache/spark/pull/19106 @sethah I haven't heard of anybody hitting this issue in practice, but it did seem best to ensure that valid probability distributions would be produced regardless of input. There was some discussion of this in the JIRA: https://issues.apache.org/jira/browse/SPARK-21770 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19186: [SPARK-21972][ML] Add param handlePersistence
Github user smurching commented on the issue: https://github.com/apache/spark/pull/19186 Note: This PR follows up on the work/discussions in [https://github.com/apache/spark/pull/17014](https://github.com/apache/spark/pull/17014) --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19186: [SPARK-21972][ML] Add param handlePersistence
Github user smurching commented on a diff in the pull request: https://github.com/apache/spark/pull/19186#discussion_r138139729 --- Diff: mllib/src/main/scala/org/apache/spark/ml/clustering/KMeans.scala --- @@ -300,20 +300,23 @@ class KMeans @Since("1.5.0") ( @Since("1.5.0") def setSeed(value: Long): this.type = set(seed, value) + /** @group setParam */ + @Since("2.3.0") + def setHandlePersistence(value: Boolean): this.type = set(handlePersistence, value) + @Since("2.0.0") override def fit(dataset: Dataset[_]): KMeansModel = { transformSchema(dataset.schema, logging = true) -val handlePersistence = dataset.rdd.getStorageLevel == StorageLevel.NONE val instances: RDD[OldVector] = dataset.select(col($(featuresCol))).rdd.map { case Row(point: Vector) => OldVectors.fromML(point) } -if (handlePersistence) { +if ($(handlePersistence)) { --- End diff -- See comment above, we should also check that `dataset.storageLevel == StorageLevel.NONE` --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19186: [SPARK-21972][ML] Add param handlePersistence
Github user smurching commented on a diff in the pull request: https://github.com/apache/spark/pull/19186#discussion_r138136774 --- Diff: mllib/src/main/scala/org/apache/spark/ml/classification/LogisticRegression.scala --- @@ -483,24 +488,17 @@ class LogisticRegression @Since("1.2.0") ( this } - override protected[spark] def train(dataset: Dataset[_]): LogisticRegressionModel = { -val handlePersistence = dataset.rdd.getStorageLevel == StorageLevel.NONE -train(dataset, handlePersistence) - } - - protected[spark] def train( - dataset: Dataset[_], - handlePersistence: Boolean): LogisticRegressionModel = { + protected[spark] def train(dataset: Dataset[_]): LogisticRegressionModel = { val w = if (!isDefined(weightCol) || $(weightCol).isEmpty) lit(1.0) else col($(weightCol)) val instances: RDD[Instance] = dataset.select(col($(labelCol)), w, col($(featuresCol))).rdd.map { case Row(label: Double, weight: Double, features: Vector) => Instance(label, weight, features) } -if (handlePersistence) instances.persist(StorageLevel.MEMORY_AND_DISK) +if ($(handlePersistence)) instances.persist(StorageLevel.MEMORY_AND_DISK) --- End diff -- If `$(handlePersistence)` is `true`, we should still check that `dataset` is uncached (i.e. check that `dataset.storageLevel == StorageLevel.NONE`) before caching `instances`, or else we'll run into the issues described in [SPARK-21799](https://issues.apache.org/jira/browse/SPARK-21799) --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19186: [SPARK-21972][ML] Add param handlePersistence
Github user smurching commented on a diff in the pull request: https://github.com/apache/spark/pull/19186#discussion_r138137893 --- Diff: mllib/src/main/scala/org/apache/spark/ml/classification/OneVsRest.scala --- @@ -163,9 +165,7 @@ final class OneVsRestModel private[ml] ( val initUDF = udf { () => Map[Int, Double]() } val newDataset = dataset.withColumn(accColName, initUDF()) -// persist if underlying dataset is not persistent. -val handlePersistence = dataset.rdd.getStorageLevel == StorageLevel.NONE -if (handlePersistence) { +if ($(handlePersistence)) { --- End diff -- See comment above, we should also check that `dataset.storageLevel == StorageLevel.NONE` before caching `newDataset` --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19186: [SPARK-21972][ML] Add param handlePersistence
Github user smurching commented on a diff in the pull request: https://github.com/apache/spark/pull/19186#discussion_r138139091 --- Diff: mllib/src/main/scala/org/apache/spark/ml/param/shared/SharedParamsCodeGen.scala --- @@ -82,7 +82,8 @@ private[shared] object SharedParamsCodeGen { "all instance weights as 1.0"), ParamDesc[String]("solver", "the solver algorithm for optimization", finalFields = false), ParamDesc[Int]("aggregationDepth", "suggested depth for treeAggregate (>= 2)", Some("2"), -isValid = "ParamValidators.gtEq(2)", isExpertParam = true)) +isValid = "ParamValidators.gtEq(2)", isExpertParam = true), + ParamDesc[Boolean]("handlePersistence", "whether to handle data persistence", Some("true"))) --- End diff -- This description could be a bit clearer, how about "if true, will cache unpersisted input data before fitting estimator on it"? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19186: [SPARK-21972][ML] Add param handlePersistence
Github user smurching commented on a diff in the pull request: https://github.com/apache/spark/pull/19186#discussion_r138140113 --- Diff: mllib/src/main/scala/org/apache/spark/ml/regression/IsotonicRegression.scala --- @@ -165,8 +170,7 @@ class IsotonicRegression @Since("1.5.0") (@Since("1.5.0") override val uid: Stri transformSchema(dataset.schema, logging = true) // Extract columns from data. If dataset is persisted, do not persist oldDataset. val instances = extractWeightedLabeledPoints(dataset) -val handlePersistence = dataset.rdd.getStorageLevel == StorageLevel.NONE -if (handlePersistence) instances.persist(StorageLevel.MEMORY_AND_DISK) +if ($(handlePersistence)) instances.persist(StorageLevel.MEMORY_AND_DISK) --- End diff -- See comment above, we should also check that `dataset.storageLevel == StorageLevel.NONE` --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19186: [SPARK-21972][ML] Add param handlePersistence
Github user smurching commented on a diff in the pull request: https://github.com/apache/spark/pull/19186#discussion_r138139539 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/classification/LogisticRegression.scala --- @@ -444,13 +444,13 @@ class LogisticRegressionWithLBFGS lr.setFitIntercept(addIntercept) lr.setMaxIter(optimizer.getNumIterations()) lr.setTol(optimizer.getConvergenceTol()) +// Determine if we should cache the DF +lr.setHandlePersistence(input.getStorageLevel == StorageLevel.NONE) --- End diff -- `handlePersistence` should be specified by the user rather than inferred by the algorithm. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19107: [SPARK-21799][ML] Fix `KMeans` performance regression ca...
Github user smurching commented on the issue: https://github.com/apache/spark/pull/19107 @jkbradley would you be able to give this a look? Thanks! --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19106: [SPARK-21770][ML] ProbabilisticClassificationModel fix c...
Github user smurching commented on the issue: https://github.com/apache/spark/pull/19106 This looks good to me! @srowen would you be able to give it another look? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19106: [SPARK-21770][ML] ProbabilisticClassificationMode...
Github user smurching commented on a diff in the pull request: https://github.com/apache/spark/pull/19106#discussion_r137988907 --- Diff: mllib/src/main/scala/org/apache/spark/ml/classification/ProbabilisticClassifier.scala --- @@ -245,6 +245,13 @@ private[ml] object ProbabilisticClassificationModel { v.values(i) /= sum i += 1 } +} else { + var i = 0 + val size = v.size + while (i < size) { +v.values(i) = 1.0 / size --- End diff -- You could use `java.util.Arrays.fill` to update `v.values` in-place, but I'm not sure that it'll make a huge difference. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17014: [SPARK-18608][ML] Fix double-caching in ML algorithms
Github user smurching commented on the issue: https://github.com/apache/spark/pull/17014 Hi @zhengruifeng, thanks for your work on this! Now that we're introducing a new handlePersistence parameter (a new public API), it'd be good to track work in a separate JIRA/PR as @jkbradley suggested so others are aware of the proposed change. I've created a new JIRA ticket for adding the handlePersistence param here: [SPARK-21972](https://issues.apache.org/jira/browse/SPARK-21972). Would you mind resubmitting your work as a new PR that addresses the new JIRA ticket (SPARK-21972)? Thanks & sorry for the inconvenience! --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19107: [SPARK-21799][ML] Fix `KMeans` performance regression ca...
Github user smurching commented on the issue: https://github.com/apache/spark/pull/19107 Sorry for the delay, this looks good to me -- thanks @WeichenXu123! --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17014: [SPARK-18608][ML] Fix double-caching in ML algorithms
Github user smurching commented on the issue: https://github.com/apache/spark/pull/17014 @WeichenXu123 That approach sounds reasonable to me. My main thought (& this might be obvious) is on the implementation level -- as long as we implement this by adding an `org.apache.spark.ml.Param` named `handlePersistence`, I think we can maintain binary compatibility. I'd be concerned about making `handlePersistence` an argument to `fit()`, which seems like it might [break binary compatibility](https://wiki.eclipse.org/Evolving_Java-based_APIs_2#Evolving_API_classes_-_API_methods_and_constructors). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19065: [SPARK-21729][ML][TEST] Generic test for Probabil...
Github user smurching commented on a diff in the pull request: https://github.com/apache/spark/pull/19065#discussion_r136152805 --- Diff: mllib/src/test/scala/org/apache/spark/ml/classification/ProbabilisticClassifierSuite.scala --- @@ -91,4 +94,60 @@ object ProbabilisticClassifierSuite { "thresholds" -> Array(0.4, 0.6) ) + /** + * Add test for prediction using the model with all combinations of --- End diff -- Tiny nit: This could be reworded from the JIRA description. How about: Helper for testing that a ProbabilisticClassificationModel computes the same predictions across all combinations of output columns (rawPrediction/probability/prediction) turned on/off. Makes sure the output column values match by comparing vs. the case with all 3 output columns turned on. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19065: [SPARK-21729][ML][TEST] Generic test for Probabil...
Github user smurching commented on a diff in the pull request: https://github.com/apache/spark/pull/19065#discussion_r135655220 --- Diff: mllib/src/test/scala/org/apache/spark/ml/classification/ProbabilisticClassifierSuite.scala --- @@ -91,4 +94,54 @@ object ProbabilisticClassifierSuite { "thresholds" -> Array(0.4, 0.6) ) + def probabilisticClassifierGenericTest[ + FeaturesType, + M <: ProbabilisticClassificationModel[FeaturesType, M]]( +model: M, testData: Dataset[_]): Unit = { + +val allColModel = model.copy(ParamMap.empty) + .setRawPredictionCol("rawPredictionAll") + .setProbabilityCol("probabilityAll") + .setPredictionCol("predictionAll") +val allColResult = allColModel.transform(testData) + +for (rawPredictionCol <- Seq("", "rawPredictionSingle")) { + for (probabilityCol <- Seq("", "probabilitySingle")) { +for (predictionCol <- Seq("", "predictionSingle")) { + val newModel = model.copy(ParamMap.empty) +.setRawPredictionCol(rawPredictionCol) +.setProbabilityCol(probabilityCol) +.setPredictionCol(predictionCol) + + val result = newModel.transform(allColResult) + + import org.apache.spark.sql.functions._ + + val resultRawPredictionCol = +if (rawPredictionCol.isEmpty) col("rawPredictionAll") else col(rawPredictionCol) + val resultProbabilityCol = +if (probabilityCol.isEmpty) col("probabilityAll") else col(probabilityCol) + val resultPredictionCol = +if (predictionCol.isEmpty) col("predictionAll") else col(predictionCol) + + result.select( +resultRawPredictionCol, col("rawPredictionAll"), +resultProbabilityCol, col("probabilityAll"), +resultPredictionCol, col("predictionAll") + ).collect().foreach { +case Row( + rawPredictionSingle: Vector, rawPredictionAll: Vector, + probabilitySingle: Vector, probabilityAll: Vector, + predictionSingle: Double, predictionAll: Double +) => { + assert(rawPredictionSingle.asInstanceOf[Vector] ~== rawPredictionAll relTol 1E-3) --- End diff -- Are these `asInstanceOf[]` casts necessary given that `rawPredictionSingle`, `rawPredictionAll` are explicitly typed in the case statement above? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19065: [SPARK-21729][ML][TEST] Generic test for Probabil...
Github user smurching commented on a diff in the pull request: https://github.com/apache/spark/pull/19065#discussion_r135653663 --- Diff: mllib/src/test/scala/org/apache/spark/ml/classification/ProbabilisticClassifierSuite.scala --- @@ -91,4 +94,54 @@ object ProbabilisticClassifierSuite { "thresholds" -> Array(0.4, 0.6) ) + def probabilisticClassifierGenericTest[ --- End diff -- Could you add a comment explaining what this test does? Thanks! --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19065: [SPARK-21729][ML][TEST] Generic test for Probabil...
Github user smurching commented on a diff in the pull request: https://github.com/apache/spark/pull/19065#discussion_r135656421 --- Diff: mllib/src/test/scala/org/apache/spark/ml/classification/ProbabilisticClassifierSuite.scala --- @@ -91,4 +94,54 @@ object ProbabilisticClassifierSuite { "thresholds" -> Array(0.4, 0.6) ) + def probabilisticClassifierGenericTest[ + FeaturesType, + M <: ProbabilisticClassificationModel[FeaturesType, M]]( +model: M, testData: Dataset[_]): Unit = { + +val allColModel = model.copy(ParamMap.empty) + .setRawPredictionCol("rawPredictionAll") + .setProbabilityCol("probabilityAll") + .setPredictionCol("predictionAll") +val allColResult = allColModel.transform(testData) + +for (rawPredictionCol <- Seq("", "rawPredictionSingle")) { + for (probabilityCol <- Seq("", "probabilitySingle")) { --- End diff -- Just to confirm, does setting `probabilityCol`, `rawPredictionCol`, `predictionCol` to empty strings work here because expressions like `$(probabilityCol)` (used in [ProbabilisticClassifier.scala](https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/classification/ProbabilisticClassifier.scala#L115)) return the String value of probabilityCol? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19065: [SPARK-21729][ML][TEST] Generic test for Probabil...
Github user smurching commented on a diff in the pull request: https://github.com/apache/spark/pull/19065#discussion_r135653421 --- Diff: mllib/src/test/scala/org/apache/spark/ml/classification/ProbabilisticClassifierSuite.scala --- @@ -18,7 +18,10 @@ package org.apache.spark.ml.classification import org.apache.spark.SparkFunSuite -import org.apache.spark.ml.linalg.{Vector, Vectors} +import org.apache.spark.ml.linalg.{DenseVector, Vector, Vectors} --- End diff -- It looks like DenseVector is an unused import and could be removed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19065: [SPARK-21729][ML][TEST] Generic test for Probabil...
Github user smurching commented on a diff in the pull request: https://github.com/apache/spark/pull/19065#discussion_r135653044 --- Diff: mllib/src/test/scala/org/apache/spark/ml/classification/DecisionTreeClassifierSuite.scala --- @@ -262,6 +262,9 @@ class DecisionTreeClassifierSuite assert(Vectors.dense(rawPred.toArray.map(_ / sum)) === probPred, "probability prediction mismatch") } + +ProbabilisticClassifierSuite.probabilisticClassifierGenericTest[ --- End diff -- We should use a more descriptive name for this test. How about `ProbabilisticClassifierSuite.testPredictMethods`? @jkbradley may have other suggestions too. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19065: [SPARK-21729][ML][TEST] Generic test for Probabil...
Github user smurching commented on a diff in the pull request: https://github.com/apache/spark/pull/19065#discussion_r135653479 --- Diff: mllib/src/test/scala/org/apache/spark/ml/classification/ProbabilisticClassifierSuite.scala --- @@ -18,7 +18,10 @@ package org.apache.spark.ml.classification import org.apache.spark.SparkFunSuite -import org.apache.spark.ml.linalg.{Vector, Vectors} +import org.apache.spark.ml.linalg.{DenseVector, Vector, Vectors} +import org.apache.spark.ml.param.ParamMap +import org.apache.spark.ml.util.TestingUtils._ +import org.apache.spark.sql.{DataFrame, Dataset, Row} --- End diff -- DataFrame is an unused import, could be removed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14872: [SPARK-3162][MLlib][WIP] Add local tree training for dec...
Github user smurching commented on the issue: https://github.com/apache/spark/pull/14872 No worries, apologies for being busy on my end -- I'll leave the branch up & try to contribute in other ways when I have the time! --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14872: [SPARK-3162][MLlib][WIP] Add local tree training ...
Github user smurching closed the pull request at: https://github.com/apache/spark/pull/14872 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14872: [SPARK-3162][MLlib][WIP] Add local tree training for dec...
Github user smurching commented on the issue: https://github.com/apache/spark/pull/14872 Hi, I've stopped working on this PR - I can go ahead and close it. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14872: [SPARK-3162][MLlib][WIP] Add local tree training ...
GitHub user smurching reopened a pull request: https://github.com/apache/spark/pull/14872 [SPARK-3162][MLlib][WIP] Add local tree training for decision tree regressors ## What changes were proposed in this pull request? Based on [Yggdrasil](https://github.com/fabuzaid21/yggdrasil), added local training of decision tree regressors. Some classes/objects largely correspond to Yggdrasil classes/objects. Specifically: * class LocalDecisionTreeRegressor --> class YggdrasilRegressor * object LocalDecisionTree --> object YggdrasilRegression * object LocalDecisionTreeUtils --> object Yggdrasil ## How was this patch tested? Added unit tests in (ml/tree/impl/LocalTreeTrainingSuite.scala) verifying that local & distributed training of a decision tree regressor produces the same tree. You can merge this pull request into a Git repository by running: $ git pull https://github.com/smurching/spark local-trees-pr Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/14872.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #14872 commit acf5b3e29a346a0cb86f621269855a6a98a9a74e Author: Siddharth Murching <smurch...@databricks.com> Date: 2016-08-29T23:51:33Z Add local tree training for decision tree regressors commit aa4fcc8d401385f38fe0cdfdb9fe39062c3a9f96 Author: Siddharth Murching <smurch...@databricks.com> Date: 2016-08-30T01:19:07Z Fix setting of impurity values for leaf nodes to match values produced by distributed Random Forest algorithm commit f273fc6a4b5048ae577d03676def354dce5c87a7 Author: Siddharth Murching <smurch...@databricks.com> Date: 2016-08-31T07:01:26Z WIP refactoring single-machine tree code commit 5e61e3b29c236d27e0d655d15a48f2fe3e13d26a Author: Siddharth Murching <smurch...@databricks.com> Date: 2016-09-01T01:22:48Z Remove unused imports, remove array of single-node impurity aggregators commit d2060fc460a97228a36bf81956cf8dd24c83106e Author: Siddharth Murching <smurch...@databricks.com> Date: 2016-09-01T23:11:17Z WIP commit 634a3223374608d68018daac5500a429034bbc20 Author: Siddharth Murching <smurch...@databricks.com> Date: 2016-09-02T00:21:10Z More work, tests still pass commit eb7fde00e0db5aa5d04951f8f4a9cd62204f1609 Author: Siddharth Murching <smurch...@databricks.com> Date: 2016-09-02T17:02:06Z WIP: Added tests for classes upon which local tree training is dependent. Some integration tests fail commit b748f05e3eaa7d58b1ad86d269e0dda5f35ee885 Author: Siddharth Murching <smurch...@databricks.com> Date: 2016-09-02T17:37:31Z WIP debugging commit 297052242727e6693ccbacf89f44b3ff6db584f7 Author: Siddharth Murching <smurch...@databricks.com> Date: 2016-09-02T21:34:13Z Consolidate checking for valid splits commit 8d443ce38f958e7b83b502e614e01c824cb63c4b Author: Siddharth Murching <smurch...@databricks.com> Date: 2016-09-02T21:52:47Z Delete empty test suite commit ee56ffe98756ed78cefbc3f782a471f04e80b256 Author: Siddharth Murching <smurch...@databricks.com> Date: 2016-09-02T22:49:38Z Fix some style errors --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14872: [SPARK-3162][MLlib][WIP] Add local tree training ...
Github user smurching closed the pull request at: https://github.com/apache/spark/pull/14872 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14872: Add local tree training for decision tree regress...
GitHub user smurching opened a pull request: https://github.com/apache/spark/pull/14872 Add local tree training for decision tree regressors ## What changes were proposed in this pull request? Based on [Yggdrasil](https://github.com/fabuzaid21/yggdrasil), added local training of decision tree regressors. Some classes/objects largely correspond to Yggdrasil classes/objects. Specifically: * class LocalDecisionTreeRegressor --> class YggdrasilRegressor * object LocalDecisionTree --> object YggdrasilRegression * object LocalDecisionTreeUtils --> object Yggdrasil ## How was this patch tested? Added unit tests in (ml/tree/impl/LocalTreeTrainingSuite.scala) verifying that local & distributed training of a decision tree regressor produces the same tree. You can merge this pull request into a Git repository by running: $ git pull https://github.com/smurching/spark local-trees-pr Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/14872.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #14872 commit acf5b3e29a346a0cb86f621269855a6a98a9a74e Author: Siddharth Murching <smurch...@databricks.com> Date: 2016-08-29T23:51:33Z Add local tree training for decision tree regressors commit aa4fcc8d401385f38fe0cdfdb9fe39062c3a9f96 Author: Siddharth Murching <smurch...@databricks.com> Date: 2016-08-30T01:19:07Z Fix setting of impurity values for leaf nodes to match values produced by distributed Random Forest algorithm --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #13881: [SPARK-3723] [MLlib] Adding instrumentation to random fo...
Github user smurching commented on the issue: https://github.com/apache/spark/pull/13881 Does it make sense to only perform instrumentation-related computations (i.e. updating the max/min nodes per group) if the instrumentation argument to RandomForest.run (instr) is not None? This isn't checked for in the current implementation. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #13881: [SPARK-3723] [MLlib] Adding instrumentation to ra...
GitHub user smurching opened a pull request: https://github.com/apache/spark/pull/13881 [SPARK-3723] [MLlib] Adding instrumentation to random forests ## What changes were proposed in this pull request? In RandomForest.run(), added instrumentation for the number of node groups, along with the min, max, and average number of nodes per group. Also fixed a typo in BaggedPoint.scala documentation. ## How was this patch tested? Tested by running RandomForestClassifierSuite, checking the test output manually to make sure instrumentation information was present and reasonable. You can merge this pull request into a Git repository by running: $ git pull https://github.com/smurching/spark random-forest-instrumentation Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/13881.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #13881 commit 8f45533b9a5f7c3c1f46d0d15a9f1815fa6227d5 Author: Siddharth Murching <smurch...@databricks.com> Date: 2016-06-23T23:40:26Z Fix typo in BaggedPoint.scala, add simple instrumentation to Random Forests commit bd7d24d4f5a79eca6ff9629706c254beba74bc45 Author: Siddharth Murching <smurch...@databricks.com> Date: 2016-06-24T00:40:02Z Reorder instrumentation logging statements to look nicer --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org