[GitHub] spark issue #20389: [SPARK-23205][ML] Update ImageSchema.readImages to corre...

2018-01-24 Thread smurching
Github user smurching commented on the issue:

https://github.com/apache/spark/pull/20389
  
Thanks for the reviews @srowen, @dongjoon-hyun! Would it make sense to 
merge this before Spark 2.3 is released & if so would one of you be able to do 
so?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20389: [SPARK-23205][ML] Update ImageSchema.readImages t...

2018-01-24 Thread smurching
GitHub user smurching opened a pull request:

https://github.com/apache/spark/pull/20389

[SPARK-23205][ML] Update ImageSchema.readImages to correctly set alpha 
values for four-channel images

## What changes were proposed in this pull request?

When parsing raw image data in ImageSchema.decode(), we use a 
[java.awt.Color](https://docs.oracle.com/javase/7/docs/api/java/awt/Color.html#Color(int))
 constructor that sets alpha = 255, even for four-channel images (which may 
have different alpha values). This PR fixes this issue & adds a unit test to 
verify correctness of reading four-channel images.

## How was this patch tested?

Updates an existing unit test ("readImages pixel values test" in 
`ImageSchemaSuite`) to also verify correctness when reading a four-channel 
image.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/smurching/spark image-schema-bugfix

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/20389.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #20389


commit 054c1dd457e5c95872a188f7978f678d7c7093d5
Author: Sid Murching <sid.murching@...>
Date:   2018-01-24T22:34:08Z

Bugfix + test




---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19758: [SPARK-3162][MLlib] Local Tree Training Pt 1: Ref...

2017-11-30 Thread smurching
Github user smurching commented on a diff in the pull request:

https://github.com/apache/spark/pull/19758#discussion_r154284858
  
--- Diff: 
mllib/src/test/scala/org/apache/spark/ml/tree/impl/TreeSplitUtilsSuite.scala ---
@@ -0,0 +1,280 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.ml.tree.impl
+
+import org.apache.spark.SparkFunSuite
+import org.apache.spark.ml.tree.{CategoricalSplit, ContinuousSplit, Split}
+import org.apache.spark.ml.util.DefaultReadWriteTest
+import org.apache.spark.mllib.tree.impurity.{Entropy, Impurity}
+import org.apache.spark.mllib.tree.model.ImpurityStats
+import org.apache.spark.mllib.util.MLlibTestSparkContext
+
+/** Suite exercising helper methods for making split decisions during 
decision tree training. */
+class TreeSplitUtilsSuite
+  extends SparkFunSuite with MLlibTestSparkContext with 
DefaultReadWriteTest {
+
+  /**
+   * Get a DTStatsAggregator for sufficient stat collection/impurity 
calculation populated
+   * with the data from the specified training points. Assumes a feature 
index of 0 and that
+   * all training points have the same weights (1.0).
+   */
+  private def getAggregator(
+  metadata: DecisionTreeMetadata,
+  values: Array[Int],
+  labels: Array[Double],
+  featureSplits: Array[Split]): DTStatsAggregator = {
+// Create stats aggregator
+val statsAggregator = new DTStatsAggregator(metadata, featureSubset = 
None)
+// Update parent impurity stats
+val featureIndex = 0
+val instanceWeights = Array.fill[Double](values.length)(1.0)
+AggUpdateUtils.updateParentImpurity(statsAggregator, indices = 
values.indices.toArray,
+  from = 0, to = values.length, instanceWeights, labels)
+// Update current aggregator's impurity stats
+values.zip(labels).foreach { case (value: Int, label: Double) =>
+  if (metadata.isUnordered(featureIndex)) {
+AggUpdateUtils.updateUnorderedFeature(statsAggregator, value, 
label,
+  featureIndex = featureIndex, featureIndexIdx = 0, featureSplits, 
instanceWeight = 1.0)
+  } else {
+AggUpdateUtils.updateOrderedFeature(statsAggregator, value, label, 
featureIndexIdx = 0,
+  instanceWeight = 1.0)
+  }
+}
+statsAggregator
+  }
+
+  /**
+   * Check that left/right impurities match what we'd expect for a split.
+   * @param labels Labels whose impurity information should be reflected 
in stats
+   * @param stats ImpurityStats object containing impurity info for the 
left/right sides of a split
+   */
+  private def validateImpurityStats(
+  impurity: Impurity,
+  labels: Array[Double],
+  stats: ImpurityStats,
+  expectedLeftStats: Array[Double],
+  expectedRightStats: Array[Double]): Unit = {
+// Compute impurity for our data points manually
+val numClasses = (labels.max + 1).toInt
+val fullImpurityStatsArray
+  = Array.tabulate[Double](numClasses)((label: Int) => labels.count(_ 
== label).toDouble)
+val fullImpurity = Entropy.calculate(fullImpurityStatsArray, 
labels.length)
+// Verify that impurity stats were computed correctly for split
+assert(stats.impurityCalculator.stats === fullImpurityStatsArray)
+assert(stats.impurity === fullImpurity)
+assert(stats.leftImpurityCalculator.stats === expectedLeftStats)
+assert(stats.rightImpurityCalculator.stats === expectedRightStats)
+assert(stats.valid)
+  }
+
+  /* * * * * * * * * * * Choosing Splits  * * * * * * * * * * */
+
+  test("chooseSplit: choose correct type of split (continuous split)") {
+// Construct (binned) continuous data
+val labels = Array(0.0, 0.0, 1.0)
+val values = Array(1, 2, 3)
+val featureIndex = 0
+// Get an array of continuous splits corresponding to values in 

[GitHub] spark issue #19753: [SPARK-22521][ML] VectorIndexerModel support handle unse...

2017-11-15 Thread smurching
Github user smurching commented on the issue:

https://github.com/apache/spark/pull/19753
  
This LGTM, @jkbradley would you be able to give this a look?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19753: [SPARK-22521][ML] VectorIndexerModel support hand...

2017-11-15 Thread smurching
Github user smurching commented on a diff in the pull request:

https://github.com/apache/spark/pull/19753#discussion_r151315757
  
--- Diff: python/pyspark/ml/feature.py ---
@@ -2565,22 +2575,28 @@ class VectorIndexer(JavaEstimator, HasInputCol, 
HasOutputCol, JavaMLReadable, Ja
   "(>= 2). If a feature is found to have > 
maxCategories values, then " +
   "it is declared continuous.", 
typeConverter=TypeConverters.toInt)
 
+handleInvalid = Param(Params._dummy(), "handleInvalid", "How to handle 
invalid data " +
+  "(unseen labels or NULL values). Options are 
'skip' (filter out " +
+  "rows with invalid data), 'error' (throw an 
error), or 'keep' (put " +
+  "invalid data in a special additional bucket, at 
index numCategories).",
+  typeConverter=TypeConverters.toString)
+
 @keyword_only
-def __init__(self, maxCategories=20, inputCol=None, outputCol=None):
+def __init__(self, maxCategories=20, inputCol=None, outputCol=None, 
handleInvalid="error"):
 """
-__init__(self, maxCategories=20, inputCol=None, outputCol=None)
+__init__(self, maxCategories=20, inputCol=None, outputCol=None, 
handleInvalid="error")
 """
 super(VectorIndexer, self).__init__()
 self._java_obj = 
self._new_java_obj("org.apache.spark.ml.feature.VectorIndexer", self.uid)
-self._setDefault(maxCategories=20)
+self._setDefault(maxCategories=20, handleInvalid="error")
 kwargs = self._input_kwargs
 self.setParams(**kwargs)
 
 @keyword_only
 @since("1.4.0")
-def setParams(self, maxCategories=20, inputCol=None, outputCol=None):
+def setParams(self, maxCategories=20, inputCol=None, outputCol=None, 
handleInvalid="error"):
--- End diff --

Thanks for the explanation, that makes sense.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19753: [SPARK-22521][ML] VectorIndexerModel support hand...

2017-11-15 Thread smurching
Github user smurching commented on a diff in the pull request:

https://github.com/apache/spark/pull/19753#discussion_r151298582
  
--- Diff: python/pyspark/ml/feature.py ---
@@ -2565,22 +2575,28 @@ class VectorIndexer(JavaEstimator, HasInputCol, 
HasOutputCol, JavaMLReadable, Ja
   "(>= 2). If a feature is found to have > 
maxCategories values, then " +
   "it is declared continuous.", 
typeConverter=TypeConverters.toInt)
 
+handleInvalid = Param(Params._dummy(), "handleInvalid", "How to handle 
invalid data " +
+  "(unseen labels or NULL values). Options are 
'skip' (filter out " +
+  "rows with invalid data), 'error' (throw an 
error), or 'keep' (put " +
+  "invalid data in a special additional bucket, at 
index numCategories).",
+  typeConverter=TypeConverters.toString)
+
 @keyword_only
-def __init__(self, maxCategories=20, inputCol=None, outputCol=None):
+def __init__(self, maxCategories=20, inputCol=None, outputCol=None, 
handleInvalid="error"):
 """
-__init__(self, maxCategories=20, inputCol=None, outputCol=None)
+__init__(self, maxCategories=20, inputCol=None, outputCol=None, 
handleInvalid="error")
 """
 super(VectorIndexer, self).__init__()
 self._java_obj = 
self._new_java_obj("org.apache.spark.ml.feature.VectorIndexer", self.uid)
-self._setDefault(maxCategories=20)
+self._setDefault(maxCategories=20, handleInvalid="error")
 kwargs = self._input_kwargs
 self.setParams(**kwargs)
 
 @keyword_only
 @since("1.4.0")
-def setParams(self, maxCategories=20, inputCol=None, outputCol=None):
+def setParams(self, maxCategories=20, inputCol=None, outputCol=None, 
handleInvalid="error"):
--- End diff --

Another Q: I see there's a pattern of `setParams` using `None` as a default 
value for all/most of its arguments in other featurizers, perhaps we should do 
the same (i.e. have a default argument of `handleValid=None` here)? IMO 
specifying the default parameter value in one place is preferable to 
duplicating it.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19753: [SPARK-22521][ML] VectorIndexerModel support hand...

2017-11-15 Thread smurching
Github user smurching commented on a diff in the pull request:

https://github.com/apache/spark/pull/19753#discussion_r151298765
  
--- Diff: python/pyspark/ml/feature.py ---
@@ -2565,22 +2575,28 @@ class VectorIndexer(JavaEstimator, HasInputCol, 
HasOutputCol, JavaMLReadable, Ja
   "(>= 2). If a feature is found to have > 
maxCategories values, then " +
   "it is declared continuous.", 
typeConverter=TypeConverters.toInt)
 
+handleInvalid = Param(Params._dummy(), "handleInvalid", "How to handle 
invalid data " +
+  "(unseen labels or NULL values). Options are 
'skip' (filter out " +
+  "rows with invalid data), 'error' (throw an 
error), or 'keep' (put " +
+  "invalid data in a special additional bucket, at 
index numCategories).",
+  typeConverter=TypeConverters.toString)
+
 @keyword_only
-def __init__(self, maxCategories=20, inputCol=None, outputCol=None):
+def __init__(self, maxCategories=20, inputCol=None, outputCol=None, 
handleInvalid="error"):
 """
-__init__(self, maxCategories=20, inputCol=None, outputCol=None)
+__init__(self, maxCategories=20, inputCol=None, outputCol=None, 
handleInvalid="error")
 """
 super(VectorIndexer, self).__init__()
 self._java_obj = 
self._new_java_obj("org.apache.spark.ml.feature.VectorIndexer", self.uid)
-self._setDefault(maxCategories=20)
+self._setDefault(maxCategories=20, handleInvalid="error")
 kwargs = self._input_kwargs
 self.setParams(**kwargs)
 
 @keyword_only
 @since("1.4.0")
-def setParams(self, maxCategories=20, inputCol=None, outputCol=None):
+def setParams(self, maxCategories=20, inputCol=None, outputCol=None, 
handleInvalid="error"):
--- End diff --

The same goes for the constructor (IMO we should default to 
`handleInvalid=None` there too), but open to hearing your thoughts.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19758: [SPARK-3162][MLlib] Local Tree Training Pt 1: Ref...

2017-11-15 Thread smurching
GitHub user smurching opened a pull request:

https://github.com/apache/spark/pull/19758

[SPARK-3162][MLlib] Local Tree Training Pt 1: Refactor RandomForest.scala 
into utility classes

## What changes were proposed in this pull request?

Breaks up #19433 to help unblock #19666; after this PR is merged, #19666 
can be merged.

This PR contains the changes made to migrate functionality from 
RandomForest.scala into the following utility classes:

* AggUpdateUtils
* ImpurityUtils
* SplitUtils

The PR also adds tests for split selection logic in TreeSplitUtilsSuite.

A follow-up PR will include the other changes from #19433:
* Local decision tree data structures & tests
* Local tree training logic & tests

## How was this patch tested?

Adds unit tests for split selection logic in TreeSplitUtilsSuite

(Please explain how this patch was tested. E.g. unit tests, integration 
tests, manual tests)
(If this patch involves UI changes, please attach a screenshot; otherwise, 
remove this)

Please review http://spark.apache.org/contributing.html before opening a 
pull request.


You can merge this pull request into a Git repository by running:

$ git pull https://github.com/smurching/spark refactor-random-forest

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/19758.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #19758


commit f2e3fbd40eea2919d249710eae5b5789d97543b7
Author: Sid Murching <sid.murch...@databricks.com>
Date:   2017-11-15T17:52:01Z

Local tree training part 1 (refactor RandomForest.scala into utility 
classes)

commit a2357c95672e94a148051d00e26b89245eb8e204
Author: Sid Murching <sid.murch...@databricks.com>
Date:   2017-11-15T17:57:55Z

WIP adding TreeSplitUtilsSuite

commit 320c32ee8d0ac9bde457b0286d064470648c73af
Author: Sid Murching <sid.murch...@databricks.com>
Date:   2017-11-15T19:37:56Z

WIP

commit b93f9f3da9cca0887c0264162f5b032f14fa87d7
Author: Sid Murching <sid.murch...@databricks.com>
Date:   2017-11-15T19:57:25Z

Add TreeSplitUtilsSuite, refactor it to not depend on any local tree 
training code




---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19753: [SPARK-22521][ML] VectorIndexerModel support handle unse...

2017-11-15 Thread smurching
Github user smurching commented on the issue:

https://github.com/apache/spark/pull/19753
  
Looking at this now, thanks @WeichenXu123!


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19381: [SPARK-10884][ML] Support prediction on single in...

2017-11-14 Thread smurching
Github user smurching commented on a diff in the pull request:

https://github.com/apache/spark/pull/19381#discussion_r151020618
  
--- Diff: 
mllib/src/test/scala/org/apache/spark/ml/regression/RandomForestRegressorSuite.scala
 ---
@@ -19,14 +19,16 @@ package org.apache.spark.ml.regression
 
 import org.apache.spark.SparkFunSuite
 import org.apache.spark.ml.feature.LabeledPoint
+import org.apache.spark.ml.linalg.Vector
 import org.apache.spark.ml.tree.impl.TreeTests
 import org.apache.spark.ml.util.{DefaultReadWriteTest, MLTestingUtils}
+import org.apache.spark.ml.util.TestingUtils._
--- End diff --

Nit: unused import


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19381: [SPARK-10884][ML] Support prediction on single in...

2017-11-14 Thread smurching
Github user smurching commented on a diff in the pull request:

https://github.com/apache/spark/pull/19381#discussion_r151020666
  
--- Diff: 
mllib/src/test/scala/org/apache/spark/ml/regression/GBTRegressorSuite.scala ---
@@ -19,15 +19,16 @@ package org.apache.spark.ml.regression
 
 import org.apache.spark.SparkFunSuite
 import org.apache.spark.ml.feature.LabeledPoint
-import org.apache.spark.ml.linalg.Vectors
+import org.apache.spark.ml.linalg.{Vector, Vectors}
 import org.apache.spark.ml.tree.impl.TreeTests
 import org.apache.spark.ml.util.{DefaultReadWriteTest, MLTestingUtils}
+import org.apache.spark.ml.util.TestingUtils._
--- End diff --

Nit: unused import


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19433: [SPARK-3162] [MLlib] Add local tree training for ...

2017-11-14 Thread smurching
Github user smurching commented on a diff in the pull request:

https://github.com/apache/spark/pull/19433#discussion_r151019591
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/tree/impl/RandomForest.scala ---
@@ -852,6 +662,41 @@ private[spark] object RandomForest extends Logging {
   }
 
   /**
+   * Find the best split for a node.
+   *
+   * @param binAggregates Bin statistics.
+   * @return tuple for best split: (Split, information gain, prediction at 
node)
+   */
+  private[tree] def binsToBestSplit(
+  binAggregates: DTStatsAggregator,
+  splits: Array[Array[Split]],
+  featuresForNode: Option[Array[Int]],
+  node: LearningNode): (Split, ImpurityStats) = {
+val validFeatureSplits = 
getNonConstantFeatures(binAggregates.metadata, featuresForNode)
+// For each (feature, split), calculate the gain, and select the best 
(feature, split).
+val parentImpurityCalc = if (node.stats == null) None else 
Some(node.stats.impurityCalculator)
--- End diff --

I believe so, the nodes at the top level are created 
([RandomForest.scala:178](https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/tree/impl/RandomForest.scala#L178))
 with 
[`LearningNode.emptyNode`](https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/tree/Node.scala#L341),
 which sets `node.stats = null`.

I could change this to check node depth (via node index), but if we're 
planning on deprecating node indices in the future it might be best not to.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19433: [SPARK-3162] [MLlib] Add local tree training for ...

2017-11-14 Thread smurching
Github user smurching commented on a diff in the pull request:

https://github.com/apache/spark/pull/19433#discussion_r151017375
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/tree/impl/SplitUtils.scala ---
@@ -0,0 +1,215 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.ml.tree.impl
+
+import org.apache.spark.ml.tree.{CategoricalSplit, Split}
+import org.apache.spark.mllib.tree.impurity.ImpurityCalculator
+import org.apache.spark.mllib.tree.model.ImpurityStats
+
+/** Utility methods for choosing splits during local & distributed tree 
training. */
+private[impl] object SplitUtils {
+
+  /** Sorts ordered feature categories by label centroid, returning an 
ordered list of categories */
+  private def sortByCentroid(
+  binAggregates: DTStatsAggregator,
+  featureIndex: Int,
+  featureIndexIdx: Int): List[Int] = {
+/* Each bin is one category (feature value).
+ * The bins are ordered based on centroidForCategories, and this 
ordering determines which
+ * splits are considered.  (With K categories, we consider K - 1 
possible splits.)
+ *
+ * centroidForCategories is a list: (category, centroid)
+ */
+val numCategories = binAggregates.metadata.numBins(featureIndex)
+val nodeFeatureOffset = binAggregates.getFeatureOffset(featureIndexIdx)
+
+val centroidForCategories = Range(0, numCategories).map { featureValue 
=>
+  val categoryStats =
+binAggregates.getImpurityCalculator(nodeFeatureOffset, 
featureValue)
+  val centroid = ImpurityUtils.getCentroid(binAggregates.metadata, 
categoryStats)
+  (featureValue, centroid)
    +}
+// TODO(smurching): How to handle logging statements like these?
+// logDebug("Centroids for categorical variable: " + 
centroidForCategories.mkString(","))
+// bins sorted by centroids
+val categoriesSortedByCentroid = 
centroidForCategories.toList.sortBy(_._2).map(_._1)
+// logDebug("Sorted centroids for categorical variable = " +
+//   categoriesSortedByCentroid.mkString(","))
+categoriesSortedByCentroid
+  }
+
+  /**
+   * Find the best split for an unordered categorical feature at a single 
node.
+   *
+   * Algorithm:
+   *  - Considers all possible subsets (exponentially many)
+   *
+   * @param featureIndex  Global index of feature being split.
+   * @param featureIndexIdx Index of feature being split within subset of 
features for current node.
+   * @param featureSplits Array of splits for the current feature
+   * @param parentCalculator Optional: ImpurityCalculator containing 
impurity stats for current node
+   * @return  (best split, statistics for split)  If no valid split was 
found, the returned
+   *  ImpurityStats instance will be invalid (have member valid = 
false).
+   */
+  private[impl] def chooseUnorderedCategoricalSplit(
+  binAggregates: DTStatsAggregator,
+  featureIndex: Int,
+  featureIndexIdx: Int,
+  featureSplits: Array[Split],
+  parentCalculator: Option[ImpurityCalculator] = None): (Split, 
ImpurityStats) = {
+// Unordered categorical feature
+val nodeFeatureOffset = binAggregates.getFeatureOffset(featureIndexIdx)
+val numSplits = binAggregates.metadata.numSplits(featureIndex)
+var parentCalc = parentCalculator
+val (bestFeatureSplitIndex, bestFeatureGainStats) =
+  Range(0, numSplits).map { splitIndex =>
+val leftChildStats = 
binAggregates.getImpurityCalculator(nodeFeatureOffset, splitIndex)
+val rightChildStats = binAggregates.getParentImpurityCalculator()
+  .subtract(leftChildStats)
+val gainAndImpurityStats = 
ImpurityUtils.calculateImpurityStats(parentCalc,
+  leftChildStats, rightChildStats, binAggregates.metadata)
+// Comput

[GitHub] spark pull request #19433: [SPARK-3162] [MLlib] Add local tree training for ...

2017-11-14 Thread smurching
Github user smurching commented on a diff in the pull request:

https://github.com/apache/spark/pull/19433#discussion_r151011913
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/tree/impl/RandomForest.scala ---
@@ -627,221 +621,37 @@ private[spark] object RandomForest extends Logging {
   }
 
   /**
-   * Calculate the impurity statistics for a given (feature, split) based 
upon left/right
-   * aggregates.
-   *
-   * @param stats the recycle impurity statistics for this feature's all 
splits,
-   *  only 'impurity' and 'impurityCalculator' are valid 
between each iteration
-   * @param leftImpurityCalculator left node aggregates for this (feature, 
split)
-   * @param rightImpurityCalculator right node aggregate for this 
(feature, split)
-   * @param metadata learning and dataset metadata for DecisionTree
-   * @return Impurity statistics for this (feature, split)
+   * Return a list of pairs (featureIndexIdx, featureIndex) where 
featureIndex is the global
+   * (across all trees) index of a feature and featureIndexIdx is the 
index of a feature within the
+   * list of features for a given node. Filters out constant features 
(features with 0 splits)
*/
-  private def calculateImpurityStats(
-  stats: ImpurityStats,
-  leftImpurityCalculator: ImpurityCalculator,
-  rightImpurityCalculator: ImpurityCalculator,
-  metadata: DecisionTreeMetadata): ImpurityStats = {
-
-val parentImpurityCalculator: ImpurityCalculator = if (stats == null) {
-  leftImpurityCalculator.copy.add(rightImpurityCalculator)
-} else {
-  stats.impurityCalculator
-}
-
-val impurity: Double = if (stats == null) {
-  parentImpurityCalculator.calculate()
-} else {
-  stats.impurity
-}
-
-val leftCount = leftImpurityCalculator.count
-val rightCount = rightImpurityCalculator.count
-
-val totalCount = leftCount + rightCount
-
-// If left child or right child doesn't satisfy minimum instances per 
node,
-// then this split is invalid, return invalid information gain stats.
-if ((leftCount < metadata.minInstancesPerNode) ||
-  (rightCount < metadata.minInstancesPerNode)) {
-  return 
ImpurityStats.getInvalidImpurityStats(parentImpurityCalculator)
-}
-
-val leftImpurity = leftImpurityCalculator.calculate() // Note: This 
equals 0 if count = 0
-val rightImpurity = rightImpurityCalculator.calculate()
-
-val leftWeight = leftCount / totalCount.toDouble
-val rightWeight = rightCount / totalCount.toDouble
-
-val gain = impurity - leftWeight * leftImpurity - rightWeight * 
rightImpurity
-
-// if information gain doesn't satisfy minimum information gain,
-// then this split is invalid, return invalid information gain stats.
-if (gain < metadata.minInfoGain) {
-  return 
ImpurityStats.getInvalidImpurityStats(parentImpurityCalculator)
+  private[impl] def getNonConstantFeatures(
+  metadata: DecisionTreeMetadata,
+  featuresForNode: Option[Array[Int]]): Seq[(Int, Int)] = {
+Range(0, metadata.numFeaturesPerNode).map { featureIndexIdx =>
--- End diff --

At some point when refactoring I was hitting errors caused by a stateful 
operation within a `map` over the output of this method (IIRC the result of the 
`map` was accessed repeatedly, causing the stateful operation to inadvertently 
be run multiple times).

However using `withFilter` and `view` now seems to work, I'll change it 
back :)


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19433: [SPARK-3162] [MLlib] Add local tree training for ...

2017-11-14 Thread smurching
Github user smurching commented on a diff in the pull request:

https://github.com/apache/spark/pull/19433#discussion_r151011879
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/tree/model/InformationGainStats.scala
 ---
@@ -112,7 +113,7 @@ private[spark] object ImpurityStats {
* minimum number of instances per node.
*/
   def getInvalidImpurityStats(impurityCalculator: ImpurityCalculator): 
ImpurityStats = {
-new ImpurityStats(Double.MinValue, impurityCalculator.calculate(),
+new ImpurityStats(Double.MinValue, impurity = -1,
--- End diff --

I changed this to be -1 here since node impurity would eventually get set 
to -1 anyways when `LearningNodes` with invalid `ImpurityStats` were converted 
into decision tree leaf nodes (see 
[`LearningNode.toNode`](https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/tree/Node.scala#L279))


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19666: [SPARK-22451][ML] Reduce decision tree aggregate size fo...

2017-11-13 Thread smurching
Github user smurching commented on the issue:

https://github.com/apache/spark/pull/19666
  
Discussed with @jkbradley, I'll split up #19433 so that the parts of it 
that'd potentially conflict with this PR (refactoring RandomForest.scala into 
utility classes) can be merged first.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19666: [SPARK-22451][ML] Reduce decision tree aggregate ...

2017-11-06 Thread smurching
Github user smurching commented on a diff in the pull request:

https://github.com/apache/spark/pull/19666#discussion_r149238531
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/tree/impl/RandomForest.scala ---
@@ -976,6 +930,44 @@ private[spark] object RandomForest extends Logging {
 categories
   }
 
+  private[tree] def traverseUnorderedSplits[T](
--- End diff --

Could you please add a docstring for this method, since it's a bit 
complicated?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19666: [SPARK-22451][ML] Reduce decision tree aggregate ...

2017-11-06 Thread smurching
Github user smurching commented on a diff in the pull request:

https://github.com/apache/spark/pull/19666#discussion_r149238185
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/tree/impl/RandomForest.scala ---
@@ -741,17 +678,43 @@ private[spark] object RandomForest extends Logging {
   (splits(featureIndex)(bestFeatureSplitIndex), 
bestFeatureGainStats)
 } else if (binAggregates.metadata.isUnordered(featureIndex)) {
   // Unordered categorical feature
-  val leftChildOffset = 
binAggregates.getFeatureOffset(featureIndexIdx)
-  val (bestFeatureSplitIndex, bestFeatureGainStats) =
-Range(0, numSplits).map { splitIndex =>
-  val leftChildStats = 
binAggregates.getImpurityCalculator(leftChildOffset, splitIndex)
-  val rightChildStats = 
binAggregates.getParentImpurityCalculator()
-.subtract(leftChildStats)
+  val numBins = binAggregates.metadata.numBins(featureIndex)
+  val featureOffset = 
binAggregates.getFeatureOffset(featureIndexIdx)
+
+  val binStatsArray = Array.tabulate(numBins) { binIndex =>
+binAggregates.getImpurityCalculator(featureOffset, binIndex)
+  }
+  val parentStats = binAggregates.getParentImpurityCalculator()
+
+  var bestGain = Double.NegativeInfinity
+  var bestSet: BitSet = null
+  var bestLeftChildStats: ImpurityCalculator = null
+  var bestRightChildStats: ImpurityCalculator = null
+
+  traverseUnorderedSplits[ImpurityCalculator](numBins, null,
--- End diff --

Could you please add a comment explaining what this does? E.g.:
`// Computes the best split for the current feature, storing the result 
across the vars above`


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19666: [SPARK-22451][ML] Reduce decision tree aggregate ...

2017-11-06 Thread smurching
Github user smurching commented on a diff in the pull request:

https://github.com/apache/spark/pull/19666#discussion_r149237212
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/tree/impl/RandomForest.scala ---
@@ -741,17 +678,43 @@ private[spark] object RandomForest extends Logging {
   (splits(featureIndex)(bestFeatureSplitIndex), 
bestFeatureGainStats)
 } else if (binAggregates.metadata.isUnordered(featureIndex)) {
   // Unordered categorical feature
-  val leftChildOffset = 
binAggregates.getFeatureOffset(featureIndexIdx)
-  val (bestFeatureSplitIndex, bestFeatureGainStats) =
-Range(0, numSplits).map { splitIndex =>
-  val leftChildStats = 
binAggregates.getImpurityCalculator(leftChildOffset, splitIndex)
-  val rightChildStats = 
binAggregates.getParentImpurityCalculator()
-.subtract(leftChildStats)
+  val numBins = binAggregates.metadata.numBins(featureIndex)
+  val featureOffset = 
binAggregates.getFeatureOffset(featureIndexIdx)
+
+  val binStatsArray = Array.tabulate(numBins) { binIndex =>
--- End diff --

Could you please add a comment explaining what this is? E.g.:
`// Each element of binStatsArray stores pre-computed label statistics for 
a single bin of the current future`


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19666: [SPARK-22451][ML] Reduce decision tree aggregate ...

2017-11-06 Thread smurching
Github user smurching commented on a diff in the pull request:

https://github.com/apache/spark/pull/19666#discussion_r149242575
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/tree/impl/RandomForest.scala ---
@@ -976,6 +930,44 @@ private[spark] object RandomForest extends Logging {
 categories
   }
 
+  private[tree] def traverseUnorderedSplits[T](
--- End diff --

Also, does `traverseUnorderedSplits` need to take a type parameter / two 
different closures as method arguments? AFAICT the use of a type 
parameter/closures here allow us to unit test this functionality on a simple 
example, but I wonder if we could simplify this somehow.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19666: [SPARK-22451][ML] Reduce decision tree aggregate ...

2017-11-06 Thread smurching
Github user smurching commented on a diff in the pull request:

https://github.com/apache/spark/pull/19666#discussion_r149238123
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/tree/impl/RandomForest.scala ---
@@ -741,17 +678,43 @@ private[spark] object RandomForest extends Logging {
   (splits(featureIndex)(bestFeatureSplitIndex), 
bestFeatureGainStats)
 } else if (binAggregates.metadata.isUnordered(featureIndex)) {
   // Unordered categorical feature
-  val leftChildOffset = 
binAggregates.getFeatureOffset(featureIndexIdx)
-  val (bestFeatureSplitIndex, bestFeatureGainStats) =
-Range(0, numSplits).map { splitIndex =>
-  val leftChildStats = 
binAggregates.getImpurityCalculator(leftChildOffset, splitIndex)
-  val rightChildStats = 
binAggregates.getParentImpurityCalculator()
-.subtract(leftChildStats)
+  val numBins = binAggregates.metadata.numBins(featureIndex)
+  val featureOffset = 
binAggregates.getFeatureOffset(featureIndexIdx)
+
+  val binStatsArray = Array.tabulate(numBins) { binIndex =>
+binAggregates.getImpurityCalculator(featureOffset, binIndex)
+  }
+  val parentStats = binAggregates.getParentImpurityCalculator()
+
+  var bestGain = Double.NegativeInfinity
+  var bestSet: BitSet = null
+  var bestLeftChildStats: ImpurityCalculator = null
+  var bestRightChildStats: ImpurityCalculator = null
+
+  traverseUnorderedSplits[ImpurityCalculator](numBins, null,
+(stats, binIndex) => {
+  val binStats = binStatsArray(binIndex)
+  if (stats == null) {
+binStats
+  } else {
+stats.copy.add(binStats)
+  }
+},
+(set, leftChildStats) => {
+  val rightChildStats = 
parentStats.copy.subtract(leftChildStats)
   gainAndImpurityStats = 
calculateImpurityStats(gainAndImpurityStats,
 leftChildStats, rightChildStats, binAggregates.metadata)
-  (splitIndex, gainAndImpurityStats)
-}.maxBy(_._2.gain)
-  (splits(featureIndex)(bestFeatureSplitIndex), 
bestFeatureGainStats)
+  if (gainAndImpurityStats.gain > bestGain) {
+bestGain = gainAndImpurityStats.gain
+bestSet = set | new BitSet(numBins) // copy set
--- End diff --

Why not use `set.copy()`?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19666: [SPARK-22451][ML] Reduce decision tree aggregate ...

2017-11-06 Thread smurching
Github user smurching commented on a diff in the pull request:

https://github.com/apache/spark/pull/19666#discussion_r149241680
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/tree/impl/RandomForest.scala ---
@@ -976,6 +930,44 @@ private[spark] object RandomForest extends Logging {
 categories
   }
 
+  private[tree] def traverseUnorderedSplits[T](
+  arity: Int,
+  zeroStats: T,
+  seqOp: (T, Int) => T,
+  finalizer: (BitSet, T) => Unit): Unit = {
+assert(arity > 1)
+
+// numSplits = (1 << arity - 1) - 1
+val numSplits = DecisionTreeMetadata.numUnorderedSplits(arity)
+val subSet: BitSet = new BitSet(arity)
+
+// dfs traverse
+// binIndex: [0, arity)
+def dfs(binIndex: Int, combNumber: Int, stats: T): Unit = {
+  if (binIndex == arity) {
+// recursion exit when binIndex == arity
+if (combNumber > 0) {
+  // we get an available unordered split, saved in subSet.
+  finalizer(subSet, stats)
+}
+  } else {
+subSet.set(binIndex)
+val leftChildCombNumber = combNumber + (1 << binIndex)
+// pruning: only need combNumber satisfy: 1 <= combNumber <= 
numSplits
--- End diff --

If I understand correctly, the check `if (leftChildCombNumber <= 
numSplits)` helps us ensure that we consider each split only once, right?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19433: [SPARK-3162] [MLlib] Add local tree training for decisio...

2017-11-06 Thread smurching
Github user smurching commented on the issue:

https://github.com/apache/spark/pull/19433
  
jenkins retest this please


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19381: [SPARK-10884][ML] Support prediction on single in...

2017-11-02 Thread smurching
Github user smurching commented on a diff in the pull request:

https://github.com/apache/spark/pull/19381#discussion_r148709163
  
--- Diff: 
mllib/src/test/scala/org/apache/spark/ml/classification/LinearSVCSuite.scala ---
@@ -202,6 +202,15 @@ class LinearSVCSuite extends SparkFunSuite with 
MLlibTestSparkContext with Defau
   dataset.as[LabeledPoint], estimator, modelEquals, 42L)
   }
 
+  test("prediction on single instance") {
+val trainer = new LinearSVC()
+val model = trainer.fit(smallBinaryDataset)
+model.transform(smallBinaryDataset).select("features", 
"prediction").collect().foreach {
+  case Row(features: Vector, prediction: Double) =>
+assert(prediction ~== model.predict(features) relTol 1E-5)
--- End diff --

Could you please check for exact equality here (& in other tests)?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19381: [SPARK-10884][ML] Support prediction on single in...

2017-11-02 Thread smurching
Github user smurching commented on a diff in the pull request:

https://github.com/apache/spark/pull/19381#discussion_r148708939
  
--- Diff: 
mllib/src/test/scala/org/apache/spark/ml/classification/NaiveBayesSuite.scala 
---
@@ -165,6 +165,35 @@ class NaiveBayesSuite extends SparkFunSuite with 
MLlibTestSparkContext with Defa
   Vector, NaiveBayesModel](model, testDataset)
   }
 
+  test("prediction on single instance") {
+val nPoints = 1000
+val piArray = Array(0.5, 0.1, 0.4).map(math.log)
+val thetaArray = Array(
+  Array(0.70, 0.10, 0.10, 0.10), // label 0
+  Array(0.10, 0.70, 0.10, 0.10), // label 1
+  Array(0.10, 0.10, 0.70, 0.10)  // label 2
+).map(_.map(math.log))
+val pi = Vectors.dense(piArray)
+val theta = new DenseMatrix(3, 4, thetaArray.flatten, true)
+
+val testDataset =
+  generateNaiveBayesInput(piArray, thetaArray, nPoints, seed, 
"multinomial").toDF()
+val nb = new NaiveBayes().setSmoothing(1.0).setModelType("multinomial")
+val model = nb.fit(testDataset)
+
+validateModelFit(pi, theta, model)
--- End diff --

Do we need lines 184-186? They seem unrelated to what we want to test (that 
`predict` produces the same result as `transform` on a single instance). 
Similarly, I don't think we need to create `piArray`, `thetaArray`, `pi`, 
`theta`, etc; this test should just fit a model on a dataset and compare the 
fitted model's `predict` and `transform` outputs.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19381: [SPARK-10884][ML] Support prediction on single in...

2017-11-02 Thread smurching
Github user smurching commented on a diff in the pull request:

https://github.com/apache/spark/pull/19381#discussion_r148709125
  
--- Diff: 
mllib/src/test/scala/org/apache/spark/ml/classification/NaiveBayesSuite.scala 
---
@@ -165,6 +165,35 @@ class NaiveBayesSuite extends SparkFunSuite with 
MLlibTestSparkContext with Defa
   Vector, NaiveBayesModel](model, testDataset)
   }
 
+  test("prediction on single instance") {
+val nPoints = 1000
+val piArray = Array(0.5, 0.1, 0.4).map(math.log)
+val thetaArray = Array(
+  Array(0.70, 0.10, 0.10, 0.10), // label 0
+  Array(0.10, 0.70, 0.10, 0.10), // label 1
+  Array(0.10, 0.10, 0.70, 0.10)  // label 2
+).map(_.map(math.log))
+val pi = Vectors.dense(piArray)
+val theta = new DenseMatrix(3, 4, thetaArray.flatten, true)
+
+val testDataset =
--- End diff --

Suggestion: rename to `trainDataset`?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19433: [SPARK-3162] [MLlib] Add local tree training for decisio...

2017-11-02 Thread smurching
Github user smurching commented on the issue:

https://github.com/apache/spark/pull/19433
  
jenkins retest this please


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19433: [SPARK-3162] [MLlib] Add local tree training for decisio...

2017-10-26 Thread smurching
Github user smurching commented on the issue:

https://github.com/apache/spark/pull/19433
  
Made a few updates, here’s a quick summary/what I’d propose moving 
forward:

Right now:
* Shared row indices for all (categorical & continuous) features are stored 
& updated in `TrainingInfo`
* `LocalDecisionTree.computeBestSplits` computes best splits/sufficient 
stats for a single feature at a time
* A utility method (`LocalDecisionTreeUtils.updateArrayForSplit`) is used 
to sort both feature values and shared row indices

When we add support for raw continuous feature values:
* Add a subclass of `FeatureColumn` (e.g. `ContinuousFeatureColumn`) that 
stores and sorts its own array of row indices, pass these row indices to 
methods requiring them.

I also renamed `FeatureVector` to `FeatureColumn` since the former seemed 
like it’d confuse developers (`FeatureVector` sounds like a single data point)


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19433: [SPARK-3162] [MLlib] Add local tree training for ...

2017-10-26 Thread smurching
Github user smurching commented on a diff in the pull request:

https://github.com/apache/spark/pull/19433#discussion_r147307553
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/tree/impl/LocalDecisionTree.scala ---
@@ -0,0 +1,250 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.ml.tree.impl
+
+import org.apache.spark.ml.tree._
+import org.apache.spark.mllib.tree.model.ImpurityStats
+
+/** Object exposing methods for local training of decision trees */
+private[ml] object LocalDecisionTree {
+
+  /**
+   * Fully splits the passed-in node on the provided local dataset, 
returning
+   * an InternalNode/LeafNode corresponding to the root of the resulting 
tree.
+   *
+   * @param node LearningNode to use as the root of the subtree fit on the 
passed-in dataset
+   * @param metadata learning and dataset metadata for DecisionTree
+   * @param splits splits(i) = array of splits for feature i
+   */
+  private[ml] def fitNode(
+  input: Array[TreePoint],
+  instanceWeights: Array[Double],
+  node: LearningNode,
+  metadata: DecisionTreeMetadata,
+  splits: Array[Array[Split]]): Node = {
+
+// The case with 1 node (depth = 0) is handled separately.
+// This allows all iterations in the depth > 0 case to use the same 
code.
+// TODO: Check that learning works when maxDepth > 0 but learning 
stops at 1 node (because of
+//   other parameters).
+if (metadata.maxDepth == 0) {
+  return node.toNode
+}
+
+// Prepare column store.
+//   Note: rowToColumnStoreDense checks to make sure numRows < 
Int.MaxValue.
+val colStoreInit: Array[Array[Int]]
+= 
LocalDecisionTreeUtils.rowToColumnStoreDense(input.map(_.binnedFeatures))
+val labels = input.map(_.label)
+
+// Fit a regression model on the dataset, throwing an error if 
metadata indicates that
+// we should train a classifier.
+// TODO: Add support for training classifiers
+if (metadata.numClasses > 1 && metadata.numClasses <= 32) {
+  throw new UnsupportedOperationException("Local training of a 
decision tree classifier is " +
+"unsupported; currently, only regression is supported")
+} else {
+  trainRegressor(node, colStoreInit, instanceWeights, labels, 
metadata, splits)
+}
+  }
+
+  /**
    +   * Locally fits a decision tree regressor.
+   * TODO(smurching): Logic for fitting a classifier & regressor is the 
same; only difference
+   * is impurity metric. Use the same logic for fitting a classifier.
+   *
+   * @param rootNode Node to use as root of the tree fit on the passed-in 
dataset
+   * @param colStoreInit Array of columns of training data
+   * @param instanceWeights Array of weights for each training example
+   * @param metadata learning and dataset metadata for DecisionTree
+   * @param splits splits(i) = Array of possible splits for feature i
+   * @return LeafNode or InternalNode representation of rootNode
+   */
+  private[ml] def trainRegressor(
+  rootNode: LearningNode,
+  colStoreInit: Array[Array[Int]],
+  instanceWeights: Array[Double],
+  labels: Array[Double],
+  metadata: DecisionTreeMetadata,
+  splits: Array[Array[Split]]): Node = {
+
+// Sort each column by decision tree node.
+val colStore: Array[FeatureVector] = colStoreInit.zipWithIndex.map { 
case (col, featureIndex) =>
+  val featureArity: Int = 
metadata.featureArity.getOrElse(featureIndex, 0)
+  FeatureVector(featureIndex, featureArity, col)
+}
+
+val numRows = colStore.headOption match {
+  case None => 0
+  case Some(column) => column.values.length
+}
+
+// Create a new PartitionInfo describing the status of our 
partially-trained subtree
+

[GitHub] spark pull request #19381: [SPARK-10884][ML] Support prediction on single in...

2017-10-25 Thread smurching
Github user smurching commented on a diff in the pull request:

https://github.com/apache/spark/pull/19381#discussion_r146986798
  
--- Diff: 
mllib/src/test/scala/org/apache/spark/ml/classification/DecisionTreeClassifierSuite.scala
 ---
@@ -267,6 +268,24 @@ class DecisionTreeClassifierSuite
   Vector, DecisionTreeClassificationModel](newTree, newData)
   }
 
+  test("prediction on single instance") {
+val rdd = continuousDataPointsForMulticlassRDD
+val dt = new DecisionTreeClassifier()
+  .setImpurity("Gini")
+  .setMaxDepth(4)
+  .setMaxBins(100)
+val categoricalFeatures = Map(0 -> 3)
+val numClasses = 3
+
+val newData: DataFrame = TreeTests.setMetadata(rdd, 
categoricalFeatures, numClasses)
+val newTree = dt.fit(newData)
+
+newTree.transform(newData).select(dt.getFeaturesCol, 
dt.getPredictionCol).collect().foreach {
+  case Row(features: Vector, prediction: Double) =>
+assert(prediction ~== newTree.predict(features) relTol 1E-5)
--- End diff --

Can we test exact equality (e.g. `prediction === 
newTree.predict(features)`) here and in other unit tests?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19381: [SPARK-10884][ML] Support prediction on single instance ...

2017-10-25 Thread smurching
Github user smurching commented on the issue:

https://github.com/apache/spark/pull/19381
  
Looking at this now!


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19433: [SPARK-3162] [MLlib] Add local tree training for ...

2017-10-25 Thread smurching
Github user smurching commented on a diff in the pull request:

https://github.com/apache/spark/pull/19433#discussion_r146981434
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/tree/impl/LocalDecisionTree.scala ---
@@ -0,0 +1,250 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.ml.tree.impl
+
+import org.apache.spark.ml.tree._
+import org.apache.spark.mllib.tree.model.ImpurityStats
+
+/** Object exposing methods for local training of decision trees */
+private[ml] object LocalDecisionTree {
+
+  /**
+   * Fully splits the passed-in node on the provided local dataset, 
returning
+   * an InternalNode/LeafNode corresponding to the root of the resulting 
tree.
+   *
+   * @param node LearningNode to use as the root of the subtree fit on the 
passed-in dataset
+   * @param metadata learning and dataset metadata for DecisionTree
+   * @param splits splits(i) = array of splits for feature i
+   */
+  private[ml] def fitNode(
+  input: Array[TreePoint],
+  instanceWeights: Array[Double],
+  node: LearningNode,
+  metadata: DecisionTreeMetadata,
+  splits: Array[Array[Split]]): Node = {
+
+// The case with 1 node (depth = 0) is handled separately.
+// This allows all iterations in the depth > 0 case to use the same 
code.
+// TODO: Check that learning works when maxDepth > 0 but learning 
stops at 1 node (because of
+//   other parameters).
+if (metadata.maxDepth == 0) {
+  return node.toNode
+}
+
+// Prepare column store.
+//   Note: rowToColumnStoreDense checks to make sure numRows < 
Int.MaxValue.
+val colStoreInit: Array[Array[Int]]
+= 
LocalDecisionTreeUtils.rowToColumnStoreDense(input.map(_.binnedFeatures))
+val labels = input.map(_.label)
+
+// Fit a regression model on the dataset, throwing an error if 
metadata indicates that
+// we should train a classifier.
+// TODO: Add support for training classifiers
+if (metadata.numClasses > 1 && metadata.numClasses <= 32) {
+  throw new UnsupportedOperationException("Local training of a 
decision tree classifier is " +
+"unsupported; currently, only regression is supported")
+} else {
+  trainRegressor(node, colStoreInit, instanceWeights, labels, 
metadata, splits)
+}
+  }
+
+  /**
    +   * Locally fits a decision tree regressor.
+   * TODO(smurching): Logic for fitting a classifier & regressor is the 
same; only difference
+   * is impurity metric. Use the same logic for fitting a classifier.
+   *
+   * @param rootNode Node to use as root of the tree fit on the passed-in 
dataset
+   * @param colStoreInit Array of columns of training data
+   * @param instanceWeights Array of weights for each training example
+   * @param metadata learning and dataset metadata for DecisionTree
+   * @param splits splits(i) = Array of possible splits for feature i
+   * @return LeafNode or InternalNode representation of rootNode
+   */
+  private[ml] def trainRegressor(
+  rootNode: LearningNode,
+  colStoreInit: Array[Array[Int]],
+  instanceWeights: Array[Double],
+  labels: Array[Double],
+  metadata: DecisionTreeMetadata,
+  splits: Array[Array[Split]]): Node = {
+
+// Sort each column by decision tree node.
+val colStore: Array[FeatureVector] = colStoreInit.zipWithIndex.map { 
case (col, featureIndex) =>
+  val featureArity: Int = 
metadata.featureArity.getOrElse(featureIndex, 0)
+  FeatureVector(featureIndex, featureArity, col)
+}
+
+val numRows = colStore.headOption match {
+  case None => 0
+  case Some(column) => column.values.length
+}
+
+// Create a new PartitionInfo describing the status of our 
partially-trained subtree
+

[GitHub] spark issue #19433: [SPARK-3162] [MLlib] Add local tree training for decisio...

2017-10-24 Thread smurching
Github user smurching commented on the issue:

https://github.com/apache/spark/pull/19433
  
@WeichenXu123 Thanks for the comments! I'll respond inline:

> In your doc, you said "Specifically, we only need to store sufficient 
stats for each bin of a single feature, as opposed to each bin of every 
feature", BUT, current implementation, you still allocate space for all 
features when computing: -- see DTStatsAggregator implementation, you pass 
featureSubset = None so DTStatsAggregator will allocate space for every 
features. According to your purpose, you should pass featureSubset = 
Some(Array(currentFeatureIndex)).

I like your proposed solution (pass `featureSubset = 
Some(Array(currentFeatureIndex))`). I'll go ahead & implement it.

> Current implementation still use binnedFeatures. You said in future it 
will be improved to sort feature values for continuous feature (for more 
precise tree training), if you want to consider every possible thresholds, you 
need hold rawFeatures instead of binnedFeatures in the columnar feature array, 
and in each split range offset, you need sort every continuous features. Is 
this the thing you want to do in the future ? This will increase calculation 
amount.

Yep, we'll have to pass raw (`Double`) continuous features to the local 
tree training methods, which will require them to accept an 
`Array[LabeledPoint]` instead of an `Array[TreePoint]` as input & increase 
memory usage (along with requiring us to store additional indices). 

We'll actually only have to run an `O(n log n)` sort on continuous feature 
values once (i.e. in the `FeatureVector` constructor), since once the 
continuous features are sorted we can update them as we would for categorical 
features when splitting nodes (in `O(n)` time) and they'll remain sorted.

> For current implementation(using binnedFeature) , there is no need to 
sort continuous features inside each split offset. So the indices for each 
feature is exactly the same. In order to save memory, I think these indices 
should be shared, no need to create separate indices array for each features. 
Even if you add the improvements for continuous features mentioned above, you 
can create separate indices array for only continuous features, the categorical 
features can still share the same indices array.

Agreed, I'll make this change.

> About locality advantage of columnar format, I have some doubts. Current 
implementation, you do not reorder the label and weight array, access label and 
weight value need use indices, when calculating DTStat, this break locality. 
(But I'm not sure how much impact to perf this will bring).

Yeah, I'm not sure if it'd be better to reorder the labels/weights arrays 
to achieve improved locality.
I think we could experiment with both, but I'd prefer to save that for a 
follow-up PR unless you or another reviewer think it'll make a big perf 
difference.

> About the overhead of columnar format: when making reordering (when get 
new split, we need reorder left sub-tree samples into front), so you need 
reordering on each column, and at the same time, update the indices array. But, 
if we use row format, like:
Array[(features, label, weight)], reordering will be much easier, and do 
not need indices.
So, I am considering, whether we can use row format, but at the time when 
we need DTStatsAggregator computation, copy the data we need from the row 
format into columnar format array (only need to copy rows between sub-node 
offset and only copy the sampled features if using feature subsampling).

This is an interesting idea, my main concern is that on the first iteration 
of local tree training we'd need to copy the entire training data matrix from 
row -> columnar format, which negates any memory savings we get from not using 
indices. I'm also concerned about the overhead of repeatedly copying data from 
row -> columnar format.



---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19433: [SPARK-3162] [MLlib] Add local tree training for ...

2017-10-24 Thread smurching
Github user smurching commented on a diff in the pull request:

https://github.com/apache/spark/pull/19433#discussion_r146731101
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/tree/impl/TrainingInfo.scala ---
@@ -0,0 +1,144 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.ml.tree.impl
+
+import scala.collection.mutable.ArrayBuffer
+
+import org.apache.spark.ml.tree.{LearningNode, Split}
+import org.apache.spark.util.collection.BitSet
+
+/**
+ * Maintains intermediate state of data (columns) and tree during local 
tree training.
+ * Primary local tree training data structure; contains all information 
required to describe
+ * the state of the algorithm at any point during learning.??
+ *
+ * Nodes are indexed left-to-right along the periphery of the tree, with 
0-based indices.
+ * The "periphery" is the set of leaf nodes (active and inactive).
+ *
+ * @param columns  Array of columns.
+ * Each column is sorted first by nodes (left-to-right 
along the tree periphery);
+ * all columns share this first level of sorting.
+ * Within each node's group, each column is sorted based 
on feature value;
+ * this second level of sorting differs across columns.
+ * @param instanceWeights Array of weights for each training example
+ * @param nodeOffsets  Offsets into the columns indicating the first level 
of sorting (by node).
+ * The rows corresponding to the node activeNodes(i) 
are in the range
+ * [nodeOffsets(i)(0), nodeOffsets(i)(1)) .
+ * @param activeNodes  Nodes which are active (still being split).
+ * Inactive nodes are known to be leaves in the final 
tree.
+ */
+private[impl] case class TrainingInfo(
+columns: Array[FeatureVector],
+instanceWeights: Array[Double],
--- End diff --

Good call, I'll move `instanceWeights` outside `TrainingInfo`


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19433: [SPARK-3162] [MLlib] Add local tree training for ...

2017-10-24 Thread smurching
Github user smurching commented on a diff in the pull request:

https://github.com/apache/spark/pull/19433#discussion_r146731083
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/tree/impl/LocalDecisionTree.scala ---
@@ -0,0 +1,250 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.ml.tree.impl
+
+import org.apache.spark.ml.tree._
+import org.apache.spark.mllib.tree.model.ImpurityStats
+
+/** Object exposing methods for local training of decision trees */
+private[ml] object LocalDecisionTree {
+
+  /**
+   * Fully splits the passed-in node on the provided local dataset, 
returning
+   * an InternalNode/LeafNode corresponding to the root of the resulting 
tree.
+   *
+   * @param node LearningNode to use as the root of the subtree fit on the 
passed-in dataset
+   * @param metadata learning and dataset metadata for DecisionTree
+   * @param splits splits(i) = array of splits for feature i
+   */
+  private[ml] def fitNode(
+  input: Array[TreePoint],
+  instanceWeights: Array[Double],
+  node: LearningNode,
+  metadata: DecisionTreeMetadata,
+  splits: Array[Array[Split]]): Node = {
+
+// The case with 1 node (depth = 0) is handled separately.
+// This allows all iterations in the depth > 0 case to use the same 
code.
+// TODO: Check that learning works when maxDepth > 0 but learning 
stops at 1 node (because of
+//   other parameters).
+if (metadata.maxDepth == 0) {
+  return node.toNode
+}
+
+// Prepare column store.
+//   Note: rowToColumnStoreDense checks to make sure numRows < 
Int.MaxValue.
+val colStoreInit: Array[Array[Int]]
+= 
LocalDecisionTreeUtils.rowToColumnStoreDense(input.map(_.binnedFeatures))
+val labels = input.map(_.label)
+
+// Fit a regression model on the dataset, throwing an error if 
metadata indicates that
+// we should train a classifier.
+// TODO: Add support for training classifiers
+if (metadata.numClasses > 1 && metadata.numClasses <= 32) {
+  throw new UnsupportedOperationException("Local training of a 
decision tree classifier is " +
+"unsupported; currently, only regression is supported")
+} else {
+  trainRegressor(node, colStoreInit, instanceWeights, labels, 
metadata, splits)
+}
+  }
+
+  /**
    +   * Locally fits a decision tree regressor.
+   * TODO(smurching): Logic for fitting a classifier & regressor is the 
same; only difference
+   * is impurity metric. Use the same logic for fitting a classifier.
+   *
+   * @param rootNode Node to use as root of the tree fit on the passed-in 
dataset
+   * @param colStoreInit Array of columns of training data
+   * @param instanceWeights Array of weights for each training example
+   * @param metadata learning and dataset metadata for DecisionTree
+   * @param splits splits(i) = Array of possible splits for feature i
+   * @return LeafNode or InternalNode representation of rootNode
+   */
+  private[ml] def trainRegressor(
+  rootNode: LearningNode,
+  colStoreInit: Array[Array[Int]],
+  instanceWeights: Array[Double],
+  labels: Array[Double],
+  metadata: DecisionTreeMetadata,
+  splits: Array[Array[Split]]): Node = {
+
+// Sort each column by decision tree node.
+val colStore: Array[FeatureVector] = colStoreInit.zipWithIndex.map { 
case (col, featureIndex) =>
+  val featureArity: Int = 
metadata.featureArity.getOrElse(featureIndex, 0)
+  FeatureVector(featureIndex, featureArity, col)
+}
+
+val numRows = colStore.headOption match {
+  case None => 0
+  case Some(column) => column.values.length
+}
+
+// Create a new PartitionInfo describing the status of our 
partially-trained subtree
+

[GitHub] spark pull request #19433: [SPARK-3162] [MLlib] Add local tree training for ...

2017-10-24 Thread smurching
Github user smurching commented on a diff in the pull request:

https://github.com/apache/spark/pull/19433#discussion_r146731070
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/tree/impl/AggUpdateUtils.scala ---
@@ -0,0 +1,85 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.ml.tree.impl
+
+import org.apache.spark.ml.tree.Split
+
+/**
+ * Helpers for updating DTStatsAggregators during collection of sufficient 
stats for tree training.
+ */
+private[impl] object AggUpdateUtils {
+
+  /**
+   * Updates the parent node stats of the passed-in impurity aggregator 
with the labels
+   * corresponding to the feature values at indices [from, to).
+   */
+  private[impl] def updateParentImpurity(
+  statsAggregator: DTStatsAggregator,
+  col: FeatureVector,
+  from: Int,
+  to: Int,
+  instanceWeights: Array[Double],
+  labels: Array[Double]): Unit = {
+from.until(to).foreach { idx =>
+  val rowIndex = col.indices(idx)
+  val label = labels(rowIndex)
+  statsAggregator.updateParent(label, instanceWeights(rowIndex))
+}
+  }
+
+  /**
+   * Update aggregator for an (unordered feature, label) pair
+   * @param splits Array of arrays of splits for each feature; splits(i) = 
splits for feature i.
+   */
+  private[impl] def updateUnorderedFeature(
+  agg: DTStatsAggregator,
+  featureValue: Int,
+  label: Double,
+  featureIndex: Int,
+  featureIndexIdx: Int,
+  splits: Array[Array[Split]],
--- End diff --

Good call, I'll make this change.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19433: [SPARK-3162] [MLlib] Add local tree training for ...

2017-10-24 Thread smurching
Github user smurching commented on a diff in the pull request:

https://github.com/apache/spark/pull/19433#discussion_r146731072
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/tree/impl/FeatureVector.scala ---
@@ -0,0 +1,148 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.ml.tree.impl
+
+import org.apache.spark.util.collection.BitSet
+
+/**
+ * Stores values for a single training data column (a single continuous or 
categorical feature).
+ *
+ * Values are currently stored in a dense representation only.
+ * TODO: Support sparse storage (to optimize deeper levels of the tree), 
and maybe compressed
+ *   storage (to optimize upper levels of the tree).
+ *
+ * TODO: Sort feature values to support more complicated splitting logic 
(e.g. considering every
+ *   possible continuous split instead of discretizing continuous 
features).
+ *
+ * NOTE: We could add sorting of feature values in this PR; the only 
changed required would be to
+ * sort feature values at construction-time. Sorting might improve 
locality during stats
+ * aggregation (we'd frequently update the same O(statsSize) array for a 
(feature, bin),
+ * instead of frequently updating for the same feature).
+ *
+ * @param featureArity  For categorical features, this gives the number of 
categories.
+ *  For continuous features, this should be set to 0.
+ * @param rowIndices Optional: rowIndices(i) is the row index of the ith 
feature value (values(i))
+ *   If unspecified, feature values are assumed to be 
ordered by row (i.e. values(i)
+ *   is a feature value from the ith row).
+ */
+private[impl] class FeatureVector(
+val featureIndex: Int,
+val featureArity: Int,
+val values: Array[Int],
+private val rowIndices: Option[Array[Int]])
+  extends Serializable {
+  // Associates feature values with training point rows. indices(i) = 
training point index
+  // (row index) of ith feature value
+  val indices = rowIndices.getOrElse(values.indices.toArray)
+
+  def isCategorical: Boolean = featureArity > 0
+
+  /** For debugging */
+  override def toString: String = {
+"  FeatureVector(" +
+  s"featureIndex: $featureIndex,\n" +
+  s"featureType: ${if (featureArity == 0) "Continuous" else 
"Categorical"},\n" +
+  s"featureArity: $featureArity,\n" +
+  s"values: ${values.mkString(", ")},\n" +
+  s"indices: ${indices.mkString(", ")},\n" +
+  "  )"
+  }
+
+  def deepCopy(): FeatureVector =
+new FeatureVector(featureIndex, featureArity, values.clone(), 
Some(indices.clone()))
+
+  override def equals(other: Any): Boolean = {
+other match {
+  case o: FeatureVector =>
+featureIndex == o.featureIndex && featureArity == o.featureArity &&
+  values.sameElements(o.values) && indices.sameElements(o.indices)
+  case _ => false
+}
+  }
+
+  /**
+   * Reorders the subset of feature values at indices [from, to) in the 
passed-in column
+   * according to the split information encoded in instanceBitVector 
(feature values for rows
+   * that split left appear before feature values for rows that split 
right).
+   *
+   * @param numLeftRows Number of rows on the left side of the split
+   * @param tempVals Destination buffer for reordered feature values
+   * @param tempIndices Destination buffer for row indices corresponding 
to reordered feature values
+   * @param instanceBitVector instanceBitVector(i) = true if the row for 
the ith feature
+   *  value splits right, false otherwise
+   */
+  private[ml] def updateForSplit(
+  from: Int

[GitHub] spark issue #19433: [SPARK-3162] [MLlib] Add local tree training for decisio...

2017-10-12 Thread smurching
Github user smurching commented on the issue:

https://github.com/apache/spark/pull/19433
  
Sorry, realized I conflated feature subsampling and `subsampleWeights` 
(instance weights for training examples). IMO feature subsampling can be added 
in a follow-up PR, but `subsampleWeights` should go in this PR.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19433: [SPARK-3162] [MLlib] Add local tree training for decisio...

2017-10-12 Thread smurching
Github user smurching commented on the issue:

https://github.com/apache/spark/pull/19433
  
Thanks for the comments!

- Yep, feature subsampling is necessary for using local tree training in 
distributed training. I was thinking of adding subsampling in a follow-up PR. 
You're right that we don't need to pass an array of `BaggedPoints` to local 
tree training; we should just pass an array of `subsampleWeights` (weights for 
the current tree) and an array of `TreePoints`. I'll push an update for this.

- Agreed that the logic for classification will be the same but with a 
different impurity metric. I can add support for classification & associated 
tests in a follow-up PR.

- IMO the primary advantage of the columnar storage format is that it'll 
eventually enable improvements to best split calculations; specifically, for 
continuous features we could sort the unbinned feature values and consider 
every possible threshold.  There are also the locality & memory advantages 
described in the design doc. In brief, `DTStatsAggregator` stores a flat array 
partitioned by (feature x bin). If we can iterate through all values for a 
single feature at once, most updates to `DTStatsAggregator`will occur within 
the same subarray.

- Multithreading could be a nice way to increase parallelism since we don't 
use Spark during local tree training. I think we could add it in a follow-up PR.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19433: [SPARK-3162] [MLlib] Add local tree training for decisio...

2017-10-11 Thread smurching
Github user smurching commented on the issue:

https://github.com/apache/spark/pull/19433
  
The failing SparkR test (which compares `RandomForest` predictions to 
hardcoded values) fails not due to a correctness issue but (AFAICT) because of 
an implementation change in best-split selection. 

In this PR we recompute parent node impurity stats when considering each 
split for a feature, instead of computing parent impurity stats once per 
feature (see this by comparing `RandomForest.calculateImpurityStats` in Spark 
master and `ImpurityUtils.calculateImpurityStats` in this PR).

The process of repeatedly computing parent impurity stats results in 
slightly different results at each iteration due to Double precision 
limitations. This in turn can cause different splits to be selected (e.g. if 
two splits have mathematically equal gains, Double precision limitations can 
cause one split to have a higher/smaller gain than the other, influencing 
tiebreaking).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19433: [SPARK-3162] [MLlib] Add local tree training for decisio...

2017-10-09 Thread smurching
Github user smurching commented on the issue:

https://github.com/apache/spark/pull/19433
  
The failing tests (in `DecisionTreeSuite`) fail because we've historically 
handled

a) splits that have 0 gain 

differently from

b) splits that fail to achieve user-specified minimum gain 
(`metadata.minInfoGain`) or don't meet minimum instance-counts per node 
(`metadata.minInstancesPerNode`).

Previously we'd create a leaf node with valid impurity stats in case a) and 
invalid impurity stats in case b). This PR creates a leaf node with invalid 
impurity stats in both cases.

As a fix I'd suggest creating a `LeafNode` with correct impurity stats in 
case a), but with the `stats.valid` member set to `false` to indicate that the 
node should not be split.




---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19433: [SPARK-3162] [MLlib][WIP] Add local tree training for de...

2017-10-09 Thread smurching
Github user smurching commented on the issue:

https://github.com/apache/spark/pull/19433
  
Thanks! I'll remove the WIP. To clear things up for the future, I'd thought 
[WIP] was the appropriate tag for a PR that's ready for review but not ready to 
be merged (based on https://spark.apache.org/contributing.html) -- have we 
stopped using the WIP tag?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19433: [SPARK-3162] [MLlib][WIP] Add local tree training...

2017-10-09 Thread smurching
Github user smurching commented on a diff in the pull request:

https://github.com/apache/spark/pull/19433#discussion_r143398990
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/tree/impl/LocalDecisionTreeUtils.scala 
---
@@ -0,0 +1,59 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.ml.tree.impl
+
+import org.apache.spark.internal.Logging
+
+/**
+ * Utility methods specific to local decision tree training.
+ */
+private[ml] object LocalDecisionTreeUtils extends Logging {
+
+  /**
+   * Convert a dataset of binned feature values from row storage to column 
storage.
+   * Stores data as [[org.apache.spark.ml.linalg.DenseVector]].
+   *
+   *
+   * @param rowStore  An array of input data rows, each represented as an
+   *  int array of binned feature values
+   * @return Transpose of rowStore as an array of columns consisting of 
binned feature values.
+   *
+   * TODO: Add implementation for sparse data.
+   *   For sparse data, distribute more evenly based on number of 
non-zeros.
+   *   (First collect stats to decide how to partition.)
+   */
+  private[impl] def rowToColumnStoreDense(rowStore: Array[Array[Int]]): 
Array[Array[Int]] = {
+// Compute the number of rows in the data
+val numRows = {
+  val longNumRows: Long = rowStore.length
+  require(longNumRows < Int.MaxValue, s"rowToColumnStore given RDD 
with $longNumRows rows," +
+s" but can handle at most ${Int.MaxValue} rows")
+  longNumRows.toInt
+}
+
+// Check that the input dataset isn't empty (0 rows) or featureless 
(rows with 0 features)
+require(numRows > 0, "Local decision tree training requires numRows > 
0.")
+val numFeatures = rowStore(0).length
+require(numFeatures > 0, "Local decision tree training requires 
numFeatures > 0.")
+// Return the transpose of the rowStore matrix
+0.until(numFeatures).map { colIdx =>
--- End diff --

TODO: replace this with `rowStore.transpose`, which is more memory 
efficient (iterates over each row once, allowing for rows of the original 
matrix to be GC'd during the transpose operation).



---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19433: [SPARK-3162] [MLlib][WIP] Add local tree training for de...

2017-10-04 Thread smurching
Github user smurching commented on the issue:

https://github.com/apache/spark/pull/19433
  
@WeichenXu123 would you be able to take an initial look at this?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19433: [SPARK-3162] [MLlib][WIP] Add local tree training...

2017-10-04 Thread smurching
GitHub user smurching opened a pull request:

https://github.com/apache/spark/pull/19433

[SPARK-3162] [MLlib][WIP] Add local tree training for decision tree 
regressors

## What changes were proposed in this pull request?
 WIP, DO NOT MERGE

### Overview
This PR adds local tree training for decision tree regressors as a first 
step for addressing 
[SPARK-3162](https://issues.apache.org/jira/browse/SPARK-3162) (train decision 
trees locally when possible). See [this design 
doc](https://docs.google.com/document/d/1baU5KeorrmLpC4EZoqLuG-E8sUJqmdELLbr8o6wdbVM/edit)
 for a detailed description of the proposed changes.

Distributed training logic has been refactored but only minimally modified; 
the local tree training implementation leverages existing distributed training 
logic for computing impurities and splits. This shared logic has been 
refactored into `...Utils` objects (e.g. `SplitUtils.scala`, 
`ImpurityUtils.scala`). 

### How to Review

Each commit in this PR adds non-overlapping functionality, so the PR should 
be reviewable commit-by-commit.

Changes introduced by each commit:
1. Adds new data structures for local tree training (`FeatureVector`, 
`TrainingInfo`) & associated unit tests (`LocalTreeDataSuite`)
2. Adds shared utility methods for computing splits/impurities 
(`SplitUtils`, `ImpurityUtils`, `AggUpdateUtils`), largely copied from existing 
distributed training code in `RandomForest.scala`.
3. Unit tests for split/impurity utility methods (`TreeSplitUtilsSuite`)
4. Updates distributed training code in `RandomForest.scala` to depend on 
the utility methods introduced in 2.
5. Adds local tree training logic (`LocalDecisionTree`) 
6. Local tree unit/integration tests (`LocalTreeUnitSuite`, 
`LocalTreeIntegrationSuite`)

## How was this patch tested?
No existing tests were modified. The following new tests were added (also 
described above):
* Unit tests for new data structures specific to local tree training 
(`LocalTreeDataSuite`, `LocalTreeUtilsSuite`)
* Unit tests for impurity/split utility methods (`TreeSplitUtilsSuite`)
* Unit tests for local tree training logic (`LocalTreeUnitSuite`)
* Integration tests verifying that local & distributed tree training 
produce the same trees (`LocalTreeIntegrationSuite`)

(Please explain how this patch was tested. E.g. unit tests, integration 
tests, manual tests)
(If this patch involves UI changes, please attach a screenshot; otherwise, 
remove this)

Please review http://spark.apache.org/contributing.html before opening a 
pull request.


You can merge this pull request into a Git repository by running:

$ git pull https://github.com/smurching/spark pr-splitup

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/19433.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #19433


commit 219a12001383017e70f10cd7c785272e70e64b28
Author: Sid Murching <sid.murch...@databricks.com>
Date:   2017-10-04T20:55:35Z

Add data structures for local tree training & associated tests (in 
LocalTreeDataSuite):
* TrainingInfo: primary local tree training data structure, contains 
all information required to describe state of
algorithm at any point during learning
* FeatureVector: Stores data for an individual feature as an Array[Int]

commit 710714395c966f664af7f7b62226336675ec2ea7
Author: Sid Murching <sid.murch...@databricks.com>
Date:   2017-10-04T20:57:30Z

Add utility methods used for impurity and split calculations during both 
local & distributed training:
 * AggUpdateUtils: Helper methods for updating sufficient stats for a given 
node
 * ImpurityUtils: Helper methods for impurity-related calcluations during 
node split decisions
 * SplitUtils: Helper methods for choosing splits given sufficient stats

NOTE: Both ImpurityUtils and SplitUtils primarily contain code taken from 
RandomForest.scala, with slight modifications.
Tests for SplitUtils are contained in the next commit.

commit 49bf0ae9b275264e757de573f81b816437be77e7
Author: Sid Murching <sid.murch...@databricks.com>
Date:   2017-10-04T21:36:15Z

Add test suites for utility methods used during best-split computation:
 * TreeSplitUtilsSuite: Test suite for SplitUtils
 * TreeTests: Add utility method (getMetadata) for TreeSplitUtilsSuite

 Also add methods used by these tests in LocalDecisionTree.scala, 
RandomForest.scala

commit bc54b165849202269b80bbac1a84afb857e87e31
Author: Sid Murching <sid.murch...@databricks.com>
Date:   2017-10-04T21:48:33Z

 Update RandomForest.scala to use new utility methods for impurity/split 
calculations

commit 6a68a5cc6a6b7087163bbe5681ad41ae

[GitHub] spark pull request #19278: [SPARK-22060][ML] Fix CrossValidator/TrainValidat...

2017-09-19 Thread smurching
Github user smurching commented on a diff in the pull request:

https://github.com/apache/spark/pull/19278#discussion_r139816308
  
--- Diff: mllib/src/main/scala/org/apache/spark/ml/util/ReadWrite.scala ---
@@ -399,14 +399,17 @@ private[ml] object DefaultParamsReader {
* This works if all Params implement 
[[org.apache.spark.ml.param.Param.jsonDecode()]].
--- End diff --

Update the docstring to state that params included in `skipParams` aren't 
set.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19278: [SPARK-22060][ML] Fix CrossValidator/TrainValidat...

2017-09-19 Thread smurching
Github user smurching commented on a diff in the pull request:

https://github.com/apache/spark/pull/19278#discussion_r139809836
  
--- Diff: mllib/src/main/scala/org/apache/spark/ml/util/ReadWrite.scala ---
@@ -399,14 +399,17 @@ private[ml] object DefaultParamsReader {
* This works if all Params implement 
[[org.apache.spark.ml.param.Param.jsonDecode()]].
* TODO: Move to [[Metadata]] method
*/
-  def getAndSetParams(instance: Params, metadata: Metadata): Unit = {
+  def getAndSetParams(instance: Params, metadata: Metadata,
+  skipParams: List[String] = null): Unit = {
--- End diff --

Use an `Option[List[String]]` that defaults to `None` instead of a 
`List[String]` that defaults to null?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19278: [SPARK-22060][ML] Fix CrossValidator/TrainValidat...

2017-09-19 Thread smurching
Github user smurching commented on a diff in the pull request:

https://github.com/apache/spark/pull/19278#discussion_r139817121
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/tuning/CrossValidator.scala ---
@@ -17,6 +17,7 @@
 
 package org.apache.spark.ml.tuning
 
+import java.io.IOException
--- End diff --

This exception is unused & can be removed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19208: [SPARK-21087] [ML] CrossValidator, TrainValidatio...

2017-09-18 Thread smurching
Github user smurching commented on a diff in the pull request:

https://github.com/apache/spark/pull/19208#discussion_r139586600
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/tuning/TrainValidationSplit.scala ---
@@ -276,12 +315,32 @@ object TrainValidationSplitModel extends 
MLReadable[TrainValidationSplitModel] {
 
 ValidatorParams.validateParams(instance)
 
+protected var shouldPersistSubModels: Boolean = false
+
+/**
+ * Set option for persist sub models.
+ */
+@Since("2.3.0")
+def persistSubModels(persist: Boolean): this.type = {
+  shouldPersistSubModels = persist
+  this
+}
+
 override protected def saveImpl(path: String): Unit = {
   import org.json4s.JsonDSL._
-  val extraMetadata = "validationMetrics" -> 
instance.validationMetrics.toSeq
+  val extraMetadata = ("validationMetrics" -> 
instance.validationMetrics.toSeq) ~
+("shouldPersistSubModels" -> shouldPersistSubModels)
   ValidatorParams.saveImpl(path, instance, sc, Some(extraMetadata))
   val bestModelPath = new Path(path, "bestModel").toString
   instance.bestModel.asInstanceOf[MLWritable].save(bestModelPath)
+  if (shouldPersistSubModels) {
+require(instance.subModels != null, "Cannot get sub models to 
persist.")
+val subModelsPath = new Path(path, "subModels")
+for (paramIndex <- 0 until instance.getEstimatorParamMaps.length) {
+  val modelPath = new Path(subModelsPath, 
paramIndex.toString).toString
+  
instance.subModels(paramIndex).asInstanceOf[MLWritable].save(modelPath)
--- End diff --

@WeichenXu123 Actually I don't think we have to worry about this; Pipeline 
persistence doesn't clean up if a stage fails to persist (see 
[Pipeline.scala](https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/Pipeline.scala#L242))


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19208: [SPARK-21087] [ML] CrossValidator, TrainValidatio...

2017-09-18 Thread smurching
Github user smurching commented on a diff in the pull request:

https://github.com/apache/spark/pull/19208#discussion_r139578700
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/tuning/TrainValidationSplit.scala ---
@@ -276,12 +315,32 @@ object TrainValidationSplitModel extends 
MLReadable[TrainValidationSplitModel] {
 
 ValidatorParams.validateParams(instance)
 
+protected var shouldPersistSubModels: Boolean = false
+
+/**
+ * Set option for persist sub models.
+ */
+@Since("2.3.0")
+def persistSubModels(persist: Boolean): this.type = {
+  shouldPersistSubModels = persist
+  this
+}
+
 override protected def saveImpl(path: String): Unit = {
   import org.json4s.JsonDSL._
-  val extraMetadata = "validationMetrics" -> 
instance.validationMetrics.toSeq
+  val extraMetadata = ("validationMetrics" -> 
instance.validationMetrics.toSeq) ~
+("shouldPersistSubModels" -> shouldPersistSubModels)
   ValidatorParams.saveImpl(path, instance, sc, Some(extraMetadata))
   val bestModelPath = new Path(path, "bestModel").toString
   instance.bestModel.asInstanceOf[MLWritable].save(bestModelPath)
+  if (shouldPersistSubModels) {
+require(instance.subModels != null, "Cannot get sub models to 
persist.")
+val subModelsPath = new Path(path, "subModels")
+for (paramIndex <- 0 until instance.getEstimatorParamMaps.length) {
+  val modelPath = new Path(subModelsPath, 
paramIndex.toString).toString
+  
instance.subModels(paramIndex).asInstanceOf[MLWritable].save(modelPath)
--- End diff --

Should we clean up/remove the partially-persisted `subModels` if any of 
these `save()` calls fail? E.g. let's say we have four subModels and the first 
three `save()` calls succeed but the fourth fails - should we delete the 
folders for the first three submodels?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19208: [SPARK-21087] [ML] CrossValidator, TrainValidatio...

2017-09-18 Thread smurching
Github user smurching commented on a diff in the pull request:

https://github.com/apache/spark/pull/19208#discussion_r139573779
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/tuning/CrossValidator.scala ---
@@ -261,17 +290,40 @@ class CrossValidatorModel private[ml] (
 val copied = new CrossValidatorModel(
   uid,
   bestModel.copy(extra).asInstanceOf[Model[_]],
-  avgMetrics.clone())
+  avgMetrics.clone(),
+  CrossValidatorModel.copySubModels(subModels))
 copyValues(copied, extra).setParent(parent)
   }
 
   @Since("1.6.0")
   override def write: MLWriter = new 
CrossValidatorModel.CrossValidatorModelWriter(this)
+
+  @Since("2.3.0")
+  @throws[IOException]("If the input path already exists but overwrite is 
not enabled.")
+  def save(path: String, persistSubModels: Boolean): Unit = {
+write.asInstanceOf[CrossValidatorModel.CrossValidatorModelWriter]
+  .persistSubModels(persistSubModels).save(path)
+  }
--- End diff --

I think users can still access `CrossValidatorModelWriter` through 
`CrossValidatorModel.write`, so the `save` method is unnecessary. 

The `private[CrossValidatorModel]` annotation on the 
`CrossValidatorModelWriter` constructor only means that users can't create 
instances of the class e.g. via `new 
CrossValidatorModel.CrossValidatorModelWriter(...)`


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19208: [SPARK-21087] [ML] CrossValidator, TrainValidatio...

2017-09-18 Thread smurching
Github user smurching commented on a diff in the pull request:

https://github.com/apache/spark/pull/19208#discussion_r139556318
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/param/shared/SharedParamsCodeGen.scala 
---
@@ -82,7 +82,10 @@ private[shared] object SharedParamsCodeGen {
 "all instance weights as 1.0"),
   ParamDesc[String]("solver", "the solver algorithm for optimization", 
finalFields = false),
   ParamDesc[Int]("aggregationDepth", "suggested depth for 
treeAggregate (>= 2)", Some("2"),
-isValid = "ParamValidators.gtEq(2)", isExpertParam = true))
+isValid = "ParamValidators.gtEq(2)", isExpertParam = true),
+  ParamDesc[Boolean]("collectSubModels", "whether to collect sub 
models when tuning fitting",
--- End diff --

Suggestion: reword "whether to collect sub models when tuning fitting" --> 
"whether to collect a list of sub-models trained during tuning"


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19208: [SPARK-21087] [ML] CrossValidator, TrainValidatio...

2017-09-18 Thread smurching
Github user smurching commented on a diff in the pull request:

https://github.com/apache/spark/pull/19208#discussion_r139568979
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/tuning/CrossValidator.scala ---
@@ -237,12 +251,17 @@ object CrossValidator extends 
MLReadable[CrossValidator] {
 class CrossValidatorModel private[ml] (
 @Since("1.4.0") override val uid: String,
 @Since("1.2.0") val bestModel: Model[_],
-@Since("1.5.0") val avgMetrics: Array[Double])
+@Since("1.5.0") val avgMetrics: Array[Double],
+@Since("2.3.0") val subModels: Array[Array[Model[_]]])
   extends Model[CrossValidatorModel] with CrossValidatorParams with 
MLWritable {
 
   /** A Python-friendly auxiliary constructor. */
   private[ml] def this(uid: String, bestModel: Model[_], avgMetrics: 
JList[Double]) = {
-this(uid, bestModel, avgMetrics.asScala.toArray)
+this(uid, bestModel, avgMetrics.asScala.toArray, null)
--- End diff --

See earlier suggestion, use an Option set to `None` instead of setting the 
Array to null


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19208: [SPARK-21087] [ML] CrossValidator, TrainValidatio...

2017-09-18 Thread smurching
Github user smurching commented on a diff in the pull request:

https://github.com/apache/spark/pull/19208#discussion_r139557219
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/tuning/CrossValidator.scala ---
@@ -117,6 +123,12 @@ class CrossValidator @Since("1.2.0") (@Since("1.4.0") 
override val uid: String)
 instr.logParams(numFolds, seed, parallelism)
 logTuningParams(instr)
 
+val collectSubModelsParam = $(collectSubModels)
+
+var subModels: Array[Array[Model[_]]] = if (collectSubModelsParam) {
--- End diff --

Perhaps use an `Option[Array[Model[_]]]` instead of setting `subModels` to 
null?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19186: [SPARK-21972][ML] Add param handlePersistence

2017-09-13 Thread smurching
Github user smurching commented on a diff in the pull request:

https://github.com/apache/spark/pull/19186#discussion_r138712707
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/classification/LogisticRegression.scala
 ---
@@ -483,24 +488,24 @@ class LogisticRegression @Since("1.2.0") (
 this
   }
 
-  override protected[spark] def train(dataset: Dataset[_]): 
LogisticRegressionModel = {
-val handlePersistence = dataset.storageLevel == StorageLevel.NONE
-train(dataset, handlePersistence)
-  }
-
-  protected[spark] def train(
-  dataset: Dataset[_],
-  handlePersistence: Boolean): LogisticRegressionModel = {
+  protected[spark] def train(dataset: Dataset[_]): LogisticRegressionModel 
= {
 val w = if (!isDefined(weightCol) || $(weightCol).isEmpty) lit(1.0) 
else col($(weightCol))
 val instances: RDD[Instance] =
   dataset.select(col($(labelCol)), w, col($(featuresCol))).rdd.map {
 case Row(label: Double, weight: Double, features: Vector) =>
   Instance(label, weight, features)
   }
 
-if (handlePersistence) instances.persist(StorageLevel.MEMORY_AND_DISK)
+if (dataset.storageLevel == StorageLevel.NONE) {
+  if ($(handlePersistence)) {
+instances.persist(StorageLevel.MEMORY_AND_DISK)
+  } else {
+logWarning("The input dataset is uncached, which may hurt 
performance if its upstreams " +
+  "are also uncached.")
+  }
+}
--- End diff --

Oops, yeah I had forgotten about that (thanks for the catch). 

One solution could be to extend `HasHandlePersistence` in `Predictor` and 
check `handlePersistence` / cache uncached data in `Predictor.fit()` instead of 
`Predictor.train()`. This has the drawback of limiting individual algorithms' 
ability to customize their caching behavior. 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19106: [SPARK-21770][ML] ProbabilisticClassificationModel fix c...

2017-09-12 Thread smurching
Github user smurching commented on the issue:

https://github.com/apache/spark/pull/19106
  
@sethah I haven't heard of anybody hitting this issue in practice, but it 
did seem best to ensure that valid probability distributions would be produced 
regardless of input. There was some discussion of this in the JIRA: 
https://issues.apache.org/jira/browse/SPARK-21770


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19186: [SPARK-21972][ML] Add param handlePersistence

2017-09-11 Thread smurching
Github user smurching commented on the issue:

https://github.com/apache/spark/pull/19186
  
Note: This PR follows up on the work/discussions in 
[https://github.com/apache/spark/pull/17014](https://github.com/apache/spark/pull/17014)


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19186: [SPARK-21972][ML] Add param handlePersistence

2017-09-11 Thread smurching
Github user smurching commented on a diff in the pull request:

https://github.com/apache/spark/pull/19186#discussion_r138139729
  
--- Diff: mllib/src/main/scala/org/apache/spark/ml/clustering/KMeans.scala 
---
@@ -300,20 +300,23 @@ class KMeans @Since("1.5.0") (
   @Since("1.5.0")
   def setSeed(value: Long): this.type = set(seed, value)
 
+  /** @group setParam */
+  @Since("2.3.0")
+  def setHandlePersistence(value: Boolean): this.type = 
set(handlePersistence, value)
+
   @Since("2.0.0")
   override def fit(dataset: Dataset[_]): KMeansModel = {
 transformSchema(dataset.schema, logging = true)
 
-val handlePersistence = dataset.rdd.getStorageLevel == 
StorageLevel.NONE
 val instances: RDD[OldVector] = 
dataset.select(col($(featuresCol))).rdd.map {
   case Row(point: Vector) => OldVectors.fromML(point)
 }
 
-if (handlePersistence) {
+if ($(handlePersistence)) {
--- End diff --

See comment above, we should also check that `dataset.storageLevel == 
StorageLevel.NONE`


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19186: [SPARK-21972][ML] Add param handlePersistence

2017-09-11 Thread smurching
Github user smurching commented on a diff in the pull request:

https://github.com/apache/spark/pull/19186#discussion_r138136774
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/classification/LogisticRegression.scala
 ---
@@ -483,24 +488,17 @@ class LogisticRegression @Since("1.2.0") (
 this
   }
 
-  override protected[spark] def train(dataset: Dataset[_]): 
LogisticRegressionModel = {
-val handlePersistence = dataset.rdd.getStorageLevel == 
StorageLevel.NONE
-train(dataset, handlePersistence)
-  }
-
-  protected[spark] def train(
-  dataset: Dataset[_],
-  handlePersistence: Boolean): LogisticRegressionModel = {
+  protected[spark] def train(dataset: Dataset[_]): LogisticRegressionModel 
= {
 val w = if (!isDefined(weightCol) || $(weightCol).isEmpty) lit(1.0) 
else col($(weightCol))
 val instances: RDD[Instance] =
   dataset.select(col($(labelCol)), w, col($(featuresCol))).rdd.map {
 case Row(label: Double, weight: Double, features: Vector) =>
   Instance(label, weight, features)
   }
 
-if (handlePersistence) instances.persist(StorageLevel.MEMORY_AND_DISK)
+if ($(handlePersistence)) 
instances.persist(StorageLevel.MEMORY_AND_DISK)
--- End diff --

If `$(handlePersistence)` is `true`, we should still check that `dataset` 
is uncached (i.e. check that `dataset.storageLevel == StorageLevel.NONE`) 
before caching `instances`, or else we'll run into the issues described in 
[SPARK-21799](https://issues.apache.org/jira/browse/SPARK-21799)


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19186: [SPARK-21972][ML] Add param handlePersistence

2017-09-11 Thread smurching
Github user smurching commented on a diff in the pull request:

https://github.com/apache/spark/pull/19186#discussion_r138137893
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/classification/OneVsRest.scala ---
@@ -163,9 +165,7 @@ final class OneVsRestModel private[ml] (
 val initUDF = udf { () => Map[Int, Double]() }
 val newDataset = dataset.withColumn(accColName, initUDF())
 
-// persist if underlying dataset is not persistent.
-val handlePersistence = dataset.rdd.getStorageLevel == 
StorageLevel.NONE
-if (handlePersistence) {
+if ($(handlePersistence)) {
--- End diff --

See comment above, we should also check that `dataset.storageLevel == 
StorageLevel.NONE` before caching `newDataset`


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19186: [SPARK-21972][ML] Add param handlePersistence

2017-09-11 Thread smurching
Github user smurching commented on a diff in the pull request:

https://github.com/apache/spark/pull/19186#discussion_r138139091
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/param/shared/SharedParamsCodeGen.scala 
---
@@ -82,7 +82,8 @@ private[shared] object SharedParamsCodeGen {
 "all instance weights as 1.0"),
   ParamDesc[String]("solver", "the solver algorithm for optimization", 
finalFields = false),
   ParamDesc[Int]("aggregationDepth", "suggested depth for 
treeAggregate (>= 2)", Some("2"),
-isValid = "ParamValidators.gtEq(2)", isExpertParam = true))
+isValid = "ParamValidators.gtEq(2)", isExpertParam = true),
+  ParamDesc[Boolean]("handlePersistence", "whether to handle data 
persistence", Some("true")))
--- End diff --

This description could be a bit clearer, how about "if true, will cache 
unpersisted input data before fitting estimator on it"?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19186: [SPARK-21972][ML] Add param handlePersistence

2017-09-11 Thread smurching
Github user smurching commented on a diff in the pull request:

https://github.com/apache/spark/pull/19186#discussion_r138140113
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/regression/IsotonicRegression.scala ---
@@ -165,8 +170,7 @@ class IsotonicRegression @Since("1.5.0") 
(@Since("1.5.0") override val uid: Stri
 transformSchema(dataset.schema, logging = true)
 // Extract columns from data.  If dataset is persisted, do not persist 
oldDataset.
 val instances = extractWeightedLabeledPoints(dataset)
-val handlePersistence = dataset.rdd.getStorageLevel == 
StorageLevel.NONE
-if (handlePersistence) instances.persist(StorageLevel.MEMORY_AND_DISK)
+if ($(handlePersistence)) 
instances.persist(StorageLevel.MEMORY_AND_DISK)
--- End diff --

See comment above, we should also check that `dataset.storageLevel == 
StorageLevel.NONE`


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19186: [SPARK-21972][ML] Add param handlePersistence

2017-09-11 Thread smurching
Github user smurching commented on a diff in the pull request:

https://github.com/apache/spark/pull/19186#discussion_r138139539
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/classification/LogisticRegression.scala
 ---
@@ -444,13 +444,13 @@ class LogisticRegressionWithLBFGS
 lr.setFitIntercept(addIntercept)
 lr.setMaxIter(optimizer.getNumIterations())
 lr.setTol(optimizer.getConvergenceTol())
+// Determine if we should cache the DF
+lr.setHandlePersistence(input.getStorageLevel == StorageLevel.NONE)
--- End diff --

`handlePersistence` should be specified by the user rather than inferred by 
the algorithm.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19107: [SPARK-21799][ML] Fix `KMeans` performance regression ca...

2017-09-11 Thread smurching
Github user smurching commented on the issue:

https://github.com/apache/spark/pull/19107
  
@jkbradley would you be able to give this a look? Thanks!


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19106: [SPARK-21770][ML] ProbabilisticClassificationModel fix c...

2017-09-11 Thread smurching
Github user smurching commented on the issue:

https://github.com/apache/spark/pull/19106
  
This looks good to me! @srowen would you be able to give it another look?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19106: [SPARK-21770][ML] ProbabilisticClassificationMode...

2017-09-11 Thread smurching
Github user smurching commented on a diff in the pull request:

https://github.com/apache/spark/pull/19106#discussion_r137988907
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/classification/ProbabilisticClassifier.scala
 ---
@@ -245,6 +245,13 @@ private[ml] object ProbabilisticClassificationModel {
 v.values(i) /= sum
 i += 1
   }
+} else {
+  var i = 0
+  val size = v.size
+  while (i < size) {
+v.values(i) = 1.0 / size
--- End diff --

You could use `java.util.Arrays.fill` to update `v.values` in-place, but 
I'm not sure that it'll make a huge difference.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #17014: [SPARK-18608][ML] Fix double-caching in ML algorithms

2017-09-10 Thread smurching
Github user smurching commented on the issue:

https://github.com/apache/spark/pull/17014
  
Hi @zhengruifeng, thanks for your work on this!

Now that we're introducing a new handlePersistence parameter (a new public 
API), it'd be good to track work in a separate JIRA/PR as @jkbradley suggested 
so others are aware of the proposed change.

I've created a new JIRA ticket for adding the handlePersistence param here: 
[SPARK-21972](https://issues.apache.org/jira/browse/SPARK-21972). Would you 
mind resubmitting your work as a new PR that addresses the new JIRA ticket 
(SPARK-21972)?

Thanks & sorry for the inconvenience!


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19107: [SPARK-21799][ML] Fix `KMeans` performance regression ca...

2017-09-07 Thread smurching
Github user smurching commented on the issue:

https://github.com/apache/spark/pull/19107
  
Sorry for the delay, this looks good to me -- thanks @WeichenXu123!


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #17014: [SPARK-18608][ML] Fix double-caching in ML algorithms

2017-08-31 Thread smurching
Github user smurching commented on the issue:

https://github.com/apache/spark/pull/17014
  
@WeichenXu123 That approach sounds reasonable to me. 

My main thought (& this might be obvious) is on the implementation level -- 
as long as we implement this by adding an `org.apache.spark.ml.Param` named 
`handlePersistence`, I think we can maintain binary compatibility. I'd be 
concerned about making `handlePersistence` an argument to `fit()`, which seems 
like it might [break binary 
compatibility](https://wiki.eclipse.org/Evolving_Java-based_APIs_2#Evolving_API_classes_-_API_methods_and_constructors).



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19065: [SPARK-21729][ML][TEST] Generic test for Probabil...

2017-08-30 Thread smurching
Github user smurching commented on a diff in the pull request:

https://github.com/apache/spark/pull/19065#discussion_r136152805
  
--- Diff: 
mllib/src/test/scala/org/apache/spark/ml/classification/ProbabilisticClassifierSuite.scala
 ---
@@ -91,4 +94,60 @@ object ProbabilisticClassifierSuite {
 "thresholds" -> Array(0.4, 0.6)
   )
 
+  /**
+   * Add test for prediction using the model with all combinations of
--- End diff --

Tiny nit: This could be reworded from the JIRA description.

How about:

Helper for testing that a ProbabilisticClassificationModel computes the 
same predictions across all combinations of output columns 
(rawPrediction/probability/prediction) turned on/off. Makes sure the output 
column values match by comparing vs. the case with all 3 output columns turned 
on.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19065: [SPARK-21729][ML][TEST] Generic test for Probabil...

2017-08-28 Thread smurching
Github user smurching commented on a diff in the pull request:

https://github.com/apache/spark/pull/19065#discussion_r135655220
  
--- Diff: 
mllib/src/test/scala/org/apache/spark/ml/classification/ProbabilisticClassifierSuite.scala
 ---
@@ -91,4 +94,54 @@ object ProbabilisticClassifierSuite {
 "thresholds" -> Array(0.4, 0.6)
   )
 
+  def probabilisticClassifierGenericTest[
+  FeaturesType,
+  M <: ProbabilisticClassificationModel[FeaturesType, M]](
+model: M, testData: Dataset[_]): Unit = {
+
+val allColModel = model.copy(ParamMap.empty)
+  .setRawPredictionCol("rawPredictionAll")
+  .setProbabilityCol("probabilityAll")
+  .setPredictionCol("predictionAll")
+val allColResult = allColModel.transform(testData)
+
+for (rawPredictionCol <- Seq("", "rawPredictionSingle")) {
+  for (probabilityCol <- Seq("", "probabilitySingle")) {
+for (predictionCol <- Seq("", "predictionSingle")) {
+  val newModel = model.copy(ParamMap.empty)
+.setRawPredictionCol(rawPredictionCol)
+.setProbabilityCol(probabilityCol)
+.setPredictionCol(predictionCol)
+
+  val result = newModel.transform(allColResult)
+
+  import org.apache.spark.sql.functions._
+
+  val resultRawPredictionCol =
+if (rawPredictionCol.isEmpty) col("rawPredictionAll") else 
col(rawPredictionCol)
+  val resultProbabilityCol =
+if (probabilityCol.isEmpty) col("probabilityAll") else 
col(probabilityCol)
+  val resultPredictionCol =
+if (predictionCol.isEmpty) col("predictionAll") else 
col(predictionCol)
+
+  result.select(
+resultRawPredictionCol, col("rawPredictionAll"),
+resultProbabilityCol, col("probabilityAll"),
+resultPredictionCol, col("predictionAll")
+  ).collect().foreach {
+case Row(
+  rawPredictionSingle: Vector, rawPredictionAll: Vector,
+  probabilitySingle: Vector, probabilityAll: Vector,
+  predictionSingle: Double, predictionAll: Double
+) => {
+  assert(rawPredictionSingle.asInstanceOf[Vector] ~== 
rawPredictionAll relTol 1E-3)
--- End diff --

Are these `asInstanceOf[]` casts necessary given that 
`rawPredictionSingle`, `rawPredictionAll` are explicitly typed in the case 
statement above?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19065: [SPARK-21729][ML][TEST] Generic test for Probabil...

2017-08-28 Thread smurching
Github user smurching commented on a diff in the pull request:

https://github.com/apache/spark/pull/19065#discussion_r135653663
  
--- Diff: 
mllib/src/test/scala/org/apache/spark/ml/classification/ProbabilisticClassifierSuite.scala
 ---
@@ -91,4 +94,54 @@ object ProbabilisticClassifierSuite {
 "thresholds" -> Array(0.4, 0.6)
   )
 
+  def probabilisticClassifierGenericTest[
--- End diff --

Could you add a comment explaining what this test does? Thanks!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19065: [SPARK-21729][ML][TEST] Generic test for Probabil...

2017-08-28 Thread smurching
Github user smurching commented on a diff in the pull request:

https://github.com/apache/spark/pull/19065#discussion_r135656421
  
--- Diff: 
mllib/src/test/scala/org/apache/spark/ml/classification/ProbabilisticClassifierSuite.scala
 ---
@@ -91,4 +94,54 @@ object ProbabilisticClassifierSuite {
 "thresholds" -> Array(0.4, 0.6)
   )
 
+  def probabilisticClassifierGenericTest[
+  FeaturesType,
+  M <: ProbabilisticClassificationModel[FeaturesType, M]](
+model: M, testData: Dataset[_]): Unit = {
+
+val allColModel = model.copy(ParamMap.empty)
+  .setRawPredictionCol("rawPredictionAll")
+  .setProbabilityCol("probabilityAll")
+  .setPredictionCol("predictionAll")
+val allColResult = allColModel.transform(testData)
+
+for (rawPredictionCol <- Seq("", "rawPredictionSingle")) {
+  for (probabilityCol <- Seq("", "probabilitySingle")) {
--- End diff --

Just to confirm, does setting `probabilityCol`, `rawPredictionCol`, 
`predictionCol` to empty strings work here because expressions like 
`$(probabilityCol)` (used in 
[ProbabilisticClassifier.scala](https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/classification/ProbabilisticClassifier.scala#L115))
 return the String value of probabilityCol?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19065: [SPARK-21729][ML][TEST] Generic test for Probabil...

2017-08-28 Thread smurching
Github user smurching commented on a diff in the pull request:

https://github.com/apache/spark/pull/19065#discussion_r135653421
  
--- Diff: 
mllib/src/test/scala/org/apache/spark/ml/classification/ProbabilisticClassifierSuite.scala
 ---
@@ -18,7 +18,10 @@
 package org.apache.spark.ml.classification
 
 import org.apache.spark.SparkFunSuite
-import org.apache.spark.ml.linalg.{Vector, Vectors}
+import org.apache.spark.ml.linalg.{DenseVector, Vector, Vectors}
--- End diff --

It looks like DenseVector is an unused import and could be removed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19065: [SPARK-21729][ML][TEST] Generic test for Probabil...

2017-08-28 Thread smurching
Github user smurching commented on a diff in the pull request:

https://github.com/apache/spark/pull/19065#discussion_r135653044
  
--- Diff: 
mllib/src/test/scala/org/apache/spark/ml/classification/DecisionTreeClassifierSuite.scala
 ---
@@ -262,6 +262,9 @@ class DecisionTreeClassifierSuite
   assert(Vectors.dense(rawPred.toArray.map(_ / sum)) === probPred,
 "probability prediction mismatch")
 }
+
+ProbabilisticClassifierSuite.probabilisticClassifierGenericTest[
--- End diff --

We should use a more descriptive name for this test. How about 
`ProbabilisticClassifierSuite.testPredictMethods`? @jkbradley may have other 
suggestions too.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19065: [SPARK-21729][ML][TEST] Generic test for Probabil...

2017-08-28 Thread smurching
Github user smurching commented on a diff in the pull request:

https://github.com/apache/spark/pull/19065#discussion_r135653479
  
--- Diff: 
mllib/src/test/scala/org/apache/spark/ml/classification/ProbabilisticClassifierSuite.scala
 ---
@@ -18,7 +18,10 @@
 package org.apache.spark.ml.classification
 
 import org.apache.spark.SparkFunSuite
-import org.apache.spark.ml.linalg.{Vector, Vectors}
+import org.apache.spark.ml.linalg.{DenseVector, Vector, Vectors}
+import org.apache.spark.ml.param.ParamMap
+import org.apache.spark.ml.util.TestingUtils._
+import org.apache.spark.sql.{DataFrame, Dataset, Row}
--- End diff --

DataFrame is an unused import, could be removed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #14872: [SPARK-3162][MLlib][WIP] Add local tree training for dec...

2017-01-23 Thread smurching
Github user smurching commented on the issue:

https://github.com/apache/spark/pull/14872
  
No worries, apologies for being busy on my end -- I'll leave the branch up 
& try to contribute in other ways when I have the time!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #14872: [SPARK-3162][MLlib][WIP] Add local tree training ...

2017-01-23 Thread smurching
Github user smurching closed the pull request at:

https://github.com/apache/spark/pull/14872


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #14872: [SPARK-3162][MLlib][WIP] Add local tree training for dec...

2017-01-23 Thread smurching
Github user smurching commented on the issue:

https://github.com/apache/spark/pull/14872
  
Hi,

I've stopped working on this PR - I can go ahead and close it.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #14872: [SPARK-3162][MLlib][WIP] Add local tree training ...

2016-09-27 Thread smurching
GitHub user smurching reopened a pull request:

https://github.com/apache/spark/pull/14872

[SPARK-3162][MLlib][WIP] Add local tree training for decision tree 
regressors

## What changes were proposed in this pull request?

Based on [Yggdrasil](https://github.com/fabuzaid21/yggdrasil), added local 
training of decision tree regressors.

Some classes/objects largely correspond to Yggdrasil classes/objects.
Specifically:
* class LocalDecisionTreeRegressor --> class YggdrasilRegressor
* object LocalDecisionTree --> object YggdrasilRegression
* object LocalDecisionTreeUtils --> object Yggdrasil

## How was this patch tested?

Added unit tests in (ml/tree/impl/LocalTreeTrainingSuite.scala) verifying 
that local & distributed training of a decision tree regressor produces the 
same tree.



You can merge this pull request into a Git repository by running:

$ git pull https://github.com/smurching/spark local-trees-pr

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/14872.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #14872


commit acf5b3e29a346a0cb86f621269855a6a98a9a74e
Author: Siddharth Murching <smurch...@databricks.com>
Date:   2016-08-29T23:51:33Z

Add local tree training for decision tree regressors

commit aa4fcc8d401385f38fe0cdfdb9fe39062c3a9f96
Author: Siddharth Murching <smurch...@databricks.com>
Date:   2016-08-30T01:19:07Z

Fix setting of impurity values for leaf nodes to match values produced by
distributed Random Forest algorithm

commit f273fc6a4b5048ae577d03676def354dce5c87a7
Author: Siddharth Murching <smurch...@databricks.com>
Date:   2016-08-31T07:01:26Z

WIP refactoring single-machine tree code

commit 5e61e3b29c236d27e0d655d15a48f2fe3e13d26a
Author: Siddharth Murching <smurch...@databricks.com>
Date:   2016-09-01T01:22:48Z

Remove unused imports, remove array of single-node impurity aggregators

commit d2060fc460a97228a36bf81956cf8dd24c83106e
Author: Siddharth Murching <smurch...@databricks.com>
Date:   2016-09-01T23:11:17Z

WIP

commit 634a3223374608d68018daac5500a429034bbc20
Author: Siddharth Murching <smurch...@databricks.com>
Date:   2016-09-02T00:21:10Z

More work, tests still pass

commit eb7fde00e0db5aa5d04951f8f4a9cd62204f1609
Author: Siddharth Murching <smurch...@databricks.com>
Date:   2016-09-02T17:02:06Z

WIP: Added tests for classes upon which local tree training is dependent. 
Some integration tests fail

commit b748f05e3eaa7d58b1ad86d269e0dda5f35ee885
Author: Siddharth Murching <smurch...@databricks.com>
Date:   2016-09-02T17:37:31Z

WIP debugging

commit 297052242727e6693ccbacf89f44b3ff6db584f7
Author: Siddharth Murching <smurch...@databricks.com>
Date:   2016-09-02T21:34:13Z

Consolidate checking for valid splits

commit 8d443ce38f958e7b83b502e614e01c824cb63c4b
Author: Siddharth Murching <smurch...@databricks.com>
Date:   2016-09-02T21:52:47Z

Delete empty test suite

commit ee56ffe98756ed78cefbc3f782a471f04e80b256
Author: Siddharth Murching <smurch...@databricks.com>
Date:   2016-09-02T22:49:38Z

Fix some style errors




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #14872: [SPARK-3162][MLlib][WIP] Add local tree training ...

2016-09-02 Thread smurching
Github user smurching closed the pull request at:

https://github.com/apache/spark/pull/14872


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #14872: Add local tree training for decision tree regress...

2016-08-29 Thread smurching
GitHub user smurching opened a pull request:

https://github.com/apache/spark/pull/14872

Add local tree training for decision tree regressors

## What changes were proposed in this pull request?

Based on [Yggdrasil](https://github.com/fabuzaid21/yggdrasil), added local 
training of decision tree regressors.

Some classes/objects largely correspond to Yggdrasil classes/objects.
Specifically:
* class LocalDecisionTreeRegressor --> class YggdrasilRegressor
* object LocalDecisionTree --> object YggdrasilRegression
* object LocalDecisionTreeUtils --> object Yggdrasil

## How was this patch tested?

Added unit tests in (ml/tree/impl/LocalTreeTrainingSuite.scala) verifying 
that local & distributed training of a decision tree regressor produces the 
same tree.



You can merge this pull request into a Git repository by running:

$ git pull https://github.com/smurching/spark local-trees-pr

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/14872.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #14872


commit acf5b3e29a346a0cb86f621269855a6a98a9a74e
Author: Siddharth Murching <smurch...@databricks.com>
Date:   2016-08-29T23:51:33Z

Add local tree training for decision tree regressors

commit aa4fcc8d401385f38fe0cdfdb9fe39062c3a9f96
Author: Siddharth Murching <smurch...@databricks.com>
Date:   2016-08-30T01:19:07Z

Fix setting of impurity values for leaf nodes to match values produced by
distributed Random Forest algorithm




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #13881: [SPARK-3723] [MLlib] Adding instrumentation to random fo...

2016-06-24 Thread smurching
Github user smurching commented on the issue:

https://github.com/apache/spark/pull/13881
  
Does it make sense to only perform instrumentation-related computations 
(i.e. updating the max/min nodes per group) if the instrumentation argument to 
RandomForest.run (instr) is not None? This isn't checked for in the current 
implementation.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #13881: [SPARK-3723] [MLlib] Adding instrumentation to ra...

2016-06-23 Thread smurching
GitHub user smurching opened a pull request:

https://github.com/apache/spark/pull/13881

[SPARK-3723] [MLlib] Adding instrumentation to random forests

## What changes were proposed in this pull request?

In RandomForest.run(), added instrumentation for the number of node groups, 
along with the min, max, and average number of nodes per group.

Also fixed a typo in BaggedPoint.scala documentation.


## How was this patch tested?

Tested by running RandomForestClassifierSuite, checking the test output 
manually to make sure instrumentation information was present and reasonable.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/smurching/spark random-forest-instrumentation

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/13881.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #13881


commit 8f45533b9a5f7c3c1f46d0d15a9f1815fa6227d5
Author: Siddharth Murching <smurch...@databricks.com>
Date:   2016-06-23T23:40:26Z

Fix typo in BaggedPoint.scala, add simple instrumentation to Random Forests

commit bd7d24d4f5a79eca6ff9629706c254beba74bc45
Author: Siddharth Murching <smurch...@databricks.com>
Date:   2016-06-24T00:40:02Z

Reorder instrumentation logging statements to look nicer




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org