[GitHub] spark pull request: MLI-1 Decision Trees

2014-04-02 Thread manishamde
Github user manishamde commented on the pull request:

https://github.com/apache/spark/pull/79#issuecomment-39401757
  
@hirakendu Thanks a lot for the detailed comments and feedback. Yes, we 
have a responsibility to keep improving the trees going forward so getting 
additional feedback is awesome.

The feedback around the current implementation in the 'Miscellaneous' 
section is around renaming and minor refactoring of code. I agree with most of 
the feedback and some choices are personal preferences which should discuss and 
resolve. We should address it soon when we implement the better ```Impurity``` 
interface that we promised to implement ASAP. 

I think the "Error" interface you described is very similar to what @mengxr 
proposed as well. We should discuss the naming convention. Even though I don't 
feel strongly about "Impurity" but "Error" might not be the best name for the 
classification scenario. I am open to better names and ready to be convinced 
otherwise. :-)

For 'General Design Notes', I have similar thoughts but I will wait for 
@etrain's comments since he has thought about it carefully. In general, I like 
@etrain 's MLI design for Model and Algorithm. I did not tie the current 
implementation to the existing traits yet since I wanted to have a broader 
conversation about it after the tree PR. It's straightforward to implement once 
we agree on the interfaces for mllib algorithms.

Finally, and most importantly, thanks a ton for performing such extensive 
tests on a massive dataset! The results are not too shabby. ;-)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: MLI-1 Decision Trees

2014-04-02 Thread hirakendu
Github user hirakendu commented on the pull request:

https://github.com/apache/spark/pull/79#issuecomment-39396722
  
Thanks for noticing the typo, it's 500 million examples. Corrected it.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: MLI-1 Decision Trees

2014-04-02 Thread etrain
Github user etrain commented on the pull request:

https://github.com/apache/spark/pull/79#issuecomment-39392123
  
Hi Hirakendu - thanks for all the detailed suggestions and information. I 
will reply to that separately.

One question - you say there are 500,000 examples and this equates to 90GB 
of raw data. If that's the case, this works out to ~200KB per example - is that 
right or are you off by an order of magnitude in either the number of features 
or the number of data points? Or are we throwing a bunch of data out before 
fitting?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: MLI-1 Decision Trees

2014-04-02 Thread hirakendu
Github user hirakendu commented on the pull request:

https://github.com/apache/spark/pull/79#issuecomment-39390821
  
Since it's a long review, I have put things together in one place 
[spark_mllib_tree_pr79_review.md](https://github.com/hirakendu/temp/blob/master/spark_mllib_tree_pr79_review/spark_mllib_tree_pr79_review.md)
 so that it's easy to track later.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: MLI-1 Decision Trees

2014-04-02 Thread hirakendu
Github user hirakendu commented on the pull request:

https://github.com/apache/spark/pull/79#issuecomment-39390268
  
In terms of running time performance, here are some scalability results for 
a large scale dataset. Results look satisfactory :).

The code was tested on a dataset for a binary classification problem. A 
regression tree with `Variance` (square loss) was trained because it's 
computationally more intensive. The dataset consists of about `500,000` 
training instances. There are 20 features, all numeric. Although there are 
categorical features in the dataset and the algorithm implementation supports 
it, they were not used since the main program doesn't have options.

The dataset is of size about 90 GB in plain text format. It consists of 100 
part files, each about 900 MB. To _optimize_ number of tasks to align with 
number of workers, the task split size was chosen to be 160 MB to have 300 
tasks.

The model training experiements were done on a Yahoo internal Hadoop 
cluster. The CPUs are of type Intel Xeon 2.4 GHz. The Spark YARN adaptation was 
used to run on Hadoop cluster. Both master-memory and worker-memory were set to 
`7500m` and the cluster was under light to moderate load at the time of 
experiments. The fraction of memory used for caching was `0.3`,
leading to about `2 GB` of memory per worker for caching.  The number of 
workers used for various experiments were `20, 30, 60, 100, 150, 300, 600` to 
evenly align with 600 tasks. Note that with 20 and 30 workers, only `48%` and 
`78%` of the 90 GB data could be cached in memory. The decision tree of depth 
10 was trained and minor code changes were made to record the individual level 
training times. The training command (with additional JVM settings) used is

time \
SPARK_JAVA_OPTS="${SPARK_JAVA_OPTS} \
-Dspark.hadoop.mapred.min.split.size=167772160 \
-Dspark.hadoop.mapred.max.split.size=167772160" \
-Dspark.storage.memoryFraction=0.3 \
spark-class org.apache.spark.deploy.yarn.Client \
--queue ${QUEUE} \
--num-workers ${NUM_WORKERS} \
--worker-memory 7500m \
--worker-cores 1 \
--master-memory 7500m \
--jar ${JARS}/spark_mllib_tree.jar \
--class org.apache.spark.mllib.tree.DecisionTree \
--args yarn-standalone \
--args --algo --args Regression \
--args --trainDataDir --args ${DIST_WORK}/train_data_lp.txt \
--args --testDataDir --args ${DIST_WORK}/test_data_0p1pc_lp.txt \
--args --maxDepth --args 10 \
--args --impurity --args Variance \
--args --maxBins --args 100


The training times for training each depth and for each choice of number of 
workers is in the attachments 
[workers_times.txt](https://raw.githubusercontent.com/hirakendu/temp/master/spark_mllib_tree_pr79_review/workers_times.txt).
 The attached graphs  demonstrate the scalability in terms of
cumulative training times for various depths and various number of workers.


![](https://raw.githubusercontent.com/hirakendu/temp/master/spark_mllib_tree_pr79_review/workers_times.png)


![](https://raw.githubusercontent.com/hirakendu/temp/master/spark_mllib_tree_pr79_review/workers_speedups.png)


For obtaining the speed-ups, the training times are compared to those for 
60 workers, since the data could not be cached completely for 20 and 30 
workers. For all experiments, the resources requested
were fully allocated by the cluster and all experiments ran to completion 
in their first and only run.
As we can see, the scaling is nearly linear for higher depths 9 and 10, 
across the range of 60 workers to 600 workers, although the slope is less than 
1 as expected. For such loads, 60 to 100 workers are a reasonable computing 
resource and the total training times are about 160 minutes and 100 minutes 
respectively. Overall, the performance is satisfactory, in particular when trees
of depth 4 or 5 are trained for boosting models. But clearly, there is room 
for improvement in the following sense. The dataset when fully cached takes 
about 10s for a count operation, whereas the training time for first level that 
involves simple histogram calculation of three error statistics takes roughly 
30 seconds.

The error performance in terms of RMSE was verified to be close to that of 
alternate implementations.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: MLI-1 Decision Trees

2014-04-02 Thread hirakendu
Github user hirakendu commented on a diff in the pull request:

https://github.com/apache/spark/pull/79#discussion_r11229944
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/tree/DecisionTree.scala ---
@@ -0,0 +1,1150 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.mllib.tree
--- End diff --

A plan should be made to have a consistent hierarchy and organization of 
various algorithms in the MLLib package. A separate `tree` subpackage seems 
unnecessary.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: MLI-1 Decision Trees

2014-04-02 Thread hirakendu
Github user hirakendu commented on a diff in the pull request:

https://github.com/apache/spark/pull/79#discussion_r11229920
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/tree/DecisionTree.scala ---
@@ -0,0 +1,1150 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.mllib.tree
+
+import scala.util.control.Breaks._
+
+import org.apache.spark.{Logging, SparkContext}
+import org.apache.spark.SparkContext._
+import org.apache.spark.mllib.regression.LabeledPoint
+import org.apache.spark.mllib.tree.configuration.Strategy
+import org.apache.spark.mllib.tree.configuration.Algo._
+import org.apache.spark.mllib.tree.configuration.FeatureType._
+import org.apache.spark.mllib.tree.configuration.QuantileStrategy._
+import org.apache.spark.mllib.tree.impurity.{Entropy, Gini, Impurity, 
Variance}
+import org.apache.spark.mllib.tree.model._
+import org.apache.spark.rdd.RDD
+import org.apache.spark.util.random.XORShiftRandom
+
+/**
+ * A class that implements a decision tree algorithm for classification 
and regression. It
+ * supports both continuous and categorical features.
+ * @param strategy The configuration parameters for the tree algorithm 
which specify the type
+ * of algorithm (classification, regression, etc.), 
feature type (continuous,
+ * categorical), depth of the tree, quantile calculation 
strategy, etc.
+ */
+class DecisionTree private(val strategy: Strategy) extends Serializable 
with Logging {
+
+  /**
+   * Method to train a decision tree model over an RDD
+   * @param input RDD of 
[[org.apache.spark.mllib.regression.LabeledPoint]] used as training data
+   * @return a DecisionTreeModel that can be used for prediction
+   */
+  def train(input: RDD[LabeledPoint]): DecisionTreeModel = {
+
+// Cache input RDD for speedup during multiple passes.
+input.cache()
+logDebug("algo = " + strategy.algo)
+
+// Find the splits and the corresponding bins (interval between the 
splits) using a sample
+// of the input data.
+val (splits, bins) = DecisionTree.findSplitsBins(input, strategy)
+logDebug("numSplits = " + bins(0).length)
+
+// depth of the decision tree
+val maxDepth = strategy.maxDepth
+// the max number of nodes possible given the depth of the tree
+val maxNumNodes = scala.math.pow(2, maxDepth).toInt - 1
+// Initialize an array to hold filters applied to points for each node.
+val filters = new Array[List[Filter]](maxNumNodes)
+// The filter at the top node is an empty list.
+filters(0) = List()
+// Initialize an array to hold parent impurity calculations for each 
node.
+val parentImpurities = new Array[Double](maxNumNodes)
+// dummy value for top node (updated during first split calculation)
+val nodes = new Array[Node](maxNumNodes)
+
+
+/*
+ * The main idea here is to perform level-wise training of the 
decision tree nodes thus
+ * reducing the passes over the data from l to log2(l) where l is the 
total number of nodes.
+ * Each data sample is checked for validity w.r.t to each node at a 
given level -- i.e.,
+ * the sample is only used for the split calculation at the node if 
the sampled would have
+ * still survived the filters of the parent nodes.
+ */
+
+// TODO: Convert for loop to while loop
+breakable {
+  for (level <- 0 until maxDepth) {
+
+logDebug("#")
+logDebug("level = " + level)
+logDebug("#")
+
+// Find best split for all nodes at a level.
+val splitsStatsForLevel = DecisionTree.findBestSplits(input, 
parentImpurities, strategy,
+  level, filters, splits, bins)
+
+for ((nodeSplitStats, index) <- 
splitsStatsForLevel.view.zipWithIndex) {
+ 

[GitHub] spark pull request: MLI-1 Decision Trees

2014-04-02 Thread hirakendu
Github user hirakendu commented on a diff in the pull request:

https://github.com/apache/spark/pull/79#discussion_r11229869
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/tree/configuration/Algo.scala ---
@@ -0,0 +1,26 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.mllib.tree.configuration
+
+/**
+ * Enum to select the algorithm for the decision tree
+ */
+object Algo extends Enumeration {
--- End diff --

The various `Enumeration` classes in `mllib.tree.configuration` package are 
neat. A uniform _design pattern_ for parameters and options should be used for 
MLLib and Spark, and this could be a start. Alternatively, if there is an 
existing pattern in use, it should be followed for decision tree as well.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: MLI-1 Decision Trees

2014-04-02 Thread hirakendu
Github user hirakendu commented on a diff in the pull request:

https://github.com/apache/spark/pull/79#discussion_r11229453
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/tree/configuration/Algo.scala ---
@@ -0,0 +1,26 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.mllib.tree.configuration
+
+/**
+ * Enum to select the algorithm for the decision tree
+ */
+object Algo extends Enumeration {
--- End diff --

The `Algorithm` Enumeration seems redundant given `Impurity` which implies 
the `Algorithm` anyway.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: MLI-1 Decision Trees

2014-04-02 Thread hirakendu
Github user hirakendu commented on a diff in the pull request:

https://github.com/apache/spark/pull/79#discussion_r11229365
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/tree/impurity/Impurity.scala ---
@@ -0,0 +1,42 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.mllib.tree.impurity
+
+/**
+ * Trait for calculating information gain.
+ */
+trait Impurity extends Serializable {
--- End diff --

`Impurity` should be renamed to `Error` or something more technical and 
familiar. Also see the comments earlier for the necessity and example design of 
a generic `Error` interface.
   
The `calculate` method can be renamed to something verbose like `error`.
   
For a generic interface, an additional `ErrorStats` trait and 
`error(errorStats: ErrorStats)` method can be added. For example, `Variance` or 
more aptly, `SquareError`, would implement `case class SquareErrorStats(count: 
Long, mean: Double, meanSquare: Double)` and `error(errorStats) = 
errorStats.meanSquare - errorStats.mean * errorStats.mean / count`. Note that 
`ErrorStats` should have aggregation methods, e.g., it's easy to see the 
implementation for `SquareErrorStats`.

The `Variance` class should be renamed to `SquareError`, `Entropy` to 
`EntropyError` or `KLDivergence`, `Gini` to `GiniError`.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: MLI-1 Decision Trees

2014-04-02 Thread hirakendu
Github user hirakendu commented on a diff in the pull request:

https://github.com/apache/spark/pull/79#discussion_r11229274
  
--- Diff: mllib/src/main/scala/org/apache/spark/mllib/tree/model/Bin.scala 
---
@@ -0,0 +1,33 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.mllib.tree.model
+
+import org.apache.spark.mllib.tree.configuration.FeatureType._
+
+/**
+ * Used for "binning" the features bins for faster best split calculation. 
For a continuous
+ * feature, a bin is determined by a low and a high "split". For a 
categorical feature,
+ * the a bin is determined using a single label value (category).
+ * @param lowSplit signifying the lower threshold for the continuous 
feature to be
+ * accepted in the bin
+ * @param highSplit signifying the upper threshold for the continuous 
feature to be
+ * accepted in the bin
+ * @param featureType type of feature -- categorical or continuous
+ * @param category categorical label value accepted in the bin
+ */
+case class Bin(lowSplit: Split, highSplit: Split, featureType: 
FeatureType, category: Double)
--- End diff --

`Bin` class can be simplified and some members renamed. The `lowSplit`, 
`highSplit` can be simplified to the single threshold corresponding to the left 
end of the bin range. This can be named to `leftEnd` or `lowEnd`.

It's not clear this class is needed at first place. For categorical 
variables, the value itself is the bin index, and for continuous variables, 
bins are simply defined by candidate thresholds, in turn defined by quanties. 
For every feature id, one can maintain a list of categories and thresholds and 
be done. In that case, for continuous features, the position of the threshold 
is the bin index.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: MLI-1 Decision Trees

2014-04-02 Thread hirakendu
Github user hirakendu commented on a diff in the pull request:

https://github.com/apache/spark/pull/79#discussion_r11229300
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/tree/model/InformationGainStats.scala
 ---
@@ -0,0 +1,39 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.mllib.tree.model
+
+/**
+ * Information gain statistics for each split
+ * @param gain information gain value
+ * @param impurity current node impurity
+ * @param leftImpurity left node impurity
+ * @param rightImpurity right node impurity
+ * @param predict predicted value
+ */
+class InformationGainStats(
--- End diff --

`InformationGainStats` and `Split` nicely separate the members of `Node`, 
but can also be flattened and put at top level. Would make storage and 
explanation slightly easier, albeit less unstructured.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: MLI-1 Decision Trees

2014-04-02 Thread hirakendu
Github user hirakendu commented on a diff in the pull request:

https://github.com/apache/spark/pull/79#discussion_r11229221
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/tree/configuration/Strategy.scala 
---
@@ -0,0 +1,43 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.mllib.tree.configuration
+
+import org.apache.spark.mllib.tree.impurity.Impurity
+import org.apache.spark.mllib.tree.configuration.Algo._
+import org.apache.spark.mllib.tree.configuration.QuantileStrategy._
+
+/**
+ * Stores all the configuration options for tree construction
+ * @param algo classification or regression
+ * @param impurity criterion used for information gain calculation
+ * @param maxDepth maximum depth of the tree
+ * @param maxBins maximum number of bins used for splitting features
+ * @param quantileCalculationStrategy algorithm for calculating quantiles
+ * @param categoricalFeaturesInfo A map storing information about the 
categorical variables and the
+ *number of discrete values they take. For 
example, an entry (n ->
+ *k) implies the feature n is categorical 
with k categories 0,
+ *1, 2, ... , k-1. It's important to note 
that features are
+ *zero-indexed.
+ */
+class Strategy (
--- End diff --

`Strategy` should be renamed to `Parameters`. Modeling and algorithm 
parameters can be separate, the latter being part of the model.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: MLI-1 Decision Trees

2014-04-02 Thread hirakendu
Github user hirakendu commented on a diff in the pull request:

https://github.com/apache/spark/pull/79#discussion_r11229187
  
--- Diff: mllib/src/main/scala/org/apache/spark/mllib/tree/model/Node.scala 
---
@@ -0,0 +1,90 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.mllib.tree.model
+
+import org.apache.spark.Logging
+import org.apache.spark.mllib.tree.configuration.FeatureType._
+
+/**
+ * Node in a decision tree
+ * @param id integer node id
+ * @param predict predicted value at the node
+ * @param isLeaf whether the leaf is a node
+ * @param split split to calculate left and right nodes
+ * @param leftNode  left child
+ * @param rightNode right child
+ * @param stats information gain stats
+ */
+class Node (
--- End diff --

`Node` should be a recursive/linked structure with references to child 
nodes and parent node (the latter allows for easy traversal), so I don't see 
the need for `nodes: Array[Node]` as the model and the `build` method in 
`Node`. The `DecisionTreeModel` should essentially be the root `Node` with 
methods like `predict` etc. involving a recursive traversal on child nodes.

The method `predictIfleaf` in `Node` should be renamed to simply `predict`. 
It predicts regardless of whether it's a leaf and does recursive traversal 
until it hits a leaf child. The prediction value should be renamed to 
`prediction` instead of `predict`, which would clean up the ambiguity with this 
`predict` method.

Putting things together: `Node` should be simple with a `prediction` and 
should be a recursive structure. `DecisionTreeModel` should be a wrapper around 
a root `Node` member and contain methods like `predict`, `save`, `explain`, 
`load` etc. based on recursive traversal.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: MLI-1 Decision Trees

2014-04-02 Thread hirakendu
Github user hirakendu commented on a diff in the pull request:

https://github.com/apache/spark/pull/79#discussion_r11229120
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/tree/model/Split.scala ---
@@ -0,0 +1,64 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.mllib.tree.model
+
+import org.apache.spark.mllib.tree.configuration.FeatureType.FeatureType
+
+/**
+ * Split applied to a feature
+ * @param feature feature index
+ * @param threshold threshold for continuous feature
+ * @param featureType type of feature -- categorical or continuous
+ * @param categories accepted values for categorical variables
+ */
+case class Split(
--- End diff --

The functionality of `Split` can be simplified by a modification. If I 
understand correctly, `Split` represents the left or right (low or high) branch 
of the parent node. Instead, it suffices to store the branching condition for 
each node as a splitting condition. This can be appropriately named as 
`SplitPredicate` or `SplittingCondition` or branching condition and consist of 
feature id, feature type (continuous or categorical), threshold, left branch 
categories and right branch categories.

I think depending on the choice here, we require `Filter`, but nonetheless 
I think it's redundant and we should exploit the recursive/linked structure of 
tree, which we are doing anyway.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: MLI-1 Decision Trees

2014-04-02 Thread hirakendu
Github user hirakendu commented on a diff in the pull request:

https://github.com/apache/spark/pull/79#discussion_r11229035
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/tree/model/Filter.scala ---
@@ -0,0 +1,28 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.mllib.tree.model
+
+/**
+ * Filter specifying a split and type of comparison to be applied on 
features
+ * @param split split specifying the feature index, type and threshold
+ * @param comparison integer specifying <,=,>
+ */
+case class Filter(split: Split, comparison: Int) {
--- End diff --

`Filter` class is a nice abstraction of branching conditions leading to 
current node. There are already references to left and right child nodes, so I 
think this is redundant. If need be, a reference to a parent node as an 
`Option[Node]` suffices and is more useful. The functionality should be covered 
across `Node` and `Split` classes.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: MLI-1 Decision Trees

2014-04-02 Thread hirakendu
Github user hirakendu commented on a diff in the pull request:

https://github.com/apache/spark/pull/79#discussion_r11228983
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/tree/DecisionTree.scala ---
@@ -0,0 +1,1150 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.mllib.tree
+
+import scala.util.control.Breaks._
+
+import org.apache.spark.{Logging, SparkContext}
+import org.apache.spark.SparkContext._
+import org.apache.spark.mllib.regression.LabeledPoint
+import org.apache.spark.mllib.tree.configuration.Strategy
+import org.apache.spark.mllib.tree.configuration.Algo._
+import org.apache.spark.mllib.tree.configuration.FeatureType._
+import org.apache.spark.mllib.tree.configuration.QuantileStrategy._
+import org.apache.spark.mllib.tree.impurity.{Entropy, Gini, Impurity, 
Variance}
+import org.apache.spark.mllib.tree.model._
+import org.apache.spark.rdd.RDD
+import org.apache.spark.util.random.XORShiftRandom
+
+/**
+ * A class that implements a decision tree algorithm for classification 
and regression. It
+ * supports both continuous and categorical features.
+ * @param strategy The configuration parameters for the tree algorithm 
which specify the type
+ * of algorithm (classification, regression, etc.), 
feature type (continuous,
+ * categorical), depth of the tree, quantile calculation 
strategy, etc.
+ */
+class DecisionTree private(val strategy: Strategy) extends Serializable 
with Logging {
--- End diff --

`DecisionTree` class and object should be reorganized and separated into 
`DecisionTreeModel` and `DecisionTreeAlgorithm`, with a `Strategy` and root 
`Node` as part of `Model` and `train` as part of the `Algorithm`.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: MLI-1 Decision Trees

2014-04-02 Thread hirakendu
Github user hirakendu commented on a diff in the pull request:

https://github.com/apache/spark/pull/79#discussion_r11228901
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/tree/impurity/Impurity.scala ---
@@ -0,0 +1,42 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.mllib.tree.impurity
+
+/**
+ * Trail for calculating information gain
+ */
+trait Impurity extends Serializable {
+
+  /**
+   * information calculation for binary classification
+   * @param c0 count of instances with label 0
+   * @param c1 count of instances with label 1
+   * @return information value
+   */
+  def calculate(c0 : Double, c1 : Double): Double
+
+  /**
+   * information calculation for regression
+   * @param count number of instances
+   * @param sum sum of labels
+   * @param sumSquares summation of squares of the labels
+   * @return information value
+   */
+  def calculate(count: Double, sum: Double, sumSquares: Double): Double
--- End diff --

Adding to the discussion on the need for a generic interface for 
`Impurity`, or more precisely `Error`, I believe we all see that it's good to 
have. Ideally I would have preferred a single `Error` trait and that all types 
of Error like Square or KL divergence extend it, but the consensus is
that it negatively impacts performance.

In addition to performance-oriented implementations for specific loss 
functions, I would still recommend a generic `Error` interface and a generic 
implementation of decision-tree based on this interface. One possibility is to 
add a third `calculate(stats)`, or more precisely `error(errorStats: 
ErrorStats)` to the `Error` interface. I am not sure it will help the signature 
collision problem though, unless we just keep the one signature for generic 
error statistics.

For reference and example of one such interface and implementations, see 
`trait LossStats[S <: LossStats[S]]` and `abstract class Loss[S <: 
LossStats[S]:Manifest]` in my previous PR,

[https://github.com/apache/incubator-spark/pull/161/files](https://github.com/apache/incubator-spark/pull/161/files),
 that exactly do that and provide interfaces for aggregable error statistics 
and calculating error from these statistics. (On second thought, I feel 
`ErrorStats` and `Error` are better names.) Also see the generic implementation 
`class DecisionTreeAlgorithm[S <: LossStats[S]:Manifest]` and implementations 
of specific error functions, `SquareLoss` and `EntropyLoss`.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: MLI-1 Decision Trees

2014-04-02 Thread hirakendu
Github user hirakendu commented on the pull request:

https://github.com/apache/spark/pull/79#issuecomment-39385736
  
Thanks Manish, Evan, Ameet, Ram, and thanks Xiangrui, Matei. It's great to 
have this PR merged, but I think there is room for a lot of improvement, and I 
hope this thread is still open for comments. I have some detailed notes from my 
side, we can track them in a separate place if required.

Some design notes to start with:

The current design of classes and relationships is good,
but I think it would be great if it can be modified slightly to
make it similar to the design of other existing algorithms in MLLib
and extend the existing interfaces.
For example, we can have a `DecisionTreeModel` or that extends
the existing `RegressionModel` interface, similar to
the existing `RidgeRegressionModel` and `LinearRegressionModel`.
Alternatively, we can have separate `ClassificationTreeModel`
and `RegressionTreeModel` that extend the existing interfaces
`ClassificationModel` and `RegressionModel` respectively.

Note that it is important to keep the `Model` as a class,
so that we can later compose ensemble and other composite models
consisting of multiple `Model` instances.
Currently the decision tree `Node` is essentially the `Model`,
although I would prefer a wrapper around it and explicitly
called `DecisionTreeModel` that implements methods like
`predict`, `save`, `explain`, `load` like I remember
coming across in MLI.

In the file `DecisionTree.scala`, the actual class
can be renamed to `DecisionTreeAlgorithm` that extends
`Algorithm` interface that implements `train(samples)`
and outputs a `RegressionModel`.
Note that there is no `Algorithm` interface currently in MLLib,
although there may be one in MLI and the closest is
`abstract class GeneralizedLinearAlgorithm[M <: GeneralizedLinearModel]`.

Modeling `Strategy` should be renamed to `Parameters`.
I would prefer a separation between model parameters
like depth, minimum gain and algorithm parameters
like level-by-level training or caching.
The line is blurred for some aspects like quantile strategy,
but I am inclined to put those into modeling parameters,
so it can be referred to later on.
Rule of thumb being that different algorithm parameters
should lead to same model (up to randomization effects)
for the same model parameters.

`Impurity` should be renamed to `Error` or something more technical.
Also see my later comments on the need for a generic `Error` Interface
that allow easy adaptation of algorithms for specific loss functions.

In general, MLLib and MLI should define proper interfaces
for `Model`, `Algorithm`, `Parameters` and importantly `Loss` entities,
at least for supervised machine learning algorithms
and all implementations should adhere to it.
Surprisingly, MLLIb or MLI currently doesn't have a `Loss` or `Error` 
interface,
the closest being the `Optimizer` interface.
There is also a need for portable model output formats, e.g., JSON,
that can be used by other programs, possibly written in other languages
outside Scala and Java.
Can also use an existing format like PMML (not sure if it's widely used).
Lastly, there is a need to support standard data formats with
optional delimiter parameters - I am sure this a general need for Spark.
I understand there has been significant effort before for standardization,
would be good to know about the current status.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: MLI-1 Decision Trees

2014-04-01 Thread asfgit
Github user asfgit closed the pull request at:

https://github.com/apache/spark/pull/79


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: MLI-1 Decision Trees

2014-04-01 Thread mateiz
Github user mateiz commented on the pull request:

https://github.com/apache/spark/pull/79#issuecomment-39288342
  
Excellent, thanks a lot Manish, Ram and others! I've merged this in.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: MLI-1 Decision Trees

2014-04-01 Thread mengxr
Github user mengxr commented on the pull request:

https://github.com/apache/spark/pull/79#issuecomment-39287513
  
@mateiz This looks good to me and I will mark some APIs 
developer/experimental after it gets merged. Please help merge it.

Thanks @manishamde , @hirakendu , @etrain , @atalwalkar , and @harsha2010 
for the work!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: MLI-1 Decision Trees

2014-03-31 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/79#issuecomment-39169881
  
All automated tests passed.
Refer to this link for build results: 
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/13623/


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: MLI-1 Decision Trees

2014-03-31 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/79#issuecomment-39169879
  
Merged build finished. All automated tests passed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: MLI-1 Decision Trees

2014-03-31 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/79#issuecomment-39167685
  
 Merged build triggered. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: MLI-1 Decision Trees

2014-03-31 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/79#issuecomment-39167697
  
Merged build started. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: MLI-1 Decision Trees

2014-03-31 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/79#issuecomment-39159295
  
Merged build finished. All automated tests passed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: MLI-1 Decision Trees

2014-03-31 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/79#issuecomment-39159296
  
All automated tests passed.
Refer to this link for build results: 
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/13614/


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: MLI-1 Decision Trees

2014-03-31 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/79#issuecomment-39155725
  
Merged build started. Build is starting -or- tests failed to complete.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: MLI-1 Decision Trees

2014-03-31 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/79#issuecomment-39155715
  
 Merged build triggered. Build is starting -or- tests failed to complete.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: MLI-1 Decision Trees

2014-03-31 Thread mengxr
Github user mengxr commented on the pull request:

https://github.com/apache/spark/pull/79#issuecomment-39155337
  
@manishamde  Please merge the changes at 
https://github.com/manishamde/spark/pull/4


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: MLI-1 Decision Trees

2014-03-31 Thread manishamde
Github user manishamde commented on a diff in the pull request:

https://github.com/apache/spark/pull/79#discussion_r11142394
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/tree/configuration/Strategy.scala 
---
@@ -0,0 +1,47 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.mllib.tree.configuration
+
+import org.apache.spark.mllib.tree.impurity.Impurity
+import org.apache.spark.mllib.tree.configuration.Algo._
+import org.apache.spark.mllib.tree.configuration.QuantileStrategy._
+
+/**
+ * Stores all the configuration options for tree construction
+ * @param algo classification or regression
+ * @param impurity criterion used for information gain calculation
+ * @param maxDepth maximum depth of the tree
+ * @param maxBins maximum number of bins used for splitting features
+ * @param quantileCalculationStrategy algorithm for calculating quantiles
+ * @param categoricalFeaturesInfo A map storing information about the 
categorical variables and the
+ *number of discrete values they take. For 
example, an entry (n ->
+ *k) implies the feature n is categorical 
with k categories 0,
+ *1, 2, ... , k-1. It's important to note 
that features are
+ *zero-indexed.
+ */
+class Strategy  (
+val algo: Algo,
+val impurity: Impurity,
+val maxDepth: Int,
+val maxBins: Int = 100,
+val quantileCalculationStrategy: QuantileStrategy = Sort,
+val categoricalFeaturesInfo: Map[Int,Int] = 
Map[Int,Int]()) extends Serializable {
+
+  var numBins: Int  = Int.MinValue
--- End diff --

Will do. Thanks!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: MLI-1 Decision Trees

2014-03-31 Thread manishamde
Github user manishamde commented on a diff in the pull request:

https://github.com/apache/spark/pull/79#discussion_r11142399
  
--- Diff: 
mllib/src/test/scala/org/apache/spark/mllib/tree/DecisionTreeSuite.scala ---
@@ -0,0 +1,431 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.mllib.tree
+
+import org.scalatest.BeforeAndAfterAll
+import org.scalatest.FunSuite
+
+import org.apache.spark.SparkContext
+import org.apache.spark.mllib.regression.LabeledPoint
+import org.apache.spark.mllib.tree.impurity.{Entropy, Gini, Variance}
+import org.apache.spark.mllib.tree.model.Filter
+import org.apache.spark.mllib.tree.configuration.Strategy
+import org.apache.spark.mllib.tree.configuration.Algo._
+import org.apache.spark.mllib.tree.configuration.FeatureType._
+
+class DecisionTreeSuite extends FunSuite with BeforeAndAfterAll {
+
+  @transient private var sc: SparkContext = _
+
+  override def beforeAll() {
+sc = new SparkContext("local", "test")
+  }
+
+  override def afterAll() {
+sc.stop()
+System.clearProperty("spark.driver.port")
+  }
+
+  test("split and bin calculation"){
+val arr = DecisionTreeSuite.generateOrderedLabeledPointsWithLabel1()
+assert(arr.length == 1000)
+val rdd = sc.parallelize(arr)
+val strategy = new Strategy(Classification,Gini,3,100)
+val (splits, bins) = DecisionTree.findSplitsBins(rdd,strategy)
+assert(splits.length==2)
+assert(bins.length==2)
+assert(splits(0).length==99)
+assert(bins(0).length==100)
+  }
+
+  test("split and bin calculation for categorical variables"){
+val arr = DecisionTreeSuite.generateCategoricalDataPoints()
+assert(arr.length == 1000)
+val rdd = sc.parallelize(arr)
+val strategy = new 
Strategy(Classification,Gini,3,100,categoricalFeaturesInfo = Map(0 -> 2,
+  1-> 2))
+val (splits, bins) = DecisionTree.findSplitsBins(rdd,strategy)
+assert(splits.length==2)
+assert(bins.length==2)
+assert(splits(0).length==99)
+assert(bins(0).length==100)
+
+//Checking splits
+
+assert(splits(0)(0).feature == 0)
+assert(splits(0)(0).threshold == Double.MinValue)
+assert(splits(0)(0).featureType == Categorical)
+assert(splits(0)(0).categories.length == 1)
+assert(splits(0)(0).categories.contains(1.0))
+
+
+assert(splits(0)(1).feature == 0)
+assert(splits(0)(1).threshold == Double.MinValue)
+assert(splits(0)(1).featureType == Categorical)
+assert(splits(0)(1).categories.length == 2)
+assert(splits(0)(1).categories.contains(1.0))
+assert(splits(0)(1).categories.contains(0.0))
+
+assert(splits(0)(2) == null)
+
+assert(splits(1)(0).feature == 1)
+assert(splits(1)(0).threshold == Double.MinValue)
+assert(splits(1)(0).featureType == Categorical)
+assert(splits(1)(0).categories.length == 1)
+assert(splits(1)(0).categories.contains(0.0))
+
+
+assert(splits(1)(1).feature == 1)
+assert(splits(1)(1).threshold == Double.MinValue)
+assert(splits(1)(1).featureType == Categorical)
+assert(splits(1)(1).categories.length == 2)
+assert(splits(1)(1).categories.contains(1.0))
+assert(splits(1)(1).categories.contains(0.0))
+
+assert(splits(1)(2) == null)
+
+
+// Checks bins
+
+assert(bins(0)(0).category == 1.0)
+assert(bins(0)(0).lowSplit.categories.length == 0)
+assert(bins(0)(0).highSplit.categories.length == 1)
+assert(bins(0)(0).highSplit.categories.contains(1.0))
+
+assert(bins(0)(1).category == 0.0)
+assert(bins(0)(1).lowSplit.categories.length == 1)
+assert(bins(0)(1).lowSplit.categories.contains(1.0))
+assert(bins(0)(1).highSplit.categories.length == 2)
+assert(bins(0)(1).highSplit.categories.contains(1.0))
+assert(bins(0)

[GitHub] spark pull request: MLI-1 Decision Trees

2014-03-31 Thread manishamde
Github user manishamde commented on the pull request:

https://github.com/apache/spark/pull/79#issuecomment-39155083
  
@mengxr I am not sure the merge of your PR on my repo went through. I might 
have closed the PR accidentally and I can't find a way to undo it. Could you 
please re-send your latest PR?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: MLI-1 Decision Trees

2014-03-31 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/79#issuecomment-39154884
  
Merged build finished. All automated tests passed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: MLI-1 Decision Trees

2014-03-31 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/79#issuecomment-39154885
  
All automated tests passed.
Refer to this link for build results: 
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/13609/


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: MLI-1 Decision Trees

2014-03-31 Thread mengxr
Github user mengxr commented on a diff in the pull request:

https://github.com/apache/spark/pull/79#discussion_r11141936
  
--- Diff: 
mllib/src/test/scala/org/apache/spark/mllib/tree/DecisionTreeSuite.scala ---
@@ -0,0 +1,431 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.mllib.tree
+
+import org.scalatest.BeforeAndAfterAll
+import org.scalatest.FunSuite
+
+import org.apache.spark.SparkContext
+import org.apache.spark.mllib.regression.LabeledPoint
+import org.apache.spark.mllib.tree.impurity.{Entropy, Gini, Variance}
+import org.apache.spark.mllib.tree.model.Filter
+import org.apache.spark.mllib.tree.configuration.Strategy
+import org.apache.spark.mllib.tree.configuration.Algo._
+import org.apache.spark.mllib.tree.configuration.FeatureType._
+
+class DecisionTreeSuite extends FunSuite with BeforeAndAfterAll {
+
+  @transient private var sc: SparkContext = _
+
+  override def beforeAll() {
+sc = new SparkContext("local", "test")
+  }
+
+  override def afterAll() {
+sc.stop()
+System.clearProperty("spark.driver.port")
+  }
+
+  test("split and bin calculation"){
+val arr = DecisionTreeSuite.generateOrderedLabeledPointsWithLabel1()
+assert(arr.length == 1000)
+val rdd = sc.parallelize(arr)
+val strategy = new Strategy(Classification,Gini,3,100)
+val (splits, bins) = DecisionTree.findSplitsBins(rdd,strategy)
+assert(splits.length==2)
+assert(bins.length==2)
+assert(splits(0).length==99)
+assert(bins(0).length==100)
+  }
+
+  test("split and bin calculation for categorical variables"){
+val arr = DecisionTreeSuite.generateCategoricalDataPoints()
+assert(arr.length == 1000)
+val rdd = sc.parallelize(arr)
+val strategy = new 
Strategy(Classification,Gini,3,100,categoricalFeaturesInfo = Map(0 -> 2,
+  1-> 2))
+val (splits, bins) = DecisionTree.findSplitsBins(rdd,strategy)
+assert(splits.length==2)
+assert(bins.length==2)
+assert(splits(0).length==99)
+assert(bins(0).length==100)
+
+//Checking splits
+
+assert(splits(0)(0).feature == 0)
+assert(splits(0)(0).threshold == Double.MinValue)
+assert(splits(0)(0).featureType == Categorical)
+assert(splits(0)(0).categories.length == 1)
+assert(splits(0)(0).categories.contains(1.0))
+
+
+assert(splits(0)(1).feature == 0)
+assert(splits(0)(1).threshold == Double.MinValue)
+assert(splits(0)(1).featureType == Categorical)
+assert(splits(0)(1).categories.length == 2)
+assert(splits(0)(1).categories.contains(1.0))
+assert(splits(0)(1).categories.contains(0.0))
+
+assert(splits(0)(2) == null)
+
+assert(splits(1)(0).feature == 1)
+assert(splits(1)(0).threshold == Double.MinValue)
+assert(splits(1)(0).featureType == Categorical)
+assert(splits(1)(0).categories.length == 1)
+assert(splits(1)(0).categories.contains(0.0))
+
+
+assert(splits(1)(1).feature == 1)
+assert(splits(1)(1).threshold == Double.MinValue)
+assert(splits(1)(1).featureType == Categorical)
+assert(splits(1)(1).categories.length == 2)
+assert(splits(1)(1).categories.contains(1.0))
+assert(splits(1)(1).categories.contains(0.0))
+
+assert(splits(1)(2) == null)
+
+
+// Checks bins
+
+assert(bins(0)(0).category == 1.0)
+assert(bins(0)(0).lowSplit.categories.length == 0)
+assert(bins(0)(0).highSplit.categories.length == 1)
+assert(bins(0)(0).highSplit.categories.contains(1.0))
+
+assert(bins(0)(1).category == 0.0)
+assert(bins(0)(1).lowSplit.categories.length == 1)
+assert(bins(0)(1).lowSplit.categories.contains(1.0))
+assert(bins(0)(1).highSplit.categories.length == 2)
+assert(bins(0)(1).highSplit.categories.contains(1.0))
+assert(bins(0)(1).

[GitHub] spark pull request: MLI-1 Decision Trees

2014-03-31 Thread mengxr
Github user mengxr commented on a diff in the pull request:

https://github.com/apache/spark/pull/79#discussion_r11141895
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/tree/configuration/Strategy.scala 
---
@@ -0,0 +1,47 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.mllib.tree.configuration
+
+import org.apache.spark.mllib.tree.impurity.Impurity
+import org.apache.spark.mllib.tree.configuration.Algo._
+import org.apache.spark.mllib.tree.configuration.QuantileStrategy._
+
+/**
+ * Stores all the configuration options for tree construction
+ * @param algo classification or regression
+ * @param impurity criterion used for information gain calculation
+ * @param maxDepth maximum depth of the tree
+ * @param maxBins maximum number of bins used for splitting features
+ * @param quantileCalculationStrategy algorithm for calculating quantiles
+ * @param categoricalFeaturesInfo A map storing information about the 
categorical variables and the
+ *number of discrete values they take. For 
example, an entry (n ->
+ *k) implies the feature n is categorical 
with k categories 0,
+ *1, 2, ... , k-1. It's important to note 
that features are
+ *zero-indexed.
+ */
+class Strategy  (
+val algo: Algo,
+val impurity: Impurity,
+val maxDepth: Int,
+val maxBins: Int = 100,
+val quantileCalculationStrategy: QuantileStrategy = Sort,
+val categoricalFeaturesInfo: Map[Int,Int] = 
Map[Int,Int]()) extends Serializable {
+
+  var numBins: Int  = Int.MinValue
--- End diff --

Need docs on `numBins` and explain what it means if it is set manually as 
in the test suite.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: MLI-1 Decision Trees

2014-03-31 Thread mengxr
Github user mengxr commented on the pull request:

https://github.com/apache/spark/pull/79#issuecomment-39153779
  
@manishamde I sent some minor code style updates to your repo. Please take 
a look and merge it into this PR. Thanks!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: MLI-1 Decision Trees

2014-03-31 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/79#issuecomment-39150614
  
 Merged build triggered. Build is starting -or- tests failed to complete.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: MLI-1 Decision Trees

2014-03-31 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/79#issuecomment-39150623
  
Merged build started. Build is starting -or- tests failed to complete.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: MLI-1 Decision Trees

2014-03-31 Thread mengxr
Github user mengxr commented on the pull request:

https://github.com/apache/spark/pull/79#issuecomment-39150548
  
Jenkins, retest this please.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: MLI-1 Decision Trees

2014-03-30 Thread manishamde
Github user manishamde commented on a diff in the pull request:

https://github.com/apache/spark/pull/79#discussion_r11103459
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/tree/impurity/Impurity.scala ---
@@ -0,0 +1,42 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.mllib.tree.impurity
+
+/**
+ * Trail for calculating information gain
+ */
+trait Impurity extends Serializable {
+
+  /**
+   * information calculation for binary classification
+   * @param c0 count of instances with label 0
+   * @param c1 count of instances with label 1
+   * @return information value
+   */
+  def calculate(c0 : Double, c1 : Double): Double
+
+  /**
+   * information calculation for regression
+   * @param count number of instances
+   * @param sum sum of labels
+   * @param sumSquares summation of squares of the labels
+   * @return information value
+   */
+  def calculate(count: Double, sum: Double, sumSquares: Double): Double
--- End diff --

@mengxr The generic interface you noted is correct. However, I think 
implementing this generic interface and the corresponding implementations is 
not a minor code change. There are some assumptions in the bin aggregation code 
that may need to be updated and it also requires adding partition-wise impurity 
calculation and aggregation.

@mateiz As @mengxr noted, it's highly unlikely that a user will write their 
own ```Impurity``` implementation. It's mostly an internal API and could be 
addressed soon in a different PR.

I think we all agree (please correct me if I am wrong) the ```Impurity``` 
update belongs to a different PR. I can spend time on it immediately after this 
PR is accepted.

Is this the correct method of marking a method as unstable using the 
javadoc?
```ALPHA COMPONENT```


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: MLI-1 Decision Trees

2014-03-28 Thread mengxr
Github user mengxr commented on a diff in the pull request:

https://github.com/apache/spark/pull/79#discussion_r11090794
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/tree/impurity/Impurity.scala ---
@@ -0,0 +1,42 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.mllib.tree.impurity
+
+/**
+ * Trail for calculating information gain
+ */
+trait Impurity extends Serializable {
+
+  /**
+   * information calculation for binary classification
+   * @param c0 count of instances with label 0
+   * @param c1 count of instances with label 1
+   * @return information value
+   */
+  def calculate(c0 : Double, c1 : Double): Double
+
+  /**
+   * information calculation for regression
+   * @param count number of instances
+   * @param sum sum of labels
+   * @param sumSquares summation of squares of the labels
+   * @return information value
+   */
+  def calculate(count: Double, sum: Double, sumSquares: Double): Double
--- End diff --

@mateiz A user needs an `Impurity` instance to construct `Strategy`, but 
very unlikely they need to call `calculate` directly or implement their own 
`Impurity`. I'm okay if we mark the `calculate` method unstable in another PR 
later.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: MLI-1 Decision Trees

2014-03-28 Thread mateiz
Github user mateiz commented on a diff in the pull request:

https://github.com/apache/spark/pull/79#discussion_r11089359
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/tree/impurity/Impurity.scala ---
@@ -0,0 +1,42 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.mllib.tree.impurity
+
+/**
+ * Trail for calculating information gain
+ */
+trait Impurity extends Serializable {
+
+  /**
+   * information calculation for binary classification
+   * @param c0 count of instances with label 0
+   * @param c1 count of instances with label 1
+   * @return information value
+   */
+  def calculate(c0 : Double, c1 : Double): Double
+
+  /**
+   * information calculation for regression
+   * @param count number of instances
+   * @param sum sum of labels
+   * @param sumSquares summation of squares of the labels
+   * @return information value
+   */
+  def calculate(count: Double, sum: Double, sumSquares: Double): Double
--- End diff --

BTW I'd also be okay updating this API in a later pull request before we 
release 1.0. It's fair game to change new APIs in that time window.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: MLI-1 Decision Trees

2014-03-28 Thread mateiz
Github user mateiz commented on a diff in the pull request:

https://github.com/apache/spark/pull/79#discussion_r11077636
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/tree/impurity/Impurity.scala ---
@@ -0,0 +1,42 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.mllib.tree.impurity
+
+/**
+ * Trail for calculating information gain
+ */
+trait Impurity extends Serializable {
+
+  /**
+   * information calculation for binary classification
+   * @param c0 count of instances with label 0
+   * @param c1 count of instances with label 1
+   * @return information value
+   */
+  def calculate(c0 : Double, c1 : Double): Double
+
+  /**
+   * information calculation for regression
+   * @param count number of instances
+   * @param sum sum of labels
+   * @param sumSquares summation of squares of the labels
+   * @return information value
+   */
+  def calculate(count: Double, sum: Double, sumSquares: Double): Double
--- End diff --

I'm just catching up on this, but is the problem that there will be other 
types of Impurity later that calculate different stats (not just variance)? In 
that case, maybe we can have Impurity be parameterized (Impurity[T]) where T is 
a type it accumulates over. However I'd also be okay with leaving this as is 
initially and marking the API unstable if this is an internal API. The question 
is how many users will call this directly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: MLI-1 Decision Trees

2014-03-28 Thread mengxr
Github user mengxr commented on a diff in the pull request:

https://github.com/apache/spark/pull/79#discussion_r11074467
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/tree/impurity/Impurity.scala ---
@@ -0,0 +1,42 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.mllib.tree.impurity
+
+/**
+ * Trail for calculating information gain
+ */
+trait Impurity extends Serializable {
+
+  /**
+   * information calculation for binary classification
+   * @param c0 count of instances with label 0
+   * @param c1 count of instances with label 1
+   * @return information value
+   */
+  def calculate(c0 : Double, c1 : Double): Double
+
+  /**
+   * information calculation for regression
+   * @param count number of instances
+   * @param sum sum of labels
+   * @param sumSquares summation of squares of the labels
+   * @return information value
+   */
+  def calculate(count: Double, sum: Double, sumSquares: Double): Double
--- End diff --

@manishamde Sorry I misunderstood your proposal. I thought `Seq[Double]` 
contains the raw labels,  like
~~~
0 1 1 0 1 0 0 1
~~~
for classification and 
~~~
1.0 2.0 -3.0 5.0
~~~
for regression. You meant that `Seq[Double]` is `Seq(c0, c1)` for 
classification and `Seq(n, sum, sumSq)` for regression. I can see this would be 
a minor code change but it makes the interface more confusing, IMHO. If my 
understanding is correct, Impurity is applied to a sequence of labels. The code 
will have better separation if we use the following:
~~~
labels.foreach(impurityAccumulator.accumulate)
impurityAccumulator += anotherImpurityAccumulator
val impurity = impurityAccumulator.get
~~~
I think this is also a minor code change and it gives a cleaner public 
interface of `Impurity`. Maybe we can get some suggestions from others. @mateiz 
? 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: MLI-1 Decision Trees

2014-03-27 Thread manishamde
Github user manishamde commented on a diff in the pull request:

https://github.com/apache/spark/pull/79#discussion_r11055502
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/tree/impurity/Impurity.scala ---
@@ -0,0 +1,42 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.mllib.tree.impurity
+
+/**
+ * Trail for calculating information gain
+ */
+trait Impurity extends Serializable {
+
+  /**
+   * information calculation for binary classification
+   * @param c0 count of instances with label 0
+   * @param c1 count of instances with label 1
+   * @return information value
+   */
+  def calculate(c0 : Double, c1 : Double): Double
+
+  /**
+   * information calculation for regression
+   * @param count number of instances
+   * @param sum sum of labels
+   * @param sumSquares summation of squares of the labels
+   * @return information value
+   */
+  def calculate(count: Double, sum: Double, sumSquares: Double): Double
--- End diff --

@mengxr I am not clear what you meant by the ```accumulate``` and ```get``` 
methods. 

There is no problem in encountering memory issues with 
```calculate(Seq[Double])```. The length of the input sequence will be 2 for 
classification implementations (Gini, Entropy) and 3 for regression 
implementation (Variance). The ```calculate``` method is only used after the 
bin aggregation and subsequent bin aggregate to split aggregate conversions.  
It's just a minor code change to have a single method signature in the 
```Impurity``` trait.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: MLI-1 Decision Trees

2014-03-27 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/79#issuecomment-38887410
  
Merged build finished. Some tests failed or tests have not started running 
yet.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: MLI-1 Decision Trees

2014-03-27 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/79#issuecomment-38887411
  
Some tests failed or tests have not started running yet.
Refer to this link for build results: 
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/13536/


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: MLI-1 Decision Trees

2014-03-27 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/79#issuecomment-38885625
  
 Merged build triggered. One or more automated tests failed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: MLI-1 Decision Trees

2014-03-27 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/79#issuecomment-38885632
  
Merged build started. One or more automated tests failed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---



[GitHub] spark pull request: MLI-1 Decision Trees

2014-03-27 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/79#issuecomment-38885583
  
Merged build finished. One or more automated tests failed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: MLI-1 Decision Trees

2014-03-27 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/79#issuecomment-38885584
  
One or more automated tests failed.
Refer to this link for build results: 
https://amplab.cs.berkeley.edu/jenkins/job/Spark-prb/2/


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: MLI-1 Decision Trees

2014-03-27 Thread pwendell
Github user pwendell commented on the pull request:

https://github.com/apache/spark/pull/79#issuecomment-38885516
  
Jenkins, this is okay to test. Jenkins, test this please.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: MLI-1 Decision Trees

2014-03-27 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/79#issuecomment-38885518
  
 Merged build triggered. One or more automated tests failed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: MLI-1 Decision Trees

2014-03-27 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/79#issuecomment-38885522
  
Merged build started. One or more automated tests failed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: MLI-1 Decision Trees

2014-03-27 Thread mengxr
Github user mengxr commented on a diff in the pull request:

https://github.com/apache/spark/pull/79#discussion_r11054196
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/tree/impurity/Impurity.scala ---
@@ -0,0 +1,42 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.mllib.tree.impurity
+
+/**
+ * Trail for calculating information gain
+ */
+trait Impurity extends Serializable {
+
+  /**
+   * information calculation for binary classification
+   * @param c0 count of instances with label 0
+   * @param c1 count of instances with label 1
+   * @return information value
+   */
+  def calculate(c0 : Double, c1 : Double): Double
+
+  /**
+   * information calculation for regression
+   * @param count number of instances
+   * @param sum sum of labels
+   * @param sumSquares summation of squares of the labels
+   * @return information value
+   */
+  def calculate(count: Double, sum: Double, sumSquares: Double): Double
--- End diff --

@manishamde I prefer `accumulate(Double)` and `get(): Double`. But if a 
single `calculate(Seq[Double])` won't run into memory issues, I'm okay with the 
change. Sorry for my late response!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: MLI-1 Decision Trees

2014-03-27 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/79#issuecomment-38881921
  
Can one of the admins verify this patch?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: MLI-1 Decision Trees

2014-03-27 Thread manishamde
Github user manishamde commented on a diff in the pull request:

https://github.com/apache/spark/pull/79#discussion_r11047729
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/tree/impurity/Impurity.scala ---
@@ -0,0 +1,42 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.mllib.tree.impurity
+
+/**
+ * Trail for calculating information gain
+ */
+trait Impurity extends Serializable {
+
+  /**
+   * information calculation for binary classification
+   * @param c0 count of instances with label 0
+   * @param c1 count of instances with label 1
+   * @return information value
+   */
+  def calculate(c0 : Double, c1 : Double): Double
+
+  /**
+   * information calculation for regression
+   * @param count number of instances
+   * @param sum sum of labels
+   * @param sumSquares summation of squares of the labels
+   * @return information value
+   */
+  def calculate(count: Double, sum: Double, sumSquares: Double): Double
--- End diff --

@mengxr Are we in agreement here for the immediate fix for the PR? I think 
we all agree on the final version but wanted to make sure we have consensus for 
this PR before I implement your suggestion.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: MLI-1 Decision Trees

2014-03-26 Thread manishamde
Github user manishamde commented on a diff in the pull request:

https://github.com/apache/spark/pull/79#discussion_r11001627
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/tree/impurity/Impurity.scala ---
@@ -0,0 +1,42 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.mllib.tree.impurity
+
+/**
+ * Trail for calculating information gain
+ */
+trait Impurity extends Serializable {
+
+  /**
+   * information calculation for binary classification
+   * @param c0 count of instances with label 0
+   * @param c1 count of instances with label 1
+   * @return information value
+   */
+  def calculate(c0 : Double, c1 : Double): Double
+
+  /**
+   * information calculation for regression
+   * @param count number of instances
+   * @param sum sum of labels
+   * @param sumSquares summation of squares of the labels
+   * @return information value
+   */
+  def calculate(count: Double, sum: Double, sumSquares: Double): Double
--- End diff --

@mengxr Currently, the ```calculate``` method is used on bin aggregated 
data after the ```aggregate``` operation. So this change won't require doing 
partition-wise merge of impurities. It's just code restructuring without 
modifying the implementation.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: MLI-1 Decision Trees

2014-03-26 Thread mengxr
Github user mengxr commented on a diff in the pull request:

https://github.com/apache/spark/pull/79#discussion_r11000162
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/tree/impurity/Impurity.scala ---
@@ -0,0 +1,42 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.mllib.tree.impurity
+
+/**
+ * Trail for calculating information gain
+ */
+trait Impurity extends Serializable {
+
+  /**
+   * information calculation for binary classification
+   * @param c0 count of instances with label 0
+   * @param c1 count of instances with label 1
+   * @return information value
+   */
+  def calculate(c0 : Double, c1 : Double): Double
+
+  /**
+   * information calculation for regression
+   * @param count number of instances
+   * @param sum sum of labels
+   * @param sumSquares summation of squares of the labels
+   * @return information value
+   */
+  def calculate(count: Double, sum: Double, sumSquares: Double): Double
--- End diff --

@manishamde Do you need to merge two `Impurity` instances (from two 
partitions) then?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: MLI-1 Decision Trees

2014-03-26 Thread manishamde
Github user manishamde commented on a diff in the pull request:

https://github.com/apache/spark/pull/79#discussion_r10998536
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/tree/impurity/Impurity.scala ---
@@ -0,0 +1,42 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.mllib.tree.impurity
+
+/**
+ * Trail for calculating information gain
+ */
+trait Impurity extends Serializable {
+
+  /**
+   * information calculation for binary classification
+   * @param c0 count of instances with label 0
+   * @param c1 count of instances with label 1
+   * @return information value
+   */
+  def calculate(c0 : Double, c1 : Double): Double
+
+  /**
+   * information calculation for regression
+   * @param count number of instances
+   * @param sum sum of labels
+   * @param sumSquares summation of squares of the labels
+   * @return information value
+   */
+  def calculate(count: Double, sum: Double, sumSquares: Double): Double
--- End diff --

@mengxr How about a simple temporary fix where we just have one 
```calculate``` method in the ```Impurity``` trait that takes a list of Double 
as a single parameter?

I prefer to write a generic implementation as discussed soon but it might 
take me a few more days and will delay the PR.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: MLI-1 Decision Trees

2014-03-25 Thread mengxr
Github user mengxr commented on a diff in the pull request:

https://github.com/apache/spark/pull/79#discussion_r10965763
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/tree/impurity/Impurity.scala ---
@@ -0,0 +1,42 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.mllib.tree.impurity
+
+/**
+ * Trail for calculating information gain
+ */
+trait Impurity extends Serializable {
+
+  /**
+   * information calculation for binary classification
+   * @param c0 count of instances with label 0
+   * @param c1 count of instances with label 1
+   * @return information value
+   */
+  def calculate(c0 : Double, c1 : Double): Double
+
+  /**
+   * information calculation for regression
+   * @param count number of instances
+   * @param sum sum of labels
+   * @param sumSquares summation of squares of the labels
+   * @return information value
+   */
+  def calculate(count: Double, sum: Double, sumSquares: Double): Double
--- End diff --

@manishamde As you mentioned, we can do partition-wise accumulation and 
then merge them without performance loss. It is okay to internally accumulate 
`sum` and `sumSquare` in `Variance` and use them to compute variance in this 
PR. But I think it is necessary to update the `Impurity` interface in this PR 
to make it more general, because it is a public trait.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: MLI-1 Decision Trees

2014-03-25 Thread manishamde
Github user manishamde commented on a diff in the pull request:

https://github.com/apache/spark/pull/79#discussion_r10959264
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/tree/impurity/Impurity.scala ---
@@ -0,0 +1,42 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.mllib.tree.impurity
+
+/**
+ * Trail for calculating information gain
+ */
+trait Impurity extends Serializable {
+
+  /**
+   * information calculation for binary classification
+   * @param c0 count of instances with label 0
+   * @param c1 count of instances with label 1
+   * @return information value
+   */
+  def calculate(c0 : Double, c1 : Double): Double
+
+  /**
+   * information calculation for regression
+   * @param count number of instances
+   * @param sum sum of labels
+   * @param sumSquares summation of squares of the labels
+   * @return information value
+   */
+  def calculate(count: Double, sum: Double, sumSquares: Double): Double
--- End diff --

I agree with @mengxr that the design needs to be made more generic for 
extensibility -- the signature collision is an issue we will encounter soon. As 
@mengxr mentioned, the impurity class should specify three methods: a) a method 
to calculate impurity (or an intermediate value) for a single label, b) the 
```binSeqOp``` method and c) the ```binCombOp``` method.

We will encounter a loss in performance if we calculate the stats per 
instance and then "merge" them and a loss in precision (as @mengxr pointed out) 
if we calculate stats after accumulating large numbers. I wonder whether the 
sweet spot is to calculating the stats (impurity) per partition without 
encountering a significant loss in performance and precision.

The important question is that whether this issue "blocks" this PR or 
belongs to a separate PR. I vote for the postponing it to a future PR. We will 
soon encounter more loss functions during ensemble implementations (GBT for 
example) so it might be good to handle it then.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: MLI-1 Decision Trees

2014-03-25 Thread mengxr
Github user mengxr commented on a diff in the pull request:

https://github.com/apache/spark/pull/79#discussion_r10957131
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/tree/impurity/Impurity.scala ---
@@ -0,0 +1,42 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.mllib.tree.impurity
+
+/**
+ * Trail for calculating information gain
+ */
+trait Impurity extends Serializable {
+
+  /**
+   * information calculation for binary classification
+   * @param c0 count of instances with label 0
+   * @param c1 count of instances with label 1
+   * @return information value
+   */
+  def calculate(c0 : Double, c1 : Double): Double
+
+  /**
+   * information calculation for regression
+   * @param count number of instances
+   * @param sum sum of labels
+   * @param sumSquares summation of squares of the labels
+   * @return information value
+   */
+  def calculate(count: Double, sum: Double, sumSquares: Double): Double
--- End diff --

The major loss of precision is from `sumSquares - sum*sum/count`, where a 
large number subtracts another. Changing the interface to `(count, avg, 
avgSquare)` would help avoid overflow but has nothing to do with precision. I 
agree with @etrain that we can improve it in a future PR.

The question is whether we should make `calculate(c0, c1)` and 
`calculate(count, sum, square)` a public method of `Impurity`. In either 
classification or regression, `Impurity` works like an accumulator. What we 
need is to describe how to process a label of type Double, how to merge two 
`Impurity` instances, and how to get impurity from an instance, which is very 
similar to `StatCounter`. It is strange to see Gini only implements the first 
but not the second, while Variance only implements the second but not the 
first. We probably need to reconsider the design here. For example, if we want 
to handle three classes in the future, we will run into a signature collision 
with `calculate(count, sum, squareSum)`.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: MLI-1 Decision Trees

2014-03-25 Thread etrain
Github user etrain commented on a diff in the pull request:

https://github.com/apache/spark/pull/79#discussion_r10934747
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/tree/impurity/Impurity.scala ---
@@ -0,0 +1,42 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.mllib.tree.impurity
+
+/**
+ * Trail for calculating information gain
+ */
+trait Impurity extends Serializable {
+
+  /**
+   * information calculation for binary classification
+   * @param c0 count of instances with label 0
+   * @param c1 count of instances with label 1
+   * @return information value
+   */
+  def calculate(c0 : Double, c1 : Double): Double
+
+  /**
+   * information calculation for regression
+   * @param count number of instances
+   * @param sum sum of labels
+   * @param sumSquares summation of squares of the labels
+   * @return information value
+   */
+  def calculate(count: Double, sum: Double, sumSquares: Double): Double
--- End diff --

I agree that overflow is an issue here (particularly in the case of 
sumSquares), but also agree with Manish/Hirakendu that this algorithm maintains 
its ability to generate a tree in a reasonable amount of time based on this 
property that we compute statistics for splits and then merge them together.

I actually do think it makes sense to maintain "(count, average, 
averageSumSq)" for each partition in a way that's overflow friendly and compute 
the combination as count-weighted average of both as Hirakendu suggests. This 
will complicate the code but should solve the overflow problem and keep things 
pretty efficient. 

That said - maybe this could be taken care of in a future PR as a bugfix, 
rather than in this one?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: MLI-1 Decision Trees

2014-03-25 Thread hirakendu
Github user hirakendu commented on a diff in the pull request:

https://github.com/apache/spark/pull/79#discussion_r10926671
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/tree/impurity/Impurity.scala ---
@@ -0,0 +1,42 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.mllib.tree.impurity
+
+/**
+ * Trail for calculating information gain
+ */
+trait Impurity extends Serializable {
+
+  /**
+   * information calculation for binary classification
+   * @param c0 count of instances with label 0
+   * @param c1 count of instances with label 1
+   * @return information value
+   */
+  def calculate(c0 : Double, c1 : Double): Double
+
+  /**
+   * information calculation for regression
+   * @param count number of instances
+   * @param sum sum of labels
+   * @param sumSquares summation of squares of the labels
+   * @return information value
+   */
+  def calculate(count: Double, sum: Double, sumSquares: Double): Double
--- End diff --

I agree with Manish.

Numerical stability is the first thing that comes to mind on seeing a large 
`avg = sum/count` calculation. In practice, I haven't seen any significant 
difference in results or overflows with even billion sample datasets. Also, 
features in machine learning are typically normalized and dynamic range is 
small (bounded away from 0 and infinity).

We definitely cannot use the methods in DoubleRDDFunctions because we want 
to calculate the variance of various splits, which requires the stats to be 
"aggregable". But we may be able to modify the api's to use (count, avg, 
avgSquares) as the stats and make the calculations more stable. E.g., to merge 
(count, avg) of two parts `(c1, a1)`, `(c2, a2)`, we would have `(c1 + c2, a1 * 
(c1/(c1+c2)) + a2 * (c2/(c1+c2)))`. Not too keen on that change, but let me 
know if that works.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: MLI-1 Decision Trees

2014-03-24 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/79#issuecomment-38529295
  
Merged build finished.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: MLI-1 Decision Trees

2014-03-24 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/79#issuecomment-38529297
  
All automated tests passed.
Refer to this link for build results: 
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/13419/


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: MLI-1 Decision Trees

2014-03-24 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/79#issuecomment-38526805
  
Merged build started.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: MLI-1 Decision Trees

2014-03-24 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/79#issuecomment-38526804
  
 Merged build triggered.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: MLI-1 Decision Trees

2014-03-24 Thread manishamde
Github user manishamde commented on a diff in the pull request:

https://github.com/apache/spark/pull/79#discussion_r10916114
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/tree/impurity/Impurity.scala ---
@@ -0,0 +1,42 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.mllib.tree.impurity
+
+/**
+ * Trail for calculating information gain
+ */
+trait Impurity extends Serializable {
+
+  /**
+   * information calculation for binary classification
+   * @param c0 count of instances with label 0
+   * @param c1 count of instances with label 1
+   * @return information value
+   */
+  def calculate(c0 : Double, c1 : Double): Double
+
+  /**
+   * information calculation for regression
+   * @param count number of instances
+   * @param sum sum of labels
+   * @param sumSquares summation of squares of the labels
+   * @return information value
+   */
+  def calculate(count: Double, sum: Double, sumSquares: Double): Double
--- End diff --

That's a nice observation.  However, using the ```variance``` calculation 
in ```StatCounter``` might be slow since the merge method recomputes ```n, mu, 
m2``` for each value. Also, it won't fit well with the ```binCombOp``` 
operation in the ```aggregate``` function. One can probably optimize the ```def 
merge(values: TraversableOnce[Double]): StatCounter``` method in the 
```Variance``` class by doing a batch or mini-batch update for both speed and 
precision but that's a separate discussion.

I see your concern with computing ```sumSquares``` for a large fraction of 
the instances and I think it will be be best to leverage the ```def 
merge(other: StatCounter): StatCounter``` method. 

We can calculate StatCounter per partition using ```count, sum, 
sumSquares```  and then merge during ```binCombOp``` for numerical stability. I 
won't be hard to implement. Let me know what you think.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: MLI-1 Decision Trees

2014-03-23 Thread mengxr
Github user mengxr commented on a diff in the pull request:

https://github.com/apache/spark/pull/79#discussion_r10871626
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/tree/impurity/Impurity.scala ---
@@ -0,0 +1,42 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.mllib.tree.impurity
+
+/**
+ * Trail for calculating information gain
+ */
+trait Impurity extends Serializable {
+
+  /**
+   * information calculation for binary classification
+   * @param c0 count of instances with label 0
+   * @param c1 count of instances with label 1
+   * @return information value
+   */
+  def calculate(c0 : Double, c1 : Double): Double
+
+  /**
+   * information calculation for regression
+   * @param count number of instances
+   * @param sum sum of labels
+   * @param sumSquares summation of squares of the labels
+   * @return information value
+   */
+  def calculate(count: Double, sum: Double, sumSquares: Double): Double
--- End diff --

It is easy to loss precision or run into overflow in the computation of 
`sumSquares`. Is it only for computing the sample variance? If true, we can 
simplify this interface to accept variance directly. We have a more stable 
implementation of variance computation in `DoubleRDDFunctions`.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: MLI-1 Decision Trees

2014-03-23 Thread mengxr
Github user mengxr commented on a diff in the pull request:

https://github.com/apache/spark/pull/79#discussion_r10871549
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/tree/impurity/Variance.scala ---
@@ -0,0 +1,42 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.mllib.tree.impurity
+
+import javax.naming.OperationNotSupportedException
--- End diff --

Should use `java.lang.UnsupportedOperationException` and organize imports.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: MLI-1 Decision Trees

2014-03-23 Thread mengxr
Github user mengxr commented on a diff in the pull request:

https://github.com/apache/spark/pull/79#discussion_r10871524
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/tree/impurity/Gini.scala ---
@@ -0,0 +1,48 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.mllib.tree.impurity
+
+import javax.naming.OperationNotSupportedException
--- End diff --

Should use `java.lang.UnsupportedOperationException`.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: MLI-1 Decision Trees

2014-03-23 Thread mengxr
Github user mengxr commented on a diff in the pull request:

https://github.com/apache/spark/pull/79#discussion_r10871491
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/tree/impurity/Gini.scala ---
@@ -0,0 +1,48 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.mllib.tree.impurity
+
+import javax.naming.OperationNotSupportedException
+
+/**
+ * Class for calculating the 
[[http://en.wikipedia.org/wiki/Gini_coefficient Gini
+ * coefficent]] during binary classification
--- End diff --

This is the wiki page for `Gini coefficient`, which is different from `Gini 
impurity`. Should change to

http://en.wikipedia.org/wiki/Decision_tree_learning#Gini_impurity 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: MLI-1 Decision Trees

2014-03-23 Thread mengxr
Github user mengxr commented on a diff in the pull request:

https://github.com/apache/spark/pull/79#discussion_r10871385
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/tree/DecisionTree.scala ---
@@ -0,0 +1,1183 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.mllib.tree
+
+import scala.util.control.Breaks._
+
+import org.apache.spark.SparkContext._
+import org.apache.spark.rdd.RDD
+import org.apache.spark.mllib.tree.model._
+import org.apache.spark.{SparkContext, Logging}
+import org.apache.spark.mllib.regression.LabeledPoint
+import org.apache.spark.mllib.tree.model.Split
+import org.apache.spark.mllib.tree.configuration.Strategy
+import org.apache.spark.mllib.tree.configuration.QuantileStrategy._
+import org.apache.spark.mllib.tree.configuration.FeatureType._
+import org.apache.spark.mllib.tree.configuration.Algo._
+import org.apache.spark.mllib.tree.impurity.{Variance, Entropy, Gini, 
Impurity}
+import java.util.Random
+import org.apache.spark.util.random.XORShiftRandom
+
+/**
+ * A class that implements a decision tree algorithm for classification 
and regression. It
+ * supports both continuous and categorical features.
+ * @param strategy The configuration parameters for the tree algorithm 
which specify the type
+ * of algorithm (classification, regression, etc.), 
feature type (continuous,
+ * categorical),
+ * depth of the tree, quantile calculation strategy, etc.
+  */
+class DecisionTree private(val strategy: Strategy) extends Serializable 
with Logging {
+
+  /**
+   * Method to train a decision tree model over an RDD
+   * @param input RDD of 
[[org.apache.spark.mllib.regression.LabeledPoint]] used as training data
+   *  for DecisionTree
+   * @return a DecisionTreeModel that can be used for prediction
+   */
+  def train(input: RDD[LabeledPoint]): DecisionTreeModel = {
+
+// Cache input RDD for speedup during multiple passes
+input.cache()
+logDebug("algo = " + strategy.algo)
+
+// Finding the splits and the corresponding bins (interval between the 
splits) using a sample
+// of the input data.
+val (splits, bins) = DecisionTree.findSplitsBins(input, strategy)
+logDebug("numSplits = " + bins(0).length)
+
+// Noting numBins for the input data
+strategy.numBins = bins(0).length
+
+// The depth of the decision tree
+val maxDepth = strategy.maxDepth
+// The max number of nodes possible given the depth of the tree
+val maxNumNodes = scala.math.pow(2, maxDepth).toInt - 1
+// Initalizing an array to hold filters applied to points for each node
+val filters = new Array[List[Filter]](maxNumNodes)
+// The filter at the top node is an empty list
+filters(0) = List()
+// Initializing an array to hold parent impurity calculations for each 
node
+val parentImpurities = new Array[Double](maxNumNodes)
+// Dummy value for top node (updated during first split calculation)
+val nodes = new Array[Node](maxNumNodes)
+
+// The main-idea here is to perform level-wise training of the 
decision tree nodes thus
+// reducing the passes over the data from l to log2(l) where l is the 
total number of nodes.
+// Each data sample is checked for validity w.r.t to each node at a 
given level -- i.e.,
+// the sample is only used for the split calculation at the node if 
the sampled would have
+// still survived the filters of the parent nodes.
+
+// TODO: Convert for loop to while loop
+breakable {
+  for (level <- 0 until maxDepth) {
+
+logDebug("#")
+logDebug("level = " + level)
+logDebug("#")
+
+// Find best split for all nodes at a level
+val splitsStatsForLevel = DecisionTree.

[GitHub] spark pull request: MLI-1 Decision Trees

2014-03-23 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/79#issuecomment-38407632
  
One or more automated tests failed
Refer to this link for build results: 
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/13378/


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: MLI-1 Decision Trees

2014-03-23 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/79#issuecomment-38407631
  
Merged build finished.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: MLI-1 Decision Trees

2014-03-23 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/79#issuecomment-38407348
  
Merged build started.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: MLI-1 Decision Trees

2014-03-23 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/79#issuecomment-38407347
  
 Merged build triggered.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: MLI-1 Decision Trees

2014-03-23 Thread manishamde
Github user manishamde commented on a diff in the pull request:

https://github.com/apache/spark/pull/79#discussion_r10871098
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/tree/DecisionTree.scala ---
@@ -0,0 +1,1183 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.mllib.tree
+
+import scala.util.control.Breaks._
+
+import org.apache.spark.SparkContext._
+import org.apache.spark.rdd.RDD
+import org.apache.spark.mllib.tree.model._
+import org.apache.spark.{SparkContext, Logging}
+import org.apache.spark.mllib.regression.LabeledPoint
+import org.apache.spark.mllib.tree.model.Split
+import org.apache.spark.mllib.tree.configuration.Strategy
+import org.apache.spark.mllib.tree.configuration.QuantileStrategy._
+import org.apache.spark.mllib.tree.configuration.FeatureType._
+import org.apache.spark.mllib.tree.configuration.Algo._
+import org.apache.spark.mllib.tree.impurity.{Variance, Entropy, Gini, 
Impurity}
+import java.util.Random
+import org.apache.spark.util.random.XORShiftRandom
+
+/**
+ * A class that implements a decision tree algorithm for classification 
and regression. It
+ * supports both continuous and categorical features.
+ * @param strategy The configuration parameters for the tree algorithm 
which specify the type
+ * of algorithm (classification, regression, etc.), 
feature type (continuous,
+ * categorical),
+ * depth of the tree, quantile calculation strategy, etc.
+  */
+class DecisionTree private(val strategy: Strategy) extends Serializable 
with Logging {
+
+  /**
+   * Method to train a decision tree model over an RDD
+   * @param input RDD of 
[[org.apache.spark.mllib.regression.LabeledPoint]] used as training data
+   *  for DecisionTree
+   * @return a DecisionTreeModel that can be used for prediction
+   */
+  def train(input: RDD[LabeledPoint]): DecisionTreeModel = {
+
+// Cache input RDD for speedup during multiple passes
+input.cache()
+logDebug("algo = " + strategy.algo)
+
+// Finding the splits and the corresponding bins (interval between the 
splits) using a sample
+// of the input data.
+val (splits, bins) = DecisionTree.findSplitsBins(input, strategy)
+logDebug("numSplits = " + bins(0).length)
+
+// Noting numBins for the input data
+strategy.numBins = bins(0).length
+
+// The depth of the decision tree
+val maxDepth = strategy.maxDepth
+// The max number of nodes possible given the depth of the tree
+val maxNumNodes = scala.math.pow(2, maxDepth).toInt - 1
+// Initalizing an array to hold filters applied to points for each node
+val filters = new Array[List[Filter]](maxNumNodes)
+// The filter at the top node is an empty list
+filters(0) = List()
+// Initializing an array to hold parent impurity calculations for each 
node
+val parentImpurities = new Array[Double](maxNumNodes)
+// Dummy value for top node (updated during first split calculation)
+val nodes = new Array[Node](maxNumNodes)
+
+// The main-idea here is to perform level-wise training of the 
decision tree nodes thus
+// reducing the passes over the data from l to log2(l) where l is the 
total number of nodes.
+// Each data sample is checked for validity w.r.t to each node at a 
given level -- i.e.,
+// the sample is only used for the split calculation at the node if 
the sampled would have
+// still survived the filters of the parent nodes.
+
+// TODO: Convert for loop to while loop
+breakable {
+  for (level <- 0 until maxDepth) {
+
+logDebug("#")
+logDebug("level = " + level)
+logDebug("#")
+
+// Find best split for all nodes at a level
+val splitsStatsForLevel = DecisionT

[GitHub] spark pull request: MLI-1 Decision Trees

2014-03-23 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/79#issuecomment-38377451
  
Merged build finished.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: MLI-1 Decision Trees

2014-03-23 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/79#issuecomment-38377452
  
One or more automated tests failed
Refer to this link for build results: 
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/13363/


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: MLI-1 Decision Trees

2014-03-23 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/79#issuecomment-38376590
  
Merged build started.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: MLI-1 Decision Trees

2014-03-23 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/79#issuecomment-38376589
  
 Merged build triggered.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: MLI-1 Decision Trees

2014-03-23 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/79#issuecomment-38376567
  
Merged build finished.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: MLI-1 Decision Trees

2014-03-23 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/79#issuecomment-38376569
  
One or more automated tests failed
Refer to this link for build results: 
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/13361/


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: MLI-1 Decision Trees

2014-03-23 Thread mengxr
Github user mengxr commented on a diff in the pull request:

https://github.com/apache/spark/pull/79#discussion_r10866028
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/tree/DecisionTree.scala ---
@@ -0,0 +1,1183 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.mllib.tree
+
+import scala.util.control.Breaks._
+
+import org.apache.spark.SparkContext._
+import org.apache.spark.rdd.RDD
+import org.apache.spark.mllib.tree.model._
+import org.apache.spark.{SparkContext, Logging}
+import org.apache.spark.mllib.regression.LabeledPoint
+import org.apache.spark.mllib.tree.model.Split
+import org.apache.spark.mllib.tree.configuration.Strategy
+import org.apache.spark.mllib.tree.configuration.QuantileStrategy._
+import org.apache.spark.mllib.tree.configuration.FeatureType._
+import org.apache.spark.mllib.tree.configuration.Algo._
+import org.apache.spark.mllib.tree.impurity.{Variance, Entropy, Gini, 
Impurity}
+import java.util.Random
+import org.apache.spark.util.random.XORShiftRandom
+
+/**
+ * A class that implements a decision tree algorithm for classification 
and regression. It
+ * supports both continuous and categorical features.
+ * @param strategy The configuration parameters for the tree algorithm 
which specify the type
+ * of algorithm (classification, regression, etc.), 
feature type (continuous,
+ * categorical),
+ * depth of the tree, quantile calculation strategy, etc.
+  */
+class DecisionTree private(val strategy: Strategy) extends Serializable 
with Logging {
+
+  /**
+   * Method to train a decision tree model over an RDD
+   * @param input RDD of 
[[org.apache.spark.mllib.regression.LabeledPoint]] used as training data
+   *  for DecisionTree
+   * @return a DecisionTreeModel that can be used for prediction
+   */
+  def train(input: RDD[LabeledPoint]): DecisionTreeModel = {
+
+// Cache input RDD for speedup during multiple passes
+input.cache()
+logDebug("algo = " + strategy.algo)
+
+// Finding the splits and the corresponding bins (interval between the 
splits) using a sample
+// of the input data.
+val (splits, bins) = DecisionTree.findSplitsBins(input, strategy)
+logDebug("numSplits = " + bins(0).length)
+
+// Noting numBins for the input data
+strategy.numBins = bins(0).length
+
+// The depth of the decision tree
+val maxDepth = strategy.maxDepth
+// The max number of nodes possible given the depth of the tree
+val maxNumNodes = scala.math.pow(2, maxDepth).toInt - 1
+// Initalizing an array to hold filters applied to points for each node
+val filters = new Array[List[Filter]](maxNumNodes)
+// The filter at the top node is an empty list
+filters(0) = List()
+// Initializing an array to hold parent impurity calculations for each 
node
+val parentImpurities = new Array[Double](maxNumNodes)
+// Dummy value for top node (updated during first split calculation)
+val nodes = new Array[Node](maxNumNodes)
+
+// The main-idea here is to perform level-wise training of the 
decision tree nodes thus
+// reducing the passes over the data from l to log2(l) where l is the 
total number of nodes.
+// Each data sample is checked for validity w.r.t to each node at a 
given level -- i.e.,
+// the sample is only used for the split calculation at the node if 
the sampled would have
+// still survived the filters of the parent nodes.
+
+// TODO: Convert for loop to while loop
+breakable {
+  for (level <- 0 until maxDepth) {
+
+logDebug("#")
+logDebug("level = " + level)
+logDebug("#")
+
+// Find best split for all nodes at a level
+val splitsStatsForLevel = DecisionTree.

[GitHub] spark pull request: MLI-1 Decision Trees

2014-03-23 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/79#issuecomment-38375820
  
Merged build finished.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: MLI-1 Decision Trees

2014-03-23 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/79#issuecomment-38375821
  
All automated tests passed.
Refer to this link for build results: 
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/13358/


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: MLI-1 Decision Trees

2014-03-23 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/79#issuecomment-38375816
  
Merged build started.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: MLI-1 Decision Trees

2014-03-23 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/79#issuecomment-38375815
  
 Merged build triggered.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: MLI-1 Decision Trees

2014-03-23 Thread manishamde
Github user manishamde commented on a diff in the pull request:

https://github.com/apache/spark/pull/79#discussion_r10865951
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/tree/DecisionTree.scala ---
@@ -0,0 +1,1183 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.mllib.tree
+
+import scala.util.control.Breaks._
+
+import org.apache.spark.SparkContext._
+import org.apache.spark.rdd.RDD
+import org.apache.spark.mllib.tree.model._
+import org.apache.spark.{SparkContext, Logging}
+import org.apache.spark.mllib.regression.LabeledPoint
+import org.apache.spark.mllib.tree.model.Split
+import org.apache.spark.mllib.tree.configuration.Strategy
+import org.apache.spark.mllib.tree.configuration.QuantileStrategy._
+import org.apache.spark.mllib.tree.configuration.FeatureType._
+import org.apache.spark.mllib.tree.configuration.Algo._
+import org.apache.spark.mllib.tree.impurity.{Variance, Entropy, Gini, 
Impurity}
+import java.util.Random
+import org.apache.spark.util.random.XORShiftRandom
+
+/**
+ * A class that implements a decision tree algorithm for classification 
and regression. It
+ * supports both continuous and categorical features.
+ * @param strategy The configuration parameters for the tree algorithm 
which specify the type
+ * of algorithm (classification, regression, etc.), 
feature type (continuous,
+ * categorical),
+ * depth of the tree, quantile calculation strategy, etc.
+  */
+class DecisionTree private(val strategy: Strategy) extends Serializable 
with Logging {
+
+  /**
+   * Method to train a decision tree model over an RDD
+   * @param input RDD of 
[[org.apache.spark.mllib.regression.LabeledPoint]] used as training data
+   *  for DecisionTree
+   * @return a DecisionTreeModel that can be used for prediction
+   */
+  def train(input: RDD[LabeledPoint]): DecisionTreeModel = {
+
+// Cache input RDD for speedup during multiple passes
+input.cache()
+logDebug("algo = " + strategy.algo)
+
+// Finding the splits and the corresponding bins (interval between the 
splits) using a sample
+// of the input data.
+val (splits, bins) = DecisionTree.findSplitsBins(input, strategy)
+logDebug("numSplits = " + bins(0).length)
+
+// Noting numBins for the input data
+strategy.numBins = bins(0).length
+
+// The depth of the decision tree
+val maxDepth = strategy.maxDepth
+// The max number of nodes possible given the depth of the tree
+val maxNumNodes = scala.math.pow(2, maxDepth).toInt - 1
+// Initalizing an array to hold filters applied to points for each node
+val filters = new Array[List[Filter]](maxNumNodes)
+// The filter at the top node is an empty list
+filters(0) = List()
+// Initializing an array to hold parent impurity calculations for each 
node
+val parentImpurities = new Array[Double](maxNumNodes)
+// Dummy value for top node (updated during first split calculation)
+val nodes = new Array[Node](maxNumNodes)
+
+// The main-idea here is to perform level-wise training of the 
decision tree nodes thus
+// reducing the passes over the data from l to log2(l) where l is the 
total number of nodes.
+// Each data sample is checked for validity w.r.t to each node at a 
given level -- i.e.,
+// the sample is only used for the split calculation at the node if 
the sampled would have
+// still survived the filters of the parent nodes.
+
+// TODO: Convert for loop to while loop
+breakable {
+  for (level <- 0 until maxDepth) {
+
+logDebug("#")
+logDebug("level = " + level)
+logDebug("#")
+
+// Find best split for all nodes at a level
+val splitsStatsForLevel = DecisionT

[GitHub] spark pull request: MLI-1 Decision Trees

2014-03-23 Thread mengxr
Github user mengxr commented on a diff in the pull request:

https://github.com/apache/spark/pull/79#discussion_r10865926
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/tree/DecisionTree.scala ---
@@ -0,0 +1,1183 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.mllib.tree
+
+import scala.util.control.Breaks._
+
+import org.apache.spark.SparkContext._
+import org.apache.spark.rdd.RDD
+import org.apache.spark.mllib.tree.model._
+import org.apache.spark.{SparkContext, Logging}
+import org.apache.spark.mllib.regression.LabeledPoint
+import org.apache.spark.mllib.tree.model.Split
+import org.apache.spark.mllib.tree.configuration.Strategy
+import org.apache.spark.mllib.tree.configuration.QuantileStrategy._
+import org.apache.spark.mllib.tree.configuration.FeatureType._
+import org.apache.spark.mllib.tree.configuration.Algo._
+import org.apache.spark.mllib.tree.impurity.{Variance, Entropy, Gini, 
Impurity}
+import java.util.Random
+import org.apache.spark.util.random.XORShiftRandom
+
+/**
+ * A class that implements a decision tree algorithm for classification 
and regression. It
+ * supports both continuous and categorical features.
+ * @param strategy The configuration parameters for the tree algorithm 
which specify the type
+ * of algorithm (classification, regression, etc.), 
feature type (continuous,
+ * categorical),
+ * depth of the tree, quantile calculation strategy, etc.
+  */
+class DecisionTree private(val strategy: Strategy) extends Serializable 
with Logging {
+
+  /**
+   * Method to train a decision tree model over an RDD
+   * @param input RDD of 
[[org.apache.spark.mllib.regression.LabeledPoint]] used as training data
+   *  for DecisionTree
+   * @return a DecisionTreeModel that can be used for prediction
+   */
+  def train(input: RDD[LabeledPoint]): DecisionTreeModel = {
+
+// Cache input RDD for speedup during multiple passes
+input.cache()
+logDebug("algo = " + strategy.algo)
+
+// Finding the splits and the corresponding bins (interval between the 
splits) using a sample
+// of the input data.
+val (splits, bins) = DecisionTree.findSplitsBins(input, strategy)
+logDebug("numSplits = " + bins(0).length)
+
+// Noting numBins for the input data
+strategy.numBins = bins(0).length
+
+// The depth of the decision tree
+val maxDepth = strategy.maxDepth
+// The max number of nodes possible given the depth of the tree
+val maxNumNodes = scala.math.pow(2, maxDepth).toInt - 1
+// Initalizing an array to hold filters applied to points for each node
+val filters = new Array[List[Filter]](maxNumNodes)
+// The filter at the top node is an empty list
+filters(0) = List()
+// Initializing an array to hold parent impurity calculations for each 
node
+val parentImpurities = new Array[Double](maxNumNodes)
+// Dummy value for top node (updated during first split calculation)
+val nodes = new Array[Node](maxNumNodes)
+
+// The main-idea here is to perform level-wise training of the 
decision tree nodes thus
+// reducing the passes over the data from l to log2(l) where l is the 
total number of nodes.
+// Each data sample is checked for validity w.r.t to each node at a 
given level -- i.e.,
+// the sample is only used for the split calculation at the node if 
the sampled would have
+// still survived the filters of the parent nodes.
+
+// TODO: Convert for loop to while loop
+breakable {
+  for (level <- 0 until maxDepth) {
+
+logDebug("#")
+logDebug("level = " + level)
+logDebug("#")
+
+// Find best split for all nodes at a level
+val splitsStatsForLevel = DecisionTree.

[GitHub] spark pull request: MLI-1 Decision Trees

2014-03-22 Thread manishamde
Github user manishamde commented on a diff in the pull request:

https://github.com/apache/spark/pull/79#discussion_r10865814
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/tree/DecisionTree.scala ---
@@ -0,0 +1,1183 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.mllib.tree
+
+import scala.util.control.Breaks._
+
+import org.apache.spark.SparkContext._
+import org.apache.spark.rdd.RDD
+import org.apache.spark.mllib.tree.model._
+import org.apache.spark.{SparkContext, Logging}
+import org.apache.spark.mllib.regression.LabeledPoint
+import org.apache.spark.mllib.tree.model.Split
+import org.apache.spark.mllib.tree.configuration.Strategy
+import org.apache.spark.mllib.tree.configuration.QuantileStrategy._
+import org.apache.spark.mllib.tree.configuration.FeatureType._
+import org.apache.spark.mllib.tree.configuration.Algo._
+import org.apache.spark.mllib.tree.impurity.{Variance, Entropy, Gini, 
Impurity}
+import java.util.Random
+import org.apache.spark.util.random.XORShiftRandom
+
+/**
+ * A class that implements a decision tree algorithm for classification 
and regression. It
+ * supports both continuous and categorical features.
+ * @param strategy The configuration parameters for the tree algorithm 
which specify the type
+ * of algorithm (classification, regression, etc.), 
feature type (continuous,
+ * categorical),
+ * depth of the tree, quantile calculation strategy, etc.
+  */
+class DecisionTree private(val strategy: Strategy) extends Serializable 
with Logging {
+
+  /**
+   * Method to train a decision tree model over an RDD
+   * @param input RDD of 
[[org.apache.spark.mllib.regression.LabeledPoint]] used as training data
+   *  for DecisionTree
+   * @return a DecisionTreeModel that can be used for prediction
+   */
+  def train(input: RDD[LabeledPoint]): DecisionTreeModel = {
+
+// Cache input RDD for speedup during multiple passes
+input.cache()
+logDebug("algo = " + strategy.algo)
+
+// Finding the splits and the corresponding bins (interval between the 
splits) using a sample
+// of the input data.
+val (splits, bins) = DecisionTree.findSplitsBins(input, strategy)
+logDebug("numSplits = " + bins(0).length)
+
+// Noting numBins for the input data
+strategy.numBins = bins(0).length
+
+// The depth of the decision tree
+val maxDepth = strategy.maxDepth
+// The max number of nodes possible given the depth of the tree
+val maxNumNodes = scala.math.pow(2, maxDepth).toInt - 1
+// Initalizing an array to hold filters applied to points for each node
+val filters = new Array[List[Filter]](maxNumNodes)
+// The filter at the top node is an empty list
+filters(0) = List()
+// Initializing an array to hold parent impurity calculations for each 
node
+val parentImpurities = new Array[Double](maxNumNodes)
+// Dummy value for top node (updated during first split calculation)
+val nodes = new Array[Node](maxNumNodes)
+
+// The main-idea here is to perform level-wise training of the 
decision tree nodes thus
+// reducing the passes over the data from l to log2(l) where l is the 
total number of nodes.
+// Each data sample is checked for validity w.r.t to each node at a 
given level -- i.e.,
+// the sample is only used for the split calculation at the node if 
the sampled would have
+// still survived the filters of the parent nodes.
+
+// TODO: Convert for loop to while loop
+breakable {
+  for (level <- 0 until maxDepth) {
+
+logDebug("#")
+logDebug("level = " + level)
+logDebug("#")
+
+// Find best split for all nodes at a level
+val splitsStatsForLevel = DecisionT

[GitHub] spark pull request: MLI-1 Decision Trees

2014-03-22 Thread manishamde
Github user manishamde commented on a diff in the pull request:

https://github.com/apache/spark/pull/79#discussion_r10865809
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/tree/DecisionTree.scala ---
@@ -0,0 +1,1183 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.mllib.tree
+
+import scala.util.control.Breaks._
+
+import org.apache.spark.SparkContext._
+import org.apache.spark.rdd.RDD
+import org.apache.spark.mllib.tree.model._
+import org.apache.spark.{SparkContext, Logging}
+import org.apache.spark.mllib.regression.LabeledPoint
+import org.apache.spark.mllib.tree.model.Split
+import org.apache.spark.mllib.tree.configuration.Strategy
+import org.apache.spark.mllib.tree.configuration.QuantileStrategy._
+import org.apache.spark.mllib.tree.configuration.FeatureType._
+import org.apache.spark.mllib.tree.configuration.Algo._
+import org.apache.spark.mllib.tree.impurity.{Variance, Entropy, Gini, 
Impurity}
+import java.util.Random
+import org.apache.spark.util.random.XORShiftRandom
+
+/**
+ * A class that implements a decision tree algorithm for classification 
and regression. It
+ * supports both continuous and categorical features.
+ * @param strategy The configuration parameters for the tree algorithm 
which specify the type
+ * of algorithm (classification, regression, etc.), 
feature type (continuous,
+ * categorical),
+ * depth of the tree, quantile calculation strategy, etc.
+  */
+class DecisionTree private(val strategy: Strategy) extends Serializable 
with Logging {
+
+  /**
+   * Method to train a decision tree model over an RDD
+   * @param input RDD of 
[[org.apache.spark.mllib.regression.LabeledPoint]] used as training data
+   *  for DecisionTree
+   * @return a DecisionTreeModel that can be used for prediction
+   */
+  def train(input: RDD[LabeledPoint]): DecisionTreeModel = {
+
+// Cache input RDD for speedup during multiple passes
+input.cache()
+logDebug("algo = " + strategy.algo)
+
+// Finding the splits and the corresponding bins (interval between the 
splits) using a sample
+// of the input data.
+val (splits, bins) = DecisionTree.findSplitsBins(input, strategy)
+logDebug("numSplits = " + bins(0).length)
+
+// Noting numBins for the input data
+strategy.numBins = bins(0).length
+
+// The depth of the decision tree
+val maxDepth = strategy.maxDepth
+// The max number of nodes possible given the depth of the tree
+val maxNumNodes = scala.math.pow(2, maxDepth).toInt - 1
+// Initalizing an array to hold filters applied to points for each node
+val filters = new Array[List[Filter]](maxNumNodes)
+// The filter at the top node is an empty list
+filters(0) = List()
+// Initializing an array to hold parent impurity calculations for each 
node
+val parentImpurities = new Array[Double](maxNumNodes)
+// Dummy value for top node (updated during first split calculation)
+val nodes = new Array[Node](maxNumNodes)
+
+// The main-idea here is to perform level-wise training of the 
decision tree nodes thus
+// reducing the passes over the data from l to log2(l) where l is the 
total number of nodes.
+// Each data sample is checked for validity w.r.t to each node at a 
given level -- i.e.,
+// the sample is only used for the split calculation at the node if 
the sampled would have
+// still survived the filters of the parent nodes.
+
+// TODO: Convert for loop to while loop
+breakable {
+  for (level <- 0 until maxDepth) {
+
+logDebug("#")
+logDebug("level = " + level)
+logDebug("#")
+
+// Find best split for all nodes at a level
+val splitsStatsForLevel = DecisionT

[GitHub] spark pull request: MLI-1 Decision Trees

2014-03-22 Thread manishamde
Github user manishamde commented on a diff in the pull request:

https://github.com/apache/spark/pull/79#discussion_r10865810
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/tree/DecisionTree.scala ---
@@ -0,0 +1,1183 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.mllib.tree
+
+import scala.util.control.Breaks._
+
+import org.apache.spark.SparkContext._
+import org.apache.spark.rdd.RDD
+import org.apache.spark.mllib.tree.model._
+import org.apache.spark.{SparkContext, Logging}
+import org.apache.spark.mllib.regression.LabeledPoint
+import org.apache.spark.mllib.tree.model.Split
+import org.apache.spark.mllib.tree.configuration.Strategy
+import org.apache.spark.mllib.tree.configuration.QuantileStrategy._
+import org.apache.spark.mllib.tree.configuration.FeatureType._
+import org.apache.spark.mllib.tree.configuration.Algo._
+import org.apache.spark.mllib.tree.impurity.{Variance, Entropy, Gini, 
Impurity}
+import java.util.Random
+import org.apache.spark.util.random.XORShiftRandom
+
+/**
+ * A class that implements a decision tree algorithm for classification 
and regression. It
+ * supports both continuous and categorical features.
+ * @param strategy The configuration parameters for the tree algorithm 
which specify the type
+ * of algorithm (classification, regression, etc.), 
feature type (continuous,
+ * categorical),
+ * depth of the tree, quantile calculation strategy, etc.
+  */
+class DecisionTree private(val strategy: Strategy) extends Serializable 
with Logging {
+
+  /**
+   * Method to train a decision tree model over an RDD
+   * @param input RDD of 
[[org.apache.spark.mllib.regression.LabeledPoint]] used as training data
+   *  for DecisionTree
+   * @return a DecisionTreeModel that can be used for prediction
+   */
+  def train(input: RDD[LabeledPoint]): DecisionTreeModel = {
+
+// Cache input RDD for speedup during multiple passes
+input.cache()
+logDebug("algo = " + strategy.algo)
+
+// Finding the splits and the corresponding bins (interval between the 
splits) using a sample
+// of the input data.
+val (splits, bins) = DecisionTree.findSplitsBins(input, strategy)
+logDebug("numSplits = " + bins(0).length)
+
+// Noting numBins for the input data
+strategy.numBins = bins(0).length
+
+// The depth of the decision tree
+val maxDepth = strategy.maxDepth
+// The max number of nodes possible given the depth of the tree
+val maxNumNodes = scala.math.pow(2, maxDepth).toInt - 1
+// Initalizing an array to hold filters applied to points for each node
+val filters = new Array[List[Filter]](maxNumNodes)
+// The filter at the top node is an empty list
+filters(0) = List()
+// Initializing an array to hold parent impurity calculations for each 
node
+val parentImpurities = new Array[Double](maxNumNodes)
+// Dummy value for top node (updated during first split calculation)
+val nodes = new Array[Node](maxNumNodes)
+
+// The main-idea here is to perform level-wise training of the 
decision tree nodes thus
+// reducing the passes over the data from l to log2(l) where l is the 
total number of nodes.
+// Each data sample is checked for validity w.r.t to each node at a 
given level -- i.e.,
+// the sample is only used for the split calculation at the node if 
the sampled would have
+// still survived the filters of the parent nodes.
+
+// TODO: Convert for loop to while loop
+breakable {
+  for (level <- 0 until maxDepth) {
+
+logDebug("#")
+logDebug("level = " + level)
+logDebug("#")
+
+// Find best split for all nodes at a level
+val splitsStatsForLevel = DecisionT

[GitHub] spark pull request: MLI-1 Decision Trees

2014-03-22 Thread manishamde
Github user manishamde commented on the pull request:

https://github.com/apache/spark/pull/79#issuecomment-38375199
  
@mengxr Thanks a lot! LGTM. Merged.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


  1   2   >