[GitHub] spark pull request: [SPARK-4439] [MLlib] add python api for random...

2014-11-20 Thread mengxr
Github user mengxr commented on the pull request:

https://github.com/apache/spark/pull/3320#issuecomment-63777609
  
@davies We updated the `RandomForest` API in #3374 . Now `RandomForest` 
returns a `RandomForestModel`. Could you rebase and update this PR? Thanks!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-4439] [MLlib] add python api for random...

2014-11-20 Thread davies
Github user davies commented on the pull request:

https://github.com/apache/spark/pull/3320#issuecomment-63853606
  
@mengxr done.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-4439] [MLlib] add python api for random...

2014-11-20 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/3320#issuecomment-63856051
  
  [Test build #23677 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/23677/consoleFull)
 for   PR 3320 at commit 
[`e0df852`](https://github.com/apache/spark/commit/e0df852ab4f353b9f800fe5374195fee5a06aa52).
 * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-4439] [MLlib] add python api for random...

2014-11-20 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/3320#issuecomment-63869584
  
  [Test build #23677 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/23677/consoleFull)
 for   PR 3320 at commit 
[`e0df852`](https://github.com/apache/spark/commit/e0df852ab4f353b9f800fe5374195fee5a06aa52).
 * This patch **passes all tests**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-4439] [MLlib] add python api for random...

2014-11-20 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/3320#issuecomment-63869599
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/23677/
Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-4439] [MLlib] add python api for random...

2014-11-20 Thread jkbradley
Github user jkbradley commented on a diff in the pull request:

https://github.com/apache/spark/pull/3320#discussion_r20676259
  
--- Diff: python/pyspark/mllib/tree.py ---
@@ -181,8 +182,206 @@ def trainRegressor(data, categoricalFeaturesInfo,
  model.predict(rdd).collect()
 [1.0, 0.0]
 
-return DecisionTree._train(data, regression, 0, 
categoricalFeaturesInfo,
-   impurity, maxDepth, maxBins, 
minInstancesPerNode, minInfoGain)
+return cls._train(data, regression, 0, categoricalFeaturesInfo,
+  impurity, maxDepth, maxBins, 
minInstancesPerNode, minInfoGain)
+
+
+class RandomForestModel(JavaModelWrapper):
+
+Represents a random forest model.
+
+EXPERIMENTAL: This is an experimental API.
+  It will probably be modified in future.
+
+def predict(self, x):
+
+Predict values for a single data point or an RDD of points using
+the model trained.
+
+if isinstance(x, RDD):
+return self.call(predict, x.map(_convert_to_vector))
+
+else:
+return self.call(predict, _convert_to_vector(x))
+
+def numTrees(self):
+
+Get number of trees in forest.
+
+return self.call(numTrees)
+
+def totalNumNodes(self):
+
+Get total number of nodes, summed over all trees in the forest.
+
+return self.call(totalNumNodes)
+
+def __repr__(self):
+ Summary of model 
+return self._java_model.toString()
+
+def toDebugString(self):
+ Full model 
+return self._java_model.toDebugString()
+
+
+class RandomForest(object):
+
+Learning algorithm for a random forest model for classification or 
regression.
+
+EXPERIMENTAL: This is an experimental API.
+  It will probably be modified in future.
+
+
+supportedFeatureSubsetStrategies = (auto, all, sqrt, log2, 
onethird)
+
+@classmethod
+def _train(cls, data, type, numClasses, features, impurity, maxDepth, 
maxBins,
+   numTrees, featureSubsetStrategy, seed):
+first = data.first()
+assert isinstance(first, LabeledPoint), the data should be RDD of 
LabeledPoint
+if featureSubsetStrategy not in 
cls.supportedFeatureSubsetStrategies:
+raise ValueError(unsupported featureSubsetStrategy: %s % 
featureSubsetStrategy)
+if seed is None:
+seed = random.randint(0, 1  30)
+model = callMLlibFunc(trainRandomForestModel, data, type, 
numClasses, features,
+  impurity, maxDepth, maxBins, numTrees, 
featureSubsetStrategy, seed)
+return RandomForestModel(model)
+
+@classmethod
+def trainClassifier(cls, data, numClassesForClassification, 
categoricalFeaturesInfo, numTrees,
+featureSubsetStrategy=auto, impurity=gini, 
maxDepth=4, maxBins=32,
+seed=None):
+
+Method to train a decision tree model for binary or multiclass
+classification.
+
+:param data: Training dataset: RDD of LabeledPoint. Labels should 
take
+   values {0, 1, ..., numClasses-1}.
+:param numClassesForClassification: number of classes for 
classification.
+:param categoricalFeaturesInfo: Map storing arity of categorical 
features.
+   E.g., an entry (n - k) indicates that feature n is 
categorical
+   with k categories indexed from 0: {0, 1, ..., k-1}.
+:param numTrees: Number of trees in the random forest.
+:param featureSubsetStrategy: Number of features to consider for 
splits at
+   each node.
+   Supported: auto (default), all, sqrt, log2, 
onethird.
+   If auto is set, this parameter is set based on numTrees:
+   if numTrees == 1, set to all;
+   if numTrees  1 (forest) set to sqrt for classification 
and to
--- End diff --

could just state default for classification


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-4439] [MLlib] add python api for random...

2014-11-20 Thread jkbradley
Github user jkbradley commented on a diff in the pull request:

https://github.com/apache/spark/pull/3320#discussion_r20676265
  
--- Diff: python/pyspark/mllib/tree.py ---
@@ -181,8 +182,206 @@ def trainRegressor(data, categoricalFeaturesInfo,
  model.predict(rdd).collect()
 [1.0, 0.0]
 
-return DecisionTree._train(data, regression, 0, 
categoricalFeaturesInfo,
-   impurity, maxDepth, maxBins, 
minInstancesPerNode, minInfoGain)
+return cls._train(data, regression, 0, categoricalFeaturesInfo,
+  impurity, maxDepth, maxBins, 
minInstancesPerNode, minInfoGain)
+
+
+class RandomForestModel(JavaModelWrapper):
+
+Represents a random forest model.
+
+EXPERIMENTAL: This is an experimental API.
+  It will probably be modified in future.
+
+def predict(self, x):
+
+Predict values for a single data point or an RDD of points using
+the model trained.
+
+if isinstance(x, RDD):
+return self.call(predict, x.map(_convert_to_vector))
+
+else:
+return self.call(predict, _convert_to_vector(x))
+
+def numTrees(self):
+
+Get number of trees in forest.
+
+return self.call(numTrees)
+
+def totalNumNodes(self):
+
+Get total number of nodes, summed over all trees in the forest.
+
+return self.call(totalNumNodes)
+
+def __repr__(self):
+ Summary of model 
+return self._java_model.toString()
+
+def toDebugString(self):
+ Full model 
+return self._java_model.toDebugString()
+
+
+class RandomForest(object):
+
+Learning algorithm for a random forest model for classification or 
regression.
+
+EXPERIMENTAL: This is an experimental API.
+  It will probably be modified in future.
+
+
+supportedFeatureSubsetStrategies = (auto, all, sqrt, log2, 
onethird)
+
+@classmethod
+def _train(cls, data, type, numClasses, features, impurity, maxDepth, 
maxBins,
--- End diff --

type -- algo
features -- categoricalFeaturesInfo


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-4439] [MLlib] add python api for random...

2014-11-20 Thread jkbradley
Github user jkbradley commented on a diff in the pull request:

https://github.com/apache/spark/pull/3320#discussion_r20676263
  
--- Diff: python/pyspark/mllib/tree.py ---
@@ -181,8 +182,206 @@ def trainRegressor(data, categoricalFeaturesInfo,
  model.predict(rdd).collect()
 [1.0, 0.0]
 
-return DecisionTree._train(data, regression, 0, 
categoricalFeaturesInfo,
-   impurity, maxDepth, maxBins, 
minInstancesPerNode, minInfoGain)
+return cls._train(data, regression, 0, categoricalFeaturesInfo,
+  impurity, maxDepth, maxBins, 
minInstancesPerNode, minInfoGain)
+
+
+class RandomForestModel(JavaModelWrapper):
+
+Represents a random forest model.
+
+EXPERIMENTAL: This is an experimental API.
+  It will probably be modified in future.
+
+def predict(self, x):
+
+Predict values for a single data point or an RDD of points using
+the model trained.
+
+if isinstance(x, RDD):
+return self.call(predict, x.map(_convert_to_vector))
+
+else:
+return self.call(predict, _convert_to_vector(x))
+
+def numTrees(self):
+
+Get number of trees in forest.
+
+return self.call(numTrees)
+
+def totalNumNodes(self):
+
+Get total number of nodes, summed over all trees in the forest.
+
+return self.call(totalNumNodes)
+
+def __repr__(self):
+ Summary of model 
+return self._java_model.toString()
+
+def toDebugString(self):
+ Full model 
+return self._java_model.toDebugString()
+
+
+class RandomForest(object):
+
+Learning algorithm for a random forest model for classification or 
regression.
+
+EXPERIMENTAL: This is an experimental API.
+  It will probably be modified in future.
+
+
+supportedFeatureSubsetStrategies = (auto, all, sqrt, log2, 
onethird)
+
+@classmethod
+def _train(cls, data, type, numClasses, features, impurity, maxDepth, 
maxBins,
+   numTrees, featureSubsetStrategy, seed):
+first = data.first()
+assert isinstance(first, LabeledPoint), the data should be RDD of 
LabeledPoint
+if featureSubsetStrategy not in 
cls.supportedFeatureSubsetStrategies:
+raise ValueError(unsupported featureSubsetStrategy: %s % 
featureSubsetStrategy)
+if seed is None:
+seed = random.randint(0, 1  30)
+model = callMLlibFunc(trainRandomForestModel, data, type, 
numClasses, features,
+  impurity, maxDepth, maxBins, numTrees, 
featureSubsetStrategy, seed)
+return RandomForestModel(model)
+
+@classmethod
+def trainClassifier(cls, data, numClassesForClassification, 
categoricalFeaturesInfo, numTrees,
+featureSubsetStrategy=auto, impurity=gini, 
maxDepth=4, maxBins=32,
+seed=None):
+
+Method to train a decision tree model for binary or multiclass
+classification.
+
+:param data: Training dataset: RDD of LabeledPoint. Labels should 
take
+   values {0, 1, ..., numClasses-1}.
+:param numClassesForClassification: number of classes for 
classification.
+:param categoricalFeaturesInfo: Map storing arity of categorical 
features.
+   E.g., an entry (n - k) indicates that feature n is 
categorical
+   with k categories indexed from 0: {0, 1, ..., k-1}.
+:param numTrees: Number of trees in the random forest.
+:param featureSubsetStrategy: Number of features to consider for 
splits at
+   each node.
+   Supported: auto (default), all, sqrt, log2, 
onethird.
+   If auto is set, this parameter is set based on numTrees:
+   if numTrees == 1, set to all;
+   if numTrees  1 (forest) set to sqrt for classification 
and to
+   onethird for regression.
+:param impurity: Criterion used for information gain calculation.
+   Supported values: gini (recommended) or entropy.
+:param maxDepth: Maximum depth of the tree. E.g., depth 0 means 1 
leaf node;
+   depth 1 means 1 internal node + 2 leaf nodes. (default: 4)
+:param maxBins: maximum number of bins used for splitting features
+   (default: 100)
+:param seed: Random seed for bootstrapping and choosing feature 
subsets.
+:return: RandomForestModel 

[GitHub] spark pull request: [SPARK-4439] [MLlib] add python api for random...

2014-11-20 Thread jkbradley
Github user jkbradley commented on the pull request:

https://github.com/apache/spark/pull/3320#issuecomment-63877350
  
@davies Thanks for adding this API!  I made a few small comments.  Other 
than those, LGTM


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-4439] [MLlib] add python api for random...

2014-11-20 Thread davies
Github user davies commented on the pull request:

https://github.com/apache/spark/pull/3320#issuecomment-63880444
  
@jkbradley done.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-4439] [MLlib] add python api for random...

2014-11-20 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/3320#issuecomment-63880964
  
  [Test build #23684 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/23684/consoleFull)
 for   PR 3320 at commit 
[`8003dfc`](https://github.com/apache/spark/commit/8003dfc674fedeca520cfceaa6e48845cd5138be).
 * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-4439] [MLlib] add python api for random...

2014-11-20 Thread jkbradley
Github user jkbradley commented on the pull request:

https://github.com/apache/spark/pull/3320#issuecomment-63881779
  
LGTM


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-4439] [MLlib] add python api for random...

2014-11-20 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/3320#issuecomment-63893839
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/23684/
Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-4439] [MLlib] add python api for random...

2014-11-20 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/3320#issuecomment-63893828
  
  [Test build #23684 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/23684/consoleFull)
 for   PR 3320 at commit 
[`8003dfc`](https://github.com/apache/spark/commit/8003dfc674fedeca520cfceaa6e48845cd5138be).
 * This patch **passes all tests**.
 * This patch merges cleanly.
 * This patch adds the following public classes _(experimental)_:
  * `class RandomForestModel(JavaModelWrapper):`
  * `class RandomForest(object):`



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-4439] [MLlib] add python api for random...

2014-11-20 Thread mengxr
Github user mengxr commented on the pull request:

https://github.com/apache/spark/pull/3320#issuecomment-63899911
  
Merged into master and branch-1.2. Thanks!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-4439] [MLlib] add python api for random...

2014-11-20 Thread manishamde
Github user manishamde commented on the pull request:

https://github.com/apache/spark/pull/3320#issuecomment-63900346
  
Thanks a lot @davies 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-4439] [MLlib] add python api for random...

2014-11-20 Thread davies
Github user davies closed the pull request at:

https://github.com/apache/spark/pull/3320


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-4439] [MLlib] add python api for random...

2014-11-19 Thread jkbradley
Github user jkbradley commented on a diff in the pull request:

https://github.com/apache/spark/pull/3320#discussion_r20591273
  
--- Diff: python/pyspark/mllib/tree.py ---
@@ -181,8 +180,191 @@ def trainRegressor(data, categoricalFeaturesInfo,
  model.predict(rdd).collect()
 [1.0, 0.0]
 
-return DecisionTree._train(data, regression, 0, 
categoricalFeaturesInfo,
-   impurity, maxDepth, maxBins, 
minInstancesPerNode, minInfoGain)
+return cls._train(data, regression, 0, categoricalFeaturesInfo,
+  impurity, maxDepth, maxBins, 
minInstancesPerNode, minInfoGain)
+
+
+class WeightedEnsembleModel(JavaModelWrapper):
--- End diff --

I started having second thoughts about this too.  I vote for having 
WeightedEnsembleModel be internal, and having it extended by each algorithm's 
particular model.  That will allow the most consistency with the new API.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-4439] [MLlib] add python api for random...

2014-11-19 Thread davies
Github user davies commented on a diff in the pull request:

https://github.com/apache/spark/pull/3320#discussion_r20618709
  
--- Diff: python/pyspark/mllib/tree.py ---
@@ -181,8 +180,191 @@ def trainRegressor(data, categoricalFeaturesInfo,
  model.predict(rdd).collect()
 [1.0, 0.0]
 
-return DecisionTree._train(data, regression, 0, 
categoricalFeaturesInfo,
-   impurity, maxDepth, maxBins, 
minInstancesPerNode, minInfoGain)
+return cls._train(data, regression, 0, categoricalFeaturesInfo,
+  impurity, maxDepth, maxBins, 
minInstancesPerNode, minInfoGain)
+
+
+class WeightedEnsembleModel(JavaModelWrapper):
--- End diff --

Is this ready to go?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-4439] [MLlib] add python api for random...

2014-11-19 Thread davies
Github user davies commented on a diff in the pull request:

https://github.com/apache/spark/pull/3320#discussion_r20619421
  
--- Diff: python/pyspark/mllib/tree.py ---
@@ -181,8 +180,191 @@ def trainRegressor(data, categoricalFeaturesInfo,
  model.predict(rdd).collect()
 [1.0, 0.0]
 
-return DecisionTree._train(data, regression, 0, 
categoricalFeaturesInfo,
-   impurity, maxDepth, maxBins, 
minInstancesPerNode, minInfoGain)
+return cls._train(data, regression, 0, categoricalFeaturesInfo,
+  impurity, maxDepth, maxBins, 
minInstancesPerNode, minInfoGain)
+
+
+class WeightedEnsembleModel(JavaModelWrapper):
--- End diff --

Should we also use `RandomForestModel` in scala/java API?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-4439] [MLlib] add python api for random...

2014-11-19 Thread manishamde
Github user manishamde commented on a diff in the pull request:

https://github.com/apache/spark/pull/3320#discussion_r20619737
  
--- Diff: python/pyspark/mllib/tree.py ---
@@ -181,8 +180,191 @@ def trainRegressor(data, categoricalFeaturesInfo,
  model.predict(rdd).collect()
 [1.0, 0.0]
 
-return DecisionTree._train(data, regression, 0, 
categoricalFeaturesInfo,
-   impurity, maxDepth, maxBins, 
minInstancesPerNode, minInfoGain)
+return cls._train(data, regression, 0, categoricalFeaturesInfo,
+  impurity, maxDepth, maxBins, 
minInstancesPerNode, minInfoGain)
+
+
+class WeightedEnsembleModel(JavaModelWrapper):
--- End diff --

Yes, if we decide to go down this route of using a new model class per 
algo. I will defer this choice to @mengxr and @jkbradley since I am not 
well-versed with the new MLlib api to understand the tradeoffs.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-4439] [MLlib] add python api for random...

2014-11-17 Thread davies
Github user davies commented on the pull request:

https://github.com/apache/spark/pull/3320#issuecomment-63364920
  
@JoshRosen corrected, thanks! I had these mistakes many times :-( 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-4439] [MLlib] add python api for random...

2014-11-17 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/3320#issuecomment-63371418
  
  [Test build #23487 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/23487/consoleFull)
 for   PR 3320 at commit 
[`565d476`](https://github.com/apache/spark/commit/565d47627953bd5e420b81d48a9a80afe4e6f66b).
 * This patch **fails PySpark unit tests**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-4439] [MLlib] add python api for random...

2014-11-17 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/3320#issuecomment-63371428
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/23487/
Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-4439] [MLlib] add python api for random...

2014-11-17 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/3320#issuecomment-63390282
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/23495/
Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-4439] [MLlib] add python api for random...

2014-11-17 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/3320#issuecomment-63390276
  
  [Test build #23495 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/23495/consoleFull)
 for   PR 3320 at commit 
[`89a000f`](https://github.com/apache/spark/commit/89a000fd8e6e15c2ba83d702bbe2f294727f0a4d).
 * This patch **passes all tests**.
 * This patch merges cleanly.
 * This patch adds the following public classes _(experimental)_:
  * `class WeightedEnsembleModel(JavaModelWrapper):`
  * `class RandomForest(object):`



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-4439] [MLlib] add python api for random...

2014-11-17 Thread mengxr
Github user mengxr commented on a diff in the pull request:

https://github.com/apache/spark/pull/3320#discussion_r20479562
  
--- Diff: python/pyspark/mllib/tree.py ---
@@ -181,8 +180,191 @@ def trainRegressor(data, categoricalFeaturesInfo,
  model.predict(rdd).collect()
 [1.0, 0.0]
 
-return DecisionTree._train(data, regression, 0, 
categoricalFeaturesInfo,
-   impurity, maxDepth, maxBins, 
minInstancesPerNode, minInfoGain)
+return cls._train(data, regression, 0, categoricalFeaturesInfo,
+  impurity, maxDepth, maxBins, 
minInstancesPerNode, minInfoGain)
+
+
+class WeightedEnsembleModel(JavaModelWrapper):
--- End diff --

This class is more general than `RandomForestModel`. @manishamde @jkbradley 
Do we want `RandomForest` returning `RandomForestModel` that extends 
`WeightedEnsembleModel`, or simply rename `WeightedEnsembleModel` to 
`TreeEnsembleModel`? The implementation is firmly attached to trees.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-4439] [MLlib] add python api for random...

2014-11-17 Thread mengxr
Github user mengxr commented on a diff in the pull request:

https://github.com/apache/spark/pull/3320#discussion_r20479607
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/api/python/PythonMLLibAPI.scala ---
@@ -465,6 +465,40 @@ class PythonMLLibAPI extends Serializable {
   }
 
   /**
+   * Java stub for Python mllib RandomForest.train().
+   * This stub returns a handle to the Java object instead of the content 
of the Java object.
+   * Extra care needs to be taken in the Python code to ensure it gets 
freed on exit;
+   * see the Py4J documentation.
+   */
+  def trainRandomForestModel(
+data: JavaRDD[LabeledPoint],
--- End diff --

4-space indentation


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-4439] [MLlib] add python api for random...

2014-11-17 Thread manishamde
Github user manishamde commented on a diff in the pull request:

https://github.com/apache/spark/pull/3320#discussion_r20479768
  
--- Diff: python/pyspark/mllib/tree.py ---
@@ -181,8 +180,191 @@ def trainRegressor(data, categoricalFeaturesInfo,
  model.predict(rdd).collect()
 [1.0, 0.0]
 
-return DecisionTree._train(data, regression, 0, 
categoricalFeaturesInfo,
-   impurity, maxDepth, maxBins, 
minInstancesPerNode, minInfoGain)
+return cls._train(data, regression, 0, categoricalFeaturesInfo,
+  impurity, maxDepth, maxBins, 
minInstancesPerNode, minInfoGain)
+
+
+class WeightedEnsembleModel(JavaModelWrapper):
--- End diff --

@mengxr The idea was that ```WeightedEnsembleModel``` model will also 
support non-tree based weak learners for boosting. I don't have a strong 
preference either way. 

May be we could also removed the prefix ```Weighted``` from the 
```WeightedEnsembleModel``` to keep the name simple.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-4439] [MLlib] add python api for random...

2014-11-17 Thread mengxr
Github user mengxr commented on a diff in the pull request:

https://github.com/apache/spark/pull/3320#discussion_r20480563
  
--- Diff: python/pyspark/mllib/tree.py ---
@@ -181,8 +180,191 @@ def trainRegressor(data, categoricalFeaturesInfo,
  model.predict(rdd).collect()
 [1.0, 0.0]
 
-return DecisionTree._train(data, regression, 0, 
categoricalFeaturesInfo,
-   impurity, maxDepth, maxBins, 
minInstancesPerNode, minInfoGain)
+return cls._train(data, regression, 0, categoricalFeaturesInfo,
+  impurity, maxDepth, maxBins, 
minInstancesPerNode, minInfoGain)
+
+
+class WeightedEnsembleModel(JavaModelWrapper):
--- End diff --

`WeightedEnsembleModel` is under `mllib.tree` and its public methods like 
`weakHypotheses`, toString, toDebugTree, numWeakHypotheses, and totalNumNodes 
are documented or implemented under the assumption of trees. I feel 
`TreeEnsembleModel` would be more appropriate here.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-4439] [MLlib] add python api for random...

2014-11-17 Thread davies
Github user davies commented on a diff in the pull request:

https://github.com/apache/spark/pull/3320#discussion_r20481058
  
--- Diff: python/pyspark/mllib/tree.py ---
@@ -181,8 +180,191 @@ def trainRegressor(data, categoricalFeaturesInfo,
  model.predict(rdd).collect()
 [1.0, 0.0]
 
-return DecisionTree._train(data, regression, 0, 
categoricalFeaturesInfo,
-   impurity, maxDepth, maxBins, 
minInstancesPerNode, minInfoGain)
+return cls._train(data, regression, 0, categoricalFeaturesInfo,
+  impurity, maxDepth, maxBins, 
minInstancesPerNode, minInfoGain)
+
+
+class WeightedEnsembleModel(JavaModelWrapper):
--- End diff --

I had changed to use `RandomForestModel` as the public interface, moved 
this as internal, so we can change the name of it later (also should change the 
name in Scala).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-4439] [MLlib] add python api for random...

2014-11-17 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/3320#issuecomment-63410715
  
  [Test build #23527 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/23527/consoleFull)
 for   PR 3320 at commit 
[`dae7fc0`](https://github.com/apache/spark/commit/dae7fc01d1df78e3d4f9a18b90ed553eff48edaa).
 * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-4439] [MLlib] add python api for random...

2014-11-17 Thread manishamde
Github user manishamde commented on a diff in the pull request:

https://github.com/apache/spark/pull/3320#discussion_r20481177
  
--- Diff: python/pyspark/mllib/tree.py ---
@@ -181,8 +180,191 @@ def trainRegressor(data, categoricalFeaturesInfo,
  model.predict(rdd).collect()
 [1.0, 0.0]
 
-return DecisionTree._train(data, regression, 0, 
categoricalFeaturesInfo,
-   impurity, maxDepth, maxBins, 
minInstancesPerNode, minInfoGain)
+return cls._train(data, regression, 0, categoricalFeaturesInfo,
+  impurity, maxDepth, maxBins, 
minInstancesPerNode, minInfoGain)
+
+
+class WeightedEnsembleModel(JavaModelWrapper):
--- End diff --

I am fine with the ```TreeEnsembleModel``` name with the understanding that 
we may need a new model class once we create a new namespace like 
```mllib.ensemble``` in the future. @jkbradley Thoughts?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-4439] [MLlib] add python api for random...

2014-11-17 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/3320#issuecomment-63411351
  
  [Test build #23528 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/23528/consoleFull)
 for   PR 3320 at commit 
[`885abee`](https://github.com/apache/spark/commit/885abee042bb64771f53dab7814ed914a68b62a1).
 * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-4439] [MLlib] add python api for random...

2014-11-17 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/3320#issuecomment-63416979
  
  [Test build #23527 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/23527/consoleFull)
 for   PR 3320 at commit 
[`dae7fc0`](https://github.com/apache/spark/commit/dae7fc01d1df78e3d4f9a18b90ed553eff48edaa).
 * This patch **passes all tests**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-4439] [MLlib] add python api for random...

2014-11-17 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/3320#issuecomment-63416983
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/23527/
Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-4439] [MLlib] add python api for random...

2014-11-17 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/3320#issuecomment-63418194
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/23528/
Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-4439] [MLlib] add python api for random...

2014-11-17 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/3320#issuecomment-63418185
  
  [Test build #23528 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/23528/consoleFull)
 for   PR 3320 at commit 
[`885abee`](https://github.com/apache/spark/commit/885abee042bb64771f53dab7814ed914a68b62a1).
 * This patch **passes all tests**.
 * This patch merges cleanly.
 * This patch adds the following public classes _(experimental)_:
  * `class WeightedEnsembleModel(JavaModelWrapper):`
  * `class RandomForestModel(WeightedEnsembleModel):`
  * `class RandomForest(object):`



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org