[GitHub] spark pull request: [SPARK-4439] [MLlib] add python api for random...
Github user mengxr commented on the pull request: https://github.com/apache/spark/pull/3320#issuecomment-63777609 @davies We updated the `RandomForest` API in #3374 . Now `RandomForest` returns a `RandomForestModel`. Could you rebase and update this PR? Thanks! --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-4439] [MLlib] add python api for random...
Github user davies commented on the pull request: https://github.com/apache/spark/pull/3320#issuecomment-63853606 @mengxr done. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-4439] [MLlib] add python api for random...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/3320#issuecomment-63856051 [Test build #23677 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/23677/consoleFull) for PR 3320 at commit [`e0df852`](https://github.com/apache/spark/commit/e0df852ab4f353b9f800fe5374195fee5a06aa52). * This patch merges cleanly. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-4439] [MLlib] add python api for random...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/3320#issuecomment-63869584 [Test build #23677 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/23677/consoleFull) for PR 3320 at commit [`e0df852`](https://github.com/apache/spark/commit/e0df852ab4f353b9f800fe5374195fee5a06aa52). * This patch **passes all tests**. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-4439] [MLlib] add python api for random...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/3320#issuecomment-63869599 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/23677/ Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-4439] [MLlib] add python api for random...
Github user jkbradley commented on a diff in the pull request: https://github.com/apache/spark/pull/3320#discussion_r20676259 --- Diff: python/pyspark/mllib/tree.py --- @@ -181,8 +182,206 @@ def trainRegressor(data, categoricalFeaturesInfo, model.predict(rdd).collect() [1.0, 0.0] -return DecisionTree._train(data, regression, 0, categoricalFeaturesInfo, - impurity, maxDepth, maxBins, minInstancesPerNode, minInfoGain) +return cls._train(data, regression, 0, categoricalFeaturesInfo, + impurity, maxDepth, maxBins, minInstancesPerNode, minInfoGain) + + +class RandomForestModel(JavaModelWrapper): + +Represents a random forest model. + +EXPERIMENTAL: This is an experimental API. + It will probably be modified in future. + +def predict(self, x): + +Predict values for a single data point or an RDD of points using +the model trained. + +if isinstance(x, RDD): +return self.call(predict, x.map(_convert_to_vector)) + +else: +return self.call(predict, _convert_to_vector(x)) + +def numTrees(self): + +Get number of trees in forest. + +return self.call(numTrees) + +def totalNumNodes(self): + +Get total number of nodes, summed over all trees in the forest. + +return self.call(totalNumNodes) + +def __repr__(self): + Summary of model +return self._java_model.toString() + +def toDebugString(self): + Full model +return self._java_model.toDebugString() + + +class RandomForest(object): + +Learning algorithm for a random forest model for classification or regression. + +EXPERIMENTAL: This is an experimental API. + It will probably be modified in future. + + +supportedFeatureSubsetStrategies = (auto, all, sqrt, log2, onethird) + +@classmethod +def _train(cls, data, type, numClasses, features, impurity, maxDepth, maxBins, + numTrees, featureSubsetStrategy, seed): +first = data.first() +assert isinstance(first, LabeledPoint), the data should be RDD of LabeledPoint +if featureSubsetStrategy not in cls.supportedFeatureSubsetStrategies: +raise ValueError(unsupported featureSubsetStrategy: %s % featureSubsetStrategy) +if seed is None: +seed = random.randint(0, 1 30) +model = callMLlibFunc(trainRandomForestModel, data, type, numClasses, features, + impurity, maxDepth, maxBins, numTrees, featureSubsetStrategy, seed) +return RandomForestModel(model) + +@classmethod +def trainClassifier(cls, data, numClassesForClassification, categoricalFeaturesInfo, numTrees, +featureSubsetStrategy=auto, impurity=gini, maxDepth=4, maxBins=32, +seed=None): + +Method to train a decision tree model for binary or multiclass +classification. + +:param data: Training dataset: RDD of LabeledPoint. Labels should take + values {0, 1, ..., numClasses-1}. +:param numClassesForClassification: number of classes for classification. +:param categoricalFeaturesInfo: Map storing arity of categorical features. + E.g., an entry (n - k) indicates that feature n is categorical + with k categories indexed from 0: {0, 1, ..., k-1}. +:param numTrees: Number of trees in the random forest. +:param featureSubsetStrategy: Number of features to consider for splits at + each node. + Supported: auto (default), all, sqrt, log2, onethird. + If auto is set, this parameter is set based on numTrees: + if numTrees == 1, set to all; + if numTrees 1 (forest) set to sqrt for classification and to --- End diff -- could just state default for classification --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-4439] [MLlib] add python api for random...
Github user jkbradley commented on a diff in the pull request: https://github.com/apache/spark/pull/3320#discussion_r20676265 --- Diff: python/pyspark/mllib/tree.py --- @@ -181,8 +182,206 @@ def trainRegressor(data, categoricalFeaturesInfo, model.predict(rdd).collect() [1.0, 0.0] -return DecisionTree._train(data, regression, 0, categoricalFeaturesInfo, - impurity, maxDepth, maxBins, minInstancesPerNode, minInfoGain) +return cls._train(data, regression, 0, categoricalFeaturesInfo, + impurity, maxDepth, maxBins, minInstancesPerNode, minInfoGain) + + +class RandomForestModel(JavaModelWrapper): + +Represents a random forest model. + +EXPERIMENTAL: This is an experimental API. + It will probably be modified in future. + +def predict(self, x): + +Predict values for a single data point or an RDD of points using +the model trained. + +if isinstance(x, RDD): +return self.call(predict, x.map(_convert_to_vector)) + +else: +return self.call(predict, _convert_to_vector(x)) + +def numTrees(self): + +Get number of trees in forest. + +return self.call(numTrees) + +def totalNumNodes(self): + +Get total number of nodes, summed over all trees in the forest. + +return self.call(totalNumNodes) + +def __repr__(self): + Summary of model +return self._java_model.toString() + +def toDebugString(self): + Full model +return self._java_model.toDebugString() + + +class RandomForest(object): + +Learning algorithm for a random forest model for classification or regression. + +EXPERIMENTAL: This is an experimental API. + It will probably be modified in future. + + +supportedFeatureSubsetStrategies = (auto, all, sqrt, log2, onethird) + +@classmethod +def _train(cls, data, type, numClasses, features, impurity, maxDepth, maxBins, --- End diff -- type -- algo features -- categoricalFeaturesInfo --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-4439] [MLlib] add python api for random...
Github user jkbradley commented on a diff in the pull request: https://github.com/apache/spark/pull/3320#discussion_r20676263 --- Diff: python/pyspark/mllib/tree.py --- @@ -181,8 +182,206 @@ def trainRegressor(data, categoricalFeaturesInfo, model.predict(rdd).collect() [1.0, 0.0] -return DecisionTree._train(data, regression, 0, categoricalFeaturesInfo, - impurity, maxDepth, maxBins, minInstancesPerNode, minInfoGain) +return cls._train(data, regression, 0, categoricalFeaturesInfo, + impurity, maxDepth, maxBins, minInstancesPerNode, minInfoGain) + + +class RandomForestModel(JavaModelWrapper): + +Represents a random forest model. + +EXPERIMENTAL: This is an experimental API. + It will probably be modified in future. + +def predict(self, x): + +Predict values for a single data point or an RDD of points using +the model trained. + +if isinstance(x, RDD): +return self.call(predict, x.map(_convert_to_vector)) + +else: +return self.call(predict, _convert_to_vector(x)) + +def numTrees(self): + +Get number of trees in forest. + +return self.call(numTrees) + +def totalNumNodes(self): + +Get total number of nodes, summed over all trees in the forest. + +return self.call(totalNumNodes) + +def __repr__(self): + Summary of model +return self._java_model.toString() + +def toDebugString(self): + Full model +return self._java_model.toDebugString() + + +class RandomForest(object): + +Learning algorithm for a random forest model for classification or regression. + +EXPERIMENTAL: This is an experimental API. + It will probably be modified in future. + + +supportedFeatureSubsetStrategies = (auto, all, sqrt, log2, onethird) + +@classmethod +def _train(cls, data, type, numClasses, features, impurity, maxDepth, maxBins, + numTrees, featureSubsetStrategy, seed): +first = data.first() +assert isinstance(first, LabeledPoint), the data should be RDD of LabeledPoint +if featureSubsetStrategy not in cls.supportedFeatureSubsetStrategies: +raise ValueError(unsupported featureSubsetStrategy: %s % featureSubsetStrategy) +if seed is None: +seed = random.randint(0, 1 30) +model = callMLlibFunc(trainRandomForestModel, data, type, numClasses, features, + impurity, maxDepth, maxBins, numTrees, featureSubsetStrategy, seed) +return RandomForestModel(model) + +@classmethod +def trainClassifier(cls, data, numClassesForClassification, categoricalFeaturesInfo, numTrees, +featureSubsetStrategy=auto, impurity=gini, maxDepth=4, maxBins=32, +seed=None): + +Method to train a decision tree model for binary or multiclass +classification. + +:param data: Training dataset: RDD of LabeledPoint. Labels should take + values {0, 1, ..., numClasses-1}. +:param numClassesForClassification: number of classes for classification. +:param categoricalFeaturesInfo: Map storing arity of categorical features. + E.g., an entry (n - k) indicates that feature n is categorical + with k categories indexed from 0: {0, 1, ..., k-1}. +:param numTrees: Number of trees in the random forest. +:param featureSubsetStrategy: Number of features to consider for splits at + each node. + Supported: auto (default), all, sqrt, log2, onethird. + If auto is set, this parameter is set based on numTrees: + if numTrees == 1, set to all; + if numTrees 1 (forest) set to sqrt for classification and to + onethird for regression. +:param impurity: Criterion used for information gain calculation. + Supported values: gini (recommended) or entropy. +:param maxDepth: Maximum depth of the tree. E.g., depth 0 means 1 leaf node; + depth 1 means 1 internal node + 2 leaf nodes. (default: 4) +:param maxBins: maximum number of bins used for splitting features + (default: 100) +:param seed: Random seed for bootstrapping and choosing feature subsets. +:return: RandomForestModel
[GitHub] spark pull request: [SPARK-4439] [MLlib] add python api for random...
Github user jkbradley commented on the pull request: https://github.com/apache/spark/pull/3320#issuecomment-63877350 @davies Thanks for adding this API! I made a few small comments. Other than those, LGTM --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-4439] [MLlib] add python api for random...
Github user davies commented on the pull request: https://github.com/apache/spark/pull/3320#issuecomment-63880444 @jkbradley done. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-4439] [MLlib] add python api for random...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/3320#issuecomment-63880964 [Test build #23684 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/23684/consoleFull) for PR 3320 at commit [`8003dfc`](https://github.com/apache/spark/commit/8003dfc674fedeca520cfceaa6e48845cd5138be). * This patch merges cleanly. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-4439] [MLlib] add python api for random...
Github user jkbradley commented on the pull request: https://github.com/apache/spark/pull/3320#issuecomment-63881779 LGTM --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-4439] [MLlib] add python api for random...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/3320#issuecomment-63893839 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/23684/ Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-4439] [MLlib] add python api for random...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/3320#issuecomment-63893828 [Test build #23684 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/23684/consoleFull) for PR 3320 at commit [`8003dfc`](https://github.com/apache/spark/commit/8003dfc674fedeca520cfceaa6e48845cd5138be). * This patch **passes all tests**. * This patch merges cleanly. * This patch adds the following public classes _(experimental)_: * `class RandomForestModel(JavaModelWrapper):` * `class RandomForest(object):` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-4439] [MLlib] add python api for random...
Github user mengxr commented on the pull request: https://github.com/apache/spark/pull/3320#issuecomment-63899911 Merged into master and branch-1.2. Thanks! --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-4439] [MLlib] add python api for random...
Github user manishamde commented on the pull request: https://github.com/apache/spark/pull/3320#issuecomment-63900346 Thanks a lot @davies --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-4439] [MLlib] add python api for random...
Github user davies closed the pull request at: https://github.com/apache/spark/pull/3320 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-4439] [MLlib] add python api for random...
Github user jkbradley commented on a diff in the pull request: https://github.com/apache/spark/pull/3320#discussion_r20591273 --- Diff: python/pyspark/mllib/tree.py --- @@ -181,8 +180,191 @@ def trainRegressor(data, categoricalFeaturesInfo, model.predict(rdd).collect() [1.0, 0.0] -return DecisionTree._train(data, regression, 0, categoricalFeaturesInfo, - impurity, maxDepth, maxBins, minInstancesPerNode, minInfoGain) +return cls._train(data, regression, 0, categoricalFeaturesInfo, + impurity, maxDepth, maxBins, minInstancesPerNode, minInfoGain) + + +class WeightedEnsembleModel(JavaModelWrapper): --- End diff -- I started having second thoughts about this too. I vote for having WeightedEnsembleModel be internal, and having it extended by each algorithm's particular model. That will allow the most consistency with the new API. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-4439] [MLlib] add python api for random...
Github user davies commented on a diff in the pull request: https://github.com/apache/spark/pull/3320#discussion_r20618709 --- Diff: python/pyspark/mllib/tree.py --- @@ -181,8 +180,191 @@ def trainRegressor(data, categoricalFeaturesInfo, model.predict(rdd).collect() [1.0, 0.0] -return DecisionTree._train(data, regression, 0, categoricalFeaturesInfo, - impurity, maxDepth, maxBins, minInstancesPerNode, minInfoGain) +return cls._train(data, regression, 0, categoricalFeaturesInfo, + impurity, maxDepth, maxBins, minInstancesPerNode, minInfoGain) + + +class WeightedEnsembleModel(JavaModelWrapper): --- End diff -- Is this ready to go? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-4439] [MLlib] add python api for random...
Github user davies commented on a diff in the pull request: https://github.com/apache/spark/pull/3320#discussion_r20619421 --- Diff: python/pyspark/mllib/tree.py --- @@ -181,8 +180,191 @@ def trainRegressor(data, categoricalFeaturesInfo, model.predict(rdd).collect() [1.0, 0.0] -return DecisionTree._train(data, regression, 0, categoricalFeaturesInfo, - impurity, maxDepth, maxBins, minInstancesPerNode, minInfoGain) +return cls._train(data, regression, 0, categoricalFeaturesInfo, + impurity, maxDepth, maxBins, minInstancesPerNode, minInfoGain) + + +class WeightedEnsembleModel(JavaModelWrapper): --- End diff -- Should we also use `RandomForestModel` in scala/java API? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-4439] [MLlib] add python api for random...
Github user manishamde commented on a diff in the pull request: https://github.com/apache/spark/pull/3320#discussion_r20619737 --- Diff: python/pyspark/mllib/tree.py --- @@ -181,8 +180,191 @@ def trainRegressor(data, categoricalFeaturesInfo, model.predict(rdd).collect() [1.0, 0.0] -return DecisionTree._train(data, regression, 0, categoricalFeaturesInfo, - impurity, maxDepth, maxBins, minInstancesPerNode, minInfoGain) +return cls._train(data, regression, 0, categoricalFeaturesInfo, + impurity, maxDepth, maxBins, minInstancesPerNode, minInfoGain) + + +class WeightedEnsembleModel(JavaModelWrapper): --- End diff -- Yes, if we decide to go down this route of using a new model class per algo. I will defer this choice to @mengxr and @jkbradley since I am not well-versed with the new MLlib api to understand the tradeoffs. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-4439] [MLlib] add python api for random...
Github user davies commented on the pull request: https://github.com/apache/spark/pull/3320#issuecomment-63364920 @JoshRosen corrected, thanks! I had these mistakes many times :-( --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-4439] [MLlib] add python api for random...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/3320#issuecomment-63371418 [Test build #23487 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/23487/consoleFull) for PR 3320 at commit [`565d476`](https://github.com/apache/spark/commit/565d47627953bd5e420b81d48a9a80afe4e6f66b). * This patch **fails PySpark unit tests**. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-4439] [MLlib] add python api for random...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/3320#issuecomment-63371428 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/23487/ Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-4439] [MLlib] add python api for random...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/3320#issuecomment-63390282 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/23495/ Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-4439] [MLlib] add python api for random...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/3320#issuecomment-63390276 [Test build #23495 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/23495/consoleFull) for PR 3320 at commit [`89a000f`](https://github.com/apache/spark/commit/89a000fd8e6e15c2ba83d702bbe2f294727f0a4d). * This patch **passes all tests**. * This patch merges cleanly. * This patch adds the following public classes _(experimental)_: * `class WeightedEnsembleModel(JavaModelWrapper):` * `class RandomForest(object):` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-4439] [MLlib] add python api for random...
Github user mengxr commented on a diff in the pull request: https://github.com/apache/spark/pull/3320#discussion_r20479562 --- Diff: python/pyspark/mllib/tree.py --- @@ -181,8 +180,191 @@ def trainRegressor(data, categoricalFeaturesInfo, model.predict(rdd).collect() [1.0, 0.0] -return DecisionTree._train(data, regression, 0, categoricalFeaturesInfo, - impurity, maxDepth, maxBins, minInstancesPerNode, minInfoGain) +return cls._train(data, regression, 0, categoricalFeaturesInfo, + impurity, maxDepth, maxBins, minInstancesPerNode, minInfoGain) + + +class WeightedEnsembleModel(JavaModelWrapper): --- End diff -- This class is more general than `RandomForestModel`. @manishamde @jkbradley Do we want `RandomForest` returning `RandomForestModel` that extends `WeightedEnsembleModel`, or simply rename `WeightedEnsembleModel` to `TreeEnsembleModel`? The implementation is firmly attached to trees. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-4439] [MLlib] add python api for random...
Github user mengxr commented on a diff in the pull request: https://github.com/apache/spark/pull/3320#discussion_r20479607 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/api/python/PythonMLLibAPI.scala --- @@ -465,6 +465,40 @@ class PythonMLLibAPI extends Serializable { } /** + * Java stub for Python mllib RandomForest.train(). + * This stub returns a handle to the Java object instead of the content of the Java object. + * Extra care needs to be taken in the Python code to ensure it gets freed on exit; + * see the Py4J documentation. + */ + def trainRandomForestModel( +data: JavaRDD[LabeledPoint], --- End diff -- 4-space indentation --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-4439] [MLlib] add python api for random...
Github user manishamde commented on a diff in the pull request: https://github.com/apache/spark/pull/3320#discussion_r20479768 --- Diff: python/pyspark/mllib/tree.py --- @@ -181,8 +180,191 @@ def trainRegressor(data, categoricalFeaturesInfo, model.predict(rdd).collect() [1.0, 0.0] -return DecisionTree._train(data, regression, 0, categoricalFeaturesInfo, - impurity, maxDepth, maxBins, minInstancesPerNode, minInfoGain) +return cls._train(data, regression, 0, categoricalFeaturesInfo, + impurity, maxDepth, maxBins, minInstancesPerNode, minInfoGain) + + +class WeightedEnsembleModel(JavaModelWrapper): --- End diff -- @mengxr The idea was that ```WeightedEnsembleModel``` model will also support non-tree based weak learners for boosting. I don't have a strong preference either way. May be we could also removed the prefix ```Weighted``` from the ```WeightedEnsembleModel``` to keep the name simple. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-4439] [MLlib] add python api for random...
Github user mengxr commented on a diff in the pull request: https://github.com/apache/spark/pull/3320#discussion_r20480563 --- Diff: python/pyspark/mllib/tree.py --- @@ -181,8 +180,191 @@ def trainRegressor(data, categoricalFeaturesInfo, model.predict(rdd).collect() [1.0, 0.0] -return DecisionTree._train(data, regression, 0, categoricalFeaturesInfo, - impurity, maxDepth, maxBins, minInstancesPerNode, minInfoGain) +return cls._train(data, regression, 0, categoricalFeaturesInfo, + impurity, maxDepth, maxBins, minInstancesPerNode, minInfoGain) + + +class WeightedEnsembleModel(JavaModelWrapper): --- End diff -- `WeightedEnsembleModel` is under `mllib.tree` and its public methods like `weakHypotheses`, toString, toDebugTree, numWeakHypotheses, and totalNumNodes are documented or implemented under the assumption of trees. I feel `TreeEnsembleModel` would be more appropriate here. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-4439] [MLlib] add python api for random...
Github user davies commented on a diff in the pull request: https://github.com/apache/spark/pull/3320#discussion_r20481058 --- Diff: python/pyspark/mllib/tree.py --- @@ -181,8 +180,191 @@ def trainRegressor(data, categoricalFeaturesInfo, model.predict(rdd).collect() [1.0, 0.0] -return DecisionTree._train(data, regression, 0, categoricalFeaturesInfo, - impurity, maxDepth, maxBins, minInstancesPerNode, minInfoGain) +return cls._train(data, regression, 0, categoricalFeaturesInfo, + impurity, maxDepth, maxBins, minInstancesPerNode, minInfoGain) + + +class WeightedEnsembleModel(JavaModelWrapper): --- End diff -- I had changed to use `RandomForestModel` as the public interface, moved this as internal, so we can change the name of it later (also should change the name in Scala). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-4439] [MLlib] add python api for random...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/3320#issuecomment-63410715 [Test build #23527 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/23527/consoleFull) for PR 3320 at commit [`dae7fc0`](https://github.com/apache/spark/commit/dae7fc01d1df78e3d4f9a18b90ed553eff48edaa). * This patch merges cleanly. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-4439] [MLlib] add python api for random...
Github user manishamde commented on a diff in the pull request: https://github.com/apache/spark/pull/3320#discussion_r20481177 --- Diff: python/pyspark/mllib/tree.py --- @@ -181,8 +180,191 @@ def trainRegressor(data, categoricalFeaturesInfo, model.predict(rdd).collect() [1.0, 0.0] -return DecisionTree._train(data, regression, 0, categoricalFeaturesInfo, - impurity, maxDepth, maxBins, minInstancesPerNode, minInfoGain) +return cls._train(data, regression, 0, categoricalFeaturesInfo, + impurity, maxDepth, maxBins, minInstancesPerNode, minInfoGain) + + +class WeightedEnsembleModel(JavaModelWrapper): --- End diff -- I am fine with the ```TreeEnsembleModel``` name with the understanding that we may need a new model class once we create a new namespace like ```mllib.ensemble``` in the future. @jkbradley Thoughts? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-4439] [MLlib] add python api for random...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/3320#issuecomment-63411351 [Test build #23528 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/23528/consoleFull) for PR 3320 at commit [`885abee`](https://github.com/apache/spark/commit/885abee042bb64771f53dab7814ed914a68b62a1). * This patch merges cleanly. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-4439] [MLlib] add python api for random...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/3320#issuecomment-63416979 [Test build #23527 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/23527/consoleFull) for PR 3320 at commit [`dae7fc0`](https://github.com/apache/spark/commit/dae7fc01d1df78e3d4f9a18b90ed553eff48edaa). * This patch **passes all tests**. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-4439] [MLlib] add python api for random...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/3320#issuecomment-63416983 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/23527/ Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-4439] [MLlib] add python api for random...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/3320#issuecomment-63418194 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/23528/ Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-4439] [MLlib] add python api for random...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/3320#issuecomment-63418185 [Test build #23528 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/23528/consoleFull) for PR 3320 at commit [`885abee`](https://github.com/apache/spark/commit/885abee042bb64771f53dab7814ed914a68b62a1). * This patch **passes all tests**. * This patch merges cleanly. * This patch adds the following public classes _(experimental)_: * `class WeightedEnsembleModel(JavaModelWrapper):` * `class RandomForestModel(WeightedEnsembleModel):` * `class RandomForest(object):` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org