http://git-wip-us.apache.org/repos/asf/spark-website/blob/d2bcf185/site/docs/2.1.0/mllib-decision-tree.html ---------------------------------------------------------------------- diff --git a/site/docs/2.1.0/mllib-decision-tree.html b/site/docs/2.1.0/mllib-decision-tree.html index 1a3d865..991610e 100644 --- a/site/docs/2.1.0/mllib-decision-tree.html +++ b/site/docs/2.1.0/mllib-decision-tree.html @@ -307,23 +307,23 @@ <ul id="markdown-toc"> - <li><a href="#basic-algorithm" id="markdown-toc-basic-algorithm">Basic algorithm</a> <ul> - <li><a href="#node-impurity-and-information-gain" id="markdown-toc-node-impurity-and-information-gain">Node impurity and information gain</a></li> - <li><a href="#split-candidates" id="markdown-toc-split-candidates">Split candidates</a></li> - <li><a href="#stopping-rule" id="markdown-toc-stopping-rule">Stopping rule</a></li> + <li><a href="#basic-algorithm">Basic algorithm</a> <ul> + <li><a href="#node-impurity-and-information-gain">Node impurity and information gain</a></li> + <li><a href="#split-candidates">Split candidates</a></li> + <li><a href="#stopping-rule">Stopping rule</a></li> </ul> </li> - <li><a href="#usage-tips" id="markdown-toc-usage-tips">Usage tips</a> <ul> - <li><a href="#problem-specification-parameters" id="markdown-toc-problem-specification-parameters">Problem specification parameters</a></li> - <li><a href="#stopping-criteria" id="markdown-toc-stopping-criteria">Stopping criteria</a></li> - <li><a href="#tunable-parameters" id="markdown-toc-tunable-parameters">Tunable parameters</a></li> - <li><a href="#caching-and-checkpointing" id="markdown-toc-caching-and-checkpointing">Caching and checkpointing</a></li> + <li><a href="#usage-tips">Usage tips</a> <ul> + <li><a href="#problem-specification-parameters">Problem specification parameters</a></li> + <li><a href="#stopping-criteria">Stopping criteria</a></li> + <li><a href="#tunable-parameters">Tunable parameters</a></li> + <li><a href="#caching-and-checkpointing">Caching and checkpointing</a></li> </ul> </li> - <li><a href="#scaling" id="markdown-toc-scaling">Scaling</a></li> - <li><a href="#examples" id="markdown-toc-examples">Examples</a> <ul> - <li><a href="#classification" id="markdown-toc-classification">Classification</a></li> - <li><a href="#regression" id="markdown-toc-regression">Regression</a></li> + <li><a href="#scaling">Scaling</a></li> + <li><a href="#examples">Examples</a> <ul> + <li><a href="#classification">Classification</a></li> + <li><a href="#regression">Regression</a></li> </ul> </li> </ul> @@ -548,7 +548,7 @@ maximum tree depth of 5. The test error is calculated to measure the algorithm a <div data-lang="scala"> <p>Refer to the <a href="api/scala/index.html#org.apache.spark.mllib.tree.DecisionTree"><code>DecisionTree</code> Scala docs</a> and <a href="api/scala/index.html#org.apache.spark.mllib.tree.model.DecisionTreeModel"><code>DecisionTreeModel</code> Scala docs</a> for details on the API.</p> - <div class="highlight"><pre><span class="k">import</span> <span class="nn">org.apache.spark.mllib.tree.DecisionTree</span> + <div class="highlight"><pre><span></span><span class="k">import</span> <span class="nn">org.apache.spark.mllib.tree.DecisionTree</span> <span class="k">import</span> <span class="nn">org.apache.spark.mllib.tree.model.DecisionTreeModel</span> <span class="k">import</span> <span class="nn">org.apache.spark.mllib.util.MLUtils</span> @@ -588,7 +588,7 @@ maximum tree depth of 5. The test error is calculated to measure the algorithm a <div data-lang="java"> <p>Refer to the <a href="api/java/org/apache/spark/mllib/tree/DecisionTree.html"><code>DecisionTree</code> Java docs</a> and <a href="api/java/org/apache/spark/mllib/tree/model/DecisionTreeModel.html"><code>DecisionTreeModel</code> Java docs</a> for details on the API.</p> - <div class="highlight"><pre><span class="kn">import</span> <span class="nn">java.util.HashMap</span><span class="o">;</span> + <div class="highlight"><pre><span></span><span class="kn">import</span> <span class="nn">java.util.HashMap</span><span class="o">;</span> <span class="kn">import</span> <span class="nn">java.util.Map</span><span class="o">;</span> <span class="kn">import</span> <span class="nn">scala.Tuple2</span><span class="o">;</span> @@ -604,8 +604,8 @@ maximum tree depth of 5. The test error is calculated to measure the algorithm a <span class="kn">import</span> <span class="nn">org.apache.spark.mllib.tree.model.DecisionTreeModel</span><span class="o">;</span> <span class="kn">import</span> <span class="nn">org.apache.spark.mllib.util.MLUtils</span><span class="o">;</span> -<span class="n">SparkConf</span> <span class="n">sparkConf</span> <span class="o">=</span> <span class="k">new</span> <span class="nf">SparkConf</span><span class="o">().</span><span class="na">setAppName</span><span class="o">(</span><span class="s">"JavaDecisionTreeClassificationExample"</span><span class="o">);</span> -<span class="n">JavaSparkContext</span> <span class="n">jsc</span> <span class="o">=</span> <span class="k">new</span> <span class="nf">JavaSparkContext</span><span class="o">(</span><span class="n">sparkConf</span><span class="o">);</span> +<span class="n">SparkConf</span> <span class="n">sparkConf</span> <span class="o">=</span> <span class="k">new</span> <span class="n">SparkConf</span><span class="o">().</span><span class="na">setAppName</span><span class="o">(</span><span class="s">"JavaDecisionTreeClassificationExample"</span><span class="o">);</span> +<span class="n">JavaSparkContext</span> <span class="n">jsc</span> <span class="o">=</span> <span class="k">new</span> <span class="n">JavaSparkContext</span><span class="o">(</span><span class="n">sparkConf</span><span class="o">);</span> <span class="c1">// Load and parse the data file.</span> <span class="n">String</span> <span class="n">datapath</span> <span class="o">=</span> <span class="s">"data/mllib/sample_libsvm_data.txt"</span><span class="o">;</span> @@ -657,30 +657,30 @@ maximum tree depth of 5. The test error is calculated to measure the algorithm a <div data-lang="python"> <p>Refer to the <a href="api/python/pyspark.mllib.html#pyspark.mllib.tree.DecisionTree"><code>DecisionTree</code> Python docs</a> and <a href="api/python/pyspark.mllib.html#pyspark.mllib.tree.DecisionTreeModel"><code>DecisionTreeModel</code> Python docs</a> for more details on the API.</p> - <div class="highlight"><pre><span class="kn">from</span> <span class="nn">pyspark.mllib.tree</span> <span class="kn">import</span> <span class="n">DecisionTree</span><span class="p">,</span> <span class="n">DecisionTreeModel</span> + <div class="highlight"><pre><span></span><span class="kn">from</span> <span class="nn">pyspark.mllib.tree</span> <span class="kn">import</span> <span class="n">DecisionTree</span><span class="p">,</span> <span class="n">DecisionTreeModel</span> <span class="kn">from</span> <span class="nn">pyspark.mllib.util</span> <span class="kn">import</span> <span class="n">MLUtils</span> -<span class="c"># Load and parse the data file into an RDD of LabeledPoint.</span> -<span class="n">data</span> <span class="o">=</span> <span class="n">MLUtils</span><span class="o">.</span><span class="n">loadLibSVMFile</span><span class="p">(</span><span class="n">sc</span><span class="p">,</span> <span class="s">'data/mllib/sample_libsvm_data.txt'</span><span class="p">)</span> -<span class="c"># Split the data into training and test sets (30% held out for testing)</span> +<span class="c1"># Load and parse the data file into an RDD of LabeledPoint.</span> +<span class="n">data</span> <span class="o">=</span> <span class="n">MLUtils</span><span class="o">.</span><span class="n">loadLibSVMFile</span><span class="p">(</span><span class="n">sc</span><span class="p">,</span> <span class="s1">'data/mllib/sample_libsvm_data.txt'</span><span class="p">)</span> +<span class="c1"># Split the data into training and test sets (30% held out for testing)</span> <span class="p">(</span><span class="n">trainingData</span><span class="p">,</span> <span class="n">testData</span><span class="p">)</span> <span class="o">=</span> <span class="n">data</span><span class="o">.</span><span class="n">randomSplit</span><span class="p">([</span><span class="mf">0.7</span><span class="p">,</span> <span class="mf">0.3</span><span class="p">])</span> -<span class="c"># Train a DecisionTree model.</span> -<span class="c"># Empty categoricalFeaturesInfo indicates all features are continuous.</span> +<span class="c1"># Train a DecisionTree model.</span> +<span class="c1"># Empty categoricalFeaturesInfo indicates all features are continuous.</span> <span class="n">model</span> <span class="o">=</span> <span class="n">DecisionTree</span><span class="o">.</span><span class="n">trainClassifier</span><span class="p">(</span><span class="n">trainingData</span><span class="p">,</span> <span class="n">numClasses</span><span class="o">=</span><span class="mi">2</span><span class="p">,</span> <span class="n">categoricalFeaturesInfo</span><span class="o">=</span><span class="p">{},</span> - <span class="n">impurity</span><span class="o">=</span><span class="s">'gini'</span><span class="p">,</span> <span class="n">maxDepth</span><span class="o">=</span><span class="mi">5</span><span class="p">,</span> <span class="n">maxBins</span><span class="o">=</span><span class="mi">32</span><span class="p">)</span> + <span class="n">impurity</span><span class="o">=</span><span class="s1">'gini'</span><span class="p">,</span> <span class="n">maxDepth</span><span class="o">=</span><span class="mi">5</span><span class="p">,</span> <span class="n">maxBins</span><span class="o">=</span><span class="mi">32</span><span class="p">)</span> -<span class="c"># Evaluate model on test instances and compute test error</span> +<span class="c1"># Evaluate model on test instances and compute test error</span> <span class="n">predictions</span> <span class="o">=</span> <span class="n">model</span><span class="o">.</span><span class="n">predict</span><span class="p">(</span><span class="n">testData</span><span class="o">.</span><span class="n">map</span><span class="p">(</span><span class="k">lambda</span> <span class="n">x</span><span class="p">:</span> <span class="n">x</span><span class="o">.</span><span class="n">features</span><span class="p">))</span> <span class="n">labelsAndPredictions</span> <span class="o">=</span> <span class="n">testData</span><span class="o">.</span><span class="n">map</span><span class="p">(</span><span class="k">lambda</span> <span class="n">lp</span><span class="p">:</span> <span class="n">lp</span><span class="o">.</span><span class="n">label</span><span class="p">)</span><span class="o">.</span><span class="n">zip</span><span class="p">(</span><span class="n">predictions</span><span class="p">)</span> <span class="n">testErr</span> <span class="o">=</span> <span class="n">labelsAndPredictions</span><span class="o">.</span><span class="n">filter</span><span class="p">(</span><span class="k">lambda</span> <span class="p">(</span><span class="n">v</span><span class="p">,</span> <span class="n">p</span><span class="p">):</span> <span class="n">v</span> <span class="o">!=</span> <span class="n">p</span><span class="p">)</span><span class="o">.</span><span class="n">count</span><span class="p">()</span> <span class="o">/</span> <span class="nb">float</span><span class="p">(</span><span class="n">testData</span><span class="o">.</span><span class="n">count</span><span class="p">())</span> -<span class="k">print</span><span class="p">(</span><span class="s">'Test Error = '</span> <span class="o">+</span> <span class="nb">str</span><span class="p">(</span><span class="n">testErr</span><span class="p">))</span> -<span class="k">print</span><span class="p">(</span><span class="s">'Learned classification tree model:'</span><span class="p">)</span> +<span class="k">print</span><span class="p">(</span><span class="s1">'Test Error = '</span> <span class="o">+</span> <span class="nb">str</span><span class="p">(</span><span class="n">testErr</span><span class="p">))</span> +<span class="k">print</span><span class="p">(</span><span class="s1">'Learned classification tree model:'</span><span class="p">)</span> <span class="k">print</span><span class="p">(</span><span class="n">model</span><span class="o">.</span><span class="n">toDebugString</span><span class="p">())</span> -<span class="c"># Save and load model</span> -<span class="n">model</span><span class="o">.</span><span class="n">save</span><span class="p">(</span><span class="n">sc</span><span class="p">,</span> <span class="s">"target/tmp/myDecisionTreeClassificationModel"</span><span class="p">)</span> -<span class="n">sameModel</span> <span class="o">=</span> <span class="n">DecisionTreeModel</span><span class="o">.</span><span class="n">load</span><span class="p">(</span><span class="n">sc</span><span class="p">,</span> <span class="s">"target/tmp/myDecisionTreeClassificationModel"</span><span class="p">)</span> +<span class="c1"># Save and load model</span> +<span class="n">model</span><span class="o">.</span><span class="n">save</span><span class="p">(</span><span class="n">sc</span><span class="p">,</span> <span class="s2">"target/tmp/myDecisionTreeClassificationModel"</span><span class="p">)</span> +<span class="n">sameModel</span> <span class="o">=</span> <span class="n">DecisionTreeModel</span><span class="o">.</span><span class="n">load</span><span class="p">(</span><span class="n">sc</span><span class="p">,</span> <span class="s2">"target/tmp/myDecisionTreeClassificationModel"</span><span class="p">)</span> </pre></div> <div><small>Find full example code at "examples/src/main/python/mllib/decision_tree_classification_example.py" in the Spark repo.</small></div> </div> @@ -701,7 +701,7 @@ depth of 5. The Mean Squared Error (MSE) is computed at the end to evaluate <div data-lang="scala"> <p>Refer to the <a href="api/scala/index.html#org.apache.spark.mllib.tree.DecisionTree"><code>DecisionTree</code> Scala docs</a> and <a href="api/scala/index.html#org.apache.spark.mllib.tree.model.DecisionTreeModel"><code>DecisionTreeModel</code> Scala docs</a> for details on the API.</p> - <div class="highlight"><pre><span class="k">import</span> <span class="nn">org.apache.spark.mllib.tree.DecisionTree</span> + <div class="highlight"><pre><span></span><span class="k">import</span> <span class="nn">org.apache.spark.mllib.tree.DecisionTree</span> <span class="k">import</span> <span class="nn">org.apache.spark.mllib.tree.model.DecisionTreeModel</span> <span class="k">import</span> <span class="nn">org.apache.spark.mllib.util.MLUtils</span> @@ -740,7 +740,7 @@ depth of 5. The Mean Squared Error (MSE) is computed at the end to evaluate <div data-lang="java"> <p>Refer to the <a href="api/java/org/apache/spark/mllib/tree/DecisionTree.html"><code>DecisionTree</code> Java docs</a> and <a href="api/java/org/apache/spark/mllib/tree/model/DecisionTreeModel.html"><code>DecisionTreeModel</code> Java docs</a> for details on the API.</p> - <div class="highlight"><pre><span class="kn">import</span> <span class="nn">java.util.HashMap</span><span class="o">;</span> + <div class="highlight"><pre><span></span><span class="kn">import</span> <span class="nn">java.util.HashMap</span><span class="o">;</span> <span class="kn">import</span> <span class="nn">java.util.Map</span><span class="o">;</span> <span class="kn">import</span> <span class="nn">scala.Tuple2</span><span class="o">;</span> @@ -757,8 +757,8 @@ depth of 5. The Mean Squared Error (MSE) is computed at the end to evaluate <span class="kn">import</span> <span class="nn">org.apache.spark.mllib.tree.model.DecisionTreeModel</span><span class="o">;</span> <span class="kn">import</span> <span class="nn">org.apache.spark.mllib.util.MLUtils</span><span class="o">;</span> -<span class="n">SparkConf</span> <span class="n">sparkConf</span> <span class="o">=</span> <span class="k">new</span> <span class="nf">SparkConf</span><span class="o">().</span><span class="na">setAppName</span><span class="o">(</span><span class="s">"JavaDecisionTreeRegressionExample"</span><span class="o">);</span> -<span class="n">JavaSparkContext</span> <span class="n">jsc</span> <span class="o">=</span> <span class="k">new</span> <span class="nf">JavaSparkContext</span><span class="o">(</span><span class="n">sparkConf</span><span class="o">);</span> +<span class="n">SparkConf</span> <span class="n">sparkConf</span> <span class="o">=</span> <span class="k">new</span> <span class="n">SparkConf</span><span class="o">().</span><span class="na">setAppName</span><span class="o">(</span><span class="s">"JavaDecisionTreeRegressionExample"</span><span class="o">);</span> +<span class="n">JavaSparkContext</span> <span class="n">jsc</span> <span class="o">=</span> <span class="k">new</span> <span class="n">JavaSparkContext</span><span class="o">(</span><span class="n">sparkConf</span><span class="o">);</span> <span class="c1">// Load and parse the data file.</span> <span class="n">String</span> <span class="n">datapath</span> <span class="o">=</span> <span class="s">"data/mllib/sample_libsvm_data.txt"</span><span class="o">;</span> @@ -814,31 +814,31 @@ depth of 5. The Mean Squared Error (MSE) is computed at the end to evaluate <div data-lang="python"> <p>Refer to the <a href="api/python/pyspark.mllib.html#pyspark.mllib.tree.DecisionTree"><code>DecisionTree</code> Python docs</a> and <a href="api/python/pyspark.mllib.html#pyspark.mllib.tree.DecisionTreeModel"><code>DecisionTreeModel</code> Python docs</a> for more details on the API.</p> - <div class="highlight"><pre><span class="kn">from</span> <span class="nn">pyspark.mllib.tree</span> <span class="kn">import</span> <span class="n">DecisionTree</span><span class="p">,</span> <span class="n">DecisionTreeModel</span> + <div class="highlight"><pre><span></span><span class="kn">from</span> <span class="nn">pyspark.mllib.tree</span> <span class="kn">import</span> <span class="n">DecisionTree</span><span class="p">,</span> <span class="n">DecisionTreeModel</span> <span class="kn">from</span> <span class="nn">pyspark.mllib.util</span> <span class="kn">import</span> <span class="n">MLUtils</span> -<span class="c"># Load and parse the data file into an RDD of LabeledPoint.</span> -<span class="n">data</span> <span class="o">=</span> <span class="n">MLUtils</span><span class="o">.</span><span class="n">loadLibSVMFile</span><span class="p">(</span><span class="n">sc</span><span class="p">,</span> <span class="s">'data/mllib/sample_libsvm_data.txt'</span><span class="p">)</span> -<span class="c"># Split the data into training and test sets (30% held out for testing)</span> +<span class="c1"># Load and parse the data file into an RDD of LabeledPoint.</span> +<span class="n">data</span> <span class="o">=</span> <span class="n">MLUtils</span><span class="o">.</span><span class="n">loadLibSVMFile</span><span class="p">(</span><span class="n">sc</span><span class="p">,</span> <span class="s1">'data/mllib/sample_libsvm_data.txt'</span><span class="p">)</span> +<span class="c1"># Split the data into training and test sets (30% held out for testing)</span> <span class="p">(</span><span class="n">trainingData</span><span class="p">,</span> <span class="n">testData</span><span class="p">)</span> <span class="o">=</span> <span class="n">data</span><span class="o">.</span><span class="n">randomSplit</span><span class="p">([</span><span class="mf">0.7</span><span class="p">,</span> <span class="mf">0.3</span><span class="p">])</span> -<span class="c"># Train a DecisionTree model.</span> -<span class="c"># Empty categoricalFeaturesInfo indicates all features are continuous.</span> +<span class="c1"># Train a DecisionTree model.</span> +<span class="c1"># Empty categoricalFeaturesInfo indicates all features are continuous.</span> <span class="n">model</span> <span class="o">=</span> <span class="n">DecisionTree</span><span class="o">.</span><span class="n">trainRegressor</span><span class="p">(</span><span class="n">trainingData</span><span class="p">,</span> <span class="n">categoricalFeaturesInfo</span><span class="o">=</span><span class="p">{},</span> - <span class="n">impurity</span><span class="o">=</span><span class="s">'variance'</span><span class="p">,</span> <span class="n">maxDepth</span><span class="o">=</span><span class="mi">5</span><span class="p">,</span> <span class="n">maxBins</span><span class="o">=</span><span class="mi">32</span><span class="p">)</span> + <span class="n">impurity</span><span class="o">=</span><span class="s1">'variance'</span><span class="p">,</span> <span class="n">maxDepth</span><span class="o">=</span><span class="mi">5</span><span class="p">,</span> <span class="n">maxBins</span><span class="o">=</span><span class="mi">32</span><span class="p">)</span> -<span class="c"># Evaluate model on test instances and compute test error</span> +<span class="c1"># Evaluate model on test instances and compute test error</span> <span class="n">predictions</span> <span class="o">=</span> <span class="n">model</span><span class="o">.</span><span class="n">predict</span><span class="p">(</span><span class="n">testData</span><span class="o">.</span><span class="n">map</span><span class="p">(</span><span class="k">lambda</span> <span class="n">x</span><span class="p">:</span> <span class="n">x</span><span class="o">.</span><span class="n">features</span><span class="p">))</span> <span class="n">labelsAndPredictions</span> <span class="o">=</span> <span class="n">testData</span><span class="o">.</span><span class="n">map</span><span class="p">(</span><span class="k">lambda</span> <span class="n">lp</span><span class="p">:</span> <span class="n">lp</span><span class="o">.</span><span class="n">label</span><span class="p">)</span><span class="o">.</span><span class="n">zip</span><span class="p">(</span><span class="n">predictions</span><span class="p">)</span> <span class="n">testMSE</span> <span class="o">=</span> <span class="n">labelsAndPredictions</span><span class="o">.</span><span class="n">map</span><span class="p">(</span><span class="k">lambda</span> <span class="p">(</span><span class="n">v</span><span class="p">,</span> <span class="n">p</span><span class="p">):</span> <span class="p">(</span><span class="n">v</span> <span class="o">-</span> <span class="n">p</span><span class="p">)</span> <span class="o">*</span> <span class="p">(</span><span class="n">v</span> <span class="o">-</span> <span class="n">p</span><span class="p">))</span><span class="o">.</span><span class="n">sum</span><span class="p">()</span> <span class="o">/</span>\ <span class="nb">float</span><span class="p">(</span><span class="n">testData</span><span class="o">.</span><span class="n">count</span><span class="p">())</span> -<span class="k">print</span><span class="p">(</span><span class="s">'Test Mean Squared Error = '</span> <span class="o">+</span> <span class="nb">str</span><span class="p">(</span><span class="n">testMSE</span><span class="p">))</span> -<span class="k">print</span><span class="p">(</span><span class="s">'Learned regression tree model:'</span><span class="p">)</span> +<span class="k">print</span><span class="p">(</span><span class="s1">'Test Mean Squared Error = '</span> <span class="o">+</span> <span class="nb">str</span><span class="p">(</span><span class="n">testMSE</span><span class="p">))</span> +<span class="k">print</span><span class="p">(</span><span class="s1">'Learned regression tree model:'</span><span class="p">)</span> <span class="k">print</span><span class="p">(</span><span class="n">model</span><span class="o">.</span><span class="n">toDebugString</span><span class="p">())</span> -<span class="c"># Save and load model</span> -<span class="n">model</span><span class="o">.</span><span class="n">save</span><span class="p">(</span><span class="n">sc</span><span class="p">,</span> <span class="s">"target/tmp/myDecisionTreeRegressionModel"</span><span class="p">)</span> -<span class="n">sameModel</span> <span class="o">=</span> <span class="n">DecisionTreeModel</span><span class="o">.</span><span class="n">load</span><span class="p">(</span><span class="n">sc</span><span class="p">,</span> <span class="s">"target/tmp/myDecisionTreeRegressionModel"</span><span class="p">)</span> +<span class="c1"># Save and load model</span> +<span class="n">model</span><span class="o">.</span><span class="n">save</span><span class="p">(</span><span class="n">sc</span><span class="p">,</span> <span class="s2">"target/tmp/myDecisionTreeRegressionModel"</span><span class="p">)</span> +<span class="n">sameModel</span> <span class="o">=</span> <span class="n">DecisionTreeModel</span><span class="o">.</span><span class="n">load</span><span class="p">(</span><span class="n">sc</span><span class="p">,</span> <span class="s2">"target/tmp/myDecisionTreeRegressionModel"</span><span class="p">)</span> </pre></div> <div><small>Find full example code at "examples/src/main/python/mllib/decision_tree_regression_example.py" in the Spark repo.</small></div> </div>
http://git-wip-us.apache.org/repos/asf/spark-website/blob/d2bcf185/site/docs/2.1.0/mllib-dimensionality-reduction.html ---------------------------------------------------------------------- diff --git a/site/docs/2.1.0/mllib-dimensionality-reduction.html b/site/docs/2.1.0/mllib-dimensionality-reduction.html index 239d2c1..0d67e32 100644 --- a/site/docs/2.1.0/mllib-dimensionality-reduction.html +++ b/site/docs/2.1.0/mllib-dimensionality-reduction.html @@ -331,12 +331,12 @@ <ul id="markdown-toc"> - <li><a href="#singular-value-decomposition-svd" id="markdown-toc-singular-value-decomposition-svd">Singular value decomposition (SVD)</a> <ul> - <li><a href="#performance" id="markdown-toc-performance">Performance</a></li> - <li><a href="#svd-example" id="markdown-toc-svd-example">SVD Example</a></li> + <li><a href="#singular-value-decomposition-svd">Singular value decomposition (SVD)</a> <ul> + <li><a href="#performance">Performance</a></li> + <li><a href="#svd-example">SVD Example</a></li> </ul> </li> - <li><a href="#principal-component-analysis-pca" id="markdown-toc-principal-component-analysis-pca">Principal component analysis (PCA)</a></li> + <li><a href="#principal-component-analysis-pca">Principal component analysis (PCA)</a></li> </ul> <p><a href="http://en.wikipedia.org/wiki/Dimensionality_reduction">Dimensionality reduction</a> is the process @@ -354,7 +354,7 @@ factorizes a matrix into three matrices: $U$, $\Sigma$, and $V$ such that</p> A = U \Sigma V^T, \]</code></p> -<p>where</p> +<p>where </p> <ul> <li>$U$ is an orthonormal matrix, whose columns are called left singular vectors,</li> @@ -396,13 +396,13 @@ passes, $O(n)$ storage on each executor, and $O(n k)$ storage on the driver.</li <h3 id="svd-example">SVD Example</h3> <p><code>spark.mllib</code> provides SVD functionality to row-oriented matrices, provided in the -<a href="mllib-data-types.html#rowmatrix">RowMatrix</a> class.</p> +<a href="mllib-data-types.html#rowmatrix">RowMatrix</a> class. </p> <div class="codetabs"> <div data-lang="scala"> <p>Refer to the <a href="api/scala/index.html#org.apache.spark.mllib.linalg.SingularValueDecomposition"><code>SingularValueDecomposition</code> Scala docs</a> for details on the API.</p> - <div class="highlight"><pre><span class="k">import</span> <span class="nn">org.apache.spark.mllib.linalg.Matrix</span> + <div class="highlight"><pre><span></span><span class="k">import</span> <span class="nn">org.apache.spark.mllib.linalg.Matrix</span> <span class="k">import</span> <span class="nn">org.apache.spark.mllib.linalg.SingularValueDecomposition</span> <span class="k">import</span> <span class="nn">org.apache.spark.mllib.linalg.Vector</span> <span class="k">import</span> <span class="nn">org.apache.spark.mllib.linalg.Vectors</span> @@ -431,7 +431,7 @@ passes, $O(n)$ storage on each executor, and $O(n k)$ storage on the driver.</li <div data-lang="java"> <p>Refer to the <a href="api/java/org/apache/spark/mllib/linalg/SingularValueDecomposition.html"><code>SingularValueDecomposition</code> Java docs</a> for details on the API.</p> - <div class="highlight"><pre><span class="kn">import</span> <span class="nn">java.util.LinkedList</span><span class="o">;</span> + <div class="highlight"><pre><span></span><span class="kn">import</span> <span class="nn">java.util.LinkedList</span><span class="o">;</span> <span class="kn">import</span> <span class="nn">org.apache.spark.api.java.JavaRDD</span><span class="o">;</span> <span class="kn">import</span> <span class="nn">org.apache.spark.api.java.JavaSparkContext</span><span class="o">;</span> @@ -450,10 +450,10 @@ passes, $O(n)$ storage on each executor, and $O(n k)$ storage on the driver.</li <span class="n">JavaRDD</span><span class="o"><</span><span class="n">Vector</span><span class="o">></span> <span class="n">rows</span> <span class="o">=</span> <span class="n">jsc</span><span class="o">.</span><span class="na">parallelize</span><span class="o">(</span><span class="n">rowsList</span><span class="o">);</span> <span class="c1">// Create a RowMatrix from JavaRDD<Vector>.</span> -<span class="n">RowMatrix</span> <span class="n">mat</span> <span class="o">=</span> <span class="k">new</span> <span class="nf">RowMatrix</span><span class="o">(</span><span class="n">rows</span><span class="o">.</span><span class="na">rdd</span><span class="o">());</span> +<span class="n">RowMatrix</span> <span class="n">mat</span> <span class="o">=</span> <span class="k">new</span> <span class="n">RowMatrix</span><span class="o">(</span><span class="n">rows</span><span class="o">.</span><span class="na">rdd</span><span class="o">());</span> <span class="c1">// Compute the top 3 singular values and corresponding singular vectors.</span> -<span class="n">SingularValueDecomposition</span><span class="o"><</span><span class="n">RowMatrix</span><span class="o">,</span> <span class="n">Matrix</span><span class="o">></span> <span class="n">svd</span> <span class="o">=</span> <span class="n">mat</span><span class="o">.</span><span class="na">computeSVD</span><span class="o">(</span><span class="mi">3</span><span class="o">,</span> <span class="kc">true</span><span class="o">,</span> <span class="mf">1.0</span><span class="n">E</span><span class="o">-</span><span class="mi">9</span><span class="n">d</span><span class="o">);</span> +<span class="n">SingularValueDecomposition</span><span class="o"><</span><span class="n">RowMatrix</span><span class="o">,</span> <span class="n">Matrix</span><span class="o">></span> <span class="n">svd</span> <span class="o">=</span> <span class="n">mat</span><span class="o">.</span><span class="na">computeSVD</span><span class="o">(</span><span class="mi">3</span><span class="o">,</span> <span class="kc">true</span><span class="o">,</span> <span class="mf">1.0E-9d</span><span class="o">);</span> <span class="n">RowMatrix</span> <span class="n">U</span> <span class="o">=</span> <span class="n">svd</span><span class="o">.</span><span class="na">U</span><span class="o">();</span> <span class="n">Vector</span> <span class="n">s</span> <span class="o">=</span> <span class="n">svd</span><span class="o">.</span><span class="na">s</span><span class="o">();</span> <span class="n">Matrix</span> <span class="n">V</span> <span class="o">=</span> <span class="n">svd</span><span class="o">.</span><span class="na">V</span><span class="o">();</span> @@ -489,7 +489,7 @@ and use them to project the vectors into a low-dimensional space.</p> <p>Refer to the <a href="api/scala/index.html#org.apache.spark.mllib.linalg.distributed.RowMatrix"><code>RowMatrix</code> Scala docs</a> for details on the API.</p> - <div class="highlight"><pre><span class="k">import</span> <span class="nn">org.apache.spark.mllib.linalg.Matrix</span> + <div class="highlight"><pre><span></span><span class="k">import</span> <span class="nn">org.apache.spark.mllib.linalg.Matrix</span> <span class="k">import</span> <span class="nn">org.apache.spark.mllib.linalg.Vectors</span> <span class="k">import</span> <span class="nn">org.apache.spark.mllib.linalg.distributed.RowMatrix</span> @@ -516,7 +516,7 @@ and use them to project the vectors into a low-dimensional space while keeping a <p>Refer to the <a href="api/scala/index.html#org.apache.spark.mllib.feature.PCA"><code>PCA</code> Scala docs</a> for details on the API.</p> - <div class="highlight"><pre><span class="k">import</span> <span class="nn">org.apache.spark.mllib.feature.PCA</span> + <div class="highlight"><pre><span></span><span class="k">import</span> <span class="nn">org.apache.spark.mllib.feature.PCA</span> <span class="k">import</span> <span class="nn">org.apache.spark.mllib.linalg.Vectors</span> <span class="k">import</span> <span class="nn">org.apache.spark.mllib.regression.LabeledPoint</span> <span class="k">import</span> <span class="nn">org.apache.spark.rdd.RDD</span> @@ -547,7 +547,7 @@ The number of columns should be small, e.g, less than 1000.</p> <p>Refer to the <a href="api/java/org/apache/spark/mllib/linalg/distributed/RowMatrix.html"><code>RowMatrix</code> Java docs</a> for details on the API.</p> - <div class="highlight"><pre><span class="kn">import</span> <span class="nn">java.util.LinkedList</span><span class="o">;</span> + <div class="highlight"><pre><span></span><span class="kn">import</span> <span class="nn">java.util.LinkedList</span><span class="o">;</span> <span class="kn">import</span> <span class="nn">org.apache.spark.api.java.JavaRDD</span><span class="o">;</span> <span class="kn">import</span> <span class="nn">org.apache.spark.api.java.JavaSparkContext</span><span class="o">;</span> @@ -565,7 +565,7 @@ The number of columns should be small, e.g, less than 1000.</p> <span class="n">JavaRDD</span><span class="o"><</span><span class="n">Vector</span><span class="o">></span> <span class="n">rows</span> <span class="o">=</span> <span class="n">JavaSparkContext</span><span class="o">.</span><span class="na">fromSparkContext</span><span class="o">(</span><span class="n">sc</span><span class="o">).</span><span class="na">parallelize</span><span class="o">(</span><span class="n">rowsList</span><span class="o">);</span> <span class="c1">// Create a RowMatrix from JavaRDD<Vector>.</span> -<span class="n">RowMatrix</span> <span class="n">mat</span> <span class="o">=</span> <span class="k">new</span> <span class="nf">RowMatrix</span><span class="o">(</span><span class="n">rows</span><span class="o">.</span><span class="na">rdd</span><span class="o">());</span> +<span class="n">RowMatrix</span> <span class="n">mat</span> <span class="o">=</span> <span class="k">new</span> <span class="n">RowMatrix</span><span class="o">(</span><span class="n">rows</span><span class="o">.</span><span class="na">rdd</span><span class="o">());</span> <span class="c1">// Compute the top 3 principal components.</span> <span class="n">Matrix</span> <span class="n">pc</span> <span class="o">=</span> <span class="n">mat</span><span class="o">.</span><span class="na">computePrincipalComponents</span><span class="o">(</span><span class="mi">3</span><span class="o">);</span> http://git-wip-us.apache.org/repos/asf/spark-website/blob/d2bcf185/site/docs/2.1.0/mllib-ensembles.html ---------------------------------------------------------------------- diff --git a/site/docs/2.1.0/mllib-ensembles.html b/site/docs/2.1.0/mllib-ensembles.html index ab17ce5..604c546 100644 --- a/site/docs/2.1.0/mllib-ensembles.html +++ b/site/docs/2.1.0/mllib-ensembles.html @@ -307,33 +307,33 @@ <ul id="markdown-toc"> - <li><a href="#gradient-boosted-trees-vs-random-forests" id="markdown-toc-gradient-boosted-trees-vs-random-forests">Gradient-Boosted Trees vs. Random Forests</a></li> - <li><a href="#random-forests" id="markdown-toc-random-forests">Random Forests</a> <ul> - <li><a href="#basic-algorithm" id="markdown-toc-basic-algorithm">Basic algorithm</a> <ul> - <li><a href="#training" id="markdown-toc-training">Training</a></li> - <li><a href="#prediction" id="markdown-toc-prediction">Prediction</a></li> + <li><a href="#gradient-boosted-trees-vs-random-forests">Gradient-Boosted Trees vs. Random Forests</a></li> + <li><a href="#random-forests">Random Forests</a> <ul> + <li><a href="#basic-algorithm">Basic algorithm</a> <ul> + <li><a href="#training">Training</a></li> + <li><a href="#prediction">Prediction</a></li> </ul> </li> - <li><a href="#usage-tips" id="markdown-toc-usage-tips">Usage tips</a></li> - <li><a href="#examples" id="markdown-toc-examples">Examples</a> <ul> - <li><a href="#classification" id="markdown-toc-classification">Classification</a></li> - <li><a href="#regression" id="markdown-toc-regression">Regression</a></li> + <li><a href="#usage-tips">Usage tips</a></li> + <li><a href="#examples">Examples</a> <ul> + <li><a href="#classification">Classification</a></li> + <li><a href="#regression">Regression</a></li> </ul> </li> </ul> </li> - <li><a href="#gradient-boosted-trees-gbts" id="markdown-toc-gradient-boosted-trees-gbts">Gradient-Boosted Trees (GBTs)</a> <ul> - <li><a href="#basic-algorithm-1" id="markdown-toc-basic-algorithm-1">Basic algorithm</a> <ul> - <li><a href="#losses" id="markdown-toc-losses">Losses</a></li> + <li><a href="#gradient-boosted-trees-gbts">Gradient-Boosted Trees (GBTs)</a> <ul> + <li><a href="#basic-algorithm-1">Basic algorithm</a> <ul> + <li><a href="#losses">Losses</a></li> </ul> </li> - <li><a href="#usage-tips-1" id="markdown-toc-usage-tips-1">Usage tips</a> <ul> - <li><a href="#validation-while-training" id="markdown-toc-validation-while-training">Validation while training</a></li> + <li><a href="#usage-tips-1">Usage tips</a> <ul> + <li><a href="#validation-while-training">Validation while training</a></li> </ul> </li> - <li><a href="#examples-1" id="markdown-toc-examples-1">Examples</a> <ul> - <li><a href="#classification-1" id="markdown-toc-classification-1">Classification</a></li> - <li><a href="#regression-1" id="markdown-toc-regression-1">Regression</a></li> + <li><a href="#examples-1">Examples</a> <ul> + <li><a href="#classification-1">Classification</a></li> + <li><a href="#regression-1">Regression</a></li> </ul> </li> </ul> @@ -450,7 +450,7 @@ The test error is calculated to measure the algorithm accuracy.</p> <div data-lang="scala"> <p>Refer to the <a href="api/scala/index.html#org.apache.spark.mllib.tree.RandomForest$"><code>RandomForest</code> Scala docs</a> and <a href="api/scala/index.html#org.apache.spark.mllib.tree.model.RandomForestModel"><code>RandomForestModel</code> Scala docs</a> for details on the API.</p> - <div class="highlight"><pre><span class="k">import</span> <span class="nn">org.apache.spark.mllib.tree.RandomForest</span> + <div class="highlight"><pre><span></span><span class="k">import</span> <span class="nn">org.apache.spark.mllib.tree.RandomForest</span> <span class="k">import</span> <span class="nn">org.apache.spark.mllib.tree.model.RandomForestModel</span> <span class="k">import</span> <span class="nn">org.apache.spark.mllib.util.MLUtils</span> @@ -492,7 +492,7 @@ The test error is calculated to measure the algorithm accuracy.</p> <div data-lang="java"> <p>Refer to the <a href="api/java/org/apache/spark/mllib/tree/RandomForest.html"><code>RandomForest</code> Java docs</a> and <a href="api/java/org/apache/spark/mllib/tree/model/RandomForestModel.html"><code>RandomForestModel</code> Java docs</a> for details on the API.</p> - <div class="highlight"><pre><span class="kn">import</span> <span class="nn">java.util.HashMap</span><span class="o">;</span> + <div class="highlight"><pre><span></span><span class="kn">import</span> <span class="nn">java.util.HashMap</span><span class="o">;</span> <span class="kn">import</span> <span class="nn">scala.Tuple2</span><span class="o">;</span> @@ -507,8 +507,8 @@ The test error is calculated to measure the algorithm accuracy.</p> <span class="kn">import</span> <span class="nn">org.apache.spark.mllib.tree.model.RandomForestModel</span><span class="o">;</span> <span class="kn">import</span> <span class="nn">org.apache.spark.mllib.util.MLUtils</span><span class="o">;</span> -<span class="n">SparkConf</span> <span class="n">sparkConf</span> <span class="o">=</span> <span class="k">new</span> <span class="nf">SparkConf</span><span class="o">().</span><span class="na">setAppName</span><span class="o">(</span><span class="s">"JavaRandomForestClassificationExample"</span><span class="o">);</span> -<span class="n">JavaSparkContext</span> <span class="n">jsc</span> <span class="o">=</span> <span class="k">new</span> <span class="nf">JavaSparkContext</span><span class="o">(</span><span class="n">sparkConf</span><span class="o">);</span> +<span class="n">SparkConf</span> <span class="n">sparkConf</span> <span class="o">=</span> <span class="k">new</span> <span class="n">SparkConf</span><span class="o">().</span><span class="na">setAppName</span><span class="o">(</span><span class="s">"JavaRandomForestClassificationExample"</span><span class="o">);</span> +<span class="n">JavaSparkContext</span> <span class="n">jsc</span> <span class="o">=</span> <span class="k">new</span> <span class="n">JavaSparkContext</span><span class="o">(</span><span class="n">sparkConf</span><span class="o">);</span> <span class="c1">// Load and parse the data file.</span> <span class="n">String</span> <span class="n">datapath</span> <span class="o">=</span> <span class="s">"data/mllib/sample_libsvm_data.txt"</span><span class="o">;</span> <span class="n">JavaRDD</span><span class="o"><</span><span class="n">LabeledPoint</span><span class="o">></span> <span class="n">data</span> <span class="o">=</span> <span class="n">MLUtils</span><span class="o">.</span><span class="na">loadLibSVMFile</span><span class="o">(</span><span class="n">jsc</span><span class="o">.</span><span class="na">sc</span><span class="o">(),</span> <span class="n">datapath</span><span class="o">).</span><span class="na">toJavaRDD</span><span class="o">();</span> @@ -561,33 +561,33 @@ The test error is calculated to measure the algorithm accuracy.</p> <div data-lang="python"> <p>Refer to the <a href="api/python/pyspark.mllib.html#pyspark.mllib.tree.RandomForest"><code>RandomForest</code> Python docs</a> and <a href="api/python/pyspark.mllib.html#pyspark.mllib.tree.RandomForestModel"><code>RandomForest</code> Python docs</a> for more details on the API.</p> - <div class="highlight"><pre><span class="kn">from</span> <span class="nn">pyspark.mllib.tree</span> <span class="kn">import</span> <span class="n">RandomForest</span><span class="p">,</span> <span class="n">RandomForestModel</span> + <div class="highlight"><pre><span></span><span class="kn">from</span> <span class="nn">pyspark.mllib.tree</span> <span class="kn">import</span> <span class="n">RandomForest</span><span class="p">,</span> <span class="n">RandomForestModel</span> <span class="kn">from</span> <span class="nn">pyspark.mllib.util</span> <span class="kn">import</span> <span class="n">MLUtils</span> -<span class="c"># Load and parse the data file into an RDD of LabeledPoint.</span> -<span class="n">data</span> <span class="o">=</span> <span class="n">MLUtils</span><span class="o">.</span><span class="n">loadLibSVMFile</span><span class="p">(</span><span class="n">sc</span><span class="p">,</span> <span class="s">'data/mllib/sample_libsvm_data.txt'</span><span class="p">)</span> -<span class="c"># Split the data into training and test sets (30% held out for testing)</span> +<span class="c1"># Load and parse the data file into an RDD of LabeledPoint.</span> +<span class="n">data</span> <span class="o">=</span> <span class="n">MLUtils</span><span class="o">.</span><span class="n">loadLibSVMFile</span><span class="p">(</span><span class="n">sc</span><span class="p">,</span> <span class="s1">'data/mllib/sample_libsvm_data.txt'</span><span class="p">)</span> +<span class="c1"># Split the data into training and test sets (30% held out for testing)</span> <span class="p">(</span><span class="n">trainingData</span><span class="p">,</span> <span class="n">testData</span><span class="p">)</span> <span class="o">=</span> <span class="n">data</span><span class="o">.</span><span class="n">randomSplit</span><span class="p">([</span><span class="mf">0.7</span><span class="p">,</span> <span class="mf">0.3</span><span class="p">])</span> -<span class="c"># Train a RandomForest model.</span> -<span class="c"># Empty categoricalFeaturesInfo indicates all features are continuous.</span> -<span class="c"># Note: Use larger numTrees in practice.</span> -<span class="c"># Setting featureSubsetStrategy="auto" lets the algorithm choose.</span> +<span class="c1"># Train a RandomForest model.</span> +<span class="c1"># Empty categoricalFeaturesInfo indicates all features are continuous.</span> +<span class="c1"># Note: Use larger numTrees in practice.</span> +<span class="c1"># Setting featureSubsetStrategy="auto" lets the algorithm choose.</span> <span class="n">model</span> <span class="o">=</span> <span class="n">RandomForest</span><span class="o">.</span><span class="n">trainClassifier</span><span class="p">(</span><span class="n">trainingData</span><span class="p">,</span> <span class="n">numClasses</span><span class="o">=</span><span class="mi">2</span><span class="p">,</span> <span class="n">categoricalFeaturesInfo</span><span class="o">=</span><span class="p">{},</span> - <span class="n">numTrees</span><span class="o">=</span><span class="mi">3</span><span class="p">,</span> <span class="n">featureSubsetStrategy</span><span class="o">=</span><span class="s">"auto"</span><span class="p">,</span> - <span class="n">impurity</span><span class="o">=</span><span class="s">'gini'</span><span class="p">,</span> <span class="n">maxDepth</span><span class="o">=</span><span class="mi">4</span><span class="p">,</span> <span class="n">maxBins</span><span class="o">=</span><span class="mi">32</span><span class="p">)</span> + <span class="n">numTrees</span><span class="o">=</span><span class="mi">3</span><span class="p">,</span> <span class="n">featureSubsetStrategy</span><span class="o">=</span><span class="s2">"auto"</span><span class="p">,</span> + <span class="n">impurity</span><span class="o">=</span><span class="s1">'gini'</span><span class="p">,</span> <span class="n">maxDepth</span><span class="o">=</span><span class="mi">4</span><span class="p">,</span> <span class="n">maxBins</span><span class="o">=</span><span class="mi">32</span><span class="p">)</span> -<span class="c"># Evaluate model on test instances and compute test error</span> +<span class="c1"># Evaluate model on test instances and compute test error</span> <span class="n">predictions</span> <span class="o">=</span> <span class="n">model</span><span class="o">.</span><span class="n">predict</span><span class="p">(</span><span class="n">testData</span><span class="o">.</span><span class="n">map</span><span class="p">(</span><span class="k">lambda</span> <span class="n">x</span><span class="p">:</span> <span class="n">x</span><span class="o">.</span><span class="n">features</span><span class="p">))</span> <span class="n">labelsAndPredictions</span> <span class="o">=</span> <span class="n">testData</span><span class="o">.</span><span class="n">map</span><span class="p">(</span><span class="k">lambda</span> <span class="n">lp</span><span class="p">:</span> <span class="n">lp</span><span class="o">.</span><span class="n">label</span><span class="p">)</span><span class="o">.</span><span class="n">zip</span><span class="p">(</span><span class="n">predictions</span><span class="p">)</span> <span class="n">testErr</span> <span class="o">=</span> <span class="n">labelsAndPredictions</span><span class="o">.</span><span class="n">filter</span><span class="p">(</span><span class="k">lambda</span> <span class="p">(</span><span class="n">v</span><span class="p">,</span> <span class="n">p</span><span class="p">):</span> <span class="n">v</span> <span class="o">!=</span> <span class="n">p</span><span class="p">)</span><span class="o">.</span><span class="n">count</span><span class="p">()</span> <span class="o">/</span> <span class="nb">float</span><span class="p">(</span><span class="n">testData</span><span class="o">.</span><span class="n">count</span><span class="p">())</span> -<span class="k">print</span><span class="p">(</span><span class="s">'Test Error = '</span> <span class="o">+</span> <span class="nb">str</span><span class="p">(</span><span class="n">testErr</span><span class="p">))</span> -<span class="k">print</span><span class="p">(</span><span class="s">'Learned classification forest model:'</span><span class="p">)</span> +<span class="k">print</span><span class="p">(</span><span class="s1">'Test Error = '</span> <span class="o">+</span> <span class="nb">str</span><span class="p">(</span><span class="n">testErr</span><span class="p">))</span> +<span class="k">print</span><span class="p">(</span><span class="s1">'Learned classification forest model:'</span><span class="p">)</span> <span class="k">print</span><span class="p">(</span><span class="n">model</span><span class="o">.</span><span class="n">toDebugString</span><span class="p">())</span> -<span class="c"># Save and load model</span> -<span class="n">model</span><span class="o">.</span><span class="n">save</span><span class="p">(</span><span class="n">sc</span><span class="p">,</span> <span class="s">"target/tmp/myRandomForestClassificationModel"</span><span class="p">)</span> -<span class="n">sameModel</span> <span class="o">=</span> <span class="n">RandomForestModel</span><span class="o">.</span><span class="n">load</span><span class="p">(</span><span class="n">sc</span><span class="p">,</span> <span class="s">"target/tmp/myRandomForestClassificationModel"</span><span class="p">)</span> +<span class="c1"># Save and load model</span> +<span class="n">model</span><span class="o">.</span><span class="n">save</span><span class="p">(</span><span class="n">sc</span><span class="p">,</span> <span class="s2">"target/tmp/myRandomForestClassificationModel"</span><span class="p">)</span> +<span class="n">sameModel</span> <span class="o">=</span> <span class="n">RandomForestModel</span><span class="o">.</span><span class="n">load</span><span class="p">(</span><span class="n">sc</span><span class="p">,</span> <span class="s2">"target/tmp/myRandomForestClassificationModel"</span><span class="p">)</span> </pre></div> <div><small>Find full example code at "examples/src/main/python/mllib/random_forest_classification_example.py" in the Spark repo.</small></div> </div> @@ -608,7 +608,7 @@ The Mean Squared Error (MSE) is computed at the end to evaluate <div data-lang="scala"> <p>Refer to the <a href="api/scala/index.html#org.apache.spark.mllib.tree.RandomForest$"><code>RandomForest</code> Scala docs</a> and <a href="api/scala/index.html#org.apache.spark.mllib.tree.model.RandomForestModel"><code>RandomForestModel</code> Scala docs</a> for details on the API.</p> - <div class="highlight"><pre><span class="k">import</span> <span class="nn">org.apache.spark.mllib.tree.RandomForest</span> + <div class="highlight"><pre><span></span><span class="k">import</span> <span class="nn">org.apache.spark.mllib.tree.RandomForest</span> <span class="k">import</span> <span class="nn">org.apache.spark.mllib.tree.model.RandomForestModel</span> <span class="k">import</span> <span class="nn">org.apache.spark.mllib.util.MLUtils</span> @@ -650,7 +650,7 @@ The Mean Squared Error (MSE) is computed at the end to evaluate <div data-lang="java"> <p>Refer to the <a href="api/java/org/apache/spark/mllib/tree/RandomForest.html"><code>RandomForest</code> Java docs</a> and <a href="api/java/org/apache/spark/mllib/tree/model/RandomForestModel.html"><code>RandomForestModel</code> Java docs</a> for details on the API.</p> - <div class="highlight"><pre><span class="kn">import</span> <span class="nn">java.util.HashMap</span><span class="o">;</span> + <div class="highlight"><pre><span></span><span class="kn">import</span> <span class="nn">java.util.HashMap</span><span class="o">;</span> <span class="kn">import</span> <span class="nn">java.util.Map</span><span class="o">;</span> <span class="kn">import</span> <span class="nn">scala.Tuple2</span><span class="o">;</span> @@ -667,8 +667,8 @@ The Mean Squared Error (MSE) is computed at the end to evaluate <span class="kn">import</span> <span class="nn">org.apache.spark.mllib.util.MLUtils</span><span class="o">;</span> <span class="kn">import</span> <span class="nn">org.apache.spark.SparkConf</span><span class="o">;</span> -<span class="n">SparkConf</span> <span class="n">sparkConf</span> <span class="o">=</span> <span class="k">new</span> <span class="nf">SparkConf</span><span class="o">().</span><span class="na">setAppName</span><span class="o">(</span><span class="s">"JavaRandomForestRegressionExample"</span><span class="o">);</span> -<span class="n">JavaSparkContext</span> <span class="n">jsc</span> <span class="o">=</span> <span class="k">new</span> <span class="nf">JavaSparkContext</span><span class="o">(</span><span class="n">sparkConf</span><span class="o">);</span> +<span class="n">SparkConf</span> <span class="n">sparkConf</span> <span class="o">=</span> <span class="k">new</span> <span class="n">SparkConf</span><span class="o">().</span><span class="na">setAppName</span><span class="o">(</span><span class="s">"JavaRandomForestRegressionExample"</span><span class="o">);</span> +<span class="n">JavaSparkContext</span> <span class="n">jsc</span> <span class="o">=</span> <span class="k">new</span> <span class="n">JavaSparkContext</span><span class="o">(</span><span class="n">sparkConf</span><span class="o">);</span> <span class="c1">// Load and parse the data file.</span> <span class="n">String</span> <span class="n">datapath</span> <span class="o">=</span> <span class="s">"data/mllib/sample_libsvm_data.txt"</span><span class="o">;</span> <span class="n">JavaRDD</span><span class="o"><</span><span class="n">LabeledPoint</span><span class="o">></span> <span class="n">data</span> <span class="o">=</span> <span class="n">MLUtils</span><span class="o">.</span><span class="na">loadLibSVMFile</span><span class="o">(</span><span class="n">jsc</span><span class="o">.</span><span class="na">sc</span><span class="o">(),</span> <span class="n">datapath</span><span class="o">).</span><span class="na">toJavaRDD</span><span class="o">();</span> @@ -725,34 +725,34 @@ The Mean Squared Error (MSE) is computed at the end to evaluate <div data-lang="python"> <p>Refer to the <a href="api/python/pyspark.mllib.html#pyspark.mllib.tree.RandomForest"><code>RandomForest</code> Python docs</a> and <a href="api/python/pyspark.mllib.html#pyspark.mllib.tree.RandomForestModel"><code>RandomForest</code> Python docs</a> for more details on the API.</p> - <div class="highlight"><pre><span class="kn">from</span> <span class="nn">pyspark.mllib.tree</span> <span class="kn">import</span> <span class="n">RandomForest</span><span class="p">,</span> <span class="n">RandomForestModel</span> + <div class="highlight"><pre><span></span><span class="kn">from</span> <span class="nn">pyspark.mllib.tree</span> <span class="kn">import</span> <span class="n">RandomForest</span><span class="p">,</span> <span class="n">RandomForestModel</span> <span class="kn">from</span> <span class="nn">pyspark.mllib.util</span> <span class="kn">import</span> <span class="n">MLUtils</span> -<span class="c"># Load and parse the data file into an RDD of LabeledPoint.</span> -<span class="n">data</span> <span class="o">=</span> <span class="n">MLUtils</span><span class="o">.</span><span class="n">loadLibSVMFile</span><span class="p">(</span><span class="n">sc</span><span class="p">,</span> <span class="s">'data/mllib/sample_libsvm_data.txt'</span><span class="p">)</span> -<span class="c"># Split the data into training and test sets (30% held out for testing)</span> +<span class="c1"># Load and parse the data file into an RDD of LabeledPoint.</span> +<span class="n">data</span> <span class="o">=</span> <span class="n">MLUtils</span><span class="o">.</span><span class="n">loadLibSVMFile</span><span class="p">(</span><span class="n">sc</span><span class="p">,</span> <span class="s1">'data/mllib/sample_libsvm_data.txt'</span><span class="p">)</span> +<span class="c1"># Split the data into training and test sets (30% held out for testing)</span> <span class="p">(</span><span class="n">trainingData</span><span class="p">,</span> <span class="n">testData</span><span class="p">)</span> <span class="o">=</span> <span class="n">data</span><span class="o">.</span><span class="n">randomSplit</span><span class="p">([</span><span class="mf">0.7</span><span class="p">,</span> <span class="mf">0.3</span><span class="p">])</span> -<span class="c"># Train a RandomForest model.</span> -<span class="c"># Empty categoricalFeaturesInfo indicates all features are continuous.</span> -<span class="c"># Note: Use larger numTrees in practice.</span> -<span class="c"># Setting featureSubsetStrategy="auto" lets the algorithm choose.</span> +<span class="c1"># Train a RandomForest model.</span> +<span class="c1"># Empty categoricalFeaturesInfo indicates all features are continuous.</span> +<span class="c1"># Note: Use larger numTrees in practice.</span> +<span class="c1"># Setting featureSubsetStrategy="auto" lets the algorithm choose.</span> <span class="n">model</span> <span class="o">=</span> <span class="n">RandomForest</span><span class="o">.</span><span class="n">trainRegressor</span><span class="p">(</span><span class="n">trainingData</span><span class="p">,</span> <span class="n">categoricalFeaturesInfo</span><span class="o">=</span><span class="p">{},</span> - <span class="n">numTrees</span><span class="o">=</span><span class="mi">3</span><span class="p">,</span> <span class="n">featureSubsetStrategy</span><span class="o">=</span><span class="s">"auto"</span><span class="p">,</span> - <span class="n">impurity</span><span class="o">=</span><span class="s">'variance'</span><span class="p">,</span> <span class="n">maxDepth</span><span class="o">=</span><span class="mi">4</span><span class="p">,</span> <span class="n">maxBins</span><span class="o">=</span><span class="mi">32</span><span class="p">)</span> + <span class="n">numTrees</span><span class="o">=</span><span class="mi">3</span><span class="p">,</span> <span class="n">featureSubsetStrategy</span><span class="o">=</span><span class="s2">"auto"</span><span class="p">,</span> + <span class="n">impurity</span><span class="o">=</span><span class="s1">'variance'</span><span class="p">,</span> <span class="n">maxDepth</span><span class="o">=</span><span class="mi">4</span><span class="p">,</span> <span class="n">maxBins</span><span class="o">=</span><span class="mi">32</span><span class="p">)</span> -<span class="c"># Evaluate model on test instances and compute test error</span> +<span class="c1"># Evaluate model on test instances and compute test error</span> <span class="n">predictions</span> <span class="o">=</span> <span class="n">model</span><span class="o">.</span><span class="n">predict</span><span class="p">(</span><span class="n">testData</span><span class="o">.</span><span class="n">map</span><span class="p">(</span><span class="k">lambda</span> <span class="n">x</span><span class="p">:</span> <span class="n">x</span><span class="o">.</span><span class="n">features</span><span class="p">))</span> <span class="n">labelsAndPredictions</span> <span class="o">=</span> <span class="n">testData</span><span class="o">.</span><span class="n">map</span><span class="p">(</span><span class="k">lambda</span> <span class="n">lp</span><span class="p">:</span> <span class="n">lp</span><span class="o">.</span><span class="n">label</span><span class="p">)</span><span class="o">.</span><span class="n">zip</span><span class="p">(</span><span class="n">predictions</span><span class="p">)</span> <span class="n">testMSE</span> <span class="o">=</span> <span class="n">labelsAndPredictions</span><span class="o">.</span><span class="n">map</span><span class="p">(</span><span class="k">lambda</span> <span class="p">(</span><span class="n">v</span><span class="p">,</span> <span class="n">p</span><span class="p">):</span> <span class="p">(</span><span class="n">v</span> <span class="o">-</span> <span class="n">p</span><span class="p">)</span> <span class="o">*</span> <span class="p">(</span><span class="n">v</span> <span class="o">-</span> <span class="n">p</span><span class="p">))</span><span class="o">.</span><span class="n">sum</span><span class="p">()</span> <span class="o">/</span>\ <span class="nb">float</span><span class="p">(</span><span class="n">testData</span><span class="o">.</span><span class="n">count</span><span class="p">())</span> -<span class="k">print</span><span class="p">(</span><span class="s">'Test Mean Squared Error = '</span> <span class="o">+</span> <span class="nb">str</span><span class="p">(</span><span class="n">testMSE</span><span class="p">))</span> -<span class="k">print</span><span class="p">(</span><span class="s">'Learned regression forest model:'</span><span class="p">)</span> +<span class="k">print</span><span class="p">(</span><span class="s1">'Test Mean Squared Error = '</span> <span class="o">+</span> <span class="nb">str</span><span class="p">(</span><span class="n">testMSE</span><span class="p">))</span> +<span class="k">print</span><span class="p">(</span><span class="s1">'Learned regression forest model:'</span><span class="p">)</span> <span class="k">print</span><span class="p">(</span><span class="n">model</span><span class="o">.</span><span class="n">toDebugString</span><span class="p">())</span> -<span class="c"># Save and load model</span> -<span class="n">model</span><span class="o">.</span><span class="n">save</span><span class="p">(</span><span class="n">sc</span><span class="p">,</span> <span class="s">"target/tmp/myRandomForestRegressionModel"</span><span class="p">)</span> -<span class="n">sameModel</span> <span class="o">=</span> <span class="n">RandomForestModel</span><span class="o">.</span><span class="n">load</span><span class="p">(</span><span class="n">sc</span><span class="p">,</span> <span class="s">"target/tmp/myRandomForestRegressionModel"</span><span class="p">)</span> +<span class="c1"># Save and load model</span> +<span class="n">model</span><span class="o">.</span><span class="n">save</span><span class="p">(</span><span class="n">sc</span><span class="p">,</span> <span class="s2">"target/tmp/myRandomForestRegressionModel"</span><span class="p">)</span> +<span class="n">sameModel</span> <span class="o">=</span> <span class="n">RandomForestModel</span><span class="o">.</span><span class="n">load</span><span class="p">(</span><span class="n">sc</span><span class="p">,</span> <span class="s2">"target/tmp/myRandomForestRegressionModel"</span><span class="p">)</span> </pre></div> <div><small>Find full example code at "examples/src/main/python/mllib/random_forest_regression_example.py" in the Spark repo.</small></div> </div> @@ -859,7 +859,7 @@ The test error is calculated to measure the algorithm accuracy.</p> <div data-lang="scala"> <p>Refer to the <a href="api/scala/index.html#org.apache.spark.mllib.tree.GradientBoostedTrees"><code>GradientBoostedTrees</code> Scala docs</a> and <a href="api/scala/index.html#org.apache.spark.mllib.tree.model.GradientBoostedTreesModel"><code>GradientBoostedTreesModel</code> Scala docs</a> for details on the API.</p> - <div class="highlight"><pre><span class="k">import</span> <span class="nn">org.apache.spark.mllib.tree.GradientBoostedTrees</span> + <div class="highlight"><pre><span></span><span class="k">import</span> <span class="nn">org.apache.spark.mllib.tree.GradientBoostedTrees</span> <span class="k">import</span> <span class="nn">org.apache.spark.mllib.tree.configuration.BoostingStrategy</span> <span class="k">import</span> <span class="nn">org.apache.spark.mllib.tree.model.GradientBoostedTreesModel</span> <span class="k">import</span> <span class="nn">org.apache.spark.mllib.util.MLUtils</span> @@ -901,7 +901,7 @@ The test error is calculated to measure the algorithm accuracy.</p> <div data-lang="java"> <p>Refer to the <a href="api/java/org/apache/spark/mllib/tree/GradientBoostedTrees.html"><code>GradientBoostedTrees</code> Java docs</a> and <a href="api/java/org/apache/spark/mllib/tree/model/GradientBoostedTreesModel.html"><code>GradientBoostedTreesModel</code> Java docs</a> for details on the API.</p> - <div class="highlight"><pre><span class="kn">import</span> <span class="nn">java.util.HashMap</span><span class="o">;</span> + <div class="highlight"><pre><span></span><span class="kn">import</span> <span class="nn">java.util.HashMap</span><span class="o">;</span> <span class="kn">import</span> <span class="nn">java.util.Map</span><span class="o">;</span> <span class="kn">import</span> <span class="nn">scala.Tuple2</span><span class="o">;</span> @@ -918,9 +918,9 @@ The test error is calculated to measure the algorithm accuracy.</p> <span class="kn">import</span> <span class="nn">org.apache.spark.mllib.tree.model.GradientBoostedTreesModel</span><span class="o">;</span> <span class="kn">import</span> <span class="nn">org.apache.spark.mllib.util.MLUtils</span><span class="o">;</span> -<span class="n">SparkConf</span> <span class="n">sparkConf</span> <span class="o">=</span> <span class="k">new</span> <span class="nf">SparkConf</span><span class="o">()</span> +<span class="n">SparkConf</span> <span class="n">sparkConf</span> <span class="o">=</span> <span class="k">new</span> <span class="n">SparkConf</span><span class="o">()</span> <span class="o">.</span><span class="na">setAppName</span><span class="o">(</span><span class="s">"JavaGradientBoostedTreesClassificationExample"</span><span class="o">);</span> -<span class="n">JavaSparkContext</span> <span class="n">jsc</span> <span class="o">=</span> <span class="k">new</span> <span class="nf">JavaSparkContext</span><span class="o">(</span><span class="n">sparkConf</span><span class="o">);</span> +<span class="n">JavaSparkContext</span> <span class="n">jsc</span> <span class="o">=</span> <span class="k">new</span> <span class="n">JavaSparkContext</span><span class="o">(</span><span class="n">sparkConf</span><span class="o">);</span> <span class="c1">// Load and parse the data file.</span> <span class="n">String</span> <span class="n">datapath</span> <span class="o">=</span> <span class="s">"data/mllib/sample_libsvm_data.txt"</span><span class="o">;</span> @@ -972,32 +972,32 @@ The test error is calculated to measure the algorithm accuracy.</p> <div data-lang="python"> <p>Refer to the <a href="api/python/pyspark.mllib.html#pyspark.mllib.tree.GradientBoostedTrees"><code>GradientBoostedTrees</code> Python docs</a> and <a href="api/python/pyspark.mllib.html#pyspark.mllib.tree.GradientBoostedTreesModel"><code>GradientBoostedTreesModel</code> Python docs</a> for more details on the API.</p> - <div class="highlight"><pre><span class="kn">from</span> <span class="nn">pyspark.mllib.tree</span> <span class="kn">import</span> <span class="n">GradientBoostedTrees</span><span class="p">,</span> <span class="n">GradientBoostedTreesModel</span> + <div class="highlight"><pre><span></span><span class="kn">from</span> <span class="nn">pyspark.mllib.tree</span> <span class="kn">import</span> <span class="n">GradientBoostedTrees</span><span class="p">,</span> <span class="n">GradientBoostedTreesModel</span> <span class="kn">from</span> <span class="nn">pyspark.mllib.util</span> <span class="kn">import</span> <span class="n">MLUtils</span> -<span class="c"># Load and parse the data file.</span> -<span class="n">data</span> <span class="o">=</span> <span class="n">MLUtils</span><span class="o">.</span><span class="n">loadLibSVMFile</span><span class="p">(</span><span class="n">sc</span><span class="p">,</span> <span class="s">"data/mllib/sample_libsvm_data.txt"</span><span class="p">)</span> -<span class="c"># Split the data into training and test sets (30% held out for testing)</span> +<span class="c1"># Load and parse the data file.</span> +<span class="n">data</span> <span class="o">=</span> <span class="n">MLUtils</span><span class="o">.</span><span class="n">loadLibSVMFile</span><span class="p">(</span><span class="n">sc</span><span class="p">,</span> <span class="s2">"data/mllib/sample_libsvm_data.txt"</span><span class="p">)</span> +<span class="c1"># Split the data into training and test sets (30% held out for testing)</span> <span class="p">(</span><span class="n">trainingData</span><span class="p">,</span> <span class="n">testData</span><span class="p">)</span> <span class="o">=</span> <span class="n">data</span><span class="o">.</span><span class="n">randomSplit</span><span class="p">([</span><span class="mf">0.7</span><span class="p">,</span> <span class="mf">0.3</span><span class="p">])</span> -<span class="c"># Train a GradientBoostedTrees model.</span> -<span class="c"># Notes: (a) Empty categoricalFeaturesInfo indicates all features are continuous.</span> -<span class="c"># (b) Use more iterations in practice.</span> +<span class="c1"># Train a GradientBoostedTrees model.</span> +<span class="c1"># Notes: (a) Empty categoricalFeaturesInfo indicates all features are continuous.</span> +<span class="c1"># (b) Use more iterations in practice.</span> <span class="n">model</span> <span class="o">=</span> <span class="n">GradientBoostedTrees</span><span class="o">.</span><span class="n">trainClassifier</span><span class="p">(</span><span class="n">trainingData</span><span class="p">,</span> <span class="n">categoricalFeaturesInfo</span><span class="o">=</span><span class="p">{},</span> <span class="n">numIterations</span><span class="o">=</span><span class="mi">3</span><span class="p">)</span> -<span class="c"># Evaluate model on test instances and compute test error</span> +<span class="c1"># Evaluate model on test instances and compute test error</span> <span class="n">predictions</span> <span class="o">=</span> <span class="n">model</span><span class="o">.</span><span class="n">predict</span><span class="p">(</span><span class="n">testData</span><span class="o">.</span><span class="n">map</span><span class="p">(</span><span class="k">lambda</span> <span class="n">x</span><span class="p">:</span> <span class="n">x</span><span class="o">.</span><span class="n">features</span><span class="p">))</span> <span class="n">labelsAndPredictions</span> <span class="o">=</span> <span class="n">testData</span><span class="o">.</span><span class="n">map</span><span class="p">(</span><span class="k">lambda</span> <span class="n">lp</span><span class="p">:</span> <span class="n">lp</span><span class="o">.</span><span class="n">label</span><span class="p">)</span><span class="o">.</span><span class="n">zip</span><span class="p">(</span><span class="n">predictions</span><span class="p">)</span> <span class="n">testErr</span> <span class="o">=</span> <span class="n">labelsAndPredictions</span><span class="o">.</span><span class="n">filter</span><span class="p">(</span><span class="k">lambda</span> <span class="p">(</span><span class="n">v</span><span class="p">,</span> <span class="n">p</span><span class="p">):</span> <span class="n">v</span> <span class="o">!=</span> <span class="n">p</span><span class="p">)</span><span class="o">.</span><span class="n">count</span><span class="p">()</span> <span class="o">/</span> <span class="nb">float</span><span class="p">(</span><span class="n">testData</span><span class="o">.</span><span class="n">count</span><span class="p">())</span> -<span class="k">print</span><span class="p">(</span><span class="s">'Test Error = '</span> <span class="o">+</span> <span class="nb">str</span><span class="p">(</span><span class="n">testErr</span><span class="p">))</span> -<span class="k">print</span><span class="p">(</span><span class="s">'Learned classification GBT model:'</span><span class="p">)</span> +<span class="k">print</span><span class="p">(</span><span class="s1">'Test Error = '</span> <span class="o">+</span> <span class="nb">str</span><span class="p">(</span><span class="n">testErr</span><span class="p">))</span> +<span class="k">print</span><span class="p">(</span><span class="s1">'Learned classification GBT model:'</span><span class="p">)</span> <span class="k">print</span><span class="p">(</span><span class="n">model</span><span class="o">.</span><span class="n">toDebugString</span><span class="p">())</span> -<span class="c"># Save and load model</span> -<span class="n">model</span><span class="o">.</span><span class="n">save</span><span class="p">(</span><span class="n">sc</span><span class="p">,</span> <span class="s">"target/tmp/myGradientBoostingClassificationModel"</span><span class="p">)</span> +<span class="c1"># Save and load model</span> +<span class="n">model</span><span class="o">.</span><span class="n">save</span><span class="p">(</span><span class="n">sc</span><span class="p">,</span> <span class="s2">"target/tmp/myGradientBoostingClassificationModel"</span><span class="p">)</span> <span class="n">sameModel</span> <span class="o">=</span> <span class="n">GradientBoostedTreesModel</span><span class="o">.</span><span class="n">load</span><span class="p">(</span><span class="n">sc</span><span class="p">,</span> - <span class="s">"target/tmp/myGradientBoostingClassificationModel"</span><span class="p">)</span> + <span class="s2">"target/tmp/myGradientBoostingClassificationModel"</span><span class="p">)</span> </pre></div> <div><small>Find full example code at "examples/src/main/python/mllib/gradient_boosting_classification_example.py" in the Spark repo.</small></div> </div> @@ -1018,7 +1018,7 @@ The Mean Squared Error (MSE) is computed at the end to evaluate <div data-lang="scala"> <p>Refer to the <a href="api/scala/index.html#org.apache.spark.mllib.tree.GradientBoostedTrees"><code>GradientBoostedTrees</code> Scala docs</a> and <a href="api/scala/index.html#org.apache.spark.mllib.tree.model.GradientBoostedTreesModel"><code>GradientBoostedTreesModel</code> Scala docs</a> for details on the API.</p> - <div class="highlight"><pre><span class="k">import</span> <span class="nn">org.apache.spark.mllib.tree.GradientBoostedTrees</span> + <div class="highlight"><pre><span></span><span class="k">import</span> <span class="nn">org.apache.spark.mllib.tree.GradientBoostedTrees</span> <span class="k">import</span> <span class="nn">org.apache.spark.mllib.tree.configuration.BoostingStrategy</span> <span class="k">import</span> <span class="nn">org.apache.spark.mllib.tree.model.GradientBoostedTreesModel</span> <span class="k">import</span> <span class="nn">org.apache.spark.mllib.util.MLUtils</span> @@ -1059,7 +1059,7 @@ The Mean Squared Error (MSE) is computed at the end to evaluate <div data-lang="java"> <p>Refer to the <a href="api/java/org/apache/spark/mllib/tree/GradientBoostedTrees.html"><code>GradientBoostedTrees</code> Java docs</a> and <a href="api/java/org/apache/spark/mllib/tree/model/GradientBoostedTreesModel.html"><code>GradientBoostedTreesModel</code> Java docs</a> for details on the API.</p> - <div class="highlight"><pre><span class="kn">import</span> <span class="nn">java.util.HashMap</span><span class="o">;</span> + <div class="highlight"><pre><span></span><span class="kn">import</span> <span class="nn">java.util.HashMap</span><span class="o">;</span> <span class="kn">import</span> <span class="nn">java.util.Map</span><span class="o">;</span> <span class="kn">import</span> <span class="nn">scala.Tuple2</span><span class="o">;</span> @@ -1077,9 +1077,9 @@ The Mean Squared Error (MSE) is computed at the end to evaluate <span class="kn">import</span> <span class="nn">org.apache.spark.mllib.tree.model.GradientBoostedTreesModel</span><span class="o">;</span> <span class="kn">import</span> <span class="nn">org.apache.spark.mllib.util.MLUtils</span><span class="o">;</span> -<span class="n">SparkConf</span> <span class="n">sparkConf</span> <span class="o">=</span> <span class="k">new</span> <span class="nf">SparkConf</span><span class="o">()</span> +<span class="n">SparkConf</span> <span class="n">sparkConf</span> <span class="o">=</span> <span class="k">new</span> <span class="n">SparkConf</span><span class="o">()</span> <span class="o">.</span><span class="na">setAppName</span><span class="o">(</span><span class="s">"JavaGradientBoostedTreesRegressionExample"</span><span class="o">);</span> -<span class="n">JavaSparkContext</span> <span class="n">jsc</span> <span class="o">=</span> <span class="k">new</span> <span class="nf">JavaSparkContext</span><span class="o">(</span><span class="n">sparkConf</span><span class="o">);</span> +<span class="n">JavaSparkContext</span> <span class="n">jsc</span> <span class="o">=</span> <span class="k">new</span> <span class="n">JavaSparkContext</span><span class="o">(</span><span class="n">sparkConf</span><span class="o">);</span> <span class="c1">// Load and parse the data file.</span> <span class="n">String</span> <span class="n">datapath</span> <span class="o">=</span> <span class="s">"data/mllib/sample_libsvm_data.txt"</span><span class="o">;</span> <span class="n">JavaRDD</span><span class="o"><</span><span class="n">LabeledPoint</span><span class="o">></span> <span class="n">data</span> <span class="o">=</span> <span class="n">MLUtils</span><span class="o">.</span><span class="na">loadLibSVMFile</span><span class="o">(</span><span class="n">jsc</span><span class="o">.</span><span class="na">sc</span><span class="o">(),</span> <span class="n">datapath</span><span class="o">).</span><span class="na">toJavaRDD</span><span class="o">();</span> @@ -1135,32 +1135,32 @@ The Mean Squared Error (MSE) is computed at the end to evaluate <div data-lang="python"> <p>Refer to the <a href="api/python/pyspark.mllib.html#pyspark.mllib.tree.GradientBoostedTrees"><code>GradientBoostedTrees</code> Python docs</a> and <a href="api/python/pyspark.mllib.html#pyspark.mllib.tree.GradientBoostedTreesModel"><code>GradientBoostedTreesModel</code> Python docs</a> for more details on the API.</p> - <div class="highlight"><pre><span class="kn">from</span> <span class="nn">pyspark.mllib.tree</span> <span class="kn">import</span> <span class="n">GradientBoostedTrees</span><span class="p">,</span> <span class="n">GradientBoostedTreesModel</span> + <div class="highlight"><pre><span></span><span class="kn">from</span> <span class="nn">pyspark.mllib.tree</span> <span class="kn">import</span> <span class="n">GradientBoostedTrees</span><span class="p">,</span> <span class="n">GradientBoostedTreesModel</span> <span class="kn">from</span> <span class="nn">pyspark.mllib.util</span> <span class="kn">import</span> <span class="n">MLUtils</span> -<span class="c"># Load and parse the data file.</span> -<span class="n">data</span> <span class="o">=</span> <span class="n">MLUtils</span><span class="o">.</span><span class="n">loadLibSVMFile</span><span class="p">(</span><span class="n">sc</span><span class="p">,</span> <span class="s">"data/mllib/sample_libsvm_data.txt"</span><span class="p">)</span> -<span class="c"># Split the data into training and test sets (30% held out for testing)</span> +<span class="c1"># Load and parse the data file.</span> +<span class="n">data</span> <span class="o">=</span> <span class="n">MLUtils</span><span class="o">.</span><span class="n">loadLibSVMFile</span><span class="p">(</span><span class="n">sc</span><span class="p">,</span> <span class="s2">"data/mllib/sample_libsvm_data.txt"</span><span class="p">)</span> +<span class="c1"># Split the data into training and test sets (30% held out for testing)</span> <span class="p">(</span><span class="n">trainingData</span><span class="p">,</span> <span class="n">testData</span><span class="p">)</span> <span class="o">=</span> <span class="n">data</span><span class="o">.</span><span class="n">randomSplit</span><span class="p">([</span><span class="mf">0.7</span><span class="p">,</span> <span class="mf">0.3</span><span class="p">])</span> -<span class="c"># Train a GradientBoostedTrees model.</span> -<span class="c"># Notes: (a) Empty categoricalFeaturesInfo indicates all features are continuous.</span> -<span class="c"># (b) Use more iterations in practice.</span> +<span class="c1"># Train a GradientBoostedTrees model.</span> +<span class="c1"># Notes: (a) Empty categoricalFeaturesInfo indicates all features are continuous.</span> +<span class="c1"># (b) Use more iterations in practice.</span> <span class="n">model</span> <span class="o">=</span> <span class="n">GradientBoostedTrees</span><span class="o">.</span><span class="n">trainRegressor</span><span class="p">(</span><span class="n">trainingData</span><span class="p">,</span> <span class="n">categoricalFeaturesInfo</span><span class="o">=</span><span class="p">{},</span> <span class="n">numIterations</span><span class="o">=</span><span class="mi">3</span><span class="p">)</span> -<span class="c"># Evaluate model on test instances and compute test error</span> +<span class="c1"># Evaluate model on test instances and compute test error</span> <span class="n">predictions</span> <span class="o">=</span> <span class="n">model</span><span class="o">.</span><span class="n">predict</span><span class="p">(</span><span class="n">testData</span><span class="o">.</span><span class="n">map</span><span class="p">(</span><span class="k">lambda</span> <span class="n">x</span><span class="p">:</span> <span class="n">x</span><span class="o">.</span><span class="n">features</span><span class="p">))</span> <span class="n">labelsAndPredictions</span> <span class="o">=</span> <span class="n">testData</span><span class="o">.</span><span class="n">map</span><span class="p">(</span><span class="k">lambda</span> <span class="n">lp</span><span class="p">:</span> <span class="n">lp</span><span class="o">.</span><span class="n">label</span><span class="p">)</span><span class="o">.</span><span class="n">zip</span><span class="p">(</span><span class="n">predictions</span><span class="p">)</span> <span class="n">testMSE</span> <span class="o">=</span> <span class="n">labelsAndPredictions</span><span class="o">.</span><span class="n">map</span><span class="p">(</span><span class="k">lambda</span> <span class="p">(</span><span class="n">v</span><span class="p">,</span> <span class="n">p</span><span class="p">):</span> <span class="p">(</span><span class="n">v</span> <span class="o">-</span> <span class="n">p</span><span class="p">)</span> <span class="o">*</span> <span class="p">(</span><span class="n">v</span> <span class="o">-</span> <span class="n">p</span><span class="p">))</span><span class="o">.</span><span class="n">sum</span><span class="p">()</span> <span class="o">/</span>\ <span class="nb">float</span><span class="p">(</span><span class="n">testData</span><span class="o">.</span><span class="n">count</span><span class="p">())</span> -<span class="k">print</span><span class="p">(</span><span class="s">'Test Mean Squared Error = '</span> <span class="o">+</span> <span class="nb">str</span><span class="p">(</span><span class="n">testMSE</span><span class="p">))</span> -<span class="k">print</span><span class="p">(</span><span class="s">'Learned regression GBT model:'</span><span class="p">)</span> +<span class="k">print</span><span class="p">(</span><span class="s1">'Test Mean Squared Error = '</span> <span class="o">+</span> <span class="nb">str</span><span class="p">(</span><span class="n">testMSE</span><span class="p">))</span> +<span class="k">print</span><span class="p">(</span><span class="s1">'Learned regression GBT model:'</span><span class="p">)</span> <span class="k">print</span><span class="p">(</span><span class="n">model</span><span class="o">.</span><span class="n">toDebugString</span><span class="p">())</span> -<span class="c"># Save and load model</span> -<span class="n">model</span><span class="o">.</span><span class="n">save</span><span class="p">(</span><span class="n">sc</span><span class="p">,</span> <span class="s">"target/tmp/myGradientBoostingRegressionModel"</span><span class="p">)</span> -<span class="n">sameModel</span> <span class="o">=</span> <span class="n">GradientBoostedTreesModel</span><span class="o">.</span><span class="n">load</span><span class="p">(</span><span class="n">sc</span><span class="p">,</span> <span class="s">"target/tmp/myGradientBoostingRegressionModel"</span><span class="p">)</span> +<span class="c1"># Save and load model</span> +<span class="n">model</span><span class="o">.</span><span class="n">save</span><span class="p">(</span><span class="n">sc</span><span class="p">,</span> <span class="s2">"target/tmp/myGradientBoostingRegressionModel"</span><span class="p">)</span> +<span class="n">sameModel</span> <span class="o">=</span> <span class="n">GradientBoostedTreesModel</span><span class="o">.</span><span class="n">load</span><span class="p">(</span><span class="n">sc</span><span class="p">,</span> <span class="s2">"target/tmp/myGradientBoostingRegressionModel"</span><span class="p">)</span> </pre></div> <div><small>Find full example code at "examples/src/main/python/mllib/gradient_boosting_regression_example.py" in the Spark repo.</small></div> </div> --------------------------------------------------------------------- To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org