http://git-wip-us.apache.org/repos/asf/spark-website/blob/d2bcf185/site/docs/2.1.0/ml-clustering.html ---------------------------------------------------------------------- diff --git a/site/docs/2.1.0/ml-clustering.html b/site/docs/2.1.0/ml-clustering.html index e225281..df38605 100644 --- a/site/docs/2.1.0/ml-clustering.html +++ b/site/docs/2.1.0/ml-clustering.html @@ -313,21 +313,21 @@ about these algorithms.</p> <p><strong>Table of Contents</strong></p> <ul id="markdown-toc"> - <li><a href="#k-means" id="markdown-toc-k-means">K-means</a> <ul> - <li><a href="#input-columns" id="markdown-toc-input-columns">Input Columns</a></li> - <li><a href="#output-columns" id="markdown-toc-output-columns">Output Columns</a></li> - <li><a href="#example" id="markdown-toc-example">Example</a></li> + <li><a href="#k-means">K-means</a> <ul> + <li><a href="#input-columns">Input Columns</a></li> + <li><a href="#output-columns">Output Columns</a></li> + <li><a href="#example">Example</a></li> </ul> </li> - <li><a href="#latent-dirichlet-allocation-lda" id="markdown-toc-latent-dirichlet-allocation-lda">Latent Dirichlet allocation (LDA)</a></li> - <li><a href="#bisecting-k-means" id="markdown-toc-bisecting-k-means">Bisecting k-means</a> <ul> - <li><a href="#example-1" id="markdown-toc-example-1">Example</a></li> + <li><a href="#latent-dirichlet-allocation-lda">Latent Dirichlet allocation (LDA)</a></li> + <li><a href="#bisecting-k-means">Bisecting k-means</a> <ul> + <li><a href="#example-1">Example</a></li> </ul> </li> - <li><a href="#gaussian-mixture-model-gmm" id="markdown-toc-gaussian-mixture-model-gmm">Gaussian Mixture Model (GMM)</a> <ul> - <li><a href="#input-columns-1" id="markdown-toc-input-columns-1">Input Columns</a></li> - <li><a href="#output-columns-1" id="markdown-toc-output-columns-1">Output Columns</a></li> - <li><a href="#example-2" id="markdown-toc-example-2">Example</a></li> + <li><a href="#gaussian-mixture-model-gmm">Gaussian Mixture Model (GMM)</a> <ul> + <li><a href="#input-columns-1">Input Columns</a></li> + <li><a href="#output-columns-1">Output Columns</a></li> + <li><a href="#example-2">Example</a></li> </ul> </li> </ul> @@ -391,7 +391,7 @@ called <a href="http://theory.stanford.edu/~sergei/papers/vldb12-kmpar.pdf">kmea <div data-lang="scala"> <p>Refer to the <a href="api/scala/index.html#org.apache.spark.ml.clustering.KMeans">Scala API docs</a> for more details.</p> - <div class="highlight"><pre><span class="k">import</span> <span class="nn">org.apache.spark.ml.clustering.KMeans</span> + <div class="highlight"><pre><span></span><span class="k">import</span> <span class="nn">org.apache.spark.ml.clustering.KMeans</span> <span class="c1">// Loads data.</span> <span class="k">val</span> <span class="n">dataset</span> <span class="k">=</span> <span class="n">spark</span><span class="o">.</span><span class="n">read</span><span class="o">.</span><span class="n">format</span><span class="o">(</span><span class="s">"libsvm"</span><span class="o">).</span><span class="n">load</span><span class="o">(</span><span class="s">"data/mllib/sample_kmeans_data.txt"</span><span class="o">)</span> @@ -402,7 +402,7 @@ called <a href="http://theory.stanford.edu/~sergei/papers/vldb12-kmpar.pdf">kmea <span class="c1">// Evaluate clustering by computing Within Set Sum of Squared Errors.</span> <span class="k">val</span> <span class="nc">WSSSE</span> <span class="k">=</span> <span class="n">model</span><span class="o">.</span><span class="n">computeCost</span><span class="o">(</span><span class="n">dataset</span><span class="o">)</span> -<span class="n">println</span><span class="o">(</span><span class="n">s</span><span class="s">"Within Set Sum of Squared Errors = $WSSSE"</span><span class="o">)</span> +<span class="n">println</span><span class="o">(</span><span class="s">s"Within Set Sum of Squared Errors = </span><span class="si">$WSSSE</span><span class="s">"</span><span class="o">)</span> <span class="c1">// Shows the result.</span> <span class="n">println</span><span class="o">(</span><span class="s">"Cluster Centers: "</span><span class="o">)</span> @@ -414,7 +414,7 @@ called <a href="http://theory.stanford.edu/~sergei/papers/vldb12-kmpar.pdf">kmea <div data-lang="java"> <p>Refer to the <a href="api/java/org/apache/spark/ml/clustering/KMeans.html">Java API docs</a> for more details.</p> - <div class="highlight"><pre><span class="kn">import</span> <span class="nn">org.apache.spark.ml.clustering.KMeansModel</span><span class="o">;</span> + <div class="highlight"><pre><span></span><span class="kn">import</span> <span class="nn">org.apache.spark.ml.clustering.KMeansModel</span><span class="o">;</span> <span class="kn">import</span> <span class="nn">org.apache.spark.ml.clustering.KMeans</span><span class="o">;</span> <span class="kn">import</span> <span class="nn">org.apache.spark.ml.linalg.Vector</span><span class="o">;</span> <span class="kn">import</span> <span class="nn">org.apache.spark.sql.Dataset</span><span class="o">;</span> @@ -424,7 +424,7 @@ called <a href="http://theory.stanford.edu/~sergei/papers/vldb12-kmpar.pdf">kmea <span class="n">Dataset</span><span class="o"><</span><span class="n">Row</span><span class="o">></span> <span class="n">dataset</span> <span class="o">=</span> <span class="n">spark</span><span class="o">.</span><span class="na">read</span><span class="o">().</span><span class="na">format</span><span class="o">(</span><span class="s">"libsvm"</span><span class="o">).</span><span class="na">load</span><span class="o">(</span><span class="s">"data/mllib/sample_kmeans_data.txt"</span><span class="o">);</span> <span class="c1">// Trains a k-means model.</span> -<span class="n">KMeans</span> <span class="n">kmeans</span> <span class="o">=</span> <span class="k">new</span> <span class="nf">KMeans</span><span class="o">().</span><span class="na">setK</span><span class="o">(</span><span class="mi">2</span><span class="o">).</span><span class="na">setSeed</span><span class="o">(</span><span class="mi">1L</span><span class="o">);</span> +<span class="n">KMeans</span> <span class="n">kmeans</span> <span class="o">=</span> <span class="k">new</span> <span class="n">KMeans</span><span class="o">().</span><span class="na">setK</span><span class="o">(</span><span class="mi">2</span><span class="o">).</span><span class="na">setSeed</span><span class="o">(</span><span class="mi">1L</span><span class="o">);</span> <span class="n">KMeansModel</span> <span class="n">model</span> <span class="o">=</span> <span class="n">kmeans</span><span class="o">.</span><span class="na">fit</span><span class="o">(</span><span class="n">dataset</span><span class="o">);</span> <span class="c1">// Evaluate clustering by computing Within Set Sum of Squared Errors.</span> @@ -434,7 +434,7 @@ called <a href="http://theory.stanford.edu/~sergei/papers/vldb12-kmpar.pdf">kmea <span class="c1">// Shows the result.</span> <span class="n">Vector</span><span class="o">[]</span> <span class="n">centers</span> <span class="o">=</span> <span class="n">model</span><span class="o">.</span><span class="na">clusterCenters</span><span class="o">();</span> <span class="n">System</span><span class="o">.</span><span class="na">out</span><span class="o">.</span><span class="na">println</span><span class="o">(</span><span class="s">"Cluster Centers: "</span><span class="o">);</span> -<span class="k">for</span> <span class="o">(</span><span class="n">Vector</span> <span class="nl">center:</span> <span class="n">centers</span><span class="o">)</span> <span class="o">{</span> +<span class="k">for</span> <span class="o">(</span><span class="n">Vector</span> <span class="n">center</span><span class="o">:</span> <span class="n">centers</span><span class="o">)</span> <span class="o">{</span> <span class="n">System</span><span class="o">.</span><span class="na">out</span><span class="o">.</span><span class="na">println</span><span class="o">(</span><span class="n">center</span><span class="o">);</span> <span class="o">}</span> </pre></div> @@ -444,22 +444,22 @@ called <a href="http://theory.stanford.edu/~sergei/papers/vldb12-kmpar.pdf">kmea <div data-lang="python"> <p>Refer to the <a href="api/python/pyspark.ml.html#pyspark.ml.clustering.KMeans">Python API docs</a> for more details.</p> - <div class="highlight"><pre><span class="kn">from</span> <span class="nn">pyspark.ml.clustering</span> <span class="kn">import</span> <span class="n">KMeans</span> + <div class="highlight"><pre><span></span><span class="kn">from</span> <span class="nn">pyspark.ml.clustering</span> <span class="kn">import</span> <span class="n">KMeans</span> -<span class="c"># Loads data.</span> -<span class="n">dataset</span> <span class="o">=</span> <span class="n">spark</span><span class="o">.</span><span class="n">read</span><span class="o">.</span><span class="n">format</span><span class="p">(</span><span class="s">"libsvm"</span><span class="p">)</span><span class="o">.</span><span class="n">load</span><span class="p">(</span><span class="s">"data/mllib/sample_kmeans_data.txt"</span><span class="p">)</span> +<span class="c1"># Loads data.</span> +<span class="n">dataset</span> <span class="o">=</span> <span class="n">spark</span><span class="o">.</span><span class="n">read</span><span class="o">.</span><span class="n">format</span><span class="p">(</span><span class="s2">"libsvm"</span><span class="p">)</span><span class="o">.</span><span class="n">load</span><span class="p">(</span><span class="s2">"data/mllib/sample_kmeans_data.txt"</span><span class="p">)</span> -<span class="c"># Trains a k-means model.</span> +<span class="c1"># Trains a k-means model.</span> <span class="n">kmeans</span> <span class="o">=</span> <span class="n">KMeans</span><span class="p">()</span><span class="o">.</span><span class="n">setK</span><span class="p">(</span><span class="mi">2</span><span class="p">)</span><span class="o">.</span><span class="n">setSeed</span><span class="p">(</span><span class="mi">1</span><span class="p">)</span> <span class="n">model</span> <span class="o">=</span> <span class="n">kmeans</span><span class="o">.</span><span class="n">fit</span><span class="p">(</span><span class="n">dataset</span><span class="p">)</span> -<span class="c"># Evaluate clustering by computing Within Set Sum of Squared Errors.</span> +<span class="c1"># Evaluate clustering by computing Within Set Sum of Squared Errors.</span> <span class="n">wssse</span> <span class="o">=</span> <span class="n">model</span><span class="o">.</span><span class="n">computeCost</span><span class="p">(</span><span class="n">dataset</span><span class="p">)</span> -<span class="k">print</span><span class="p">(</span><span class="s">"Within Set Sum of Squared Errors = "</span> <span class="o">+</span> <span class="nb">str</span><span class="p">(</span><span class="n">wssse</span><span class="p">))</span> +<span class="k">print</span><span class="p">(</span><span class="s2">"Within Set Sum of Squared Errors = "</span> <span class="o">+</span> <span class="nb">str</span><span class="p">(</span><span class="n">wssse</span><span class="p">))</span> -<span class="c"># Shows the result.</span> +<span class="c1"># Shows the result.</span> <span class="n">centers</span> <span class="o">=</span> <span class="n">model</span><span class="o">.</span><span class="n">clusterCenters</span><span class="p">()</span> -<span class="k">print</span><span class="p">(</span><span class="s">"Cluster Centers: "</span><span class="p">)</span> +<span class="k">print</span><span class="p">(</span><span class="s2">"Cluster Centers: "</span><span class="p">)</span> <span class="k">for</span> <span class="n">center</span> <span class="ow">in</span> <span class="n">centers</span><span class="p">:</span> <span class="k">print</span><span class="p">(</span><span class="n">center</span><span class="p">)</span> </pre></div> @@ -470,7 +470,7 @@ called <a href="http://theory.stanford.edu/~sergei/papers/vldb12-kmpar.pdf">kmea <p>Refer to the <a href="api/R/spark.kmeans.html">R API docs</a> for more details.</p> - <div class="highlight"><pre><span class="c1"># Fit a k-means model with spark.kmeans</span> + <div class="highlight"><pre><span></span><span class="c1"># Fit a k-means model with spark.kmeans</span> irisDF <span class="o"><-</span> <span class="kp">suppressWarnings</span><span class="p">(</span>createDataFrame<span class="p">(</span>iris<span class="p">))</span> kmeansDF <span class="o"><-</span> irisDF kmeansTestDF <span class="o"><-</span> irisDF @@ -504,7 +504,7 @@ and generates a <code>LDAModel</code> as the base model. Expert users may cast a <p>Refer to the <a href="api/scala/index.html#org.apache.spark.ml.clustering.LDA">Scala API docs</a> for more details.</p> - <div class="highlight"><pre><span class="k">import</span> <span class="nn">org.apache.spark.ml.clustering.LDA</span> + <div class="highlight"><pre><span></span><span class="k">import</span> <span class="nn">org.apache.spark.ml.clustering.LDA</span> <span class="c1">// Loads data.</span> <span class="k">val</span> <span class="n">dataset</span> <span class="k">=</span> <span class="n">spark</span><span class="o">.</span><span class="n">read</span><span class="o">.</span><span class="n">format</span><span class="o">(</span><span class="s">"libsvm"</span><span class="o">)</span> @@ -516,8 +516,8 @@ and generates a <code>LDAModel</code> as the base model. Expert users may cast a <span class="k">val</span> <span class="n">ll</span> <span class="k">=</span> <span class="n">model</span><span class="o">.</span><span class="n">logLikelihood</span><span class="o">(</span><span class="n">dataset</span><span class="o">)</span> <span class="k">val</span> <span class="n">lp</span> <span class="k">=</span> <span class="n">model</span><span class="o">.</span><span class="n">logPerplexity</span><span class="o">(</span><span class="n">dataset</span><span class="o">)</span> -<span class="n">println</span><span class="o">(</span><span class="n">s</span><span class="s">"The lower bound on the log likelihood of the entire corpus: $ll"</span><span class="o">)</span> -<span class="n">println</span><span class="o">(</span><span class="n">s</span><span class="s">"The upper bound bound on perplexity: $lp"</span><span class="o">)</span> +<span class="n">println</span><span class="o">(</span><span class="s">s"The lower bound on the log likelihood of the entire corpus: </span><span class="si">$ll</span><span class="s">"</span><span class="o">)</span> +<span class="n">println</span><span class="o">(</span><span class="s">s"The upper bound bound on perplexity: </span><span class="si">$lp</span><span class="s">"</span><span class="o">)</span> <span class="c1">// Describe topics.</span> <span class="k">val</span> <span class="n">topics</span> <span class="k">=</span> <span class="n">model</span><span class="o">.</span><span class="n">describeTopics</span><span class="o">(</span><span class="mi">3</span><span class="o">)</span> @@ -535,7 +535,7 @@ and generates a <code>LDAModel</code> as the base model. Expert users may cast a <p>Refer to the <a href="api/java/org/apache/spark/ml/clustering/LDA.html">Java API docs</a> for more details.</p> - <div class="highlight"><pre><span class="kn">import</span> <span class="nn">org.apache.spark.ml.clustering.LDA</span><span class="o">;</span> + <div class="highlight"><pre><span></span><span class="kn">import</span> <span class="nn">org.apache.spark.ml.clustering.LDA</span><span class="o">;</span> <span class="kn">import</span> <span class="nn">org.apache.spark.ml.clustering.LDAModel</span><span class="o">;</span> <span class="kn">import</span> <span class="nn">org.apache.spark.sql.Dataset</span><span class="o">;</span> <span class="kn">import</span> <span class="nn">org.apache.spark.sql.Row</span><span class="o">;</span> @@ -546,7 +546,7 @@ and generates a <code>LDAModel</code> as the base model. Expert users may cast a <span class="o">.</span><span class="na">load</span><span class="o">(</span><span class="s">"data/mllib/sample_lda_libsvm_data.txt"</span><span class="o">);</span> <span class="c1">// Trains a LDA model.</span> -<span class="n">LDA</span> <span class="n">lda</span> <span class="o">=</span> <span class="k">new</span> <span class="nf">LDA</span><span class="o">().</span><span class="na">setK</span><span class="o">(</span><span class="mi">10</span><span class="o">).</span><span class="na">setMaxIter</span><span class="o">(</span><span class="mi">10</span><span class="o">);</span> +<span class="n">LDA</span> <span class="n">lda</span> <span class="o">=</span> <span class="k">new</span> <span class="n">LDA</span><span class="o">().</span><span class="na">setK</span><span class="o">(</span><span class="mi">10</span><span class="o">).</span><span class="na">setMaxIter</span><span class="o">(</span><span class="mi">10</span><span class="o">);</span> <span class="n">LDAModel</span> <span class="n">model</span> <span class="o">=</span> <span class="n">lda</span><span class="o">.</span><span class="na">fit</span><span class="o">(</span><span class="n">dataset</span><span class="o">);</span> <span class="kt">double</span> <span class="n">ll</span> <span class="o">=</span> <span class="n">model</span><span class="o">.</span><span class="na">logLikelihood</span><span class="o">(</span><span class="n">dataset</span><span class="o">);</span> @@ -570,26 +570,26 @@ and generates a <code>LDAModel</code> as the base model. Expert users may cast a <p>Refer to the <a href="api/python/pyspark.ml.html#pyspark.ml.clustering.LDA">Python API docs</a> for more details.</p> - <div class="highlight"><pre><span class="kn">from</span> <span class="nn">pyspark.ml.clustering</span> <span class="kn">import</span> <span class="n">LDA</span> + <div class="highlight"><pre><span></span><span class="kn">from</span> <span class="nn">pyspark.ml.clustering</span> <span class="kn">import</span> <span class="n">LDA</span> -<span class="c"># Loads data.</span> -<span class="n">dataset</span> <span class="o">=</span> <span class="n">spark</span><span class="o">.</span><span class="n">read</span><span class="o">.</span><span class="n">format</span><span class="p">(</span><span class="s">"libsvm"</span><span class="p">)</span><span class="o">.</span><span class="n">load</span><span class="p">(</span><span class="s">"data/mllib/sample_lda_libsvm_data.txt"</span><span class="p">)</span> +<span class="c1"># Loads data.</span> +<span class="n">dataset</span> <span class="o">=</span> <span class="n">spark</span><span class="o">.</span><span class="n">read</span><span class="o">.</span><span class="n">format</span><span class="p">(</span><span class="s2">"libsvm"</span><span class="p">)</span><span class="o">.</span><span class="n">load</span><span class="p">(</span><span class="s2">"data/mllib/sample_lda_libsvm_data.txt"</span><span class="p">)</span> -<span class="c"># Trains a LDA model.</span> +<span class="c1"># Trains a LDA model.</span> <span class="n">lda</span> <span class="o">=</span> <span class="n">LDA</span><span class="p">(</span><span class="n">k</span><span class="o">=</span><span class="mi">10</span><span class="p">,</span> <span class="n">maxIter</span><span class="o">=</span><span class="mi">10</span><span class="p">)</span> <span class="n">model</span> <span class="o">=</span> <span class="n">lda</span><span class="o">.</span><span class="n">fit</span><span class="p">(</span><span class="n">dataset</span><span class="p">)</span> <span class="n">ll</span> <span class="o">=</span> <span class="n">model</span><span class="o">.</span><span class="n">logLikelihood</span><span class="p">(</span><span class="n">dataset</span><span class="p">)</span> <span class="n">lp</span> <span class="o">=</span> <span class="n">model</span><span class="o">.</span><span class="n">logPerplexity</span><span class="p">(</span><span class="n">dataset</span><span class="p">)</span> -<span class="k">print</span><span class="p">(</span><span class="s">"The lower bound on the log likelihood of the entire corpus: "</span> <span class="o">+</span> <span class="nb">str</span><span class="p">(</span><span class="n">ll</span><span class="p">))</span> -<span class="k">print</span><span class="p">(</span><span class="s">"The upper bound bound on perplexity: "</span> <span class="o">+</span> <span class="nb">str</span><span class="p">(</span><span class="n">lp</span><span class="p">))</span> +<span class="k">print</span><span class="p">(</span><span class="s2">"The lower bound on the log likelihood of the entire corpus: "</span> <span class="o">+</span> <span class="nb">str</span><span class="p">(</span><span class="n">ll</span><span class="p">))</span> +<span class="k">print</span><span class="p">(</span><span class="s2">"The upper bound bound on perplexity: "</span> <span class="o">+</span> <span class="nb">str</span><span class="p">(</span><span class="n">lp</span><span class="p">))</span> -<span class="c"># Describe topics.</span> +<span class="c1"># Describe topics.</span> <span class="n">topics</span> <span class="o">=</span> <span class="n">model</span><span class="o">.</span><span class="n">describeTopics</span><span class="p">(</span><span class="mi">3</span><span class="p">)</span> -<span class="k">print</span><span class="p">(</span><span class="s">"The topics described by their top-weighted terms:"</span><span class="p">)</span> +<span class="k">print</span><span class="p">(</span><span class="s2">"The topics described by their top-weighted terms:"</span><span class="p">)</span> <span class="n">topics</span><span class="o">.</span><span class="n">show</span><span class="p">(</span><span class="n">truncate</span><span class="o">=</span><span class="bp">False</span><span class="p">)</span> -<span class="c"># Shows the result</span> +<span class="c1"># Shows the result</span> <span class="n">transformed</span> <span class="o">=</span> <span class="n">model</span><span class="o">.</span><span class="n">transform</span><span class="p">(</span><span class="n">dataset</span><span class="p">)</span> <span class="n">transformed</span><span class="o">.</span><span class="n">show</span><span class="p">(</span><span class="n">truncate</span><span class="o">=</span><span class="bp">False</span><span class="p">)</span> </pre></div> @@ -600,7 +600,7 @@ and generates a <code>LDAModel</code> as the base model. Expert users may cast a <p>Refer to the <a href="api/R/spark.lda.html">R API docs</a> for more details.</p> - <div class="highlight"><pre><span class="c1"># Load training data</span> + <div class="highlight"><pre><span></span><span class="c1"># Load training data</span> df <span class="o"><-</span> read.df<span class="p">(</span><span class="s">"data/mllib/sample_lda_libsvm_data.txt"</span><span class="p">,</span> <span class="kn">source</span> <span class="o">=</span> <span class="s">"libsvm"</span><span class="p">)</span> training <span class="o"><-</span> df test <span class="o"><-</span> df @@ -641,7 +641,7 @@ moves down the hierarchy.</p> <div data-lang="scala"> <p>Refer to the <a href="api/scala/index.html#org.apache.spark.ml.clustering.BisectingKMeans">Scala API docs</a> for more details.</p> - <div class="highlight"><pre><span class="k">import</span> <span class="nn">org.apache.spark.ml.clustering.BisectingKMeans</span> + <div class="highlight"><pre><span></span><span class="k">import</span> <span class="nn">org.apache.spark.ml.clustering.BisectingKMeans</span> <span class="c1">// Loads data.</span> <span class="k">val</span> <span class="n">dataset</span> <span class="k">=</span> <span class="n">spark</span><span class="o">.</span><span class="n">read</span><span class="o">.</span><span class="n">format</span><span class="o">(</span><span class="s">"libsvm"</span><span class="o">).</span><span class="n">load</span><span class="o">(</span><span class="s">"data/mllib/sample_kmeans_data.txt"</span><span class="o">)</span> @@ -652,7 +652,7 @@ moves down the hierarchy.</p> <span class="c1">// Evaluate clustering.</span> <span class="k">val</span> <span class="n">cost</span> <span class="k">=</span> <span class="n">model</span><span class="o">.</span><span class="n">computeCost</span><span class="o">(</span><span class="n">dataset</span><span class="o">)</span> -<span class="n">println</span><span class="o">(</span><span class="n">s</span><span class="s">"Within Set Sum of Squared Errors = $cost"</span><span class="o">)</span> +<span class="n">println</span><span class="o">(</span><span class="s">s"Within Set Sum of Squared Errors = </span><span class="si">$cost</span><span class="s">"</span><span class="o">)</span> <span class="c1">// Shows the result.</span> <span class="n">println</span><span class="o">(</span><span class="s">"Cluster Centers: "</span><span class="o">)</span> @@ -665,7 +665,7 @@ moves down the hierarchy.</p> <div data-lang="java"> <p>Refer to the <a href="api/java/org/apache/spark/ml/clustering/BisectingKMeans.html">Java API docs</a> for more details.</p> - <div class="highlight"><pre><span class="kn">import</span> <span class="nn">org.apache.spark.ml.clustering.BisectingKMeans</span><span class="o">;</span> + <div class="highlight"><pre><span></span><span class="kn">import</span> <span class="nn">org.apache.spark.ml.clustering.BisectingKMeans</span><span class="o">;</span> <span class="kn">import</span> <span class="nn">org.apache.spark.ml.clustering.BisectingKMeansModel</span><span class="o">;</span> <span class="kn">import</span> <span class="nn">org.apache.spark.ml.linalg.Vector</span><span class="o">;</span> <span class="kn">import</span> <span class="nn">org.apache.spark.sql.Dataset</span><span class="o">;</span> @@ -675,7 +675,7 @@ moves down the hierarchy.</p> <span class="n">Dataset</span><span class="o"><</span><span class="n">Row</span><span class="o">></span> <span class="n">dataset</span> <span class="o">=</span> <span class="n">spark</span><span class="o">.</span><span class="na">read</span><span class="o">().</span><span class="na">format</span><span class="o">(</span><span class="s">"libsvm"</span><span class="o">).</span><span class="na">load</span><span class="o">(</span><span class="s">"data/mllib/sample_kmeans_data.txt"</span><span class="o">);</span> <span class="c1">// Trains a bisecting k-means model.</span> -<span class="n">BisectingKMeans</span> <span class="n">bkm</span> <span class="o">=</span> <span class="k">new</span> <span class="nf">BisectingKMeans</span><span class="o">().</span><span class="na">setK</span><span class="o">(</span><span class="mi">2</span><span class="o">).</span><span class="na">setSeed</span><span class="o">(</span><span class="mi">1</span><span class="o">);</span> +<span class="n">BisectingKMeans</span> <span class="n">bkm</span> <span class="o">=</span> <span class="k">new</span> <span class="n">BisectingKMeans</span><span class="o">().</span><span class="na">setK</span><span class="o">(</span><span class="mi">2</span><span class="o">).</span><span class="na">setSeed</span><span class="o">(</span><span class="mi">1</span><span class="o">);</span> <span class="n">BisectingKMeansModel</span> <span class="n">model</span> <span class="o">=</span> <span class="n">bkm</span><span class="o">.</span><span class="na">fit</span><span class="o">(</span><span class="n">dataset</span><span class="o">);</span> <span class="c1">// Evaluate clustering.</span> @@ -695,21 +695,21 @@ moves down the hierarchy.</p> <div data-lang="python"> <p>Refer to the <a href="api/python/pyspark.ml.html#pyspark.ml.clustering.BisectingKMeans">Python API docs</a> for more details.</p> - <div class="highlight"><pre><span class="kn">from</span> <span class="nn">pyspark.ml.clustering</span> <span class="kn">import</span> <span class="n">BisectingKMeans</span> + <div class="highlight"><pre><span></span><span class="kn">from</span> <span class="nn">pyspark.ml.clustering</span> <span class="kn">import</span> <span class="n">BisectingKMeans</span> -<span class="c"># Loads data.</span> -<span class="n">dataset</span> <span class="o">=</span> <span class="n">spark</span><span class="o">.</span><span class="n">read</span><span class="o">.</span><span class="n">format</span><span class="p">(</span><span class="s">"libsvm"</span><span class="p">)</span><span class="o">.</span><span class="n">load</span><span class="p">(</span><span class="s">"data/mllib/sample_kmeans_data.txt"</span><span class="p">)</span> +<span class="c1"># Loads data.</span> +<span class="n">dataset</span> <span class="o">=</span> <span class="n">spark</span><span class="o">.</span><span class="n">read</span><span class="o">.</span><span class="n">format</span><span class="p">(</span><span class="s2">"libsvm"</span><span class="p">)</span><span class="o">.</span><span class="n">load</span><span class="p">(</span><span class="s2">"data/mllib/sample_kmeans_data.txt"</span><span class="p">)</span> -<span class="c"># Trains a bisecting k-means model.</span> +<span class="c1"># Trains a bisecting k-means model.</span> <span class="n">bkm</span> <span class="o">=</span> <span class="n">BisectingKMeans</span><span class="p">()</span><span class="o">.</span><span class="n">setK</span><span class="p">(</span><span class="mi">2</span><span class="p">)</span><span class="o">.</span><span class="n">setSeed</span><span class="p">(</span><span class="mi">1</span><span class="p">)</span> <span class="n">model</span> <span class="o">=</span> <span class="n">bkm</span><span class="o">.</span><span class="n">fit</span><span class="p">(</span><span class="n">dataset</span><span class="p">)</span> -<span class="c"># Evaluate clustering.</span> +<span class="c1"># Evaluate clustering.</span> <span class="n">cost</span> <span class="o">=</span> <span class="n">model</span><span class="o">.</span><span class="n">computeCost</span><span class="p">(</span><span class="n">dataset</span><span class="p">)</span> -<span class="k">print</span><span class="p">(</span><span class="s">"Within Set Sum of Squared Errors = "</span> <span class="o">+</span> <span class="nb">str</span><span class="p">(</span><span class="n">cost</span><span class="p">))</span> +<span class="k">print</span><span class="p">(</span><span class="s2">"Within Set Sum of Squared Errors = "</span> <span class="o">+</span> <span class="nb">str</span><span class="p">(</span><span class="n">cost</span><span class="p">))</span> -<span class="c"># Shows the result.</span> -<span class="k">print</span><span class="p">(</span><span class="s">"Cluster Centers: "</span><span class="p">)</span> +<span class="c1"># Shows the result.</span> +<span class="k">print</span><span class="p">(</span><span class="s2">"Cluster Centers: "</span><span class="p">)</span> <span class="n">centers</span> <span class="o">=</span> <span class="n">model</span><span class="o">.</span><span class="n">clusterCenters</span><span class="p">()</span> <span class="k">for</span> <span class="n">center</span> <span class="ow">in</span> <span class="n">centers</span><span class="p">:</span> <span class="k">print</span><span class="p">(</span><span class="n">center</span><span class="p">)</span> @@ -784,7 +784,7 @@ model.</p> <div data-lang="scala"> <p>Refer to the <a href="api/scala/index.html#org.apache.spark.ml.clustering.GaussianMixture">Scala API docs</a> for more details.</p> - <div class="highlight"><pre><span class="k">import</span> <span class="nn">org.apache.spark.ml.clustering.GaussianMixture</span> + <div class="highlight"><pre><span></span><span class="k">import</span> <span class="nn">org.apache.spark.ml.clustering.GaussianMixture</span> <span class="c1">// Loads data</span> <span class="k">val</span> <span class="n">dataset</span> <span class="k">=</span> <span class="n">spark</span><span class="o">.</span><span class="n">read</span><span class="o">.</span><span class="n">format</span><span class="o">(</span><span class="s">"libsvm"</span><span class="o">).</span><span class="n">load</span><span class="o">(</span><span class="s">"data/mllib/sample_kmeans_data.txt"</span><span class="o">)</span> @@ -796,8 +796,8 @@ model.</p> <span class="c1">// output parameters of mixture model model</span> <span class="k">for</span> <span class="o">(</span><span class="n">i</span> <span class="k"><-</span> <span class="mi">0</span> <span class="n">until</span> <span class="n">model</span><span class="o">.</span><span class="n">getK</span><span class="o">)</span> <span class="o">{</span> - <span class="n">println</span><span class="o">(</span><span class="n">s</span><span class="s">"Gaussian $i:\nweight=${model.weights(i)}\n"</span> <span class="o">+</span> - <span class="n">s</span><span class="s">"mu=${model.gaussians(i).mean}\nsigma=\n${model.gaussians(i).cov}\n"</span><span class="o">)</span> + <span class="n">println</span><span class="o">(</span><span class="s">s"Gaussian </span><span class="si">$i</span><span class="s">:\nweight=</span><span class="si">${</span><span class="n">model</span><span class="o">.</span><span class="n">weights</span><span class="o">(</span><span class="n">i</span><span class="o">)</span><span class="si">}</span><span class="s">\n"</span> <span class="o">+</span> + <span class="s">s"mu=</span><span class="si">${</span><span class="n">model</span><span class="o">.</span><span class="n">gaussians</span><span class="o">(</span><span class="n">i</span><span class="o">).</span><span class="n">mean</span><span class="si">}</span><span class="s">\nsigma=\n</span><span class="si">${</span><span class="n">model</span><span class="o">.</span><span class="n">gaussians</span><span class="o">(</span><span class="n">i</span><span class="o">).</span><span class="n">cov</span><span class="si">}</span><span class="s">\n"</span><span class="o">)</span> <span class="o">}</span> </pre></div> <div><small>Find full example code at "examples/src/main/scala/org/apache/spark/examples/ml/GaussianMixtureExample.scala" in the Spark repo.</small></div> @@ -806,7 +806,7 @@ model.</p> <div data-lang="java"> <p>Refer to the <a href="api/java/org/apache/spark/ml/clustering/GaussianMixture.html">Java API docs</a> for more details.</p> - <div class="highlight"><pre><span class="kn">import</span> <span class="nn">org.apache.spark.ml.clustering.GaussianMixture</span><span class="o">;</span> + <div class="highlight"><pre><span></span><span class="kn">import</span> <span class="nn">org.apache.spark.ml.clustering.GaussianMixture</span><span class="o">;</span> <span class="kn">import</span> <span class="nn">org.apache.spark.ml.clustering.GaussianMixtureModel</span><span class="o">;</span> <span class="kn">import</span> <span class="nn">org.apache.spark.sql.Dataset</span><span class="o">;</span> <span class="kn">import</span> <span class="nn">org.apache.spark.sql.Row</span><span class="o">;</span> @@ -815,7 +815,7 @@ model.</p> <span class="n">Dataset</span><span class="o"><</span><span class="n">Row</span><span class="o">></span> <span class="n">dataset</span> <span class="o">=</span> <span class="n">spark</span><span class="o">.</span><span class="na">read</span><span class="o">().</span><span class="na">format</span><span class="o">(</span><span class="s">"libsvm"</span><span class="o">).</span><span class="na">load</span><span class="o">(</span><span class="s">"data/mllib/sample_kmeans_data.txt"</span><span class="o">);</span> <span class="c1">// Trains a GaussianMixture model</span> -<span class="n">GaussianMixture</span> <span class="n">gmm</span> <span class="o">=</span> <span class="k">new</span> <span class="nf">GaussianMixture</span><span class="o">()</span> +<span class="n">GaussianMixture</span> <span class="n">gmm</span> <span class="o">=</span> <span class="k">new</span> <span class="n">GaussianMixture</span><span class="o">()</span> <span class="o">.</span><span class="na">setK</span><span class="o">(</span><span class="mi">2</span><span class="o">);</span> <span class="n">GaussianMixtureModel</span> <span class="n">model</span> <span class="o">=</span> <span class="n">gmm</span><span class="o">.</span><span class="na">fit</span><span class="o">(</span><span class="n">dataset</span><span class="o">);</span> @@ -831,15 +831,15 @@ model.</p> <div data-lang="python"> <p>Refer to the <a href="api/python/pyspark.ml.html#pyspark.ml.clustering.GaussianMixture">Python API docs</a> for more details.</p> - <div class="highlight"><pre><span class="kn">from</span> <span class="nn">pyspark.ml.clustering</span> <span class="kn">import</span> <span class="n">GaussianMixture</span> + <div class="highlight"><pre><span></span><span class="kn">from</span> <span class="nn">pyspark.ml.clustering</span> <span class="kn">import</span> <span class="n">GaussianMixture</span> -<span class="c"># loads data</span> -<span class="n">dataset</span> <span class="o">=</span> <span class="n">spark</span><span class="o">.</span><span class="n">read</span><span class="o">.</span><span class="n">format</span><span class="p">(</span><span class="s">"libsvm"</span><span class="p">)</span><span class="o">.</span><span class="n">load</span><span class="p">(</span><span class="s">"data/mllib/sample_kmeans_data.txt"</span><span class="p">)</span> +<span class="c1"># loads data</span> +<span class="n">dataset</span> <span class="o">=</span> <span class="n">spark</span><span class="o">.</span><span class="n">read</span><span class="o">.</span><span class="n">format</span><span class="p">(</span><span class="s2">"libsvm"</span><span class="p">)</span><span class="o">.</span><span class="n">load</span><span class="p">(</span><span class="s2">"data/mllib/sample_kmeans_data.txt"</span><span class="p">)</span> <span class="n">gmm</span> <span class="o">=</span> <span class="n">GaussianMixture</span><span class="p">()</span><span class="o">.</span><span class="n">setK</span><span class="p">(</span><span class="mi">2</span><span class="p">)</span><span class="o">.</span><span class="n">setSeed</span><span class="p">(</span><span class="mi">538009335</span><span class="p">)</span> <span class="n">model</span> <span class="o">=</span> <span class="n">gmm</span><span class="o">.</span><span class="n">fit</span><span class="p">(</span><span class="n">dataset</span><span class="p">)</span> -<span class="k">print</span><span class="p">(</span><span class="s">"Gaussians shown as a DataFrame: "</span><span class="p">)</span> +<span class="k">print</span><span class="p">(</span><span class="s2">"Gaussians shown as a DataFrame: "</span><span class="p">)</span> <span class="n">model</span><span class="o">.</span><span class="n">gaussiansDF</span><span class="o">.</span><span class="n">show</span><span class="p">(</span><span class="n">truncate</span><span class="o">=</span><span class="bp">False</span><span class="p">)</span> </pre></div> <div><small>Find full example code at "examples/src/main/python/ml/gaussian_mixture_example.py" in the Spark repo.</small></div> @@ -849,7 +849,7 @@ model.</p> <p>Refer to the <a href="api/R/spark.gaussianMixture.html">R API docs</a> for more details.</p> - <div class="highlight"><pre><span class="c1"># Load training data</span> + <div class="highlight"><pre><span></span><span class="c1"># Load training data</span> df <span class="o"><-</span> read.df<span class="p">(</span><span class="s">"data/mllib/sample_kmeans_data.txt"</span><span class="p">,</span> <span class="kn">source</span> <span class="o">=</span> <span class="s">"libsvm"</span><span class="p">)</span> training <span class="o"><-</span> df test <span class="o"><-</span> df
http://git-wip-us.apache.org/repos/asf/spark-website/blob/d2bcf185/site/docs/2.1.0/ml-collaborative-filtering.html ---------------------------------------------------------------------- diff --git a/site/docs/2.1.0/ml-collaborative-filtering.html b/site/docs/2.1.0/ml-collaborative-filtering.html index 1f63418..91e5bed 100644 --- a/site/docs/2.1.0/ml-collaborative-filtering.html +++ b/site/docs/2.1.0/ml-collaborative-filtering.html @@ -307,12 +307,12 @@ <ul id="markdown-toc"> - <li><a href="#collaborative-filtering" id="markdown-toc-collaborative-filtering">Collaborative filtering</a> <ul> - <li><a href="#explicit-vs-implicit-feedback" id="markdown-toc-explicit-vs-implicit-feedback">Explicit vs. implicit feedback</a></li> - <li><a href="#scaling-of-the-regularization-parameter" id="markdown-toc-scaling-of-the-regularization-parameter">Scaling of the regularization parameter</a></li> + <li><a href="#collaborative-filtering">Collaborative filtering</a> <ul> + <li><a href="#explicit-vs-implicit-feedback">Explicit vs. implicit feedback</a></li> + <li><a href="#scaling-of-the-regularization-parameter">Scaling of the regularization parameter</a></li> </ul> </li> - <li><a href="#examples" id="markdown-toc-examples">Examples</a></li> + <li><a href="#examples">Examples</a></li> </ul> <h2 id="collaborative-filtering">Collaborative filtering</h2> @@ -341,7 +341,7 @@ following parameters:</p> <p><strong>Note:</strong> The DataFrame-based API for ALS currently only supports integers for user and item ids. Other numeric types are supported for the user and item id columns, -but the ids must be within the integer value range.</p> +but the ids must be within the integer value range. </p> <h3 id="explicit-vs-implicit-feedback">Explicit vs. implicit feedback</h3> @@ -385,7 +385,7 @@ rating prediction.</p> <p>Refer to the <a href="api/scala/index.html#org.apache.spark.ml.recommendation.ALS"><code>ALS</code> Scala docs</a> for more details on the API.</p> - <div class="highlight"><pre><span class="k">import</span> <span class="nn">org.apache.spark.ml.evaluation.RegressionEvaluator</span> + <div class="highlight"><pre><span></span><span class="k">import</span> <span class="nn">org.apache.spark.ml.evaluation.RegressionEvaluator</span> <span class="k">import</span> <span class="nn">org.apache.spark.ml.recommendation.ALS</span> <span class="k">case</span> <span class="k">class</span> <span class="nc">Rating</span><span class="o">(</span><span class="n">userId</span><span class="k">:</span> <span class="kt">Int</span><span class="o">,</span> <span class="n">movieId</span><span class="k">:</span> <span class="kt">Int</span><span class="o">,</span> <span class="n">rating</span><span class="k">:</span> <span class="kt">Float</span><span class="o">,</span> <span class="n">timestamp</span><span class="k">:</span> <span class="kt">Long</span><span class="o">)</span> @@ -417,7 +417,7 @@ for more details on the API.</p> <span class="o">.</span><span class="n">setLabelCol</span><span class="o">(</span><span class="s">"rating"</span><span class="o">)</span> <span class="o">.</span><span class="n">setPredictionCol</span><span class="o">(</span><span class="s">"prediction"</span><span class="o">)</span> <span class="k">val</span> <span class="n">rmse</span> <span class="k">=</span> <span class="n">evaluator</span><span class="o">.</span><span class="n">evaluate</span><span class="o">(</span><span class="n">predictions</span><span class="o">)</span> -<span class="n">println</span><span class="o">(</span><span class="n">s</span><span class="s">"Root-mean-square error = $rmse"</span><span class="o">)</span> +<span class="n">println</span><span class="o">(</span><span class="s">s"Root-mean-square error = </span><span class="si">$rmse</span><span class="s">"</span><span class="o">)</span> </pre></div> <div><small>Find full example code at "examples/src/main/scala/org/apache/spark/examples/ml/ALSExample.scala" in the Spark repo.</small></div> @@ -425,13 +425,13 @@ for more details on the API.</p> inferred from other signals), you can set <code>implicitPrefs</code> to <code>true</code> to get better results:</p> - <div class="highlight"><pre><code class="language-scala" data-lang="scala"><span class="k">val</span> <span class="n">als</span> <span class="k">=</span> <span class="k">new</span> <span class="nc">ALS</span><span class="o">()</span> + <figure class="highlight"><pre><code class="language-scala" data-lang="scala"><span></span><span class="k">val</span> <span class="n">als</span> <span class="k">=</span> <span class="k">new</span> <span class="nc">ALS</span><span class="o">()</span> <span class="o">.</span><span class="n">setMaxIter</span><span class="o">(</span><span class="mi">5</span><span class="o">)</span> <span class="o">.</span><span class="n">setRegParam</span><span class="o">(</span><span class="mf">0.01</span><span class="o">)</span> <span class="o">.</span><span class="n">setImplicitPrefs</span><span class="o">(</span><span class="kc">true</span><span class="o">)</span> <span class="o">.</span><span class="n">setUserCol</span><span class="o">(</span><span class="s">"userId"</span><span class="o">)</span> <span class="o">.</span><span class="n">setItemCol</span><span class="o">(</span><span class="s">"movieId"</span><span class="o">)</span> - <span class="o">.</span><span class="n">setRatingCol</span><span class="o">(</span><span class="s">"rating"</span><span class="o">)</span></code></pre></div> + <span class="o">.</span><span class="n">setRatingCol</span><span class="o">(</span><span class="s">"rating"</span><span class="o">)</span></code></pre></figure> </div> @@ -448,7 +448,7 @@ rating prediction.</p> <p>Refer to the <a href="api/java/org/apache/spark/ml/recommendation/ALS.html"><code>ALS</code> Java docs</a> for more details on the API.</p> - <div class="highlight"><pre><span class="kn">import</span> <span class="nn">java.io.Serializable</span><span class="o">;</span> + <div class="highlight"><pre><span></span><span class="kn">import</span> <span class="nn">java.io.Serializable</span><span class="o">;</span> <span class="kn">import</span> <span class="nn">org.apache.spark.api.java.JavaRDD</span><span class="o">;</span> <span class="kn">import</span> <span class="nn">org.apache.spark.api.java.function.Function</span><span class="o">;</span> @@ -490,13 +490,13 @@ for more details on the API.</p> <span class="kd">public</span> <span class="kd">static</span> <span class="n">Rating</span> <span class="nf">parseRating</span><span class="o">(</span><span class="n">String</span> <span class="n">str</span><span class="o">)</span> <span class="o">{</span> <span class="n">String</span><span class="o">[]</span> <span class="n">fields</span> <span class="o">=</span> <span class="n">str</span><span class="o">.</span><span class="na">split</span><span class="o">(</span><span class="s">"::"</span><span class="o">);</span> <span class="k">if</span> <span class="o">(</span><span class="n">fields</span><span class="o">.</span><span class="na">length</span> <span class="o">!=</span> <span class="mi">4</span><span class="o">)</span> <span class="o">{</span> - <span class="k">throw</span> <span class="k">new</span> <span class="nf">IllegalArgumentException</span><span class="o">(</span><span class="s">"Each line must contain 4 fields"</span><span class="o">);</span> + <span class="k">throw</span> <span class="k">new</span> <span class="n">IllegalArgumentException</span><span class="o">(</span><span class="s">"Each line must contain 4 fields"</span><span class="o">);</span> <span class="o">}</span> <span class="kt">int</span> <span class="n">userId</span> <span class="o">=</span> <span class="n">Integer</span><span class="o">.</span><span class="na">parseInt</span><span class="o">(</span><span class="n">fields</span><span class="o">[</span><span class="mi">0</span><span class="o">]);</span> <span class="kt">int</span> <span class="n">movieId</span> <span class="o">=</span> <span class="n">Integer</span><span class="o">.</span><span class="na">parseInt</span><span class="o">(</span><span class="n">fields</span><span class="o">[</span><span class="mi">1</span><span class="o">]);</span> <span class="kt">float</span> <span class="n">rating</span> <span class="o">=</span> <span class="n">Float</span><span class="o">.</span><span class="na">parseFloat</span><span class="o">(</span><span class="n">fields</span><span class="o">[</span><span class="mi">2</span><span class="o">]);</span> <span class="kt">long</span> <span class="n">timestamp</span> <span class="o">=</span> <span class="n">Long</span><span class="o">.</span><span class="na">parseLong</span><span class="o">(</span><span class="n">fields</span><span class="o">[</span><span class="mi">3</span><span class="o">]);</span> - <span class="k">return</span> <span class="k">new</span> <span class="nf">Rating</span><span class="o">(</span><span class="n">userId</span><span class="o">,</span> <span class="n">movieId</span><span class="o">,</span> <span class="n">rating</span><span class="o">,</span> <span class="n">timestamp</span><span class="o">);</span> + <span class="k">return</span> <span class="k">new</span> <span class="n">Rating</span><span class="o">(</span><span class="n">userId</span><span class="o">,</span> <span class="n">movieId</span><span class="o">,</span> <span class="n">rating</span><span class="o">,</span> <span class="n">timestamp</span><span class="o">);</span> <span class="o">}</span> <span class="o">}</span> @@ -513,7 +513,7 @@ for more details on the API.</p> <span class="n">Dataset</span><span class="o"><</span><span class="n">Row</span><span class="o">></span> <span class="n">test</span> <span class="o">=</span> <span class="n">splits</span><span class="o">[</span><span class="mi">1</span><span class="o">];</span> <span class="c1">// Build the recommendation model using ALS on the training data</span> -<span class="n">ALS</span> <span class="n">als</span> <span class="o">=</span> <span class="k">new</span> <span class="nf">ALS</span><span class="o">()</span> +<span class="n">ALS</span> <span class="n">als</span> <span class="o">=</span> <span class="k">new</span> <span class="n">ALS</span><span class="o">()</span> <span class="o">.</span><span class="na">setMaxIter</span><span class="o">(</span><span class="mi">5</span><span class="o">)</span> <span class="o">.</span><span class="na">setRegParam</span><span class="o">(</span><span class="mf">0.01</span><span class="o">)</span> <span class="o">.</span><span class="na">setUserCol</span><span class="o">(</span><span class="s">"userId"</span><span class="o">)</span> @@ -524,7 +524,7 @@ for more details on the API.</p> <span class="c1">// Evaluate the model by computing the RMSE on the test data</span> <span class="n">Dataset</span><span class="o"><</span><span class="n">Row</span><span class="o">></span> <span class="n">predictions</span> <span class="o">=</span> <span class="n">model</span><span class="o">.</span><span class="na">transform</span><span class="o">(</span><span class="n">test</span><span class="o">);</span> -<span class="n">RegressionEvaluator</span> <span class="n">evaluator</span> <span class="o">=</span> <span class="k">new</span> <span class="nf">RegressionEvaluator</span><span class="o">()</span> +<span class="n">RegressionEvaluator</span> <span class="n">evaluator</span> <span class="o">=</span> <span class="k">new</span> <span class="n">RegressionEvaluator</span><span class="o">()</span> <span class="o">.</span><span class="na">setMetricName</span><span class="o">(</span><span class="s">"rmse"</span><span class="o">)</span> <span class="o">.</span><span class="na">setLabelCol</span><span class="o">(</span><span class="s">"rating"</span><span class="o">)</span> <span class="o">.</span><span class="na">setPredictionCol</span><span class="o">(</span><span class="s">"prediction"</span><span class="o">);</span> @@ -537,13 +537,13 @@ for more details on the API.</p> inferred from other signals), you can set <code>implicitPrefs</code> to <code>true</code> to get better results:</p> - <div class="highlight"><pre><code class="language-java" data-lang="java"><span class="n">ALS</span> <span class="n">als</span> <span class="o">=</span> <span class="k">new</span> <span class="nf">ALS</span><span class="o">()</span> + <figure class="highlight"><pre><code class="language-java" data-lang="java"><span></span><span class="n">ALS</span> <span class="n">als</span> <span class="o">=</span> <span class="k">new</span> <span class="n">ALS</span><span class="o">()</span> <span class="o">.</span><span class="na">setMaxIter</span><span class="o">(</span><span class="mi">5</span><span class="o">)</span> <span class="o">.</span><span class="na">setRegParam</span><span class="o">(</span><span class="mf">0.01</span><span class="o">)</span> <span class="o">.</span><span class="na">setImplicitPrefs</span><span class="o">(</span><span class="kc">true</span><span class="o">)</span> <span class="o">.</span><span class="na">setUserCol</span><span class="o">(</span><span class="s">"userId"</span><span class="o">)</span> <span class="o">.</span><span class="na">setItemCol</span><span class="o">(</span><span class="s">"movieId"</span><span class="o">)</span> - <span class="o">.</span><span class="na">setRatingCol</span><span class="o">(</span><span class="s">"rating"</span><span class="o">);</span></code></pre></div> + <span class="o">.</span><span class="na">setRatingCol</span><span class="o">(</span><span class="s">"rating"</span><span class="o">);</span></code></pre></figure> </div> @@ -560,27 +560,27 @@ rating prediction.</p> <p>Refer to the <a href="api/python/pyspark.ml.html#pyspark.ml.recommendation.ALS"><code>ALS</code> Python docs</a> for more details on the API.</p> - <div class="highlight"><pre><span class="kn">from</span> <span class="nn">pyspark.ml.evaluation</span> <span class="kn">import</span> <span class="n">RegressionEvaluator</span> + <div class="highlight"><pre><span></span><span class="kn">from</span> <span class="nn">pyspark.ml.evaluation</span> <span class="kn">import</span> <span class="n">RegressionEvaluator</span> <span class="kn">from</span> <span class="nn">pyspark.ml.recommendation</span> <span class="kn">import</span> <span class="n">ALS</span> <span class="kn">from</span> <span class="nn">pyspark.sql</span> <span class="kn">import</span> <span class="n">Row</span> -<span class="n">lines</span> <span class="o">=</span> <span class="n">spark</span><span class="o">.</span><span class="n">read</span><span class="o">.</span><span class="n">text</span><span class="p">(</span><span class="s">"data/mllib/als/sample_movielens_ratings.txt"</span><span class="p">)</span><span class="o">.</span><span class="n">rdd</span> -<span class="n">parts</span> <span class="o">=</span> <span class="n">lines</span><span class="o">.</span><span class="n">map</span><span class="p">(</span><span class="k">lambda</span> <span class="n">row</span><span class="p">:</span> <span class="n">row</span><span class="o">.</span><span class="n">value</span><span class="o">.</span><span class="n">split</span><span class="p">(</span><span class="s">"::"</span><span class="p">))</span> +<span class="n">lines</span> <span class="o">=</span> <span class="n">spark</span><span class="o">.</span><span class="n">read</span><span class="o">.</span><span class="n">text</span><span class="p">(</span><span class="s2">"data/mllib/als/sample_movielens_ratings.txt"</span><span class="p">)</span><span class="o">.</span><span class="n">rdd</span> +<span class="n">parts</span> <span class="o">=</span> <span class="n">lines</span><span class="o">.</span><span class="n">map</span><span class="p">(</span><span class="k">lambda</span> <span class="n">row</span><span class="p">:</span> <span class="n">row</span><span class="o">.</span><span class="n">value</span><span class="o">.</span><span class="n">split</span><span class="p">(</span><span class="s2">"::"</span><span class="p">))</span> <span class="n">ratingsRDD</span> <span class="o">=</span> <span class="n">parts</span><span class="o">.</span><span class="n">map</span><span class="p">(</span><span class="k">lambda</span> <span class="n">p</span><span class="p">:</span> <span class="n">Row</span><span class="p">(</span><span class="n">userId</span><span class="o">=</span><span class="nb">int</span><span class="p">(</span><span class="n">p</span><span class="p">[</span><span class="mi">0</span><span class="p">]),</span> <span class="n">movieId</span><span class="o">=</span><span class="nb">int</span><span class="p">(</span><span class="n">p</span><span class="p">[</span><span class="mi">1</span><span class="p">]),</span> <span class="n">rating</span><span class="o">=</span><span class="nb">float</span><span class="p">(</span><span class="n">p</span><span class="p">[</span><span class="mi">2</span><span class="p">]),</span> <span class="n">timestamp</span><span class="o">=</span><span class="nb">long</span><span class="p">(</span><span class="n">p</span><span class="p">[</span><span class="mi">3</span><span class="p">])))</span> <span class="n">ratings</span> <span class="o">=</span> <span class="n">spark</span><span class="o">.</span><span class="n">createDataFrame</span><span class="p">(</span><span class="n">ratingsRDD</span><span class="p">)</span> <span class="p">(</span><span class="n">training</span><span class="p">,</span> <span class="n">test</span><span class="p">)</span> <span class="o">=</span> <span class="n">ratings</span><span class="o">.</span><span class="n">randomSplit</span><span class="p">([</span><span class="mf">0.8</span><span class="p">,</span> <span class="mf">0.2</span><span class="p">])</span> -<span class="c"># Build the recommendation model using ALS on the training data</span> -<span class="n">als</span> <span class="o">=</span> <span class="n">ALS</span><span class="p">(</span><span class="n">maxIter</span><span class="o">=</span><span class="mi">5</span><span class="p">,</span> <span class="n">regParam</span><span class="o">=</span><span class="mf">0.01</span><span class="p">,</span> <span class="n">userCol</span><span class="o">=</span><span class="s">"userId"</span><span class="p">,</span> <span class="n">itemCol</span><span class="o">=</span><span class="s">"movieId"</span><span class="p">,</span> <span class="n">ratingCol</span><span class="o">=</span><span class="s">"rating"</span><span class="p">)</span> +<span class="c1"># Build the recommendation model using ALS on the training data</span> +<span class="n">als</span> <span class="o">=</span> <span class="n">ALS</span><span class="p">(</span><span class="n">maxIter</span><span class="o">=</span><span class="mi">5</span><span class="p">,</span> <span class="n">regParam</span><span class="o">=</span><span class="mf">0.01</span><span class="p">,</span> <span class="n">userCol</span><span class="o">=</span><span class="s2">"userId"</span><span class="p">,</span> <span class="n">itemCol</span><span class="o">=</span><span class="s2">"movieId"</span><span class="p">,</span> <span class="n">ratingCol</span><span class="o">=</span><span class="s2">"rating"</span><span class="p">)</span> <span class="n">model</span> <span class="o">=</span> <span class="n">als</span><span class="o">.</span><span class="n">fit</span><span class="p">(</span><span class="n">training</span><span class="p">)</span> -<span class="c"># Evaluate the model by computing the RMSE on the test data</span> +<span class="c1"># Evaluate the model by computing the RMSE on the test data</span> <span class="n">predictions</span> <span class="o">=</span> <span class="n">model</span><span class="o">.</span><span class="n">transform</span><span class="p">(</span><span class="n">test</span><span class="p">)</span> -<span class="n">evaluator</span> <span class="o">=</span> <span class="n">RegressionEvaluator</span><span class="p">(</span><span class="n">metricName</span><span class="o">=</span><span class="s">"rmse"</span><span class="p">,</span> <span class="n">labelCol</span><span class="o">=</span><span class="s">"rating"</span><span class="p">,</span> - <span class="n">predictionCol</span><span class="o">=</span><span class="s">"prediction"</span><span class="p">)</span> +<span class="n">evaluator</span> <span class="o">=</span> <span class="n">RegressionEvaluator</span><span class="p">(</span><span class="n">metricName</span><span class="o">=</span><span class="s2">"rmse"</span><span class="p">,</span> <span class="n">labelCol</span><span class="o">=</span><span class="s2">"rating"</span><span class="p">,</span> + <span class="n">predictionCol</span><span class="o">=</span><span class="s2">"prediction"</span><span class="p">)</span> <span class="n">rmse</span> <span class="o">=</span> <span class="n">evaluator</span><span class="o">.</span><span class="n">evaluate</span><span class="p">(</span><span class="n">predictions</span><span class="p">)</span> -<span class="k">print</span><span class="p">(</span><span class="s">"Root-mean-square error = "</span> <span class="o">+</span> <span class="nb">str</span><span class="p">(</span><span class="n">rmse</span><span class="p">))</span> +<span class="k">print</span><span class="p">(</span><span class="s2">"Root-mean-square error = "</span> <span class="o">+</span> <span class="nb">str</span><span class="p">(</span><span class="n">rmse</span><span class="p">))</span> </pre></div> <div><small>Find full example code at "examples/src/main/python/ml/als_example.py" in the Spark repo.</small></div> @@ -588,8 +588,8 @@ for more details on the API.</p> inferred from other signals), you can set <code>implicitPrefs</code> to <code>True</code> to get better results:</p> - <div class="highlight"><pre><code class="language-python" data-lang="python"><span class="n">als</span> <span class="o">=</span> <span class="n">ALS</span><span class="p">(</span><span class="n">maxIter</span><span class="o">=</span><span class="mi">5</span><span class="p">,</span> <span class="n">regParam</span><span class="o">=</span><span class="mf">0.01</span><span class="p">,</span> <span class="n">implicitPrefs</span><span class="o">=</span><span class="bp">True</span><span class="p">,</span> - <span class="n">userCol</span><span class="o">=</span><span class="s">"userId"</span><span class="p">,</span> <span class="n">itemCol</span><span class="o">=</span><span class="s">"movieId"</span><span class="p">,</span> <span class="n">ratingCol</span><span class="o">=</span><span class="s">"rating"</span><span class="p">)</span></code></pre></div> + <figure class="highlight"><pre><code class="language-python" data-lang="python"><span></span><span class="n">als</span> <span class="o">=</span> <span class="n">ALS</span><span class="p">(</span><span class="n">maxIter</span><span class="o">=</span><span class="mi">5</span><span class="p">,</span> <span class="n">regParam</span><span class="o">=</span><span class="mf">0.01</span><span class="p">,</span> <span class="n">implicitPrefs</span><span class="o">=</span><span class="bp">True</span><span class="p">,</span> + <span class="n">userCol</span><span class="o">=</span><span class="s2">"userId"</span><span class="p">,</span> <span class="n">itemCol</span><span class="o">=</span><span class="s2">"movieId"</span><span class="p">,</span> <span class="n">ratingCol</span><span class="o">=</span><span class="s2">"rating"</span><span class="p">)</span></code></pre></figure> </div> @@ -597,7 +597,7 @@ better results:</p> <p>Refer to the <a href="api/R/spark.als.html">R API docs</a> for more details.</p> - <div class="highlight"><pre><span class="c1"># Load training data</span> + <div class="highlight"><pre><span></span><span class="c1"># Load training data</span> data <span class="o"><-</span> <span class="kt">list</span><span class="p">(</span><span class="kt">list</span><span class="p">(</span><span class="m">0</span><span class="p">,</span> <span class="m">0</span><span class="p">,</span> <span class="m">4.0</span><span class="p">),</span> <span class="kt">list</span><span class="p">(</span><span class="m">0</span><span class="p">,</span> <span class="m">1</span><span class="p">,</span> <span class="m">2.0</span><span class="p">),</span> <span class="kt">list</span><span class="p">(</span><span class="m">1</span><span class="p">,</span> <span class="m">1</span><span class="p">,</span> <span class="m">3.0</span><span class="p">),</span> <span class="kt">list</span><span class="p">(</span><span class="m">1</span><span class="p">,</span> <span class="m">2</span><span class="p">,</span> <span class="m">4.0</span><span class="p">),</span> <span class="kt">list</span><span class="p">(</span><span class="m">2</span><span class="p">,</span> <span class="m">1</span><span class="p">,</span> <span class="m">1.0</span><span class="p">),</span> <span class="kt">list</span><span class="p">(</span><span class="m">2</span><span class="p">,</span> <span class="m">2</span><span class="p">,</span> <span class="m">5.0</span><span class="p">))</span> df <span class="o"><-</span> createDataFrame<span class="p">(</span>data<span class="p">,</span> <span class="kt">c</span><span class="p">(</span><span class="s">"userId"</span><span class="p">,</span> <span class="s">"movieId"</span><span class="p">,</span> <span class="s">"rating"</span><span class="p">))</span> --------------------------------------------------------------------- To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org