http://git-wip-us.apache.org/repos/asf/spark-website/blob/d2bcf185/site/docs/2.1.0/mllib-feature-extraction.html ---------------------------------------------------------------------- diff --git a/site/docs/2.1.0/mllib-feature-extraction.html b/site/docs/2.1.0/mllib-feature-extraction.html index 4726b37..f8cd98e 100644 --- a/site/docs/2.1.0/mllib-feature-extraction.html +++ b/site/docs/2.1.0/mllib-feature-extraction.html @@ -307,32 +307,32 @@ <ul id="markdown-toc"> - <li><a href="#tf-idf" id="markdown-toc-tf-idf">TF-IDF</a></li> - <li><a href="#word2vec" id="markdown-toc-word2vec">Word2Vec</a> <ul> - <li><a href="#model" id="markdown-toc-model">Model</a></li> - <li><a href="#example" id="markdown-toc-example">Example</a></li> + <li><a href="#tf-idf">TF-IDF</a></li> + <li><a href="#word2vec">Word2Vec</a> <ul> + <li><a href="#model">Model</a></li> + <li><a href="#example">Example</a></li> </ul> </li> - <li><a href="#standardscaler" id="markdown-toc-standardscaler">StandardScaler</a> <ul> - <li><a href="#model-fitting" id="markdown-toc-model-fitting">Model Fitting</a></li> - <li><a href="#example-1" id="markdown-toc-example-1">Example</a></li> + <li><a href="#standardscaler">StandardScaler</a> <ul> + <li><a href="#model-fitting">Model Fitting</a></li> + <li><a href="#example-1">Example</a></li> </ul> </li> - <li><a href="#normalizer" id="markdown-toc-normalizer">Normalizer</a> <ul> - <li><a href="#example-2" id="markdown-toc-example-2">Example</a></li> + <li><a href="#normalizer">Normalizer</a> <ul> + <li><a href="#example-2">Example</a></li> </ul> </li> - <li><a href="#chisqselector" id="markdown-toc-chisqselector">ChiSqSelector</a> <ul> - <li><a href="#model-fitting-1" id="markdown-toc-model-fitting-1">Model Fitting</a></li> - <li><a href="#example-3" id="markdown-toc-example-3">Example</a></li> + <li><a href="#chisqselector">ChiSqSelector</a> <ul> + <li><a href="#model-fitting-1">Model Fitting</a></li> + <li><a href="#example-3">Example</a></li> </ul> </li> - <li><a href="#elementwiseproduct" id="markdown-toc-elementwiseproduct">ElementwiseProduct</a> <ul> - <li><a href="#example-4" id="markdown-toc-example-4">Example</a></li> + <li><a href="#elementwiseproduct">ElementwiseProduct</a> <ul> + <li><a href="#example-4">Example</a></li> </ul> </li> - <li><a href="#pca" id="markdown-toc-pca">PCA</a> <ul> - <li><a href="#example-5" id="markdown-toc-example-5">Example</a></li> + <li><a href="#pca">PCA</a> <ul> + <li><a href="#example-5">Example</a></li> </ul> </li> </ul> @@ -390,7 +390,7 @@ Each record could be an iterable of strings or other types.</p> <p>Refer to the <a href="api/scala/index.html#org.apache.spark.mllib.feature.HashingTF"><code>HashingTF</code> Scala docs</a> for details on the API.</p> - <div class="highlight"><pre><span class="k">import</span> <span class="nn">org.apache.spark.mllib.feature.</span><span class="o">{</span><span class="nc">HashingTF</span><span class="o">,</span> <span class="nc">IDF</span><span class="o">}</span> + <div class="highlight"><pre><span></span><span class="k">import</span> <span class="nn">org.apache.spark.mllib.feature.</span><span class="o">{</span><span class="nc">HashingTF</span><span class="o">,</span> <span class="nc">IDF</span><span class="o">}</span> <span class="k">import</span> <span class="nn">org.apache.spark.mllib.linalg.Vector</span> <span class="k">import</span> <span class="nn">org.apache.spark.rdd.RDD</span> @@ -424,24 +424,24 @@ Each record could be an iterable of strings or other types.</p> <p>Refer to the <a href="api/python/pyspark.mllib.html#pyspark.mllib.feature.HashingTF"><code>HashingTF</code> Python docs</a> for details on the API.</p> - <div class="highlight"><pre><span class="kn">from</span> <span class="nn">pyspark.mllib.feature</span> <span class="kn">import</span> <span class="n">HashingTF</span><span class="p">,</span> <span class="n">IDF</span> + <div class="highlight"><pre><span></span><span class="kn">from</span> <span class="nn">pyspark.mllib.feature</span> <span class="kn">import</span> <span class="n">HashingTF</span><span class="p">,</span> <span class="n">IDF</span> -<span class="c"># Load documents (one per line).</span> -<span class="n">documents</span> <span class="o">=</span> <span class="n">sc</span><span class="o">.</span><span class="n">textFile</span><span class="p">(</span><span class="s">"data/mllib/kmeans_data.txt"</span><span class="p">)</span><span class="o">.</span><span class="n">map</span><span class="p">(</span><span class="k">lambda</span> <span class="n">line</span><span class="p">:</span> <span class="n">line</span><span class="o">.</span><span class="n">split</span><span class="p">(</span><span class="s">" "</span><span class="p">))</span> +<span class="c1"># Load documents (one per line).</span> +<span class="n">documents</span> <span class="o">=</span> <span class="n">sc</span><span class="o">.</span><span class="n">textFile</span><span class="p">(</span><span class="s2">"data/mllib/kmeans_data.txt"</span><span class="p">)</span><span class="o">.</span><span class="n">map</span><span class="p">(</span><span class="k">lambda</span> <span class="n">line</span><span class="p">:</span> <span class="n">line</span><span class="o">.</span><span class="n">split</span><span class="p">(</span><span class="s2">" "</span><span class="p">))</span> <span class="n">hashingTF</span> <span class="o">=</span> <span class="n">HashingTF</span><span class="p">()</span> <span class="n">tf</span> <span class="o">=</span> <span class="n">hashingTF</span><span class="o">.</span><span class="n">transform</span><span class="p">(</span><span class="n">documents</span><span class="p">)</span> -<span class="c"># While applying HashingTF only needs a single pass to the data, applying IDF needs two passes:</span> -<span class="c"># First to compute the IDF vector and second to scale the term frequencies by IDF.</span> +<span class="c1"># While applying HashingTF only needs a single pass to the data, applying IDF needs two passes:</span> +<span class="c1"># First to compute the IDF vector and second to scale the term frequencies by IDF.</span> <span class="n">tf</span><span class="o">.</span><span class="n">cache</span><span class="p">()</span> <span class="n">idf</span> <span class="o">=</span> <span class="n">IDF</span><span class="p">()</span><span class="o">.</span><span class="n">fit</span><span class="p">(</span><span class="n">tf</span><span class="p">)</span> <span class="n">tfidf</span> <span class="o">=</span> <span class="n">idf</span><span class="o">.</span><span class="n">transform</span><span class="p">(</span><span class="n">tf</span><span class="p">)</span> -<span class="c"># spark.mllib's IDF implementation provides an option for ignoring terms</span> -<span class="c"># which occur in less than a minimum number of documents.</span> -<span class="c"># In such cases, the IDF for these terms is set to 0.</span> -<span class="c"># This feature can be used by passing the minDocFreq value to the IDF constructor.</span> +<span class="c1"># spark.mllib's IDF implementation provides an option for ignoring terms</span> +<span class="c1"># which occur in less than a minimum number of documents.</span> +<span class="c1"># In such cases, the IDF for these terms is set to 0.</span> +<span class="c1"># This feature can be used by passing the minDocFreq value to the IDF constructor.</span> <span class="n">idfIgnore</span> <span class="o">=</span> <span class="n">IDF</span><span class="p">(</span><span class="n">minDocFreq</span><span class="o">=</span><span class="mi">2</span><span class="p">)</span><span class="o">.</span><span class="n">fit</span><span class="p">(</span><span class="n">tf</span><span class="p">)</span> <span class="n">tfidfIgnore</span> <span class="o">=</span> <span class="n">idfIgnore</span><span class="o">.</span><span class="n">transform</span><span class="p">(</span><span class="n">tf</span><span class="p">)</span> </pre></div> @@ -467,7 +467,7 @@ skip-gram model is to maximize the average log-likelihood <code>\[ \frac{1}{T} \sum_{t = 1}^{T}\sum_{j=-k}^{j=k} \log p(w_{t+j} | w_t) \]</code> -where $k$ is the size of the training window.</p> +where $k$ is the size of the training window. </p> <p>In the skip-gram model, every word $w$ is associated with two vectors $u_w$ and $v_w$ which are vector representations of $w$ as word and context respectively. The probability of correctly @@ -475,7 +475,7 @@ predicting word $w_i$ given word $w_j$ is determined by the softmax model, which <code>\[ p(w_i | w_j ) = \frac{\exp(u_{w_i}^{\top}v_{w_j})}{\sum_{l=1}^{V} \exp(u_l^{\top}v_{w_j})} \]</code> -where $V$ is the vocabulary size.</p> +where $V$ is the vocabulary size. </p> <p>The skip-gram model with softmax is expensive because the cost of computing $\log p(w_i | w_j)$ is proportional to $V$, which can be easily in order of millions. To speed up training of Word2Vec, @@ -488,13 +488,13 @@ $O(\log(V))$</p> construct a <code>Word2Vec</code> instance and then fit a <code>Word2VecModel</code> with the input data. Finally, we display the top 40 synonyms of the specified word. To run the example, first download the <a href="http://mattmahoney.net/dc/text8.zip">text8</a> data and extract it to your preferred directory. -Here we assume the extracted file is <code>text8</code> and in same directory as you run the spark shell.</p> +Here we assume the extracted file is <code>text8</code> and in same directory as you run the spark shell. </p> <div class="codetabs"> <div data-lang="scala"> <p>Refer to the <a href="api/scala/index.html#org.apache.spark.mllib.feature.Word2Vec"><code>Word2Vec</code> Scala docs</a> for details on the API.</p> - <div class="highlight"><pre><span class="k">import</span> <span class="nn">org.apache.spark.mllib.feature.</span><span class="o">{</span><span class="nc">Word2Vec</span><span class="o">,</span> <span class="nc">Word2VecModel</span><span class="o">}</span> + <div class="highlight"><pre><span></span><span class="k">import</span> <span class="nn">org.apache.spark.mllib.feature.</span><span class="o">{</span><span class="nc">Word2Vec</span><span class="o">,</span> <span class="nc">Word2VecModel</span><span class="o">}</span> <span class="k">val</span> <span class="n">input</span> <span class="k">=</span> <span class="n">sc</span><span class="o">.</span><span class="n">textFile</span><span class="o">(</span><span class="s">"data/mllib/sample_lda_data.txt"</span><span class="o">).</span><span class="n">map</span><span class="o">(</span><span class="n">line</span> <span class="k">=></span> <span class="n">line</span><span class="o">.</span><span class="n">split</span><span class="o">(</span><span class="s">" "</span><span class="o">).</span><span class="n">toSeq</span><span class="o">)</span> @@ -505,7 +505,7 @@ Here we assume the extracted file is <code>text8</code> and in same directory as <span class="k">val</span> <span class="n">synonyms</span> <span class="k">=</span> <span class="n">model</span><span class="o">.</span><span class="n">findSynonyms</span><span class="o">(</span><span class="s">"1"</span><span class="o">,</span> <span class="mi">5</span><span class="o">)</span> <span class="k">for</span><span class="o">((</span><span class="n">synonym</span><span class="o">,</span> <span class="n">cosineSimilarity</span><span class="o">)</span> <span class="k"><-</span> <span class="n">synonyms</span><span class="o">)</span> <span class="o">{</span> - <span class="n">println</span><span class="o">(</span><span class="n">s</span><span class="s">"$synonym $cosineSimilarity"</span><span class="o">)</span> + <span class="n">println</span><span class="o">(</span><span class="s">s"</span><span class="si">$synonym</span><span class="s"> </span><span class="si">$cosineSimilarity</span><span class="s">"</span><span class="o">)</span> <span class="o">}</span> <span class="c1">// Save and load model</span> @@ -517,17 +517,17 @@ Here we assume the extracted file is <code>text8</code> and in same directory as <div data-lang="python"> <p>Refer to the <a href="api/python/pyspark.mllib.html#pyspark.mllib.feature.Word2Vec"><code>Word2Vec</code> Python docs</a> for more details on the API.</p> - <div class="highlight"><pre><span class="kn">from</span> <span class="nn">pyspark.mllib.feature</span> <span class="kn">import</span> <span class="n">Word2Vec</span> + <div class="highlight"><pre><span></span><span class="kn">from</span> <span class="nn">pyspark.mllib.feature</span> <span class="kn">import</span> <span class="n">Word2Vec</span> -<span class="n">inp</span> <span class="o">=</span> <span class="n">sc</span><span class="o">.</span><span class="n">textFile</span><span class="p">(</span><span class="s">"data/mllib/sample_lda_data.txt"</span><span class="p">)</span><span class="o">.</span><span class="n">map</span><span class="p">(</span><span class="k">lambda</span> <span class="n">row</span><span class="p">:</span> <span class="n">row</span><span class="o">.</span><span class="n">split</span><span class="p">(</span><span class="s">" "</span><span class="p">))</span> +<span class="n">inp</span> <span class="o">=</span> <span class="n">sc</span><span class="o">.</span><span class="n">textFile</span><span class="p">(</span><span class="s2">"data/mllib/sample_lda_data.txt"</span><span class="p">)</span><span class="o">.</span><span class="n">map</span><span class="p">(</span><span class="k">lambda</span> <span class="n">row</span><span class="p">:</span> <span class="n">row</span><span class="o">.</span><span class="n">split</span><span class="p">(</span><span class="s2">" "</span><span class="p">))</span> <span class="n">word2vec</span> <span class="o">=</span> <span class="n">Word2Vec</span><span class="p">()</span> <span class="n">model</span> <span class="o">=</span> <span class="n">word2vec</span><span class="o">.</span><span class="n">fit</span><span class="p">(</span><span class="n">inp</span><span class="p">)</span> -<span class="n">synonyms</span> <span class="o">=</span> <span class="n">model</span><span class="o">.</span><span class="n">findSynonyms</span><span class="p">(</span><span class="s">'1'</span><span class="p">,</span> <span class="mi">5</span><span class="p">)</span> +<span class="n">synonyms</span> <span class="o">=</span> <span class="n">model</span><span class="o">.</span><span class="n">findSynonyms</span><span class="p">(</span><span class="s1">'1'</span><span class="p">,</span> <span class="mi">5</span><span class="p">)</span> <span class="k">for</span> <span class="n">word</span><span class="p">,</span> <span class="n">cosine_distance</span> <span class="ow">in</span> <span class="n">synonyms</span><span class="p">:</span> - <span class="k">print</span><span class="p">(</span><span class="s">"{}: {}"</span><span class="o">.</span><span class="n">format</span><span class="p">(</span><span class="n">word</span><span class="p">,</span> <span class="n">cosine_distance</span><span class="p">))</span> + <span class="k">print</span><span class="p">(</span><span class="s2">"{}: {}"</span><span class="o">.</span><span class="n">format</span><span class="p">(</span><span class="n">word</span><span class="p">,</span> <span class="n">cosine_distance</span><span class="p">))</span> </pre></div> <div><small>Find full example code at "examples/src/main/python/mllib/word2vec_example.py" in the Spark repo.</small></div> </div> @@ -576,7 +576,7 @@ so that the new features have unit standard deviation and/or zero mean.</p> <div data-lang="scala"> <p>Refer to the <a href="api/scala/index.html#org.apache.spark.mllib.feature.StandardScaler"><code>StandardScaler</code> Scala docs</a> for details on the API.</p> - <div class="highlight"><pre><span class="k">import</span> <span class="nn">org.apache.spark.mllib.feature.</span><span class="o">{</span><span class="nc">StandardScaler</span><span class="o">,</span> <span class="nc">StandardScalerModel</span><span class="o">}</span> + <div class="highlight"><pre><span></span><span class="k">import</span> <span class="nn">org.apache.spark.mllib.feature.</span><span class="o">{</span><span class="nc">StandardScaler</span><span class="o">,</span> <span class="nc">StandardScalerModel</span><span class="o">}</span> <span class="k">import</span> <span class="nn">org.apache.spark.mllib.linalg.Vectors</span> <span class="k">import</span> <span class="nn">org.apache.spark.mllib.util.MLUtils</span> @@ -599,21 +599,21 @@ so that the new features have unit standard deviation and/or zero mean.</p> <div data-lang="python"> <p>Refer to the <a href="api/python/pyspark.mllib.html#pyspark.mllib.feature.StandardScaler"><code>StandardScaler</code> Python docs</a> for more details on the API.</p> - <div class="highlight"><pre><span class="kn">from</span> <span class="nn">pyspark.mllib.feature</span> <span class="kn">import</span> <span class="n">StandardScaler</span><span class="p">,</span> <span class="n">StandardScalerModel</span> + <div class="highlight"><pre><span></span><span class="kn">from</span> <span class="nn">pyspark.mllib.feature</span> <span class="kn">import</span> <span class="n">StandardScaler</span><span class="p">,</span> <span class="n">StandardScalerModel</span> <span class="kn">from</span> <span class="nn">pyspark.mllib.linalg</span> <span class="kn">import</span> <span class="n">Vectors</span> <span class="kn">from</span> <span class="nn">pyspark.mllib.util</span> <span class="kn">import</span> <span class="n">MLUtils</span> -<span class="n">data</span> <span class="o">=</span> <span class="n">MLUtils</span><span class="o">.</span><span class="n">loadLibSVMFile</span><span class="p">(</span><span class="n">sc</span><span class="p">,</span> <span class="s">"data/mllib/sample_libsvm_data.txt"</span><span class="p">)</span> +<span class="n">data</span> <span class="o">=</span> <span class="n">MLUtils</span><span class="o">.</span><span class="n">loadLibSVMFile</span><span class="p">(</span><span class="n">sc</span><span class="p">,</span> <span class="s2">"data/mllib/sample_libsvm_data.txt"</span><span class="p">)</span> <span class="n">label</span> <span class="o">=</span> <span class="n">data</span><span class="o">.</span><span class="n">map</span><span class="p">(</span><span class="k">lambda</span> <span class="n">x</span><span class="p">:</span> <span class="n">x</span><span class="o">.</span><span class="n">label</span><span class="p">)</span> <span class="n">features</span> <span class="o">=</span> <span class="n">data</span><span class="o">.</span><span class="n">map</span><span class="p">(</span><span class="k">lambda</span> <span class="n">x</span><span class="p">:</span> <span class="n">x</span><span class="o">.</span><span class="n">features</span><span class="p">)</span> <span class="n">scaler1</span> <span class="o">=</span> <span class="n">StandardScaler</span><span class="p">()</span><span class="o">.</span><span class="n">fit</span><span class="p">(</span><span class="n">features</span><span class="p">)</span> <span class="n">scaler2</span> <span class="o">=</span> <span class="n">StandardScaler</span><span class="p">(</span><span class="n">withMean</span><span class="o">=</span><span class="bp">True</span><span class="p">,</span> <span class="n">withStd</span><span class="o">=</span><span class="bp">True</span><span class="p">)</span><span class="o">.</span><span class="n">fit</span><span class="p">(</span><span class="n">features</span><span class="p">)</span> -<span class="c"># data1 will be unit variance.</span> +<span class="c1"># data1 will be unit variance.</span> <span class="n">data1</span> <span class="o">=</span> <span class="n">label</span><span class="o">.</span><span class="n">zip</span><span class="p">(</span><span class="n">scaler1</span><span class="o">.</span><span class="n">transform</span><span class="p">(</span><span class="n">features</span><span class="p">))</span> -<span class="c"># data2 will be unit variance and zero mean.</span> +<span class="c1"># data2 will be unit variance and zero mean.</span> <span class="n">data2</span> <span class="o">=</span> <span class="n">label</span><span class="o">.</span><span class="n">zip</span><span class="p">(</span><span class="n">scaler2</span><span class="o">.</span><span class="n">transform</span><span class="p">(</span><span class="n">features</span><span class="o">.</span><span class="n">map</span><span class="p">(</span><span class="k">lambda</span> <span class="n">x</span><span class="p">:</span> <span class="n">Vectors</span><span class="o">.</span><span class="n">dense</span><span class="p">(</span><span class="n">x</span><span class="o">.</span><span class="n">toArray</span><span class="p">()))))</span> </pre></div> <div><small>Find full example code at "examples/src/main/python/mllib/standard_scaler_example.py" in the Spark repo.</small></div> @@ -648,7 +648,7 @@ with $L^2$ norm, and $L^\infty$ norm.</p> <div data-lang="scala"> <p>Refer to the <a href="api/scala/index.html#org.apache.spark.mllib.feature.Normalizer"><code>Normalizer</code> Scala docs</a> for details on the API.</p> - <div class="highlight"><pre><span class="k">import</span> <span class="nn">org.apache.spark.mllib.feature.Normalizer</span> + <div class="highlight"><pre><span></span><span class="k">import</span> <span class="nn">org.apache.spark.mllib.feature.Normalizer</span> <span class="k">import</span> <span class="nn">org.apache.spark.mllib.util.MLUtils</span> <span class="k">val</span> <span class="n">data</span> <span class="k">=</span> <span class="nc">MLUtils</span><span class="o">.</span><span class="n">loadLibSVMFile</span><span class="o">(</span><span class="n">sc</span><span class="o">,</span> <span class="s">"data/mllib/sample_libsvm_data.txt"</span><span class="o">)</span> @@ -668,20 +668,20 @@ with $L^2$ norm, and $L^\infty$ norm.</p> <div data-lang="python"> <p>Refer to the <a href="api/python/pyspark.mllib.html#pyspark.mllib.feature.Normalizer"><code>Normalizer</code> Python docs</a> for more details on the API.</p> - <div class="highlight"><pre><span class="kn">from</span> <span class="nn">pyspark.mllib.feature</span> <span class="kn">import</span> <span class="n">Normalizer</span> + <div class="highlight"><pre><span></span><span class="kn">from</span> <span class="nn">pyspark.mllib.feature</span> <span class="kn">import</span> <span class="n">Normalizer</span> <span class="kn">from</span> <span class="nn">pyspark.mllib.util</span> <span class="kn">import</span> <span class="n">MLUtils</span> -<span class="n">data</span> <span class="o">=</span> <span class="n">MLUtils</span><span class="o">.</span><span class="n">loadLibSVMFile</span><span class="p">(</span><span class="n">sc</span><span class="p">,</span> <span class="s">"data/mllib/sample_libsvm_data.txt"</span><span class="p">)</span> +<span class="n">data</span> <span class="o">=</span> <span class="n">MLUtils</span><span class="o">.</span><span class="n">loadLibSVMFile</span><span class="p">(</span><span class="n">sc</span><span class="p">,</span> <span class="s2">"data/mllib/sample_libsvm_data.txt"</span><span class="p">)</span> <span class="n">labels</span> <span class="o">=</span> <span class="n">data</span><span class="o">.</span><span class="n">map</span><span class="p">(</span><span class="k">lambda</span> <span class="n">x</span><span class="p">:</span> <span class="n">x</span><span class="o">.</span><span class="n">label</span><span class="p">)</span> <span class="n">features</span> <span class="o">=</span> <span class="n">data</span><span class="o">.</span><span class="n">map</span><span class="p">(</span><span class="k">lambda</span> <span class="n">x</span><span class="p">:</span> <span class="n">x</span><span class="o">.</span><span class="n">features</span><span class="p">)</span> <span class="n">normalizer1</span> <span class="o">=</span> <span class="n">Normalizer</span><span class="p">()</span> -<span class="n">normalizer2</span> <span class="o">=</span> <span class="n">Normalizer</span><span class="p">(</span><span class="n">p</span><span class="o">=</span><span class="nb">float</span><span class="p">(</span><span class="s">"inf"</span><span class="p">))</span> +<span class="n">normalizer2</span> <span class="o">=</span> <span class="n">Normalizer</span><span class="p">(</span><span class="n">p</span><span class="o">=</span><span class="nb">float</span><span class="p">(</span><span class="s2">"inf"</span><span class="p">))</span> -<span class="c"># Each sample in data1 will be normalized using $L^2$ norm.</span> +<span class="c1"># Each sample in data1 will be normalized using $L^2$ norm.</span> <span class="n">data1</span> <span class="o">=</span> <span class="n">labels</span><span class="o">.</span><span class="n">zip</span><span class="p">(</span><span class="n">normalizer1</span><span class="o">.</span><span class="n">transform</span><span class="p">(</span><span class="n">features</span><span class="p">))</span> -<span class="c"># Each sample in data2 will be normalized using $L^\infty$ norm.</span> +<span class="c1"># Each sample in data2 will be normalized using $L^\infty$ norm.</span> <span class="n">data2</span> <span class="o">=</span> <span class="n">labels</span><span class="o">.</span><span class="n">zip</span><span class="p">(</span><span class="n">normalizer2</span><span class="o">.</span><span class="n">transform</span><span class="p">(</span><span class="n">features</span><span class="p">))</span> </pre></div> <div><small>Find full example code at "examples/src/main/python/mllib/normalizer_example.py" in the Spark repo.</small></div> @@ -730,7 +730,7 @@ an <code>RDD[Vector]</code> to produce a reduced <code>RDD[Vector]</code>.</p> <p>Refer to the <a href="api/scala/index.html#org.apache.spark.mllib.feature.ChiSqSelector"><code>ChiSqSelector</code> Scala docs</a> for details on the API.</p> - <div class="highlight"><pre><span class="k">import</span> <span class="nn">org.apache.spark.mllib.feature.ChiSqSelector</span> + <div class="highlight"><pre><span></span><span class="k">import</span> <span class="nn">org.apache.spark.mllib.feature.ChiSqSelector</span> <span class="k">import</span> <span class="nn">org.apache.spark.mllib.linalg.Vectors</span> <span class="k">import</span> <span class="nn">org.apache.spark.mllib.regression.LabeledPoint</span> <span class="k">import</span> <span class="nn">org.apache.spark.mllib.util.MLUtils</span> @@ -759,7 +759,7 @@ for details on the API.</p> <p>Refer to the <a href="api/java/org/apache/spark/mllib/feature/ChiSqSelector.html"><code>ChiSqSelector</code> Java docs</a> for details on the API.</p> - <div class="highlight"><pre><span class="kn">import</span> <span class="nn">org.apache.spark.api.java.JavaRDD</span><span class="o">;</span> + <div class="highlight"><pre><span></span><span class="kn">import</span> <span class="nn">org.apache.spark.api.java.JavaRDD</span><span class="o">;</span> <span class="kn">import</span> <span class="nn">org.apache.spark.api.java.function.Function</span><span class="o">;</span> <span class="kn">import</span> <span class="nn">org.apache.spark.mllib.feature.ChiSqSelector</span><span class="o">;</span> <span class="kn">import</span> <span class="nn">org.apache.spark.mllib.feature.ChiSqSelectorModel</span><span class="o">;</span> @@ -780,13 +780,13 @@ for details on the API.</p> <span class="k">for</span> <span class="o">(</span><span class="kt">int</span> <span class="n">i</span> <span class="o">=</span> <span class="mi">0</span><span class="o">;</span> <span class="n">i</span> <span class="o"><</span> <span class="n">lp</span><span class="o">.</span><span class="na">features</span><span class="o">().</span><span class="na">size</span><span class="o">();</span> <span class="o">++</span><span class="n">i</span><span class="o">)</span> <span class="o">{</span> <span class="n">discretizedFeatures</span><span class="o">[</span><span class="n">i</span><span class="o">]</span> <span class="o">=</span> <span class="n">Math</span><span class="o">.</span><span class="na">floor</span><span class="o">(</span><span class="n">lp</span><span class="o">.</span><span class="na">features</span><span class="o">().</span><span class="na">apply</span><span class="o">(</span><span class="n">i</span><span class="o">)</span> <span class="o">/</span> <span class="mi">16</span><span class="o">);</span> <span class="o">}</span> - <span class="k">return</span> <span class="k">new</span> <span class="nf">LabeledPoint</span><span class="o">(</span><span class="n">lp</span><span class="o">.</span><span class="na">label</span><span class="o">(),</span> <span class="n">Vectors</span><span class="o">.</span><span class="na">dense</span><span class="o">(</span><span class="n">discretizedFeatures</span><span class="o">));</span> + <span class="k">return</span> <span class="k">new</span> <span class="n">LabeledPoint</span><span class="o">(</span><span class="n">lp</span><span class="o">.</span><span class="na">label</span><span class="o">(),</span> <span class="n">Vectors</span><span class="o">.</span><span class="na">dense</span><span class="o">(</span><span class="n">discretizedFeatures</span><span class="o">));</span> <span class="o">}</span> <span class="o">}</span> <span class="o">);</span> <span class="c1">// Create ChiSqSelector that will select top 50 of 692 features</span> -<span class="n">ChiSqSelector</span> <span class="n">selector</span> <span class="o">=</span> <span class="k">new</span> <span class="nf">ChiSqSelector</span><span class="o">(</span><span class="mi">50</span><span class="o">);</span> +<span class="n">ChiSqSelector</span> <span class="n">selector</span> <span class="o">=</span> <span class="k">new</span> <span class="n">ChiSqSelector</span><span class="o">(</span><span class="mi">50</span><span class="o">);</span> <span class="c1">// Create ChiSqSelector model (selecting features)</span> <span class="kd">final</span> <span class="n">ChiSqSelectorModel</span> <span class="n">transformer</span> <span class="o">=</span> <span class="n">selector</span><span class="o">.</span><span class="na">fit</span><span class="o">(</span><span class="n">discretizedData</span><span class="o">.</span><span class="na">rdd</span><span class="o">());</span> <span class="c1">// Filter the top 50 features from each feature vector</span> @@ -794,7 +794,7 @@ for details on the API.</p> <span class="k">new</span> <span class="n">Function</span><span class="o"><</span><span class="n">LabeledPoint</span><span class="o">,</span> <span class="n">LabeledPoint</span><span class="o">>()</span> <span class="o">{</span> <span class="nd">@Override</span> <span class="kd">public</span> <span class="n">LabeledPoint</span> <span class="nf">call</span><span class="o">(</span><span class="n">LabeledPoint</span> <span class="n">lp</span><span class="o">)</span> <span class="o">{</span> - <span class="k">return</span> <span class="k">new</span> <span class="nf">LabeledPoint</span><span class="o">(</span><span class="n">lp</span><span class="o">.</span><span class="na">label</span><span class="o">(),</span> <span class="n">transformer</span><span class="o">.</span><span class="na">transform</span><span class="o">(</span><span class="n">lp</span><span class="o">.</span><span class="na">features</span><span class="o">()));</span> + <span class="k">return</span> <span class="k">new</span> <span class="n">LabeledPoint</span><span class="o">(</span><span class="n">lp</span><span class="o">.</span><span class="na">label</span><span class="o">(),</span> <span class="n">transformer</span><span class="o">.</span><span class="na">transform</span><span class="o">(</span><span class="n">lp</span><span class="o">.</span><span class="na">features</span><span class="o">()));</span> <span class="o">}</span> <span class="o">}</span> <span class="o">);</span> @@ -845,7 +845,7 @@ v_N <p>Refer to the <a href="api/scala/index.html#org.apache.spark.mllib.feature.ElementwiseProduct"><code>ElementwiseProduct</code> Scala docs</a> for details on the API.</p> - <div class="highlight"><pre><span class="k">import</span> <span class="nn">org.apache.spark.mllib.feature.ElementwiseProduct</span> + <div class="highlight"><pre><span></span><span class="k">import</span> <span class="nn">org.apache.spark.mllib.feature.ElementwiseProduct</span> <span class="k">import</span> <span class="nn">org.apache.spark.mllib.linalg.Vectors</span> <span class="c1">// Create some vector data; also works for sparse vectors</span> @@ -864,7 +864,7 @@ v_N <div data-lang="java"> <p>Refer to the <a href="api/java/org/apache/spark/mllib/feature/ElementwiseProduct.html"><code>ElementwiseProduct</code> Java docs</a> for details on the API.</p> - <div class="highlight"><pre><span class="kn">import</span> <span class="nn">java.util.Arrays</span><span class="o">;</span> + <div class="highlight"><pre><span></span><span class="kn">import</span> <span class="nn">java.util.Arrays</span><span class="o">;</span> <span class="kn">import</span> <span class="nn">org.apache.spark.api.java.JavaRDD</span><span class="o">;</span> <span class="kn">import</span> <span class="nn">org.apache.spark.api.java.function.Function</span><span class="o">;</span> @@ -876,7 +876,7 @@ v_N <span class="n">JavaRDD</span><span class="o"><</span><span class="n">Vector</span><span class="o">></span> <span class="n">data</span> <span class="o">=</span> <span class="n">jsc</span><span class="o">.</span><span class="na">parallelize</span><span class="o">(</span><span class="n">Arrays</span><span class="o">.</span><span class="na">asList</span><span class="o">(</span> <span class="n">Vectors</span><span class="o">.</span><span class="na">dense</span><span class="o">(</span><span class="mf">1.0</span><span class="o">,</span> <span class="mf">2.0</span><span class="o">,</span> <span class="mf">3.0</span><span class="o">),</span> <span class="n">Vectors</span><span class="o">.</span><span class="na">dense</span><span class="o">(</span><span class="mf">4.0</span><span class="o">,</span> <span class="mf">5.0</span><span class="o">,</span> <span class="mf">6.0</span><span class="o">)));</span> <span class="n">Vector</span> <span class="n">transformingVector</span> <span class="o">=</span> <span class="n">Vectors</span><span class="o">.</span><span class="na">dense</span><span class="o">(</span><span class="mf">0.0</span><span class="o">,</span> <span class="mf">1.0</span><span class="o">,</span> <span class="mf">2.0</span><span class="o">);</span> -<span class="kd">final</span> <span class="n">ElementwiseProduct</span> <span class="n">transformer</span> <span class="o">=</span> <span class="k">new</span> <span class="nf">ElementwiseProduct</span><span class="o">(</span><span class="n">transformingVector</span><span class="o">);</span> +<span class="kd">final</span> <span class="n">ElementwiseProduct</span> <span class="n">transformer</span> <span class="o">=</span> <span class="k">new</span> <span class="n">ElementwiseProduct</span><span class="o">(</span><span class="n">transformingVector</span><span class="o">);</span> <span class="c1">// Batch transform and per-row transform give the same results:</span> <span class="n">JavaRDD</span><span class="o"><</span><span class="n">Vector</span><span class="o">></span> <span class="n">transformedData</span> <span class="o">=</span> <span class="n">transformer</span><span class="o">.</span><span class="na">transform</span><span class="o">(</span><span class="n">data</span><span class="o">);</span> @@ -895,19 +895,19 @@ v_N <div data-lang="python"> <p>Refer to the <a href="api/python/pyspark.mllib.html#pyspark.mllib.feature.ElementwiseProduct"><code>ElementwiseProduct</code> Python docs</a> for more details on the API.</p> - <div class="highlight"><pre><span class="kn">from</span> <span class="nn">pyspark.mllib.feature</span> <span class="kn">import</span> <span class="n">ElementwiseProduct</span> + <div class="highlight"><pre><span></span><span class="kn">from</span> <span class="nn">pyspark.mllib.feature</span> <span class="kn">import</span> <span class="n">ElementwiseProduct</span> <span class="kn">from</span> <span class="nn">pyspark.mllib.linalg</span> <span class="kn">import</span> <span class="n">Vectors</span> -<span class="n">data</span> <span class="o">=</span> <span class="n">sc</span><span class="o">.</span><span class="n">textFile</span><span class="p">(</span><span class="s">"data/mllib/kmeans_data.txt"</span><span class="p">)</span> -<span class="n">parsedData</span> <span class="o">=</span> <span class="n">data</span><span class="o">.</span><span class="n">map</span><span class="p">(</span><span class="k">lambda</span> <span class="n">x</span><span class="p">:</span> <span class="p">[</span><span class="nb">float</span><span class="p">(</span><span class="n">t</span><span class="p">)</span> <span class="k">for</span> <span class="n">t</span> <span class="ow">in</span> <span class="n">x</span><span class="o">.</span><span class="n">split</span><span class="p">(</span><span class="s">" "</span><span class="p">)])</span> +<span class="n">data</span> <span class="o">=</span> <span class="n">sc</span><span class="o">.</span><span class="n">textFile</span><span class="p">(</span><span class="s2">"data/mllib/kmeans_data.txt"</span><span class="p">)</span> +<span class="n">parsedData</span> <span class="o">=</span> <span class="n">data</span><span class="o">.</span><span class="n">map</span><span class="p">(</span><span class="k">lambda</span> <span class="n">x</span><span class="p">:</span> <span class="p">[</span><span class="nb">float</span><span class="p">(</span><span class="n">t</span><span class="p">)</span> <span class="k">for</span> <span class="n">t</span> <span class="ow">in</span> <span class="n">x</span><span class="o">.</span><span class="n">split</span><span class="p">(</span><span class="s2">" "</span><span class="p">)])</span> -<span class="c"># Create weight vector.</span> +<span class="c1"># Create weight vector.</span> <span class="n">transformingVector</span> <span class="o">=</span> <span class="n">Vectors</span><span class="o">.</span><span class="n">dense</span><span class="p">([</span><span class="mf">0.0</span><span class="p">,</span> <span class="mf">1.0</span><span class="p">,</span> <span class="mf">2.0</span><span class="p">])</span> <span class="n">transformer</span> <span class="o">=</span> <span class="n">ElementwiseProduct</span><span class="p">(</span><span class="n">transformingVector</span><span class="p">)</span> -<span class="c"># Batch transform</span> +<span class="c1"># Batch transform</span> <span class="n">transformedData</span> <span class="o">=</span> <span class="n">transformer</span><span class="o">.</span><span class="n">transform</span><span class="p">(</span><span class="n">parsedData</span><span class="p">)</span> -<span class="c"># Single-row transform</span> +<span class="c1"># Single-row transform</span> <span class="n">transformedData2</span> <span class="o">=</span> <span class="n">transformer</span><span class="o">.</span><span class="n">transform</span><span class="p">(</span><span class="n">parsedData</span><span class="o">.</span><span class="n">first</span><span class="p">())</span> </pre></div> <div><small>Find full example code at "examples/src/main/python/mllib/elementwise_product_example.py" in the Spark repo.</small></div> @@ -929,7 +929,7 @@ for calculation a <a href="mllib-linear-methods.html">Linear Regression</a></p> <div data-lang="scala"> <p>Refer to the <a href="api/scala/index.html#org.apache.spark.mllib.feature.PCA"><code>PCA</code> Scala docs</a> for details on the API.</p> - <div class="highlight"><pre><span class="k">import</span> <span class="nn">org.apache.spark.mllib.feature.PCA</span> + <div class="highlight"><pre><span></span><span class="k">import</span> <span class="nn">org.apache.spark.mllib.feature.PCA</span> <span class="k">import</span> <span class="nn">org.apache.spark.mllib.linalg.Vectors</span> <span class="k">import</span> <span class="nn">org.apache.spark.mllib.regression.</span><span class="o">{</span><span class="nc">LabeledPoint</span><span class="o">,</span> <span class="nc">LinearRegressionWithSGD</span><span class="o">}</span>
http://git-wip-us.apache.org/repos/asf/spark-website/blob/d2bcf185/site/docs/2.1.0/mllib-frequent-pattern-mining.html ---------------------------------------------------------------------- diff --git a/site/docs/2.1.0/mllib-frequent-pattern-mining.html b/site/docs/2.1.0/mllib-frequent-pattern-mining.html index 47ed977..a9b76b5 100644 --- a/site/docs/2.1.0/mllib-frequent-pattern-mining.html +++ b/site/docs/2.1.0/mllib-frequent-pattern-mining.html @@ -389,7 +389,7 @@ details) from <code>transactions</code>.</p> <p>Refer to the <a href="api/scala/index.html#org.apache.spark.mllib.fpm.FPGrowth"><code>FPGrowth</code> Scala docs</a> for details on the API.</p> - <div class="highlight"><pre><span class="k">import</span> <span class="nn">org.apache.spark.mllib.fpm.FPGrowth</span> + <div class="highlight"><pre><span></span><span class="k">import</span> <span class="nn">org.apache.spark.mllib.fpm.FPGrowth</span> <span class="k">import</span> <span class="nn">org.apache.spark.rdd.RDD</span> <span class="k">val</span> <span class="n">data</span> <span class="k">=</span> <span class="n">sc</span><span class="o">.</span><span class="n">textFile</span><span class="o">(</span><span class="s">"data/mllib/sample_fpgrowth.txt"</span><span class="o">)</span> @@ -432,7 +432,7 @@ details) from <code>transactions</code>.</p> <p>Refer to the <a href="api/java/org/apache/spark/mllib/fpm/FPGrowth.html"><code>FPGrowth</code> Java docs</a> for details on the API.</p> - <div class="highlight"><pre><span class="kn">import</span> <span class="nn">java.util.Arrays</span><span class="o">;</span> + <div class="highlight"><pre><span></span><span class="kn">import</span> <span class="nn">java.util.Arrays</span><span class="o">;</span> <span class="kn">import</span> <span class="nn">java.util.List</span><span class="o">;</span> <span class="kn">import</span> <span class="nn">org.apache.spark.api.java.JavaRDD</span><span class="o">;</span> @@ -453,12 +453,12 @@ details) from <code>transactions</code>.</p> <span class="o">}</span> <span class="o">);</span> -<span class="n">FPGrowth</span> <span class="n">fpg</span> <span class="o">=</span> <span class="k">new</span> <span class="nf">FPGrowth</span><span class="o">()</span> +<span class="n">FPGrowth</span> <span class="n">fpg</span> <span class="o">=</span> <span class="k">new</span> <span class="n">FPGrowth</span><span class="o">()</span> <span class="o">.</span><span class="na">setMinSupport</span><span class="o">(</span><span class="mf">0.2</span><span class="o">)</span> <span class="o">.</span><span class="na">setNumPartitions</span><span class="o">(</span><span class="mi">10</span><span class="o">);</span> <span class="n">FPGrowthModel</span><span class="o"><</span><span class="n">String</span><span class="o">></span> <span class="n">model</span> <span class="o">=</span> <span class="n">fpg</span><span class="o">.</span><span class="na">run</span><span class="o">(</span><span class="n">transactions</span><span class="o">);</span> -<span class="k">for</span> <span class="o">(</span><span class="n">FPGrowth</span><span class="o">.</span><span class="na">FreqItemset</span><span class="o"><</span><span class="n">String</span><span class="o">></span> <span class="nl">itemset:</span> <span class="n">model</span><span class="o">.</span><span class="na">freqItemsets</span><span class="o">().</span><span class="na">toJavaRDD</span><span class="o">().</span><span class="na">collect</span><span class="o">())</span> <span class="o">{</span> +<span class="k">for</span> <span class="o">(</span><span class="n">FPGrowth</span><span class="o">.</span><span class="na">FreqItemset</span><span class="o"><</span><span class="n">String</span><span class="o">></span> <span class="n">itemset</span><span class="o">:</span> <span class="n">model</span><span class="o">.</span><span class="na">freqItemsets</span><span class="o">().</span><span class="na">toJavaRDD</span><span class="o">().</span><span class="na">collect</span><span class="o">())</span> <span class="o">{</span> <span class="n">System</span><span class="o">.</span><span class="na">out</span><span class="o">.</span><span class="na">println</span><span class="o">(</span><span class="s">"["</span> <span class="o">+</span> <span class="n">itemset</span><span class="o">.</span><span class="na">javaItems</span><span class="o">()</span> <span class="o">+</span> <span class="s">"], "</span> <span class="o">+</span> <span class="n">itemset</span><span class="o">.</span><span class="na">freq</span><span class="o">());</span> <span class="o">}</span> @@ -484,10 +484,10 @@ that stores the frequent itemsets with their frequencies.</p> <p>Refer to the <a href="api/python/pyspark.mllib.html#pyspark.mllib.fpm.FPGrowth"><code>FPGrowth</code> Python docs</a> for more details on the API.</p> - <div class="highlight"><pre><span class="kn">from</span> <span class="nn">pyspark.mllib.fpm</span> <span class="kn">import</span> <span class="n">FPGrowth</span> + <div class="highlight"><pre><span></span><span class="kn">from</span> <span class="nn">pyspark.mllib.fpm</span> <span class="kn">import</span> <span class="n">FPGrowth</span> -<span class="n">data</span> <span class="o">=</span> <span class="n">sc</span><span class="o">.</span><span class="n">textFile</span><span class="p">(</span><span class="s">"data/mllib/sample_fpgrowth.txt"</span><span class="p">)</span> -<span class="n">transactions</span> <span class="o">=</span> <span class="n">data</span><span class="o">.</span><span class="n">map</span><span class="p">(</span><span class="k">lambda</span> <span class="n">line</span><span class="p">:</span> <span class="n">line</span><span class="o">.</span><span class="n">strip</span><span class="p">()</span><span class="o">.</span><span class="n">split</span><span class="p">(</span><span class="s">' '</span><span class="p">))</span> +<span class="n">data</span> <span class="o">=</span> <span class="n">sc</span><span class="o">.</span><span class="n">textFile</span><span class="p">(</span><span class="s2">"data/mllib/sample_fpgrowth.txt"</span><span class="p">)</span> +<span class="n">transactions</span> <span class="o">=</span> <span class="n">data</span><span class="o">.</span><span class="n">map</span><span class="p">(</span><span class="k">lambda</span> <span class="n">line</span><span class="p">:</span> <span class="n">line</span><span class="o">.</span><span class="n">strip</span><span class="p">()</span><span class="o">.</span><span class="n">split</span><span class="p">(</span><span class="s1">' '</span><span class="p">))</span> <span class="n">model</span> <span class="o">=</span> <span class="n">FPGrowth</span><span class="o">.</span><span class="n">train</span><span class="p">(</span><span class="n">transactions</span><span class="p">,</span> <span class="n">minSupport</span><span class="o">=</span><span class="mf">0.2</span><span class="p">,</span> <span class="n">numPartitions</span><span class="o">=</span><span class="mi">10</span><span class="p">)</span> <span class="n">result</span> <span class="o">=</span> <span class="n">model</span><span class="o">.</span><span class="n">freqItemsets</span><span class="p">()</span><span class="o">.</span><span class="n">collect</span><span class="p">()</span> <span class="k">for</span> <span class="n">fi</span> <span class="ow">in</span> <span class="n">result</span><span class="p">:</span> @@ -509,7 +509,7 @@ that have a single item as the consequent.</p> <p>Refer to the <a href="api/java/org/apache/spark/mllib/fpm/AssociationRules.html"><code>AssociationRules</code> Scala docs</a> for details on the API.</p> - <div class="highlight"><pre><span class="k">import</span> <span class="nn">org.apache.spark.mllib.fpm.AssociationRules</span> + <div class="highlight"><pre><span></span><span class="k">import</span> <span class="nn">org.apache.spark.mllib.fpm.AssociationRules</span> <span class="k">import</span> <span class="nn">org.apache.spark.mllib.fpm.FPGrowth.FreqItemset</span> <span class="k">val</span> <span class="n">freqItemsets</span> <span class="k">=</span> <span class="n">sc</span><span class="o">.</span><span class="n">parallelize</span><span class="o">(</span><span class="nc">Seq</span><span class="o">(</span> @@ -539,7 +539,7 @@ that have a single item as the consequent.</p> <p>Refer to the <a href="api/java/org/apache/spark/mllib/fpm/AssociationRules.html"><code>AssociationRules</code> Java docs</a> for details on the API.</p> - <div class="highlight"><pre><span class="kn">import</span> <span class="nn">java.util.Arrays</span><span class="o">;</span> + <div class="highlight"><pre><span></span><span class="kn">import</span> <span class="nn">java.util.Arrays</span><span class="o">;</span> <span class="kn">import</span> <span class="nn">org.apache.spark.api.java.JavaRDD</span><span class="o">;</span> <span class="kn">import</span> <span class="nn">org.apache.spark.api.java.JavaSparkContext</span><span class="o">;</span> @@ -553,7 +553,7 @@ that have a single item as the consequent.</p> <span class="k">new</span> <span class="n">FreqItemset</span><span class="o"><</span><span class="n">String</span><span class="o">>(</span><span class="k">new</span> <span class="n">String</span><span class="o">[]</span> <span class="o">{</span><span class="s">"a"</span><span class="o">,</span> <span class="s">"b"</span><span class="o">},</span> <span class="mi">12L</span><span class="o">)</span> <span class="o">));</span> -<span class="n">AssociationRules</span> <span class="n">arules</span> <span class="o">=</span> <span class="k">new</span> <span class="nf">AssociationRules</span><span class="o">()</span> +<span class="n">AssociationRules</span> <span class="n">arules</span> <span class="o">=</span> <span class="k">new</span> <span class="n">AssociationRules</span><span class="o">()</span> <span class="o">.</span><span class="na">setMinConfidence</span><span class="o">(</span><span class="mf">0.8</span><span class="o">);</span> <span class="n">JavaRDD</span><span class="o"><</span><span class="n">AssociationRules</span><span class="o">.</span><span class="na">Rule</span><span class="o"><</span><span class="n">String</span><span class="o">>></span> <span class="n">results</span> <span class="o">=</span> <span class="n">arules</span><span class="o">.</span><span class="na">run</span><span class="o">(</span><span class="n">freqItemsets</span><span class="o">);</span> @@ -611,7 +611,7 @@ that stores the frequent sequences with their frequencies.</p> <p>Refer to the <a href="api/scala/index.html#org.apache.spark.mllib.fpm.PrefixSpan"><code>PrefixSpan</code> Scala docs</a> and <a href="api/scala/index.html#org.apache.spark.mllib.fpm.PrefixSpanModel"><code>PrefixSpanModel</code> Scala docs</a> for details on the API.</p> - <div class="highlight"><pre><span class="k">import</span> <span class="nn">org.apache.spark.mllib.fpm.PrefixSpan</span> + <div class="highlight"><pre><span></span><span class="k">import</span> <span class="nn">org.apache.spark.mllib.fpm.PrefixSpan</span> <span class="k">val</span> <span class="n">sequences</span> <span class="k">=</span> <span class="n">sc</span><span class="o">.</span><span class="n">parallelize</span><span class="o">(</span><span class="nc">Seq</span><span class="o">(</span> <span class="nc">Array</span><span class="o">(</span><span class="nc">Array</span><span class="o">(</span><span class="mi">1</span><span class="o">,</span> <span class="mi">2</span><span class="o">),</span> <span class="nc">Array</span><span class="o">(</span><span class="mi">3</span><span class="o">)),</span> @@ -643,7 +643,7 @@ that stores the frequent sequences with their frequencies.</p> <p>Refer to the <a href="api/java/org/apache/spark/mllib/fpm/PrefixSpan.html"><code>PrefixSpan</code> Java docs</a> and <a href="api/java/org/apache/spark/mllib/fpm/PrefixSpanModel.html"><code>PrefixSpanModel</code> Java docs</a> for details on the API.</p> - <div class="highlight"><pre><span class="kn">import</span> <span class="nn">java.util.Arrays</span><span class="o">;</span> + <div class="highlight"><pre><span></span><span class="kn">import</span> <span class="nn">java.util.Arrays</span><span class="o">;</span> <span class="kn">import</span> <span class="nn">java.util.List</span><span class="o">;</span> <span class="kn">import</span> <span class="nn">org.apache.spark.mllib.fpm.PrefixSpan</span><span class="o">;</span> @@ -655,11 +655,11 @@ that stores the frequent sequences with their frequencies.</p> <span class="n">Arrays</span><span class="o">.</span><span class="na">asList</span><span class="o">(</span><span class="n">Arrays</span><span class="o">.</span><span class="na">asList</span><span class="o">(</span><span class="mi">1</span><span class="o">,</span> <span class="mi">2</span><span class="o">),</span> <span class="n">Arrays</span><span class="o">.</span><span class="na">asList</span><span class="o">(</span><span class="mi">5</span><span class="o">)),</span> <span class="n">Arrays</span><span class="o">.</span><span class="na">asList</span><span class="o">(</span><span class="n">Arrays</span><span class="o">.</span><span class="na">asList</span><span class="o">(</span><span class="mi">6</span><span class="o">))</span> <span class="o">),</span> <span class="mi">2</span><span class="o">);</span> -<span class="n">PrefixSpan</span> <span class="n">prefixSpan</span> <span class="o">=</span> <span class="k">new</span> <span class="nf">PrefixSpan</span><span class="o">()</span> +<span class="n">PrefixSpan</span> <span class="n">prefixSpan</span> <span class="o">=</span> <span class="k">new</span> <span class="n">PrefixSpan</span><span class="o">()</span> <span class="o">.</span><span class="na">setMinSupport</span><span class="o">(</span><span class="mf">0.5</span><span class="o">)</span> <span class="o">.</span><span class="na">setMaxPatternLength</span><span class="o">(</span><span class="mi">5</span><span class="o">);</span> <span class="n">PrefixSpanModel</span><span class="o"><</span><span class="n">Integer</span><span class="o">></span> <span class="n">model</span> <span class="o">=</span> <span class="n">prefixSpan</span><span class="o">.</span><span class="na">run</span><span class="o">(</span><span class="n">sequences</span><span class="o">);</span> -<span class="k">for</span> <span class="o">(</span><span class="n">PrefixSpan</span><span class="o">.</span><span class="na">FreqSequence</span><span class="o"><</span><span class="n">Integer</span><span class="o">></span> <span class="nl">freqSeq:</span> <span class="n">model</span><span class="o">.</span><span class="na">freqSequences</span><span class="o">().</span><span class="na">toJavaRDD</span><span class="o">().</span><span class="na">collect</span><span class="o">())</span> <span class="o">{</span> +<span class="k">for</span> <span class="o">(</span><span class="n">PrefixSpan</span><span class="o">.</span><span class="na">FreqSequence</span><span class="o"><</span><span class="n">Integer</span><span class="o">></span> <span class="n">freqSeq</span><span class="o">:</span> <span class="n">model</span><span class="o">.</span><span class="na">freqSequences</span><span class="o">().</span><span class="na">toJavaRDD</span><span class="o">().</span><span class="na">collect</span><span class="o">())</span> <span class="o">{</span> <span class="n">System</span><span class="o">.</span><span class="na">out</span><span class="o">.</span><span class="na">println</span><span class="o">(</span><span class="n">freqSeq</span><span class="o">.</span><span class="na">javaSequence</span><span class="o">()</span> <span class="o">+</span> <span class="s">", "</span> <span class="o">+</span> <span class="n">freqSeq</span><span class="o">.</span><span class="na">freq</span><span class="o">());</span> <span class="o">}</span> </pre></div> http://git-wip-us.apache.org/repos/asf/spark-website/blob/d2bcf185/site/docs/2.1.0/mllib-isotonic-regression.html ---------------------------------------------------------------------- diff --git a/site/docs/2.1.0/mllib-isotonic-regression.html b/site/docs/2.1.0/mllib-isotonic-regression.html index aa7edb3..78bbaba 100644 --- a/site/docs/2.1.0/mllib-isotonic-regression.html +++ b/site/docs/2.1.0/mllib-isotonic-regression.html @@ -365,7 +365,7 @@ labels and real labels in the test set.</p> <p>Refer to the <a href="api/scala/index.html#org.apache.spark.mllib.regression.IsotonicRegression"><code>IsotonicRegression</code> Scala docs</a> and <a href="api/scala/index.html#org.apache.spark.mllib.regression.IsotonicRegressionModel"><code>IsotonicRegressionModel</code> Scala docs</a> for details on the API.</p> - <div class="highlight"><pre><span class="k">import</span> <span class="nn">org.apache.spark.mllib.regression.</span><span class="o">{</span><span class="nc">IsotonicRegression</span><span class="o">,</span> <span class="nc">IsotonicRegressionModel</span><span class="o">}</span> + <div class="highlight"><pre><span></span><span class="k">import</span> <span class="nn">org.apache.spark.mllib.regression.</span><span class="o">{</span><span class="nc">IsotonicRegression</span><span class="o">,</span> <span class="nc">IsotonicRegressionModel</span><span class="o">}</span> <span class="k">import</span> <span class="nn">org.apache.spark.mllib.util.MLUtils</span> <span class="k">val</span> <span class="n">data</span> <span class="k">=</span> <span class="nc">MLUtils</span><span class="o">.</span><span class="n">loadLibSVMFile</span><span class="o">(</span><span class="n">sc</span><span class="o">,</span> @@ -409,7 +409,7 @@ labels and real labels in the test set.</p> <p>Refer to the <a href="api/java/org/apache/spark/mllib/regression/IsotonicRegression.html"><code>IsotonicRegression</code> Java docs</a> and <a href="api/java/org/apache/spark/mllib/regression/IsotonicRegressionModel.html"><code>IsotonicRegressionModel</code> Java docs</a> for details on the API.</p> - <div class="highlight"><pre><span class="kn">import</span> <span class="nn">scala.Tuple2</span><span class="o">;</span> + <div class="highlight"><pre><span></span><span class="kn">import</span> <span class="nn">scala.Tuple2</span><span class="o">;</span> <span class="kn">import</span> <span class="nn">scala.Tuple3</span><span class="o">;</span> <span class="kn">import</span> <span class="nn">org.apache.spark.api.java.function.Function</span><span class="o">;</span> <span class="kn">import</span> <span class="nn">org.apache.spark.api.java.function.PairFunction</span><span class="o">;</span> @@ -429,8 +429,8 @@ labels and real labels in the test set.</p> <span class="n">JavaRDD</span><span class="o"><</span><span class="n">Tuple3</span><span class="o"><</span><span class="n">Double</span><span class="o">,</span> <span class="n">Double</span><span class="o">,</span> <span class="n">Double</span><span class="o">>></span> <span class="n">parsedData</span> <span class="o">=</span> <span class="n">data</span><span class="o">.</span><span class="na">map</span><span class="o">(</span> <span class="k">new</span> <span class="n">Function</span><span class="o"><</span><span class="n">LabeledPoint</span><span class="o">,</span> <span class="n">Tuple3</span><span class="o"><</span><span class="n">Double</span><span class="o">,</span> <span class="n">Double</span><span class="o">,</span> <span class="n">Double</span><span class="o">>>()</span> <span class="o">{</span> <span class="kd">public</span> <span class="n">Tuple3</span><span class="o"><</span><span class="n">Double</span><span class="o">,</span> <span class="n">Double</span><span class="o">,</span> <span class="n">Double</span><span class="o">></span> <span class="nf">call</span><span class="o">(</span><span class="n">LabeledPoint</span> <span class="n">point</span><span class="o">)</span> <span class="o">{</span> - <span class="k">return</span> <span class="k">new</span> <span class="n">Tuple3</span><span class="o"><>(</span><span class="k">new</span> <span class="nf">Double</span><span class="o">(</span><span class="n">point</span><span class="o">.</span><span class="na">label</span><span class="o">()),</span> - <span class="k">new</span> <span class="nf">Double</span><span class="o">(</span><span class="n">point</span><span class="o">.</span><span class="na">features</span><span class="o">().</span><span class="na">apply</span><span class="o">(</span><span class="mi">0</span><span class="o">)),</span> <span class="mf">1.0</span><span class="o">);</span> + <span class="k">return</span> <span class="k">new</span> <span class="n">Tuple3</span><span class="o"><>(</span><span class="k">new</span> <span class="n">Double</span><span class="o">(</span><span class="n">point</span><span class="o">.</span><span class="na">label</span><span class="o">()),</span> + <span class="k">new</span> <span class="n">Double</span><span class="o">(</span><span class="n">point</span><span class="o">.</span><span class="na">features</span><span class="o">().</span><span class="na">apply</span><span class="o">(</span><span class="mi">0</span><span class="o">)),</span> <span class="mf">1.0</span><span class="o">);</span> <span class="o">}</span> <span class="o">}</span> <span class="o">);</span> @@ -444,7 +444,7 @@ labels and real labels in the test set.</p> <span class="c1">// Create isotonic regression model from training data.</span> <span class="c1">// Isotonic parameter defaults to true so it is only shown for demonstration</span> <span class="kd">final</span> <span class="n">IsotonicRegressionModel</span> <span class="n">model</span> <span class="o">=</span> - <span class="k">new</span> <span class="nf">IsotonicRegression</span><span class="o">().</span><span class="na">setIsotonic</span><span class="o">(</span><span class="kc">true</span><span class="o">).</span><span class="na">run</span><span class="o">(</span><span class="n">training</span><span class="o">);</span> + <span class="k">new</span> <span class="n">IsotonicRegression</span><span class="o">().</span><span class="na">setIsotonic</span><span class="o">(</span><span class="kc">true</span><span class="o">).</span><span class="na">run</span><span class="o">(</span><span class="n">training</span><span class="o">);</span> <span class="c1">// Create tuples of predicted and real labels.</span> <span class="n">JavaPairRDD</span><span class="o"><</span><span class="n">Double</span><span class="o">,</span> <span class="n">Double</span><span class="o">></span> <span class="n">predictionAndLabel</span> <span class="o">=</span> <span class="n">test</span><span class="o">.</span><span class="na">mapToPair</span><span class="o">(</span> @@ -458,7 +458,7 @@ labels and real labels in the test set.</p> <span class="o">);</span> <span class="c1">// Calculate mean squared error between predicted and real labels.</span> -<span class="n">Double</span> <span class="n">meanSquaredError</span> <span class="o">=</span> <span class="k">new</span> <span class="nf">JavaDoubleRDD</span><span class="o">(</span><span class="n">predictionAndLabel</span><span class="o">.</span><span class="na">map</span><span class="o">(</span> +<span class="n">Double</span> <span class="n">meanSquaredError</span> <span class="o">=</span> <span class="k">new</span> <span class="n">JavaDoubleRDD</span><span class="o">(</span><span class="n">predictionAndLabel</span><span class="o">.</span><span class="na">map</span><span class="o">(</span> <span class="k">new</span> <span class="n">Function</span><span class="o"><</span><span class="n">Tuple2</span><span class="o"><</span><span class="n">Double</span><span class="o">,</span> <span class="n">Double</span><span class="o">>,</span> <span class="n">Object</span><span class="o">>()</span> <span class="o">{</span> <span class="nd">@Override</span> <span class="kd">public</span> <span class="n">Object</span> <span class="nf">call</span><span class="o">(</span><span class="n">Tuple2</span><span class="o"><</span><span class="n">Double</span><span class="o">,</span> <span class="n">Double</span><span class="o">></span> <span class="n">pl</span><span class="o">)</span> <span class="o">{</span> @@ -483,36 +483,36 @@ labels and real labels in the test set.</p> <p>Refer to the <a href="api/python/pyspark.mllib.html#pyspark.mllib.regression.IsotonicRegression"><code>IsotonicRegression</code> Python docs</a> and <a href="api/python/pyspark.mllib.html#pyspark.mllib.regression.IsotonicRegressionModel"><code>IsotonicRegressionModel</code> Python docs</a> for more details on the API.</p> - <div class="highlight"><pre><span class="kn">import</span> <span class="nn">math</span> + <div class="highlight"><pre><span></span><span class="kn">import</span> <span class="nn">math</span> <span class="kn">from</span> <span class="nn">pyspark.mllib.regression</span> <span class="kn">import</span> <span class="n">LabeledPoint</span><span class="p">,</span> <span class="n">IsotonicRegression</span><span class="p">,</span> <span class="n">IsotonicRegressionModel</span> <span class="kn">from</span> <span class="nn">pyspark.mllib.util</span> <span class="kn">import</span> <span class="n">MLUtils</span> -<span class="c"># Load and parse the data</span> +<span class="c1"># Load and parse the data</span> <span class="k">def</span> <span class="nf">parsePoint</span><span class="p">(</span><span class="n">labeledData</span><span class="p">):</span> <span class="k">return</span> <span class="p">(</span><span class="n">labeledData</span><span class="o">.</span><span class="n">label</span><span class="p">,</span> <span class="n">labeledData</span><span class="o">.</span><span class="n">features</span><span class="p">[</span><span class="mi">0</span><span class="p">],</span> <span class="mf">1.0</span><span class="p">)</span> -<span class="n">data</span> <span class="o">=</span> <span class="n">MLUtils</span><span class="o">.</span><span class="n">loadLibSVMFile</span><span class="p">(</span><span class="n">sc</span><span class="p">,</span> <span class="s">"data/mllib/sample_isotonic_regression_libsvm_data.txt"</span><span class="p">)</span> +<span class="n">data</span> <span class="o">=</span> <span class="n">MLUtils</span><span class="o">.</span><span class="n">loadLibSVMFile</span><span class="p">(</span><span class="n">sc</span><span class="p">,</span> <span class="s2">"data/mllib/sample_isotonic_regression_libsvm_data.txt"</span><span class="p">)</span> -<span class="c"># Create label, feature, weight tuples from input data with weight set to default value 1.0.</span> +<span class="c1"># Create label, feature, weight tuples from input data with weight set to default value 1.0.</span> <span class="n">parsedData</span> <span class="o">=</span> <span class="n">data</span><span class="o">.</span><span class="n">map</span><span class="p">(</span><span class="n">parsePoint</span><span class="p">)</span> -<span class="c"># Split data into training (60%) and test (40%) sets.</span> +<span class="c1"># Split data into training (60%) and test (40%) sets.</span> <span class="n">training</span><span class="p">,</span> <span class="n">test</span> <span class="o">=</span> <span class="n">parsedData</span><span class="o">.</span><span class="n">randomSplit</span><span class="p">([</span><span class="mf">0.6</span><span class="p">,</span> <span class="mf">0.4</span><span class="p">],</span> <span class="mi">11</span><span class="p">)</span> -<span class="c"># Create isotonic regression model from training data.</span> -<span class="c"># Isotonic parameter defaults to true so it is only shown for demonstration</span> +<span class="c1"># Create isotonic regression model from training data.</span> +<span class="c1"># Isotonic parameter defaults to true so it is only shown for demonstration</span> <span class="n">model</span> <span class="o">=</span> <span class="n">IsotonicRegression</span><span class="o">.</span><span class="n">train</span><span class="p">(</span><span class="n">training</span><span class="p">)</span> -<span class="c"># Create tuples of predicted and real labels.</span> +<span class="c1"># Create tuples of predicted and real labels.</span> <span class="n">predictionAndLabel</span> <span class="o">=</span> <span class="n">test</span><span class="o">.</span><span class="n">map</span><span class="p">(</span><span class="k">lambda</span> <span class="n">p</span><span class="p">:</span> <span class="p">(</span><span class="n">model</span><span class="o">.</span><span class="n">predict</span><span class="p">(</span><span class="n">p</span><span class="p">[</span><span class="mi">1</span><span class="p">]),</span> <span class="n">p</span><span class="p">[</span><span class="mi">0</span><span class="p">]))</span> -<span class="c"># Calculate mean squared error between predicted and real labels.</span> +<span class="c1"># Calculate mean squared error between predicted and real labels.</span> <span class="n">meanSquaredError</span> <span class="o">=</span> <span class="n">predictionAndLabel</span><span class="o">.</span><span class="n">map</span><span class="p">(</span><span class="k">lambda</span> <span class="n">pl</span><span class="p">:</span> <span class="n">math</span><span class="o">.</span><span class="n">pow</span><span class="p">((</span><span class="n">pl</span><span class="p">[</span><span class="mi">0</span><span class="p">]</span> <span class="o">-</span> <span class="n">pl</span><span class="p">[</span><span class="mi">1</span><span class="p">]),</span> <span class="mi">2</span><span class="p">))</span><span class="o">.</span><span class="n">mean</span><span class="p">()</span> -<span class="k">print</span><span class="p">(</span><span class="s">"Mean Squared Error = "</span> <span class="o">+</span> <span class="nb">str</span><span class="p">(</span><span class="n">meanSquaredError</span><span class="p">))</span> +<span class="k">print</span><span class="p">(</span><span class="s2">"Mean Squared Error = "</span> <span class="o">+</span> <span class="nb">str</span><span class="p">(</span><span class="n">meanSquaredError</span><span class="p">))</span> -<span class="c"># Save and load model</span> -<span class="n">model</span><span class="o">.</span><span class="n">save</span><span class="p">(</span><span class="n">sc</span><span class="p">,</span> <span class="s">"target/tmp/myIsotonicRegressionModel"</span><span class="p">)</span> -<span class="n">sameModel</span> <span class="o">=</span> <span class="n">IsotonicRegressionModel</span><span class="o">.</span><span class="n">load</span><span class="p">(</span><span class="n">sc</span><span class="p">,</span> <span class="s">"target/tmp/myIsotonicRegressionModel"</span><span class="p">)</span> +<span class="c1"># Save and load model</span> +<span class="n">model</span><span class="o">.</span><span class="n">save</span><span class="p">(</span><span class="n">sc</span><span class="p">,</span> <span class="s2">"target/tmp/myIsotonicRegressionModel"</span><span class="p">)</span> +<span class="n">sameModel</span> <span class="o">=</span> <span class="n">IsotonicRegressionModel</span><span class="o">.</span><span class="n">load</span><span class="p">(</span><span class="n">sc</span><span class="p">,</span> <span class="s2">"target/tmp/myIsotonicRegressionModel"</span><span class="p">)</span> </pre></div> <div><small>Find full example code at "examples/src/main/python/mllib/isotonic_regression_example.py" in the Spark repo.</small></div> </div> --------------------------------------------------------------------- To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org