http://git-wip-us.apache.org/repos/asf/spark-website/blob/d2bcf185/site/docs/2.1.0/quick-start.html ---------------------------------------------------------------------- diff --git a/site/docs/2.1.0/quick-start.html b/site/docs/2.1.0/quick-start.html index 76e67e1..9d5fad7 100644 --- a/site/docs/2.1.0/quick-start.html +++ b/site/docs/2.1.0/quick-start.html @@ -129,14 +129,14 @@ <ul id="markdown-toc"> - <li><a href="#interactive-analysis-with-the-spark-shell" id="markdown-toc-interactive-analysis-with-the-spark-shell">Interactive Analysis with the Spark Shell</a> <ul> - <li><a href="#basics" id="markdown-toc-basics">Basics</a></li> - <li><a href="#more-on-rdd-operations" id="markdown-toc-more-on-rdd-operations">More on RDD Operations</a></li> - <li><a href="#caching" id="markdown-toc-caching">Caching</a></li> + <li><a href="#interactive-analysis-with-the-spark-shell">Interactive Analysis with the Spark Shell</a> <ul> + <li><a href="#basics">Basics</a></li> + <li><a href="#more-on-rdd-operations">More on RDD Operations</a></li> + <li><a href="#caching">Caching</a></li> </ul> </li> - <li><a href="#self-contained-applications" id="markdown-toc-self-contained-applications">Self-Contained Applications</a></li> - <li><a href="#where-to-go-from-here" id="markdown-toc-where-to-go-from-here">Where to Go from Here</a></li> + <li><a href="#self-contained-applications">Self-Contained Applications</a></li> + <li><a href="#where-to-go-from-here">Where to Go from Here</a></li> </ul> <p>This tutorial provides a quick introduction to using Spark. We will first introduce the API through Spark’s @@ -164,26 +164,26 @@ or Python. Start it by running the following in the Spark directory:</p> <p>Spark’s primary abstraction is a distributed collection of items called a Resilient Distributed Dataset (RDD). RDDs can be created from Hadoop InputFormats (such as HDFS files) or by transforming other RDDs. Let’s make a new RDD from the text of the README file in the Spark source directory:</p> - <div class="highlight"><pre><code class="language-scala" data-lang="scala"><span class="n">scala</span><span class="o">></span> <span class="k">val</span> <span class="n">textFile</span> <span class="k">=</span> <span class="n">sc</span><span class="o">.</span><span class="n">textFile</span><span class="o">(</span><span class="s">"README.md"</span><span class="o">)</span> -<span class="n">textFile</span><span class="k">:</span> <span class="kt">org.apache.spark.rdd.RDD</span><span class="o">[</span><span class="kt">String</span><span class="o">]</span> <span class="k">=</span> <span class="nc">README</span><span class="o">.</span><span class="n">md</span> <span class="nc">MapPartitionsRDD</span><span class="o">[</span><span class="err">1</span><span class="o">]</span> <span class="n">at</span> <span class="n">textFile</span> <span class="n">at</span> <span class="o"><</span><span class="n">console</span><span class="k">>:</span><span class="mi">25</span></code></pre></div> + <figure class="highlight"><pre><code class="language-scala" data-lang="scala"><span></span><span class="n">scala</span><span class="o">></span> <span class="k">val</span> <span class="n">textFile</span> <span class="k">=</span> <span class="n">sc</span><span class="o">.</span><span class="n">textFile</span><span class="o">(</span><span class="s">"README.md"</span><span class="o">)</span> +<span class="n">textFile</span><span class="k">:</span> <span class="kt">org.apache.spark.rdd.RDD</span><span class="o">[</span><span class="kt">String</span><span class="o">]</span> <span class="k">=</span> <span class="nc">README</span><span class="o">.</span><span class="n">md</span> <span class="nc">MapPartitionsRDD</span><span class="o">[</span><span class="err">1</span><span class="o">]</span> <span class="n">at</span> <span class="n">textFile</span> <span class="n">at</span> <span class="o"><</span><span class="n">console</span><span class="k">>:</span><span class="mi">25</span></code></pre></figure> <p>RDDs have <em><a href="programming-guide.html#actions">actions</a></em>, which return values, and <em><a href="programming-guide.html#transformations">transformations</a></em>, which return pointers to new RDDs. Let’s start with a few actions:</p> - <div class="highlight"><pre><code class="language-scala" data-lang="scala"><span class="n">scala</span><span class="o">></span> <span class="n">textFile</span><span class="o">.</span><span class="n">count</span><span class="o">()</span> <span class="c1">// Number of items in this RDD</span> + <figure class="highlight"><pre><code class="language-scala" data-lang="scala"><span></span><span class="n">scala</span><span class="o">></span> <span class="n">textFile</span><span class="o">.</span><span class="n">count</span><span class="o">()</span> <span class="c1">// Number of items in this RDD</span> <span class="n">res0</span><span class="k">:</span> <span class="kt">Long</span> <span class="o">=</span> <span class="mi">126</span> <span class="c1">// May be different from yours as README.md will change over time, similar to other outputs</span> <span class="n">scala</span><span class="o">></span> <span class="n">textFile</span><span class="o">.</span><span class="n">first</span><span class="o">()</span> <span class="c1">// First item in this RDD</span> -<span class="n">res1</span><span class="k">:</span> <span class="kt">String</span> <span class="o">=</span> <span class="k">#</span> <span class="nc">Apache</span> <span class="nc">Spark</span></code></pre></div> +<span class="n">res1</span><span class="k">:</span> <span class="kt">String</span> <span class="o">=</span> <span class="k">#</span> <span class="nc">Apache</span> <span class="nc">Spark</span></code></pre></figure> <p>Now let’s use a transformation. We will use the <a href="programming-guide.html#transformations"><code>filter</code></a> transformation to return a new RDD with a subset of the items in the file.</p> - <div class="highlight"><pre><code class="language-scala" data-lang="scala"><span class="n">scala</span><span class="o">></span> <span class="k">val</span> <span class="n">linesWithSpark</span> <span class="k">=</span> <span class="n">textFile</span><span class="o">.</span><span class="n">filter</span><span class="o">(</span><span class="n">line</span> <span class="k">=></span> <span class="n">line</span><span class="o">.</span><span class="n">contains</span><span class="o">(</span><span class="s">"Spark"</span><span class="o">))</span> -<span class="n">linesWithSpark</span><span class="k">:</span> <span class="kt">org.apache.spark.rdd.RDD</span><span class="o">[</span><span class="kt">String</span><span class="o">]</span> <span class="k">=</span> <span class="nc">MapPartitionsRDD</span><span class="o">[</span><span class="err">2</span><span class="o">]</span> <span class="n">at</span> <span class="n">filter</span> <span class="n">at</span> <span class="o"><</span><span class="n">console</span><span class="k">>:</span><span class="mi">27</span></code></pre></div> + <figure class="highlight"><pre><code class="language-scala" data-lang="scala"><span></span><span class="n">scala</span><span class="o">></span> <span class="k">val</span> <span class="n">linesWithSpark</span> <span class="k">=</span> <span class="n">textFile</span><span class="o">.</span><span class="n">filter</span><span class="o">(</span><span class="n">line</span> <span class="k">=></span> <span class="n">line</span><span class="o">.</span><span class="n">contains</span><span class="o">(</span><span class="s">"Spark"</span><span class="o">))</span> +<span class="n">linesWithSpark</span><span class="k">:</span> <span class="kt">org.apache.spark.rdd.RDD</span><span class="o">[</span><span class="kt">String</span><span class="o">]</span> <span class="k">=</span> <span class="nc">MapPartitionsRDD</span><span class="o">[</span><span class="err">2</span><span class="o">]</span> <span class="n">at</span> <span class="n">filter</span> <span class="n">at</span> <span class="o"><</span><span class="n">console</span><span class="k">>:</span><span class="mi">27</span></code></pre></figure> <p>We can chain together transformations and actions:</p> - <div class="highlight"><pre><code class="language-scala" data-lang="scala"><span class="n">scala</span><span class="o">></span> <span class="n">textFile</span><span class="o">.</span><span class="n">filter</span><span class="o">(</span><span class="n">line</span> <span class="k">=></span> <span class="n">line</span><span class="o">.</span><span class="n">contains</span><span class="o">(</span><span class="s">"Spark"</span><span class="o">)).</span><span class="n">count</span><span class="o">()</span> <span class="c1">// How many lines contain "Spark"?</span> -<span class="n">res3</span><span class="k">:</span> <span class="kt">Long</span> <span class="o">=</span> <span class="mi">15</span></code></pre></div> + <figure class="highlight"><pre><code class="language-scala" data-lang="scala"><span></span><span class="n">scala</span><span class="o">></span> <span class="n">textFile</span><span class="o">.</span><span class="n">filter</span><span class="o">(</span><span class="n">line</span> <span class="k">=></span> <span class="n">line</span><span class="o">.</span><span class="n">contains</span><span class="o">(</span><span class="s">"Spark"</span><span class="o">)).</span><span class="n">count</span><span class="o">()</span> <span class="c1">// How many lines contain "Spark"?</span> +<span class="n">res3</span><span class="k">:</span> <span class="kt">Long</span> <span class="o">=</span> <span class="mi">15</span></code></pre></figure> </div> <div data-lang="python"> @@ -193,24 +193,24 @@ or Python. Start it by running the following in the Spark directory:</p> <p>Spark’s primary abstraction is a distributed collection of items called a Resilient Distributed Dataset (RDD). RDDs can be created from Hadoop InputFormats (such as HDFS files) or by transforming other RDDs. Let’s make a new RDD from the text of the README file in the Spark source directory:</p> - <div class="highlight"><pre><code class="language-python" data-lang="python"><span class="o">>>></span> <span class="n">textFile</span> <span class="o">=</span> <span class="n">sc</span><span class="o">.</span><span class="n">textFile</span><span class="p">(</span><span class="s">"README.md"</span><span class="p">)</span></code></pre></div> + <figure class="highlight"><pre><code class="language-python" data-lang="python"><span></span><span class="o">>>></span> <span class="n">textFile</span> <span class="o">=</span> <span class="n">sc</span><span class="o">.</span><span class="n">textFile</span><span class="p">(</span><span class="s2">"README.md"</span><span class="p">)</span></code></pre></figure> <p>RDDs have <em><a href="programming-guide.html#actions">actions</a></em>, which return values, and <em><a href="programming-guide.html#transformations">transformations</a></em>, which return pointers to new RDDs. Let’s start with a few actions:</p> - <div class="highlight"><pre><code class="language-python" data-lang="python"><span class="o">>>></span> <span class="n">textFile</span><span class="o">.</span><span class="n">count</span><span class="p">()</span> <span class="c"># Number of items in this RDD</span> + <figure class="highlight"><pre><code class="language-python" data-lang="python"><span></span><span class="o">>>></span> <span class="n">textFile</span><span class="o">.</span><span class="n">count</span><span class="p">()</span> <span class="c1"># Number of items in this RDD</span> <span class="mi">126</span> -<span class="o">>>></span> <span class="n">textFile</span><span class="o">.</span><span class="n">first</span><span class="p">()</span> <span class="c"># First item in this RDD</span> -<span class="s">u'# Apache Spark'</span></code></pre></div> +<span class="o">>>></span> <span class="n">textFile</span><span class="o">.</span><span class="n">first</span><span class="p">()</span> <span class="c1"># First item in this RDD</span> +<span class="sa">u</span><span class="s1">'# Apache Spark'</span></code></pre></figure> <p>Now let’s use a transformation. We will use the <a href="programming-guide.html#transformations"><code>filter</code></a> transformation to return a new RDD with a subset of the items in the file.</p> - <div class="highlight"><pre><code class="language-python" data-lang="python"><span class="o">>>></span> <span class="n">linesWithSpark</span> <span class="o">=</span> <span class="n">textFile</span><span class="o">.</span><span class="n">filter</span><span class="p">(</span><span class="k">lambda</span> <span class="n">line</span><span class="p">:</span> <span class="s">"Spark"</span> <span class="ow">in</span> <span class="n">line</span><span class="p">)</span></code></pre></div> + <figure class="highlight"><pre><code class="language-python" data-lang="python"><span></span><span class="o">>>></span> <span class="n">linesWithSpark</span> <span class="o">=</span> <span class="n">textFile</span><span class="o">.</span><span class="n">filter</span><span class="p">(</span><span class="k">lambda</span> <span class="n">line</span><span class="p">:</span> <span class="s2">"Spark"</span> <span class="ow">in</span> <span class="n">line</span><span class="p">)</span></code></pre></figure> <p>We can chain together transformations and actions:</p> - <div class="highlight"><pre><code class="language-python" data-lang="python"><span class="o">>>></span> <span class="n">textFile</span><span class="o">.</span><span class="n">filter</span><span class="p">(</span><span class="k">lambda</span> <span class="n">line</span><span class="p">:</span> <span class="s">"Spark"</span> <span class="ow">in</span> <span class="n">line</span><span class="p">)</span><span class="o">.</span><span class="n">count</span><span class="p">()</span> <span class="c"># How many lines contain "Spark"?</span> -<span class="mi">15</span></code></pre></div> + <figure class="highlight"><pre><code class="language-python" data-lang="python"><span></span><span class="o">>>></span> <span class="n">textFile</span><span class="o">.</span><span class="n">filter</span><span class="p">(</span><span class="k">lambda</span> <span class="n">line</span><span class="p">:</span> <span class="s2">"Spark"</span> <span class="ow">in</span> <span class="n">line</span><span class="p">)</span><span class="o">.</span><span class="n">count</span><span class="p">()</span> <span class="c1"># How many lines contain "Spark"?</span> +<span class="mi">15</span></code></pre></figure> </div> </div> @@ -221,38 +221,38 @@ or Python. Start it by running the following in the Spark directory:</p> <div class="codetabs"> <div data-lang="scala"> - <div class="highlight"><pre><code class="language-scala" data-lang="scala"><span class="n">scala</span><span class="o">></span> <span class="n">textFile</span><span class="o">.</span><span class="n">map</span><span class="o">(</span><span class="n">line</span> <span class="k">=></span> <span class="n">line</span><span class="o">.</span><span class="n">split</span><span class="o">(</span><span class="s">" "</span><span class="o">).</span><span class="n">size</span><span class="o">).</span><span class="n">reduce</span><span class="o">((</span><span class="n">a</span><span class="o">,</span> <span class="n">b</span><span class="o">)</span> <span class="k">=></span> <span class="k">if</span> <span class="o">(</span><span class="n">a</span> <span class="o">></span> <span class="n">b</span><span class="o">)</span> <span class="n">a</span> <span class="k">else</span> <span class="n">b</span><span class="o">)</span> -<span class="n">res4</span><span class="k">:</span> <span class="kt">Long</span> <span class="o">=</span> <span class="mi">15</span></code></pre></div> + <figure class="highlight"><pre><code class="language-scala" data-lang="scala"><span></span><span class="n">scala</span><span class="o">></span> <span class="n">textFile</span><span class="o">.</span><span class="n">map</span><span class="o">(</span><span class="n">line</span> <span class="k">=></span> <span class="n">line</span><span class="o">.</span><span class="n">split</span><span class="o">(</span><span class="s">" "</span><span class="o">).</span><span class="n">size</span><span class="o">).</span><span class="n">reduce</span><span class="o">((</span><span class="n">a</span><span class="o">,</span> <span class="n">b</span><span class="o">)</span> <span class="k">=></span> <span class="k">if</span> <span class="o">(</span><span class="n">a</span> <span class="o">></span> <span class="n">b</span><span class="o">)</span> <span class="n">a</span> <span class="k">else</span> <span class="n">b</span><span class="o">)</span> +<span class="n">res4</span><span class="k">:</span> <span class="kt">Long</span> <span class="o">=</span> <span class="mi">15</span></code></pre></figure> <p>This first maps a line to an integer value, creating a new RDD. <code>reduce</code> is called on that RDD to find the largest line count. The arguments to <code>map</code> and <code>reduce</code> are Scala function literals (closures), and can use any language feature or Scala/Java library. For example, we can easily call functions declared elsewhere. We’ll use <code>Math.max()</code> function to make this code easier to understand:</p> - <div class="highlight"><pre><code class="language-scala" data-lang="scala"><span class="n">scala</span><span class="o">></span> <span class="k">import</span> <span class="nn">java.lang.Math</span> + <figure class="highlight"><pre><code class="language-scala" data-lang="scala"><span></span><span class="n">scala</span><span class="o">></span> <span class="k">import</span> <span class="nn">java.lang.Math</span> <span class="k">import</span> <span class="nn">java.lang.Math</span> <span class="n">scala</span><span class="o">></span> <span class="n">textFile</span><span class="o">.</span><span class="n">map</span><span class="o">(</span><span class="n">line</span> <span class="k">=></span> <span class="n">line</span><span class="o">.</span><span class="n">split</span><span class="o">(</span><span class="s">" "</span><span class="o">).</span><span class="n">size</span><span class="o">).</span><span class="n">reduce</span><span class="o">((</span><span class="n">a</span><span class="o">,</span> <span class="n">b</span><span class="o">)</span> <span class="k">=></span> <span class="nc">Math</span><span class="o">.</span><span class="n">max</span><span class="o">(</span><span class="n">a</span><span class="o">,</span> <span class="n">b</span><span class="o">))</span> -<span class="n">res5</span><span class="k">:</span> <span class="kt">Int</span> <span class="o">=</span> <span class="mi">15</span></code></pre></div> +<span class="n">res5</span><span class="k">:</span> <span class="kt">Int</span> <span class="o">=</span> <span class="mi">15</span></code></pre></figure> <p>One common data flow pattern is MapReduce, as popularized by Hadoop. Spark can implement MapReduce flows easily:</p> - <div class="highlight"><pre><code class="language-scala" data-lang="scala"><span class="n">scala</span><span class="o">></span> <span class="k">val</span> <span class="n">wordCounts</span> <span class="k">=</span> <span class="n">textFile</span><span class="o">.</span><span class="n">flatMap</span><span class="o">(</span><span class="n">line</span> <span class="k">=></span> <span class="n">line</span><span class="o">.</span><span class="n">split</span><span class="o">(</span><span class="s">" "</span><span class="o">)).</span><span class="n">map</span><span class="o">(</span><span class="n">word</span> <span class="k">=></span> <span class="o">(</span><span class="n">word</span><span class="o">,</span> <span class="mi">1</span><span class="o">)).</span><span class="n">reduceByKey</span><span class="o">((</span><span class="n">a</span><span class="o">,</span> <span class="n">b</span><span class="o">)</span> <span class="k">=></span> <span class="n">a</span> <span class="o">+</span> <span class="n">b</span><span class="o">)</span> -<span class="n">wordCounts</span><span class="k">:</span> <span class="kt">org.apache.spark.rdd.RDD</span><span class="o">[(</span><span class="kt">String</span>, <span class="kt">Int</span><span class="o">)]</span> <span class="k">=</span> <span class="nc">ShuffledRDD</span><span class="o">[</span><span class="err">8</span><span class="o">]</span> <span class="n">at</span> <span class="n">reduceByKey</span> <span class="n">at</span> <span class="o"><</span><span class="n">console</span><span class="k">>:</span><span class="mi">28</span></code></pre></div> + <figure class="highlight"><pre><code class="language-scala" data-lang="scala"><span></span><span class="n">scala</span><span class="o">></span> <span class="k">val</span> <span class="n">wordCounts</span> <span class="k">=</span> <span class="n">textFile</span><span class="o">.</span><span class="n">flatMap</span><span class="o">(</span><span class="n">line</span> <span class="k">=></span> <span class="n">line</span><span class="o">.</span><span class="n">split</span><span class="o">(</span><span class="s">" "</span><span class="o">)).</span><span class="n">map</span><span class="o">(</span><span class="n">word</span> <span class="k">=></span> <span class="o">(</span><span class="n">word</span><span class="o">,</span> <span class="mi">1</span><span class="o">)).</span><span class="n">reduceByKey</span><span class="o">((</span><span class="n">a</span><span class="o">,</span> <span class="n">b</span><span class="o">)</span> <span class="k">=></span> <span cla ss="n">a</span> <span class="o">+</span> <span class="n">b</span><span class="o">)</span> +<span class="n">wordCounts</span><span class="k">:</span> <span class="kt">org.apache.spark.rdd.RDD</span><span class="o">[(</span><span class="kt">String</span>, <span class="kt">Int</span><span class="o">)]</span> <span class="k">=</span> <span class="nc">ShuffledRDD</span><span class="o">[</span><span class="err">8</span><span class="o">]</span> <span class="n">at</span> <span class="n">reduceByKey</span> <span class="n">at</span> <span class="o"><</span><span class="n">console</span><span class="k">>:</span><span class="mi">28</span></code></pre></figure> <p>Here, we combined the <a href="programming-guide.html#transformations"><code>flatMap</code></a>, <a href="programming-guide.html#transformations"><code>map</code></a>, and <a href="programming-guide.html#transformations"><code>reduceByKey</code></a> transformations to compute the per-word counts in the file as an RDD of (String, Int) pairs. To collect the word counts in our shell, we can use the <a href="programming-guide.html#actions"><code>collect</code></a> action:</p> - <div class="highlight"><pre><code class="language-scala" data-lang="scala"><span class="n">scala</span><span class="o">></span> <span class="n">wordCounts</span><span class="o">.</span><span class="n">collect</span><span class="o">()</span> -<span class="n">res6</span><span class="k">:</span> <span class="kt">Array</span><span class="o">[(</span><span class="kt">String</span>, <span class="kt">Int</span><span class="o">)]</span> <span class="k">=</span> <span class="nc">Array</span><span class="o">((</span><span class="n">means</span><span class="o">,</span><span class="mi">1</span><span class="o">),</span> <span class="o">(</span><span class="n">under</span><span class="o">,</span><span class="mi">2</span><span class="o">),</span> <span class="o">(</span><span class="k">this</span><span class="o">,</span><span class="mi">3</span><span class="o">),</span> <span class="o">(</span><span class="nc">Because</span><span class="o">,</span><span class="mi">1</span><span class="o">),</span> <span class="o">(</span><span class="nc">Python</span><span class="o">,</span><span class="mi">2</span><span class="o">),</span> <span class="o">(</span><span class="n">agree</span><span class="o">,</span><span class="mi">1</span><span class ="o">),</span> <span class="o">(</span><span class="n">cluster</span><span class="o">.,</span><span class="mi">1</span><span class="o">),</span> <span class="o">...)</span></code></pre></div> + <figure class="highlight"><pre><code class="language-scala" data-lang="scala"><span></span><span class="n">scala</span><span class="o">></span> <span class="n">wordCounts</span><span class="o">.</span><span class="n">collect</span><span class="o">()</span> +<span class="n">res6</span><span class="k">:</span> <span class="kt">Array</span><span class="o">[(</span><span class="kt">String</span>, <span class="kt">Int</span><span class="o">)]</span> <span class="k">=</span> <span class="nc">Array</span><span class="o">((</span><span class="n">means</span><span class="o">,</span><span class="mi">1</span><span class="o">),</span> <span class="o">(</span><span class="n">under</span><span class="o">,</span><span class="mi">2</span><span class="o">),</span> <span class="o">(</span><span class="k">this</span><span class="o">,</span><span class="mi">3</span><span class="o">),</span> <span class="o">(</span><span class="nc">Because</span><span class="o">,</span><span class="mi">1</span><span class="o">),</span> <span class="o">(</span><span class="nc">Python</span><span class="o">,</span><span class="mi">2</span><span class="o">),</span> <span class="o">(</span><span class="n">agree</span><span class="o">,</span><span class="mi">1</span><span class ="o">),</span> <span class="o">(</span><span class="n">cluster</span><span class="o">.,</span><span class="mi">1</span><span class="o">),</span> <span class="o">...)</span></code></pre></figure> </div> <div data-lang="python"> - <div class="highlight"><pre><code class="language-python" data-lang="python"><span class="o">>>></span> <span class="n">textFile</span><span class="o">.</span><span class="n">map</span><span class="p">(</span><span class="k">lambda</span> <span class="n">line</span><span class="p">:</span> <span class="nb">len</span><span class="p">(</span><span class="n">line</span><span class="o">.</span><span class="n">split</span><span class="p">()))</span><span class="o">.</span><span class="n">reduce</span><span class="p">(</span><span class="k">lambda</span> <span class="n">a</span><span class="p">,</span> <span class="n">b</span><span class="p">:</span> <span class="n">a</span> <span class="k">if</span> <span class="p">(</span><span class="n">a</span> <span class="o">></span> <span class="n">b</span><span class="p">)</span> <span class="k">else</span> <span class="n">b</span><span class="p">)</span> -<span class="mi">15</span></code></pre></div> + <figure class="highlight"><pre><code class="language-python" data-lang="python"><span></span><span class="o">>>></span> <span class="n">textFile</span><span class="o">.</span><span class="n">map</span><span class="p">(</span><span class="k">lambda</span> <span class="n">line</span><span class="p">:</span> <span class="nb">len</span><span class="p">(</span><span class="n">line</span><span class="o">.</span><span class="n">split</span><span class="p">()))</span><span class="o">.</span><span class="n">reduce</span><span class="p">(</span><span class="k">lambda</span> <span class="n">a</span><span class="p">,</span> <span class="n">b</span><span class="p">:</span> <span class="n">a</span> <span class="k">if</span> <span class="p">(</span><span class="n">a</span> <span class="o">></span> <span class="n">b</span><span class="p">)</span> <span class="k">else</span> <span class="n">b</span><span class="p">)</span> +<span class="mi">15</span></code></pre></figure> <p>This first maps a line to an integer value, creating a new RDD. <code>reduce</code> is called on that RDD to find the largest line count. The arguments to <code>map</code> and <code>reduce</code> are Python <a href="https://docs.python.org/2/reference/expressions.html#lambda">anonymous functions (lambdas)</a>, but we can also pass any top-level Python function we want. For example, we’ll define a <code>max</code> function to make this code easier to understand:</p> - <div class="highlight"><pre><code class="language-python" data-lang="python"><span class="o">>>></span> <span class="k">def</span> <span class="nf">max</span><span class="p">(</span><span class="n">a</span><span class="p">,</span> <span class="n">b</span><span class="p">):</span> + <figure class="highlight"><pre><code class="language-python" data-lang="python"><span></span><span class="o">>>></span> <span class="k">def</span> <span class="nf">max</span><span class="p">(</span><span class="n">a</span><span class="p">,</span> <span class="n">b</span><span class="p">):</span> <span class="o">...</span> <span class="k">if</span> <span class="n">a</span> <span class="o">></span> <span class="n">b</span><span class="p">:</span> <span class="o">...</span> <span class="k">return</span> <span class="n">a</span> <span class="o">...</span> <span class="k">else</span><span class="p">:</span> @@ -260,16 +260,16 @@ For example, we’ll define a <code>max</code> function to make this code ea <span class="o">...</span> <span class="o">>>></span> <span class="n">textFile</span><span class="o">.</span><span class="n">map</span><span class="p">(</span><span class="k">lambda</span> <span class="n">line</span><span class="p">:</span> <span class="nb">len</span><span class="p">(</span><span class="n">line</span><span class="o">.</span><span class="n">split</span><span class="p">()))</span><span class="o">.</span><span class="n">reduce</span><span class="p">(</span><span class="nb">max</span><span class="p">)</span> -<span class="mi">15</span></code></pre></div> +<span class="mi">15</span></code></pre></figure> <p>One common data flow pattern is MapReduce, as popularized by Hadoop. Spark can implement MapReduce flows easily:</p> - <div class="highlight"><pre><code class="language-python" data-lang="python"><span class="o">>>></span> <span class="n">wordCounts</span> <span class="o">=</span> <span class="n">textFile</span><span class="o">.</span><span class="n">flatMap</span><span class="p">(</span><span class="k">lambda</span> <span class="n">line</span><span class="p">:</span> <span class="n">line</span><span class="o">.</span><span class="n">split</span><span class="p">())</span><span class="o">.</span><span class="n">map</span><span class="p">(</span><span class="k">lambda</span> <span class="n">word</span><span class="p">:</span> <span class="p">(</span><span class="n">word</span><span class="p">,</span> <span class="mi">1</span><span class="p">))</span><span class="o">.</span><span class="n">reduceByKey</span><span class="p">(</span><span class="k">lambda</span> <span class="n">a</span><span class="p">,</span> <span class="n">b</span><span class="p">:</span> <span class="n">a</span><span cla ss="o">+</span><span class="n">b</span><span class="p">)</span></code></pre></div> + <figure class="highlight"><pre><code class="language-python" data-lang="python"><span></span><span class="o">>>></span> <span class="n">wordCounts</span> <span class="o">=</span> <span class="n">textFile</span><span class="o">.</span><span class="n">flatMap</span><span class="p">(</span><span class="k">lambda</span> <span class="n">line</span><span class="p">:</span> <span class="n">line</span><span class="o">.</span><span class="n">split</span><span class="p">())</span><span class="o">.</span><span class="n">map</span><span class="p">(</span><span class="k">lambda</span> <span class="n">word</span><span class="p">:</span> <span class="p">(</span><span class="n">word</span><span class="p">,</span> <span class="mi">1</span><span class="p">))</span><span class="o">.</span><span class="n">reduceByKey</span><span class="p">(</span><span class="k">lambda</span> <span class="n">a</span><span class="p">,</span> <span class="n">b</span><span class="p">:</span> <span class="n">a </span><span class="o">+</span><span class="n">b</span><span class="p">)</span></code></pre></figure> <p>Here, we combined the <a href="programming-guide.html#transformations"><code>flatMap</code></a>, <a href="programming-guide.html#transformations"><code>map</code></a>, and <a href="programming-guide.html#transformations"><code>reduceByKey</code></a> transformations to compute the per-word counts in the file as an RDD of (string, int) pairs. To collect the word counts in our shell, we can use the <a href="programming-guide.html#actions"><code>collect</code></a> action:</p> - <div class="highlight"><pre><code class="language-python" data-lang="python"><span class="o">>>></span> <span class="n">wordCounts</span><span class="o">.</span><span class="n">collect</span><span class="p">()</span> -<span class="p">[(</span><span class="s">u'and'</span><span class="p">,</span> <span class="mi">9</span><span class="p">),</span> <span class="p">(</span><span class="s">u'A'</span><span class="p">,</span> <span class="mi">1</span><span class="p">),</span> <span class="p">(</span><span class="s">u'webpage'</span><span class="p">,</span> <span class="mi">1</span><span class="p">),</span> <span class="p">(</span><span class="s">u'README'</span><span class="p">,</span> <span class="mi">1</span><span class="p">),</span> <span class="p">(</span><span class="s">u'Note'</span><span class="p">,</span> <span class="mi">1</span><span class="p">),</span> <span class="p">(</span><span class="s">u'"local"'</span><span class="p">,</span> <span class="mi">1</span><span class="p">),</span> <span class="p">(</span><span class="s">u'variable'</span><span class="p">,</span> <span class="mi">1</span><span class="p">),</span> <span class= "o">...</span><span class="p">]</span></code></pre></div> + <figure class="highlight"><pre><code class="language-python" data-lang="python"><span></span><span class="o">>>></span> <span class="n">wordCounts</span><span class="o">.</span><span class="n">collect</span><span class="p">()</span> +<span class="p">[(</span><span class="sa">u</span><span class="s1">'and'</span><span class="p">,</span> <span class="mi">9</span><span class="p">),</span> <span class="p">(</span><span class="sa">u</span><span class="s1">'A'</span><span class="p">,</span> <span class="mi">1</span><span class="p">),</span> <span class="p">(</span><span class="sa">u</span><span class="s1">'webpage'</span><span class="p">,</span> <span class="mi">1</span><span class="p">),</span> <span class="p">(</span><span class="sa">u</span><span class="s1">'README'</span><span class="p">,</span> <span class="mi">1</span><span class="p">),</span> <span class="p">(</span><span class="sa">u</span><span class="s1">'Note'</span><span class="p">,</span> <span class="mi">1</span><span class="p">),</span> <span class="p">(</span><span class="sa">u</span><span class="s1">'"local"'</span><span class="p">,</span> <span class="mi">1</span><span class="p">),</span> <spa n class="p">(</span><span class="sa">u</span><span class="s1">'variable'</span><span class="p">,</span> <span class="mi">1</span><span class="p">),</span> <span class="o">...</span><span class="p">]</span></code></pre></figure> </div> </div> @@ -280,14 +280,14 @@ For example, we’ll define a <code>max</code> function to make this code ea <div class="codetabs"> <div data-lang="scala"> - <div class="highlight"><pre><code class="language-scala" data-lang="scala"><span class="n">scala</span><span class="o">></span> <span class="n">linesWithSpark</span><span class="o">.</span><span class="n">cache</span><span class="o">()</span> + <figure class="highlight"><pre><code class="language-scala" data-lang="scala"><span></span><span class="n">scala</span><span class="o">></span> <span class="n">linesWithSpark</span><span class="o">.</span><span class="n">cache</span><span class="o">()</span> <span class="n">res7</span><span class="k">:</span> <span class="kt">linesWithSpark.</span><span class="k">type</span> <span class="o">=</span> <span class="nc">MapPartitionsRDD</span><span class="o">[</span><span class="err">2</span><span class="o">]</span> <span class="n">at</span> <span class="n">filter</span> <span class="n">at</span> <span class="o"><</span><span class="n">console</span><span class="k">>:</span><span class="mi">27</span> <span class="n">scala</span><span class="o">></span> <span class="n">linesWithSpark</span><span class="o">.</span><span class="n">count</span><span class="o">()</span> <span class="n">res8</span><span class="k">:</span> <span class="kt">Long</span> <span class="o">=</span> <span class="mi">15</span> <span class="n">scala</span><span class="o">></span> <span class="n">linesWithSpark</span><span class="o">.</span><span class="n">count</span><span class="o">()</span> -<span class="n">res9</span><span class="k">:</span> <span class="kt">Long</span> <span class="o">=</span> <span class="mi">15</span></code></pre></div> +<span class="n">res9</span><span class="k">:</span> <span class="kt">Long</span> <span class="o">=</span> <span class="mi">15</span></code></pre></figure> <p>It may seem silly to use Spark to explore and cache a 100-line text file. The interesting part is that these same functions can be used on very large data sets, even when they are striped across @@ -297,13 +297,13 @@ a cluster, as described in the <a href="programming-guide.html#initializing-spar </div> <div data-lang="python"> - <div class="highlight"><pre><code class="language-python" data-lang="python"><span class="o">>>></span> <span class="n">linesWithSpark</span><span class="o">.</span><span class="n">cache</span><span class="p">()</span> + <figure class="highlight"><pre><code class="language-python" data-lang="python"><span></span><span class="o">>>></span> <span class="n">linesWithSpark</span><span class="o">.</span><span class="n">cache</span><span class="p">()</span> <span class="o">>>></span> <span class="n">linesWithSpark</span><span class="o">.</span><span class="n">count</span><span class="p">()</span> <span class="mi">15</span> <span class="o">>>></span> <span class="n">linesWithSpark</span><span class="o">.</span><span class="n">count</span><span class="p">()</span> -<span class="mi">15</span></code></pre></div> +<span class="mi">15</span></code></pre></figure> <p>It may seem silly to use Spark to explore and cache a 100-line text file. The interesting part is that these same functions can be used on very large data sets, even when they are striped across @@ -323,7 +323,7 @@ simple application in Scala (with sbt), Java (with Maven), and Python.</p> <p>We’ll create a very simple Spark application in Scala–so simple, in fact, that it’s named <code>SimpleApp.scala</code>:</p> - <div class="highlight"><pre><code class="language-scala" data-lang="scala"><span class="cm">/* SimpleApp.scala */</span> + <figure class="highlight"><pre><code class="language-scala" data-lang="scala"><span></span><span class="cm">/* SimpleApp.scala */</span> <span class="k">import</span> <span class="nn">org.apache.spark.SparkContext</span> <span class="k">import</span> <span class="nn">org.apache.spark.SparkContext._</span> <span class="k">import</span> <span class="nn">org.apache.spark.SparkConf</span> @@ -336,10 +336,10 @@ named <code>SimpleApp.scala</code>:</p> <span class="k">val</span> <span class="n">logData</span> <span class="k">=</span> <span class="n">sc</span><span class="o">.</span><span class="n">textFile</span><span class="o">(</span><span class="n">logFile</span><span class="o">,</span> <span class="mi">2</span><span class="o">).</span><span class="n">cache</span><span class="o">()</span> <span class="k">val</span> <span class="n">numAs</span> <span class="k">=</span> <span class="n">logData</span><span class="o">.</span><span class="n">filter</span><span class="o">(</span><span class="n">line</span> <span class="k">=></span> <span class="n">line</span><span class="o">.</span><span class="n">contains</span><span class="o">(</span><span class="s">"a"</span><span class="o">)).</span><span class="n">count</span><span class="o">()</span> <span class="k">val</span> <span class="n">numBs</span> <span class="k">=</span> <span class="n">logData</span><span class="o">.</span><span class="n">filter</span><span class="o">(</span><span class="n">line</span> <span class="k">=></span> <span class="n">line</span><span class="o">.</span><span class="n">contains</span><span class="o">(</span><span class="s">"b"</span><span class="o">)).</span><span class="n">count</span><span class="o">()</span> - <span class="n">println</span><span class="o">(</span><span class="n">s</span><span class="s">"Lines with a: $numAs, Lines with b: $numBs"</span><span class="o">)</span> + <span class="n">println</span><span class="o">(</span><span class="s">s"Lines with a: </span><span class="si">$numAs</span><span class="s">, Lines with b: </span><span class="si">$numBs</span><span class="s">"</span><span class="o">)</span> <span class="n">sc</span><span class="o">.</span><span class="n">stop</span><span class="o">()</span> <span class="o">}</span> -<span class="o">}</span></code></pre></div> +<span class="o">}</span></code></pre></figure> <p>Note that applications should define a <code>main()</code> method instead of extending <code>scala.App</code>. Subclasses of <code>scala.App</code> may not work correctly.</p> @@ -352,26 +352,26 @@ we initialize a SparkContext as part of the program.</p> <p>We pass the SparkContext constructor a <a href="api/scala/index.html#org.apache.spark.SparkConf">SparkConf</a> object which contains information about our -application.</p> +application. </p> <p>Our application depends on the Spark API, so we’ll also include an sbt configuration file, <code>simple.sbt</code>, which explains that Spark is a dependency. This file also adds a repository that Spark depends on:</p> - <div class="highlight"><pre><code class="language-scala" data-lang="scala"><span class="n">name</span> <span class="o">:=</span> <span class="s">"Simple Project"</span> + <figure class="highlight"><pre><code class="language-scala" data-lang="scala"><span></span><span class="n">name</span> <span class="o">:=</span> <span class="s">"Simple Project"</span> <span class="n">version</span> <span class="o">:=</span> <span class="s">"1.0"</span> <span class="n">scalaVersion</span> <span class="o">:=</span> <span class="s">"2.11.7"</span> -<span class="n">libraryDependencies</span> <span class="o">+=</span> <span class="s">"org.apache.spark"</span> <span class="o">%%</span> <span class="s">"spark-core"</span> <span class="o">%</span> <span class="s">"2.1.0"</span></code></pre></div> +<span class="n">libraryDependencies</span> <span class="o">+=</span> <span class="s">"org.apache.spark"</span> <span class="o">%%</span> <span class="s">"spark-core"</span> <span class="o">%</span> <span class="s">"2.1.0"</span></code></pre></figure> <p>For sbt to work correctly, we’ll need to layout <code>SimpleApp.scala</code> and <code>simple.sbt</code> according to the typical directory structure. Once that is in place, we can create a JAR package containing the application’s code, then use the <code>spark-submit</code> script to run our program.</p> - <div class="highlight"><pre><code class="language-bash" data-lang="bash"><span class="c"># Your directory layout should look like this</span> -<span class="nv">$ </span>find . + <figure class="highlight"><pre><code class="language-bash" data-lang="bash"><span></span><span class="c1"># Your directory layout should look like this</span> +$ find . . ./simple.sbt ./src @@ -379,18 +379,18 @@ containing the application’s code, then use the <code>spark-submit</code> ./src/main/scala ./src/main/scala/SimpleApp.scala -<span class="c"># Package a jar containing your application</span> -<span class="nv">$ </span>sbt package +<span class="c1"># Package a jar containing your application</span> +$ sbt package ... <span class="o">[</span>info<span class="o">]</span> Packaging <span class="o">{</span>..<span class="o">}</span>/<span class="o">{</span>..<span class="o">}</span>/target/scala-2.11/simple-project_2.11-1.0.jar -<span class="c"># Use spark-submit to run your application</span> -<span class="nv">$ </span>YOUR_SPARK_HOME/bin/spark-submit <span class="se">\</span> +<span class="c1"># Use spark-submit to run your application</span> +$ YOUR_SPARK_HOME/bin/spark-submit <span class="se">\</span> --class <span class="s2">"SimpleApp"</span> <span class="se">\</span> - --master <span class="nb">local</span><span class="o">[</span>4<span class="o">]</span> <span class="se">\</span> + --master local<span class="o">[</span><span class="m">4</span><span class="o">]</span> <span class="se">\</span> target/scala-2.11/simple-project_2.11-1.0.jar ... -Lines with a: 46, Lines with b: 23</code></pre></div> +Lines with a: <span class="m">46</span>, Lines with b: <span class="m">23</span></code></pre></figure> </div> <div data-lang="java"> @@ -398,7 +398,7 @@ Lines with a: 46, Lines with b: 23</code></pre></div> <p>We’ll create a very simple Spark application, <code>SimpleApp.java</code>:</p> - <div class="highlight"><pre><code class="language-java" data-lang="java"><span class="cm">/* SimpleApp.java */</span> + <figure class="highlight"><pre><code class="language-java" data-lang="java"><span></span><span class="cm">/* SimpleApp.java */</span> <span class="kn">import</span> <span class="nn">org.apache.spark.api.java.*</span><span class="o">;</span> <span class="kn">import</span> <span class="nn">org.apache.spark.SparkConf</span><span class="o">;</span> <span class="kn">import</span> <span class="nn">org.apache.spark.api.java.function.Function</span><span class="o">;</span> @@ -406,8 +406,8 @@ Lines with a: 46, Lines with b: 23</code></pre></div> <span class="kd">public</span> <span class="kd">class</span> <span class="nc">SimpleApp</span> <span class="o">{</span> <span class="kd">public</span> <span class="kd">static</span> <span class="kt">void</span> <span class="nf">main</span><span class="o">(</span><span class="n">String</span><span class="o">[]</span> <span class="n">args</span><span class="o">)</span> <span class="o">{</span> <span class="n">String</span> <span class="n">logFile</span> <span class="o">=</span> <span class="s">"YOUR_SPARK_HOME/README.md"</span><span class="o">;</span> <span class="c1">// Should be some file on your system</span> - <span class="n">SparkConf</span> <span class="n">conf</span> <span class="o">=</span> <span class="k">new</span> <span class="nf">SparkConf</span><span class="o">().</span><span class="na">setAppName</span><span class="o">(</span><span class="s">"Simple Application"</span><span class="o">);</span> - <span class="n">JavaSparkContext</span> <span class="n">sc</span> <span class="o">=</span> <span class="k">new</span> <span class="nf">JavaSparkContext</span><span class="o">(</span><span class="n">conf</span><span class="o">);</span> + <span class="n">SparkConf</span> <span class="n">conf</span> <span class="o">=</span> <span class="k">new</span> <span class="n">SparkConf</span><span class="o">().</span><span class="na">setAppName</span><span class="o">(</span><span class="s">"Simple Application"</span><span class="o">);</span> + <span class="n">JavaSparkContext</span> <span class="n">sc</span> <span class="o">=</span> <span class="k">new</span> <span class="n">JavaSparkContext</span><span class="o">(</span><span class="n">conf</span><span class="o">);</span> <span class="n">JavaRDD</span><span class="o"><</span><span class="n">String</span><span class="o">></span> <span class="n">logData</span> <span class="o">=</span> <span class="n">sc</span><span class="o">.</span><span class="na">textFile</span><span class="o">(</span><span class="n">logFile</span><span class="o">).</span><span class="na">cache</span><span class="o">();</span> <span class="kt">long</span> <span class="n">numAs</span> <span class="o">=</span> <span class="n">logData</span><span class="o">.</span><span class="na">filter</span><span class="o">(</span><span class="k">new</span> <span class="n">Function</span><span class="o"><</span><span class="n">String</span><span class="o">,</span> <span class="n">Boolean</span><span class="o">>()</span> <span class="o">{</span> @@ -422,7 +422,7 @@ Lines with a: 46, Lines with b: 23</code></pre></div> <span class="n">sc</span><span class="o">.</span><span class="na">stop</span><span class="o">();</span> <span class="o">}</span> -<span class="o">}</span></code></pre></div> +<span class="o">}</span></code></pre></figure> <p>This program just counts the number of lines containing ‘a’ and the number containing ‘b’ in a text file. Note that you’ll need to replace YOUR_SPARK_HOME with the location where Spark is installed. @@ -435,7 +435,7 @@ that extend <code>spark.api.java.function.Function</code>. The <p>To build the program, we also write a Maven <code>pom.xml</code> file that lists Spark as a dependency. Note that Spark artifacts are tagged with a Scala version.</p> - <div class="highlight"><pre><code class="language-xml" data-lang="xml"><span class="nt"><project></span> + <figure class="highlight"><pre><code class="language-xml" data-lang="xml"><span></span><span class="nt"><project></span> <span class="nt"><groupId></span>edu.berkeley<span class="nt"></groupId></span> <span class="nt"><artifactId></span>simple-project<span class="nt"></artifactId></span> <span class="nt"><modelVersion></span>4.0.0<span class="nt"></modelVersion></span> @@ -449,31 +449,31 @@ Note that Spark artifacts are tagged with a Scala version.</p> <span class="nt"><version></span>2.1.0<span class="nt"></version></span> <span class="nt"></dependency></span> <span class="nt"></dependencies></span> -<span class="nt"></project></span></code></pre></div> +<span class="nt"></project></span></code></pre></figure> <p>We lay out these files according to the canonical Maven directory structure:</p> - <div class="highlight"><pre><code class="language-bash" data-lang="bash"><span class="nv">$ </span>find . + <figure class="highlight"><pre><code class="language-bash" data-lang="bash"><span></span>$ find . ./pom.xml ./src ./src/main ./src/main/java -./src/main/java/SimpleApp.java</code></pre></div> +./src/main/java/SimpleApp.java</code></pre></figure> <p>Now, we can package the application using Maven and execute it with <code>./bin/spark-submit</code>.</p> - <div class="highlight"><pre><code class="language-bash" data-lang="bash"><span class="c"># Package a JAR containing your application</span> -<span class="nv">$ </span>mvn package + <figure class="highlight"><pre><code class="language-bash" data-lang="bash"><span></span><span class="c1"># Package a JAR containing your application</span> +$ mvn package ... <span class="o">[</span>INFO<span class="o">]</span> Building jar: <span class="o">{</span>..<span class="o">}</span>/<span class="o">{</span>..<span class="o">}</span>/target/simple-project-1.0.jar -<span class="c"># Use spark-submit to run your application</span> -<span class="nv">$ </span>YOUR_SPARK_HOME/bin/spark-submit <span class="se">\</span> +<span class="c1"># Use spark-submit to run your application</span> +$ YOUR_SPARK_HOME/bin/spark-submit <span class="se">\</span> --class <span class="s2">"SimpleApp"</span> <span class="se">\</span> - --master <span class="nb">local</span><span class="o">[</span>4<span class="o">]</span> <span class="se">\</span> + --master local<span class="o">[</span><span class="m">4</span><span class="o">]</span> <span class="se">\</span> target/simple-project-1.0.jar ... -Lines with a: 46, Lines with b: 23</code></pre></div> +Lines with a: <span class="m">46</span>, Lines with b: <span class="m">23</span></code></pre></figure> </div> <div data-lang="python"> @@ -482,19 +482,19 @@ Lines with a: 46, Lines with b: 23</code></pre></div> <p>As an example, we’ll create a simple Spark application, <code>SimpleApp.py</code>:</p> - <div class="highlight"><pre><code class="language-python" data-lang="python"><span class="sd">"""SimpleApp.py"""</span> + <figure class="highlight"><pre><code class="language-python" data-lang="python"><span></span><span class="sd">"""SimpleApp.py"""</span> <span class="kn">from</span> <span class="nn">pyspark</span> <span class="kn">import</span> <span class="n">SparkContext</span> -<span class="n">logFile</span> <span class="o">=</span> <span class="s">"YOUR_SPARK_HOME/README.md"</span> <span class="c"># Should be some file on your system</span> -<span class="n">sc</span> <span class="o">=</span> <span class="n">SparkContext</span><span class="p">(</span><span class="s">"local"</span><span class="p">,</span> <span class="s">"Simple App"</span><span class="p">)</span> +<span class="n">logFile</span> <span class="o">=</span> <span class="s2">"YOUR_SPARK_HOME/README.md"</span> <span class="c1"># Should be some file on your system</span> +<span class="n">sc</span> <span class="o">=</span> <span class="n">SparkContext</span><span class="p">(</span><span class="s2">"local"</span><span class="p">,</span> <span class="s2">"Simple App"</span><span class="p">)</span> <span class="n">logData</span> <span class="o">=</span> <span class="n">sc</span><span class="o">.</span><span class="n">textFile</span><span class="p">(</span><span class="n">logFile</span><span class="p">)</span><span class="o">.</span><span class="n">cache</span><span class="p">()</span> -<span class="n">numAs</span> <span class="o">=</span> <span class="n">logData</span><span class="o">.</span><span class="n">filter</span><span class="p">(</span><span class="k">lambda</span> <span class="n">s</span><span class="p">:</span> <span class="s">'a'</span> <span class="ow">in</span> <span class="n">s</span><span class="p">)</span><span class="o">.</span><span class="n">count</span><span class="p">()</span> -<span class="n">numBs</span> <span class="o">=</span> <span class="n">logData</span><span class="o">.</span><span class="n">filter</span><span class="p">(</span><span class="k">lambda</span> <span class="n">s</span><span class="p">:</span> <span class="s">'b'</span> <span class="ow">in</span> <span class="n">s</span><span class="p">)</span><span class="o">.</span><span class="n">count</span><span class="p">()</span> +<span class="n">numAs</span> <span class="o">=</span> <span class="n">logData</span><span class="o">.</span><span class="n">filter</span><span class="p">(</span><span class="k">lambda</span> <span class="n">s</span><span class="p">:</span> <span class="s1">'a'</span> <span class="ow">in</span> <span class="n">s</span><span class="p">)</span><span class="o">.</span><span class="n">count</span><span class="p">()</span> +<span class="n">numBs</span> <span class="o">=</span> <span class="n">logData</span><span class="o">.</span><span class="n">filter</span><span class="p">(</span><span class="k">lambda</span> <span class="n">s</span><span class="p">:</span> <span class="s1">'b'</span> <span class="ow">in</span> <span class="n">s</span><span class="p">)</span><span class="o">.</span><span class="n">count</span><span class="p">()</span> -<span class="k">print</span><span class="p">(</span><span class="s">"Lines with a: </span><span class="si">%i</span><span class="s">, lines with b: </span><span class="si">%i</span><span class="s">"</span> <span class="o">%</span> <span class="p">(</span><span class="n">numAs</span><span class="p">,</span> <span class="n">numBs</span><span class="p">))</span> +<span class="k">print</span><span class="p">(</span><span class="s2">"Lines with a: </span><span class="si">%i</span><span class="s2">, lines with b: </span><span class="si">%i</span><span class="s2">"</span> <span class="o">%</span> <span class="p">(</span><span class="n">numAs</span><span class="p">,</span> <span class="n">numBs</span><span class="p">))</span> -<span class="n">sc</span><span class="o">.</span><span class="n">stop</span><span class="p">()</span></code></pre></div> +<span class="n">sc</span><span class="o">.</span><span class="n">stop</span><span class="p">()</span></code></pre></figure> <p>This program just counts the number of lines containing ‘a’ and the number containing ‘b’ in a text file. @@ -509,12 +509,12 @@ dependencies to <code>spark-submit</code> through its <code>--py-files</code> ar <p>We can run this application using the <code>bin/spark-submit</code> script:</p> - <div class="highlight"><pre><code class="language-bash" data-lang="bash"><span class="c"># Use spark-submit to run your application</span> -<span class="nv">$ </span>YOUR_SPARK_HOME/bin/spark-submit <span class="se">\</span> - --master <span class="nb">local</span><span class="o">[</span>4<span class="o">]</span> <span class="se">\</span> + <figure class="highlight"><pre><code class="language-bash" data-lang="bash"><span></span><span class="c1"># Use spark-submit to run your application</span> +$ YOUR_SPARK_HOME/bin/spark-submit <span class="se">\</span> + --master local<span class="o">[</span><span class="m">4</span><span class="o">]</span> <span class="se">\</span> SimpleApp.py ... -Lines with a: 46, Lines with b: 23</code></pre></div> +Lines with a: <span class="m">46</span>, Lines with b: <span class="m">23</span></code></pre></figure> </div> </div> @@ -534,14 +534,14 @@ or see “Programming Guides” menu for other components.</li> You can run them as follows:</li> </ul> -<div class="highlight"><pre><code class="language-bash" data-lang="bash"><span class="c"># For Scala and Java, use run-example:</span> +<figure class="highlight"><pre><code class="language-bash" data-lang="bash"><span></span><span class="c1"># For Scala and Java, use run-example:</span> ./bin/run-example SparkPi -<span class="c"># For Python examples, use spark-submit directly:</span> +<span class="c1"># For Python examples, use spark-submit directly:</span> ./bin/spark-submit examples/src/main/python/pi.py -<span class="c"># For R examples, use spark-submit directly:</span> -./bin/spark-submit examples/src/main/r/dataframe.R</code></pre></div> +<span class="c1"># For R examples, use spark-submit directly:</span> +./bin/spark-submit examples/src/main/r/dataframe.R</code></pre></figure>
http://git-wip-us.apache.org/repos/asf/spark-website/blob/d2bcf185/site/docs/2.1.0/running-on-mesos.html ---------------------------------------------------------------------- diff --git a/site/docs/2.1.0/running-on-mesos.html b/site/docs/2.1.0/running-on-mesos.html index 198f53c..aec6fe8 100644 --- a/site/docs/2.1.0/running-on-mesos.html +++ b/site/docs/2.1.0/running-on-mesos.html @@ -127,33 +127,33 @@ <ul id="markdown-toc"> - <li><a href="#how-it-works" id="markdown-toc-how-it-works">How it Works</a></li> - <li><a href="#installing-mesos" id="markdown-toc-installing-mesos">Installing Mesos</a> <ul> - <li><a href="#from-source" id="markdown-toc-from-source">From Source</a></li> - <li><a href="#third-party-packages" id="markdown-toc-third-party-packages">Third-Party Packages</a></li> - <li><a href="#verification" id="markdown-toc-verification">Verification</a></li> + <li><a href="#how-it-works">How it Works</a></li> + <li><a href="#installing-mesos">Installing Mesos</a> <ul> + <li><a href="#from-source">From Source</a></li> + <li><a href="#third-party-packages">Third-Party Packages</a></li> + <li><a href="#verification">Verification</a></li> </ul> </li> - <li><a href="#connecting-spark-to-mesos" id="markdown-toc-connecting-spark-to-mesos">Connecting Spark to Mesos</a> <ul> - <li><a href="#uploading-spark-package" id="markdown-toc-uploading-spark-package">Uploading Spark Package</a></li> - <li><a href="#using-a-mesos-master-url" id="markdown-toc-using-a-mesos-master-url">Using a Mesos Master URL</a></li> - <li><a href="#client-mode" id="markdown-toc-client-mode">Client Mode</a></li> - <li><a href="#cluster-mode" id="markdown-toc-cluster-mode">Cluster mode</a></li> + <li><a href="#connecting-spark-to-mesos">Connecting Spark to Mesos</a> <ul> + <li><a href="#uploading-spark-package">Uploading Spark Package</a></li> + <li><a href="#using-a-mesos-master-url">Using a Mesos Master URL</a></li> + <li><a href="#client-mode">Client Mode</a></li> + <li><a href="#cluster-mode">Cluster mode</a></li> </ul> </li> - <li><a href="#mesos-run-modes" id="markdown-toc-mesos-run-modes">Mesos Run Modes</a> <ul> - <li><a href="#coarse-grained" id="markdown-toc-coarse-grained">Coarse-Grained</a></li> - <li><a href="#fine-grained-deprecated" id="markdown-toc-fine-grained-deprecated">Fine-Grained (deprecated)</a></li> + <li><a href="#mesos-run-modes">Mesos Run Modes</a> <ul> + <li><a href="#coarse-grained">Coarse-Grained</a></li> + <li><a href="#fine-grained-deprecated">Fine-Grained (deprecated)</a></li> </ul> </li> - <li><a href="#mesos-docker-support" id="markdown-toc-mesos-docker-support">Mesos Docker Support</a></li> - <li><a href="#running-alongside-hadoop" id="markdown-toc-running-alongside-hadoop">Running Alongside Hadoop</a></li> - <li><a href="#dynamic-resource-allocation-with-mesos" id="markdown-toc-dynamic-resource-allocation-with-mesos">Dynamic Resource Allocation with Mesos</a></li> - <li><a href="#configuration" id="markdown-toc-configuration">Configuration</a> <ul> - <li><a href="#spark-properties" id="markdown-toc-spark-properties">Spark Properties</a></li> + <li><a href="#mesos-docker-support">Mesos Docker Support</a></li> + <li><a href="#running-alongside-hadoop">Running Alongside Hadoop</a></li> + <li><a href="#dynamic-resource-allocation-with-mesos">Dynamic Resource Allocation with Mesos</a></li> + <li><a href="#configuration">Configuration</a> <ul> + <li><a href="#spark-properties">Spark Properties</a></li> </ul> </li> - <li><a href="#troubleshooting-and-debugging" id="markdown-toc-troubleshooting-and-debugging">Troubleshooting and Debugging</a></li> + <li><a href="#troubleshooting-and-debugging">Troubleshooting and Debugging</a></li> </ul> <p>Spark can run on hardware clusters managed by <a href="http://mesos.apache.org/">Apache Mesos</a>.</p> @@ -289,11 +289,11 @@ instructions above. On Mac OS X, the library is called <code>libmesos.dylib</cod <p>Now when starting a Spark application against the cluster, pass a <code>mesos://</code> URL as the master when creating a <code>SparkContext</code>. For example:</p> -<div class="highlight"><pre><code class="language-scala" data-lang="scala"><span class="k">val</span> <span class="n">conf</span> <span class="k">=</span> <span class="k">new</span> <span class="nc">SparkConf</span><span class="o">()</span> +<figure class="highlight"><pre><code class="language-scala" data-lang="scala"><span></span><span class="k">val</span> <span class="n">conf</span> <span class="k">=</span> <span class="k">new</span> <span class="nc">SparkConf</span><span class="o">()</span> <span class="o">.</span><span class="n">setMaster</span><span class="o">(</span><span class="s">"mesos://HOST:5050"</span><span class="o">)</span> <span class="o">.</span><span class="n">setAppName</span><span class="o">(</span><span class="s">"My app"</span><span class="o">)</span> <span class="o">.</span><span class="n">set</span><span class="o">(</span><span class="s">"spark.executor.uri"</span><span class="o">,</span> <span class="s">"<path to spark-2.1.0.tar.gz uploaded above>"</span><span class="o">)</span> -<span class="k">val</span> <span class="n">sc</span> <span class="k">=</span> <span class="k">new</span> <span class="nc">SparkContext</span><span class="o">(</span><span class="n">conf</span><span class="o">)</span></code></pre></div> +<span class="k">val</span> <span class="n">sc</span> <span class="k">=</span> <span class="k">new</span> <span class="nc">SparkContext</span><span class="o">(</span><span class="n">conf</span><span class="o">)</span></code></pre></figure> <p>(You can also use <a href="submitting-applications.html"><code>spark-submit</code></a> and configure <code>spark.executor.uri</code> in the <a href="configuration.html#loading-default-configurations">conf/spark-defaults.conf</a> file.)</p> @@ -301,7 +301,7 @@ in the <a href="configuration.html#loading-default-configurations">conf/spark-de <p>When running a shell, the <code>spark.executor.uri</code> parameter is inherited from <code>SPARK_EXECUTOR_URI</code>, so it does not need to be redundantly passed in as a system property.</p> -<div class="highlight"><pre><code class="language-bash" data-lang="bash">./bin/spark-shell --master mesos://host:5050</code></pre></div> +<figure class="highlight"><pre><code class="language-bash" data-lang="bash"><span></span>./bin/spark-shell --master mesos://host:5050</code></pre></figure> <h2 id="cluster-mode">Cluster mode</h2> @@ -322,7 +322,7 @@ Spark cluster Web UI.</p> <p>For example:</p> -<div class="highlight"><pre><code class="language-bash" data-lang="bash">./bin/spark-submit <span class="se">\</span> +<figure class="highlight"><pre><code class="language-bash" data-lang="bash"><span></span>./bin/spark-submit <span class="se">\</span> --class org.apache.spark.examples.SparkPi <span class="se">\</span> --master mesos://207.184.161.138:7077 <span class="se">\</span> --deploy-mode cluster <span class="se">\</span> @@ -330,7 +330,7 @@ Spark cluster Web UI.</p> --executor-memory 20G <span class="se">\</span> --total-executor-cores <span class="m">100</span> <span class="se">\</span> http://path/to/examples.jar <span class="se">\</span> - 1000</code></pre></div> + <span class="m">1000</span></code></pre></figure> <p>Note that jars or python files that are passed to spark-submit should be URIs reachable by Mesos slaves, as the Spark driver doesn’t automatically upload local jars.</p> @@ -404,13 +404,13 @@ terminate when they’re idle.</p> <p>To run in fine-grained mode, set the <code>spark.mesos.coarse</code> property to false in your <a href="configuration.html#spark-properties">SparkConf</a>:</p> -<div class="highlight"><pre><code class="language-scala" data-lang="scala"><span class="n">conf</span><span class="o">.</span><span class="n">set</span><span class="o">(</span><span class="s">"spark.mesos.coarse"</span><span class="o">,</span> <span class="s">"false"</span><span class="o">)</span></code></pre></div> +<figure class="highlight"><pre><code class="language-scala" data-lang="scala"><span></span><span class="n">conf</span><span class="o">.</span><span class="n">set</span><span class="o">(</span><span class="s">"spark.mesos.coarse"</span><span class="o">,</span> <span class="s">"false"</span><span class="o">)</span></code></pre></figure> <p>You may also make use of <code>spark.mesos.constraints</code> to set attribute-based constraints on Mesos resource offers. By default, all resource offers will be accepted.</p> -<div class="highlight"><pre><code class="language-scala" data-lang="scala"><span class="n">conf</span><span class="o">.</span><span class="n">set</span><span class="o">(</span><span class="s">"spark.mesos.constraints"</span><span class="o">,</span> <span class="s">"os:centos7;us-east-1:false"</span><span class="o">)</span></code></pre></div> +<figure class="highlight"><pre><code class="language-scala" data-lang="scala"><span></span><span class="n">conf</span><span class="o">.</span><span class="n">set</span><span class="o">(</span><span class="s">"spark.mesos.constraints"</span><span class="o">,</span> <span class="s">"os:centos7;us-east-1:false"</span><span class="o">)</span></code></pre></figure> <p>For example, Let’s say <code>spark.mesos.constraints</code> is set to <code>os:centos7;us-east-1:false</code>, then the resource offers will be checked to see if they meet both these constraints and only then will be accepted to start new executors.</p> http://git-wip-us.apache.org/repos/asf/spark-website/blob/d2bcf185/site/docs/2.1.0/running-on-yarn.html ---------------------------------------------------------------------- diff --git a/site/docs/2.1.0/running-on-yarn.html b/site/docs/2.1.0/running-on-yarn.html index 1235965..5a3a8f5 100644 --- a/site/docs/2.1.0/running-on-yarn.html +++ b/site/docs/2.1.0/running-on-yarn.html @@ -629,9 +629,8 @@ includes a URI of the metadata store in <code>"hive.metastore.uris</code>, and the tokens needed to access these clusters must be explicitly requested at launch time. This is done by listing them in the <code>spark.yarn.access.namenodes</code> property.</p> -<p><code> -spark.yarn.access.namenodes hdfs://ireland.example.org:8020/,hdfs://frankfurt.example.org:8020/ -</code></p> +<pre><code>spark.yarn.access.namenodes hdfs://ireland.example.org:8020/,hdfs://frankfurt.example.org:8020/ +</code></pre> <p>Spark supports integrating with other security-aware services through Java Services mechanism (see <code>java.util.ServiceLoader</code>). To do that, implementations of <code>org.apache.spark.deploy.yarn.security.ServiceCredentialProvider</code> @@ -656,7 +655,7 @@ pre-packaged distribution.</li> then set <code>yarn.nodemanager.aux-services.spark_shuffle.class</code> to <code>org.apache.spark.network.yarn.YarnShuffleService</code>.</li> <li>Increase <code>NodeManager's</code> heap size by setting <code>YARN_HEAPSIZE</code> (1000 by default) in <code>etc/hadoop/yarn-env.sh</code> -to avoid garbage collection issues during shuffle.</li> +to avoid garbage collection issues during shuffle. </li> <li>Restart all <code>NodeManager</code>s in your cluster.</li> </ol> @@ -704,10 +703,9 @@ the Spark configuration must be set to disable token collection for the services <p>The Spark configuration must include the lines:</p> -<p><code> -spark.yarn.security.tokens.hive.enabled false +<pre><code>spark.yarn.security.tokens.hive.enabled false spark.yarn.security.tokens.hbase.enabled false -</code></p> +</code></pre> <p>The configuration option <code>spark.yarn.access.namenodes</code> must be unset.</p> @@ -717,24 +715,21 @@ spark.yarn.security.tokens.hbase.enabled false enable extra logging of Kerberos operations in Hadoop by setting the <code>HADOOP_JAAS_DEBUG</code> environment variable.</p> -<p><code>bash -export HADOOP_JAAS_DEBUG=true -</code></p> +<pre><code class="language-bash">export HADOOP_JAAS_DEBUG=true +</code></pre> <p>The JDK classes can be configured to enable extra logging of their Kerberos and SPNEGO/REST authentication via the system properties <code>sun.security.krb5.debug</code> and <code>sun.security.spnego.debug=true</code></p> -<p><code> --Dsun.security.krb5.debug=true -Dsun.security.spnego.debug=true -</code></p> +<pre><code>-Dsun.security.krb5.debug=true -Dsun.security.spnego.debug=true +</code></pre> <p>All these options can be enabled in the Application Master:</p> -<p><code> -spark.yarn.appMasterEnv.HADOOP_JAAS_DEBUG true +<pre><code>spark.yarn.appMasterEnv.HADOOP_JAAS_DEBUG true spark.yarn.am.extraJavaOptions -Dsun.security.krb5.debug=true -Dsun.security.spnego.debug=true -</code></p> +</code></pre> <p>Finally, if the log level for <code>org.apache.spark.deploy.yarn.Client</code> is set to <code>DEBUG</code>, the log will include a list of all tokens obtained, and their expiry details</p> http://git-wip-us.apache.org/repos/asf/spark-website/blob/d2bcf185/site/docs/2.1.0/spark-standalone.html ---------------------------------------------------------------------- diff --git a/site/docs/2.1.0/spark-standalone.html b/site/docs/2.1.0/spark-standalone.html index 1caf5e4..4900198 100644 --- a/site/docs/2.1.0/spark-standalone.html +++ b/site/docs/2.1.0/spark-standalone.html @@ -127,18 +127,18 @@ <ul id="markdown-toc"> - <li><a href="#installing-spark-standalone-to-a-cluster" id="markdown-toc-installing-spark-standalone-to-a-cluster">Installing Spark Standalone to a Cluster</a></li> - <li><a href="#starting-a-cluster-manually" id="markdown-toc-starting-a-cluster-manually">Starting a Cluster Manually</a></li> - <li><a href="#cluster-launch-scripts" id="markdown-toc-cluster-launch-scripts">Cluster Launch Scripts</a></li> - <li><a href="#connecting-an-application-to-the-cluster" id="markdown-toc-connecting-an-application-to-the-cluster">Connecting an Application to the Cluster</a></li> - <li><a href="#launching-spark-applications" id="markdown-toc-launching-spark-applications">Launching Spark Applications</a></li> - <li><a href="#resource-scheduling" id="markdown-toc-resource-scheduling">Resource Scheduling</a></li> - <li><a href="#monitoring-and-logging" id="markdown-toc-monitoring-and-logging">Monitoring and Logging</a></li> - <li><a href="#running-alongside-hadoop" id="markdown-toc-running-alongside-hadoop">Running Alongside Hadoop</a></li> - <li><a href="#configuring-ports-for-network-security" id="markdown-toc-configuring-ports-for-network-security">Configuring Ports for Network Security</a></li> - <li><a href="#high-availability" id="markdown-toc-high-availability">High Availability</a> <ul> - <li><a href="#standby-masters-with-zookeeper" id="markdown-toc-standby-masters-with-zookeeper">Standby Masters with ZooKeeper</a></li> - <li><a href="#single-node-recovery-with-local-file-system" id="markdown-toc-single-node-recovery-with-local-file-system">Single-Node Recovery with Local File System</a></li> + <li><a href="#installing-spark-standalone-to-a-cluster">Installing Spark Standalone to a Cluster</a></li> + <li><a href="#starting-a-cluster-manually">Starting a Cluster Manually</a></li> + <li><a href="#cluster-launch-scripts">Cluster Launch Scripts</a></li> + <li><a href="#connecting-an-application-to-the-cluster">Connecting an Application to the Cluster</a></li> + <li><a href="#launching-spark-applications">Launching Spark Applications</a></li> + <li><a href="#resource-scheduling">Resource Scheduling</a></li> + <li><a href="#monitoring-and-logging">Monitoring and Logging</a></li> + <li><a href="#running-alongside-hadoop">Running Alongside Hadoop</a></li> + <li><a href="#configuring-ports-for-network-security">Configuring Ports for Network Security</a></li> + <li><a href="#high-availability">High Availability</a> <ul> + <li><a href="#standby-masters-with-zookeeper">Standby Masters with ZooKeeper</a></li> + <li><a href="#single-node-recovery-with-local-file-system">Single-Node Recovery with Local File System</a></li> </ul> </li> </ul> @@ -446,17 +446,17 @@ By default, it will acquire <em>all</em> cores in the cluster, which only makes application at a time. You can cap the number of cores by setting <code>spark.cores.max</code> in your <a href="configuration.html#spark-properties">SparkConf</a>. For example:</p> -<div class="highlight"><pre><code class="language-scala" data-lang="scala"><span class="k">val</span> <span class="n">conf</span> <span class="k">=</span> <span class="k">new</span> <span class="nc">SparkConf</span><span class="o">()</span> +<figure class="highlight"><pre><code class="language-scala" data-lang="scala"><span></span><span class="k">val</span> <span class="n">conf</span> <span class="k">=</span> <span class="k">new</span> <span class="nc">SparkConf</span><span class="o">()</span> <span class="o">.</span><span class="n">setMaster</span><span class="o">(...)</span> <span class="o">.</span><span class="n">setAppName</span><span class="o">(...)</span> <span class="o">.</span><span class="n">set</span><span class="o">(</span><span class="s">"spark.cores.max"</span><span class="o">,</span> <span class="s">"10"</span><span class="o">)</span> -<span class="k">val</span> <span class="n">sc</span> <span class="k">=</span> <span class="k">new</span> <span class="nc">SparkContext</span><span class="o">(</span><span class="n">conf</span><span class="o">)</span></code></pre></div> +<span class="k">val</span> <span class="n">sc</span> <span class="k">=</span> <span class="k">new</span> <span class="nc">SparkContext</span><span class="o">(</span><span class="n">conf</span><span class="o">)</span></code></pre></figure> <p>In addition, you can configure <code>spark.deploy.defaultCores</code> on the cluster master process to change the default for applications that don’t set <code>spark.cores.max</code> to something less than infinite. Do this by adding the following to <code>conf/spark-env.sh</code>:</p> -<div class="highlight"><pre><code class="language-bash" data-lang="bash"><span class="nb">export </span><span class="nv">SPARK_MASTER_OPTS</span><span class="o">=</span><span class="s2">"-Dspark.deploy.defaultCores=<value>"</span></code></pre></div> +<figure class="highlight"><pre><code class="language-bash" data-lang="bash"><span></span><span class="nb">export</span> <span class="nv">SPARK_MASTER_OPTS</span><span class="o">=</span><span class="s2">"-Dspark.deploy.defaultCores=<value>"</span></code></pre></figure> <p>This is useful on shared clusters where users might not have configured a maximum number of cores individually.</p> --------------------------------------------------------------------- To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org