Regenerate website
Project: http://git-wip-us.apache.org/repos/asf/incubator-beam-site/repo Commit: http://git-wip-us.apache.org/repos/asf/incubator-beam-site/commit/6b76c3f4 Tree: http://git-wip-us.apache.org/repos/asf/incubator-beam-site/tree/6b76c3f4 Diff: http://git-wip-us.apache.org/repos/asf/incubator-beam-site/diff/6b76c3f4 Branch: refs/heads/asf-site Commit: 6b76c3f4e0b15236d433909732453e95b635ce87 Parents: 7a9190c Author: Davor Bonaci <da...@google.com> Authored: Fri Dec 9 12:49:30 2016 -0800 Committer: Davor Bonaci <da...@google.com> Committed: Fri Dec 9 12:49:30 2016 -0800 ---------------------------------------------------------------------- content/documentation/runners/spark/index.html | 157 +++++++++++++++++++- 1 file changed, 156 insertions(+), 1 deletion(-) ---------------------------------------------------------------------- http://git-wip-us.apache.org/repos/asf/incubator-beam-site/blob/6b76c3f4/content/documentation/runners/spark/index.html ---------------------------------------------------------------------- diff --git a/content/documentation/runners/spark/index.html b/content/documentation/runners/spark/index.html index 521cf5d..23960e0 100644 --- a/content/documentation/runners/spark/index.html +++ b/content/documentation/runners/spark/index.html @@ -146,7 +146,162 @@ <div class="row"> <h1 id="using-the-apache-spark-runner">Using the Apache Spark Runner</h1> -<p>This page is under construction (<a href="https://issues.apache.org/jira/browse/BEAM-507">BEAM-507</a>).</p> +<p>The Apache Spark Runner can be used to execute Beam pipelines using <a href="http://spark.apache.org/">Apache Spark</a>. +The Spark Runner can execute Spark pipelines just like a native Spark application; deploying a self-contained application for local mode, running on Sparkâs Standalone RM, or using YARN or Mesos.</p> + +<p>The Spark Runner executes Beam pipelines on top of Apache Spark, providing:</p> + +<ul> + <li>Batch and streaming (and combined) pipelines.</li> + <li>The same fault-tolerance <a href="http://spark.apache.org/docs/1.6.3/streaming-programming-guide.html#fault-tolerance-semantics">guarantees</a> as provided by RDDs and DStreams.</li> + <li>The same <a href="http://spark.apache.org/docs/1.6.3/security.html">security</a> features Spark provides.</li> + <li>Built-in metrics reporting using Sparkâs metrics system, which reports Beam Aggregators as well.</li> + <li>Native support for Beam side-inputs via sparkâs Broadcast variables.</li> +</ul> + +<p>The <a href="/documentation/runners/capability-matrix/">Beam Capability Matrix</a> documents the currently supported capabilities of the Spark Runner.</p> + +<p><em><strong>Note:</strong></em> <em>support for the Beam Model in streaming is currently experimental, follow-up in the <a href="/get-started/support/">mailing list</a> for status updates.</em></p> + +<h2 id="spark-runner-prerequisites-and-setup">Spark Runner prerequisites and setup</h2> + +<p>The Spark runner currently supports Sparkâs 1.6 branch, and more specifically any version greater than 1.6.0.</p> + +<p>You can add a dependency on the latest version of the Spark runner by adding to your pom.xml the following:</p> +<div class="language-java highlighter-rouge"><pre class="highlight"><code><span class="o"><</span><span class="n">dependency</span><span class="o">></span> + <span class="o"><</span><span class="n">groupId</span><span class="o">></span><span class="n">org</span><span class="o">.</span><span class="na">apache</span><span class="o">.</span><span class="na">beam</span><span class="o"></</span><span class="n">groupId</span><span class="o">></span> + <span class="o"><</span><span class="n">artifactId</span><span class="o">></span><span class="n">beam</span><span class="o">-</span><span class="n">runners</span><span class="o">-</span><span class="n">spark</span><span class="o"></</span><span class="n">artifactId</span><span class="o">></span> + <span class="o"><</span><span class="n">version</span><span class="o">></span><span class="mf">0.3</span><span class="o">.</span><span class="mi">0</span><span class="o">-</span><span class="n">incubating</span><span class="o"></</span><span class="n">version</span><span class="o">></span> +<span class="o"></</span><span class="n">dependency</span><span class="o">></span> +</code></pre> +</div> + +<h3 id="deploying-spark-with-your-application">Deploying Spark with your application</h3> + +<p>In some cases, such as running in local mode/Standalone, your (self-contained) application would be required to pack Spark by explicitly adding the following dependencies in your pom.xml:</p> +<div class="language-java highlighter-rouge"><pre class="highlight"><code><span class="o"><</span><span class="n">dependency</span><span class="o">></span> + <span class="o"><</span><span class="n">groupId</span><span class="o">></span><span class="n">org</span><span class="o">.</span><span class="na">apache</span><span class="o">.</span><span class="na">spark</span><span class="o"></</span><span class="n">groupId</span><span class="o">></span> + <span class="o"><</span><span class="n">artifactId</span><span class="o">></span><span class="n">spark</span><span class="o">-</span><span class="n">core_2</span><span class="o">.</span><span class="mi">10</span><span class="o"></</span><span class="n">artifactId</span><span class="o">></span> + <span class="o"><</span><span class="n">version</span><span class="o">></span><span class="err">$</span><span class="o">{</span><span class="n">spark</span><span class="o">.</span><span class="na">version</span><span class="o">}</</span><span class="n">version</span><span class="o">></span> +<span class="o"></</span><span class="n">dependency</span><span class="o">></span> + +<span class="o"><</span><span class="n">dependency</span><span class="o">></span> + <span class="o"><</span><span class="n">groupId</span><span class="o">></span><span class="n">org</span><span class="o">.</span><span class="na">apache</span><span class="o">.</span><span class="na">spark</span><span class="o"></</span><span class="n">groupId</span><span class="o">></span> + <span class="o"><</span><span class="n">artifactId</span><span class="o">></span><span class="n">spark</span><span class="o">-</span><span class="n">streaming_2</span><span class="o">.</span><span class="mi">10</span><span class="o"></</span><span class="n">artifactId</span><span class="o">></span> + <span class="o"><</span><span class="n">version</span><span class="o">></span><span class="err">$</span><span class="o">{</span><span class="n">spark</span><span class="o">.</span><span class="na">version</span><span class="o">}</</span><span class="n">version</span><span class="o">></span> +<span class="o"></</span><span class="n">dependency</span><span class="o">></span> +</code></pre> +</div> + +<p>And shading the application jar using the maven shade plugin:</p> +<div class="language-java highlighter-rouge"><pre class="highlight"><code><span class="o"><</span><span class="n">plugin</span><span class="o">></span> + <span class="o"><</span><span class="n">groupId</span><span class="o">></span><span class="n">org</span><span class="o">.</span><span class="na">apache</span><span class="o">.</span><span class="na">maven</span><span class="o">.</span><span class="na">plugins</span><span class="o"></</span><span class="n">groupId</span><span class="o">></span> + <span class="o"><</span><span class="n">artifactId</span><span class="o">></span><span class="n">maven</span><span class="o">-</span><span class="n">shade</span><span class="o">-</span><span class="n">plugin</span><span class="o"></</span><span class="n">artifactId</span><span class="o">></span> + <span class="o"><</span><span class="n">configuration</span><span class="o">></span> + <span class="o"><</span><span class="n">createDependencyReducedPom</span><span class="o">></span><span class="kc">false</span><span class="o"></</span><span class="n">createDependencyReducedPom</span><span class="o">></span> + <span class="o"><</span><span class="n">filters</span><span class="o">></span> + <span class="o"><</span><span class="n">filter</span><span class="o">></span> + <span class="o"><</span><span class="n">artifact</span><span class="o">>*:*</</span><span class="n">artifact</span><span class="o">></span> + <span class="o"><</span><span class="n">excludes</span><span class="o">></span> + <span class="o"><</span><span class="n">exclude</span><span class="o">></span><span class="n">META</span><span class="o">-</span><span class="n">INF</span><span class="o">/*.</span><span class="na">SF</span><span class="o"></</span><span class="n">exclude</span><span class="o">></span> + <span class="o"><</span><span class="n">exclude</span><span class="o">></span><span class="n">META</span><span class="o">-</span><span class="n">INF</span><span class="o">/*.</span><span class="na">DSA</span><span class="o"></</span><span class="n">exclude</span><span class="o">></span> + <span class="o"><</span><span class="n">exclude</span><span class="o">></span><span class="n">META</span><span class="o">-</span><span class="n">INF</span><span class="o">/*.</span><span class="na">RSA</span><span class="o"></</span><span class="n">exclude</span><span class="o">></span> + <span class="o"></</span><span class="n">excludes</span><span class="o">></span> + <span class="o"></</span><span class="n">filter</span><span class="o">></span> + <span class="o"></</span><span class="n">filters</span><span class="o">></span> + <span class="o"></</span><span class="n">configuration</span><span class="o">></span> + <span class="o"><</span><span class="n">executions</span><span class="o">></span> + <span class="o"><</span><span class="n">execution</span><span class="o">></span> + <span class="o"><</span><span class="n">phase</span><span class="o">></span><span class="kn">package</span><span class="o"></</span><span class="n">phase</span><span class="o">></span> + <span class="o"><</span><span class="n">goals</span><span class="o">></span> + <span class="o"><</span><span class="n">goal</span><span class="o">></span><span class="n">shade</span><span class="o"></</span><span class="n">goal</span><span class="o">></span> + <span class="o"></</span><span class="n">goals</span><span class="o">></span> + <span class="o"><</span><span class="n">configuration</span><span class="o">></span> + <span class="o"><</span><span class="n">shadedArtifactAttached</span><span class="o">></span><span class="kc">true</span><span class="o"></</span><span class="n">shadedArtifactAttached</span><span class="o">></span> + <span class="o"><</span><span class="n">shadedClassifierName</span><span class="o">></span><span class="n">shaded</span><span class="o"></</span><span class="n">shadedClassifierName</span><span class="o">></span> + <span class="o"></</span><span class="n">configuration</span><span class="o">></span> + <span class="o"></</span><span class="n">execution</span><span class="o">></span> + <span class="o"></</span><span class="n">executions</span><span class="o">></span> +<span class="o"></</span><span class="n">plugin</span><span class="o">></span> +</code></pre> +</div> + +<p>After running <code>mvn package</code>, run <code>ls target</code> and you should see (assuming your artifactId is <code class="highlighter-rouge">beam-examples</code> and the version is <code class="highlighter-rouge">1.0.0</code>):</p> +<div class="highlighter-rouge"><pre class="highlight"><code>beam-examples-1.0.0-shaded.jar +</code></pre> +</div> + +<p>To run against a Standalone cluster simply run:</p> +<div class="highlighter-rouge"><pre class="highlight"><code>spark-submit --class com.beam.examples.BeamPipeline --master spark://HOST:PORT target/beam-examples-1.0.0-shaded.jar --runner=SparkRunner +</code></pre> +</div> + +<h3 id="running-on-a-pre-deployed-spark-cluster">Running on a pre-deployed Spark cluster</h3> + +<p>Deploying your Beam pipeline on a cluster that already has a Spark deployment (Spark classes are available in container classpath) does not require any additional dependencies. +For more details on the different deployment modes see: <a href="http://spark.apache.org/docs/latest/spark-standalone.html">Standalone</a>, <a href="http://spark.apache.org/docs/latest/running-on-yarn.html">YARN</a>, or <a href="http://spark.apache.org/docs/latest/running-on-mesos.html">Mesos</a>.</p> + +<h2 id="pipeline-options-for-the-spark-runner">Pipeline options for the Spark Runner</h2> + +<p>When executing your pipeline with the Spark Runner, you should consider the following pipeline options.</p> + +<table class="table table-bordered"> +<tr> + <th>Field</th> + <th>Description</th> + <th>Default Value</th> +</tr> +<tr> + <td><code>runner</code></td> + <td>The pipeline runner to use. This option allows you to determine the pipeline runner at runtime.</td> + <td>Set to <code>SparkRunner</code> to run using Spark.</td> +</tr> +<tr> + <td><code>sparkMaster</code></td> + <td>The url of the Spark Master. This is the equivalent of setting <code>SparkConf#setMaster(String)</code> and can either be <code>local[x]</code> to run local with x cores, <code>spark://host:port</code> to connect to a Spark Standalone cluster, <code>mesos://host:port</code> to connect to a Mesos cluster, or <code>yarn</code> to connect to a yarn cluster.</td> + <td><code>local[4]</code></td> +</tr> +<tr> + <td><code>storageLevel</code></td> + <td>The <code>StorageLevel</code> to use when caching RDDs in batch pipelines. The Spark Runner automatically caches RDDs that are evaluated repeatedly. This is a batch-only property as streaming pipelines in Beam are stateful, which requires Spark DStream's <code>StorageLevel</code> to be <code>MEMORY_ONLY</code>.</td> + <td>MEMORY_ONLY</td> +</tr> +<tr> + <td><code>batchIntervalMillis</code></td> + <td>The <code>StreamingContext</code>'s <code>batchDuration</code> - setting Spark's batch interval.</td> + <td><code>1000</code></td> +</tr> +<tr> + <td><code>enableSparkMetricSinks</code></td> + <td>Enable reporting metrics to Spark's metrics Sinks.</td> + <td>true</td> +</tr> +</table> + +<h2 id="additional-notes">Additional notes</h2> + +<h3 id="using-spark-submit">Using spark-submit</h3> + +<p>When submitting a Spark application to cluster, it is common (and recommended) to use the <code>spark-submit</code> script that is provided with the spark installation. +The <code>PipelineOptions</code> described above are not to replace <code>spark-submit</code>, but to complement it. +Passing any of the above mentioned options could be done as one of the <code>application-arguments</code>, and setting <code>--master</code> takes precedence. +For more on how to generally use <code>spark-submit</code> checkout Spark <a href="http://spark.apache.org/docs/1.6.3/submitting-applications.html#launching-applications-with-spark-submit">documentation</a>.</p> + +<h3 id="monitoring-your-job">Monitoring your job</h3> + +<p>You can monitor a running Spark job using the Spark <a href="http://spark.apache.org/docs/1.6.3/monitoring.html#web-interfaces">Web Interfaces</a>. By default, this is available at port <code class="highlighter-rouge">4040</code> on the driver node. If you run Spark on your local machine that would be <code class="highlighter-rouge">http://localhost:4040</code>. +Spark also has a history server to <a href="http://spark.apache.org/docs/1.6.3/monitoring.html#viewing-after-the-fact">view after the fact</a>. +Metrics are also available via <a href="http://spark.apache.org/docs/1.6.3/monitoring.html#rest-api">REST API</a>. +Spark provides a <a href="http://spark.apache.org/docs/1.6.3/monitoring.html#metrics">metrics system</a> that allows reporting Spark metrics to a variety of Sinks. The Spark runner reports user-defined Beam Aggregators using this same metrics system and currently supports <code>GraphiteSink</code> and <code>CSVSink</code>, and providing support for additional Sinks supported by Spark is easy and straight-forward.</p> + +<h3 id="streaming-execution">Streaming Execution</h3> + +<p>If your pipeline uses an <code>UnboundedSource</code> the Spark Runner will automatically set streaming mode. Forcing streaming mode is mostly used for testing and is not recommended.</p> + +<h3 id="using-a-provided-sparkcontext-and-streaminglisteners">Using a provided SparkContext and StreamingListeners</h3> + +<p>If you would like to execute your Spark job with a provided <code>SparkContext</code>, such as when using the <a href="https://github.com/spark-jobserver/spark-jobserver">spark-jobserver</a>, or use <code>StreamingListeners</code>, you canât use <code>SparkPipelineOptions</code> (the context or a listener cannot be passed as a command-line argument anyway). +Instead, you should use <code>SparkContextOptions</code> which can only be used programmatically and is not a common <code>PipelineOptions</code> implementation.</p> </div>