This is an automated email from the ASF dual-hosted git repository. mergebot-role pushed a commit to branch asf-site in repository https://gitbox.apache.org/repos/asf/beam-site.git
commit f5fde017245530c5d2c4e23edd6f81ab0c8f450a Author: Mergebot <merge...@apache.org> AuthorDate: Thu Aug 9 16:56:32 2018 +0000 Prepare repository for deployment. --- content/documentation/runners/apex/index.html | 34 ++++++++++++++------------- 1 file changed, 18 insertions(+), 16 deletions(-) diff --git a/content/documentation/runners/apex/index.html b/content/documentation/runners/apex/index.html index 71e95fc..4aaf9f1 100644 --- a/content/documentation/runners/apex/index.html +++ b/content/documentation/runners/apex/index.html @@ -167,8 +167,7 @@ <ul class="nav"> <li><a href="#apex-runner-prerequisites">Apex Runner prerequisites</a></li> - <li><a href="#running-wordcount-using-apex-runner">Running wordcount using Apex Runner</a></li> - <li><a href="#checking-output">Checking output</a></li> + <li><a href="#running-wordcount-with-apex">Running wordcount with Apex</a></li> <li><a href="#montoring-progress-of-your-job">Montoring progress of your job</a></li> </ul> @@ -195,6 +194,9 @@ limitations under the License. <p><a href="http://apex.apache.org/">Apache Apex</a> is a stream processing platform and framework for low-latency, high-throughput and fault-tolerant analytics applications on Apache Hadoop. Apex has a unified streaming architecture and can be used for real-time and batch processing.</p> +<p>The following instructions are for running Beam pipelines with Apex on a YARN cluster. +They are not required for Apex in embedded mode (see <a href="/get-started/quickstart-java/">quickstart</a>).</p> + <h2 id="apex-runner-prerequisites">Apex Runner prerequisites</h2> <p>You may set up your own Hadoop cluster. Beam does not require anything extra to launch the pipelines on YARN. @@ -203,21 +205,20 @@ The Apex CLI can be <a href="http://apex.apache.org/docs/apex/apex_development_s obtained as <a href="http://www.atrato.io/blog/2017/04/08/apache-apex-cli/">binary build</a>. For more download options see <a href="http://apex.apache.org/downloads.html">distribution information on the Apache Apex website</a>.</p> -<h2 id="running-wordcount-using-apex-runner">Running wordcount using Apex Runner</h2> +<h2 id="running-wordcount-with-apex">Running wordcount with Apex</h2> -<p>Put data for processing into HDFS:</p> -<div class="highlighter-rouge"><pre class="highlight"><code>hdfs dfs -mkdir -p /tmp/input/ -hdfs dfs -put pom.xml /tmp/input/ +<p>Typically the build environment is separate from the target YARN cluster. In such case, it is necessary to build a fat jar that will include all dependencies. Ensure that <code class="highlighter-rouge">hadoop.version</code> in <code class="highlighter-rouge">pom.xml</code> matches the version of your YARN cluster and then build the jar file:</p> +<div class="highlighter-rouge"><pre class="highlight"><code>mvn package -Papex-runner </code></pre> </div> -<p>The output directory should not exist on HDFS:</p> -<div class="highlighter-rouge"><pre class="highlight"><code>hdfs dfs -rm -r -f /tmp/output/ +<p>Copy the resulting <code class="highlighter-rouge">target/word-count-beam-bundled-0.1.jar</code> to the cluster and submit the application using:</p> +<div class="highlighter-rouge"><pre class="highlight"><code>java -cp word-count-beam-bundled-0.1.jar org.apache.beam.examples.WordCount --inputFile=/etc/profile --output=/tmp/counts --embeddedExecution=false --configFile=beam-runners-apex.properties --runner=ApexRunner </code></pre> </div> -<p>Run the wordcount example (<em>example project needs to be modified to include HDFS file provider</em>)</p> -<div class="highlighter-rouge"><pre class="highlight"><code>mvn compile exec:java -Dexec.mainClass=org.apache.beam.examples.WordCount -Dexec.args="--inputFile=/tmp/input/pom.xml --output=/tmp/output/ --runner=ApexRunner --embeddedExecution=false --configFile=beam-runners-apex.properties" -Papex-runner +<p>If the build environment is setup as cluster client, it is possible to run the example directly:</p> +<div class="highlighter-rouge"><pre class="highlight"><code>mvn compile exec:java -Dexec.mainClass=org.apache.beam.examples.WordCount -Dexec.args="--inputFile=/etc/profile --output=/tmp/counts --runner=ApexRunner --embeddedExecution=false --configFile=beam-runners-apex.properties" -Papex-runner </code></pre> </div> @@ -233,12 +234,8 @@ apex.application.*.operator.*.attr.TIMEOUT_WINDOW_COUNT=1200 </code></pre> </div> -<h2 id="checking-output">Checking output</h2> - -<p>Check the output of the pipeline in the HDFS location.</p> -<div class="highlighter-rouge"><pre class="highlight"><code>hdfs dfs -ls /tmp/output/ -</code></pre> -</div> +<p>This example uses local files. To use a distributed file system (HDFS, S3 etc.), +it is necessary to augment the build to include the respective file system provider.</p> <h2 id="montoring-progress-of-your-job">Montoring progress of your job</h2> @@ -249,6 +246,11 @@ apex.application.*.operator.*.attr.TIMEOUT_WINDOW_COUNT=1200 <li>Apex command-line interface: <a href="http://apex.apache.org/docs/apex/apex_cli/#apex-cli-commands">Using the Apex CLI to get running application information</a>.</li> </ul> +<p>Check the output of the pipeline:</p> +<div class="highlighter-rouge"><pre class="highlight"><code>ls /tmp/counts* +</code></pre> +</div> + </div> </div> <!--