Added: incubator/samza/site/learn/documentation/0.7.0/comparisons/spark-streaming.html URL: http://svn.apache.org/viewvc/incubator/samza/site/learn/documentation/0.7.0/comparisons/spark-streaming.html?rev=1612998&view=auto ============================================================================== --- incubator/samza/site/learn/documentation/0.7.0/comparisons/spark-streaming.html (added) +++ incubator/samza/site/learn/documentation/0.7.0/comparisons/spark-streaming.html Thu Jul 24 05:05:00 2014 @@ -0,0 +1,240 @@ +<!DOCTYPE html> +<!-- + Licensed to the Apache Software Foundation (ASF) under one or more + contributor license agreements. See the NOTICE file distributed with + this work for additional information regarding copyright ownership. + The ASF licenses this file to You under the Apache License, Version 2.0 + (the "License"); you may not use this file except in compliance with + the License. You may obtain a copy of the License at + + http://www.apache.org/licenses/LICENSE-2.0 + + Unless required by applicable law or agreed to in writing, software + distributed under the License is distributed on an "AS IS" BASIS, + WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + See the License for the specific language governing permissions and + limitations under the License. +--> +<html lang="en"> + <head> + <meta charset="utf-8"> + <title>Samza - Spark Streaming</title> + <link href='/css/ropa-sans.css' rel='stylesheet' type='text/css'/> + <link href="/css/bootstrap.min.css" rel="stylesheet"/> + <link href="/css/font-awesome.min.css" rel="stylesheet"/> + <link href="/css/main.css" rel="stylesheet"/> + <link href="/css/syntax.css" rel="stylesheet"/> + <link rel="icon" type="image/png" href="/img/samza-icon.png"> + </head> + <body> + <div class="wrapper"> + <div class="wrapper-content"> + + <div class="masthead"> + <div class="container"> + <div class="masthead-logo"> + <a href="/" class="logo">samza</a> + </div> + <div class="masthead-icons"> + <div class="pull-right"> + <a href="/startup/download"><i class="fa fa-arrow-circle-o-down masthead-icon"></i></a> + <a href="https://git-wip-us.apache.org/repos/asf?p=incubator-samza.git;a=tree" target="_blank"><i class="fa fa-code masthead-icon" style="font-weight: bold;"></i></a> + <a href="https://twitter.com/samzastream" target="_blank"><i class="fa fa-twitter masthead-icon"></i></a> + </div> + </div> + </div><!-- /.container --> + </div> + + <div class="container"> + <div class="menu"> + <h1><i class="fa fa-rocket"></i> Getting Started</h1> + <ul> + <li><a href="/startup/hello-samza/0.7.0">Hello Samza</a></li> + <li><a href="/startup/download">Download</a></li> + </ul> + + <h1><i class="fa fa-book"></i> Learn</h1> + <ul> + <li><a href="/learn/documentation/0.7.0">Documentation</a></li> + <li><a href="/learn/tutorials/0.7.0">Tutorials</a></li> + <li><a href="http://wiki.apache.org/samza/FAQ">FAQ</a></li> + <li><a href="http://wiki.apache.org/samza">Wiki</a></li> + <li><a href="http://wiki.apache.org/samza/PapersAndTalks">Papers & Talks</a></li> + <li><a href="http://blogs.apache.org/samza">Blog</a></li> + </ul> + + <h1><i class="fa fa-comments"></i> Community</h1> + <ul> + <li><a href="/community/mailing-lists.html">Mailing Lists</a></li> + <li><a href="/community/irc.html">IRC</a></li> + <li><a href="https://issues.apache.org/jira/browse/SAMZA">Bugs</a></li> + <li><a href="http://wiki.apache.org/samza/PoweredBy">Powered by</a></li> + <li><a href="http://wiki.apache.org/samza/Ecosystem">Ecosystem</a></li> + <li><a href="/community/committers.html">Committers</a></li> + </ul> + + <h1><i class="fa fa-code"></i> Contribute</h1> + <ul> + <li><a href="/contribute/rules.html">Rules</a></li> + <li><a href="/contribute/coding-guide.html">Coding Guide</a></li> + <li><a href="/contribute/projects.html">Projects</a></li> + <li><a href="/contribute/seps.html">SEPs</a></li> + <li><a href="/contribute/code.html">Code</a></li> + <li><a href="https://reviews.apache.org/groups/samza">Review Board</a></li> + <li><a href="https://builds.apache.org/">Unit Tests</a></li> + <li><a href="/contribute/disclaimer.html">Disclaimer</a></li> + </ul> + </div> + + <div class="content"> + <!-- + Licensed to the Apache Software Foundation (ASF) under one or more + contributor license agreements. See the NOTICE file distributed with + this work for additional information regarding copyright ownership. + The ASF licenses this file to You under the Apache License, Version 2.0 + (the "License"); you may not use this file except in compliance with + the License. You may obtain a copy of the License at + + http://www.apache.org/licenses/LICENSE-2.0 + + Unless required by applicable law or agreed to in writing, software + distributed under the License is distributed on an "AS IS" BASIS, + WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + See the License for the specific language governing permissions and + limitations under the License. +--> + +<h2>Spark Streaming</h2> + +<!-- + Licensed to the Apache Software Foundation (ASF) under one or more + contributor license agreements. See the NOTICE file distributed with + this work for additional information regarding copyright ownership. + The ASF licenses this file to You under the Apache License, Version 2.0 + (the "License"); you may not use this file except in compliance with + the License. You may obtain a copy of the License at + + http://www.apache.org/licenses/LICENSE-2.0 + + Unless required by applicable law or agreed to in writing, software + distributed under the License is distributed on an "AS IS" BASIS, + WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + See the License for the specific language governing permissions and + limitations under the License. +--> + +<p><em>People generally want to know how similar systems compare. We’ve done our best to fairly contrast the feature sets of Samza with other systems. But we aren’t experts in these frameworks, and we are, of course, totally biased. If we have goofed anything, please let us know and we will correct it.</em></p> + +<p><a href="http://spark.apache.org/docs/latest/streaming-programming-guide.html">Spark Streaming</a> is a stream processing system that uses the core <a href="http://spark.apache.org/">Apache Spark</a> API. Both Samza and Spark Streaming provide data consistency, fault tolerance, a programming API, etc. Spark’s approach to streaming is different from Samza’s. Samza processes messages as they are received, while Spark Streaming treats streaming as a series of deterministic batch operations. Spark Streaming groups the stream into batches of a fixed duration (such as 1 second). Each batch is represented as a Resilient Distributed Dataset (<a href="http://www.cs.berkeley.edu/%7Ematei/papers/2012/nsdi_spark.pdf">RDD</a>). A neverending sequence of these RDDs is called a Discretized Stream (<a href="http://www.cs.berkeley.edu/%7Ematei/papers/2012/hotcloud_spark_streaming.pdf">DStream</a>).</p> + +<h3 id="overview-of-spark-streaming">Overview of Spark Streaming</h3> + +<p>Before going into the comparison, here is a brief overview of the Spark Streaming application. If you already are familiar with Spark Streaming, you may skip this part. There are two main parts of a Spark Streaming application: data receiving and data processing. </p> + +<ul> +<li>Data receiving is accomplished by a <a href="https://spark.apache.org/docs/latest/streaming-custom-receivers.html">receiver</a> which receives data and stores data in Spark (though not in an RDD at this point). </li> +<li>Data processing transfers the data stored in Spark into the DStream. You can then apply the two <a href="https://spark.apache.org/docs/latest/streaming-programming-guide.html#operations">operations</a> – transformations and output operations – on the DStream. The operations for DStream are a little different from what you can use for the general Spark RDD because of the streaming environment.</li> +</ul> + +<p>Here is an overview of the Spark Streaming’s <a href="https://spark.apache.org/docs/latest/cluster-overview.html">deploy</a>. Spark has a SparkContext (in SparkStreaming, itâs called <a href="https://spark.apache.org/docs/1.0.0/api/scala/index.html#org.apache.spark.streaming.StreamingContext">StreamingContext</a>) object in the driver program. The SparkContext talks with cluster manager (e.g. YARN, Mesos) which then allocates resources (that is, executors) for the Spark application. And executors will run tasks sent by the SparkContext (<a href="http://spark.apache.org/docs/latest/cluster-overview.html#compenents">read more</a>). In YARNâs context, one executor is equivalent to one container. Tasks are what is running in the containers. The driver program runs in the client machine that submits job (<a href="https://spark.apache.org/docs/latest/running-on-yarn.html#launching-spark-on-yarn">client mode</a>) or in the application manager (<a href="https://spark.apac he.org/docs/latest/running-on-yarn.html#launching-spark-on-yarn">cluster mode</a>). Both data receiving and data processing are tasks for executors. One receiver (receives one input stream) is a long-running task. Processing has a bunch of tasks. All the tasks are sent to the available executors.</p> + +<h3 id="ordering-and-guarantees">Ordering and Guarantees</h3> + +<p>Spark Streaming guarantees ordered processing of batches in a DStream. Since messages are processed in batches by side-effect-free operators, the exact ordering of messages is not important in Spark Streaming. Spark Streaming does not gurantee at-least-once or at-most-once messaging semantics because in some situations it may lose data when the driver program fails (see <a href="#fault-tolerance">fault-tolerance</a>). In addition, because Spark Streaming requires transformation operations to be deterministic, it is unsuitable for nondeterministic processing, e.g. a randomized machine learning algorithm.</p> + +<p>Samza guarantees processing the messages as the order they appear in the partition of the stream. Samza also allows you to define a deterministic ordering of messages between partitions using a <a href="../container/streams.html">MessageChooser</a>. It provides an at-least-once message delivery guarantee. And it does not require operations to be deterministic.</p> + +<h3 id="state-management">State Management</h3> + +<p>Spark Streaming provides a state DStream which keeps the state for each key and a transformation operation called <a href="https://spark.apache.org/docs/latest/streaming-programming-guide.html#transformations">updateStateByKey</a> to mutate state. Everytime updateStateByKey is applied, you will get a new state DStream where all of the state is updated by applying the function passed to updateStateByKey. This transformation can serve as a basic key-value store, though it has a few drawbacks:</p> + +<ul> +<li>you can only apply the DStream operations to your state because essentially it’s a DStream.</li> +<li>does not provide any key-value access to the data. If you want to access a certain key-value, you need to iterate the whole DStream.</li> +<li>it is inefficient when the state is large because every time a new batch is processed, Spark Streaming consumes the entire state DStream to update relevant keys and values.</li> +</ul> + +<p>Spark Streaming periodically writes intermedia data of stateful operations (updateStateByKey and window-based operations) into the HDFS. In the case of updateStateByKey, the entire state RDD is written into the HDFS after every checkpointing interval. As we mentioned in the <em><a href="../container/state-management.html#in-memory-state-with-checkpointing">in memory state with checkpointing</a></em>, writing the entire state to durable storage is very expensive when the state becomes large.</p> + +<p>Samza uses an embedded key-value store for <a href="../container/state-management.html#local-state-in-samza">state management</a>. This store is replicated as it’s mutated, and supports both very high throughput writing and reading. And it gives you a lot of flexibility to decide what kind of state you want to maintain. What is more, you can also plug in other <a href="../container/state-management.html#other-storage-engines">storage engines</a>, which enables great flexibility in the stream processing algorithms you can use. A good comparison of different types of state manager approaches can be found <a href="../container/state-management.html#approaches-to-managing-task-state">here</a>.</p> + +<p>One of the common use cases in state management is <a href="../container/state-management.html#stream-stream-join">stream-stream join</a>. Though Spark Streaming has the <a href="https://spark.apache.org/docs/latest/streaming-programming-guide.html#transformations">join</a> operation, this operation only joins two batches that are in the same time interval. It does not deal with the situation where events in two streams have mismatch. Spark Streaming’s updateStateByKey approach to store mismatch events also has the limitation because if the number of mismatch events is large, there will be a large state, which causes the inefficience in Spark Streaming. While Samza does not have this limitation.</p> + +<h3 id="partitioning-and-parallelism">Partitioning and Parallelism</h3> + +<p>Spark Streaming’s Parallelism is achieved by splitting the job into small tasks and sending them to executors. There are two types of <a href="http://spark.apache.org/docs/latest/streaming-programming-guide.html#level-of-parallelism-in-data-receiving">parallelism in Spark Streaming</a>: parallelism in receiving the stream and parallelism in processing the stream. On the receiving side, one input DStream creates one receiver, and one receiver receives one input stream of data and runs as a long-running task. So in order to parallelize the receiving process, you can split one input stream into multiple input streams based on some criteria (e.g. if you are receiving a Kafka stream with some partitions, you may split this stream based on the partition). Then you can create multiple input DStreams (so multiple receivers) for these streams and the receivers will run as multiple tasks. Accordingly, you should provide enough resources by increasing the core number of the executors or bringing up more executors. Then you can combine all the input Dstreams into one DStream during the processing if necessary. On the processing side, since a DStream is a continuous sequence of RDDs, the parallelism is simply accomplished by normal RDD operations, such as map, reduceByKey, reduceByWindow (check <a href="https://spark.apache.org/docs/latest/tuning.html#level-of-parallelism">here</a>).</p> + +<p>Samzaâs parallelism is achieved by splitting processing into independent <a href="../api/overview.html">tasks</a> which can be parallelized. You can run multiple tasks in one container or only one task per container. That depends on your workload and latency requirement. For example, if you want to quickly <a href="../jobs/reprocessing.html">reprocess a stream</a>, you may increase the number of containers to one task per container. It is important to notice that one container only uses <a href="../container/event-loop.html">one thread</a>, which maps to exactly one CPU. This design attempts to simplify resource management and the isolation between jobs.</p> + +<h3 id="buffering-&-latency">Buffering & Latency</h3> + +<p>Spark streaming essentially is a sequence of small batch processes. With a fast execution engine, it can reach the latency as low as one second (from their <a href="http://www.cs.berkeley.edu/%7Ematei/papers/2012/hotcloud_spark_streaming.pdf">paper</a>). If the processing is slower than receiving, the data will be queued as DStreams in memory and the queue will keep increasing. In order to run a healthy Spark streaming application, the system should be <a href="http://spark.apache.org/docs/latest/streaming-programming-guide.html#performance-tuning">tuned</a> until the speed of processing is as fast as receiving.</p> + +<p>Samza jobs can have latency in the low milliseconds when running with Apache Kafka. It has a different approach to buffering. The buffering mechanism is dependent on the input and output system. For example, when using <a href="http://kafka.apache.org/">Kafka</a> as the input and output system, data is actually buffered to disk. This design decision, by sacrificing a little latency, allows the buffer to absorb a large backlog of messages when a job has fallen behind in its processing.</p> + +<h3 id="fault-tolerance">Fault-tolerance</h3> + +<p>There are two kinds of failures in both Spark Streaming and Samza: worker node (running executors) failure in Spark Streaming (equivalent to container failure in Samza) and driver node (running driver program) failure (equivalent to application manager (AM) failure in Samza).</p> + +<p>When a worker node fails in Spark Streaming, it will be restarted by the cluster manager. When a container fails in Samza, the application manager will work with YARN to start a new container. </p> + +<p>When a driver node fails in Spark Streaming, Sparkâs <a href="http://spark.apache.org/docs/latest/spark-standalone.html">standalone cluster mode</a> will restart the driver node automatically. But it is currently not supported in YARN and Mesos. You will need other mechanisms to restart the driver node automatically. Spark Streaming can use the checkpoint in HDFS to recreate the StreamingContext. When the AM fails in Samza, YARN will handle restarting the AM. Samza will restart all the containers if the AM restarts.</p> + +<p>In terms of data lost, there is a difference between Spark Streaming and Samza. If the input stream is active streaming system, such as Flume, Kafka, Spark Streaming may lose data if the failure happens when the data is received but not yet replicated to other nodes (also see <a href="https://issues.apache.org/jira/browse/SPARK-1647">SPARK-1647</a>). Samza will not lose data when the failure happens because it has the concept of <a href="../container/checkpointing.html">checkpointing</a> that stores the offset of the latest processed message and always commits the checkpoint after processing the data. There is not data lost situation like Spark Streaming has. If a container fails, it reads from the latest checkpoint. When a Samza job recovers from a failure, it’s possible that it will process some data more than once. This happens because the job restarts at the last checkpoint, and any messages that had been processed between that checkpoint and the failure are processed a gain. The amount of reprocessed data can be minimized by setting a small checkpoint interval period.</p> + +<h3 id="deployment-&-execution">Deployment & Execution</h3> + +<p>Spark has a SparkContext object to talk with cluster managers, which then allocate resources for the application. Currently Spark supports three types of cluster managers: <a href="http://spark.apache.org/docs/latest/spark-standalone.html">Spark standalone</a>, <a href="http://mesos.apache.org/">Apache Mesos</a> and <a href="http://hadoop.apache.org/docs/current/hadoop-yarn/hadoop-yarn-site/YARN.html">Hadoop YARN</a>. Besides these, Spark has a script for launching in <a href="http://spark.apache.org/docs/latest/ec2-scripts.html">Amazon EC2</a>.</p> + +<p>Samza only supports YARN and local execution currently.</p> + +<h3 id="isolation">Isolation</h3> + +<p>Spark Streaming and Samza have the same isolation. Spark Streaming depends on cluster managers (e.g Mesos or YARN) and Samza depend on YARN to provide processor isolation. Different applications run in different JVMs. Data cannot be shared among different applications unless it is written to external storage. Since Samza provides out-of-box Kafka integration, it is very easy to reuse the output of other Samza jobs (see <a href="../introduction/concepts.html#dataflow-graphs">here</a>).</p> + +<h3 id="language-support">Language Support</h3> + +<p>Spark Streaming is written in Java and Scala and provides Scala, Java, and Python APIs. Samza is written in Java and Scala and has a Java API.</p> + +<h3 id="workflow">Workflow</h3> + +<p>In Spark Streaming, you build an entire processing graph with a DSL API and deploy that entire graph as one unit. The communication between the nodes in that graph (in the form of DStreams) is provided by the framework. That is a similar to Storm. Samza is totally different – each job is just a message-at-a-time processor, and there is no framework support for topologies. Output of a processing task always needs to go back to a message broker (e.g. Kafka).</p> + +<p>A positive consequence of Samza’s design is that a job’s output can be consumed by multiple unrelated jobs, potentially run by different teams, and those jobs are isolated from each other through Kafka’s buffering. That is not the case with Storm’s and Spark Streaming’s framework-internal streams.</p> + +<p>Although a Storm/Spark Streaming job could in principle write its output to a message broker, the framework doesn’t really make this easy. It seems that Storm/Spark aren’t intended to used in a way where one topology’s output is another topology’s input. By contrast, in Samza, that mode of usage is standard.</p> + +<h3 id="maturity">Maturity</h3> + +<p>Spark has an active user and developer community, and recently releases 1.0.0 version. It has a list of companies that use it on its <a href="https://cwiki.apache.org/confluence/display/SPARK/Powered+By+Spark">Powered by</a> page. Since Spark contains Spark Streaming, Spark SQL, MLlib, GraphX and Bagel, it’s tough to tell what portion of companies on the list are actually using Spark Streaming, and not just Spark.</p> + +<p>Samza is still young, but has just released version 0.7.0. It has a responsive community and is being developed actively. That said, it is built on solid systems such as YARN and Kafka. Samza is heavily used at LinkedIn and we hope others will find it useful as well.</p> + +<h2 id="api-overview-»"><a href="../api/overview.html">API Overview »</a></h2> + + + </div> + </div> + + </div><!-- /.wrapper-content --> + </div><!-- /.wrapper --> + + <div class="footer"> + <div class="container"> + <!-- nothing for now. --> + </div> + </div> + + <!-- Google Analytics --> + <script> + (function(i,s,o,g,r,a,m){i['GoogleAnalyticsObject']=r;i[r]=i[r]||function(){ + (i[r].q=i[r].q||[]).push(arguments)},i[r].l=1*new Date();a=s.createElement(o), + m=s.getElementsByTagName(o)[0];a.async=1;a.src=g;m.parentNode.insertBefore(a,m) + })(window,document,'script','//www.google-analytics.com/analytics.js','ga'); + + ga('create', 'UA-43122768-1', 'apache.org'); + ga('send', 'pageview'); + + </script> + </body> +</html>
Modified: incubator/samza/site/learn/documentation/0.7.0/comparisons/storm.html URL: http://svn.apache.org/viewvc/incubator/samza/site/learn/documentation/0.7.0/comparisons/storm.html?rev=1612998&r1=1612997&r2=1612998&view=diff ============================================================================== --- incubator/samza/site/learn/documentation/0.7.0/comparisons/storm.html (original) +++ incubator/samza/site/learn/documentation/0.7.0/comparisons/storm.html Thu Jul 24 05:05:00 2014 @@ -227,7 +227,7 @@ <p>Samza’s serialization and data model are both pluggable. We are not terribly opinionated about which approach is best.</p> -<h2 id="api-overview-»"><a href="../api/overview.html">API Overview »</a></h2> +<h2 id="spark-streaming-»"><a href="spark-streaming.html">Spark Streaming »</a></h2> </div> Modified: incubator/samza/site/learn/documentation/0.7.0/container/checkpointing.html URL: http://svn.apache.org/viewvc/incubator/samza/site/learn/documentation/0.7.0/container/checkpointing.html?rev=1612998&r1=1612997&r2=1612998&view=diff ============================================================================== --- incubator/samza/site/learn/documentation/0.7.0/container/checkpointing.html (original) +++ incubator/samza/site/learn/documentation/0.7.0/container/checkpointing.html Thu Jul 24 05:05:00 2014 @@ -145,7 +145,7 @@ <p>For checkpoints to be effective, they need to be written somewhere where they will survive faults. Samza allows you to write checkpoints to the file system (using FileSystemCheckpointManager), but that doesn’t help if the machine fails and the container needs to be restarted on another machine. The most common configuration is to use Kafka for checkpointing. You can enable this with the following job configuration:</p> -<div class="highlight"><pre><code class="language-jproperties" data-lang="jproperties"><span class="c"># The name of your job determines the name under which checkpoints will be stored</span> +<div class="highlight"><pre><code class="jproperties"><span class="c"># The name of your job determines the name under which checkpoints will be stored</span> <span class="na">job.name</span><span class="o">=</span><span class="s">example-job</span> <span class="c"># Define a system called "kafka" for consuming and producing to a Kafka cluster</span> @@ -162,7 +162,7 @@ <p>Sometimes it can be useful to use checkpoints only for some input streams, but not for others. In this case, you can tell Samza to ignore any checkpointed offsets for a particular stream name:</p> -<div class="highlight"><pre><code class="language-jproperties" data-lang="jproperties"><span class="c"># Ignore any checkpoints for the topic "my-special-topic"</span> +<div class="highlight"><pre><code class="jproperties"><span class="c"># Ignore any checkpoints for the topic "my-special-topic"</span> <span class="na">systems.kafka.streams.my-special-topic.samza.reset.offset</span><span class="o">=</span><span class="s">true</span> <span class="c"># Always start consuming "my-special-topic" at the oldest available offset</span> @@ -208,12 +208,12 @@ <p>To inspect a job’s latest checkpoint, you need to specify your job’s config file, so that the tool knows which job it is dealing with:</p> -<div class="highlight"><pre><code class="language-bash" data-lang="bash">samza-example/target/bin/checkpoint-tool.sh <span class="se">\</span> +<div class="highlight"><pre><code class="bash">samza-example/target/bin/checkpoint-tool.sh <span class="se">\</span> --config-path<span class="o">=</span>file:///path/to/job/config.properties</code></pre></div> <p>This command prints out the latest checkpoint in a properties file format. You can save the output to a file, and edit it as you wish. For example, to jump back to the oldest possible point in time, you can set all the offsets to 0. Then you can feed that properties file back into checkpoint-tool.sh and save the modified checkpoint:</p> -<div class="highlight"><pre><code class="language-bash" data-lang="bash">samza-example/target/bin/checkpoint-tool.sh <span class="se">\</span> +<div class="highlight"><pre><code class="bash">samza-example/target/bin/checkpoint-tool.sh <span class="se">\</span> --config-path<span class="o">=</span>file:///path/to/job/config.properties <span class="se">\</span> --new-offsets<span class="o">=</span>file:///path/to/new/offsets.properties</code></pre></div> Modified: incubator/samza/site/learn/documentation/0.7.0/container/event-loop.html URL: http://svn.apache.org/viewvc/incubator/samza/site/learn/documentation/0.7.0/container/event-loop.html?rev=1612998&r1=1612997&r2=1612998&view=diff ============================================================================== --- incubator/samza/site/learn/documentation/0.7.0/container/event-loop.html (original) +++ incubator/samza/site/learn/documentation/0.7.0/container/event-loop.html Thu Jul 24 05:05:00 2014 @@ -153,7 +153,7 @@ <p>You can then tell Samza to use your lifecycle listener with the following properties in your job configuration:</p> -<div class="highlight"><pre><code class="language-jproperties" data-lang="jproperties"><span class="c"># Define a listener called "my-listener" by giving the factory class name</span> +<div class="highlight"><pre><code class="jproperties"><span class="c"># Define a listener called "my-listener" by giving the factory class name</span> <span class="na">task.lifecycle.listener.my-listener.class</span><span class="o">=</span><span class="s">com.example.foo.MyListenerFactory</span> <span class="c"># Enable it in this job (multiple listeners can be separated by commas)</span> Modified: incubator/samza/site/learn/documentation/0.7.0/container/jmx.html URL: http://svn.apache.org/viewvc/incubator/samza/site/learn/documentation/0.7.0/container/jmx.html?rev=1612998&r1=1612997&r2=1612998&view=diff ============================================================================== --- incubator/samza/site/learn/documentation/0.7.0/container/jmx.html (original) +++ incubator/samza/site/learn/documentation/0.7.0/container/jmx.html Thu Jul 24 05:05:00 2014 @@ -127,7 +127,7 @@ <p>You can tell Samza to publish its internal <a href="metrics.html">metrics</a>, and any custom metrics you define, as JMX MBeans. To enable this, set the following properties in your job configuration:</p> -<div class="highlight"><pre><code class="language-jproperties" data-lang="jproperties"><span class="c"># Define a Samza metrics reporter called "jmx", which publishes to JMX</span> +<div class="highlight"><pre><code class="jproperties"><span class="c"># Define a Samza metrics reporter called "jmx", which publishes to JMX</span> <span class="na">metrics.reporter.jmx.class</span><span class="o">=</span><span class="s">org.apache.samza.metrics.reporter.JmxReporterFactory</span> <span class="c"># Use it (if you have multiple reporters defined, separate them with commas)</span> Modified: incubator/samza/site/learn/documentation/0.7.0/container/metrics.html URL: http://svn.apache.org/viewvc/incubator/samza/site/learn/documentation/0.7.0/container/metrics.html?rev=1612998&r1=1612997&r2=1612998&view=diff ============================================================================== --- incubator/samza/site/learn/documentation/0.7.0/container/metrics.html (original) +++ incubator/samza/site/learn/documentation/0.7.0/container/metrics.html Thu Jul 24 05:05:00 2014 @@ -129,7 +129,7 @@ <p>To set up your job to publish metrics to Kafka, you can use the following configuration:</p> -<div class="highlight"><pre><code class="language-jproperties" data-lang="jproperties"><span class="c"># Define a metrics reporter called "snapshot", which publishes metrics</span> +<div class="highlight"><pre><code class="jproperties"><span class="c"># Define a metrics reporter called "snapshot", which publishes metrics</span> <span class="c"># every 60 seconds.</span> <span class="na">metrics.reporters</span><span class="o">=</span><span class="s">snapshot</span> <span class="na">metrics.reporter.snapshot.class</span><span class="o">=</span><span class="s">org.apache.samza.metrics.reporter.MetricsSnapshotReporterFactory</span> @@ -144,7 +144,7 @@ <p>With this configuration, the job automatically sends several JSON-encoded messages to the “metrics” topic in Kafka every 60 seconds. The messages look something like this:</p> -<div class="highlight"><pre><code class="language-json" data-lang="json"><span class="p">{</span> +<div class="highlight"><pre><code class="json"><span class="p">{</span> <span class="nt">"header"</span><span class="p">:</span> <span class="p">{</span> <span class="nt">"container-name"</span><span class="p">:</span> <span class="s2">"samza-container-0"</span><span class="p">,</span> <span class="nt">"host"</span><span class="p">:</span> <span class="s2">"samza-grid-1234.example.com"</span><span class="p">,</span> @@ -177,7 +177,7 @@ <p>You can register your custom metrics through a <a href="../api/javadocs/org/apache/samza/metrics/MetricsRegistry.html">MetricsRegistry</a>. Your stream task needs to implement <a href="../api/javadocs/org/apache/samza/task/InitableTask.html">InitableTask</a>, so that you can get the metrics registry from the <a href="../api/javadocs/org/apache/samza/task/TaskContext.html">TaskContext</a>. This simple example shows how to count the number of messages processed by your task:</p> -<div class="highlight"><pre><code class="language-java" data-lang="java"><span class="kd">public</span> <span class="kd">class</span> <span class="nc">MyJavaStreamTask</span> <span class="kd">implements</span> <span class="n">StreamTask</span><span class="o">,</span> <span class="n">InitableTask</span> <span class="o">{</span> +<div class="highlight"><pre><code class="java"><span class="kd">public</span> <span class="kd">class</span> <span class="nc">MyJavaStreamTask</span> <span class="kd">implements</span> <span class="n">StreamTask</span><span class="o">,</span> <span class="n">InitableTask</span> <span class="o">{</span> <span class="kd">private</span> <span class="n">Counter</span> <span class="n">messageCount</span><span class="o">;</span> <span class="kd">public</span> <span class="kt">void</span> <span class="nf">init</span><span class="o">(</span><span class="n">Config</span> <span class="n">config</span><span class="o">,</span> <span class="n">TaskContext</span> <span class="n">context</span><span class="o">)</span> <span class="o">{</span> Modified: incubator/samza/site/learn/documentation/0.7.0/container/samza-container.html URL: http://svn.apache.org/viewvc/incubator/samza/site/learn/documentation/0.7.0/container/samza-container.html?rev=1612998&r1=1612997&r2=1612998&view=diff ============================================================================== --- incubator/samza/site/learn/documentation/0.7.0/container/samza-container.html (original) +++ incubator/samza/site/learn/documentation/0.7.0/container/samza-container.html Thu Jul 24 05:05:00 2014 @@ -144,7 +144,7 @@ <p>When the container starts, it creates instances of the <a href="../api/overview.html">task class</a> that you’ve written. If the task class implements the <a href="../api/javadocs/org/apache/samza/task/InitableTask.html">InitableTask</a> interface, the SamzaContainer will also call the init() method.</p> -<div class="highlight"><pre><code class="language-java" data-lang="java"><span class="cm">/** Implement this if you want a callback when your task starts up. */</span> +<div class="highlight"><pre><code class="java"><span class="cm">/** Implement this if you want a callback when your task starts up. */</span> <span class="kd">public</span> <span class="kd">interface</span> <span class="nc">InitableTask</span> <span class="o">{</span> <span class="kt">void</span> <span class="nf">init</span><span class="o">(</span><span class="n">Config</span> <span class="n">config</span><span class="o">,</span> <span class="n">TaskContext</span> <span class="n">context</span><span class="o">);</span> <span class="o">}</span></code></pre></div> Modified: incubator/samza/site/learn/documentation/0.7.0/container/serialization.html URL: http://svn.apache.org/viewvc/incubator/samza/site/learn/documentation/0.7.0/container/serialization.html?rev=1612998&r1=1612997&r2=1612998&view=diff ============================================================================== --- incubator/samza/site/learn/documentation/0.7.0/container/serialization.html (original) +++ incubator/samza/site/learn/documentation/0.7.0/container/serialization.html Thu Jul 24 05:05:00 2014 @@ -133,7 +133,7 @@ <p>You can use whatever makes sense for your job; Samza doesn’t impose any particular data model or serialization scheme on you. However, the cleanest solution is usually to use Samza’s serde layer. The following configuration example shows how to use it.</p> -<div class="highlight"><pre><code class="language-jproperties" data-lang="jproperties"><span class="c"># Define a system called "kafka"</span> +<div class="highlight"><pre><code class="jproperties"><span class="c"># Define a system called "kafka"</span> <span class="na">systems.kafka.samza.factory</span><span class="o">=</span><span class="s">org.apache.samza.system.kafka.KafkaSystemFactory</span> <span class="c"># The job is going to consume a topic called "PageViewEvent" from the "kafka" system</span> Modified: incubator/samza/site/learn/documentation/0.7.0/container/state-management.html URL: http://svn.apache.org/viewvc/incubator/samza/site/learn/documentation/0.7.0/container/state-management.html?rev=1612998&r1=1612997&r2=1612998&view=diff ============================================================================== --- incubator/samza/site/learn/documentation/0.7.0/container/state-management.html (original) +++ incubator/samza/site/learn/documentation/0.7.0/container/state-management.html Thu Jul 24 05:05:00 2014 @@ -246,7 +246,7 @@ <p>To use a key-value store in your job, add the following to your job config:</p> -<div class="highlight"><pre><code class="language-jproperties" data-lang="jproperties"><span class="c"># Use the key-value store implementation for a store called "my-store"</span> +<div class="highlight"><pre><code class="jproperties"><span class="c"># Use the key-value store implementation for a store called "my-store"</span> <span class="na">stores.my-store.factory</span><span class="o">=</span><span class="s">org.apache.samza.storage.kv.KeyValueStorageEngineFactory</span> <span class="c"># Use the Kafka topic "my-store-changelog" as the changelog stream for this store.</span> @@ -263,7 +263,7 @@ <p>Here is a simple example that writes every incoming message to the store:</p> -<div class="highlight"><pre><code class="language-java" data-lang="java"><span class="kd">public</span> <span class="kd">class</span> <span class="nc">MyStatefulTask</span> <span class="kd">implements</span> <span class="n">StreamTask</span><span class="o">,</span> <span class="n">InitableTask</span> <span class="o">{</span> +<div class="highlight"><pre><code class="java"><span class="kd">public</span> <span class="kd">class</span> <span class="nc">MyStatefulTask</span> <span class="kd">implements</span> <span class="n">StreamTask</span><span class="o">,</span> <span class="n">InitableTask</span> <span class="o">{</span> <span class="kd">private</span> <span class="n">KeyValueStore</span><span class="o"><</span><span class="n">String</span><span class="o">,</span> <span class="n">String</span><span class="o">></span> <span class="n">store</span><span class="o">;</span> <span class="kd">public</span> <span class="kt">void</span> <span class="nf">init</span><span class="o">(</span><span class="n">Config</span> <span class="n">config</span><span class="o">,</span> <span class="n">TaskContext</span> <span class="n">context</span><span class="o">)</span> <span class="o">{</span> @@ -279,13 +279,13 @@ <p>Here is the complete key-value store API:</p> -<div class="highlight"><pre><code class="language-java" data-lang="java"><span class="kd">public</span> <span class="kd">interface</span> <span class="nc">KeyValueStore</span><span class="o"><</span><span class="n">K</span><span class="o">,</span> <span class="n">V</span><span class="o">></span> <span class="o">{</span> +<div class="highlight"><pre><code class="java"><span class="kd">public</span> <span class="kd">interface</span> <span class="nc">KeyValueStore</span><span class="o"><</span><span class="n">K</span><span class="o">,</span> <span class="n">V</span><span class="o">></span> <span class="o">{</span> <span class="n">V</span> <span class="nf">get</span><span class="o">(</span><span class="n">K</span> <span class="n">key</span><span class="o">);</span> <span class="kt">void</span> <span class="nf">put</span><span class="o">(</span><span class="n">K</span> <span class="n">key</span><span class="o">,</span> <span class="n">V</span> <span class="n">value</span><span class="o">);</span> <span class="kt">void</span> <span class="nf">putAll</span><span class="o">(</span><span class="n">List</span><span class="o"><</span><span class="n">Entry</span><span class="o"><</span><span class="n">K</span><span class="o">,</span><span class="n">V</span><span class="o">>></span> <span class="n">entries</span><span class="o">);</span> <span class="kt">void</span> <span class="nf">delete</span><span class="o">(</span><span class="n">K</span> <span class="n">key</span><span class="o">);</span> - <span class="n">KeyValueIterator</span><span class="o"><</span><span class="n">K</span><span class="o">,</span><span class="n">V</span><span class="o">></span> <span class="nf">range</span><span class="o">(</span><span class="n">K</span> <span class="n">from</span><span class="o">,</span> <span class="n">K</span> <span class="n">to</span><span class="o">);</span> - <span class="n">KeyValueIterator</span><span class="o"><</span><span class="n">K</span><span class="o">,</span><span class="n">V</span><span class="o">></span> <span class="nf">all</span><span class="o">();</span> + <span class="n">KeyValueIterator</span><span class="o"><</span><span class="n">K</span><span class="o">,</span><span class="n">V</span><span class="o">></span> <span class="n">range</span><span class="o">(</span><span class="n">K</span> <span class="n">from</span><span class="o">,</span> <span class="n">K</span> <span class="n">to</span><span class="o">);</span> + <span class="n">KeyValueIterator</span><span class="o"><</span><span class="n">K</span><span class="o">,</span><span class="n">V</span><span class="o">></span> <span class="n">all</span><span class="o">();</span> <span class="o">}</span></code></pre></div> <p>Additional configuration properties for the key-value store are documented in the <a href="../jobs/configuration-table.html#keyvalue">configuration reference</a>.</p> Modified: incubator/samza/site/learn/documentation/0.7.0/container/streams.html URL: http://svn.apache.org/viewvc/incubator/samza/site/learn/documentation/0.7.0/container/streams.html?rev=1612998&r1=1612997&r2=1612998&view=diff ============================================================================== --- incubator/samza/site/learn/documentation/0.7.0/container/streams.html (original) +++ incubator/samza/site/learn/documentation/0.7.0/container/streams.html Thu Jul 24 05:05:00 2014 @@ -125,7 +125,7 @@ <p>The <a href="samza-container.html">samza container</a> reads and writes messages using the <a href="../api/javadocs/org/apache/samza/system/SystemConsumer.html">SystemConsumer</a> and <a href="../api/javadocs/org/apache/samza/system/SystemProducer.html">SystemProducer</a> interfaces. You can integrate any message broker with Samza by implementing these two interfaces.</p> -<div class="highlight"><pre><code class="language-java" data-lang="java"><span class="kd">public</span> <span class="kd">interface</span> <span class="nc">SystemConsumer</span> <span class="o">{</span> +<div class="highlight"><pre><code class="java"><span class="kd">public</span> <span class="kd">interface</span> <span class="nc">SystemConsumer</span> <span class="o">{</span> <span class="kt">void</span> <span class="nf">start</span><span class="o">();</span> <span class="kt">void</span> <span class="nf">stop</span><span class="o">();</span> @@ -185,7 +185,7 @@ <p>To plug in your own message chooser, you need to implement the <a href="../api/javadocs/org/apache/samza/system/chooser/MessageChooserFactory.html">MessageChooserFactory</a> interface, and set the “task.chooser.class” configuration to the fully-qualified class name of your implementation:</p> -<div class="highlight"><pre><code class="language-jproperties" data-lang="jproperties"><span class="na">task.chooser.class</span><span class="o">=</span><span class="s">com.example.samza.YourMessageChooserFactory</span></code></pre></div> +<div class="highlight"><pre><code class="jproperties"><span class="na">task.chooser.class</span><span class="o">=</span><span class="s">com.example.samza.YourMessageChooserFactory</span></code></pre></div> <h4 id="prioritizing-input-streams">Prioritizing input streams</h4> @@ -193,7 +193,7 @@ <p>Samza provides a mechanism to prioritize one stream over another by setting this configuration parameter: systems.<system>.streams.<stream>.samza.priority=<number>. For example:</p> -<div class="highlight"><pre><code class="language-jproperties" data-lang="jproperties"><span class="na">systems.kafka.streams.my-real-time-stream.samza.priority</span><span class="o">=</span><span class="s">2</span> +<div class="highlight"><pre><code class="jproperties"><span class="na">systems.kafka.streams.my-real-time-stream.samza.priority</span><span class="o">=</span><span class="s">2</span> <span class="na">systems.kafka.streams.my-batch-stream.samza.priority</span><span class="o">=</span><span class="s">1</span></code></pre></div> <p>This declares that my-real-time-stream’s messages should be processed with higher priority than my-batch-stream’s messages. If my-real-time-stream has any messages available, they are processed first. Only if there are no messages currently waiting on my-real-time-stream, the Samza job continues processing my-batch-stream.</p> @@ -212,7 +212,7 @@ <p>To configure a stream called “my-bootstrap-stream” to be a fully-consumed bootstrap stream, use the following settings:</p> -<div class="highlight"><pre><code class="language-jproperties" data-lang="jproperties"><span class="na">systems.kafka.streams.my-bootstrap-stream.samza.bootstrap</span><span class="o">=</span><span class="s">true</span> +<div class="highlight"><pre><code class="jproperties"><span class="na">systems.kafka.streams.my-bootstrap-stream.samza.bootstrap</span><span class="o">=</span><span class="s">true</span> <span class="na">systems.kafka.streams.my-bootstrap-stream.samza.reset.offset</span><span class="o">=</span><span class="s">true</span> <span class="na">systems.kafka.streams.my-bootstrap-stream.samza.offset.default</span><span class="o">=</span><span class="s">oldest</span></code></pre></div> @@ -226,7 +226,7 @@ <p>For example, if you want to read 100 messages in a row from each stream partition (regardless of the MessageChooser), you can use this configuration parameter:</p> -<div class="highlight"><pre><code class="language-jproperties" data-lang="jproperties"><span class="na">task.consumer.batch.size</span><span class="o">=</span><span class="s">100</span></code></pre></div> +<div class="highlight"><pre><code class="jproperties"><span class="na">task.consumer.batch.size</span><span class="o">=</span><span class="s">100</span></code></pre></div> <p>With this setting, Samza tries to read a message from the most recently used <a href="../api/javadocs/org/apache/samza/system/SystemStreamPartition.html">SystemStreamPartition</a>. This behavior continues either until no more messages are available for that SystemStreamPartition, or until the batch size has been reached. When that happens, Samza defers to the MessageChooser to determine the next message to process. It then again tries to continue consume from the chosen message’s SystemStreamPartition until the batch size is reached.</p> Modified: incubator/samza/site/learn/documentation/0.7.0/container/windowing.html URL: http://svn.apache.org/viewvc/incubator/samza/site/learn/documentation/0.7.0/container/windowing.html?rev=1612998&r1=1612997&r2=1612998&view=diff ============================================================================== --- incubator/samza/site/learn/documentation/0.7.0/container/windowing.html (original) +++ incubator/samza/site/learn/documentation/0.7.0/container/windowing.html Thu Jul 24 05:05:00 2014 @@ -127,14 +127,14 @@ <p>Samza’s <em>windowing</em> feature provides a way for tasks to do something in regular time intervals, for example once per minute. To enable windowing, you just need to set one property in your job configuration:</p> -<div class="highlight"><pre><code class="language-jproperties" data-lang="jproperties"><span class="c"># Call the window() method every 60 seconds</span> +<div class="highlight"><pre><code class="jproperties"><span class="c"># Call the window() method every 60 seconds</span> <span class="na">task.window.ms</span><span class="o">=</span><span class="s">60000</span></code></pre></div> <p>Next, your stream task needs to implement the <a href="../api/javadocs/org/apache/samza/task/WindowableTask.html">WindowableTask</a> interface. This interface defines a window() method which is called by Samza in the regular interval that you configured.</p> <p>For example, this is how you would implement a basic per-minute event counter:</p> -<div class="highlight"><pre><code class="language-java" data-lang="java"><span class="kd">public</span> <span class="kd">class</span> <span class="nc">EventCounterTask</span> <span class="kd">implements</span> <span class="n">StreamTask</span><span class="o">,</span> <span class="n">WindowableTask</span> <span class="o">{</span> +<div class="highlight"><pre><code class="java"><span class="kd">public</span> <span class="kd">class</span> <span class="nc">EventCounterTask</span> <span class="kd">implements</span> <span class="n">StreamTask</span><span class="o">,</span> <span class="n">WindowableTask</span> <span class="o">{</span> <span class="kd">public</span> <span class="kd">static</span> <span class="kd">final</span> <span class="n">SystemStream</span> <span class="n">OUTPUT_STREAM</span> <span class="o">=</span> <span class="k">new</span> <span class="nf">SystemStream</span><span class="o">(</span><span class="s">"kafka"</span><span class="o">,</span> <span class="s">"events-per-minute"</span><span class="o">);</span> @@ -149,7 +149,7 @@ <span class="kd">public</span> <span class="kt">void</span> <span class="nf">window</span><span class="o">(</span><span class="n">MessageCollector</span> <span class="n">collector</span><span class="o">,</span> <span class="n">TaskCoordinator</span> <span class="n">coordinator</span><span class="o">)</span> <span class="o">{</span> - <span class="n">collector</span><span class="o">.</span><span class="na">send</span><span class="o">(</span><span class="k">new</span> <span class="nf">OutgoingMessageEnvelope</span><span class="o">(</span><span class="n">OUTPUT_STREAM</span><span class="o">,</span> <span class="n">eventsSeen</span><span class="o">));</span> + <span class="n">collector</span><span class="o">.</span><span class="na">send</span><span class="o">(</span><span class="k">new</span> <span class="n">OutgoingMessageEnvelope</span><span class="o">(</span><span class="n">OUTPUT_STREAM</span><span class="o">,</span> <span class="n">eventsSeen</span><span class="o">));</span> <span class="n">eventsSeen</span> <span class="o">=</span> <span class="mi">0</span><span class="o">;</span> <span class="o">}</span> <span class="o">}</span></code></pre></div> Modified: incubator/samza/site/learn/documentation/0.7.0/index.html URL: http://svn.apache.org/viewvc/incubator/samza/site/learn/documentation/0.7.0/index.html?rev=1612998&r1=1612997&r2=1612998&view=diff ============================================================================== --- incubator/samza/site/learn/documentation/0.7.0/index.html (original) +++ incubator/samza/site/learn/documentation/0.7.0/index.html Thu Jul 24 05:05:00 2014 @@ -137,6 +137,7 @@ <li><a href="comparisons/introduction.html">Introduction</a></li> <li><a href="comparisons/mupd8.html">MUPD8</a></li> <li><a href="comparisons/storm.html">Storm</a></li> + <li><a href="comparisons/spark-streaming.html">Spark Streaming</a></li> <!-- TODO comparisons pages <li><a href="comparisons/aurora.html">Aurora</a></li> <li><a href="comparisons/jms.html">JMS</a></li> Modified: incubator/samza/site/learn/documentation/0.7.0/introduction/architecture.html URL: http://svn.apache.org/viewvc/incubator/samza/site/learn/documentation/0.7.0/introduction/architecture.html?rev=1612998&r1=1612997&r2=1612998&view=diff ============================================================================== --- incubator/samza/site/learn/documentation/0.7.0/introduction/architecture.html (original) +++ incubator/samza/site/learn/documentation/0.7.0/introduction/architecture.html Thu Jul 24 05:05:00 2014 @@ -203,7 +203,7 @@ <p>Let’s take a look at a real example: suppose we want to count the number of page views. In SQL, you would write something like:</p> -<div class="highlight"><pre><code class="language-sql" data-lang="sql"><span class="k">SELECT</span> <span class="n">user_id</span><span class="p">,</span> <span class="k">COUNT</span><span class="p">(</span><span class="o">*</span><span class="p">)</span> <span class="k">FROM</span> <span class="n">PageViewEvent</span> <span class="k">GROUP</span> <span class="k">BY</span> <span class="n">user_id</span></code></pre></div> +<div class="highlight"><pre><code class="sql"><span class="k">SELECT</span> <span class="n">user_id</span><span class="p">,</span> <span class="k">COUNT</span><span class="p">(</span><span class="o">*</span><span class="p">)</span> <span class="k">FROM</span> <span class="n">PageViewEvent</span> <span class="k">GROUP</span> <span class="k">BY</span> <span class="n">user_id</span></code></pre></div> <p>Although Samza doesn’t support SQL right now, the idea is the same. Two jobs are required to calculate this query: one to group messages by user ID, and the other to do the counting.</p> Modified: incubator/samza/site/learn/documentation/0.7.0/jobs/configuration-table.html URL: http://svn.apache.org/viewvc/incubator/samza/site/learn/documentation/0.7.0/jobs/configuration-table.html?rev=1612998&r1=1612997&r2=1612998&view=diff ============================================================================== --- incubator/samza/site/learn/documentation/0.7.0/jobs/configuration-table.html (original) +++ incubator/samza/site/learn/documentation/0.7.0/jobs/configuration-table.html Thu Jul 24 05:05:00 2014 @@ -357,6 +357,26 @@ </tr> <tr> + <td class="property" id="task-drop-deserialization-errors">task.drop.deserialization.errors</td> + <td class="default"></td> + <td class="description"> + This property is to define how the system deals with deserialization failure situation. If set to true, the system will + skip the error messages and keep running. If set to false, the system with throw exceptions and fail the container. Default + is false. + </td> + </tr> + + <tr> + <td class="property" id="task-drop-serialization-errors">task.drop.serialization.errors</td> + <td class="default"></td> + <td class="description"> + This property is to define how the system deals with serialization failure situation. If set to true, the system will + drop the error messages and keep running. If set to false, the system with throw exceptions and fail the container. Default + is false. + </td> + </tr> + + <tr> <th colspan="3" class="section" id="streams"><a href="../container/streams.html">Systems (input and output streams)</a></th> </tr> Modified: incubator/samza/site/learn/documentation/0.7.0/jobs/configuration.html URL: http://svn.apache.org/viewvc/incubator/samza/site/learn/documentation/0.7.0/jobs/configuration.html?rev=1612998&r1=1612997&r2=1612998&view=diff ============================================================================== --- incubator/samza/site/learn/documentation/0.7.0/jobs/configuration.html (original) +++ incubator/samza/site/learn/documentation/0.7.0/jobs/configuration.html Thu Jul 24 05:05:00 2014 @@ -125,7 +125,7 @@ <p>All Samza jobs have a configuration file that defines the job. A very basic configuration file looks like this:</p> -<div class="highlight"><pre><code class="language-jproperties" data-lang="jproperties"><span class="c"># Job</span> +<div class="highlight"><pre><code class="jproperties"><span class="c"># Job</span> <span class="na">job.factory.class</span><span class="o">=</span><span class="s">samza.job.local.LocalJobFactory</span> <span class="na">job.name</span><span class="o">=</span><span class="s">hello-world</span> Modified: incubator/samza/site/learn/documentation/0.7.0/jobs/job-runner.html URL: http://svn.apache.org/viewvc/incubator/samza/site/learn/documentation/0.7.0/jobs/job-runner.html?rev=1612998&r1=1612997&r2=1612998&view=diff ============================================================================== --- incubator/samza/site/learn/documentation/0.7.0/jobs/job-runner.html (original) +++ incubator/samza/site/learn/documentation/0.7.0/jobs/job-runner.html Thu Jul 24 05:05:00 2014 @@ -125,13 +125,13 @@ <p>Samza jobs are started using a script called run-job.sh.</p> -<div class="highlight"><pre><code class="language-bash" data-lang="bash">samza-example/target/bin/run-job.sh <span class="se">\</span> +<div class="highlight"><pre><code class="bash">samza-example/target/bin/run-job.sh <span class="se">\</span> --config-factory<span class="o">=</span>samza.config.factories.PropertiesConfigFactory <span class="se">\</span> --config-path<span class="o">=</span>file://<span class="nv">$PWD</span>/config/hello-world.properties</code></pre></div> <p>You provide two parameters to the run-job.sh script. One is the config location, and the other is a factory class that is used to read your configuration file. The run-job.sh script is actually executing a Samza class called JobRunner. The JobRunner uses your ConfigFactory to get a Config object from the config path.</p> -<div class="highlight"><pre><code class="language-java" data-lang="java"><span class="kd">public</span> <span class="kd">interface</span> <span class="nc">ConfigFactory</span> <span class="o">{</span> +<div class="highlight"><pre><code class="java"><span class="kd">public</span> <span class="kd">interface</span> <span class="nc">ConfigFactory</span> <span class="o">{</span> <span class="n">Config</span> <span class="nf">getConfig</span><span class="o">(</span><span class="n">URI</span> <span class="n">configUri</span><span class="o">);</span> <span class="o">}</span></code></pre></div> @@ -139,7 +139,7 @@ <p>Once the JobRunner gets your configuration, it gives your configuration to the StreamJobFactory class defined by the “job.factory” property. Samza ships with two job factory implementations: LocalJobFactory and YarnJobFactory. The StreamJobFactory’s responsibility is to give the JobRunner a job that it can run.</p> -<div class="highlight"><pre><code class="language-java" data-lang="java"><span class="kd">public</span> <span class="kd">interface</span> <span class="nc">StreamJob</span> <span class="o">{</span> +<div class="highlight"><pre><code class="java"><span class="kd">public</span> <span class="kd">interface</span> <span class="nc">StreamJob</span> <span class="o">{</span> <span class="n">StreamJob</span> <span class="nf">submit</span><span class="o">();</span> <span class="n">StreamJob</span> <span class="nf">kill</span><span class="o">();</span> Modified: incubator/samza/site/learn/documentation/0.7.0/jobs/logging.html URL: http://svn.apache.org/viewvc/incubator/samza/site/learn/documentation/0.7.0/jobs/logging.html?rev=1612998&r1=1612997&r2=1612998&view=diff ============================================================================== --- incubator/samza/site/learn/documentation/0.7.0/jobs/logging.html (original) +++ incubator/samza/site/learn/documentation/0.7.0/jobs/logging.html Thu Jul 24 05:05:00 2014 @@ -129,7 +129,7 @@ <p>The <a href="/startup/hello-samza/0.7.0">hello-samza</a> project shows how to use <a href="http://logging.apache.org/log4j/1.2/">log4j</a> with Samza. To turn on log4j logging, you just need to make sure slf4j-log4j12 is in your SamzaContainer’s classpath. In Maven, this can be done by adding the following dependency to your Samza package project.</p> -<div class="highlight"><pre><code class="language-xml" data-lang="xml"><span class="nt"><dependency></span> +<div class="highlight"><pre><code class="xml"><span class="nt"><dependency></span> <span class="nt"><groupId></span>org.slf4j<span class="nt"></groupId></span> <span class="nt"><artifactId></span>slf4j-log4j12<span class="nt"></artifactId></span> <span class="nt"><scope></span>runtime<span class="nt"></scope></span> @@ -142,15 +142,15 @@ <p>Samza’s <a href="packaging.html">run-class.sh</a> script will automatically set the following setting if log4j.xml exists in your <a href="packaging.html">Samza package’s</a> lib directory.</p> -<div class="highlight"><pre><code class="language-bash" data-lang="bash">-Dlog4j.configuration<span class="o">=</span>file:<span class="nv">$base_dir</span>/lib/log4j.xml</code></pre></div> +<div class="highlight"><pre><code class="bash">-Dlog4j.configuration<span class="o">=</span>file:<span class="nv">$base_dir</span>/lib/log4j.xml</code></pre></div> <p>The <a href="packaging.html">run-class.sh</a> script will also set the following Java system properties:</p> -<div class="highlight"><pre><code class="language-bash" data-lang="bash">-Dsamza.log.dir<span class="o">=</span><span class="nv">$SAMZA_LOG_DIR</span> -Dsamza.container.name<span class="o">=</span><span class="nv">$SAMZA_CONTAINER_NAME</span><span class="o">=</span></code></pre></div> +<div class="highlight"><pre><code class="bash">-Dsamza.log.dir<span class="o">=</span><span class="nv">$SAMZA_LOG_DIR</span> -Dsamza.container.name<span class="o">=</span><span class="nv">$SAMZA_CONTAINER_NAME</span><span class="o">=</span></code></pre></div> <p>These settings are very useful if you’re using a file-based appender. For example, you can use a daily rolling appender by configuring log4j.xml like this:</p> -<div class="highlight"><pre><code class="language-xml" data-lang="xml"><span class="nt"><appender</span> <span class="na">name=</span><span class="s">"RollingAppender"</span> <span class="na">class=</span><span class="s">"org.apache.log4j.DailyRollingFileAppender"</span><span class="nt">></span> +<div class="highlight"><pre><code class="xml"><span class="nt"><appender</span> <span class="na">name=</span><span class="s">"RollingAppender"</span> <span class="na">class=</span><span class="s">"org.apache.log4j.DailyRollingFileAppender"</span><span class="nt">></span> <span class="nt"><param</span> <span class="na">name=</span><span class="s">"File"</span> <span class="na">value=</span><span class="s">"${samza.log.dir}/${samza.container.name}.log"</span> <span class="nt">/></span> <span class="nt"><param</span> <span class="na">name=</span><span class="s">"DatePattern"</span> <span class="na">value=</span><span class="s">"'.'yyyy-MM-dd"</span> <span class="nt">/></span> <span class="nt"><layout</span> <span class="na">class=</span><span class="s">"org.apache.log4j.PatternLayout"</span><span class="nt">></span> @@ -170,7 +170,7 @@ <p>Samza’s will automatically set the following garbage collection logging setting, and will output it to <code>$SAMZA_LOG_DIR/gc.log</code>.</p> -<div class="highlight"><pre><code class="language-bash" data-lang="bash">-XX:+PrintGCDateStamps -Xloggc:<span class="nv">$SAMZA_LOG_DIR</span>/gc.log</code></pre></div> +<div class="highlight"><pre><code class="bash">-XX:+PrintGCDateStamps -Xloggc:<span class="nv">$SAMZA_LOG_DIR</span>/gc.log</code></pre></div> <h4 id="rotation">Rotation</h4> Modified: incubator/samza/site/learn/documentation/0.7.0/jobs/reprocessing.html URL: http://svn.apache.org/viewvc/incubator/samza/site/learn/documentation/0.7.0/jobs/reprocessing.html?rev=1612998&r1=1612997&r2=1612998&view=diff ============================================================================== --- incubator/samza/site/learn/documentation/0.7.0/jobs/reprocessing.html (original) +++ incubator/samza/site/learn/documentation/0.7.0/jobs/reprocessing.html Thu Jul 24 05:05:00 2014 @@ -106,6 +106,23 @@ <h2>Reprocessing previously processed data</h2> +<!-- + Licensed to the Apache Software Foundation (ASF) under one or more + contributor license agreements. See the NOTICE file distributed with + this work for additional information regarding copyright ownership. + The ASF licenses this file to You under the Apache License, Version 2.0 + (the "License"); you may not use this file except in compliance with + the License. You may obtain a copy of the License at + + http://www.apache.org/licenses/LICENSE-2.0 + + Unless required by applicable law or agreed to in writing, software + distributed under the License is distributed on an "AS IS" BASIS, + WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + See the License for the specific language governing permissions and + limitations under the License. +--> + <p>From time to time you may want to deploy a new version of your Samza job that computes results differently. Perhaps you fixed a bug or introduced a new feature. For example, say you have a Samza job that classifies messages as spam or not-spam, using a machine learning model that you train offline. Periodically you want to deploy an updated version of your Samza job which includes the latest classification model.</p> <p>When you start up a new version of your job, a question arises: what do you want to do with messages that were previously processed with the old version of your job? The answer depends on the behavior you want:</p> Modified: incubator/samza/site/learn/documentation/0.7.0/jobs/yarn-jobs.html URL: http://svn.apache.org/viewvc/incubator/samza/site/learn/documentation/0.7.0/jobs/yarn-jobs.html?rev=1612998&r1=1612997&r2=1612998&view=diff ============================================================================== --- incubator/samza/site/learn/documentation/0.7.0/jobs/yarn-jobs.html (original) +++ incubator/samza/site/learn/documentation/0.7.0/jobs/yarn-jobs.html Thu Jul 24 05:05:00 2014 @@ -127,7 +127,7 @@ <p>If you want to use YARN to run your Samza job, you’ll also need to define the location of your Samza job’s package. For example, you might say:</p> -<div class="highlight"><pre><code class="language-jproperties" data-lang="jproperties"><span class="na">yarn.package.path</span><span class="o">=</span><span class="s">http://my.http.server/jobs/ingraphs-package-0.0.55.tgz</span></code></pre></div> +<div class="highlight"><pre><code class="jproperties"><span class="na">yarn.package.path</span><span class="o">=</span><span class="s">http://my.http.server/jobs/ingraphs-package-0.0.55.tgz</span></code></pre></div> <p>This .tgz file follows the conventions outlined on the <a href="packaging.html">Packaging</a> page (it has bin/run-am.sh and bin/run-container.sh). YARN NodeManagers will take responsibility for downloading this .tgz file on the appropriate machines, and untar'ing them. From there, YARN will execute run-am.sh or run-container.sh for the Samza Application Master, and SamzaContainer, respectively.</p> Modified: incubator/samza/site/learn/documentation/0.7.0/operations/kafka.html URL: http://svn.apache.org/viewvc/incubator/samza/site/learn/documentation/0.7.0/operations/kafka.html?rev=1612998&r1=1612997&r2=1612998&view=diff ============================================================================== --- incubator/samza/site/learn/documentation/0.7.0/operations/kafka.html (original) +++ incubator/samza/site/learn/documentation/0.7.0/operations/kafka.html Thu Jul 24 05:05:00 2014 @@ -133,7 +133,7 @@ <p>Kafka brokers should be configured to automatically create topics. Without this, it’s going to be very cumbersome to run Samze jobs, since jobs will write to arbitrary (and sometimes new) topics.</p> -<div class="highlight"><pre><code class="language-jproperties" data-lang="jproperties"><span class="na">auto.create.topics.enable</span><span class="o">=</span><span class="s">true</span></code></pre></div> +<div class="highlight"><pre><code class="jproperties"><span class="na">auto.create.topics.enable</span><span class="o">=</span><span class="s">true</span></code></pre></div> </div> Modified: incubator/samza/site/learn/tutorials/0.7.0/deploy-samza-job-from-hdfs.html URL: http://svn.apache.org/viewvc/incubator/samza/site/learn/tutorials/0.7.0/deploy-samza-job-from-hdfs.html?rev=1612998&r1=1612997&r2=1612998&view=diff ============================================================================== --- incubator/samza/site/learn/tutorials/0.7.0/deploy-samza-job-from-hdfs.html (original) +++ incubator/samza/site/learn/tutorials/0.7.0/deploy-samza-job-from-hdfs.html Thu Jul 24 05:05:00 2014 @@ -125,49 +125,19 @@ <p>This tutorial uses <a href="../../../startup/hello-samza/0.7.0/">hello-samza</a> to illustrate how to run a Samza job if you want to publish the Samza job’s .tar.gz package to HDFS.</p> -<h3 id="build-a-new-samza-job-package">Build a new Samza job package</h3> - -<p>Build a new Samza job package to include the hadoop-hdfs-version.jar.</p> - -<ul> -<li>Add dependency statement in pom.xml of samza-job-package</li> -</ul> - -<div class="highlight"><pre><code class="language-xml" data-lang="xml"><span class="nt"><dependency></span> - <span class="nt"><groupId></span>org.apache.hadoop<span class="nt"></groupId></span> - <span class="nt"><artifactId></span>hadoop-hdfs<span class="nt"></artifactId></span> - <span class="nt"><version></span>2.2.0<span class="nt"></version></span> -<span class="nt"></dependency></span></code></pre></div> - -<ul> -<li>Add the following code to src/main/assembly/src.xml in samza-job-package.</li> -</ul> - -<div class="highlight"><pre><code class="language-xml" data-lang="xml"><span class="nt"><include></span>org.apache.hadoop:hadoop-hdfs<span class="nt"></include></span></code></pre></div> - -<ul> -<li>Create .tar.gz package</li> -</ul> - -<div class="highlight"><pre><code class="language-bash" data-lang="bash">mvn clean pacakge</code></pre></div> - -<ul> -<li>Make sure hadoop-common-version.jar has the same version as your hadoop-hdfs-version.jar. Otherwise, you may still have errors.</li> -</ul> - <h3 id="upload-the-package">Upload the package</h3> -<div class="highlight"><pre><code class="language-bash" data-lang="bash">hadoop fs -put ./samza-job-package/target/samza-job-package-0.7.0-dist.tar.gz /path/for/tgz</code></pre></div> +<div class="highlight"><pre><code class="bash">hadoop fs -put ./samza-job-package/target/samza-job-package-0.7.0-dist.tar.gz /path/for/tgz</code></pre></div> <h3 id="add-hdfs-configuration">Add HDFS configuration</h3> -<p>Put the hdfs-site.xml file of your cluster into ~/.samza/conf directory. (The same place as the yarn-site.xml)</p> +<p>Put the hdfs-site.xml file of your cluster into ~/.samza/conf directory (The same place as the yarn-site.xml). If you set HADOOP_CONF_DIR, put the hdfs-site.xml in your configuration directory if the hdfs-site.xml is not there.</p> <h3 id="change-properties-file">Change properties file</h3> <p>Change the yarn.package.path in the properties file to your HDFS location.</p> -<div class="highlight"><pre><code class="language-jproperties" data-lang="jproperties"><span class="na">yarn.package.path</span><span class="o">=</span><span class="s">hdfs://<hdfs name node ip>:<hdfs name node port>/path/to/tgz</span></code></pre></div> +<div class="highlight"><pre><code class="jproperties"><span class="na">yarn.package.path</span><span class="o">=</span><span class="s">hdfs://<hdfs name node ip>:<hdfs name node port>/path/to/tgz</span></code></pre></div> <p>Then you should be able to run the Samza job as described in <a href="../../../startup/hello-samza/0.7.0/">hello-samza</a>.</p> Modified: incubator/samza/site/learn/tutorials/0.7.0/remote-debugging-samza.html URL: http://svn.apache.org/viewvc/incubator/samza/site/learn/tutorials/0.7.0/remote-debugging-samza.html?rev=1612998&r1=1612997&r2=1612998&view=diff ============================================================================== --- incubator/samza/site/learn/tutorials/0.7.0/remote-debugging-samza.html (original) +++ incubator/samza/site/learn/tutorials/0.7.0/remote-debugging-samza.html Thu Jul 24 05:05:00 2014 @@ -129,22 +129,22 @@ <p>Start by checking out Samza, so we have access to the source.</p> -<div class="highlight"><pre><code class="language-bash" data-lang="bash">git clone http://git-wip-us.apache.org/repos/asf/incubator-samza.git</code></pre></div> +<div class="highlight"><pre><code class="bash">git clone http://git-wip-us.apache.org/repos/asf/incubator-samza.git</code></pre></div> <p>Next, grab hello-samza.</p> -<div class="highlight"><pre><code class="language-bash" data-lang="bash">git clone git://git.apache.org/incubator-samza-hello-samza.git</code></pre></div> +<div class="highlight"><pre><code class="bash">git clone git://git.apache.org/incubator-samza-hello-samza.git</code></pre></div> <h3 id="setup-the-environment">Setup the Environment</h3> <p>Now, let’s setup the Eclipse project files.</p> -<div class="highlight"><pre><code class="language-bash" data-lang="bash"><span class="nb">cd </span>incubator-samza +<div class="highlight"><pre><code class="bash"><span class="nb">cd </span>incubator-samza ./gradlew eclipse</code></pre></div> <p>Let’s also release Samza to Maven’s local repository, so hello-samza has access to the JARs that it needs.</p> -<div class="highlight"><pre><code class="language-bash" data-lang="bash">./gradlew -PscalaVersion<span class="o">=</span>2.9.2 clean publishToMavenLocal</code></pre></div> +<div class="highlight"><pre><code class="bash">./gradlew -PscalaVersion<span class="o">=</span>2.9.2 clean publishToMavenLocal</code></pre></div> <p>Next, open Eclipse, and import the Samza source code into your workspace: “File” > “Import” > “Existing Projects into Workspace” > “Browse”. Select ‘incubator-samza’ folder, and hit ‘finish’.</p> @@ -152,7 +152,7 @@ <p>Now, go back to the hello-samza project, and edit ./samza-job-package/src/main/config/wikipedia-feed.properties to add the following line:</p> -<div class="highlight"><pre><code class="language-jproperties" data-lang="jproperties"><span class="na">task.opts</span><span class="o">=</span><span class="s">-agentlib:jdwp=transport=dt_socket,address=localhost:9009,server=y,suspend=y</span></code></pre></div> +<div class="highlight"><pre><code class="jproperties"><span class="na">task.opts</span><span class="o">=</span><span class="s">-agentlib:jdwp=transport=dt_socket,address=localhost:9009,server=y,suspend=y</span></code></pre></div> <p>The <a href="../../documentation/0.7.0/jobs/configuration-table.html">task.opts</a> configuration parameter is a way to override Java parameters at runtime for your Samza containers. In this example, we’re setting the agentlib parameter to enable remote debugging on localhost, port 9009. In a more realistic environment, you might also set Java heap settings (-Xmx, -Xms, etc), as well as garbage collection and logging settings.</p> @@ -162,15 +162,15 @@ <p>Now that the Samza job has been setup to enable remote debugging when a Samza container starts, let’s start the ZooKeeper, Kafka, and YARN.</p> -<div class="highlight"><pre><code class="language-bash" data-lang="bash">bin/grid</code></pre></div> +<div class="highlight"><pre><code class="bash">bin/grid</code></pre></div> <p>If you get a complaint that JAVA_HOME is not set, then you’ll need to set it. This can be done on OSX by running:</p> -<div class="highlight"><pre><code class="language-bash" data-lang="bash"><span class="nb">export </span><span class="nv">JAVA_HOME</span><span class="o">=</span><span class="k">$(</span>/usr/libexec/java_home<span class="k">)</span></code></pre></div> +<div class="highlight"><pre><code class="bash"><span class="nb">export </span><span class="nv">JAVA_HOME</span><span class="o">=</span><span class="k">$(</span>/usr/libexec/java_home<span class="k">)</span></code></pre></div> <p>Once the grid starts, you can start the wikipedia-feed Samza job.</p> -<div class="highlight"><pre><code class="language-bash" data-lang="bash">mvn clean package +<div class="highlight"><pre><code class="bash">mvn clean package mkdir -p deploy/samza tar -xvf ./samza-job-package/target/samza-job-package-0.7.0-dist.tar.gz -C deploy/samza deploy/samza/bin/run-job.sh --config-factory<span class="o">=</span>org.apache.samza.config.factories.PropertiesConfigFactory --config-path<span class="o">=</span>file://<span class="nv">$PWD</span>/deploy/samza/config/wikipedia-feed.properties</code></pre></div> Modified: incubator/samza/site/learn/tutorials/0.7.0/run-hello-samza-without-internet.html URL: http://svn.apache.org/viewvc/incubator/samza/site/learn/tutorials/0.7.0/run-hello-samza-without-internet.html?rev=1612998&r1=1612997&r2=1612998&view=diff ============================================================================== --- incubator/samza/site/learn/tutorials/0.7.0/run-hello-samza-without-internet.html (original) +++ incubator/samza/site/learn/tutorials/0.7.0/run-hello-samza-without-internet.html Thu Jul 24 05:05:00 2014 @@ -129,7 +129,7 @@ <p>Ping irc.wikimedia.org. Sometimes the firewall in your company blocks this service.</p> -<div class="highlight"><pre><code class="language-bash" data-lang="bash">telnet irc.wikimedia.org 6667</code></pre></div> +<div class="highlight"><pre><code class="bash">telnet irc.wikimedia.org 6667</code></pre></div> <p>You should see something like this:</p> <div class="highlight"><pre><code class="language-text" data-lang="text">Trying 208.80.152.178... @@ -146,15 +146,15 @@ NOTICE AUTH :*** Found your hostname <p>We provide an alternative to get wikipedia feed data. Instead of running</p> -<div class="highlight"><pre><code class="language-bash" data-lang="bash">deploy/samza/bin/run-job.sh --config-factory<span class="o">=</span>org.apache.samza.config.factories.PropertiesConfigFactory --config-path<span class="o">=</span>file://<span class="nv">$PWD</span>/deploy/samza/config/wikipedia-feed.properties</code></pre></div> +<div class="highlight"><pre><code class="bash">deploy/samza/bin/run-job.sh --config-factory<span class="o">=</span>org.apache.samza.config.factories.PropertiesConfigFactory --config-path<span class="o">=</span>file://<span class="nv">$PWD</span>/deploy/samza/config/wikipedia-feed.properties</code></pre></div> <p>You will run</p> -<div class="highlight"><pre><code class="language-bash" data-lang="bash">bin/produce-wikipedia-raw-data.sh</code></pre></div> +<div class="highlight"><pre><code class="bash">bin/produce-wikipedia-raw-data.sh</code></pre></div> <p>This script will read wikipedia feed data from local file and produce them to the Kafka broker. By default, it produces to localhost:9092 as the Kafka broker and uses localhost:2181 as zookeeper. You can overwrite them:</p> -<div class="highlight"><pre><code class="language-bash" data-lang="bash">bin/produce-wikipedia-raw-data.sh -b yourKafkaBrokerAddress -z yourZookeeperAddress</code></pre></div> +<div class="highlight"><pre><code class="bash">bin/produce-wikipedia-raw-data.sh -b yourKafkaBrokerAddress -z yourZookeeperAddress</code></pre></div> <p>Now you can go back to Generate Wikipedia Statistics section in <a href="../../../startup/hello-samza/0.7.0/">Hello Samza</a> and follow the remaining steps.</p> @@ -162,7 +162,7 @@ NOTICE AUTH :*** Found your hostname <p>The goal of</p> -<div class="highlight"><pre><code class="language-bash" data-lang="bash">deploy/samza/bin/run-job.sh --config-factory<span class="o">=</span>org.apache.samza.config.factories.PropertiesConfigFactory --config-path<span class="o">=</span>file://<span class="nv">$PWD</span>/deploy/samza/config/wikipedia-feed.properties</code></pre></div> +<div class="highlight"><pre><code class="bash">deploy/samza/bin/run-job.sh --config-factory<span class="o">=</span>org.apache.samza.config.factories.PropertiesConfigFactory --config-path<span class="o">=</span>file://<span class="nv">$PWD</span>/deploy/samza/config/wikipedia-feed.properties</code></pre></div> <p>is to deploy a Samza job which listens to wikipedia API, receives the feed in realtime and produces the feed to the Kafka topic wikipedia-raw. The alternative in this tutorial is reading local wikipedia feed in an infinite loop and producing the data to Kafka wikipedia-raw. The follow-up job, wikipedia-parser is getting data from Kafka topic wikipedia-raw, so as long as we have correct data in Kafka topic wikipedia-raw, we are fine. All Samza jobs are connected by the Kafka and do not depend on each other.</p>
