Repository: samza
Updated Branches:
  refs/heads/0.14.0 c7150cf2d -> ef506a3c5


SAMZA-1536; Adding docs for Kinesis consumer

Author: Aditya Toomula <atoom...@atoomula-ld1.linkedin.biz>

Reviewers: Jagadish <jagad...@apache.org>

Closes #384 from atoomula/kinesis-docs

(cherry picked from commit ed8dad54bf04c3f81c2d686786722333325ca432)
Signed-off-by: Jagadish <jvenkatra...@linkedin.com>


Project: http://git-wip-us.apache.org/repos/asf/samza/repo
Commit: http://git-wip-us.apache.org/repos/asf/samza/commit/ef506a3c
Tree: http://git-wip-us.apache.org/repos/asf/samza/tree/ef506a3c
Diff: http://git-wip-us.apache.org/repos/asf/samza/diff/ef506a3c

Branch: refs/heads/0.14.0
Commit: ef506a3c5efcbf1126c377bea4d3325a976d292f
Parents: c7150cf
Author: Aditya Toomula <atoom...@atoomula-ld1.linkedin.biz>
Authored: Thu Dec 14 15:33:29 2017 -0800
Committer: Jagadish <jvenkatra...@linkedin.com>
Committed: Thu Dec 14 15:33:46 2017 -0800

----------------------------------------------------------------------
 .../documentation/versioned/aws/kinesis.md      | 104 +++++++++++++++++++
 docs/learn/documentation/versioned/index.html   |   6 ++
 .../versioned/jobs/configuration-table.html     |  45 +++++++-
 3 files changed, 153 insertions(+), 2 deletions(-)
----------------------------------------------------------------------


http://git-wip-us.apache.org/repos/asf/samza/blob/ef506a3c/docs/learn/documentation/versioned/aws/kinesis.md
----------------------------------------------------------------------
diff --git a/docs/learn/documentation/versioned/aws/kinesis.md 
b/docs/learn/documentation/versioned/aws/kinesis.md
new file mode 100644
index 0000000..a4be3dd
--- /dev/null
+++ b/docs/learn/documentation/versioned/aws/kinesis.md
@@ -0,0 +1,104 @@
+---
+layout: page
+title: Connecting to Kinesis
+---
+<!--
+   Licensed to the Apache Software Foundation (ASF) under one or more
+   contributor license agreements.  See the NOTICE file distributed with
+   this work for additional information regarding copyright ownership.
+   The ASF licenses this file to You under the Apache License, Version 2.0
+   (the "License"); you may not use this file except in compliance with
+   the License.  You may obtain a copy of the License at
+
+       http://www.apache.org/licenses/LICENSE-2.0
+
+   Unless required by applicable law or agreed to in writing, software
+   distributed under the License is distributed on an "AS IS" BASIS,
+   WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+   See the License for the specific language governing permissions and
+   limitations under the License.
+-->
+
+You can configure your Samza jobs to process data from [AWS 
Kinesis](https://aws.amazon.com/kinesis/data-streams), Amazon's data streaming 
service. A `Kinesis data stream` is similar to a Kafka topic and can have 
multiple partitions. Each message consumed from a Kinesis data stream is an 
instance of 
[Record](http://docs.aws.amazon.com/goto/WebAPI/kinesis-2013-12-02/Record).
+
+### Consuming from Kinesis:
+
+Samza's 
[KinesisSystemConsumer](https://github.com/apache/samza/blob/master/samza-aws/src/main/java/org/apache/samza/system/kinesis/consumer/KinesisSystemConsumer.java)
 wraps the Record into a 
[KinesisIncomingMessageEnvelope](https://github.com/apache/samza/blob/master/samza-aws/src/main/java/org/apache/samza/system/kinesis/consumer/KinesisIncomingMessageEnvelope.java).
 The key of the message is set to partition key of the Record. The message is 
obtained from the Record body.
+
+To configure Samza to consume from Kinesis streams:
+
+{% highlight jproperties %}
+# define a kinesis system factory with your identifier. eg: kinesis-system
+systems.kinesis-system.samza.factory=org.apache.samza.system.eventhub.KinesisSystemFactory
+
+# kinesis system consumer works with only AllSspToSingleTaskGrouperFactory
+job.systemstreampartition.grouper.factory=org.apache.samza.container.grouper.stream.AllSspToSingleTaskGrouperFactory
+
+# define your streams
+task.inputs=kinesis-system.input0
+
+# define required properties for your streams
+systems.kinesis-system.streams.input0.aws.region=YOUR-STREAM-REGION
+systems.kinesis-system.streams.input0.aws.accessKey=YOUR-ACCESS_KEY
+sensitive.systems.kinesis-system.streams.input0.aws.secretKey=YOUR-SECRET-KEY
+{% endhighlight %}
+
+The tuple required to access the Kinesis data stream must be provided, namely 
the fields `YOUR-STREAM-REGION`, `YOUR-ACCESS-KEY`, `YOUR-SECRET-KEY`.
+
+#### Advanced Configuration:
+
+##### AWS Client Configs:
+
+You can configure any [AWS client 
config](http://docs.aws.amazon.com/AWSJavaSDK/latest/javadoc/com/amazonaws/ClientConfiguration.html)
 with the prefix `system.system-name.aws.clientConfig.*`
+{% highlight jproperties %}
+system.system-name.aws.clientConfig.CONFIG-NAME=CONFIG-VALUE
+{% endhighlight %}
+
+As an example, to set a proxy host and proxy port for the AWS Client:
+{% highlight jproperties %}
+systems.system-name.aws.clientConfig.ProxyHost=my-proxy-host.com
+systems.system-name.aws.clientConfig.ProxyPort=my-proxy-port
+{% endhighlight %}
+
+##### KCL Configs:
+
+Similarly, you can set any [Kinesis Client Library 
config](https://github.com/awslabs/amazon-kinesis-client/blob/master/src/main/java/com/amazonaws/services/kinesis/clientlibrary/lib/worker/KinesisClientLibConfiguration.java)
 for a stream by configuring it under 
`systems.system-name.streams.stream-name.aws.kcl.*`
+{% highlight jproperties %}
+systems.system-name.streams.stream-name.aws.kcl.CONFIG-NAME=CONFIG-VALUE
+{% endhighlight %}
+
+As an example, to reset the checkpoint and set the starting position for a 
stream:
+{% highlight jproperties %}
+systems.kinesis-system.streams.input0.aws.kcl.TableName=my-app-table-name
+# set the starting position to either TRIM_HORIZON (oldest) or LATEST (latest)
+systems.kinesis-system.streams.input0.aws.kcl.InitialPositionInStream=my-start-position
+{% endhighlight %}
+
+#### Limitations
+
+The following limitations apply for Samza jobs consuming from Kinesis streams 
using the Samza consumer:
+* Stateful processing (eg: windows or joins) is not supported on Kinesis 
streams. However, you can accomplish this by chaining two Samza jobs where the 
first job reads from Kinesis and sends to Kafka while the second job processes 
the data from Kafka.
+* Kinesis streams cannot be configured as 
[bootstrap](https://samza.apache.org/learn/documentation/latest/container/streams.html)
 or 
[broadcast](https://samza.apache.org/learn/documentation/latest/container/samza-container.html)
 streams.
+* Kinesis streams must be used with the 
[AllSspToSingleTaskGrouperFactory](https://github.com/apache/samza/blob/master/samza-core/src/main/java/org/apache/samza/container/grouper/stream/AllSspToSingleTaskGrouperFactory.java).
 No other grouper is supported.
+* A Samza job that consumes from Kinesis cannot consume from any other input 
source. However, you can send your results to any destination (eg: Kafka, 
EventHubs), and have another Samza job consume them.
+
+### Producing to Kinesis:
+
+The KinesisSystemProducer for Samza is not yet implemented.
+
+### How to configure Samza job to consume from Kinesis data stream ?
+
+This tutorial uses [hello 
samza](../../../startup/hello-samza/{{site.version}}/) to illustrate running a 
Samza job on Yarn that consumes from Kinesis. We will use the 
[KinesisHelloSamza](https://github.com/apache/samza-hello-samza/blob/master/src/main/java/samza/examples/kinesis/KinesisHelloSamza.java)
 example.
+
+#### Update properties file
+
+Update the following properties in the kinesis-hello-samza.properties file:
+
+{% highlight jproperties %}
+task.inputs=kinesis.<kinesis-stream>
+systems.kinesis.streams.<kinesis-stream>.aws.region=<kinesis-stream-region>
+systems.kinesis.streams.<kinesis-stream>.aws.accessKey=<your-access-key>
+sensitive.systems.kinesis.streams.<kinesis-stream>.aws.region=<your-secret-key>
+{% endhighlight %}
+
+Now, you are ready to run your Samza application on Yarn as described 
[here](../../../startup/hello-samza/{{site.version}}/). Check the log file for 
messages read from your Kinesis stream.

http://git-wip-us.apache.org/repos/asf/samza/blob/ef506a3c/docs/learn/documentation/versioned/index.html
----------------------------------------------------------------------
diff --git a/docs/learn/documentation/versioned/index.html 
b/docs/learn/documentation/versioned/index.html
index fa8e1e3..9b24c9c 100644
--- a/docs/learn/documentation/versioned/index.html
+++ b/docs/learn/documentation/versioned/index.html
@@ -100,6 +100,12 @@ title: Documentation
   <li><a href="azure/eventhubs.html">Eventhubs Consumer/Producer</a></li>
 </ul>
 
+<h4>AWS</h4>
+
+<ul class="documentation-list">
+  <li><a href="aws/kinesis.html">Kinesis Consumer/Producer</a></li>
+</ul>
+
 <h4>Operations</h4>
 
 <ul class="documentation-list">

http://git-wip-us.apache.org/repos/asf/samza/blob/ef506a3c/docs/learn/documentation/versioned/jobs/configuration-table.html
----------------------------------------------------------------------
diff --git a/docs/learn/documentation/versioned/jobs/configuration-table.html 
b/docs/learn/documentation/versioned/jobs/configuration-table.html
index 6666bb1..49886ce 100644
--- a/docs/learn/documentation/versioned/jobs/configuration-table.html
+++ b/docs/learn/documentation/versioned/jobs/configuration-table.html
@@ -2170,7 +2170,7 @@
                 </tr>
 
                 <tr>
-                    <th colspan="3" class="section" 
id="hdfs-system-producer"><a href="../hadoop/producer.html">Writing to 
HDFS</a></th>
+                    <th colspan="3" class="section" 
id="hdfs-system-producer"><a href="../hdfs/producer.html">Writing to 
HDFS</a></th>
                 </tr>
 
                 <tr>
@@ -2210,7 +2210,7 @@
                 </tr>
 
                 <tr>
-                    <th colspan="3" class="section" 
id="hdfs-system-consumer"><a href="../hadoop/consumer.html">Reading from 
HDFS</a></th>
+                    <th colspan="3" class="section" 
id="hdfs-system-consumer"><a href="../hdfs/consumer.html">Reading from 
HDFS</a></th>
                 </tr>
 
                 <tr>
@@ -2335,6 +2335,47 @@
                         Consumer only config. Per partition capacity of the 
eventhubs consumer buffer - the blocking queue used for storing messages. 
Larger buffer capacity typically leads to better throughput but consumes more 
memory.
                     </td>
                 </tr>
+
+                <tr>
+                    <th colspan="3" class="section" id="kinesis">
+                        Using <a 
href="https://aws.amazon.com/kinesis/";>Kinesis</a> for input streams<br>
+                        <span class="subtitle">
+                            (This section applies if you have set
+                            <a href="#systems-samza-factory" 
class="property">systems.*.samza.factory</a>
+                            <code>= 
org.apache.samza.system.kinesis.KinesisSystemFactory</code>)
+                        </span>
+                    </th>
+                </tr>
+
+                <tr>
+                    <td class="property" 
id="kinesis-stream-region">systems.<span 
class="system">system-name</span>.<br>streams.<span 
class="stream">stream-name</span>.<br>aws.region</td>
+                    <td class="default"></td>
+                    <td class="description">Region of the associated <span 
class="stream">stream-name</span>. Required to access the Kinesis data 
stream.</td>
+                </tr>
+
+                <tr>
+                    <td class="property" 
id="kinesis-stream-access-key">systems.<span 
class="system">system-name</span>.<br>streams.<span 
class="stream">stream-name</span>.<br>aws.accessKey</td>
+                    <td class="default"></td>
+                    <td class="description">AccessKey of the associated <span 
class="stream">stream-name</span>. Required to access Kinesis data stream.</td>
+                </tr>
+
+                <tr>
+                    <td class="property" 
id="kinesis-stream-secret-key">systems.<span 
class="system">system-name</span>.<br>streams.<span 
class="stream">stream-name</span>.<br>aws.secretKey</td>
+                    <td class="default"></td>
+                    <td class="description">SecretKey of the associated <span 
class="stream">stream-name</span>. Required to access the Kinesis data 
stream.</td>
+                </tr>
+
+                <tr>
+                    <td class="property" 
id="kinesis-stream-aws-kcl-configs">systems.<span 
class="system">system-name</span>.<br>streams.<span 
class="stream">stream-name</span>.<br>aws.kcl.*</td>
+                    <td class="default"></td>
+                    <td class="description"><a 
href="https://github.com/awslabs/amazon-kinesis-client/blob/master/src/main/java/com/amazonaws/services/kinesis/clientlibrary/lib/worker/KinesisClientLibConfiguration.java";>AWS
 Kinesis Client Library configuration</a> associated with the <span 
class="stream">stream-name</span>.</td>
+                </tr>
+
+                <tr>
+                    <td class="property" 
id="kinesis-system-aws-client-configs">systems.<span 
class="system">system-name</span>.<br>aws.clientConfig.*</td>
+                    <td class="default"></td>
+                    <td class="description"><a 
href="http://docs.aws.amazon.com/AWSJavaSDK/latest/javadoc/com/amazonaws/ClientConfiguration.html";>AWS
 ClientConfiguration</a> associated with the <span 
class="system">system-name</span>.</td>
+                </tr>
             </tbody>
         </table>
     </body>

Reply via email to