kafka-site git commit: Resync 0.10.1 for 0.10.1.0 release

jgus Wed, 19 Oct 2016 15:35:52 -0700

Repository: kafka-site
Updated Branches:
  refs/heads/asf-site 230d32e67 -> 9bad5a3b7



Resync 0.10.1 for 0.10.1.0 release


Project: http://git-wip-us.apache.org/repos/asf/kafka-site/repo
Commit: http://git-wip-us.apache.org/repos/asf/kafka-site/commit/9bad5a3b
Tree: http://git-wip-us.apache.org/repos/asf/kafka-site/tree/9bad5a3b
Diff: http://git-wip-us.apache.org/repos/asf/kafka-site/diff/9bad5a3b

Branch: refs/heads/asf-site
Commit: 9bad5a3b71b30aec71a8bfddaebb379dccca18ad
Parents: 230d32e
Author: Jason Gustafson <ja...@confluent.io>
Authored: Wed Oct 19 15:35:07 2016 -0700
Committer: Jason Gustafson <ja...@confluent.io>
Committed: Wed Oct 19 15:35:07 2016 -0700

----------------------------------------------------------------------
 0101/ops.html        | 94 +++++++++++++++++++++++++++++++++++++++++++++++
 0101/quickstart.html |  2 -
 0101/streams.html    | 45 +++++++++++++++++++++--
 0101/upgrade.html    |  4 --
 4 files changed, 136 insertions(+), 9 deletions(-)
----------------------------------------------------------------------


http://git-wip-us.apache.org/repos/asf/kafka-site/blob/9bad5a3b/0101/ops.html
----------------------------------------------------------------------
diff --git a/0101/ops.html b/0101/ops.html
index ed0c153..c26f0cb 100644
--- a/0101/ops.html
+++ b/0101/ops.html
@@ -346,6 +346,100 @@ Topic:foo PartitionCount:1        ReplicationFactor:3     
Configs:
        Topic: foo      Partition: 0    Leader: 5       Replicas: 5,6,7 Isr: 
5,6,7
 </pre>
 
+<h4><a id="rep-throttle" href="#rep-throttle">Limiting Bandwidth Usage during 
Data Migration</a></h4>
+Kafka lets you apply a throttle to replication traffic, setting an upper bound 
on the bandwidth used to move replicas from machine to machine. This is useful 
when rebalancing a cluster, bootstrapping a new broker or adding or removing 
brokers, as it limits the impact these data-intensive operations will have on 
users.
+<p></p>
+There are two interfaces that can be used to engage a throttle. The simplest, 
and safest, is to apply a throttle when invoking the 
kafka-reassign-partitions.sh, but kafka-configs.sh can also be used to view and 
alter the throttle values directly.
+<p></p>
+So for example, if you were to execute a rebalance, with the below command, it 
would move partitions at no more than 50MB/s.
+<pre>$ bin/kafka-reassign-partitions.sh --zookeeper myhost:2181--execute 
--reassignment-json-file bigger-cluster.json âthrottle 50000000</pre>
+When you execute this script you will see the throttle engage:
+<pre>
+The throttle limit was set to 50000000 B/s
+Successfully started reassignment of partitions.</pre>
+<p>Should you wish to alter the throttle, during a rebalance, say to increase 
the throughput so it completes quicker, you can do this by re-running the 
execute command passing the same reassignment-json-file:</p>
+<pre>$ bin/kafka-reassign-partitions.sh --zookeeper localhost:2181  --execute 
--reassignment-json-file bigger-cluster.json --throttle 700000000
+There is an existing assignment running.
+The throttle limit was set to 700000000 B/s</pre>
+
+<p>Once the rebalance completes the administrator can check the status of the 
rebalance using the --verify option.
+    If the rebalance has completed, the throttle will be removed via the 
--verify command. It is important that
+    administrators remove the throttle in a timely manner once rebalancing 
completes by running the command with
+    the --verify option. Failure to do so could cause regular replication 
traffic to be throttled. </p>
+<p>When the --verify option is executed, and the reassignment has completed, 
the script will confirm that the throttle was removed:</p>
+
+<pre>$ bin/kafka-reassign-partitions.sh --zookeeper localhost:2181  --verify 
--reassignment-json-file bigger-cluster.json
+Status of partition reassignment:
+Reassignment of partition [my-topic,1] completed successfully
+Reassignment of partition [mytopic,0] completed successfully
+Throttle was removed.</pre>
+
+<p>The administrator can also validate the assigned configs using the 
kafka-configs.sh. There are two pairs of throttle
+    configuration used to manage the throttling process. The throttle value 
itself. This is configured, at a broker
+    level, using the dynamic properties: </p>
+
+<pre>leader.replication.throttled.rate
+follower.replication.throttled.rate</pre>
+
+<p>There is also an enumerated set of throttled replicas: </p>
+
+<pre>leader.replication.throttled.replicas
+follower.replication.throttled.replicas</pre>
+
+<p>Which are configured per topic. All four config values are automatically 
assigned by kafka-reassign-partitions.sh
+    (discussed below). </p>
+<p>To view the throttle limit configuration:</p>
+
+<pre>$ bin/kafka-configs.sh --describe --zookeeper localhost:2181 
--entity-type brokers
+Configs for brokers '2' are 
leader.replication.throttled.rate=700000000,follower.replication.throttled.rate=700000000
+Configs for brokers '1' are 
leader.replication.throttled.rate=700000000,follower.replication.throttled.rate=700000000</pre>
+
+<p>This shows the throttle applied to both leader and follower side of the 
replication protocol. By default both sides
+    are assigned the same throttled throughput value. </p>
+
+<p>To view the list of throttled replicas:</p>
+
+<pre>$ bin/kafka-configs.sh --describe --zookeeper localhost:2181 
--entity-type topics
+Configs for topic 'my-topic' are 
leader.replication.throttled.replicas=1:102,0:101,
+    follower.replication.throttled.replicas=1:101,0:102</pre>
+
+<p>Here we see the leader throttle is applied to partition 1 on broker 102 and 
partition 0 on broker 101. Likewise the
+    follower throttle is applied to partition 1 on
+    broker 101 and partition 0 on broker 102. </p>
+
+<p>By default kafka-reassign-partitions.sh will apply the leader throttle to 
all replicas that exist before the
+    rebalance, any one of which might be leader.
+    It will apply the follower throttle to all move destinations. So if there 
is a partition with replicas on brokers
+    101,102, being reassigned to 102,103, a leader throttle,
+    for that partition, would be applied to 101,102 and a follower throttle 
would be applied to 103 only. </p>
+
+
+<p>If required, you can also use the --alter switch on kafka-configs.sh to 
alter the throttle configurations manually.
+</p>
+
+<h5>Safe usage of throttled replication</h5>
+
+<p>Some care should be taken when using throttled replication. In 
particular:</p>
+
+<p><i>(1) Throttle Removal:</i></p>
+The throttle should be removed in a timely manner once reassignment completes 
(by running kafka-reassign-partitions
+âverify).
+
+<p><i>(2) Ensuring Progress:</i></p>
+<p>If the throttle is set too low, in comparison to the incoming write rate, 
it is possible for replication to not
+    make progress. This occurs when:</p>
+<pre>max(BytesInPerSec) > throttle</pre>
+<p>
+    Where BytesInPerSec is the metric that monitors the write throughput of 
producers into each broker. </p>
+<p>The administrator can monitor whether replication is making progress, 
during the rebalance, using the metric:</p>
+
+<pre>kafka.server:type=FetcherLagMetrics,name=ConsumerLag,clientId=([-.\w]+),topic=([-.\w]+),partition=([0-9]+)</pre>
+
+<p>The lag should constantly decrease during replication.  If the metric does 
not decrease the administrator should
+    increase the
+    throttle throughput as described above. </p>
+
+
 <h4><a id="quotas" href="#quotas">Setting quotas</a></h4>
 Quotas overrides and defaults may be configured at (user, client-id), user or 
client-id levels as described <a href="#design_quotas">here</a>.
 By default, clients receive an unlimited quota.

http://git-wip-us.apache.org/repos/asf/kafka-site/blob/9bad5a3b/0101/quickstart.html
----------------------------------------------------------------------
diff --git a/0101/quickstart.html b/0101/quickstart.html
index 7a77692..2e83cbb 100644
--- a/0101/quickstart.html
+++ b/0101/quickstart.html
@@ -15,8 +15,6 @@
  limitations under the License.
 -->
 
-<h3><a id="quickstart" href="#quickstart">1.3 Quick Start</a></h3>
-
 <p>
 This tutorial assumes you are starting fresh and have no existing Kafka or 
ZooKeeper data.
 Since Kafka console scripts are different for Unix-based and Windows 
platforms, on Windows platforms use <code>bin\windows\</code> instead of 
<code>bin/</code>, and change the script extension to <code>.bat</code>.

http://git-wip-us.apache.org/repos/asf/kafka-site/blob/9bad5a3b/0101/streams.html
----------------------------------------------------------------------
diff --git a/0101/streams.html b/0101/streams.html
index 9c21ec4..74620ec 100644
--- a/0101/streams.html
+++ b/0101/streams.html
@@ -51,7 +51,7 @@ We first summarize the key concepts of Kafka Streams.
 
 <ul>
     <li>A <b>stream</b> is the most important abstraction provided by Kafka 
Streams: it represents an unbounded, continuously updating data set. A stream 
is an ordered, replayable, and fault-tolerant sequence of immutable data 
records, where a <b>data record</b> is defined as a key-value pair.</li>
-    <li>A stream processing application written in Kafka Streams defines its 
computational logic through one or more <b>processor topologies</b>, where a 
processor topology is a graph of stream processors (nodes) that are connected 
by streams (edges).</li>
+    <li>A <b>stream processing application</b> written in Kafka Streams 
defines its computational logic through one or more <b>processor 
topologies</b>, where a processor topology is a graph of stream processors 
(nodes) that are connected by streams (edges).</li>
     <li>A <b>stream processor</b> is a node in the processor topology; it 
represents a processing step to transform data in streams by receiving one 
input record at a time from its upstream processors in the topology, applying 
its operation to it, and may subsequently producing one or more output records 
to its downstream processors.</li>
 </ul>
 
@@ -74,8 +74,11 @@ Common notions of time in streams are:
 <ul>
     <li><b>Event time</b> - The point in time when an event or data record 
occurred, i.e. was originally created "at the source".</li>
     <li><b>Processing time</b> - The point in time when the event or data 
record happens to be processed by the stream processing application, i.e. when 
the record is being consumed. The processing time may be milliseconds, hours, 
or days etc. later than the original event time.</li>
+    <li><b>Ingestion time</b> - The point in time when an event or data record 
is stored in a topic partition by a Kafka broker. The difference to event time 
is that this ingestion timestamp is generated when the record is appended to 
the target topic by the Kafka broker, not when the record is created "at the 
source". The difference to processing time is that processing time is when the 
stream processing application processes the record. For example, if a record is 
never processed, there is no notion of processing time for it, but it still has 
an ingestion time.
 </ul>
-
+<p>
+The choice between event-time and ingestion-time is actually done through the 
configuration of Kafka (not Kafka Streams): From Kafka 0.10.x onwards, 
timestamps are automatically embedded into Kafka messages. Depending on Kafka's 
configuration these timestamps represent event-time or ingestion-time. The 
respective Kafka configuration setting can be specified on the broker level or 
per topic. The default timestamp extractor in Kafka Streams will retrieve these 
embedded timestamps as-is. Hence, the effective time semantics of your 
application depend on the effective Kafka configuration for these embedded 
timestamps.
+</p>
 <p>
 Kafka Streams assigns a <b>timestamp</b> to every data record
 via the <code>TimestampExtractor</code> interface.
@@ -87,6 +90,15 @@ per-record timestamps describe the progress of a stream with 
regards to time (al
 are leveraged by time-dependent operations such as joins.
 </p>
 
+<p>
+  Finally, whenever a Kafka Streams application writes records to Kafka, then 
it will also assign timestamps to these new records. The way the timestamps are 
assigned depends on the context:
+  <ul>
+    <li> When new output records are generated via processing some input 
record, for example, <code>context.forward()</code> triggered in the 
<code>process()</code> function call, output record timestamps are inherited 
from input record timestamps directly.</li>
+    <li> When new output records are generated via periodic functions such as 
<code>punctuate()</code>, the output record timestamp is defined as the current 
internal time (obtained through <code>context.timestamp()</code>) of the stream 
task.</li>
+    <li> For aggregations, the timestamp of a resulting aggregate update 
record will be that of the latest arrived input record that triggered the 
update.</li>
+  </ul>
+</p>
+
 <h5><a id="streams_state" href="#streams_state">States</a></h5>
 
 <p>
@@ -102,6 +114,9 @@ Every task in Kafka Streams embeds one or more state stores 
that can be accessed
 These state stores can either be a persistent key-value store, an in-memory 
hashmap, or another convenient data structure.
 Kafka Streams offers fault-tolerance and automatic recovery for local state 
stores.
 </p>
+<p>
+  Kafka Streams allows direct read-only queries of the state stores by 
methods, threads, processes or applications external to the stream processing 
application that created the state stores. This is provided through a feature 
called <b>Interactive Queries</b>. All stores are named and Interactive Queries 
exposes only the read operations of the underlying implementation. 
+</p>
 <br>
 <p>
 As we have mentioned above, the computational logic of a Kafka Streams 
application is defined as a <a href="#streams_topology">processor topology</a>.
@@ -246,7 +261,12 @@ In the next section we present another way to build the 
processor topology: the
 To build a processor topology using the Streams DSL, developers can apply the 
<code>KStreamBuilder</code> class, which is extended from the 
<code>TopologyBuilder</code>.
 A simple example is included with the source code for Kafka in the 
<code>streams/examples</code> package. The rest of this section will walk
 through some code to demonstrate the key steps in creating a topology using 
the Streams DSL, but we recommend developers to read the full example source
-codes for details.
+codes for details. 
+
+<h5><a id="streams_kstream_ktable" href="#streams_kstream_ktable">KStream and 
KTable</a></h5>
+The DSL uses two main abstractions. A <b>KStream</b> is an abstraction of a 
record stream, where each data record represents a self-contained datum in the 
unbounded data set. A <b>KTable</b> is an abstraction of a changelog stream, 
where each data record represents an update. More precisely, the value in a 
data record is considered to be an update of the last value for the same record 
key, if any (if a corresponding key doesn't exist yet, the update will be 
considered a create). To illustrate the difference between KStreams and 
KTables, letâs imagine the following two data records are being sent to the 
stream: <code>("alice", 1) --> ("alice", 3)</code>. If these records a KStream 
and the stream processing application were to sum the values it would return 
<code>4</code>. If these records were a KTable, the return would be 
<code>3</code>, since the last record would be considered as an update.
+
+
 
 <h5><a id="streams_dsl_source" href="#streams_dsl_source">Create Source 
Streams from Kafka</a></h5>
 
@@ -263,6 +283,25 @@ from a single topic).
     KTable<String, GenericRecord> source2 = builder.table("topic3", 
"stateStoreName");
 </pre>
 
+<h5><a id="streams_dsl_windowing" href="#streams_dsl_windowing">Windowing a 
stream</a></h5>
+A stream processor may need to divide data records into time buckets, i.e. to 
<b>window</b> the stream by time. This is usually needed for join and 
aggregation operations, etc. Kafka Streams currently defines the following 
types of windows:
+<ul>
+  <li><b>Hopping time windows</b> are windows based on time intervals. They 
model fixed-sized, (possibly) overlapping windows. A hopping window is defined 
by two properties: the window's size and its advance interval (aka "hop"). The 
advance interval specifies by how much a window moves forward relative to the 
previous one. For example, you can configure a hopping window with a size 5 
minutes and an advance interval of 1 minute. Since hopping windows can overlap 
a data record may belong to more than one such windows.</li>
+  <li><b>Tumbling time windows</b> are a special case of hopping time windows 
and, like the latter, are windows based on time intervals. They model 
fixed-size, non-overlapping, gap-less windows. A tumbling window is defined by 
a single property: the window's size. A tumbling window is a hopping window 
whose window size is equal to its advance interval. Since tumbling windows 
never overlap, a data record will belong to one and only one window.</li>
+  <li><b>Sliding windows</b> model a fixed-size window that slides 
continuously over the time axis; here, two data records are said to be included 
in the same window if the difference of their timestamps is within the window 
size. Thus, sliding windows are not aligned to the epoch, but on the data 
record timestamps. In Kafka Streams, sliding windows are used only for join 
operations, and can be specified through the <code>JoinWindows</code> 
class.</li>
+  </ul>
+
+<h5><a id="streams_dsl_joins" href="#streams_dsl_joins">Joins</a></h5>
+A <b>join</b> operation merges two streams based on the keys of their data 
records, and yields a new stream. A join over record streams usually needs to 
be performed on a windowing basis because otherwise the number of records that 
must be maintained for performing the join may grow indefinitely. In Kafka 
Streams, you may perform the following join operations:
+<ul>
+  <li><b>KStream-to-KStream Joins</b> are always windowed joins, since 
otherwise the memory and state required to compute the join would grow 
infinitely in size. Here, a newly received record from one of the streams is 
joined with the other stream's records within the specified window interval to 
produce one result for each matching pair based on user-provided 
<code>ValueJoiner</code>. A new <code>KStream</code> instance representing the 
result stream of the join is returned from this operator.</li>
+  
+  <li><b>KTable-to-KTable Joins</b> are join operations designed to be 
consistent with the ones in relational databases. Here, both changelog streams 
are materialized into local state stores first. When a new record is received 
from one of the streams, it is joined with the other stream's materialized 
state stores to produce one result for each matching pair based on 
user-provided ValueJoiner. A new <code>KTable</code> instance representing the 
result stream of the join, which is also a changelog stream of the represented 
table, is returned from this operator.</li>
+  <li><b>KStream-to-KTable Joins</b> allow you to perform table lookups 
against a changelog stream (<code>KTable</code>) upon receiving a new record 
from another record stream (KStream). An example use case would be to enrich a 
stream of user activities (<code>KStream</code>) with the latest user profile 
information (<code>KTable</code>). Only records received from the record stream 
will trigger the join and produce results via <code>ValueJoiner</code>, not 
vice versa (i.e., records received from the changelog stream will be used only 
to update the materialized state store). A new <code>KStream</code> instance 
representing the result stream of the join is returned from this operator.</li>
+  </ul>
+
+Depending on the operands the following join operations are supported: 
<b>inner joins</b>, <b>outer joins</b> and <b>left joins</b>. Their semantics 
are similar to the corresponding operators in relational databases.
+a
 <h5><a id="streams_dsl_transform" href="#streams_dsl_transform">Transform a 
stream</a></h5>
 
 <p>

http://git-wip-us.apache.org/repos/asf/kafka-site/blob/9bad5a3b/0101/upgrade.html
----------------------------------------------------------------------
diff --git a/0101/upgrade.html b/0101/upgrade.html
index 05b55e0..e9fef1f 100644
--- a/0101/upgrade.html
+++ b/0101/upgrade.html
@@ -15,10 +15,6 @@
  limitations under the License.
 -->
 
-
-
-<h3><a id="upgrade" href="#upgrade">1.5 Upgrading From Previous 
Versions</a></h3>
-
 <h4><a id="upgrade_10_1" href="#upgrade_10_1">Upgrading from 0.8.x, 0.9.x or 
0.10.0.X to 0.10.1.0</a></h4>
 0.10.1.0 has wire protocol changes. By following the recommended rolling 
upgrade plan below, you guarantee no downtime during the upgrade.
 However, please notice the <a href="#upgrade_10_1_breaking">Potential breaking 
changes in 0.10.1.0</a> before upgrade.

kafka-site git commit: Resync 0.10.1 for 0.10.1.0 release

Reply via email to