ryannedolan commented on a change in pull request #324:
URL: https://github.com/apache/kafka-site/pull/324#discussion_r563180323
##########
File path: 27/ops.html
##########
@@ -553,7 +539,558 @@ <h3 class="anchor-heading"><a id="datacenters"
class="anchor-link"></a><a href="
<p>
It is generally <i>not</i> advisable to run a <i>single</i> Kafka cluster
that spans multiple datacenters over a high-latency link. This will incur very
high replication latency both for Kafka writes and ZooKeeper writes, and
neither Kafka nor ZooKeeper will remain available in all locations if the
network between locations is unavailable.
- <h3 class="anchor-heading"><a id="config" class="anchor-link"></a><a
href="#config">6.3 Kafka Configuration</a></h3>
+ <h3 class="anchor-heading"><a id="georeplication" class="anchor-link"></a><a
href="#georeplication">6.3 Geo-Replication (Cross-Cluster Data
Mirroring)</a></h3>
+
+ <h4 class="anchor-heading"><a id="georeplication-overview"
class="anchor-link"></a><a href="#georeplication-overview">Geo-Replication
Overview</a></h4>
+
+ <p>
+ Kafka administrators can define data flows that cross the boundaries of
individual Kafka clusters, data centers, or geo-regions. Such event streaming
setups are often needed for organizational, technical, or legal requirements.
Common scenarios include:
+ </p>
+
+ <ul>
+ <li>Geo-replication</li>
+ <li>Disaster recovery</li>
+ <li>Feeding edge clusters into a central, aggregate cluster</li>
+ <li>Physical isolation of clusters (such as production vs. testing)</li>
+ <li>Cloud migration or hybrid cloud deployments</li>
+ <li>Legal and compliance requirements</li>
+ </ul>
+
+ <p>
+ Administrators can set up such inter-cluster data flows with Kafka's
MirrorMaker (version 2), a tool to replicate data between different Kafka
environments in a streaming manner. MirrorMaker is built on top of the Kafka
Connect framework and supports features such as:
+ </p>
+
+ <ul>
+ <li>Replicates topics (data plus configurations)</li>
+ <li>Replicates consumer groups including offsets to migrate applications
between clusters</li>
+ <li>Replicates ACLs</li>
+ <li>Preserves partitioning</li>
+ <li>Automatically detects new topics and partitions</li>
+ <li>Provides a wide range of metrics, such as end-to-end replication
latency across multiple data centers/clusters</li>
+ <li>Fault-tolerant and horizontally scalable operations</li>
+ </ul>
+
+ <p>
+ <em>Note: Geo-replication with MirrorMaker replicates data across Kafka
clusters. This inter-cluster replication is different from Kafka's <a
href="#replication">intra-cluster replication</a>, which replicates data within
the same Kafka cluster.</em>
+ </p>
+
+ <h4 class="anchor-heading"><a id="georeplication-flows"
class="anchor-link"></a><a href="#georeplication-flows">What Are Replication
Flows</a></h4>
+
+ <p>
+ With MirrorMaker, Kafka administrators can replicate topics, topic
configurations, consumer groups and their offsets, and ACLs from one or more
source Kafka clusters to one or more target Kafka clusters, i.e., across
cluster environments. In a nutshell, MirrorMaker consumes data from the source
cluster with source connectors, and then replicates the data by producing to
the target cluster with sink connectors.
Review comment:
"with sink connectors" is not true at the moment, since I don't think we
have a sink connector yet. And even when we do, it would usually be sufficient
to use source _or_ sink connector. There are certainly cases where this
sentence is true, but I think it's misleading as a general statement.
Maybe "In a nutshell, MirrorMaker uses Connectors to consume from source
clusters and produce to target clusters" or something like that.
##########
File path: 27/ops.html
##########
@@ -553,7 +539,558 @@ <h3 class="anchor-heading"><a id="datacenters"
class="anchor-link"></a><a href="
<p>
It is generally <i>not</i> advisable to run a <i>single</i> Kafka cluster
that spans multiple datacenters over a high-latency link. This will incur very
high replication latency both for Kafka writes and ZooKeeper writes, and
neither Kafka nor ZooKeeper will remain available in all locations if the
network between locations is unavailable.
- <h3 class="anchor-heading"><a id="config" class="anchor-link"></a><a
href="#config">6.3 Kafka Configuration</a></h3>
+ <h3 class="anchor-heading"><a id="georeplication" class="anchor-link"></a><a
href="#georeplication">6.3 Geo-Replication (Cross-Cluster Data
Mirroring)</a></h3>
+
+ <h4 class="anchor-heading"><a id="georeplication-overview"
class="anchor-link"></a><a href="#georeplication-overview">Geo-Replication
Overview</a></h4>
+
+ <p>
+ Kafka administrators can define data flows that cross the boundaries of
individual Kafka clusters, data centers, or geo-regions. Such event streaming
setups are often needed for organizational, technical, or legal requirements.
Common scenarios include:
+ </p>
+
+ <ul>
+ <li>Geo-replication</li>
+ <li>Disaster recovery</li>
+ <li>Feeding edge clusters into a central, aggregate cluster</li>
+ <li>Physical isolation of clusters (such as production vs. testing)</li>
+ <li>Cloud migration or hybrid cloud deployments</li>
+ <li>Legal and compliance requirements</li>
+ </ul>
+
+ <p>
+ Administrators can set up such inter-cluster data flows with Kafka's
MirrorMaker (version 2), a tool to replicate data between different Kafka
environments in a streaming manner. MirrorMaker is built on top of the Kafka
Connect framework and supports features such as:
+ </p>
+
+ <ul>
+ <li>Replicates topics (data plus configurations)</li>
+ <li>Replicates consumer groups including offsets to migrate applications
between clusters</li>
+ <li>Replicates ACLs</li>
+ <li>Preserves partitioning</li>
+ <li>Automatically detects new topics and partitions</li>
+ <li>Provides a wide range of metrics, such as end-to-end replication
latency across multiple data centers/clusters</li>
+ <li>Fault-tolerant and horizontally scalable operations</li>
+ </ul>
+
+ <p>
+ <em>Note: Geo-replication with MirrorMaker replicates data across Kafka
clusters. This inter-cluster replication is different from Kafka's <a
href="#replication">intra-cluster replication</a>, which replicates data within
the same Kafka cluster.</em>
+ </p>
+
+ <h4 class="anchor-heading"><a id="georeplication-flows"
class="anchor-link"></a><a href="#georeplication-flows">What Are Replication
Flows</a></h4>
+
+ <p>
+ With MirrorMaker, Kafka administrators can replicate topics, topic
configurations, consumer groups and their offsets, and ACLs from one or more
source Kafka clusters to one or more target Kafka clusters, i.e., across
cluster environments. In a nutshell, MirrorMaker consumes data from the source
cluster with source connectors, and then replicates the data by producing to
the target cluster with sink connectors.
+ </p>
+
+ <p>
+ These directional flows from source to target clusters are called
replication flows. They are defined with the format
<code>{source_cluster}->{target_cluster}</code> in the MirrorMaker
configuration file as described later. Administrators can create complex
replication topologies based on these flows.
+ </p>
+
+ <p>
+ Here are some example patterns:
+ </p>
+
+ <ul>
+ <li>Active/Active high availability deployments: <code>A->B,
B->A</code></li>
+ <li>Active/Passive or Active/Standby high availability deployments:
<code>A->B</code></li>
+ <li>Aggregation (e.g., from many clusters to one): <code>A->K, B->K,
C->K</code></li>
+ <li>Fan-out (e.g., from one to many clusters): <code>K->A, K->B,
K->C</code></li>
+ <li>Forwarding: <code>A->B, B->C, C->D</code></li>
+ </ul>
+
+ <p>
+ By default, a flow replicates all topics and consumer groups. However,
each replication flow can be configured independently. For instance, you can
define that only specific topics or consumer groups are replicated from the
source cluster to the target cluster.
+ </p>
+
+ <p>
+ Here is a first example on how to configure data replication from a
<code>primary</code> cluster to a <code>secondary</code> cluster (an
active/passive setup):
+ </p>
+
+<pre class="line-numbers"><code class="language-text"># Basic settings
+clusters = primary, secondary
+primary.bootstrap.servers = broker3-primary:9092
+secondary.bootstrap.servers = broker5-secondary:9092
+
+# Define replication flows
+primary->secondary.enable = true
+primary->secondary.topics = foobar-topic, quux-.*
+</code></pre>
+
+
+ <h4 class="anchor-heading"><a id="georeplication-mirrormaker"
class="anchor-link"></a><a href="#georeplication-mirrormaker">Configuring
Geo-Replication</a></h4>
+
+ <p>
+ The following sections describe how to configure and run a dedicated
MirrorMaker cluster. If you want to run MirrorMaker within an existing Kafka
Connect cluster or other supported deployment setups, please refer to <a
href="https://cwiki.apache.org/confluence/display/KAFKA/KIP-382%3A+MirrorMaker+2.0">KIP-382:
MirrorMaker 2.0</a> and be aware that the names of configuration settings may
vary between deployment modes.
+ </p>
+
+ <p>
+ Beyond what's covered in the following sections, further examples and
information on configuration settings are available at:
+ </p>
+
+ <ul>
+ <li><a
href="https://github.com/apache/kafka/blob/trunk/connect/mirror/src/main/java/org/apache/kafka/connect/mirror/MirrorMakerConfig.java">MirrorMakerConfig</a>,
<a
href="https://github.com/apache/kafka/blob/trunk/connect/mirror/src/main/java/org/apache/kafka/connect/mirror/MirrorConnectorConfig.java">MirrorConnectorConfig</a></li>
+ <li><a
href="https://github.com/apache/kafka/blob/trunk/connect/mirror/src/main/java/org/apache/kafka/connect/mirror/DefaultTopicFilter.java">DefaultTopicFilter</a>
for topics, <a
href="https://github.com/apache/kafka/blob/trunk/connect/mirror/src/main/java/org/apache/kafka/connect/mirror/DefaultGroupFilter.java">DefaultGroupFilter</a>
for consumer groups</li>
+ <li>Example configuration settings in <a
href="https://github.com/apache/kafka/blob/trunk/config/connect-mirror-maker.properties">connect-mirror-maker.properties</a>,
<a
href="https://cwiki.apache.org/confluence/display/KAFKA/KIP-382%3A+MirrorMaker+2.0">KIP-382:
MirrorMaker 2.0</a></li>
+ </ul>
+
+ <h5 class="anchor-heading"><a id="georeplication-config-syntax"
class="anchor-link"></a><a href="#georeplication-config-syntax">Configuration
File Syntax</a></h5>
+
+ <p>
+ The MirrorMaker configuration file is typically named
<code>connect-mirror-maker.properties</code>. You can configure a variety of
components in this file:
+ </p>
+
+ <ul>
+ <li>MirrorMaker settings: global settings including cluster definitions
(aliases), plus custom settings per replication flow</li>
+ <li>Kafka Connect and connector settings</li>
+ <li>Kafka producer, consumer, and admin client settings</li>
+ </ul>
+
+ <p>
+ Example: Define MirrorMaker settings (explained in more detail later).
+ </p>
+
+<pre class="line-numbers"><code class="language-text"># Global settings
+clusters = us-west, us-east # defines cluster aliases
+us-west.bootstrap.servers = broker3-west:9092
+us-east.bootstrap.servers = broker5-east:9092
+
+topics = .* # all topics to be replicated by default
+
+# Specific replication flow settings (here: flow from us-west to us-east)
+us-west->us-east.enable = true
+us-west->us.east.topics = foo.*, bar.* # override the default above
+</code></pre>
+
+ <p>
+ MirrorMaker is based on the Kafka Connect framework. Any Kafka Connect,
source connector, and sink connector settings as described in the <a
href="#connectconfigs">documentation chapter on Kafka Connect</a> can be used
directly in the MirrorMaker configuration, without having to change or prefix
the name of the configuration setting.
+ </p>
+
+ <p>
+ Example: Define custom Kafka Connect settings to be used by MirrorMaker.
+ </p>
+
+<pre class="line-numbers"><code class="language-text"># Setting Kafka Connect
defaults for MirrorMaker
+tasks.max = 5
+</code></pre>
+
+ <p>
+ Most of the default Kafka Connect settings work well for MirrorMaker
out-of-the-box, with the exception of <code>tasks.max</code>. In order to
evenly distribute the workload across more than one MirrorMaker process, it is
recommended to set <code>tasks.max</code> to at least <code>2</code>
(preferably higher) depending on the available hardware resources and the total
number of topic-partitions to be replicated.
+ </p>
+
+ <p>
+ You can further customize MirrorMaker's Kafka Connect settings <em>per
source or target cluster</em> (more precisely, you can specify Kafka Connect
worker-level configuration settings "per connector"). Use the format of
<code>{cluster}.{config_name}</code> in the MirrorMaker configuration file.
+ </p>
+
+ <p>
+ Example: Define custom connector settings for the <code>us-west</code>
cluster.
+ </p>
+
+<pre class="line-numbers"><code class="language-text"># us-west custom settings
+us-west.offset.storage.topic = my-mirrormaker-offsets
+</code></pre>
+
+ <p>
+ MirrorMaker internally uses the Kafka producer, consumer, and admin
clients. Custom settings for these clients are often needed. To override the
defaults, use the following format in the MirrorMaker configuration file:
+ </p>
+
+ <ul>
+ <li><code>{source}.consumer.{consumer_config_name}</code></li>
+ <li><code>{target}.producer.{producer_config_name}</code></li>
+ <li><code>{source_or_target}.admin.{admin_config_name}</code></li>
+ </ul>
+
+ <p>
+ Example: Define custom producer, consumer, admin client settings.
+ </p>
+
+<pre class="line-numbers"><code class="language-text"># us-west cluster (from
which to consume)
+us-west.consumer.isolation.level = read_committed
+us-west.admin.bootstrap.servers = broker57-primary:9092
+
+# us-east cluster (to which to produce)
+us-east.producer.compression.type = gzip
+us-east.producer.buffer.memory = 32768
+us-east.admin.bootstrap.servers = broker8-secondary:9092
+</code></pre>
+
+ <h5 class="anchor-heading"><a id="georeplication-flow-create"
class="anchor-link"></a><a href="#georeplication-flow-create">Creating and
Enabling Replication Flows</a></h5>
+
+ <p>
+ To define a replication flow, you must first define the respective source
and target Kafka clusters in the MirrorMaker configuration file.
+ </p>
+
+ <ul>
+ <li><code>clusters</code> (required): comma-separated list of Kafka
cluster "aliases"</li>
+ <li><code>{clusterAlias}.bootstrap.servers</code> (required): connection
information for the specific cluster; comma-separated list of "bootstrap" Kafka
brokers
+ </ul>
+
+ <p>
+ Example: Define two cluster aliases <code>primary</code> and
<code>secondary</code>, including their connection information.
+ </p>
+
+<pre class="line-numbers"><code class="language-text">clusters = primary,
secondary
+primary.bootstrap.servers = broker10-primary:9092,broker-11-primary:9092
+secondary.bootstrap.servers = broker5-secondary:9092,broker6-secondary:9092
+</code></pre>
+
+ <p>
+ Secondly, you must explicitly enable individual replication flows with
<code>{source}->{target}.enabled = true</code> as needed. Remember that flows
are directional: if you need two-way (bidirectional) replication, you must
enable flows in both directions.
+ </p>
+
+<pre class="line-numbers"><code class="language-text"># Enable replication
from primary to secondary
+primary->secondary.enable = true
+</code></pre>
+
+ <p>
+ By default, a replication flow will replicate all but a few special topics
and consumer groups from the source cluster to the target cluster, and
automatically detect any newly created topics and groups. The names of
replicated topics in the target cluster will be prefixed with the name of the
source cluster (see section further below). For example, the topic
<code>foo</code> in the source cluster <code>us-west</code> would be replicated
to a topic named <code>us-west.foo</code> in the target cluster
<code>us-east</code>.
+ </p>
+
+ <p>
+ The subsequent sections explain how to customize this basic setup
according to your needs.
+ </p>
+
+ <h5 class="anchor-heading"><a id="georeplication-flow-configure"
class="anchor-link"></a><a href="#georeplication-flow-configure">Configuring
Replication Flows</a></h5>
+
+ <p>
+The configuration of a replication flow is a combination of top-level default
settings (e.g., <code>topics</code>), on top of which flow-specific settings,
if any, are applied (e.g., <code>us-west->us-east.topics</code>). To change the
top-level defaults, add the respective top-level setting to the MirrorMaker
configuration file. To override the defaults for a specific replication flow
only, use the syntax format <code>{source}->{target}.{config.name}</code>.
+ </p>
+
+ <p>
+ The most important settings are:
+ </p>
+
+ <ul>
+ <li><code>topics</code>: list of topics or a regular expression that
defines which topics in the source cluster to replicate (default: <code>topics
= .*</code>)
+ <li><code>topics.exclude</code>: list of topics or a regular expression to
subsequently exclude topics that were matched by the <code>topics</code>
setting (default: <code>topics.exclude = .*[\-\.]internal, .*\.replica,
__.*</code>)
+ <li><code>groups</code>: list of topics or regular expression that defines
which consumer groups in the source cluster to replicate (default: <code>groups
= .*</code>)
+ <li><code>groups.exclude</code>: list of topics or a regular expression to
subsequently exclude consumer groups that were matched by the
<code>groups</code> setting (default: <code>groups.exclude =
console-consumer-.*, connect-.*, __.*</code>)
+ <li><code>{source}->{target}.enable</code>: set to <code>true</code> to
enable the replication flow (default: <code>false</code>)
+ </ul>
+
+ <p>
+ Example:
+ </p>
+
+<pre class="line-numbers"><code class="language-text"># Custom top-level
defaults that apply to all replication flows
+topics = .*
+groups = consumer-group1, consumer-group2
+
+# Don't forget to enable a flow!
+us-west->us-east.enable = true
+
+# Custom settings for specific replication flows
+us-west->us-east.topics = foo.*
+us-west->us-east.groups = bar.*
+us-west->us-east.emit.heartbeats = false
+</code></pre>
+
+ <p>
+ Additional configuration settings are supported, some of which are listed
below. In most cases, you can leave these settings at their default values. See
<a
href="https://github.com/apache/kafka/blob/trunk/connect/mirror/src/main/java/org/apache/kafka/connect/mirror/MirrorMakerConfig.java">MirrorMakerConfig</a>
and <a
href="https://github.com/apache/kafka/blob/trunk/connect/mirror/src/main/java/org/apache/kafka/connect/mirror/MirrorConnectorConfig.java">MirrorConnectorConfig</a>
for further details.
+ </p>
+
+ <ul>
+ <li><code>refresh.topics.enabled</code>: whether to check for new topics
in the source cluster periodically (default: true)
+ <li><code>refresh.topics.interval.seconds</code>: frequency of checking
for new topics in the source cluster; lower values than the default may lead to
performance degradation (default: 6000, every ten minutes)
+ <li><code>refresh.groups.enabled</code>: whether to check for new consumer
groups in the source cluster periodically (default: true)
+ <li><code>refresh.groups.interval.seconds</code>: frequency of checking
for new consumer groups in the source cluster; lower values than the default
may lead to performance degradation (default: 6000, every ten minutes)
+ <li><code>sync.topic.configs.enabled</code>: whether to replicate topic
configurations from the source cluster (default: true)
+ <li><code>sync.topic.acls.enabled</code>: whether to sync ACLs from the
source cluster (default: true)
+ <li><code>emit.heartbeats.enabled</code>: whether to emit heartbeats
periodically (default: true)
+ <li><code>emit.heartbeats.interval.seconds</code>: frequency at which
heartbeats are emitted (default: 5, every five seconds)
+ <li><code>heartbeats.topic.replication.factor</code>: replication factor
of MirrorMaker's internal heartbeat topics (default: 3)
+ <li><code>emit.checkpoints.enabled</code>: whether to emit MirrorMaker's
consumer offsets periodically (default: true)
+ <li><code>emit.checkpoints.interval.seconds</code>: frequency at which
checkpoints are emitted (default: 60, every minute)
+ <li><code>checkpoints.topic.replication.factor</code>: replication factor
of MirrorMaker's internal checkpoints topics (default: 3)
+ <li><code>sync.group.offsets.enabled</code>: whether to periodically write
the translated offsets of replicated consumer groups (in the source cluster) to
<code>__consumer_offsets</code> topic in target cluster, as long as no active
consumers in that group are connected to the target cluster (default: true)
+ <li><code>sync.group.offsets.interval.seconds</code>: frequency at which
consumer group offsets are synced (default: 60, every minute)
+ <li><code>offset-syncs.topic.replication.factor</code>: replication factor
of MirrorMaker's internal offset-sync topics (default: 3)
+ </ul>
+
+ <h5 class="anchor-heading"><a id="georeplication-flow-secure"
class="anchor-link"></a><a href="#georeplication-flow-secure">Securing
Replication Flows</a></h5>
+
+ <p>
+ MirrorMaker supports the same <a href="#connectconfigs">security settings
as Kafka Connect</a>, so please refer to the linked section for further
information.
+ </p>
+
+ <p>
+ Example: Encrypt communication between MirrorMaker and the
<code>us-east</code> cluster.
+ </p>
+
+<pre class="line-numbers"><code
class="language-text">us-east.security.protocol=SSL
+us-east.ssl.truststore.location=/path/to/truststore.jks
+us-east.ssl.truststore.password=my-secret-password
+us-east.ssl.keystore.location=/path/to/keystore.jks
+us-east.ssl.keystore.password=my-secret-password
+us-east.ssl.key.password=my-secret-password
+</code></pre>
+
+ <h5 class="anchor-heading"><a id="georeplication-topic-naming"
class="anchor-link"></a><a href="#georeplication-topic-naming">Custom Naming of
Replicated Topics in Target Clusters</a></h5>
+
+ <p>
+ Replicated topics in a target cluster—sometimes called <em>remote</em>
topics—are renamed according to a replication policy. MirrorMaker uses this
policy to ensure that events (aka records, messages) from different clusters
are not written to the same topic-partition. By default as per <a
href="https://github.com/apache/kafka/blob/trunk/connect/mirror-client/src/main/java/org/apache/kafka/connect/mirror/DefaultReplicationPolicy.java">DefaultReplicationPolicy</a>,
the names of replicated topics in the target clusters have the format
<code>{source}.{source_topic_name}</code>:
Review comment:
I think "records" is more prevalent in Kafka docs vs "events". Maybe
verify that and stick with whatever the rest of the docs use.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
[email protected]