[GitHub] ivankelly commented on a change in pull request #1466: Topic compaction documentation

GitBox Fri, 25 May 2018 02:48:35 -0700

ivankelly commented on a change in pull request #1466: Topic compaction 
documentation
URL: https://github.com/apache/incubator-pulsar/pull/1466#discussion_r190843622


 ##########
 File path: site/docs/latest/getting-started/ConceptsAndArchitecture.md
 ##########
 @@ -522,18 +541,55 @@ while (true) {
 To create a reader that will read from the latest available message:
 
 ```java
-MessageId id = MessageId.latest;
-Reader reader = pulsarClient.createReader(topic, id, new 
ReaderConfiguration());
+Reader<byte[]> reader = pulsarClient.newReader()
+    .topic(topic)
+    .startMessageId(MessageId.latest)
+    .create();
 ```
 
 To create a reader that will read from some message between earliest and 
latest:
 
 ```java
 byte[] msgIdBytes = // Some byte array
 MessageId id = MessageId.fromByteArray(msgIdBytes);
-Reader reader = pulsarClient.createReader(topic, id, new 
ReaderConfiguration());
+Reader<byte[]> reader = pulsarClient.newReader()
+    .topic(topic)
+    .startMessageId(id)
+    .create();
 ```
 
+## Topic compaction {#compaction}
+
+Pulsar was built with highly scalable [persistent 
storage](#persistent-storage) of message data as a primary objective. Pulsar {% 
popover topics %} enable you to persistently store as many unacknowledged 
messages as you need while preserving message ordering. By default, Pulsar 
stores *all* unacknowledged/unprocessed messages produced on a topic. 
Accumulating many unacknowledged messages on a topic is necessary for many 
Pulsar use cases but it can also be very time intensive for Pulsar {% popover 
consumers %} to "rewind" through the entire log of messages.
+
+{% include admonition.html type="success" content="For a more practical guide 
to topic compaction, see the [Topic compaction 
cookbook](../../cookbooks/compaction)." %}
+
+For some use cases, however, consumers don't need a complete "image" of the 
topic log. They may only need a few values to construct a more "shallow" image 
of the log, perhaps even just the most recent value. For these kinds of use 
cases Pulsar offers **topic compaction**. When you run compaction on a topic, 
Pulsar goes through a topic's backlog and removes messages that are *obscured* 
by later messages, i.e. it goes through the topic on a per-key basis and leaves 
only the most recent message associated with that key.
+
+Pulsar's topic compaction feature:
+
+* Can help preserve disk space and allow for much more efficient "rewind" of 
topic logs
+* Applies only to [persistent topics](#persistent-storage)
+* Is triggered manually via the command line. See the [Topic compaction 
cookbook](../../cookbooks/compaction)
+* Is conceptually and operationally distinct from [retention and 
expiry](#message-retention-and-expiry)
+
+{% include admonition.html type="info" title="Topic compaction example: the 
stock ticker"
+   content="An example use case for a compacted Pulsar topic would be a stock 
ticker topic. On a stock ticker topic, each message bears a timestamped dollar 
value for stocks for purchase (with the message key holding the stock symbol, 
e.g. `AAPL` or `GOOG`). With a stock ticker you may care only about the most 
recent value(s) of the stock and have no interest in historical data (i.e. you 
don't need to construct a complete image of the topic's sequence of messages 
per key). Compaction would be highly beneficial in this case because it would 
keep consumers from needing to rewind through obscured messages." %}
+
+### How topic compaction works
+
+When topic compaction is triggered [via the CLI](../../cookbooks/compaction), 
Pulsar will iterate over the entire topic from beginning to end. For each key 
that it encounters the {% popover broker %} responsible will keep a record of 
the latest occurrence of that key. When this iterative process is finished, the 
broker will create a [BookKeeper ledger](#ledgers) to store the compacted topic.
+
+After that, the broker will make a second iteration through each message on 
the topic. For each message, if the key matches the latest occurrence of that 
key, then the key's data payload, message ID, and metadata will be written to 
the newly created BookKeeper ledger. If the key doesn't match the latest then 
the message will be skipped and left alone. If any given message has an empty 
payload, it will be skipped and considered deleted (akin to the concept of 
[tombstones](http://docs.basho.com/riak/kv/2.2.3/using/reference/object-deletion/#tombstones)
 in key-value databases). At the end of this second iteration through the 
topic, the newly created BookKeeper ledger is closed and two things are written 
to the topic's metadata: the ID of the BookKeeper ledger and the message ID of 
the last compacted message (this is known as the **compaction horizon** of the 
topic). Once this metadata is written compaction is complete.
 
 Review comment:
   We shouldn't link to a blog belonging to a company in receivership. There's 
a wikipedia page for it: https://en.wikipedia.org/wiki/Tombstone_(data_store)

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

[GitHub] ivankelly commented on a change in pull request #1466: Topic compaction documentation

Reply via email to