ivankelly commented on a change in pull request #1466: Topic compaction documentation URL: https://github.com/apache/incubator-pulsar/pull/1466#discussion_r190843622
########## File path: site/docs/latest/getting-started/ConceptsAndArchitecture.md ########## @@ -522,18 +541,55 @@ while (true) { To create a reader that will read from the latest available message: ```java -MessageId id = MessageId.latest; -Reader reader = pulsarClient.createReader(topic, id, new ReaderConfiguration()); +Reader<byte[]> reader = pulsarClient.newReader() + .topic(topic) + .startMessageId(MessageId.latest) + .create(); ``` To create a reader that will read from some message between earliest and latest: ```java byte[] msgIdBytes = // Some byte array MessageId id = MessageId.fromByteArray(msgIdBytes); -Reader reader = pulsarClient.createReader(topic, id, new ReaderConfiguration()); +Reader<byte[]> reader = pulsarClient.newReader() + .topic(topic) + .startMessageId(id) + .create(); ``` +## Topic compaction {#compaction} + +Pulsar was built with highly scalable [persistent storage](#persistent-storage) of message data as a primary objective. Pulsar {% popover topics %} enable you to persistently store as many unacknowledged messages as you need while preserving message ordering. By default, Pulsar stores *all* unacknowledged/unprocessed messages produced on a topic. Accumulating many unacknowledged messages on a topic is necessary for many Pulsar use cases but it can also be very time intensive for Pulsar {% popover consumers %} to "rewind" through the entire log of messages. + +{% include admonition.html type="success" content="For a more practical guide to topic compaction, see the [Topic compaction cookbook](../../cookbooks/compaction)." %} + +For some use cases, however, consumers don't need a complete "image" of the topic log. They may only need a few values to construct a more "shallow" image of the log, perhaps even just the most recent value. For these kinds of use cases Pulsar offers **topic compaction**. When you run compaction on a topic, Pulsar goes through a topic's backlog and removes messages that are *obscured* by later messages, i.e. it goes through the topic on a per-key basis and leaves only the most recent message associated with that key. + +Pulsar's topic compaction feature: + +* Can help preserve disk space and allow for much more efficient "rewind" of topic logs +* Applies only to [persistent topics](#persistent-storage) +* Is triggered manually via the command line. See the [Topic compaction cookbook](../../cookbooks/compaction) +* Is conceptually and operationally distinct from [retention and expiry](#message-retention-and-expiry) + +{% include admonition.html type="info" title="Topic compaction example: the stock ticker" + content="An example use case for a compacted Pulsar topic would be a stock ticker topic. On a stock ticker topic, each message bears a timestamped dollar value for stocks for purchase (with the message key holding the stock symbol, e.g. `AAPL` or `GOOG`). With a stock ticker you may care only about the most recent value(s) of the stock and have no interest in historical data (i.e. you don't need to construct a complete image of the topic's sequence of messages per key). Compaction would be highly beneficial in this case because it would keep consumers from needing to rewind through obscured messages." %} + +### How topic compaction works + +When topic compaction is triggered [via the CLI](../../cookbooks/compaction), Pulsar will iterate over the entire topic from beginning to end. For each key that it encounters the {% popover broker %} responsible will keep a record of the latest occurrence of that key. When this iterative process is finished, the broker will create a [BookKeeper ledger](#ledgers) to store the compacted topic. + +After that, the broker will make a second iteration through each message on the topic. For each message, if the key matches the latest occurrence of that key, then the key's data payload, message ID, and metadata will be written to the newly created BookKeeper ledger. If the key doesn't match the latest then the message will be skipped and left alone. If any given message has an empty payload, it will be skipped and considered deleted (akin to the concept of [tombstones](http://docs.basho.com/riak/kv/2.2.3/using/reference/object-deletion/#tombstones) in key-value databases). At the end of this second iteration through the topic, the newly created BookKeeper ledger is closed and two things are written to the topic's metadata: the ID of the BookKeeper ledger and the message ID of the last compacted message (this is known as the **compaction horizon** of the topic). Once this metadata is written compaction is complete. Review comment: We shouldn't link to a blog belonging to a company in receivership. There's a wikipedia page for it: https://en.wikipedia.org/wiki/Tombstone_(data_store) ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services