mjsax commented on a change in pull request #8995: URL: https://github.com/apache/kafka/pull/8995#discussion_r452481312
########## File path: docs/streams/core-concepts.html ########## @@ -170,13 +150,59 @@ <h3><a id="streams_concepts_duality" href="#streams-concepts-duality">Duality of or to run <a id="streams-developer-guide-interactive-queries" href="/{{version}}/documentation/streams/developer-guide/interactive-queries#interactive-queries">interactive queries</a> against your application's latest processing results. And, beyond its internal usage, the Kafka Streams API also allows developers to exploit this duality in their own applications. - </p> + </p> - <p> + <p> Before we discuss concepts such as <a id="streams-developer-guide-dsl-aggregating" href="/{{version}}/documentation/streams/developer-guide/dsl-api#aggregating">aggregations</a> in Kafka Streams, we must first introduce <strong>tables</strong> in more detail, and talk about the aforementioned stream-table duality. - Essentially, this duality means that a stream can be viewed as a table, and a table can be viewed as a stream. - </p> + Essentially, this duality means that a stream can be viewed as a table, and a table can be viewed as a stream. Kafka's log compaction feature, for example, exploits this duality. + </p> + + <p> + A simple form of a table is a collection of key-value pairs, also called a map or associative array. Such a table may look as follows: + </p> + <img class="centered" src="/{{version}}/images/streams-table-duality-01.png"> + + The <b>stream-table duality</b> describes the close relationship between streams and tables. + <ul> + <li><b>Stream as Table</b>: A stream can be considered a changelog of a table, where each data record in the stream captures a state change of the table. A stream is thus a table in disguise, and it can be easily turned into a "real" table by replaying the changelog from beginning to end to reconstruct the table. Similarly, in a more general analogy, aggregating data records in a stream - such as computing the total number of pageviews by user from a stream of pageview events - will return a table (here with the key and the value being the user and its corresponding pageview count, respectively).</li> + <li><b>Table as Stream</b>: A table can be considered a snapshot, at a point in time, of the latest value for each key in a stream (a stream's data records are key-value pairs). A table is thus a stream in disguise, and it can be easily turned into a "real" stream by iterating over each key-value entry in the table.</li> + </ul> + + <p> + Let's illustrate this with an example. Imagine a table that tracks the total number of pageviews by user (first column of diagram below). Over time, whenever a new pageview event is processed, the state of the table is updated accordingly. Here, the state changes between different points in time - and different revisions of the table - can be represented as a changelog stream (second column). + </p> + <img class="centered" src="/{{version}}/images/streams-table-duality-02.png" style="width:300px"> + + <p> + Interestingly, because of the stream-table duality, the same stream can be used to reconstruct the original table (third column): + </p> + <img class="centered" src="/{{version}}/images/streams-table-duality-03.png" style="width:600px"> + + <p> + The same mechanism is used, for example, to replicate databases via change data capture (CDC) and, within Kafka Streams, to replicate its so-called state stores across machines for fault-tolerance. + The stream-table duality is such an important concept that Kafka Streams models it explicitly via the <a href="#streams_kstream_ktable">KStream, KTable, and GlobalKTable</a> interfaces. + </p> + + <h3><a id="streams_concepts_aggregations" href="#streams_concepts_aggregations">Aggregations</a></h3> + <p> + An <strong>aggregation</strong> operation takes one input stream or table, and yields a new table by combining multiple input records into a single output record. Examples of aggregations are computing counts or sum. + </p> + + <p> + In the <code>Kafka Streams DSL</code>, an input stream of an <code>aggregation</code> can be a KStream or a KTable, but the output stream will always be a KTable. This allows Kafka Streams to update an aggregate value upon the out-of-order arrival of further records after the value was produced and emitted. When such out-of-order arrival happens, the aggregating KStream or KTable emits a new aggregate value. Because the output is a KTable, the new value is considered to overwrite the old value with the same key in subsequent processing steps. Review comment: Unfortunately, and I try to get it into better shape incrementally (reading GitHub diffs with long lines is just a pain). Would go awesome if somebody (*cough*) could do a PR just fixing it throughout the docs -- the current lazy approach is somewhat tiring. ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org