[GitHub] [kafka] mjsax commented on a change in pull request #8995: Restore stream-table duality description

GitBox Thu, 09 Jul 2020 13:50:22 -0700


mjsax commented on a change in pull request #8995:
URL: https://github.com/apache/kafka/pull/8995#discussion_r452481312




##########
File path: docs/streams/core-concepts.html
##########
@@ -170,13 +150,59 @@ <h3><a id="streams_concepts_duality" 
href="#streams-concepts-duality">Duality of
       or to run <a id="streams-developer-guide-interactive-queries" 
href="/{{version}}/documentation/streams/developer-guide/interactive-queries#interactive-queries">interactive
 queries</a>
       against your application's latest processing results. And, beyond its 
internal usage, the Kafka Streams API
       also allows developers to exploit this duality in their own applications.
-  </p>
+    </p>
 
-  <p>
+    <p>
       Before we discuss concepts such as <a 
id="streams-developer-guide-dsl-aggregating" 
href="/{{version}}/documentation/streams/developer-guide/dsl-api#aggregating">aggregations</a>
       in Kafka Streams, we must first introduce <strong>tables</strong> in 
more detail, and talk about the aforementioned stream-table duality.
-      Essentially, this duality means that a stream can be viewed as a table, 
and a table can be viewed as a stream.
-  </p>
+      Essentially, this duality means that a stream can be viewed as a table, 
and a table can be viewed as a stream. Kafka's log compaction feature, for 
example, exploits this duality.
+    </p>
+
+    <p>
+        A simple form of a table is a collection of key-value pairs, also 
called a map or associative array. Such a table may look as follows:
+    </p>
+    <img class="centered" 
src="/{{version}}/images/streams-table-duality-01.png">
+
+    The <b>stream-table duality</b> describes the close relationship between 
streams and tables.
+    <ul>
+        <li><b>Stream as Table</b>: A stream can be considered a changelog of 
a table, where each data record in the stream captures a state change of the 
table. A stream is thus a table in disguise, and it can be easily turned into a 
"real" table by replaying the changelog from beginning to end to reconstruct 
the table. Similarly, in a more general analogy, aggregating data records in a 
stream - such as computing the total number of pageviews by user from a stream 
of pageview events - will return a table (here with the key and the value being 
the user and its corresponding pageview count, respectively).</li>
+        <li><b>Table as Stream</b>: A table can be considered a snapshot, at a 
point in time, of the latest value for each key in a stream (a stream's data 
records are key-value pairs). A table is thus a stream in disguise, and it can 
be easily turned into a "real" stream by iterating over each key-value entry in 
the table.</li>
+    </ul>
+
+    <p>
+        Let's illustrate this with an example. Imagine a table that tracks the 
total number of pageviews by user (first column of diagram below). Over time, 
whenever a new pageview event is processed, the state of the table is updated 
accordingly. Here, the state changes between different points in time - and 
different revisions of the table - can be represented as a changelog stream 
(second column).
+    </p>
+    <img class="centered" 
src="/{{version}}/images/streams-table-duality-02.png" style="width:300px">
+
+    <p>
+        Interestingly, because of the stream-table duality, the same stream 
can be used to reconstruct the original table (third column):
+    </p>
+    <img class="centered" 
src="/{{version}}/images/streams-table-duality-03.png" style="width:600px">
+
+    <p>
+        The same mechanism is used, for example, to replicate databases via 
change data capture (CDC) and, within Kafka Streams, to replicate its so-called 
state stores across machines for fault-tolerance.
+        The stream-table duality is such an important concept that Kafka 
Streams models it explicitly via the <a href="#streams_kstream_ktable">KStream, 
KTable, and GlobalKTable</a> interfaces.
+    </p>
+
+    <h3><a id="streams_concepts_aggregations" 
href="#streams_concepts_aggregations">Aggregations</a></h3>
+    <p>
+        An <strong>aggregation</strong> operation takes one input stream or 
table, and yields a new table by combining multiple input records into a single 
output record. Examples of aggregations are computing counts or sum.
+    </p>
+
+    <p>
+        In the <code>Kafka Streams DSL</code>, an input stream of an 
<code>aggregation</code> can be a KStream or a KTable, but the output stream 
will always be a KTable. This allows Kafka Streams to update an aggregate value 
upon the out-of-order arrival of further records after the value was produced 
and emitted. When such out-of-order arrival happens, the aggregating KStream or 
KTable emits a new aggregate value. Because the output is a KTable, the new 
value is considered to overwrite the old value with the same key in subsequent 
processing steps.

Review comment:
       Unfortunately, and I try to get it into better shape incrementally 
(reading GitHub diffs with long lines is just a pain). Would go awesome if 
somebody (*cough*) could do a PR just fixing it throughout the docs -- the 
current lazy approach is somewhat tiring.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [kafka] mjsax commented on a change in pull request #8995: Restore stream-table duality description

Reply via email to