Hi all, I'd like to propose the idea of enhancing Structured Streaming's console sink to print event-time metrics and state store data, in addition to the sink's rows.
I've noticed beginners often struggle to understand how watermarks, operator state, and output rows are all intertwined. By printing all of this information in the same place, I think that this sink will make it easier for users to see—and our docs to explain—how these concepts work together. For example, our docs could walk the users through a query with a 10-second tumbling window aggregation (e.g. with a .count()) and a 15 second watermark. After processing something like (foo, 17) and (bar, 15), writing another record (baz, 36) to the source would cause the following to print for batch 2: +----------------------------------------+ | WRITES TO SINK (Batch = 2) | +--------------------------+-------------+ | window | count | +--------------------------+-------------+ | {10 seconds, 20 seconds} | 2 | +--------------------------+-------------+ | EVENT TIME | +----------------------------------------+ | watermark -> 21 seconds | | numDroppedRows -> 0 | +----------------------------------------+ | STATE ROWS | +--------------------------+-------------+ | key | value | +--------------------------+-------------+ | {30 seconds, 40 seconds} | {1} | +--------------------------+-------------+ >From this (especially with expository help), it would be more apparent that the record at 36 seconds did three things: it advanced the watermark to 36-15 = 21 seconds, caused the [10, 20] window to close, and was put into the state for [30, 40]. One valid concern is that this sink would now be printing *metadata*, not just data: will users think that Structured Streaming writes metadata to sinks? Perhaps. But I think that we can clarify that in the documentation of the console sink. Finally, the specific behavior for handling queries with multiple stateful operations, joins, and (F)MGWS can be handled in a subsequent design discussion if the general idea is appreciated. *TLDR: I propose adding event-time and state store metadata to the console sink to better highlight the semantics of the Structured Streaming engine. * Neil