Hi, As I understood, the proposal you mentioned suggests adding event-time and state store metadata to the console sink to better highlight the semantics of the Structured Streaming engine. While I agree this enhancement can provide valuable insights into the engine's behavior especially for newcomers, there are potential challenges that we need to be aware of:
- Including additional metadata in the console sink output can increase the volume of information printed. This might result in a more verbose console output, making it harder to observe the actual data from the metadata, especially in scenarios with high data throughput. - Added verbosity, the proposed additional metadata may make the console output more verbose, potentially affecting its readability, especially for users who are primarily interested in the processed data and not the internal engine details. - Users unfamiliar with the internal workings of Structured Streaming might misinterpret the metadata as part of the actual data, leading to confusion. - The act of printing additional metadata to the console may introduce some overhead, especially in scenarios where high-frequency updates occur. While this overhead might be minimal, it is worth considering it in performance-sensitive applications. - While the proposal aims to make it easier for beginners to understand concepts like watermarks, operator state, and output rows, it could potentially increase the learning curve due to the introduction of additional terminology and information. - Users might benefit from the ability to selectively enable or disable the display of certain metadata elements to tailor the console output to their specific needs. However, this introduces additional complexity. As usual with these things, your mileage varies. Whilst the proposed enhancements offer valuable insights into the behavior of Structured Streaming, we ought to think about the potential downsides, particularly in terms of increased verbosity, complexity, and the impact on user experience HTH Mich Talebzadeh, Dad | Technologist | Solutions Architect | Engineer London United Kingdom view my Linkedin profile <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/> https://en.everybodywiki.com/Mich_Talebzadeh *Disclaimer:* Use it at your own risk. Any and all responsibility for any loss, damage or destruction of data or any other property which may arise from relying on this email's technical content is explicitly disclaimed. The author will in no case be liable for any monetary damages arising from such loss, damage or destruction. On Sat, 3 Feb 2024 at 01:32, Neil Ramaswamy <neil.ramasw...@databricks.com.invalid> wrote: > Hi all, > > I'd like to propose the idea of enhancing Structured Streaming's console > sink to print event-time metrics and state store data, in addition to the > sink's rows. > > I've noticed beginners often struggle to understand how watermarks, > operator state, and output rows are all intertwined. By printing all of > this information in the same place, I think that this sink will make it > easier for users to see—and our docs to explain—how these concepts work > together. > > For example, our docs could walk the users through a query with a > 10-second tumbling window aggregation (e.g. with a .count()) and a 15 > second watermark. After processing something like (foo, 17) and (bar, 15), > writing another record (baz, 36) to the source would cause the following to > print for batch 2: > > +----------------------------------------+ > > | WRITES TO SINK (Batch = 2) | > > +--------------------------+-------------+ > > | window | count | > > +--------------------------+-------------+ > > | {10 seconds, 20 seconds} | 2 | > > +--------------------------+-------------+ > > | EVENT TIME | > > +----------------------------------------+ > > | watermark -> 21 seconds | > > | numDroppedRows -> 0 | > > +----------------------------------------+ > > | STATE ROWS | > > +--------------------------+-------------+ > > | key | value | > > +--------------------------+-------------+ > > | {30 seconds, 40 seconds} | {1} | > > +--------------------------+-------------+ > > From this (especially with expository help), it would be more apparent > that the record at 36 seconds did three things: it advanced the watermark > to 36-15 = 21 seconds, caused the [10, 20] window to close, and was put > into the state for [30, 40]. > > One valid concern is that this sink would now be printing *metadata*, not > just data: will users think that Structured Streaming writes metadata to > sinks? Perhaps. But I think that we can clarify that in the documentation > of the console sink. > > Finally, the specific behavior for handling queries with multiple stateful > operations, joins, and (F)MGWS can be handled in a subsequent design > discussion if the general idea is appreciated. > > *TLDR: I propose adding event-time and state store metadata to the console > sink to better highlight the semantics of the Structured Streaming engine. * > > Neil > > >