Hi,

As I understood, the proposal you mentioned suggests adding event-time and
state store metadata to the console sink to better highlight the semantics
of the Structured Streaming engine. While I agree this enhancement can
provide valuable insights into the engine's behavior especially for
newcomers, there are potential challenges that we need to be aware of:

- Including additional metadata in the console sink output can increase the
volume of information printed. This might result in a more verbose console
output, making it harder to observe the actual data from the metadata,
especially in scenarios with high data throughput.
- Added verbosity, the proposed additional metadata may make the console
output more verbose, potentially affecting its readability, especially for
users who are primarily interested in the processed data and not the
internal engine details.
- Users unfamiliar with the internal workings of Structured Streaming might
misinterpret the metadata as part of the actual data, leading to confusion.
- The act of printing additional metadata to the console may introduce some
overhead, especially in scenarios where high-frequency updates occur. While
this overhead might be minimal, it is worth considering it in
performance-sensitive applications.
- While the proposal aims to make it easier for beginners to understand
concepts like watermarks, operator state, and output rows, it could
potentially increase the learning curve due to the introduction of
additional terminology and information.
- Users might benefit from the ability to selectively enable or disable the
display of certain metadata elements to tailor the console output to their
specific needs. However, this introduces additional complexity.

As usual with these things, your mileage varies. Whilst the proposed
enhancements offer valuable insights into the behavior of Structured
Streaming, we ought to think about the potential downsides, particularly in
terms of increased verbosity, complexity, and the impact on user experience

HTH
Mich Talebzadeh,
Dad | Technologist | Solutions Architect | Engineer
London
United Kingdom


   view my Linkedin profile
<https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>


 https://en.everybodywiki.com/Mich_Talebzadeh



*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.




On Sat, 3 Feb 2024 at 01:32, Neil Ramaswamy
<neil.ramasw...@databricks.com.invalid> wrote:

> Hi all,
>
> I'd like to propose the idea of enhancing Structured Streaming's console
> sink to print event-time metrics and state store data, in addition to the
> sink's rows.
>
> I've noticed beginners often struggle to understand how watermarks,
> operator state, and output rows are all intertwined. By printing all of
> this information in the same place, I think that this sink will make it
> easier for users to see—and our docs to explain—how these concepts work
> together.
>
> For example, our docs could walk the users through a query with a
> 10-second tumbling window aggregation (e.g. with a .count()) and a 15
> second watermark. After processing something like (foo, 17) and (bar, 15),
> writing another record (baz, 36) to the source would cause the following to
> print for batch 2:
>
> +----------------------------------------+
>
> |      WRITES TO SINK (Batch = 2)        |
>
> +--------------------------+-------------+
>
> |          window          |   count     |
>
> +--------------------------+-------------+
>
> | {10 seconds, 20 seconds} |      2      |
>
> +--------------------------+-------------+
>
> |             EVENT TIME                 |
>
> +----------------------------------------+
>
> | watermark -> 21 seconds                |
>
> | numDroppedRows -> 0                    |
>
> +----------------------------------------+
>
> |             STATE ROWS                 |
>
> +--------------------------+-------------+
>
> |           key            |    value    |
>
> +--------------------------+-------------+
>
> | {30 seconds, 40 seconds} |     {1}     |
>
> +--------------------------+-------------+
>
> From this (especially with expository help), it would be more apparent
> that the record at 36 seconds did three things: it advanced the watermark
> to 36-15 = 21 seconds, caused the [10, 20] window to close, and was put
> into the state for [30, 40].
>
> One valid concern is that this sink would now be printing *metadata*, not
> just data: will users think that Structured Streaming writes metadata to
> sinks? Perhaps. But I think that we can clarify that in the documentation
> of the console sink.
>
> Finally, the specific behavior for handling queries with multiple stateful
> operations, joins, and (F)MGWS can be handled in a subsequent design
> discussion if the general idea is appreciated.
>
> *TLDR: I propose adding event-time and state store metadata to the console
> sink to better highlight the semantics of the Structured Streaming engine. *
>
> Neil
>
>
>

Reply via email to