Thanks for the comments, Anish and Jerry. To summarize so far, we are in agreement that:
1. Enhanced console sink is a good tool for new users to understand Structured Streaming semantics 2. It should be opt-in via an option (unlike my original proposal) 3. Out of the 2 modes of verbosity I proposed, we're fine with the first mode for now (print sink data with event-time metadata and state data for stateful queries, with duration-rendered timestamps, with just the KeyWithIndexToValue state store for joins, and with a state table for every stateful operator, if there are multiple). I think the last pending suggestion (from Raghu, Anish, and Jerry) is how to structure the output so that it's clear what is data and what is metadata. Here's my proposal: ------------------------------------------ BATCH: 1 ------------------------------------------ +----------------------------------------+ | ROWS WRITTEN TO SINK | +--------------------------+-------------+ | window | count | +--------------------------+-------------+ | {10 seconds, 20 seconds} | 2 | +--------------------------+-------------+ +----------------------------------------+ | EVENT TIME METADATA | +----------------------------------------+ | watermark -> 21 seconds | | numDroppedRows -> 0 | +----------------------------------------+ +----------------------------------------+ | ROWS IN STATE STORE | +--------------------------+-------------+ | key | value | +--------------------------+-------------+ | {30 seconds, 40 seconds} | {1} | +--------------------------+-------------+ If there are no more major concerns, I think we can discuss smaller details in the JIRA ticket or PR itself. I don't think a SPIP is needed for a flag-gated benign change like this, but please let me know if you disagree. Best, Neil On Thu, Feb 8, 2024 at 5:37 PM Jerry Peng <jerry.boyang.p...@gmail.com> wrote: > I am generally a +1 on this as we can use this information in our docs to > demonstrate certains concepts to potential users. > > I am in agreement with other reviewers that we should keep the existing > default behavior of the console sink. This new style of output should be > enabled behind a flag. > > As for the output of this "new mode" in the console sink, can we be more > explicit about what is the actual output and what is the metadata? It is > not clear from the logged output. > > On Tue, Feb 6, 2024 at 11:08 AM Neil Ramaswamy > <neil.ramasw...@databricks.com.invalid> wrote: > >> Jungtaek and Raghu, thanks for the input. I'm happy with the verbose mode >> being off by default. >> >> I think it's reasonable to have 1 or 2 levels of verbosity: >> >> 1. The first verbose mode could target new users, and take a highly >> opinionated view on what's important to understand streaming semantics. >> This would include printing the sink rows, watermark, number of dropped >> rows (if any), and state data. For state data, we should print for all >> state stores (for multiple stateful operators), but for joins, I think >> rendering just the KeyWithIndexToValueStore(s) is reasonable. Timestamps >> would render as durations (see original message) to make small examples >> easy to understand. >> 2. The second verbose mode could target more advanced users trying to >> create a reproduction. In addition to the first verbose mode, it would >> also >> print the other join state store, the number of evicted rows due to the >> watermark, and print timestamps as extended ISO 8601 strings (same as >> today). >> >> Rather than implementing both, I would prefer to implement the first >> level, and evaluate later if the second would be useful. >> >> Mich, can you elaborate on why you don't think it's useful? To reiterate, >> this proposal is to bring to light certain metrics/values that are >> essential for understanding SS micro-batching semantics. It's to help users >> go from 0 to 1, not 1 to 100. (And the Spark UI can't be the place for >> rendering sink data or state store values—there should be no sensitive user >> data there.) >> >> On Mon, Feb 5, 2024 at 11:32 PM Mich Talebzadeh < >> mich.talebza...@gmail.com> wrote: >> >>> I don't think adding this to the streaming flow (at micro level) will be >>> that useful >>> >>> However, this can be added to Spark UI as an enhancement to >>> the Streaming Query Statistics page. >>> >>> HTH >>> >>> Mich Talebzadeh, >>> Dad | Technologist | Solutions Architect | Engineer >>> London >>> United Kingdom >>> >>> >>> view my Linkedin profile >>> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/> >>> >>> >>> https://en.everybodywiki.com/Mich_Talebzadeh >>> >>> >>> >>> *Disclaimer:* Use it at your own risk. Any and all responsibility for >>> any loss, damage or destruction of data or any other property which may >>> arise from relying on this email's technical content is explicitly >>> disclaimed. The author will in no case be liable for any monetary damages >>> arising from such loss, damage or destruction. >>> >>> >>> >>> >>> On Tue, 6 Feb 2024 at 03:49, Raghu Angadi <raghu.ang...@databricks.com> >>> wrote: >>> >>>> Agree, the default behavior does not need to change. >>>> >>>> Neil, how about separating it into two sections: >>>> >>>> - Actual rows in the sink (same as current output) >>>> - Followed by metadata data >>>> >>>>