Re: Enhanced Console Sink for Structured Streaming

Neil Ramaswamy Fri, 09 Feb 2024 14:39:20 -0800

Thanks for the comments, Anish and Jerry. To summarize so far, we are in
agreement that:


1. Enhanced console sink is a good tool for new users to understand
Structured Streaming semantics
2. It should be opt-in via an option (unlike my original proposal)
3. Out of the 2 modes of verbosity I proposed, we're fine with the first
mode for now (print sink data with event-time metadata and state data for
stateful queries, with duration-rendered timestamps, with just the
KeyWithIndexToValue state store for joins, and with a state table for every
stateful operator, if there are multiple).

I think the last pending suggestion (from Raghu, Anish, and Jerry) is how
to structure the output so that it's clear what is data and what is
metadata. Here's my proposal:

------------------------------------------
BATCH: 1
------------------------------------------

+----------------------------------------+
|           ROWS WRITTEN TO SINK         |
+--------------------------+-------------+
|          window          |   count     |
+--------------------------+-------------+
| {10 seconds, 20 seconds} |      2      |
+--------------------------+-------------+

+----------------------------------------+
|           EVENT TIME METADATA          |
+----------------------------------------+
| watermark -> 21 seconds                |
| numDroppedRows -> 0                    |
+----------------------------------------+

+----------------------------------------+
|          ROWS IN STATE STORE           |
+--------------------------+-------------+
|           key            |    value    |
+--------------------------+-------------+
| {30 seconds, 40 seconds} |     {1}     |
+--------------------------+-------------+

If there are no more major concerns, I think we can discuss smaller details
in the JIRA ticket or PR itself. I don't think a SPIP is needed for a
flag-gated benign change like this, but please let me know if you disagree.

Best,
Neil

On Thu, Feb 8, 2024 at 5:37 PM Jerry Peng <jerry.boyang.p...@gmail.com>
wrote:

> I am generally a +1 on this as we can use this information in our docs to
> demonstrate certains concepts to potential users.
>
> I am in agreement with other reviewers that we should keep the existing
> default behavior of the console sink.  This new style of output should be
> enabled behind a flag.
>
> As for the output of this "new mode" in the console sink, can we be more
> explicit about what is the actual output and what is the metadata?  It is
> not clear from the logged output.
>
> On Tue, Feb 6, 2024 at 11:08 AM Neil Ramaswamy
> <neil.ramasw...@databricks.com.invalid> wrote:
>
>> Jungtaek and Raghu, thanks for the input. I'm happy with the verbose mode
>> being off by default.
>>
>> I think it's reasonable to have 1 or 2 levels of verbosity:
>>
>>    1. The first verbose mode could target new users, and take a highly
>>    opinionated view on what's important to understand streaming semantics.
>>    This would include printing the sink rows, watermark, number of dropped
>>    rows (if any), and state data. For state data, we should print for all
>>    state stores (for multiple stateful operators), but for joins, I think
>>    rendering just the KeyWithIndexToValueStore(s) is reasonable. Timestamps
>>    would render as durations (see original message) to make small examples
>>    easy to understand.
>>    2. The second verbose mode could target more advanced users trying to
>>    create a reproduction. In addition to the first verbose mode, it would 
>> also
>>    print the other join state store, the number of evicted rows due to the
>>    watermark, and print timestamps as extended ISO 8601 strings (same as
>>    today).
>>
>> Rather than implementing both, I would prefer to implement the first
>> level, and evaluate later if the second would be useful.
>>
>> Mich, can you elaborate on why you don't think it's useful? To reiterate,
>> this proposal is to bring to light certain metrics/values that are
>> essential for understanding SS micro-batching semantics. It's to help users
>> go from 0 to 1, not 1 to 100. (And the Spark UI can't be the place for
>> rendering sink data or state store values—there should be no sensitive user
>> data there.)
>>
>> On Mon, Feb 5, 2024 at 11:32 PM Mich Talebzadeh <
>> mich.talebza...@gmail.com> wrote:
>>
>>> I don't think adding this to the streaming flow (at micro level) will be
>>> that useful
>>>
>>> However, this can be added to Spark UI as an enhancement to
>>> the Streaming Query Statistics page.
>>>
>>> HTH
>>>
>>> Mich Talebzadeh,
>>> Dad | Technologist | Solutions Architect | Engineer
>>> London
>>> United Kingdom
>>>
>>>
>>>    view my Linkedin profile
>>> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>>>
>>>
>>>  https://en.everybodywiki.com/Mich_Talebzadeh
>>>
>>>
>>>
>>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>>> any loss, damage or destruction of data or any other property which may
>>> arise from relying on this email's technical content is explicitly
>>> disclaimed. The author will in no case be liable for any monetary damages
>>> arising from such loss, damage or destruction.
>>>
>>>
>>>
>>>
>>> On Tue, 6 Feb 2024 at 03:49, Raghu Angadi <raghu.ang...@databricks.com>
>>> wrote:
>>>
>>>> Agree, the default behavior does not need to change.
>>>>
>>>> Neil, how about separating it into two sections:
>>>>
>>>>    - Actual rows in the sink (same as current output)
>>>>    - Followed by metadata data
>>>>
>>>>

Re: Enhanced Console Sink for Structured Streaming

Reply via email to