Re: Enhanced Console Sink for Structured Streaming

Jerry Peng Thu, 08 Feb 2024 17:37:11 -0800

I am generally a +1 on this as we can use this information in our docs to
demonstrate certains concepts to potential users.


I am in agreement with other reviewers that we should keep the existing
default behavior of the console sink.  This new style of output should be
enabled behind a flag.

As for the output of this "new mode" in the console sink, can we be more
explicit about what is the actual output and what is the metadata?  It is
not clear from the logged output.

On Tue, Feb 6, 2024 at 11:08 AM Neil Ramaswamy
<neil.ramasw...@databricks.com.invalid> wrote:

> Jungtaek and Raghu, thanks for the input. I'm happy with the verbose mode
> being off by default.
>
> I think it's reasonable to have 1 or 2 levels of verbosity:
>
>    1. The first verbose mode could target new users, and take a highly
>    opinionated view on what's important to understand streaming semantics.
>    This would include printing the sink rows, watermark, number of dropped
>    rows (if any), and state data. For state data, we should print for all
>    state stores (for multiple stateful operators), but for joins, I think
>    rendering just the KeyWithIndexToValueStore(s) is reasonable. Timestamps
>    would render as durations (see original message) to make small examples
>    easy to understand.
>    2. The second verbose mode could target more advanced users trying to
>    create a reproduction. In addition to the first verbose mode, it would also
>    print the other join state store, the number of evicted rows due to the
>    watermark, and print timestamps as extended ISO 8601 strings (same as
>    today).
>
> Rather than implementing both, I would prefer to implement the first
> level, and evaluate later if the second would be useful.
>
> Mich, can you elaborate on why you don't think it's useful? To reiterate,
> this proposal is to bring to light certain metrics/values that are
> essential for understanding SS micro-batching semantics. It's to help users
> go from 0 to 1, not 1 to 100. (And the Spark UI can't be the place for
> rendering sink data or state store values—there should be no sensitive user
> data there.)
>
> On Mon, Feb 5, 2024 at 11:32 PM Mich Talebzadeh <mich.talebza...@gmail.com>
> wrote:
>
>> I don't think adding this to the streaming flow (at micro level) will be
>> that useful
>>
>> However, this can be added to Spark UI as an enhancement to the Streaming
>> Query Statistics page.
>>
>> HTH
>>
>> Mich Talebzadeh,
>> Dad | Technologist | Solutions Architect | Engineer
>> London
>> United Kingdom
>>
>>
>>    view my Linkedin profile
>> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>>
>>
>>  https://en.everybodywiki.com/Mich_Talebzadeh
>>
>>
>>
>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>> any loss, damage or destruction of data or any other property which may
>> arise from relying on this email's technical content is explicitly
>> disclaimed. The author will in no case be liable for any monetary damages
>> arising from such loss, damage or destruction.
>>
>>
>>
>>
>> On Tue, 6 Feb 2024 at 03:49, Raghu Angadi <raghu.ang...@databricks.com>
>> wrote:
>>
>>> Agree, the default behavior does not need to change.
>>>
>>> Neil, how about separating it into two sections:
>>>
>>>    - Actual rows in the sink (same as current output)
>>>    - Followed by metadata data
>>>
>>>

Re: Enhanced Console Sink for Structured Streaming

Reply via email to