Re: Enhanced Console Sink for Structured Streaming

2024-02-08 Thread Anish Shrigondekar
Hi Neil,

Thanks for putting this together. +1 to the proposal of enhancing the
console sink further. I think it will help new users understand some of the
streaming/micro-batch semantics a bit better in Spark.

Agree with not having verbose mode enabled by default. I think step 1
described above sounds most useful/relevant to the large majority of users.
Might be useful to have a clean separation of the output rows (what the
sink produces) vs query metadata (watermark etc) vs state store data
maintained by the query for stateful queries, so that it's easier to
understand/consume.

Thanks,
Anish

On Tue, Feb 6, 2024 at 11:08 AM Neil Ramaswamy
 wrote:

> Jungtaek and Raghu, thanks for the input. I'm happy with the verbose mode
> being off by default.
>
> I think it's reasonable to have 1 or 2 levels of verbosity:
>
>1. The first verbose mode could target new users, and take a highly
>opinionated view on what's important to understand streaming semantics.
>This would include printing the sink rows, watermark, number of dropped
>rows (if any), and state data. For state data, we should print for all
>state stores (for multiple stateful operators), but for joins, I think
>rendering just the KeyWithIndexToValueStore(s) is reasonable. Timestamps
>would render as durations (see original message) to make small examples
>easy to understand.
>2. The second verbose mode could target more advanced users trying to
>create a reproduction. In addition to the first verbose mode, it would also
>print the other join state store, the number of evicted rows due to the
>watermark, and print timestamps as extended ISO 8601 strings (same as
>today).
>
> Rather than implementing both, I would prefer to implement the first
> level, and evaluate later if the second would be useful.
>
> Mich, can you elaborate on why you don't think it's useful? To reiterate,
> this proposal is to bring to light certain metrics/values that are
> essential for understanding SS micro-batching semantics. It's to help users
> go from 0 to 1, not 1 to 100. (And the Spark UI can't be the place for
> rendering sink data or state store values—there should be no sensitive user
> data there.)
>
> On Mon, Feb 5, 2024 at 11:32 PM Mich Talebzadeh 
> wrote:
>
>> I don't think adding this to the streaming flow (at micro level) will be
>> that useful
>>
>> However, this can be added to Spark UI as an enhancement to the Streaming
>> Query Statistics page.
>>
>> HTH
>>
>> Mich Talebzadeh,
>> Dad | Technologist | Solutions Architect | Engineer
>> London
>> United Kingdom
>>
>>
>>view my Linkedin profile
>> 
>>
>>
>>  https://en.everybodywiki.com/Mich_Talebzadeh
>>
>>
>>
>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>> any loss, damage or destruction of data or any other property which may
>> arise from relying on this email's technical content is explicitly
>> disclaimed. The author will in no case be liable for any monetary damages
>> arising from such loss, damage or destruction.
>>
>>
>>
>>
>> On Tue, 6 Feb 2024 at 03:49, Raghu Angadi 
>> wrote:
>>
>>> Agree, the default behavior does not need to change.
>>>
>>> Neil, how about separating it into two sections:
>>>
>>>- Actual rows in the sink (same as current output)
>>>- Followed by metadata data
>>>
>>>


Re: [VOTE] SPIP: Structured Streaming - Arbitrary State API v2

2024-01-08 Thread Anish Shrigondekar
Thanks Jungtaek for creating the Vote thread.

+1 (non-binding) from my side too.

Thanks,
Anish

On Tue, Jan 9, 2024 at 6:09 AM Jungtaek Lim 
wrote:

> Starting with my +1 (non-binding). Thanks!
>
> On Tue, Jan 9, 2024 at 9:37 AM Jungtaek Lim 
> wrote:
>
>> Hi all,
>>
>> I'd like to start the vote for SPIP: Structured Streaming - Arbitrary
>> State API v2.
>>
>> References:
>>
>>- JIRA ticket 
>>- SPIP doc
>>
>> 
>>- Discussion thread
>>
>>
>> Please vote on the SPIP for the next 72 hours:
>>
>> [ ] +1: Accept the proposal as an official SPIP
>> [ ] +0
>> [ ] -1: I don’t think this is a good idea because …
>>
>> Thanks!
>> Jungtaek Lim (HeartSaVioR)
>>
>


Re: [DISCUSS] SPIP: Structured Streaming - Arbitrary State API v2

2023-11-29 Thread Anish Shrigondekar
Hi dev,

Addressed the comments that Jungtaek had on the doc. Bumping the thread
once again to see if other folks have any feedback on the proposal.

Thanks,
Anish

On Mon, Nov 27, 2023 at 8:15 PM Jungtaek Lim 
wrote:

> Kindly bump for better reach after the long holiday. Please kindly review
> the proposal which opens the chance to address complex use cases of
> streaming. Thanks!
>
> On Thu, Nov 23, 2023 at 8:19 AM Jungtaek Lim 
> wrote:
>
>> Thanks Anish for proposing SPIP and initiating this thread! I believe
>> this SPIP will help a bunch of complex use cases on streaming.
>>
>> dev@: We are coincidentally initiating this discussion in thanksgiving
>> holidays. We understand people in the US may not have time to review the
>> SPIP, and we plan to bump this thread in early next week. We are open for
>> any feedback from non-US during the holiday. We can either address feedback
>> altogether after the holiday (Anish is in the US) or I can answer if the
>> feedback is more about the question. Thanks!
>>
>> On Thu, Nov 23, 2023 at 5:27 AM Anish Shrigondekar <
>> anish.shrigonde...@databricks.com> wrote:
>>
>>> Hi dev,
>>>
>>> I would like to start a discussion on "Structured Streaming - Arbitrary
>>> State API v2". This proposal aims to address a bunch of limitations we see
>>> today using mapGroupsWithState/flatMapGroupsWithState operator. The
>>> detailed set of limitations is described in the SPIP doc.
>>>
>>> We propose to support various features such as multiple state variables
>>> (flexible data modeling), composite types, enhanced timer functionality,
>>> support for chaining operators after new operator, handling initial state
>>> along with state data source, schema evolution etc This will allow users to
>>> write more powerful streaming state management logic primarily used in
>>> operational use-cases. Other built-in stateful operators could also benefit
>>> from such changes in the future.
>>>
>>> JIRA: https://issues.apache.org/jira/browse/SPARK-45939
>>> SPIP:
>>> https://docs.google.com/document/d/1QtC5qd4WQEia9kl1Qv74WE0TiXYy3x6zeTykygwPWig/edit?usp=sharing
>>> Design Doc:
>>> https://docs.google.com/document/d/1QjZmNZ-fHBeeCYKninySDIoOEWfX6EmqXs2lK097u9o/edit?usp=sharing
>>>
>>> cc - @Jungtaek Lim   who has graciously
>>> agreed to be the shepherd for this project
>>>
>>> Looking forward to your feedback !
>>>
>>> Thanks,
>>> Anish
>>>
>>


[DISCUSS] SPIP: Structured Streaming - Arbitrary State API v2

2023-11-22 Thread Anish Shrigondekar
Hi dev,

I would like to start a discussion on "Structured Streaming - Arbitrary
State API v2". This proposal aims to address a bunch of limitations we see
today using mapGroupsWithState/flatMapGroupsWithState operator. The
detailed set of limitations is described in the SPIP doc.

We propose to support various features such as multiple state variables
(flexible data modeling), composite types, enhanced timer functionality,
support for chaining operators after new operator, handling initial state
along with state data source, schema evolution etc This will allow users to
write more powerful streaming state management logic primarily used in
operational use-cases. Other built-in stateful operators could also benefit
from such changes in the future.

JIRA: https://issues.apache.org/jira/browse/SPARK-45939
SPIP:
https://docs.google.com/document/d/1QtC5qd4WQEia9kl1Qv74WE0TiXYy3x6zeTykygwPWig/edit?usp=sharing
Design Doc:
https://docs.google.com/document/d/1QjZmNZ-fHBeeCYKninySDIoOEWfX6EmqXs2lK097u9o/edit?usp=sharing

cc - @Jungtaek Lim   who has graciously
agreed to be the shepherd for this project

Looking forward to your feedback !

Thanks,
Anish


Re: [VOTE] SPIP: State Data Source - Reader

2023-10-23 Thread Anish Shrigondekar
+1 (non-binding)

Thanks,
Anish

On Mon, Oct 23, 2023 at 5:01 PM Wenchen Fan  wrote:

> +1
>
> On Mon, Oct 23, 2023 at 4:03 PM Jungtaek Lim 
> wrote:
>
>> Starting with my +1 (non-binding). Thanks!
>>
>> On Mon, Oct 23, 2023 at 1:23 PM Jungtaek Lim <
>> kabhwan.opensou...@gmail.com> wrote:
>>
>>> Hi all,
>>>
>>> I'd like to start the vote for SPIP: State Data Source - Reader.
>>>
>>> The high level summary of the SPIP is that we propose a new data source
>>> which enables a read ability for state store in the checkpoint, via batch
>>> query. This would enable two major use cases 1) constructing tests with
>>> verifying state store 2) inspecting values in state store in the scenario
>>> of incident.
>>>
>>> References:
>>>
>>>- JIRA ticket 
>>>- SPIP doc
>>>
>>> 
>>>- Discussion thread
>>>
>>>
>>> Please vote on the SPIP for the next 72 hours:
>>>
>>> [ ] +1: Accept the proposal as an official SPIP
>>> [ ] +0
>>> [ ] -1: I don’t think this is a good idea because …
>>>
>>> Thanks!
>>> Jungtaek Lim (HeartSaVioR)
>>>
>>


Re: [DISCUSS] SPIP: State Data Source - Reader

2023-10-16 Thread Anish Shrigondekar
Hi Jungtaek,

Thanks for putting this together. +1 from me and looks good overall. Posted
some minor comments/questions to the doc.

Thanks,
Anish

On Mon, Oct 16, 2023 at 11:25 AM Bartosz Konieczny 
wrote:

> Thank you, Jungtaek, for your answers! It's clear now.
>
> +1 for me. It seems like a prerequisite for further ops-related
> improvements for the state store management. I mean especially here the
> state rebalancing that could rely on this read+write state store API. I
> don't mean here the dynamic state rebalancing that could probably be
> implemented with a lower latency directly in the stateful API. Instead I'm
> thinking more of an offline job to rebalance the state and later restart
> the stateful pipeline with the changed number of shuffle partitions.
>
> Best,
> Bartosz.
>
> On Mon, Oct 16, 2023 at 6:19 PM Jungtaek Lim 
> wrote:
>
>> bump for better reach
>>
>> On Thu, Oct 12, 2023 at 4:26 PM Jungtaek Lim <
>> kabhwan.opensou...@gmail.com> wrote:
>>
>>> Sorry, please use this link instead for SPIP doc:
>>> https://docs.google.com/document/d/1_iVf_CIu2RZd3yWWF6KoRNlBiz5NbSIK0yThqG0EvPY/edit?usp=sharing
>>>
>>>
>>> On Thu, Oct 12, 2023 at 3:58 PM Jungtaek Lim <
>>> kabhwan.opensou...@gmail.com> wrote:
>>>
 Hi dev,

 I'd like to start a discussion on "State Data Source - Reader".

 This proposal aims to introduce a new data source "statestore" which
 enables reading the state rows from existing checkpoint via offline (batch)
 query. This will enable users to 1) create unit tests against stateful
 query verifying the state value (especially flatMapGroupsWithState), 2)
 gather more context on the status when an incident occurs, especially for
 incorrect output.

 *SPIP*:
 https://docs.google.com/document/d/1HjEupRv8TRFeULtJuxRq_tEG1Wq-9UNu-ctGgCYRke0/edit?usp=sharing
 *JIRA*: https://issues.apache.org/jira/browse/SPARK-45511

 Looking forward to your feedback!

 Thanks,
 Jungtaek Lim (HeartSaVioR)

 ps. The scope of the project is narrowed to the reader in this SPIP,
 since the writer requires us to consider more cases. We are planning on it.

>>>
>
> --
> Bartosz Konieczny
> freelance data engineer
> https://www.waitingforcode.com
> https://github.com/bartosz25/
> https://twitter.com/waitingforcode
>
>


Re: [DISCUSS] Deprecate DStream in 3.4

2023-01-13 Thread Anish Shrigondekar
+1 on the Dstreams deprecation proposal

On Fri, Jan 13, 2023 at 10:47 AM Jerry Peng 
wrote:

> +1 in general for marking the DStreams API as deprecated
>
> Jungtaek, can you please provide / elaborate on the concrete actions you
> intend on taking for the depreciation process?
>
> Best,
>
> Jerry
>
> On Thu, Jan 12, 2023 at 11:16 PM L. C. Hsieh  wrote:
>
>> +1
>>
>> On Thu, Jan 12, 2023 at 10:39 PM Jungtaek Lim
>>  wrote:
>> >
>> > Yes, exactly. I'm sorry to bring confusion - should have clarified
>> action items on the proposal.
>> >
>> > On Fri, Jan 13, 2023 at 3:31 PM Dongjoon Hyun 
>> wrote:
>> >>
>> >> Then, could you elaborate `the proposed code change` specifically?
>> >> Maybe, usual deprecation warning logs and annotation on the API?
>> >>
>> >>
>> >> On Thu, Jan 12, 2023 at 10:05 PM Jungtaek Lim <
>> kabhwan.opensou...@gmail.com> wrote:
>> >>>
>> >>> Maybe I need to clarify - my proposal is "explicitly" deprecating it,
>> which incurs code change for sure. Guidance on the Spark website is done
>> already as I mentioned - we updated the DStream doc page to mention that
>> DStream is a "legacy" project and users should move to SS. I don't feel
>> this is sufficient to refrain users from using it, hence initiating this
>> proposal.
>> >>>
>> >>> Sorry to make confusion. I just wanted to make sure the goal of the
>> proposal is not "removing" the API. The discussion on the removal of API
>> doesn't tend to go well, so I wanted to make sure I don't mean that.
>> >>>
>> >>> On Fri, Jan 13, 2023 at 2:46 PM Dongjoon Hyun <
>> dongjoon.h...@gmail.com> wrote:
>> 
>>  +1 for the proposal (guiding only without any code change).
>> 
>>  Thanks,
>>  Dongjoon.
>> 
>>  On Thu, Jan 12, 2023 at 9:33 PM Shixiong Zhu 
>> wrote:
>> >
>> > +1
>> >
>> >
>> > On Thu, Jan 12, 2023 at 5:08 PM Tathagata Das <
>> tathagata.das1...@gmail.com> wrote:
>> >>
>> >> +1
>> >>
>> >> On Thu, Jan 12, 2023 at 7:46 PM Hyukjin Kwon 
>> wrote:
>> >>>
>> >>> +1
>> >>>
>> >>> On Fri, 13 Jan 2023 at 08:51, Jungtaek Lim <
>> kabhwan.opensou...@gmail.com> wrote:
>> 
>>  bump for more visibility.
>> 
>>  On Wed, Jan 11, 2023 at 12:20 PM Jungtaek Lim <
>> kabhwan.opensou...@gmail.com> wrote:
>> >
>> > Hi dev,
>> >
>> > I'd like to propose the deprecation of DStream in Spark 3.4, in
>> favor of promoting Structured Streaming.
>> > (Sorry for the late proposal, if we don't make the change in
>> 3.4, we will have to wait for another 6 months.)
>> >
>> > We have been focusing on Structured Streaming for years (across
>> multiple major and minor versions), and during the time we haven't made any
>> improvements for DStream. Furthermore, recently we updated the DStream doc
>> to explicitly say DStream is a legacy project.
>> >
>> https://spark.apache.org/docs/latest/streaming-programming-guide.html#note
>> >
>> > The baseline of deprecation is that we don't see a particular
>> use case which only DStream solves. This is a different story with GraphX
>> and MLLIB, as we don't have replacements for that.
>> >
>> > The proposal does not mean we will remove the API soon, as the
>> Spark project has been making deprecation against public API. I don't
>> intend to propose the target version for removal. The goal is to guide
>> users to refrain from constructing a new workload with DStream. We might
>> want to go with this in future, but it would require a new discussion
>> thread at that time.
>> >
>> > What do you think?
>> >
>> > Thanks,
>> > Jungtaek Lim (HeartSaVioR)
>>
>> -
>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>>
>>