Hi, I think one more thing we need to consider to do in 2.0 is changing the default value of configuration to improve out-of-box user experience.
Currently, in order to run a Flink job, users may need to set a bunch of configurations, such as minibatch, checkpoint interval, exactly-once, incremental-checkpoint, etc. It's very verbose and hard to use for beginners. Most of them can have a universally applicable value. Because changing the default value is a breaking change. I think It's worth considering changing them in 2.0. What do you think? Best, Jark On Wed, 28 Jun 2023 at 14:10, Sergey Nuyanzin <snuyan...@gmail.com> wrote: > Hi Chesnay > > >"Move Calcite rules from Scala to Java": I would hope that this would be > >an entirely internal change, and could thus be an incremental process > >independent of major releases. > >What is the actual scale of this item; how much are we actually > re-writing? > > Thanks for asking > yes, you're right, that should be internal change. > Yeah I was also thinking about incremental change (rule by rule or > reasonable small group of rules). > And yes, this could be an independent (on major release) activity > > The problem is actually for children of RelOptRule. > Currently I see 60+ such rules (in Scala) using the mentioned deprecated > api. > There are also children of ConverterRule (50+) which do not have such > issues. > Maybe it could be considered as the next step to have all the rules in > Java. > > On Tue, Jun 27, 2023 at 1:34 PM Xintong Song <tonysong...@gmail.com> > wrote: > > > Hi Alex & Gyula, > > > > By compatibility discussion do you mean the "[DISCUSS] FLIP-321: > Introduce > > > an API deprecation process" thread [1]? > > > > > > > Yes, I meant the FLIP-321 discussion. I just noticed I pasted the wrong > url > > in my previous email. Sorry for the mistake. > > > > I am also curious to know if the rationale behind this new API has been > > > previously discussed on the mailing list. Do we have a list of > > shortcomings > > > in the current DataStream API that it tries to resolve? How does the > > > current ProcessFunction functionality fit into the picture? Will it be > > kept > > > as is or subsumed by new API? > > > > > > > I don't think we should create a replacement for the DataStream API > unless > > > we have a very good reason to do so and with a proper discussion about > > this > > > as Alex said. > > > > > > The ProcessFunction API which is targeting to replace DataStream API is > > still a proposal, not a decision. Sorry for the confusion, I should have > > been more careful with my words, not giving the impression that this is > > something we'll do anyway. > > > > There will be a FLIP describing the motivations and designs in detail, > for > > the community to discuss and vote on. We are still working on it. TBH, > this > > is not trivial and we would need more time on it. > > > > Just to quickly share some backgrounds: > > > > - We see quite some problems with the current DataStream APIs > > - Users are working with concrete classes rather than interfaces, > > which means > > - Users can access methods that are designed to be used by internal > > classes, even though they are annotated with `@Internal`. E.g., > > `DataStream#getTransformation`. > > - Changes to the non-API implementations (e.g., > `Transformation`) > > would affect the API classes (e.g., `DataStream`), which > > makes it hard to > > provide binary compatibility. > > - Internal classes are used as parameter / return-value of public > > APIs. E.g., while `AbstractStreamOperator` is PublicEvolving, > > `StreamTask` > > which returns from `AbstractStreamOperator#getContainingTask` is > > Internal. > > - In many cases, users are asked to extend the API classes, rather > > than implementing interfaces. E.g., `AbstractStreamOperator`. > > - Any changes to the base classes, even the internal part, may > > affect the behavior of the user-provided sub-classes > > - Users can override the behavior of the base classes > > - The API module `flink-streaming-java` contains non-API classes, > and > > depends on internal modules such as `flink-runtime`, which means > > - Changes to the internal modules may affect the API modules, which > > requires users to re-build their applications upon upgrading > > - The artifact user needs for building their application larger > > than necessary. > > - We probably should not expose operators (e.g., > > `AbstractStreamOperator`) to users. Functions should be enough > > for users to > > define their data processing logics. Exposing operator-level > concepts > > (e.g., mailbox thread model, checkpoint barrier alignment, etc.) is > > unnecessary and limits the improvement regarding such exposed > > mechanisms > > with compatibility considerations. > > - The current DataStream API seems to be a mixture of many things, > > making it hard to understand especially for newcomers. It might be > > better > > to re-organize it into several parts: (the taxonomy below are just > an > > example of the, we are still working on this) > > - The most fundamental stateful stream processing: streams, > > partitions / key, process functions, state, timeline-service > > - An extension for common batch-streaming unified functions: > map, > > flatmap, filter, agg, reduce, join, etc. > > - An extension for windowing supports: window, triggering > > - An extension for event-time supports: event time, watermark > > - The extensions are like short-cuts / sugars, without which > users > > can probably still achieve the same behavior by working with the > > fundamental APIs, but would be a lot easier with the extensions > > - The original plan was to do in-place refactors / changes on > > DataStream API. Some related items are listed in this doc [2] attached > > to > > the kicking off email [3]. Not all of the above issues are listed, > > because > > we haven't looked into this as deeply as now by that time. > > - We proposed this as a new API rather than in-place refactors in the > > 2.0 work item list, because we realized the changes might be too big > > for an > > in-place change. First having a new API then gradually retiring the > old > > one > > would help users to smoothly migrate between them. > > > > A thorough discussion is definitely needed once the FLIP is out. And of > > course it's possible that the FLIP might be rejected. Given that we are > > planning for release 2.0, I just feel it would be better to bring this up > > early even the concrete plan is not yet ready, > > > > Best, > > > > Xintong > > > > > > [1] https://lists.apache.org/thread/vmhzv8fcw2b33pqxp43486owrxbkd5x9 > > [2] > > > > > https://docs.google.com/document/d/1_PMGl5RuDQGlV99_gL3y7OiRsF0DgCk91Coua6hFXhE/edit?usp=sharing > > [3] https://lists.apache.org/thread/b8w5cx0qqbwzzklyn5xxf54vw9ymys1c > > > > On Tue, Jun 27, 2023 at 5:15 PM Gyula Fóra <gyf...@apache.org> wrote: > > > > > Hey! > > > > > > I share the same concerns mentioned above regarding the > "ProcessFunction > > > API". > > > > > > I don't think we should create a replacement for the DataStream API > > unless > > > we have a very good reason to do so and with a proper discussion about > > this > > > as Alex said. > > > > > > Cheers, > > > Gyula > > > > > > On Tue, Jun 27, 2023 at 11:03 AM Alexander Fedulov < > > > alexander.fedu...@gmail.com> wrote: > > > > > > > Hi Xintong, > > > > > > > > By compatibility discussion do you mean the "[DISCUSS] FLIP-321: > > > Introduce > > > > an API deprecation process" thread [1]? > > > > > > > > I am also curious to know if the rationale behind this new API has > been > > > > previously discussed on the mailing list. Do we have a list of > > > shortcomings > > > > in the current DataStream API that it tries to resolve? How does the > > > > current ProcessFunction functionality fit into the picture? Will it > be > > > kept > > > > as is or subsumed by new API? > > > > > > > > [1] https://lists.apache.org/thread/vmhzv8fcw2b33pqxp43486owrxbkd5x9 > > > > > > > > Best, > > > > Alex > > > > > > > > On Mon, 26 Jun 2023 at 14:33, Xintong Song <tonysong...@gmail.com> > > > wrote: > > > > > > > > > > > > > > > > The ProcessFunction API item is giving me the most headaches > > because > > > > it's > > > > > > very unclear what it actually entails; like is it an entirely > > > separate > > > > > API > > > > > > to DataStream (sounds like it is!) or an extension of DataStream. > > How > > > > > much > > > > > > will it share the internals with DataStream etc.; how does it > > relate > > > to > > > > > the > > > > > > Table API (w.r.t. switching APIs / what Table API uses > underneath). > > > > > > > > > > > > > > > > I totally understand your confusion. We started planning this after > > > > kicking > > > > > off the release 2.0, so there's still a lot to be explored and the > > plan > > > > > keeps changing. > > > > > > > > > > > > > > > - In the beginning, we planned to do an in-place refactor of > > > > DataStream > > > > > API, until the API migration period is proposed. > > > > > - Then we want to make it an entirely separate API to > DataStream, > > > and > > > > > listed as a must-have for release 2.0 so that we can remove > > > DataStream > > > > > once > > > > > it's ready. > > > > > - However, depending on the outcome of the API compatibility > > > > discussion > > > > > [1], we may not be able to remove DataStream in 2.0 anyway, > which > > > > means > > > > > we > > > > > might need to re-evaluate the necessity of this item for 2.0. > > > > > > > > > > I'd say we wait a bit longer for the compatibility discussion [1] > and > > > > > decide the priority for this item afterwards. > > > > > > > > > > > > > > > Best, > > > > > > > > > > Xintong > > > > > > > > > > > > > > > [1] https://lists.apache.org/list.html?dev@flink.apache.org > > > > > > > > > > > > > > > On Mon, Jun 26, 2023 at 6:00 PM Chesnay Schepler < > ches...@apache.org > > > > > > > > wrote: > > > > > > > > > > > by-and-large I'm quite happy with the list of items. > > > > > > > > > > > > I'm curious as to why the "Disaggregated State Management" item > is > > > > marked > > > > > > as a must-have; will it require changes that break something? > What > > > > > prevents > > > > > > it from being added in 2.1? > > > > > > > > > > > > We may want to update the Java 17 item to "Make Java 17 the > > default, > > > > drop > > > > > > Java 8/11". Maybe even split it into a must-have "Drop Java 8" > and > > a > > > > > > nice-to-have "Drop Java 11"? > > > > > > > > > > > > "Move Calcite rules from Scala to Java": I would hope that this > > would > > > > be > > > > > > an entirely internal change, and could thus be an incremental > > process > > > > > > independent of major releases. > > > > > > What is the actual scale of this item; how much are we actually > > > > > re-writing? > > > > > > > > > > > > "Add MetricGroup#getLogicalScope": I'd raise this to a > must-have; i > > > > think > > > > > > I marked it down as nice-to-have only because it depends on > another > > > > item. > > > > > > > > > > > > The ProcessFunction API item is giving me the most headaches > > because > > > > it's > > > > > > very unclear what it actually entails; like is it an entirely > > > separate > > > > > API > > > > > > to DataStream (sounds like it is!) or an extension of DataStream. > > How > > > > > much > > > > > > will it share the internals with DataStream etc.; how does it > > relate > > > to > > > > > the > > > > > > Table API (w.r.t. switching APIs / what Table API uses > underneath). > > > > > > > > > > > > There are a few items I added as ideas which don't have a > priority > > > yet; > > > > > > would love to get some feedback on those. > > > > > > > > > > > > On 21/06/2023 08:41, Xintong Song wrote: > > > > > > > > > > > > Hi devs, > > > > > > > > > > > > As previously discussed in [1], we had been collecting work item > > > > > proposals > > > > > > for the 2.0 release until June 15th, on the wiki page [2]. > > > > > > > > > > > > - As we have passed the due date, I'd like to kindly remind > > > everyone > > > > > *not > > > > > > to add / remove items directly on the wiki page*. If needed, > > > please > > > > > post > > > > > > in this thread or reach out to the release managers instead. > > > > > > - I've reached out to some folks for clarifications about > their > > > > > > proposals. Some of them mentioned that they can not yet tell > > > whether > > > > > we > > > > > > should do an item or not, and would need more time / > discussions > > > to > > > > > make > > > > > > the decision. So I added a new symbol for items whose > priorities > > > are > > > > > `TBD`. > > > > > > > > > > > > Now it's time to collaboratively decide a minimum set of > must-have > > > > items. > > > > > > I've gone through the entire list of proposed items, and found > most > > > of > > > > > them > > > > > > make quite much sense. So I think an online sync might not be > > > necessary > > > > > for > > > > > > this. I'd like to go with this DISCUSS thread, where everyone can > > > > comment > > > > > > on how they think the list can be improved, followed by a VOTE to > > > > > formally > > > > > > make the decision. > > > > > > > > > > > > Any feedback and opinions, including but not limited to the > > following > > > > > > aspects, will be appreciated. > > > > > > > > > > > > - Important items that are missing from the list > > > > > > - Concerns regarding the listed items or their priorities > > > > > > > > > > > > Looking forward to your feedback. > > > > > > > > > > > > Best, > > > > > > > > > > > > Xintong > > > > > > > > > > > > > > > > > > [1] > > > > > > > > > > > > > > > https://lists.apache.org/list?dev@flink.apache.org:lte=1M:release%202.0%20status%20updates > > > > > > > > > > > > [2] > https://cwiki.apache.org/confluence/display/FLINK/2.0+Release > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > -- > Best regards, > Sergey >