Re: [DISCUSS] FLIP-435: Introduce a New Dynamic Table for Simplifying Data Pipelines

2024-04-09 Thread Becket Qin
Thanks for the proposal. I like the FLIP.

My ranking:

1. Refresh(ing) / Live Table -> easy to understand and implies the dynamic
characteristic

2. Derived Table -> easy to understand.

3. Materialized Table -> sounds like just a table with physical data stored
somewhere.

4. Materialized View -> modifying a view directly is a little weird.

Thanks,

Jiangjie (Becket) Qin



On Tue, Apr 9, 2024 at 5:46 AM Lincoln Lee  wrote:

> Thanks Ron and Timo for your proposal!
>
> Here is my ranking:
>
> 1. Derived table -> extend the persistent semantics of derived table in SQL
>standard, with a strong association with query, and has industry
> precedents
>such as Google Looker.
>
> 2. Live Table ->  an alternative for 'dynamic table'
>
> 3. Materialized Table -> combination of the Materialized View and Table,
> but
> still a table which accept data changes
>
> 4. Materialized View -> need to extend understanding of the view to accept
> data changes
>
> The reason for not adding 'Refresh Table' is I don't want to tell the user
> to 'refresh a refresh table'.
>
>
> Best,
> Lincoln Lee
>
>
> Ron liu  于2024年4月9日周二 20:11写道:
>
> > Hi, Dev
> >
> > My rankings are:
> >
> > 1. Derived Table
> > 2. Materialized Table
> > 3. Live Table
> > 4. Materialized View
> >
> > Best,
> > Ron
> >
> >
> >
> > Ron liu  于2024年4月9日周二 20:07写道:
> >
> > > Hi, Dev
> > >
> > > After several rounds of discussion, there is currently no consensus on
> > the
> > > name of the new concept. Timo has proposed that we decide the name
> > through
> > > a vote. This is a good solution when there is no clear preference, so
> we
> > > will adopt this approach.
> > >
> > > Regarding the name of the new concept, there are currently five
> > candidates:
> > > 1. Derived Table -> taken by SQL standard
> > > 2. Materialized Table -> similar to SQL materialized view but a table
> > > 3. Live Table -> similar to dynamic tables
> > > 4. Refresh Table -> states what it does
> > > 5. Materialized View -> needs to extend the standard to support
> modifying
> > > data
> > >
> > > For the above five candidates, everyone can give your rankings based on
> > > your preferences. You can choose up to five options or only choose some
> > of
> > > them.
> > > We will use a scoring rule, where the* first rank gets 5 points, second
> > > rank gets 4 points, third rank gets 3 points, fourth rank gets 2
> points,
> > > and fifth rank gets 1 point*.
> > > After the voting closes, I will score all the candidates based on
> > > everyone's votes, and the candidate with the highest score will be
> chosen
> > > as the name for the new concept.
> > >
> > > The voting will last up to 72 hours and is expected to close this
> Friday.
> > > I look forward to everyone voting on the name in this thread. Of
> course,
> > we
> > > also welcome new input regarding the name.
> > >
> > > Best,
> > > Ron
> > >
> > > Ron liu  于2024年4月9日周二 19:49写道:
> > >
> > >> Hi, Dev
> > >>
> > >> Sorry for my previous statement was not quite accurate. We will hold a
> > >> vote for the name within this thread.
> > >>
> > >> Best,
> > >> Ron
> > >>
> > >>
> > >> Ron liu  于2024年4月9日周二 19:29写道:
> > >>
> > >>> Hi, Timo
> > >>>
> > >>> Thanks for your reply.
> > >>>
> > >>> I agree with you that sometimes naming is more difficult. When no one
> > >>> has a clear preference, voting on the name is a good solution, so
> I'll
> > send
> > >>> a separate email for the vote, clarify the rules for the vote, then
> let
> > >>> everyone vote.
> > >>>
> > >>> One other point to confirm, in your ranking there is an option for
> > >>> Materialized View, does it stand for the UPDATING Materialized View
> > that
> > >>> you mentioned earlier in the discussion? If using Materialized View I
> > think
> > >>> it is needed to extend it.
> > >>>
> > >>> Best,
> > >>> Ron
> > >>>
> > >>> Timo Walther  于2024年4月9日周二 17:20写道:
> > >>>
> > >>>> Hi Ron,
> > >>>>
> > >>>> yes

[DISCUSS] FLIP-421: Support Custom Conversion for LogicalTypes

2024-02-01 Thread Becket Qin
Hi folks,

I'd like to kick off the discussion of FLIP-421[1].

The motivation is to support custom conversion between Flink SQL internal
data classes and external classes. The typical scenarios of these
conversions are:
1. In source / sink
2. conversion between Table and DataStream.

I think this FLIP will help improve the user experience in
format development (It makes the implementation of FLIP-358[2] much
easier). And it also makes the table-datastream conversion more usable.

Comments are welcome!

Thanks,

Jiangjie (Becket) Qin

[1]
https://cwiki.apache.org/confluence/display/FLINK/FLIP-421%3A+Support+Custom+Conversion+for+LogicalTypes
[2]
https://cwiki.apache.org/confluence/display/FLINK/FLIP-358%3A+flink-avro+enhancement+and+cleanup


Re: Re: Re: [VOTE] Accept Flink CDC into Apache Flink

2024-01-12 Thread Becket Qin
+1 (binding)

Thanks,

Jiangjie (Becket) Qin

On Fri, Jan 12, 2024 at 5:58 AM Zhijiang 
wrote:

> +1 (binding)
> Best,
> Zhijiang
> --
> From:Kurt Yang 
> Send Time:2024年1月12日(星期五) 15:31
> To:dev
> Subject:Re: Re: Re: [VOTE] Accept Flink CDC into Apache Flink
> +1 (binding)
> Best,
> Kurt
> On Fri, Jan 12, 2024 at 2:21 PM Hequn Cheng  wrote:
> > +1 (binding)
> >
> > Thanks,
> > Hequn
> >
> > On Fri, Jan 12, 2024 at 2:19 PM godfrey he  wrote:
> >
> > > +1 (binding)
> > >
> > > Thanks,
> > > Godfrey
> > >
> > > Zhu Zhu  于2024年1月12日周五 14:10写道:
> > > >
> > > > +1 (binding)
> > > >
> > > > Thanks,
> > > > Zhu
> > > >
> > > > Hangxiang Yu  于2024年1月11日周四 14:26写道:
> > > >
> > > > > +1 (non-binding)
> > > > >
> > > > > On Thu, Jan 11, 2024 at 11:19 AM Xuannan Su  >
> > > wrote:
> > > > >
> > > > > > +1 (non-binding)
> > > > > >
> > > > > > Best,
> > > > > > Xuannan
> > > > > >
> > > > > > On Thu, Jan 11, 2024 at 10:28 AM Xuyang 
> > wrote:
> > > > > > >
> > > > > > > +1 (non-binding)--
> > > > > > >
> > > > > > > Best!
> > > > > > > Xuyang
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > > 在 2024-01-11 10:00:11,"Yang Wang" 
> 写道:
> > > > > > > >+1 (binding)
> > > > > > > >
> > > > > > > >
> > > > > > > >Best,
> > > > > > > >Yang
> > > > > > > >
> > > > > > > >On Thu, Jan 11, 2024 at 9:53 AM liu ron 
> > > wrote:
> > > > > > > >
> > > > > > > >> +1 non-binding
> > > > > > > >>
> > > > > > > >> Best
> > > > > > > >> Ron
> > > > > > > >>
> > > > > > > >> Matthias Pohl 
> 于2024年1月10日周三
> > > > > 23:05写道:
> > > > > > > >>
> > > > > > > >> > +1 (binding)
> > > > > > > >> >
> > > > > > > >> > On Wed, Jan 10, 2024 at 3:35 PM ConradJam <
> > > jam.gz...@gmail.com>
> > > > > > wrote:
> > > > > > > >> >
> > > > > > > >> > > +1 non-binding
> > > > > > > >> > >
> > > > > > > >> > > Dawid Wysakowicz  于2024年1月10日周三
> > > > > 21:06写道:
> > > > > > > >> > >
> > > > > > > >> > > > +1 (binding)
> > > > > > > >> > > > Best,
> > > > > > > >> > > > Dawid
> > > > > > > >> > > >
> > > > > > > >> > > > On Wed, 10 Jan 2024 at 11:54, Piotr Nowojski <
> > > > > > pnowoj...@apache.org>
> > > > > > > >> > > wrote:
> > > > > > > >> > > >
> > > > > > > >> > > > > +1 (binding)
> > > > > > > >> > > > >
> > > > > > > >> > > > > śr., 10 sty 2024 o 11:25 Martijn Visser <
> > > > > > martijnvis...@apache.org>
> > > > > > > >> > > > > napisał(a):
> > > > > > > >> > > > >
> > > > > > > >> > > > > > +1 (binding)
> > > > > > > >> > > > > >
> > > > > > > >> > > > > > On Wed, Jan 10, 2024 at 4:43 AM Xingbo Huang <
> > > > > > hxbks...@gmail.com
> > > > > > > >> >
> > > > > > > >> > > > wrote:
> > > > > > > >> > > > > > >
> > > > > > > >> > > > > > > +1 (binding)
> > > > > > > >> > > > > > >
> &

Re: [DISCUSS] FLIP-389: Annotate SingleThreadFetcherManager and FutureCompletingBlockingQueue as PublicEvolving

2024-01-11 Thread Becket Qin
Hi Qingsheng,

Thanks for the comment. I think the initial idea is to hide the queue
completely from the users, i.e. make FutureCompletingBlockingQueue class
internal. If it is OK to expose the class to the users, then just returning
the queue sounds reasonable to me.

Thanks,

Jiangjie (Becket) Qin

On Wed, Jan 10, 2024 at 10:39 PM Hongshun Wang 
wrote:

> Hi Qingsheng,
>
>
> I agree with you that it would be clearer to have a new interface that
> extracts the SplitFetcher creation and management logic from the current
> SplitFetcherManager. However, extensive modifications to the interface may
> influence a lot and cause compatibility issues. Perhaps we can consider
> doing it later, rather than in this FLIP.
>
>
> Adding a new internal method, SplitFetcherManager#getQueue(), to
> SourceReaderBase seems to be a better option than exposing methods like
> poll and notifyAvailable on SplitFetcherManager.
>
>
> I have taken this valuable suggestion and updated the FLIP accordingly.
>
>
> Thanks,
>
> Hongshun
>
> On Thu, Jan 11, 2024 at 2:09 PM Qingsheng Ren  wrote:
>
>> Hi Hongshun and Becket,
>>
>> Sorry for being late in the discussion! I went through the entire FLIP
>> but I still have some concerns about the new SplitFetcherManager.
>>
>> First of all I agree that we should hide the elementQueue from connector
>> developers. This could simplify the interface exposed to developers so that
>> they can focus on the interaction with external systems.
>>
>> However in the current FLIP, SplitFetcherManager exposes 4 more methods,
>> poll / getAvailabilityFuture / notifyAvailable / noAvailableElement, which
>> are tightly coupled with the implementation of the elementQueue. The naming
>> of these methods look weird to me, like what does it mean to "poll from a
>> SplitFetcherManager" / "notify a SplitFetcherManager available"? To clarify
>> these methods we have to explain to developers that "well we hide a queue
>> inside SplitFetcherMamager and the poll method is actually polling from the
>> queue". I'm afraid these methods will implicitly expose the concept and the
>> implementation of the queue to developers.
>>
>> I think a cleaner solution would be having a new interface that extracts
>> SplitFetcher creating and managing logic from the current
>> SplitFetcherManager, but having too many concepts might make the entire
>> Source API even harder to understand. To make a compromise, I'm considering
>> only exposing constructors of SplitFetcherManager as public APIs, and
>> adding a new internal method SplitFetcherManager#getQueue() for
>> SourceReaderBase (well it's a bit hacky I admit but I think exposing
>> methods like poll and notifyAvailable on SplitFetcherManager is even
>> worth). WDTY?
>>
>> Thanks,
>> Qingsheng
>>
>> On Thu, Dec 21, 2023 at 8:36 AM Becket Qin  wrote:
>>
>>> Hi Hongshun,
>>>
>>> I think the proposal in the FLIP is basically fine. A few minor comments:
>>>
>>> 1. In FLIPs, we define all the user-sensible changes as public
>>> interfaces.
>>> The public interface section should list all of them. So, the code blocks
>>> currently in the proposed changes section should be put into the public
>>> interface section instead.
>>>
>>> 2. It would be good to put all the changes of one class together. For
>>> example, for SplitFetcherManager, we can say:
>>> - Change SplitFetcherManager from Internal to PublicEvolving.
>>> - Deprecate the old constructor exposing the
>>> FutureCompletingBlockingQueue, and add new constructors as replacements
>>> which creates the FutureCompletingBlockingQueue instance internally.
>>> - Add a few new methods to expose the functionality of the internal
>>> FutureCompletingBlockingQueue via the SplitFetcherManager.
>>>And then follows the code block containing all the changes above.
>>> Ideally, the changes should come with something like "// <-- New", so
>>> that it is. easier to be found.
>>>
>>> 3. In the proposed changes section, it would be good to add some more
>>> detailed explanation of the idea behind the public interface changes. So
>>> even people new to Flink can understand better how exactly the interface
>>> changes will help fulfill the motivation. For example, regarding the
>>> constructor signature change, we can say the following. We can mention a
>>> few things in this section:
>>> - By exposing the SplitFetcherManager / SingleThreadFetcheManager, by

Re: [DISCUSS] FLIP-389: Annotate SingleThreadFetcherManager and FutureCompletingBlockingQueue as PublicEvolving

2023-12-20 Thread Becket Qin
Hi Hongshun,

I think the proposal in the FLIP is basically fine. A few minor comments:

1. In FLIPs, we define all the user-sensible changes as public interfaces.
The public interface section should list all of them. So, the code blocks
currently in the proposed changes section should be put into the public
interface section instead.

2. It would be good to put all the changes of one class together. For
example, for SplitFetcherManager, we can say:
- Change SplitFetcherManager from Internal to PublicEvolving.
- Deprecate the old constructor exposing the
FutureCompletingBlockingQueue, and add new constructors as replacements
which creates the FutureCompletingBlockingQueue instance internally.
- Add a few new methods to expose the functionality of the internal
FutureCompletingBlockingQueue via the SplitFetcherManager.
   And then follows the code block containing all the changes above.
Ideally, the changes should come with something like "// <-- New", so
that it is. easier to be found.

3. In the proposed changes section, it would be good to add some more
detailed explanation of the idea behind the public interface changes. So
even people new to Flink can understand better how exactly the interface
changes will help fulfill the motivation. For example, regarding the
constructor signature change, we can say the following. We can mention a
few things in this section:
- By exposing the SplitFetcherManager / SingleThreadFetcheManager, by
implementing addSplits() and removeSplits(), connector developers can
easily create their own threading models in the SourceReaderBase.
- Note that the SplitFetcher constructor is package private, so users
can only create SplitFetchers via SplitFetcherManager.createSplitFetcher().
This ensures each SplitFetcher is always owned by the SplitFetcherManager.
- This FLIP essentially embedded the element queue (a
FutureCompletingBlockingQueue) instance into the SplitFetcherManager. This
hides the element queue from the connector developers and simplifies the
SourceReaderBase to consist of only SplitFetcherManager and RecordEmitter
as major components.

In short, the public interface section answers the question of "what". We
should list all the user-sensible changes in the public interface section,
without verbose explanation. The proposed changes section answers "how",
where we can add more details to explain the changes listed in the public
interface section.

Thanks,

Jiangjie (Becket) Qin



On Wed, Dec 20, 2023 at 10:07 AM Hongshun Wang 
wrote:

> Hi Becket,
>
>
> It has been a long time since we last discussed. Are there any other
> problems with this Flip from your side? I am looking forward to hearing
> from you.
>
>
> Thanks,
> Hongshun Wang
>


Re: [DISCUSS] FLIP-377: Support configuration to disable filter push down for Table/SQL Sources

2023-12-19 Thread Becket Qin
Hi Jiabao,

Thanks for updating the FLIP.
Can you add the behavior of the policies that are only applicable to some
but not all of the databases? This is a part of the intended behavior of
the proposed configuration. So, we should include that in the FLIP.
Otherwise, the FLIP looks good to me.

Cheers,

Jiangjie (Becket) Qin

On Tue, Dec 19, 2023 at 11:00 PM Jiabao Sun 
wrote:

> Hi Becket,
>
> I share the same view as you regarding the prefix for this configuration
> option.
>
> For the JDBC connector, I prefer setting 'filter.handling.policy' = 'FOO'
> and throwing an exception when the database do not support that specific
> policy.
>
> Not using a prefix can reduce the learning curve for users and avoid
> introducing a new set of configuration options for every supported JDBC
> database.
> I think the policies we provide can be compatible with most databases that
> follow the JDBC protocol.
> However, there may be cases where certain databases cannot support some
> policies.
> Nevertheless, we can ensure fast failure and allow users to choose a
> suitable policy in such situations.
>
> I have removed the contents about the configuration prefix.
> Please help review it again.
>
> Thanks,
> Jiabao
>
>
> > 2023年12月19日 19:46,Becket Qin  写道:
> >
> > Hi Jiabao,
> >
> > Thanks for updating the FLIP.
> >
> > One more question regarding the JDBC connector, since it is a connector
> > shared by multiple databases, what if there is a filter handling policy
> > that is only applicable to one of the databases, but not the others? In
> > that case, how would the users specify that policy?
> > Unlike the example of orc format with 2nd+ level config, JDBC connector
> > only looks at the URL to decide which driver to use.
> >
> > For example, MySql supports policy FOO while other databases do not. If
> > users want to use FOO for MySql, what should they do? Will they set
> > '*mysql.filter.hanlding.policy'
> > = 'FOO', *which will only be picked up when the MySql driver is used?
> > Or they should just set* 'filter.handling.policy' = 'FOO', *and throw
> > exceptions when other JDBC drivers are used? Personally, I prefer the
> > latter. If we pick that, do we still need to mention the following?
> >
> > *The prefix is needed when the option is for a 2nd+ level. *
> >> 'connector' = 'filesystem',
> >> 'format' = 'orc',
> >> 'orc.filter.handling.policy' = 'NUBERIC_ONY'
> >>
> >> *In this case, the values of this configuration may be different
> depending
> >> on the format option. For example, orc format may have INDEXED_ONLY
> while
> >> parquet format may have something else. *
> >>
> >
> > I found this is somewhat misleading, because the example here is not a
> part
> > of the proposal of this FLIP. It is just an example explaining when a
> > prefix is needed, which seems orthogonal to the proposal in this FLIP.
> >
> > Thanks,
> >
> > Jiangjie (Becket) Qin
> >
> >
> > On Tue, Dec 19, 2023 at 10:09 AM Jiabao Sun  .invalid>
> > wrote:
> >
> >> Thanks Becket for the suggestions,
> >>
> >> Updated.
> >> Please help review it again when you have time.
> >>
> >> Best,
> >> Jiabao
> >>
> >>
> >>> 2023年12月19日 09:06,Becket Qin  写道:
> >>>
> >>> Hi JIabao,
> >>>
> >>> Thanks for updating the FLIP. It looks better. Some suggestions /
> >> questions:
> >>>
> >>> 1. In the motivation section:
> >>>
> >>>> *Currently, Flink Table/SQL does not expose fine-grained control for
> >> users
> >>>> to control filter pushdown. **However, filter pushdown has some side
> >>>> effects, such as additional computational pressure on external
> >>>> systems. Moreover, Improper queries can lead to issues such as full
> >> table
> >>>> scans, which in turn can impact the stability of external systems.*
> >>>
> >>> This statement sounds like the side effects are there for all the
> >> systems,
> >>> which is inaccurate. Maybe we can say:
> >>> *Currently, Flink Table/SQL does not expose fine-grained control for
> >> users
> >>> to control filter pushdown. **However, filter pushdown may have side
> >>> effects in some cases, **such as additional computational pressure on
> >>> external systems. The JDBC source is a typical example of that. If a
> >> filter
> >>> is

Re: [DISCUSS] FLIP-377: Support configuration to disable filter push down for Table/SQL Sources

2023-12-19 Thread Becket Qin
Hi Jiabao,

Thanks for updating the FLIP.

One more question regarding the JDBC connector, since it is a connector
shared by multiple databases, what if there is a filter handling policy
that is only applicable to one of the databases, but not the others? In
that case, how would the users specify that policy?
Unlike the example of orc format with 2nd+ level config, JDBC connector
only looks at the URL to decide which driver to use.

For example, MySql supports policy FOO while other databases do not. If
users want to use FOO for MySql, what should they do? Will they set
'*mysql.filter.hanlding.policy'
= 'FOO', *which will only be picked up when the MySql driver is used?
Or they should just set* 'filter.handling.policy' = 'FOO', *and throw
exceptions when other JDBC drivers are used? Personally, I prefer the
latter. If we pick that, do we still need to mention the following?

*The prefix is needed when the option is for a 2nd+ level. *
> 'connector' = 'filesystem',
> 'format' = 'orc',
> 'orc.filter.handling.policy' = 'NUBERIC_ONY'
>
> *In this case, the values of this configuration may be different depending
> on the format option. For example, orc format may have INDEXED_ONLY while
> parquet format may have something else. *
>

I found this is somewhat misleading, because the example here is not a part
of the proposal of this FLIP. It is just an example explaining when a
prefix is needed, which seems orthogonal to the proposal in this FLIP.

Thanks,

Jiangjie (Becket) Qin


On Tue, Dec 19, 2023 at 10:09 AM Jiabao Sun 
wrote:

> Thanks Becket for the suggestions,
>
> Updated.
> Please help review it again when you have time.
>
> Best,
> Jiabao
>
>
> > 2023年12月19日 09:06,Becket Qin  写道:
> >
> > Hi JIabao,
> >
> > Thanks for updating the FLIP. It looks better. Some suggestions /
> questions:
> >
> > 1. In the motivation section:
> >
> >> *Currently, Flink Table/SQL does not expose fine-grained control for
> users
> >> to control filter pushdown. **However, filter pushdown has some side
> >> effects, such as additional computational pressure on external
> >> systems. Moreover, Improper queries can lead to issues such as full
> table
> >> scans, which in turn can impact the stability of external systems.*
> >
> > This statement sounds like the side effects are there for all the
> systems,
> > which is inaccurate. Maybe we can say:
> > *Currently, Flink Table/SQL does not expose fine-grained control for
> users
> > to control filter pushdown. **However, filter pushdown may have side
> > effects in some cases, **such as additional computational pressure on
> > external systems. The JDBC source is a typical example of that. If a
> filter
> > is pushed down to the database, an expensive full table scan may happen
> if
> > the filter involves unindexed columns.*
> >
> > 2. Regarding the prefix, usually a prefix is not required for the top
> level
> > connector options. This is because the *connector* option is already
> there.
> > So
> >'connector' = 'jdbc',
> >  'filter.handling.policy' = 'ALWAYS'
> > is sufficient.
> >
> > The prefix is needed when the option is for a 2nd+ level. For example,
> >'connector' = 'jdbc',
> >'format' = 'orc',
> >'orc.some.option' = 'some_value'
> > In this case, the prefix of "orc" is needed to make it clear this option
> is
> > for the format.
> >
> > I am guessing that the reason that currently the connector prefix is
> there
> > is because the values of this configuration may be different depending on
> > the connectors. For example, jdbc may have INDEXED_ONLY while MongoDB may
> > have something else. Personally speaking, I am fine if we do not have a
> > prefix in this case because users have already specified the connector
> type
> > and it is intuitive enough that the option value is for that connector,
> not
> > others.
> >
> > 3. can we clarify on the following statement:
> >
> >> *Introduce the native configuration [prefix].filter.handling.policy in
> the
> >> connector.*
> >
> > What do you mean by "native configuration"? From what I understand, the
> > FLIP does the following:
> > - introduces a new configuration to the JDBC and MongoDB connector.
> > - Suggests a convention option name if other connectors are going to add
> an
> > option for the same purpose.
> >
> > Thanks,
> >
> > Jiangjie (Becket) Qin
> >
> >
> >
> > On Mon, Dec 18, 2023 at 5:45 PM Jiabao Sun  .invalid>
> > wrote:
> >
> >> Hi Becket,
> >>

Re: [DISCUSS] FLIP-377: Support configuration to disable filter push down for Table/SQL Sources

2023-12-18 Thread Becket Qin
Hi JIabao,

Thanks for updating the FLIP. It looks better. Some suggestions / questions:

1. In the motivation section:

> *Currently, Flink Table/SQL does not expose fine-grained control for users
> to control filter pushdown. **However, filter pushdown has some side
> effects, such as additional computational pressure on external
> systems. Moreover, Improper queries can lead to issues such as full table
> scans, which in turn can impact the stability of external systems.*

This statement sounds like the side effects are there for all the systems,
which is inaccurate. Maybe we can say:
*Currently, Flink Table/SQL does not expose fine-grained control for users
to control filter pushdown. **However, filter pushdown may have side
effects in some cases, **such as additional computational pressure on
external systems. The JDBC source is a typical example of that. If a filter
is pushed down to the database, an expensive full table scan may happen if
the filter involves unindexed columns.*

2. Regarding the prefix, usually a prefix is not required for the top level
connector options. This is because the *connector* option is already there.
So
'connector' = 'jdbc',
  'filter.handling.policy' = 'ALWAYS'
is sufficient.

The prefix is needed when the option is for a 2nd+ level. For example,
'connector' = 'jdbc',
'format' = 'orc',
'orc.some.option' = 'some_value'
In this case, the prefix of "orc" is needed to make it clear this option is
for the format.

I am guessing that the reason that currently the connector prefix is there
is because the values of this configuration may be different depending on
the connectors. For example, jdbc may have INDEXED_ONLY while MongoDB may
have something else. Personally speaking, I am fine if we do not have a
prefix in this case because users have already specified the connector type
and it is intuitive enough that the option value is for that connector, not
others.

3. can we clarify on the following statement:

> *Introduce the native configuration [prefix].filter.handling.policy in the
> connector.*

What do you mean by "native configuration"? From what I understand, the
FLIP does the following:
- introduces a new configuration to the JDBC and MongoDB connector.
- Suggests a convention option name if other connectors are going to add an
option for the same purpose.

Thanks,

Jiangjie (Becket) Qin



On Mon, Dec 18, 2023 at 5:45 PM Jiabao Sun 
wrote:

> Hi Becket,
>
> The FLIP document[1] has been updated.
> Could you help take a look again?
>
> Thanks,
> Jiabao
>
> [1]
> https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=276105768
>
>
> > 2023年12月18日 16:53,Becket Qin  写道:
> >
> > Yes, that sounds reasonable to me. We can start with ALWAYS and NEVER,
> and
> > add more policies as needed.
> >
> > Thanks,
> >
> > Jiangjie (Becket) Qin
> >
> > On Mon, Dec 18, 2023 at 4:48 PM Jiabao Sun  .invalid>
> > wrote:
> >
> >> Thanks Bucket,
> >>
> >> The jdbc.filter.handling.policy is good to me as it provides sufficient
> >> extensibility for future filter pushdown optimizations.
> >> However, currently, we don't have an implementation for the AUTO mode,
> and
> >> it seems that the AUTO mode can easily be confused with the ALWAYS mode
> >> because users don't have the opportunity to MANUALLY decide which
> filters
> >> to push down.
> >>
> >> I suggest that we only introduce the ALWAYS and NEVER modes for now, and
> >> we can consider introducing more flexible policies in the future,
> >> such as INDEX_ONLY, NUMBERIC_ONLY and so on.
> >>
> >> WDYT?
> >>
> >> Best,
> >> Jiabao
> >>
> >>
> >>
> >>> 2023年12月18日 16:27,Becket Qin  写道:
> >>>
> >>> Hi Jiabao,
> >>>
> >>> Please see the reply inline.
> >>>
> >>>
> >>>> The MySQL connector is currently in the flink-connector-jdbc
> repository
> >>>> and is not a standalone connector.
> >>>> Is it too unique to use "mysql" as the configuration option prefix?
> >>>
> >>> If the intended behavior makes sense to all the supported JDBC drivers,
> >> we
> >>> can make this a JDBC connector configuration.
> >>>
> >>> Also, I would like to ask about the difference in behavior between AUTO
> >> and
> >>>> ALWAYS.
> >>>> It seems that we cannot guarantee the pushing down of all filters to
> the
> >>>> external system under the ALWAYS
> >>>> mode because not all filters in Flink SQL are supported by the
> extern

Re: [DISCUSS] FLIP-377: Support configuration to disable filter push down for Table/SQL Sources

2023-12-18 Thread Becket Qin
Yes, that sounds reasonable to me. We can start with ALWAYS and NEVER, and
add more policies as needed.

Thanks,

Jiangjie (Becket) Qin

On Mon, Dec 18, 2023 at 4:48 PM Jiabao Sun 
wrote:

> Thanks Bucket,
>
> The jdbc.filter.handling.policy is good to me as it provides sufficient
> extensibility for future filter pushdown optimizations.
> However, currently, we don't have an implementation for the AUTO mode, and
> it seems that the AUTO mode can easily be confused with the ALWAYS mode
> because users don't have the opportunity to MANUALLY decide which filters
> to push down.
>
> I suggest that we only introduce the ALWAYS and NEVER modes for now, and
> we can consider introducing more flexible policies in the future,
> such as INDEX_ONLY, NUMBERIC_ONLY and so on.
>
> WDYT?
>
> Best,
> Jiabao
>
>
>
> > 2023年12月18日 16:27,Becket Qin  写道:
> >
> > Hi Jiabao,
> >
> > Please see the reply inline.
> >
> >
> >> The MySQL connector is currently in the flink-connector-jdbc repository
> >> and is not a standalone connector.
> >> Is it too unique to use "mysql" as the configuration option prefix?
> >
> > If the intended behavior makes sense to all the supported JDBC drivers,
> we
> > can make this a JDBC connector configuration.
> >
> > Also, I would like to ask about the difference in behavior between AUTO
> and
> >> ALWAYS.
> >> It seems that we cannot guarantee the pushing down of all filters to the
> >> external system under the ALWAYS
> >> mode because not all filters in Flink SQL are supported by the external
> >> system.
> >> Should we throw an error when encountering a filter that cannot be
> pushed
> >> down in the ALWAYS mode?
> >
> > The idea of AUTO is to do efficiency-aware pushdowns. The source will
> query
> > the external system (MySQL, Oracle, SQL Server, etc) first to retrieve
> the
> > information of the table. With that information, the source will decide
> > whether to further push a filter to the external system based on the
> > efficiency. E.g. only push the indexed fields. In contrast, ALWAYS will
> > just always push the supported filters to the external system, regardless
> > of the efficiency. In case there are filters that are not supported,
> > according to the current contract of SupportsFilterPushdown, these
> > unsupported filters should just be returned by the
> > *SupportsFilterPushdown.applyFilters()* method as remaining filters.
> > Therefore, there is no need to throw exceptions here. This is likely the
> > desired behavior for most users, IMO. If there are cases that users
> really
> > want to get alerted when a filter cannot be pushed to the external
> system,
> > we can add another value like "ENFORCED_ALWAYS", which behaves like
> ALWAYS,
> > but throws exceptions when a filter cannot be applied to the external
> > system. But personally I don't see much value in doing this.
> >
> > Thanks,
> >
> > Jiangjie (Becket) Qin
> >
> >
> >
> > On Mon, Dec 18, 2023 at 3:54 PM Jiabao Sun  .invalid>
> > wrote:
> >
> >> Hi Becket,
> >>
> >> The MySQL connector is currently in the flink-connector-jdbc repository
> >> and is not a standalone connector.
> >> Is it too unique to use "mysql" as the configuration option prefix?
> >>
> >> Also, I would like to ask about the difference in behavior between AUTO
> >> and ALWAYS.
> >> It seems that we cannot guarantee the pushing down of all filters to the
> >> external system under the ALWAYS
> >> mode because not all filters in Flink SQL are supported by the external
> >> system.
> >> Should we throw an error when encountering a filter that cannot be
> pushed
> >> down in the ALWAYS mode?
> >>
> >> Thanks,
> >> Jiabao
> >>
> >>> 2023年12月18日 15:34,Becket Qin  写道:
> >>>
> >>> Hi JIabao,
> >>>
> >>> Thanks for updating the FLIP. Maybe I did not explain it clearly
> enough.
> >> My
> >>> point is that given there are various good flavors of behaviors
> handling
> >>> filters pushed down, we should not have a common config of
> >>> "ignore.filter.pushdown", because the behavior is not *common*.
> >>>
> >>> It looks like the original motivation of this FLIP is just for MySql.
> >> Let's
> >>> focus on what is the best solution for MySql connector here first.
> After
> >&g

Re: [DISCUSS] FLIP-377: Support configuration to disable filter push down for Table/SQL Sources

2023-12-18 Thread Becket Qin
Hi Jiabao,

Please see the reply inline.


> The MySQL connector is currently in the flink-connector-jdbc repository
> and is not a standalone connector.
> Is it too unique to use "mysql" as the configuration option prefix?

If the intended behavior makes sense to all the supported JDBC drivers, we
can make this a JDBC connector configuration.

Also, I would like to ask about the difference in behavior between AUTO and
> ALWAYS.
> It seems that we cannot guarantee the pushing down of all filters to the
> external system under the ALWAYS
> mode because not all filters in Flink SQL are supported by the external
> system.
> Should we throw an error when encountering a filter that cannot be pushed
> down in the ALWAYS mode?

The idea of AUTO is to do efficiency-aware pushdowns. The source will query
the external system (MySQL, Oracle, SQL Server, etc) first to retrieve the
information of the table. With that information, the source will decide
whether to further push a filter to the external system based on the
efficiency. E.g. only push the indexed fields. In contrast, ALWAYS will
just always push the supported filters to the external system, regardless
of the efficiency. In case there are filters that are not supported,
according to the current contract of SupportsFilterPushdown, these
unsupported filters should just be returned by the
*SupportsFilterPushdown.applyFilters()* method as remaining filters.
Therefore, there is no need to throw exceptions here. This is likely the
desired behavior for most users, IMO. If there are cases that users really
want to get alerted when a filter cannot be pushed to the external system,
we can add another value like "ENFORCED_ALWAYS", which behaves like ALWAYS,
but throws exceptions when a filter cannot be applied to the external
system. But personally I don't see much value in doing this.

Thanks,

Jiangjie (Becket) Qin



On Mon, Dec 18, 2023 at 3:54 PM Jiabao Sun 
wrote:

> Hi Becket,
>
> The MySQL connector is currently in the flink-connector-jdbc repository
> and is not a standalone connector.
> Is it too unique to use "mysql" as the configuration option prefix?
>
> Also, I would like to ask about the difference in behavior between AUTO
> and ALWAYS.
> It seems that we cannot guarantee the pushing down of all filters to the
> external system under the ALWAYS
> mode because not all filters in Flink SQL are supported by the external
> system.
> Should we throw an error when encountering a filter that cannot be pushed
> down in the ALWAYS mode?
>
> Thanks,
> Jiabao
>
> > 2023年12月18日 15:34,Becket Qin  写道:
> >
> > Hi JIabao,
> >
> > Thanks for updating the FLIP. Maybe I did not explain it clearly enough.
> My
> > point is that given there are various good flavors of behaviors handling
> > filters pushed down, we should not have a common config of
> > "ignore.filter.pushdown", because the behavior is not *common*.
> >
> > It looks like the original motivation of this FLIP is just for MySql.
> Let's
> > focus on what is the best solution for MySql connector here first. After
> > that, if people think the best behavior for MySql happens to be a common
> > one, we can then discuss whether that is worth being added to the base
> > implementation of source. For MySQL, if we are going to introduce a
> config
> > to MySql, why not have something like "mysql.filter.handling.policy" with
> > value of AUTO / NEVER / ALWAYS? Isn't that better than
> > "ignore.filter.pushdown"?
> >
> > Thanks,
> >
> > Jiangjie (Becket) Qin
> >
> >
> >
> > On Sun, Dec 17, 2023 at 11:30 PM Jiabao Sun  .invalid>
> > wrote:
> >
> >> Hi Becket,
> >>
> >> The FLIP document has been updated as well.
> >> Please take a look when you have time.
> >>
> >> Thanks,
> >> Jiabao
> >>
> >>
> >>> 2023年12月17日 22:54,Jiabao Sun  写道:
> >>>
> >>> Thanks Becket,
> >>>
> >>> I apologize for not being able to continue with this proposal due to
> >> being too busy during this period.
> >>>
> >>> The viewpoints you shared about the design of Flink Source make sense
> to
> >> me
> >>> The native configuration ‘ignore.filter.pushdown’ is good to me.
> >>> Having a unified name or naming style can indeed prevent confusion for
> >> users regarding
> >>> the inconsistent naming of this configuration across different
> >> connectors.
> >>>
> >>> Currently, there are not many external connectors that support filter
> >> pushdown.
> >>> I p

Re: [DISCUSS] FLIP-377: Support configuration to disable filter push down for Table/SQL Sources

2023-12-17 Thread Becket Qin
Hi JIabao,

Thanks for updating the FLIP. Maybe I did not explain it clearly enough. My
point is that given there are various good flavors of behaviors handling
filters pushed down, we should not have a common config of
"ignore.filter.pushdown", because the behavior is not *common*.

It looks like the original motivation of this FLIP is just for MySql. Let's
focus on what is the best solution for MySql connector here first. After
that, if people think the best behavior for MySql happens to be a common
one, we can then discuss whether that is worth being added to the base
implementation of source. For MySQL, if we are going to introduce a config
to MySql, why not have something like "mysql.filter.handling.policy" with
value of AUTO / NEVER / ALWAYS? Isn't that better than
"ignore.filter.pushdown"?

Thanks,

Jiangjie (Becket) Qin



On Sun, Dec 17, 2023 at 11:30 PM Jiabao Sun 
wrote:

> Hi Becket,
>
> The FLIP document has been updated as well.
> Please take a look when you have time.
>
> Thanks,
> Jiabao
>
>
> > 2023年12月17日 22:54,Jiabao Sun  写道:
> >
> > Thanks Becket,
> >
> > I apologize for not being able to continue with this proposal due to
> being too busy during this period.
> >
> > The viewpoints you shared about the design of Flink Source make sense to
> me
> > The native configuration ‘ignore.filter.pushdown’ is good to me.
> > Having a unified name or naming style can indeed prevent confusion for
> users regarding
> > the inconsistent naming of this configuration across different
> connectors.
> >
> > Currently, there are not many external connectors that support filter
> pushdown.
> > I propose that we first introduce it in flink-connector-jdbc and
> flink-connector-mongodb.
> > Do you think this is feasible?
> >
> > Best,
> > Jiabao
> >
> >
> >> 2023年11月16日 17:45,Becket Qin  写道:
> >>
> >> Hi Jiabao,
> >>
> >> Arguments like "because Spark has it so Flink should also have it" does
> not
> >> make sense. Different projects have different API flavors and styles.
> What
> >> is really important is the rationale and the design principle behind the
> >> API. They should conform to the convention of the project.
> >>
> >> First of all, Spark Source API itself has a few issues and they ended up
> >> introduce DataSource V2 in Spark 3.0, which added the decorative
> interfaces
> >> like SupportsPushdownXXX. Some of the configurations predating
> DataSource
> >> V2 may still be there.
> >>
> >> For the Spark configurations you mentioned, they are all the
> configurations
> >> for FileScanBuilder, which is equivalent to FileSource in Flink.
> Currently,
> >> regardless of the format (ORC, Parquet, Avro, etc), the FileSource
> pushes
> >> back all the filters to ensure correctness. The actual filters that got
> >> applied to the specific format might still be different. This
> >> implementation is the same in FileScanBuilder.pushFilters() for Spark. I
> >> don't know why Spark got separate configurations for each format. Maybe
> it
> >> is because the filters are actually implemented differently for
> different
> >> format.
> >>
> >> At least for the current implementation in FileScanBuilder, these
> >> configurations can be merged to one configuration like
> >> `apply.filters.to.format.enabled`. Note that this config, as well as the
> >> separate configs you mentioned, are just visible and used by the
> >> FileScanBuilder. It determines whether the filters should be passed
> down to
> >> the format of the FileScanBuilder instance. Regardless of the value of
> >> these configs, FileScanBuilder.pushFilters() will always be called, and
> >> FileScanBuilder (as well as FileSource in Flink) will always push back
> all
> >> the filters to the framework.
> >>
> >> A MySql source can have a very different way to handle this. For
> example, A
> >> MySql source  A config in this case might be "my.apply.filters" with
> three
> >> different values:
> >> - AUTO: The Source will issue a DESC Table request to understand
> whether a
> >> filter can be applied efficiently. And decide which filters can be
> applied
> >> and which cannot based on that.
> >> - NEVER: Never apply filtering. It will always do a full table read and
> >> let Flink do the filtering.
> >> - ALWAYS: Always apply the filtering to the MySql server.
> >>
> >> In the above examples of FileSource and M

Re: [DISCUSS] Resolve diamond inheritance of Sink.createWriter

2023-12-11 Thread Becket Qin
Hi Peter,

Thanks for updating the patch. The latest patch looks good to me. I've +1ed
on the PR.

Cheers,

Jiangjie (Becket) Qin

On Mon, Dec 11, 2023 at 9:21 PM Péter Váry 
wrote:

> Thanks everyone for the lively discussion!
>
> The PR is available which strictly adheres the accepted changes from
> FLIP-371. Thanks Gyula and Marton for the review. Becket, if you have any
> questions left, please let me know, so I can fix and we can merge the
> changes.
>
> I would like to invite everyone involved here to take a look at FLIP-372
> [1], and the related mailing thread [2]. The discussion there is also at
> the stage where we are debating the merits of migrating to a mixin based
> Sink API. So if you are interested, please join us there.
>
> Thanks,
> Peter
>
> [1] -
>
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-372%3A+Allow+TwoPhaseCommittingSink+WithPreCommitTopology+to+alter+the+type+of+the+Committable
> [2] - https://lists.apache.org/thread/344pzbrqbbb4w0sfj67km25msp7hxlyd
>
>
> On Tue, Dec 5, 2023, 18:05 Márton Balassi 
> wrote:
>
> > Thanks, Peter. Given the discussion I also agree that the consensus is to
> > move towards the mixin interface approach (and accept its disadvantages
> > given its advantages).
> >
> > +1 for the general direction of your proposed code change in
> > https://github.com/apache/flink/pull/23876.
> >
> > On Tue, Dec 5, 2023 at 3:44 PM Péter Váry 
> > wrote:
> >
> > > It seems to me we have a consensus to move forward with the mixin
> > approach.
> > > I hope that everyone is aware that with the mixin interfaces we lose
> the
> > > opportunity of the strong type checks. This will be especially painful
> > for
> > > generic types, where we will not have a way to ensure that the generic
> > > types are correctly synchronized between the different interfaces, even
> > on
> > > DAG creation time.
> > >
> > > Even with this drawback, I like this approach too, so +1 from my side.
> > >
> > > As a first step in the direction of the mixin approach, we can remove
> the
> > > specific implementations of the `createWriter` methods from the
> > > `StatefulSink` and the `TwoPhaseCommitingSink` interfaces (and replace
> > them
> > > with an instanceof check where needed).
> > > - This would remove the diamond inheritance - enable us to create
> default
> > > methods for backward compatibility.
> > > - This would not break the API, as the same method with wider return
> > value
> > > will be inherited from the `Sink` interface.
> > >
> > > Since, it might be easier to understand the proposed changes, I have
> > > created a new PR: https://github.com/apache/flink/pull/23876
> > > The PR has 2 commits:
> > > - Reverting the previous change - non-clean, since there were some
> > > additional fixes on the tests -
> > >
> > >
> >
> https://github.com/apache/flink/pull/23876/commits/c7625d5fa62a6e9a182f39f53fb7e5626105f3b0
> > > - The new change with mixin approach, and deprecation -
> > >
> > >
> >
> https://github.com/apache/flink/pull/23876/commits/99ec936966af527598ca49712c1263bc4aa03c15
> > >
> > > Thanks,
> > > Peter
> > >
> > > weijie guo  ezt írta (időpont: 2023. dec.
> 5.,
> > > K,
> > > 8:01):
> > >
> > > > Thanks Martijn for driving this!
> > > >
> > > > I'm +1  to reverting the breaking change.
> > > >
> > > > > For new functionality or changes we can make easily, we should
> switch
> > > to
> > > > the decorative/mixin interface approach used successfully in the
> source
> > > and
> > > > table interfaces.
> > > >
> > > > I like the way of switching to mixin interface.
> > > >
> > > > Best regards,
> > > >
> > > > Weijie
> > > >
> > > >
> > > > Becket Qin  于2023年12月5日周二 14:50写道:
> > > >
> > > > > I am with Gyula about fixing the current SinkV2 API.
> > > > >
> > > > > A SinkV3 seems not necessary because we are not changing the
> > > fundamental
> > > > > design of the API. Hopefully we can modify the interface structure
> a
> > > > little
> > > > > bit to make it similar to the Source while still keep the backwards
> > > > > compatibility.
> > > > > For example, one approach is:
> > &

Re: [DISCUSS] Resolve diamond inheritance of Sink.createWriter

2023-12-04 Thread Becket Qin
I am with Gyula about fixing the current SinkV2 API.

A SinkV3 seems not necessary because we are not changing the fundamental
design of the API. Hopefully we can modify the interface structure a little
bit to make it similar to the Source while still keep the backwards
compatibility.
For example, one approach is:

- Add snapshotState(int checkpointId) and precommit() methods to the
SinkWriter with default implementation doing nothing. Deprecate
StatefulSinkWriter and PrecommittingSinkWriter.
- Add two mixin interfaces of SupportsStatefulWrite and
SupportsTwoPhaseCommit. Deprecate the StatefulSink and
TwoPhaseCommittingSink.

Thanks,

Jiangjie (Becket) Qin

On Mon, Dec 4, 2023 at 7:25 PM Gyula Fóra  wrote:

> Hi All!
>
> Based on the discussion above, I feel that the most reasonable approach
> from both developers and users perspective at this point is what Becket
> lists as Option 1:
>
> Revert the naming change to the backward compatible version and accept that
> the names are not perfect (treat it as legacy).
>
> On a different note, I agree that the current sink v2 interface is very
> difficult to evolve and structuring the interfaces the way they are now is
> not a good design in the long run.
> For new functionality or changes we can make easily, we should switch to
> the decorative/mixin interface approach used successfully in the source and
> table interfaces. Let's try to do this as much as possible within the v2
> and compatibility boundaries and we should only introduce a v3 if we really
> must.
>
> So from my side, +1 to reverting the naming to keep backward compatibility.
>
> Cheers,
> Gyula
>
>
> On Fri, Dec 1, 2023 at 10:43 AM Péter Váry 
> wrote:
>
> > Thanks Becket for your reply!
> >
> > *On Option 1:*
> > - I personally consider API inconsistencies more important, since they
> will
> > remain with us "forever", but this is up to the community. I can
> implement
> > whichever solution we decide upon.
> >
> > *Option 2:*
> > - I don't think this specific issue merits a rewrite, but if we decide to
> > change our approach, then it's a different story.
> >
> > *Evolvability:*
> > This discussion reminds me of a similar discussion on FLIP-372 [1], where
> > we are trying to decide if we should use mixin interfaces, or use
> interface
> > inheritance.
> > With the mixin approach, we have a more flexible interface, but we can't
> > check the generic types of the interfaces/classes on compile time, or
> even
> > when we create the DAG. The first issue happens when we call the method
> and
> > fail.
> > The issue here is similar:
> > - *StatefulSink* needs a writer with a method to `*snapshotState*`
> > - *TwoPhaseCommittingSink* needs a writer with `*prepareCommit*`
> > - If there is a Sink which is stateful and needs to commit, then it needs
> > both of these methods.
> >
> > If we use the mixin solution here, we lose the possibility to check the
> > types in compile time. We could do the type check in runtime using `
> > *instanceof*`, so we are better off than with the FLIP-372 example above,
> > but still lose any important possibility. I personally prefer the mixin
> > approach, but that would mean we rewrite the Sink API again - likely a
> > SinkV3. Are we ready to move down that path?
> >
> > Thanks,
> > Peter
> >
> > [1] - https://lists.apache.org/thread/344pzbrqbbb4w0sfj67km25msp7hxlyd
> >
> > On Thu, Nov 30, 2023, 14:53 Becket Qin  wrote:
> >
> > > Hi folks,
> > >
> > > Sorry for replying late on the thread.
> > >
> > > For this particular FLIP, I see two solutions:
> > >
> > > Option 1:
> > > 1. On top of the the current status, rename
> > > *org.apache.flink.api.connector.sink2.InitContext *to
> > > *CommonInitContext (*should
> > > probably be package private*)*.
> > > 2. Change the name *WriterInitContext* back to *InitContext, *and
> revert
> > > the deprecation. We can change the parameter name to writerContext if
> we
> > > want to.
> > > Admittedly, this does not have full symmetric naming of the
> InitContexts
> > -
> > > we will have CommonInitContext / InitContext / CommitterInitContext
> > instead
> > > of InitContext / WriterInitContext / CommitterInitContext. However, the
> > > naming seems clear without much confusion. Personally, I can live with
> > > that, treating the class InitContext as a non-ideal legacy class name
> > > without much material harm.
> > >
> > > Option 2:
> > > Theoretically speaking, if 

Re: [DISCUSS] Resolve diamond inheritance of Sink.createWriter

2023-11-30 Thread Becket Qin
Hi folks,

Sorry for replying late on the thread.

For this particular FLIP, I see two solutions:

Option 1:
1. On top of the the current status, rename
*org.apache.flink.api.connector.sink2.InitContext *to
*CommonInitContext (*should
probably be package private*)*.
2. Change the name *WriterInitContext* back to *InitContext, *and revert
the deprecation. We can change the parameter name to writerContext if we
want to.
Admittedly, this does not have full symmetric naming of the InitContexts -
we will have CommonInitContext / InitContext / CommitterInitContext instead
of InitContext / WriterInitContext / CommitterInitContext. However, the
naming seems clear without much confusion. Personally, I can live with
that, treating the class InitContext as a non-ideal legacy class name
without much material harm.

Option 2:
Theoretically speaking, if we really want to reach the perfect state while
being backwards compatible, we can create a brand new set of Sink
interfaces and deprecate the old ones. But I feel this is an overkill here.

The solution to this particular issue aside, the evolvability of the
current interface hierarchy seems a more fundamental issue and worries me
more. I haven't completely thought it through, but there are two noticeable
differences between the interface design principles between Source and Sink.
1. Source uses decorative interfaces. For example, we have a
SupportsFilterPushdown interface, instead of a subclass of
FilterableSource. This seems provides better flexibility.
2. Source tends to have a more coarse-grained interface. For example,
SourceReader always has the methods of snapshotState(),
notifyCheckpointComplete(). Even if they may not be always required, we do
not separate them into different interfaces.
My hunch is that if we follow similar approach as Source, the evolvability
might be better. If we want to do this, we'd better to do it before 2.0.
What do you think?

Process wise,
- I agree that if there is a change to the passed FLIP during
implementation, it should be brought back to the mailing list.
- There might be value for the connector nightly build to depend on the
latest snapshot of the same Flink major version. It helps catching
unexpected breaking changes sooner.
- I'll update the website to reflect the latest API stability policy.
Apologies for the confusion caused by the stale doc.

Thanks,

Jiangjie (Becket) Qin



On Wed, Nov 29, 2023 at 10:55 PM Márton Balassi 
wrote:

> Thanks, Martijn and Peter.
>
> In terms of the concrete issue:
>
>- I am following up with the author of FLIP-321 [1] (Becket) to update
>the docs [2] to reflect the right state.
>- I see two reasonable approaches in terms of proceeding with the
>specific changeset:
>
>
>1. We allow the exception from FLIP-321 for this change and let the
>   PublicEvolving API change happen between Flink 1.18 and 1.19, which
> is
>   consistent with current state of the relevant documentation. [2]
> We commit
>   to helping the connector repos make the necessary (one liner)
> changes.
>   2. We revert back to the original implementation plan as explicitly
>   voted on in FLIP-371 [3]. That has no API breaking changes.
> However we end
>   up with an inconsistently named API with duplicated internal
> methods. Peter
>   has also discovered additional bad patterns during his work in
> FLIP-372
>   [3], the total of these changes could be handled in a separate FLIP
> that
>   would do multiple PublicEvolving breaking changes to clean up the
> API.
>
> In terms of the general issues:
>
>- I agree that if a PR review of an accepted FLIP newly introduces a
>breaking API change that warrants an update to the mailing list
> discussion
>and possibly even a new vote.
>- I agree with the general sentiment of FLIP-321 to provide stronger API
>guarantees with the minor note that if we have changes in mind we should
>prioritize them now such that they can be validated by Flink 2.0.
>- I agree that ideally the connector repos should build against the
>latest release and not the master branch.
>
> [1]
>
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-321%3A+Introduce+an+API+deprecation+process
> [2]
>
> https://nightlies.apache.org/flink/flink-docs-master/docs/ops/upgrading/#api-compatibility-guarantees
> [3]
>
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-371%3A+Provide+initialization+context+for+Committer+creation+in+TwoPhaseCommittingSink
> [4]
>
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-372%3A+Allow+TwoPhaseCommittingSink+WithPreCommitTopology+to+alter+the+type+of+the+Committable
>
> Best,
> Marton
>
> On Mon, Nov 27, 2023 at 7:23 PM Péter Váry 
> wrote:
>
> > I think we should try to separate the di

Re: [DISCUSS] FLIP-389: Annotate SingleThreadFetcherManager and FutureCompletingBlockingQueue as PublicEvolving

2023-11-21 Thread Becket Qin
Hi Hongshun,

The constructor of the SplitFetcher is already package private. So it can
only be accessed from the classes in the package
org.apache.flink.connector.base.source.reader.fetcher. And apparently, user
classes should not be in this package. Therefore, even if we mark the
SplitFetcher class as PublicEvolving, the constructor is not available to
the users. Only the public and protected methods are considered public API
in this case. Private / package private methods and fields are still
internal.

Thanks,

Jiangjie (Becket) Qin

On Wed, Nov 22, 2023 at 9:46 AM Hongshun Wang 
wrote:

> Hi Becket,
>
> If SplitFetcherManager becomes PublicEvolving, that also means SplitFetcher
> > needs to be PublicEvolving, because it is returned by the protected
> method
> > SplitFetcherManager.createSplitFetcher().
>
>
>
> > it looks like there is no need to expose the constructor of SplitFetcher
> > to the end users. Having an interface of SplitFetcher is also fine, but
> > might not be necessary in this case.
>
>
>
> I don't know how to make SplitFetcher as PublicEnvolving but not  to expose
> the constructor of SplitFetcher to the end users?
>
> Thanks,
> Hongshun Wang
>
> On Tue, Nov 21, 2023 at 7:23 PM Becket Qin  wrote:
>
> > Hi Hongshun,
> >
> > Do we need to expose the constructor of SplitFetcher to the users?
> Ideally,
> > users should always get a new fetcher instance by calling
> > SplitFetcherManager.createSplitFetcher(). Or, they can get an existing
> > SplitFetcher by looking up in the SplitFetcherManager.fetchers map. I
> think
> > this makes sense because a SplitFetcher should always belong to a
> > SplitFetcherManager. Therefore, it should be created via a
> > SplitFetcherManager as well. So, it looks like there is no need to expose
> > the constructor of SplitFetcher to the end users.
> >
> > Having an interface of SplitFetcher is also fine, but might not be
> > necessary in this case.
> >
> > Thanks,
> >
> > Jiangjie (Becket) Qin
> >
> > On Tue, Nov 21, 2023 at 10:36 AM Hongshun Wang 
> > wrote:
> >
> > > Hi Becket,
> > >
> > > > Additionally, SplitFetcherTask requires FutureCompletingBlockingQueue
> > as
> > > a constructor parameter, which is not allowed  now.
> > > Sorry, it was my writing mistake. What I meant is that *SplitFetcher*
> > > requires FutureCompletingBlockingQueue as a constructor parameter.
> > > SplitFetcher
> > > is a class rather than Interface. Therefore, I want to  change
> > > SplitFetcher to a public Interface and moving its implementation
> > > details to an implement
> > > subclass .
> > >
> > > Thanks,
> > > Hongshun Wang
> > >
> > > On Fri, Nov 17, 2023 at 6:21 PM Becket Qin 
> wrote:
> > >
> > > > Hi Hongshun,
> > > >
> > > > SplitFetcher.enqueueTask() returns void, right? SplitFetcherTask is
> > > already
> > > > an interface, and we need to make that as a PublicEvolving API as
> well.
> > > >
> > > > So overall, a source developer can potentially do a few things in the
> > > > SplitFetcherManager.
> > > > 1. for customized logic including split-to-fetcher assignment,
> > threading
> > > > model, etc.
> > > > 2. create their own SplitFetcherTask for the SplitFetcher /
> SplitReader
> > > to
> > > > execute in a coordinated manner.
> > > >
> > > > It should be powerful enough for the vast majority of the source
> > > > implementation, if not all.
> > > >
> > > >
> > > > Additionally, SplitFetcherTask requires FutureCompletingBlockingQueue
> > > > > as a
> > > > > constructor parameter, which is not allowed
> > > > > now.
> > > >
> > > > Are you referring to FetchTask which implements SplitFetcherTask?
> That
> > > > class will remain internal.
> > > >
> > > > Thanks,
> > > >
> > > > Jiangjie (Becket) Qin
> > > >
> > > > On Fri, Nov 17, 2023 at 5:23 PM Hongshun Wang <
> loserwang1...@gmail.com
> > >
> > > > wrote:
> > > >
> > > > > Hi, Jiangjie(Becket) ,
> > > > > Thank you for your advice. I have learned a lot.
> > > > >
> > > > > If SplitFetcherManager becomes PublicEvolving, that also means
> > > > > > SplitFetcher needs to be PublicEvolving, because it is returned
> by
> > >

Re: [DISCUSS] FLIP-389: Annotate SingleThreadFetcherManager and FutureCompletingBlockingQueue as PublicEvolving

2023-11-21 Thread Becket Qin
Hi Hongshun,

Do we need to expose the constructor of SplitFetcher to the users? Ideally,
users should always get a new fetcher instance by calling
SplitFetcherManager.createSplitFetcher(). Or, they can get an existing
SplitFetcher by looking up in the SplitFetcherManager.fetchers map. I think
this makes sense because a SplitFetcher should always belong to a
SplitFetcherManager. Therefore, it should be created via a
SplitFetcherManager as well. So, it looks like there is no need to expose
the constructor of SplitFetcher to the end users.

Having an interface of SplitFetcher is also fine, but might not be
necessary in this case.

Thanks,

Jiangjie (Becket) Qin

On Tue, Nov 21, 2023 at 10:36 AM Hongshun Wang 
wrote:

> Hi Becket,
>
> > Additionally, SplitFetcherTask requires FutureCompletingBlockingQueue  as
> a constructor parameter, which is not allowed  now.
> Sorry, it was my writing mistake. What I meant is that *SplitFetcher*
> requires FutureCompletingBlockingQueue as a constructor parameter.
> SplitFetcher
> is a class rather than Interface. Therefore, I want to  change
> SplitFetcher to a public Interface and moving its implementation
> details to an implement
> subclass .
>
> Thanks,
> Hongshun Wang
>
> On Fri, Nov 17, 2023 at 6:21 PM Becket Qin  wrote:
>
> > Hi Hongshun,
> >
> > SplitFetcher.enqueueTask() returns void, right? SplitFetcherTask is
> already
> > an interface, and we need to make that as a PublicEvolving API as well.
> >
> > So overall, a source developer can potentially do a few things in the
> > SplitFetcherManager.
> > 1. for customized logic including split-to-fetcher assignment, threading
> > model, etc.
> > 2. create their own SplitFetcherTask for the SplitFetcher / SplitReader
> to
> > execute in a coordinated manner.
> >
> > It should be powerful enough for the vast majority of the source
> > implementation, if not all.
> >
> >
> > Additionally, SplitFetcherTask requires FutureCompletingBlockingQueue
> > > as a
> > > constructor parameter, which is not allowed
> > > now.
> >
> > Are you referring to FetchTask which implements SplitFetcherTask? That
> > class will remain internal.
> >
> > Thanks,
> >
> > Jiangjie (Becket) Qin
> >
> > On Fri, Nov 17, 2023 at 5:23 PM Hongshun Wang 
> > wrote:
> >
> > > Hi, Jiangjie(Becket) ,
> > > Thank you for your advice. I have learned a lot.
> > >
> > > If SplitFetcherManager becomes PublicEvolving, that also means
> > > > SplitFetcher needs to be PublicEvolving, because it is returned by
> the
> > > > protected method SplitFetcherManager.createSplitFetcher().
> > >
> > > I completely agree with you. However, if SplitFetcher becomes
> > > PublicEvolving, SplitFetcherTask also needs to be PublicEvolving
> > > because it is returned by the public method SplitFetcher#enqueueTask.
> > > Additionally, SplitFetcherTask requires FutureCompletingBlockingQueue
> > > as a
> > > constructor parameter, which is not allowed
> > > now. Therefore, I propose changing SplitFetcher to a public Interface
> > > and moving its implementation details to an implement class (e.g.,
> > > SplitFetcherImpl or another suitable name). SplitFetcherImpl will be
> > > marked as internal and managed by SplitFetcherManager,
> > > and put data in the queue.
> > > Subclasses of SplitFetcherManager can only use the SplitFetcher
> > interface,
> > > also ensuring that the current subclasses are not affected.
> > >
> > >
> > >
> > > The current SplitFetcherManager basically looks up
> > > > the SplitT from the fetcher with the split Id, and immediately passes
> > the
> > > > SplitT back to the fetcher, which is unnecessary.
> > >
> > > I inferred that this is because SplitReader#pauseOrResumeSplits
> > > requires SplitT instead of SpiltId.  Perhaps some external source
> > > requires more information to pause. However, SplitReader doesn't store
> > > all its split data, while SplitFetcherManager saves them.
> > > CC, @Dawid Wysakowicz
> > >
> > >
> > >
> > >  If not, SplitFetcher.pause() and
> > > > SplitFetcher.resume() can be removed. In fact, they seem no longer
> used
> > > > anywhere.
> > >
> > > It seems no use any more. CC, @Arvid Heise
> > >
> > >
> > >
> > > Thanks,
> > > Hongshun Wang
> > >
> > > On Fri, Nov 17, 2023 at 11:42 AM Becket Qin 
> > wrote:
> 

Re: [DISCUSS] FLIP-389: Annotate SingleThreadFetcherManager and FutureCompletingBlockingQueue as PublicEvolving

2023-11-17 Thread Becket Qin
Hi Hongshun,

SplitFetcher.enqueueTask() returns void, right? SplitFetcherTask is already
an interface, and we need to make that as a PublicEvolving API as well.

So overall, a source developer can potentially do a few things in the
SplitFetcherManager.
1. for customized logic including split-to-fetcher assignment, threading
model, etc.
2. create their own SplitFetcherTask for the SplitFetcher / SplitReader to
execute in a coordinated manner.

It should be powerful enough for the vast majority of the source
implementation, if not all.


Additionally, SplitFetcherTask requires FutureCompletingBlockingQueue
> as a
> constructor parameter, which is not allowed
> now.

Are you referring to FetchTask which implements SplitFetcherTask? That
class will remain internal.

Thanks,

Jiangjie (Becket) Qin

On Fri, Nov 17, 2023 at 5:23 PM Hongshun Wang 
wrote:

> Hi, Jiangjie(Becket) ,
> Thank you for your advice. I have learned a lot.
>
> If SplitFetcherManager becomes PublicEvolving, that also means
> > SplitFetcher needs to be PublicEvolving, because it is returned by the
> > protected method SplitFetcherManager.createSplitFetcher().
>
> I completely agree with you. However, if SplitFetcher becomes
> PublicEvolving, SplitFetcherTask also needs to be PublicEvolving
> because it is returned by the public method SplitFetcher#enqueueTask.
> Additionally, SplitFetcherTask requires FutureCompletingBlockingQueue
> as a
> constructor parameter, which is not allowed
> now. Therefore, I propose changing SplitFetcher to a public Interface
> and moving its implementation details to an implement class (e.g.,
> SplitFetcherImpl or another suitable name). SplitFetcherImpl will be
> marked as internal and managed by SplitFetcherManager,
> and put data in the queue.
> Subclasses of SplitFetcherManager can only use the SplitFetcher interface,
> also ensuring that the current subclasses are not affected.
>
>
>
> The current SplitFetcherManager basically looks up
> > the SplitT from the fetcher with the split Id, and immediately passes the
> > SplitT back to the fetcher, which is unnecessary.
>
> I inferred that this is because SplitReader#pauseOrResumeSplits
> requires SplitT instead of SpiltId.  Perhaps some external source
> requires more information to pause. However, SplitReader doesn't store
> all its split data, while SplitFetcherManager saves them.
> CC, @Dawid Wysakowicz
>
>
>
>  If not, SplitFetcher.pause() and
> > SplitFetcher.resume() can be removed. In fact, they seem no longer used
> > anywhere.
>
> It seems no use any more. CC, @Arvid Heise
>
>
>
> Thanks,
> Hongshun Wang
>
> On Fri, Nov 17, 2023 at 11:42 AM Becket Qin  wrote:
>
> > Hi Hongshun,
> >
> > Thanks for updating the FLIP. I think that makes sense. A few comments
> > below:
> >
> > 1. If SplitFetcherManager becomes PublicEvolving, that also means
> > SplitFetcher needs to be PublicEvolving, because it is returned by the
> > protected method SplitFetcherManager.createSplitFetcher().
> >
> > 2. When checking the API of the classes to be marked as PublicEvolving,
> > there might be a few methods' signatures worth some discussion.
> >
> > For SplitFetcherManager:
> > a) Currently removeSplits() methods takes a list of SplitT. I am
> wondering
> > if it should be a list of splitIds. SplitT actually contains two parts of
> > information, the static split Id and some dynamically changing state of
> the
> > split (e.g. Kafka consumer offset). The source of truth for the dynamic
> > state is SourceReaderBase. Currently we are passing in the full source
> > split with the dynamic state for split removal. But it looks like only
> > split id is needed for the split removal.
> > Maybe this is intentional, as sometimes when a SplitReader removes a
> split,
> > it also wants to know the dynamic state of the split. If so, we can keep
> it
> > as is. But then the question is why
> > SplitFetcherManager.pauseAndResumeSplits() only takes split ids instead
> of
> > SplitT. Should we make them consistent?
> >
> > For SplitFetcher:
> > a) The SplitFetcher.pauseOrResumeSplits() method takes collections of
> > SplitT as arguments. We may want to adjust that according to what we do
> to
> > the SplitFetcherManager. The current SplitFetcherManager basically looks
> up
> > the SplitT from the fetcher with the split Id, and immediately passes the
> > SplitT back to the fetcher, which is unnecessary.
> > b) After supporting split level pause and resume, do we still need split
> > fetcher level pause and resume? If not, SplitFetcher.pause() and
> > SplitFetcher.resume() can be removed. In fa

Re: [DISCUSS] FLIP-389: Annotate SingleThreadFetcherManager and FutureCompletingBlockingQueue as PublicEvolving

2023-11-16 Thread Becket Qin
Hi Hongshun,

Thanks for updating the FLIP. I think that makes sense. A few comments
below:

1. If SplitFetcherManager becomes PublicEvolving, that also means
SplitFetcher needs to be PublicEvolving, because it is returned by the
protected method SplitFetcherManager.createSplitFetcher().

2. When checking the API of the classes to be marked as PublicEvolving,
there might be a few methods' signatures worth some discussion.

For SplitFetcherManager:
a) Currently removeSplits() methods takes a list of SplitT. I am wondering
if it should be a list of splitIds. SplitT actually contains two parts of
information, the static split Id and some dynamically changing state of the
split (e.g. Kafka consumer offset). The source of truth for the dynamic
state is SourceReaderBase. Currently we are passing in the full source
split with the dynamic state for split removal. But it looks like only
split id is needed for the split removal.
Maybe this is intentional, as sometimes when a SplitReader removes a split,
it also wants to know the dynamic state of the split. If so, we can keep it
as is. But then the question is why
SplitFetcherManager.pauseAndResumeSplits() only takes split ids instead of
SplitT. Should we make them consistent?

For SplitFetcher:
a) The SplitFetcher.pauseOrResumeSplits() method takes collections of
SplitT as arguments. We may want to adjust that according to what we do to
the SplitFetcherManager. The current SplitFetcherManager basically looks up
the SplitT from the fetcher with the split Id, and immediately passes the
SplitT back to the fetcher, which is unnecessary.
b) After supporting split level pause and resume, do we still need split
fetcher level pause and resume? If not, SplitFetcher.pause() and
SplitFetcher.resume() can be removed. In fact, they seem no longer used
anywhere.

Other than the above potential API adjustment before we mark the classes
PublicEvolving, the API looks fine to me.

I think it is good timing for deprecation now. We will mark the impacted
constructors as deprecated in 1.19, and remove them in release of 2.0.

Thanks,

Jiangjie (Becket) Qin



On Thu, Nov 16, 2023 at 8:26 PM Hongshun Wang 
wrote:

> Hi Devs,
>
> I have just modified the content of FLIP-389: Annotate
> SingleThreadFetcherManager as PublicEvolving[1].
>
> Now this Flip mainly do two thing:
>
>1. Annotate SingleThreadFetcherManager as PublicEvolving
>2. Remove all public constructors which use
>FutureCompletingBlockingQueue. This will make many constructors as
>@Depricated.
>
> This may influence many connectors, so I am looking forward to hearing from
> you.
>
>
> Best regards,
> Hongshun
>
> [1]
> https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=278465498
>
> On Wed, Nov 15, 2023 at 7:57 AM Becket Qin  wrote:
>
> > Hi Hongshun,
> > >
> > >
> > > However, it will be tricky because SplitFetcherManager includes  > SplitT
> > > extends SourceSplit>, while FutureCompletingBlockingQueue includes .
> > > This means that SplitFetcherManager would have to be modified to  > > SplitT extends SourceSplit>, which would affect the compatibility of
> the
> > > SplitFetcherManager class. I'm afraid this change will influence other
> > > sources.
> >
> > Although the FutureCompletingBlockingQueue class itself has a template
> > class . In the SourceReaderBase and SplitFetcherManager, this  is
> > actually RecordsWithSplitIds. So it looks like we can just let
> > SplitFetcherManager.poll() return a RecordsWithSplitIds.
> >
> > Thanks,
> >
> > Jiangjie (Becket) Qin
> >
> > On Tue, Nov 14, 2023 at 8:11 PM Hongshun Wang 
> > wrote:
> >
> > > Hi Becket,
> > >   I agree with you and try to modify this Flip[1], which include
> > these
> > > changes:
> > >
> > >1. Mark constructor of SingleThreadMultiplexSourceReaderBase as
> > >@Depricated
> > >2.
> > >
> > >Mark constructor of SourceReaderBase as *@Depricated* and provide a
> > new
> > >constructor without
> > >
> > >FutureCompletingBlockingQueue
> > >3.
> > >
> > >Mark constructor of SplitFetcherManager
> andSingleThreadFetcherManager
> > >as  *@Depricated* and provide a new constructor
> > >without FutureCompletingBlockingQueue. Mark SplitFetcherManager
> > >andSingleThreadFetcherManager as *@PublicEvolving*
> > >4.
> > >
> > >SplitFetcherManager provides  wrapper methods for
> > >FutureCompletingBlockingQueue  to replace its usage in
> > SourceReaderBase.
> > >Then we can use FutureCompletingBlockingQueue only in
> >

Re: [DISCUSS] FLIP-377: Support configuration to disable filter push down for Table/SQL Sources

2023-11-16 Thread Becket Qin
Hi Jiabao,

Arguments like "because Spark has it so Flink should also have it" does not
make sense. Different projects have different API flavors and styles. What
is really important is the rationale and the design principle behind the
API. They should conform to the convention of the project.

First of all, Spark Source API itself has a few issues and they ended up
introduce DataSource V2 in Spark 3.0, which added the decorative interfaces
like SupportsPushdownXXX. Some of the configurations predating DataSource
V2 may still be there.

For the Spark configurations you mentioned, they are all the configurations
for FileScanBuilder, which is equivalent to FileSource in Flink. Currently,
regardless of the format (ORC, Parquet, Avro, etc), the FileSource pushes
back all the filters to ensure correctness. The actual filters that got
applied to the specific format might still be different. This
implementation is the same in FileScanBuilder.pushFilters() for Spark. I
don't know why Spark got separate configurations for each format. Maybe it
is because the filters are actually implemented differently for different
format.

At least for the current implementation in FileScanBuilder, these
configurations can be merged to one configuration like
`apply.filters.to.format.enabled`. Note that this config, as well as the
separate configs you mentioned, are just visible and used by the
FileScanBuilder. It determines whether the filters should be passed down to
the format of the FileScanBuilder instance. Regardless of the value of
these configs, FileScanBuilder.pushFilters() will always be called, and
FileScanBuilder (as well as FileSource in Flink) will always push back all
the filters to the framework.

A MySql source can have a very different way to handle this. For example, A
MySql source  A config in this case might be "my.apply.filters" with three
different values:
 - AUTO: The Source will issue a DESC Table request to understand whether a
filter can be applied efficiently. And decide which filters can be applied
and which cannot based on that.
 - NEVER: Never apply filtering. It will always do a full table read and
let Flink do the filtering.
 - ALWAYS: Always apply the filtering to the MySql server.

In the above examples of FileSource and MySql Source, I don't think it is a
good idea to shoehorn the behaviors into a naive config of
`ignore.filter.pushdown`. That is why I don't think this is a common config.

To recap, like I said, I do agree that in some cases, we may want to behave
differently when filters are pushed down to the sources, even if a source
implements SupportsFilterPushDown, but I don't think there is a suitable
common config for this. The behavior is very likely source specific.

Thanks,

Jiangjie (Becket) Qin



On Thu, Nov 16, 2023 at 3:41 PM Jiabao Sun 
wrote:

> Thanks Becket,
>
> I still believe that adding a configuration at the source level to disable
> filter pushdown is needed. This demand exists in spark as well[1].
>
> In Spark, most sources that support filter pushdown provide their own
> corresponding configuration options to enable or disable filter pushdown.
> For PRs[2-4] that support filter pushdown capability, they also provide
> configuration options to disable this capability.
>
> I believe this configuration is applicable to most scenarios, and there is
> no need to dwell on why this configuration option was not introduced
> earlier than the SupportsFilterPushDown interface.
>
> spark.sql.parquet.filterPushdown
> spark.sql.orc.filterPushdown
> spark.sql.csv.filterPushdown.enabled
> spark.sql.json.filterPushdown.enabled
> spark.sql.avro.filterPushdown.enabled
> JDBC Option: pushDownPredicate
>
> We can see that the lack of consistency is caused by each connector
> introducing different configuration options for the same behavior.
> This is one of the motivations for advocating the introduction of a
> unified configuration name.
>
> [1] https://issues.apache.org/jira/browse/SPARK-24288
> [2] https://github.com/apache/spark/pull/27366
> [3]https://github.com/apache/spark/pull/26973
> [4] https://github.com/apache/spark/pull/29145
>
> Best,
> Jiabao
>
> > 2023年11月16日 08:10,Becket Qin  写道:
> >
> > Hi Jiabao,
> >
> > While we can always fix the formality of the config, a more fundamental
> > issue here is whether this configuration is common enough. Personally I
> am
> > still not convinced it is.
> >
> > Remember we don't have a common implementation for SupportsFilterPushdown
> > itself. Why does a potential behavior of the
> > SupportsFilterPushdown.applyFilters() method deserve a common
> > configuration? A common implementation should always come first, then its
> > configuration becomes a common configuration as a natural result. But
> here
> > we are trying to a

Re: [DISCUSS] FLIP-377: Support configuration to disable filter push down for Table/SQL Sources

2023-11-15 Thread Becket Qin
Hi Jiabao,

While we can always fix the formality of the config, a more fundamental
issue here is whether this configuration is common enough. Personally I am
still not convinced it is.

Remember we don't have a common implementation for SupportsFilterPushdown
itself. Why does a potential behavior of the
SupportsFilterPushdown.applyFilters() method deserve a common
configuration? A common implementation should always come first, then its
configuration becomes a common configuration as a natural result. But here
we are trying to add an impl to a configuration just to fix its formality.

I agree that there might be a few Source implementations that may want to
avoid additional burdens on the remote system in some circumstances. And
these circumstances are very specific:
1. The source talks to a remote service that can help perform the actual
filtering.
2. The filtering done by the remote service is inefficient for some reason
(e.g. missing index)
3. The external service does not want to perform the inefficient filtering
for some reason (e.g. it is a shared service with others)

There are multiple approaches to address the issue. Pushing back the
filters is just one way of achieving this. So here we are talking about a
config for one of the possible solutions to a scenario with all the above
situations. I don't think there is enough justification for the config to
be common.

There is always this trade-off between the proliferation of public
interfaces and the API standardization. As an extreme example, we can make
our public API a union of all the configs potentially used in all the cases
in the name of standardization. Apparently this won't work. So there must
be a bar here and this bar might be somewhat subjective. For this FLIP,
personally I don't think the config meets my bar for the reason stated
above.

Therefore, my suggestion remains the same. Keep the config as a Source
implementation specific configuration.

Thanks,

Jiangjie (Becket) Qin



On Thu, Nov 16, 2023 at 12:36 AM Jiabao Sun 
wrote:

> Thanks Becket for the feedback,
>
> Regarding concerns about common configurations, I think we can introduce
> FiltersApplier to unify the behavior of various connectors.
>
> public static class FiltersApplier {
>
> private final ReadableConfig config;
> private final Function, Result> action;
>
> private FiltersApplier(
> ReadableConfig config,
> Function, Result> action) {
> this.config = config;
> this.action = action;
> }
>
> public Result applyFilters(List filters) {
> if (config.get(ENABLE_FILTER_PUSH_DOWN)) {
> return action.apply(filters);
> } else {
> return Result.of(Collections.emptyList(), filters);
> }
> }
>
> public static FiltersApplier of(
> ReadableConfig config,
> Function, Result> action) {
> return new FiltersApplier(config, action);
> }
> }
>
> For connectors implementation:
>
> @Override
> public Result applyFilters(List filters) {
> return FiltersApplier.of(config,
> f -> Result.of(new ArrayList<>(filters),
> Collections.emptyList()));
> }
>
> As for the name, whether it is "source.filter-push-down.enabled" or
> "source.ignore-pushed-down-filters.enabled", I think both are okay.
>
> Do you think this change is feasible?
>
>
> Best,
> Jiabao
>
>
> > 2023年11月15日 23:44,Becket Qin  写道:
> >
> > Hi Jiabao,
> >
> > Yes, I still have concerns.
> >
> > The FLIP violates the following two principles regarding configuration:
> >
> > 1.* A config of a class should never negate the semantic of a decorative
> > interface implemented by that class. *
> > A decorative interface is a public contract with other components, while
> a
> > config is only internal to the class itself. The configurations for the
> > Sources are not (and should never be) visible or understood to
> > other components (e.g. optimizer). A configuration of a Source only
> > controls the behavior of that Source, provided it is not violating the
> API
> > contract / semantic defined by the decorative interface. So when a Source
> > implementation implements SupportsFilterPushdown, this is a clear public
> > contract with Flink that filters should be pushed down to that Source.
> > Therefore, for the same source, there should not be a configuration
> > "source.filter-push-down.enabled" which stops the filters from being
> pushed
> > down to that Source. However, that specific source implementation can
> have
> > its own config to control its internal behavior, e.g.
> > "ignore-pushed-down-filters.enabled" whic

Re: [DISCUSS] FLIP-377: Support configuration to disable filter push down for Table/SQL Sources

2023-11-15 Thread Becket Qin
Hi Jiabao,

Yes, I still have concerns.

The FLIP violates the following two principles regarding configuration:

1.* A config of a class should never negate the semantic of a decorative
interface implemented by that class. *
A decorative interface is a public contract with other components, while a
config is only internal to the class itself. The configurations for the
Sources are not (and should never be) visible or understood to
other components (e.g. optimizer). A configuration of a Source only
controls the behavior of that Source, provided it is not violating the API
contract / semantic defined by the decorative interface. So when a Source
implementation implements SupportsFilterPushdown, this is a clear public
contract with Flink that filters should be pushed down to that Source.
Therefore, for the same source, there should not be a configuration
"source.filter-push-down.enabled" which stops the filters from being pushed
down to that Source. However, that specific source implementation can have
its own config to control its internal behavior, e.g.
"ignore-pushed-down-filters.enabled" which may push back all the pushed
down filters back to the Flink optimizer.

2. When we are talking about "common configs", in fact we are talking about
"configs for common (abstract) implementation classes". With that as a
context, *a common config should always be backed by a common
implementation class, so that consistent behavior can be guaranteed. *
The LookupOptions you mentioned are configurations defined for classes
DefaultLookupCache / PeriodicCacheReloadTrigger / TimedCacheReloadTrigger.
These configs are considered as "common" only because the implementation
classes using them are common building blocks for lookup table
implementations. It would not make sense to have a dangling config in the
LookupOptions without the underlying common implementation class, but only
relies on a specific source to implement the stated behavior.
As a bad example, there is this outlier config "max-retries" in
LookupOptions, which I don't think should be here. This is because the
retry behavior can be very implementation specific. For example, there can
be many different flavors of retry related configurations, retry-backoff,
retry-timeout, retry-async, etc. Why only max-retry is put here? should all
of them be put here? If we put all such kinds of configs in the common
configs for "standardization and unification", the number of "common
configs" can easily go crazy. And I don't see material benefits of doing
that. So here I don't think the configuration "max-retry" should be in
LookupOptions, because it is not backed by any common implementation
classes. If max-retry is implemented in the HBase source, it should stay
there. For the same reason, the config proposed in this FLIP (probably with
a name less confusing for the first reason mentioned above)  should stay in
the specific Source implementation.

For the two reasons above, I am -1 to what the FLIP currently proposes.

I think the right way to address the motivation here is still to have a
config like "ignore-pushed-down-filters.enabled" for the specific source
implementation. Please let me know if this solves the problem you are
facing.

Thanks,

Jiangjie (Becket) Qin


On Wed, Nov 15, 2023 at 11:52 AM Jiabao Sun 
wrote:

> Hi Becket,
>
> The purpose of introducing this configuration is that not all filter
> pushdowns can improve overall performance.
> If the filter can hit the external index, then pushdown is definitely
> worth it, as it can not only improve query time but also decrease network
> overhead.
> However, for filters that do not hit the external index, it may increase a
> lot of performance overhead on the external system.
>
> Undeniably, if the connector can make accurate decisions for good and bad
> filters, we may not need to introduce this configuration option to disable
> pushing down filters to the external system.
> However, it is currently not easy to achieve.
>
> IMO, supporting filter pushdown does not mean that always filter pushdown
> is better.
> In the absence of automatic decision-making, I think we should leave this
> decision to users.
>
> The newly introduced configuration option is similar to LookupOptions,
> providing unified naming and default values to avoid confusion caused by
> inconsistent naming in different connectors for users.
> Setting the default value to true allows it to maintain compatibility with
> the default behavior of "always pushdown".
>
> Do you have any other concerns about this proposal? Please let me know.
>
> Thanks,
> Jiabao
>
>
> > 2023年10月31日 17:29,Jiabao Sun  写道:
> >
> > Hi Becket,
> >
> > Actually, for FileSystemSource, it is not always desired, only OCR file
> formats suppor

Re: [DISCUSS] FLIP-389: Annotate SingleThreadFetcherManager and FutureCompletingBlockingQueue as PublicEvolving

2023-11-14 Thread Becket Qin
Hi Hongshun,
>
>
> However, it will be tricky because SplitFetcherManager includes  extends SourceSplit>, while FutureCompletingBlockingQueue includes .
> This means that SplitFetcherManager would have to be modified to  SplitT extends SourceSplit>, which would affect the compatibility of the
> SplitFetcherManager class. I'm afraid this change will influence other
> sources.

Although the FutureCompletingBlockingQueue class itself has a template
class . In the SourceReaderBase and SplitFetcherManager, this  is
actually RecordsWithSplitIds. So it looks like we can just let
SplitFetcherManager.poll() return a RecordsWithSplitIds.

Thanks,

Jiangjie (Becket) Qin

On Tue, Nov 14, 2023 at 8:11 PM Hongshun Wang 
wrote:

> Hi Becket,
>   I agree with you and try to modify this Flip[1], which include these
> changes:
>
>1. Mark constructor of SingleThreadMultiplexSourceReaderBase as
>@Depricated
>2.
>
>Mark constructor of SourceReaderBase as *@Depricated* and provide a new
>constructor without
>
>FutureCompletingBlockingQueue
>3.
>
>Mark constructor of SplitFetcherManager andSingleThreadFetcherManager
>as  *@Depricated* and provide a new constructor
>without FutureCompletingBlockingQueue. Mark SplitFetcherManager
>andSingleThreadFetcherManager as *@PublicEvolving*
>4.
>
>SplitFetcherManager provides  wrapper methods for
>FutureCompletingBlockingQueue  to replace its usage in SourceReaderBase.
>Then we can use FutureCompletingBlockingQueue only in
>SplitFetcherManager.
>
> However, it will be tricky because SplitFetcherManager includes  extends SourceSplit>, while FutureCompletingBlockingQueue includes .
> This means that SplitFetcherManager would have to be modified to  SplitT extends SourceSplit>, which would affect the compatibility of the
> SplitFetcherManager class. I'm afraid this change will influence other
> sources.
>
>
>
> Looking forward to hearing from you.
>
> Best regards,
> Hongshun
>
> [1]
> https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=278465498
>
> On Sat, Nov 11, 2023 at 10:55 AM Becket Qin  wrote:
>
> > Hi Hongshun and Martijn,
> >
> > Sorry for the late reply as I was on travel and still catching up with
> the
> > emails. Please allow me to provide more context.
> >
> > 1. The original design of SplitFetcherManager and its subclasses was to
> > make them public to the Source developers. The goal is to let us take
> care
> > of the threading model, while the Source developers can just focus on the
> > SplitReader implementation. Therefore, I think making
> SplitFetcherManater /
> > SingleThreadFetcherManager public aligns with the original design. That
> is
> > also why these classes are exposed in the constructor of
> SourceReaderBase.
> >
> > 2. For FutureCompletingBlockingQueue, as a hindsight, it might be better
> to
> > not expose it to the Source developers. They are unlikely to use it
> > anywhere other than just constructing it. The reason that
> > FutureCompletingBlockingQueue is currently exposed in the
> SourceReaderBase
> > constructor is because both the SplitFetcherManager and SourceReaderBase
> > need it. One way to hide the FutureCompletingBlockingQueue from the
> public
> > API is to make SplitFetcherManager the only owner class of the queue, and
> > expose some of its methods via SplitFetcherManager. This way, the
> > SourceReaderBase can invoke the methods via SplitFetcherManager. I
> believe
> > this also makes the code slightly cleaner.
> >
> > Thanks,
> >
> > Jiangjie (Becket) Qin
> >
> > On Fri, Nov 10, 2023 at 12:28 PM Hongshun Wang 
> > wrote:
> >
> > > @Martijn, I agree with you.
> > >
> > >
> > > I also have two questions at the beginning:
> > >
> > >- Why is an Internal class
> > >exposed as a constructor param of a Public class?
> > >- Should these classes be exposed as public?
> > >
> > > For the first question,  I noticed that before the original Jira[1] ,
> > > all these classes missed the annotate , so it was not abnormal that
> > > FutureCompletingBlockingQueue and SingleThreadFetcherManager were
> > > constructor params of SingleThreadMultiplexSourceReaderBase.
> > >  However,
> > > this jira marked FutureCompletingBlockingQueue and
> > > SingleThreadFetcherManager as Internal, while marked
> > > SingleThreadMultiplexSourceReaderBase as Public. It's a good choice,
> > > but also forget that FutureCompletingBlockingQueue and
> > >

Re: [DISCUSS] FLIP-389: Annotate SingleThreadFetcherManager and FutureCompletingBlockingQueue as PublicEvolving

2023-11-10 Thread Becket Qin
Hi Hongshun and Martijn,

Sorry for the late reply as I was on travel and still catching up with the
emails. Please allow me to provide more context.

1. The original design of SplitFetcherManager and its subclasses was to
make them public to the Source developers. The goal is to let us take care
of the threading model, while the Source developers can just focus on the
SplitReader implementation. Therefore, I think making SplitFetcherManater /
SingleThreadFetcherManager public aligns with the original design. That is
also why these classes are exposed in the constructor of SourceReaderBase.

2. For FutureCompletingBlockingQueue, as a hindsight, it might be better to
not expose it to the Source developers. They are unlikely to use it
anywhere other than just constructing it. The reason that
FutureCompletingBlockingQueue is currently exposed in the SourceReaderBase
constructor is because both the SplitFetcherManager and SourceReaderBase
need it. One way to hide the FutureCompletingBlockingQueue from the public
API is to make SplitFetcherManager the only owner class of the queue, and
expose some of its methods via SplitFetcherManager. This way, the
SourceReaderBase can invoke the methods via SplitFetcherManager. I believe
this also makes the code slightly cleaner.

Thanks,

Jiangjie (Becket) Qin

On Fri, Nov 10, 2023 at 12:28 PM Hongshun Wang 
wrote:

> @Martijn, I agree with you.
>
>
> I also have two questions at the beginning:
>
>- Why is an Internal class
>exposed as a constructor param of a Public class?
>- Should these classes be exposed as public?
>
> For the first question,  I noticed that before the original Jira[1] ,
> all these classes missed the annotate , so it was not abnormal that
> FutureCompletingBlockingQueue and SingleThreadFetcherManager were
> constructor params of SingleThreadMultiplexSourceReaderBase.
>  However,
> this jira marked FutureCompletingBlockingQueue and
> SingleThreadFetcherManager as Internal, while marked
> SingleThreadMultiplexSourceReaderBase as Public. It's a good choice,
> but also forget that FutureCompletingBlockingQueue and
> SingleThreadFetcherManager have already been  exposed by
> SingleThreadMultiplexSourceReaderBase.
>  Thus, this problem occurs because we didn't
> clearly define the boundaries at the origin design. We should pay more
> attention to it when creating a new class.
>
>
> For the second question, I think at least SplitFetcherManager
> should be Public. There are few reasons:
>
>-  Connector developers want to decide their own
>thread mode. For example, Whether to recycle fetchers by overriding
>SplitFetcherManager#maybeShutdownFinishedFetchers
>when idle. Sometimes, developers want SplitFetcherManager react as a
>FixedThreadPool, because
>each time a thread is recycled then recreated, the context
> resources need to be rebuilt. I met a related issue in flink cdc[2].
>-
>KafkaSourceFetcherManager[3] also  extends
> SingleThreadFetcherManager to commitOffsets. But now kafka souce is
> not in Flink repository, so it's not allowed any more.
>
> [1] https://issues.apache.org/jira/browse/FLINK-22358
>
> [2]
>
> https://github.com/ververica/flink-cdc-connectors/pull/2571#issuecomment-1797585418
>
> [3]
>
> https://github.com/apache/flink-connector-kafka/blob/979791c4c71e944c16c51419cf9a84aa1f8fea4c/flink-connector-kafka/src/main/java/org/apache/flink/connector/kafka/source/reader/fetcher/KafkaSourceFetcherManager.java#L52
>
> Looking forward to hearing from you.
>
> Best regards,
> Hongshun
>
> On Thu, Nov 9, 2023 at 11:46 PM Martijn Visser 
> wrote:
>
> > Hi all,
> >
> > I'm looking at the original Jira that introduced these stability
> > designations [1] and I'm just curious if it was intended that these
> > Internal classes would be used directly, or if we just haven't created
> > the right abstractions? The reason for asking is because moving
> > something from Internal to a public designation is an easy fix, but I
> > want to make sure that it's also the right fix. If we are missing good
> > abstractions, then I would rather invest in those.
> >
> > Best regards,
> >
> > Martijn
> >
> > [1] https://issues.apache.org/jira/browse/FLINK-22358
> >
> > On Wed, Nov 8, 2023 at 12:40 PM Leonard Xu  wrote:
> > >
> > > Thanks Hongshun for starting this discussion.
> > >
> > > +1 from my side.
> > >
> > > IIRC, @Jiangjie(Becket) also mentioned this in FLINK-31324 comment[1].
> > >
> > > Best,
> > > Leonard
> > >
> > > [1]
> >
> https://issues.apache.org/jira/browse/FLINK-31324?focusedCommentId=17696756=com.atlassian.jira.plugin.system.issue

Re: [DISCUSS] FLIP-377: Support configuration to disable filter push down for Table/SQL Sources

2023-10-31 Thread Becket Qin
Hi Jiabao,

Thanks for the explanation. Maybe it's easier to explain with an example.

Let's take FileSystemTableSource as an example. Currently it implements
SupportsFilterPushDown interface. With your proposal, does it have to
support `source.filter-push-down.enabled` as well? But this configuration
does not quite make sense for the FileSystemTableSource because filter
pushdown is always desired. However, because this configuration is a part
of the SupportsFilterPushDown interface (which sounds confusing to begin
with), the FileSystemTableSource can only do one of the following:

1. Ignore the user configuration to always apply the pushed down filters -
this is an apparent anti-pattern because a configuration should always do
what it says.
2. Throw an exception telling users that this configuration is not
applicable to the FileSystemTableSource.
3. Implement this configuration to push back the pushed down filters, even
though this is never desired.

None of the above options looks awkward. I am curious what your solution is
here?

Thanks,

Jiangjie (Becket) Qin

On Tue, Oct 31, 2023 at 3:11 PM Jiabao Sun 
wrote:

> Thanks Becket for the further explanation.
>
> Perhaps I didn't explain it clearly.
>
> 1. If a source does not implement the SupportsFilterPushDown interface,
> the newly added configurations do not need to be added to either the
> requiredOptions or optionalOptions.
> Similar to LookupOptions, if a source does not implement
> LookupTableSource, there is no need to add LookupOptions to either
> requiredOptions or optionalOptions.
>
> 2. "And these configs are specific to those sources, instead of common
> configs."
> The newly introduced configurations define standardized names and default
> values.
> They still belong to the configuration at the individual source level.
> The purpose is to avoid scattered configuration items when different
> sources implement the same logic.
> Whether a source should accept these configurations is determined by the
> source's Factory.
>
> Best,
> Jiabao
>
>
> > 2023年10月31日 13:47,Becket Qin  写道:
> >
> > Hi Jiabao,
> >
> > Please see the replies inline.
> >
> > Introducing common configurations does not mean that all sources must
> >> accept these configuration options.
> >> The configuration options supported by a source are determined by the
> >> requiredOptions and optionalOptions in the Factory interface.
> >
> > This is not true. Both required and optional options are SUPPORTED. That
> > means they are implemented and if one specifies an optional config it
> will
> > still take effect. An OptionalConfig is "Optional" because this
> > configuration has a default value. Hence, it is OK that users do not
> > specify their own value. In another word, it is "optional" for the end
> > users to set the config, but the implementation and support for that
> config
> > is NOT optional. In case a source does not support a common config, an
> > exception must be thrown when the config is provided by the end users.
> > However, the config we are talking about in this FLIP is a common config
> > optional to implement, meaning that sometimes the claimed behavior won't
> be
> > there even if users specify that config.
> >
> > Similar to sources that do not implement the LookupTableSource interface,
> >> sources that do not implement the SupportsFilterPushDown interface also
> do
> >> not need to accept newly introduced options.
> >
> > First of all, filter pushdown is a behavior of the query optimizer, not
> the
> > behavior of Sources. The Sources tells the optimizer that it has the
> > ability to accept pushed down filters by implementing the
> > SupportsFilterPushDown interface. And this is the only contract between
> the
> > Source and Optimizer regarding whether filters should be pushed down. As
> > long as a specific source implements this decorative interface, filter
> > pushdown should always take place, i.e.
> > *SupportsFilterPushDown.applyFilters()* will be called. There should be
> no
> > other config to disable that call. However, Sources can decide how to
> > behave based on their own configurations after *applyFilters()* is
> called.
> > And these configs are specific to those sources, instead of common
> configs.
> > Please see the examples I mentioned in the previous email.
> >
> > Thanks,
> >
> > Jiangjie (Becket) Qin
> >
> > On Tue, Oct 31, 2023 at 10:27 AM Jiabao Sun  .invalid>
> > wrote:
> >
> >> Hi Becket,
> >>
> >> Sorry, there was a typo in the second point. Let me correct it:
> >>

Re: [DISCUSS] FLIP-377: Support configuration to disable filter push down for Table/SQL Sources

2023-10-30 Thread Becket Qin
Hi Jiabao,

Please see the replies inline.

Introducing common configurations does not mean that all sources must
> accept these configuration options.
> The configuration options supported by a source are determined by the
> requiredOptions and optionalOptions in the Factory interface.

This is not true. Both required and optional options are SUPPORTED. That
means they are implemented and if one specifies an optional config it will
still take effect. An OptionalConfig is "Optional" because this
configuration has a default value. Hence, it is OK that users do not
specify their own value. In another word, it is "optional" for the end
users to set the config, but the implementation and support for that config
is NOT optional. In case a source does not support a common config, an
exception must be thrown when the config is provided by the end users.
However, the config we are talking about in this FLIP is a common config
optional to implement, meaning that sometimes the claimed behavior won't be
there even if users specify that config.

Similar to sources that do not implement the LookupTableSource interface,
> sources that do not implement the SupportsFilterPushDown interface also do
> not need to accept newly introduced options.

First of all, filter pushdown is a behavior of the query optimizer, not the
behavior of Sources. The Sources tells the optimizer that it has the
ability to accept pushed down filters by implementing the
SupportsFilterPushDown interface. And this is the only contract between the
Source and Optimizer regarding whether filters should be pushed down. As
long as a specific source implements this decorative interface, filter
pushdown should always take place, i.e.
*SupportsFilterPushDown.applyFilters()* will be called. There should be no
other config to disable that call. However, Sources can decide how to
behave based on their own configurations after *applyFilters()* is called.
And these configs are specific to those sources, instead of common configs.
Please see the examples I mentioned in the previous email.

Thanks,

Jiangjie (Becket) Qin

On Tue, Oct 31, 2023 at 10:27 AM Jiabao Sun 
wrote:

> Hi Becket,
>
> Sorry, there was a typo in the second point. Let me correct it:
>
> Introducing common configurations does not mean that all sources must
> accept these configuration options.
> The configuration options supported by a source are determined by the
> requiredOptions and optionalOptions in the Factory interface.
>
> Similar to sources that do not implement the LookupTableSource interface,
> sources that do not implement the SupportsFilterPushDown interface also do
> not need to accept newly introduced options.
>
> Best,
> Jiabao
>
>
> > 2023年10月31日 10:13,Jiabao Sun  写道:
> >
> > Thanks Becket for the feedback.
> >
> > 1. Currently, the SupportsFilterPushDown#applyFilters method returns a
> result that includes acceptedFilters and remainingFilters. The source can
> decide to push down some filters or not accept any of them.
> > 2. Introducing common configuration options does not mean that a source
> that supports the SupportsFilterPushDown capability must accept this
> configuration. Similar to LookupOptions, only sources that implement the
> LookupTableSource interface are necessary to accept these configuration
> options.
> >
> > Best,
> > Jiabao
> >
> >
> >> 2023年10月31日 07:49,Becket Qin  写道:
> >>
> >> Hi Jiabao and Ruanhang,
> >>
> >> Adding a configuration of source.filter-push-down.enabled as a common
> >> source configuration seems problematic.
> >> 1. The config name is misleading. filter pushdown should only be
> determined
> >> by whether the SupportsFilterPushdown interface is implemented or not.
> >> 2. The behavior of this configuration is only applicable to some source
> >> implementations. Why is it a common configuration?
> >>
> >> Here's my suggestion for design principles:
> >> 1. Only add source impl specific configuration to corresponding sources.
> >> 2. The configuration name should not overrule existing common contracts.
> >>
> >> For example, in the case of MySql source. There are several options:
> >> 1. Have a configuration of `*mysql.avoid.remote.full.table.scan`*. If
> this
> >> configuration is set, and a filter pushdown does not hit an index, the
> >> MySql source impl would not further pushdown the filter to MySql
> servers.
> >> Note that this assumes the MySql source can retrieve the index
> information
> >> from the MySql servers.
> >> 2. If the MySql index information is not available to the MySql source,
> the
> >> configuration could be something like
> 

Re: [DISCUSS] FLIP-377: Support configuration to disable filter push down for Table/SQL Sources

2023-10-30 Thread Becket Qin
Hi Jiabao and Ruanhang,

Adding a configuration of source.filter-push-down.enabled as a common
source configuration seems problematic.
1. The config name is misleading. filter pushdown should only be determined
by whether the SupportsFilterPushdown interface is implemented or not.
2. The behavior of this configuration is only applicable to some source
implementations. Why is it a common configuration?

Here's my suggestion for design principles:
1. Only add source impl specific configuration to corresponding sources.
2. The configuration name should not overrule existing common contracts.

For example, in the case of MySql source. There are several options:
1. Have a configuration of `*mysql.avoid.remote.full.table.scan`*. If this
configuration is set, and a filter pushdown does not hit an index, the
MySql source impl would not further pushdown the filter to MySql servers.
Note that this assumes the MySql source can retrieve the index information
from the MySql servers.
2. If the MySql index information is not available to the MySql source, the
configuration could be something like *`mysql.pushback.pushed.down.filters`*.
Once set to true, MySql source would just add all the filters to the
RemainingFilters in the Result returned by
*SupportsFilterPushdown.applyFilters().*
3. An alternative to option 2 is to have a `
*mysql.apply.predicates.after.scan*`. When it is set to true, MySql source
will not push the filter down to the MySql servers, but apply the filters
inside the MySql source itself.

As you may see, the above configurations do not disable filter pushdown
itself. They just allow various implementations of filter pushdown. And the
configuration name does not give any illusion that filter pushdown is
disabled.

Thanks,

Jiangjie (Becket) Qin

On Mon, Oct 30, 2023 at 11:58 PM Jiabao Sun 
wrote:

> Thanks Hang for the suggestion.
>
>
> I think the configuration of TableSource is not closely related to
> SourceReader,
> so I prefer to introduce a independent configuration class
> TableSourceOptions in the flink-table-common module, similar to
> LookupOptions.
>
> For the second point, I suggest adding Java doc to the SupportsXXXPushDown
> interfaces, providing detailed information on these options that needs to
> be supported.
>
> I have made updates in the FLIP document.
> Please help check it again.
>
>
> Best,
> Jiabao
>
>
> > 2023年10月30日 17:23,Hang Ruan  写道:
> >
> > Thanks for the improvements, Jiabao.
> >
> > There are some details that I am not sure about.
> > 1. The new option `source.filter-push-down.enabled` will be added to
> which
> > class? I think it should be `SourceReaderOptions`.
> > 2. How are the connector developers able to know and follow the FLIP? Do
> we
> > need an abstract base class or provide a default method?
> >
> > Best,
> > Hang
> >
> > Jiabao Sun  于2023年10月30日周一 14:45写道:
> >
> >> Hi, all,
> >>
> >> Thanks for the lively discussion.
> >>
> >> Based on the discussion, I have made some adjustments to the FLIP
> document:
> >>
> >> 1. The name of the newly added option has been changed to
> >> "source.filter-push-down.enabled".
> >> 2. Considering compatibility with older versions, the newly added
> >> "source.filter-push-down.enabled" option needs to respect the
> optimizer's
> >> "table.optimizer.source.predicate-pushdown-enabled" option.
> >>But there is a consideration to remove the old option in Flink 2.0.
> >> 3. We can provide more options to disable other source abilities with
> side
> >> effects, such as “source.aggregate.enabled” and
> “source.projection.enabled"
> >>This is not urgent and can be continuously introduced.
> >>
> >> Looking forward to your feedback again.
> >>
> >> Best,
> >> Jiabao
> >>
> >>
> >>> 2023年10月29日 08:45,Becket Qin  写道:
> >>>
> >>> Thanks for digging into the git history, Jark. I agree it makes sense
> to
> >>> deprecate this API in 2.0.
> >>>
> >>> Cheers,
> >>>
> >>> Jiangjie (Becket) Qin
> >>>
> >>> On Fri, Oct 27, 2023 at 5:47 PM Jark Wu  wrote:
> >>>
> >>>> Hi Becket,
> >>>>
> >>>> I checked the history of "
> >>>> *table.optimizer.source.predicate-pushdown-enabled*",
> >>>> it seems it was introduced since the legacy FilterableTableSource
> >>>> interface
> >>>> which might be an experiential feature at that time. I don't see the
> >>>> necessity
> >>>>

Re: [DISCUSS] FLIP-377: Support configuration to disable filter push down for Table/SQL Sources

2023-10-28 Thread Becket Qin
Thanks for digging into the git history, Jark. I agree it makes sense to
deprecate this API in 2.0.

Cheers,

Jiangjie (Becket) Qin

On Fri, Oct 27, 2023 at 5:47 PM Jark Wu  wrote:

> Hi Becket,
>
> I checked the history of "
> *table.optimizer.source.predicate-pushdown-enabled*",
> it seems it was introduced since the legacy FilterableTableSource
> interface
> which might be an experiential feature at that time. I don't see the
> necessity
> of this option at the moment. Maybe we can deprecate this option and drop
> it
> in Flink 2.0[1] if it is not necessary anymore. This may help to
> simplify this discussion.
>
>
> Best,
> Jark
>
> [1]: https://issues.apache.org/jira/browse/FLINK-32383
>
>
>
> On Thu, 26 Oct 2023 at 10:14, Becket Qin  wrote:
>
>> Thanks for the proposal, Jiabao. My two cents below:
>>
>> 1. If I understand correctly, the motivation of the FLIP is mainly to
>> make predicate pushdown optional on SOME of the Sources. If so, intuitively
>> the configuration should be Source specific instead of general. Otherwise,
>> we will end up with general configurations that may not take effect for
>> some of the Source implementations. This violates the basic rule of a
>> configuration - it does what it says, regardless of the implementation.
>> While configuration standardization is usually a good thing, it should not
>> break the basic rules.
>> If we really want to have this general configuration, for the sources
>> this configuration does not apply, they should throw an exception to make
>> it clear that this configuration is not supported. However, that seems ugly.
>>
>> 2. I think the actual motivation of this FLIP is about "how a source
>> should implement predicate pushdown efficiently", not "whether predicate
>> pushdown should be applied to the source." For example, if a source wants
>> to avoid additional computing load in the external system, it can always
>> read the entire record and apply the predicates by itself. However, from
>> the Flink perspective, the predicate pushdown is applied, it is just
>> implemented differently by the source. So the design principle here is that
>> Flink only cares about whether a source supports predicate pushdown or not,
>> it does not care about the implementation efficiency / side effect of the
>> predicates pushdown. It is the Source implementation's responsibility to
>> ensure the predicates pushdown is implemented efficiently and does not
>> impose excessive pressure on the external system. And it is OK to have
>> additional configurations to achieve this goal. Obviously, such
>> configurations will be source specific in this case.
>>
>> 3. Regarding the existing configurations of 
>> *table.optimizer.source.predicate-pushdown-enabled.
>> *I am not sure why we need it. Supposedly, if a source implements a
>> SupportsXXXPushDown interface, the optimizer should push the corresponding
>> predicates to the Source. I am not sure in which case this configuration
>> would be used. Any ideas @Jark Wu ?
>>
>> Thanks,
>>
>> Jiangjie (Becket) Qin
>>
>>
>> On Wed, Oct 25, 2023 at 11:55 PM Jiabao Sun
>>  wrote:
>>
>>> Thanks Jane for the detailed explanation.
>>>
>>> I think that for users, we should respect conventions over
>>> configurations.
>>> Conventions can be default values explicitly specified in
>>> configurations, or they can be behaviors that follow previous versions.
>>> If the same code has different behaviors in different versions, it would
>>> be a very bad thing.
>>>
>>> I agree that for regular users, it is not necessary to understand all
>>> the configurations related to Flink.
>>> By following conventions, they can have a good experience.
>>>
>>> Let's get back to the practical situation and consider it.
>>>
>>> Case 1:
>>> The user is not familiar with the purpose of the
>>> table.optimizer.source.predicate-pushdown-enabled configuration but follows
>>> the convention of allowing predicate pushdown to the source by default.
>>> Just understanding the source.predicate-pushdown-enabled configuration
>>> and performing fine-grained toggle control will work well.
>>>
>>> Case 2:
>>> The user understands the meaning of the
>>> table.optimizer.source.predicate-pushdown-enabled configuration and has set
>>> its value to false.
>>> We have reason to believe that the user understands the meaning of the
>>> predicate pushdown configuration and 

Re: [DISCUSS] FLIP-377: Support configuration to disable filter push down for Table/SQL Sources

2023-10-25 Thread Becket Qin
Thanks for the proposal, Jiabao. My two cents below:

1. If I understand correctly, the motivation of the FLIP is mainly to make
predicate pushdown optional on SOME of the Sources. If so, intuitively the
configuration should be Source specific instead of general. Otherwise, we
will end up with general configurations that may not take effect for some
of the Source implementations. This violates the basic rule of a
configuration - it does what it says, regardless of the implementation.
While configuration standardization is usually a good thing, it should not
break the basic rules.
If we really want to have this general configuration, for the sources this
configuration does not apply, they should throw an exception to make it
clear that this configuration is not supported. However, that seems ugly.

2. I think the actual motivation of this FLIP is about "how a source should
implement predicate pushdown efficiently", not "whether predicate pushdown
should be applied to the source." For example, if a source wants to avoid
additional computing load in the external system, it can always read the
entire record and apply the predicates by itself. However, from the Flink
perspective, the predicate pushdown is applied, it is just implemented
differently by the source. So the design principle here is that Flink only
cares about whether a source supports predicate pushdown or not, it does
not care about the implementation efficiency / side effect of the
predicates pushdown. It is the Source implementation's responsibility to
ensure the predicates pushdown is implemented efficiently and does not
impose excessive pressure on the external system. And it is OK to have
additional configurations to achieve this goal. Obviously, such
configurations will be source specific in this case.

3. Regarding the existing configurations of
*table.optimizer.source.predicate-pushdown-enabled.
*I am not sure why we need it. Supposedly, if a source implements a
SupportsXXXPushDown interface, the optimizer should push the corresponding
predicates to the Source. I am not sure in which case this configuration
would be used. Any ideas @Jark Wu ?

Thanks,

Jiangjie (Becket) Qin


On Wed, Oct 25, 2023 at 11:55 PM Jiabao Sun 
wrote:

> Thanks Jane for the detailed explanation.
>
> I think that for users, we should respect conventions over configurations.
> Conventions can be default values explicitly specified in configurations,
> or they can be behaviors that follow previous versions.
> If the same code has different behaviors in different versions, it would
> be a very bad thing.
>
> I agree that for regular users, it is not necessary to understand all the
> configurations related to Flink.
> By following conventions, they can have a good experience.
>
> Let's get back to the practical situation and consider it.
>
> Case 1:
> The user is not familiar with the purpose of the
> table.optimizer.source.predicate-pushdown-enabled configuration but follows
> the convention of allowing predicate pushdown to the source by default.
> Just understanding the source.predicate-pushdown-enabled configuration and
> performing fine-grained toggle control will work well.
>
> Case 2:
> The user understands the meaning of the
> table.optimizer.source.predicate-pushdown-enabled configuration and has set
> its value to false.
> We have reason to believe that the user understands the meaning of the
> predicate pushdown configuration and the intention is to disable predicate
> pushdown (rather than whether or not to allow it).
> The previous choice of globally disabling it is likely because it couldn't
> be disabled on individual sources.
> From this perspective, if we provide more fine-grained configuration
> support and provide detailed explanations of the configuration behaviors in
> the documentation,
> users can clearly understand the differences between these two
> configurations and use them correctly.
>
> Also, I don't agree that table.optimizer.source.predicate-pushdown-enabled
> = true and source.predicate-pushdown-enabled = false means that the local
> configuration overrides the global configuration.
> On the contrary, both configurations are functioning correctly.
> The optimizer allows predicate pushdown to all sources, but some sources
> can reject the filters pushed down by the optimizer.
> This is natural, just like different components at different levels are
> responsible for different tasks.
>
> The more serious issue is that if "source.predicate-pushdown-enabled" does
> not respect "table.optimizer.source.predicate-pushdown-enabled”,
> the "table.optimizer.source.predicate-pushdown-enabled" configuration will
> be invalidated.
> This means that regardless of whether
> "table.optimizer.source.predicate-pushdown-enabled" is

Re: [DISCUSS] FLIP-356: Support Nested Fields Filter Pushdown

2023-09-20 Thread Becket Qin
Hi Martijn,

This FLIP has passed voting[1]. It is a modification on top of the FLIP-95
interface.

Thanks,

Jiangjie (Becket) Qin

[1] https://lists.apache.org/thread/hysv9y1f48gtpr5vx3x40wtjb6cp9ky6

On Wed, Sep 20, 2023 at 9:29 PM Martijn Visser 
wrote:

> For clarity purposes, this FLIP is being abandoned because it was part
> of FLIP-95?
>
> On Thu, Sep 7, 2023 at 3:01 AM Venkatakrishnan Sowrirajan
>  wrote:
> >
> > Hi everyone,
> >
> > Posted a PR (https://github.com/apache/flink/pull/23313) to add nested
> > fields filter pushdown. Please review. Thanks.
> >
> > Regards
> > Venkata krishnan
> >
> >
> > On Tue, Sep 5, 2023 at 10:04 PM Venkatakrishnan Sowrirajan <
> vsowr...@asu.edu>
> > wrote:
> >
> > > Based on an offline discussion with Becket Qin, I added *fieldIndices *
> > > back which is the field index of the nested field at every level to
> the *NestedFieldReferenceExpression
> > > *in FLIP-356
> > > <
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-356%3A+Support+Nested+Fields+Filter+Pushdown
> >
> > > *. *2 reasons to do it:
> > >
> > > 1. Agree with using *fieldIndices *as the only contract to refer to the
> > > column from the underlying datasource.
> > > 2. To keep it consistent with *FieldReferenceExpression*
> > >
> > > Having said that, I see that with *projection pushdown, *index of the
> > > fields are used whereas with *filter pushdown (*based on scanning few
> > > tablesources) *FieldReferenceExpression*'s name is used for eg: even in
> > > the Flink's *FileSystemTableSource, IcebergSource, JDBCDatsource*. This
> > > way, I feel the contract is not quite clear and explicit. Wanted to
> > > understand other's thoughts as well.
> > >
> > > Regards
> > > Venkata krishnan
> > >
> > >
> > > On Tue, Sep 5, 2023 at 5:34 PM Becket Qin 
> wrote:
> > >
> > >> Hi Venkata,
> > >>
> > >>
> > >> > Also I made minor changes to the *NestedFieldReferenceExpression,
> > >> *instead
> > >> > of *fieldIndexArray* we can just do away with *fieldNames *array
> that
> > >> > includes fieldName at every level for the nested field.
> > >>
> > >>
> > >> I don't think keeping only the field names array would work. At the
> end of
> > >> the day, the contract between Flink SQL and the connectors is based
> on the
> > >> indexes, not the names. Technically speaking, the connectors only
> emit a
> > >> bunch of RowData which is based on positions. The field names are
> added by
> > >> the SQL framework via the DDL for those RowData. In this sense, the
> > >> connectors may not be aware of the field names in Flink DDL at all.
> The
> > >> common language between Flink SQL and source is just positions. This
> is
> > >> also why ProjectionPushDown would work by only relying on the
> indexes, not
> > >> the field names. So I think the field index array is a must have here
> in
> > >> the NestedFieldReferenceExpression.
> > >>
> > >> Thanks,
> > >>
> > >> Jiangjie (Becket) Qin
> > >>
> > >> On Fri, Sep 1, 2023 at 8:12 AM Venkatakrishnan Sowrirajan <
> > >> vsowr...@asu.edu>
> > >> wrote:
> > >>
> > >> > Gentle ping on the vote for FLIP-356: Support Nested fields filter
> > >> pushdown
> > >> > <
> > >>
> https://urldefense.com/v3/__https://www.mail-archive.com/dev@flink.apache.org/msg69289.html__;!!IKRxdwAv5BmarQ!bOW26WlafOQQcb32eWtUiXBAl0cTCK1C6iYhDI2f_z__eczudAWmTRvjDiZg6gzlXmPXrDV4KJS5cFxagFE$
> > >> >.
> > >> >
> > >> > Regards
> > >> > Venkata krishnan
> > >> >
> > >> >
> > >> > On Tue, Aug 29, 2023 at 9:18 PM Venkatakrishnan Sowrirajan <
> > >> > vsowr...@asu.edu>
> > >> > wrote:
> > >> >
> > >> > > Sure, will reference this discussion to resume where we started as
> > >> part
> > >> > of
> > >> > > the flip to refactor SupportsProjectionPushDown.
> > >> > >
> > >> > > On Tue, Aug 29, 2023, 7:22 PM Jark Wu  wrote:
> > >> > >
> > >> > >> I'm fine with this. `ReferenceExpression` and
> > >> > `SupportsProjectionPushDown`
> > >>

Re: [VOTE] FLIP-355: Add parent dir of files to classpath using yarn.provided.lib.dirs

2023-09-13 Thread Becket Qin
+1 (binding)

Thanks for the FLIP, Archit.

Cheers,

Jiangjie (Becket) Qin


On Thu, Sep 14, 2023 at 10:31 AM Dong Lin  wrote:

> Thanks Archit for the FLIP.
>
> +1 (binding)
>
> Regards,
> Dong
>
> On Thu, Sep 14, 2023 at 1:47 AM Archit Goyal  >
> wrote:
>
> > Hi everyone,
> >
> > Thanks for reviewing the FLIP-355 Add parent dir of files to classpath
> > using yarn.provided.lib.dirs :
> >
> >
> https://cwiki.apache.org/confluence/display/FLINK/FLIP+355%3A+Add+parent+dir+of+files+to+classpath+using+yarn.provided.lib.dirs
> >
> > Following is the discussion thread :
> > https://lists.apache.org/thread/gv0ro4jsq4o206wg5gz9z5cww15qkvb9
> >
> > I'd like to start a vote for it. The vote will be open for at least 72
> > hours (until September 15, 12:00AM GMT) unless there is an objection or
> an
> > insufficient number of votes.
> >
> > Thanks,
> > Archit Goyal
> >
>


Re: [DISCUSS] Flink annotation strategy/consensus

2023-09-13 Thread Becket Qin
>
> Does it make sense to clearly define that APIs without annotation are
> internal APIs and should be used outside of Flink. And deprecate @Internal?


We can do this. Although I think it is OK to keep the @Internal annotation
in case extra clarity is needed sometimes.

Thanks,

Jiangjie (Becket) Qin

On Tue, Sep 12, 2023 at 7:11 PM Jing Ge  wrote:

> Hi Becket,
>
> Thanks for your reply with details.
>
>
> > 2. I agree it would be too verbose to annotate every internal method /
> > class / interface. Currently we already treat the methods / interfaces /
> > classes without annotations as effectively @Internal.
> >
>
> Does it make sense to clearly define that APIs without annotation are
> internal APIs and should be used outside of Flink. And deprecate @Internal?
>
> Best regards,
> Jing
>
> On Mon, Sep 11, 2023 at 5:05 AM Becket Qin  wrote:
>
> > Hi Jing,
> >
> > Thanks for bringing up the discussion. My two cents:
> >
> > 1. All the public methods / classes / interfaces MUST be annotated with
> one
> > of the @Experimental / @PublicEvolving / @Public. In practice, all the
> > methods by default inherit the annotation from the containing class,
> unless
> > annotated otherwise. e.g. an @Internal method in a @Public class.
>
>
> >
> 2. I agree it would be too verbose to annotate every internal method /
> > class / interface. Currently we already treat the methods / interfaces /
> > classes without annotations as effectively @Internal.
>
>
>
> 3. Per our discussion in the other thread, @Deprecated SHOULD coexist with
> > one of the @Experimental / @PublicEvolving / @Public. In that
> > case, @Deprecated overrides the other annotation, which means that public
> > API will not evolve and will be removed according to the deprecation
> > process.
> >
> > 4. The internal methods / classes / interfaces SHOULD NOT be marked as
> > deprecated. Instead, an immediate refactor should be done to remove the
> > "deprecated" internal methods / classes / interfaces, and migrate the
> code
> > to its successor. Otherwise, technical debts will build up.
> >
> > Thanks,
> >
> > Jiangjie (Becket) Qin
> >
> >
> >
> > On Sat, Sep 9, 2023 at 5:29 AM Jing Ge 
> wrote:
> >
> > > Hi devs,
> > >
> > > While I was joining the flink-avro enhancement and cleanup discussion
> > > driven by Becket[1], I realized that there are some issues with the
> > current
> > > Flink API annotation usage in the source code.
> > >
> > > As far as I am concerned, Flink wants to control the access/visibility
> of
> > > APIs across modules and for downstreams. Since no OSGI is used(it
> should
> > > not be used because of its complexity, IMHO), Flink decided to use a
> very
> > > lightweight but manual solution: customized annotation like @Internal,
> > > @Experimental, @PublicEvolving,
> > > etc. This is a Flink only concept on top of JDK annotation and is
> > therefore
> > > orthogonal to @Deprecated or any other annotations offered by JDK.
> After
> > > this concept has been used, APIs without one of these annotations are
> in
> > > the kind of gray area which means they have no contract in the context
> of
> > > this new concept. Without any given metadata they could be considered
> > > as @Internal or @Experimental, because changes are allowed to be
> applied
> > at
> > > any time. But there is no clear definition and therefore different
> people
> > > will understand it differently.
> > >
> > > There are two options to improve it, as far as I could figure out:
> > >
> > > option 1: All APIs must have one of those annotations. We should put
> some
> > > effort into going through all source code and add missing annotations.
> > > There were discussions[2] and activities going in this direction.
> > > option 2: the community comes to a new consensus that APIs without
> > > annotation equals one of @Internal, @Experimental, or @PublicEvolving.
> I
> > > personally will choose @Internal, because it is the safest one. And if
> > > @Internal is chosen as the default one, it could also be deprecated,
> > > because no annotation equals @Internal. If it makes sense, I can
> create a
> > > FLIP and help the community reach this consensus.
> > >
> > > Both options have their own pros and cons. I would choose option 2,
> since
> > > we will not end up with a lot of APIs marked as @Internal.
> > >
> > > Looking forward to hearing your thoughts.
> > >
> > > Best regards
> > > Jing
> > >
> > >
> > > [1] https://lists.apache.org/thread/7zsv528swbjxo5zk0bxq33hrkvd77d6f
> > > [2] https://lists.apache.org/thread/zl2rmodsjsdb49tt4hn6wv3gdwo0m31o
> > >
> >
>


Re: [DISCUSS] Flink annotation strategy/consensus

2023-09-11 Thread Becket Qin
Hi Shammon,

Thanks for the reply.

Do we really need to have `@Internal` methods in an `@Public` interface or
> class? In general, if a class or interface is marked as `@Public `, it is
> better that their public methods should also be `@Public`, because even if
> marked as `@Internal`, users are not aware of it when using it, which could
> be strange.

It is more like a legacy issue that the public and internal usage share the
same concrete class. e.g. DataStream.getId() is for internal usage, but
happens to be in DataStream which is a public class. This should be avoided
in the future. It is a good practice to create separate interfaces should
be created for the users in this case.

Regarding the API stability promotion, you may want to check the
FLIP-197[1].

Thanks,

Jiangjie (Becket) Qin

[1]
https://cwiki.apache.org/confluence/display/FLINK/FLIP-197%3A+API+stability+graduation+process

On Mon, Sep 11, 2023 at 11:43 AM Shammon FY  wrote:

> Thanks Jing for starting this discussion.
>
> For @Becket
> > 1. All the public methods / classes / interfaces MUST be annotated with
> one of the @Experimental / @PublicEvolving / @Public. In practice, all the
> methods by default inherit the annotation from the containing class, unless
> annotated otherwise. e.g. an @Internal method in a @Public class.
>
> Do we really need to have `@Internal` methods in an `@Public` interface or
> class? In general, if a class or interface is marked as `@Public `, it is
> better that their public methods should also be `@Public`, because even if
> marked as `@Internal`, users are not aware of it when using it, which could
> be strange.
>
> @Jing Besides `@Internal`, I have some cents about `@PublicEvolving` and
> `@Public`. Currently when we add an interface which will be used by
> external systems, we often annotate it as `@PublicEvolving`. But when
> should this interface be marked as `@Public`? I find it is difficult to
> determine this. Is `@PublicEvolving` really necessary? Should we directly
> remove `@PublicEvolving` and use `@Public` instead? I think it would be
> simpler.
>
> Best,
> Shammon FY
>
>
> On Mon, Sep 11, 2023 at 11:05 AM Becket Qin  wrote:
>
> > Hi Jing,
> >
> > Thanks for bringing up the discussion. My two cents:
> >
> > 1. All the public methods / classes / interfaces MUST be annotated with
> one
> > of the @Experimental / @PublicEvolving / @Public. In practice, all the
> > methods by default inherit the annotation from the containing class,
> unless
> > annotated otherwise. e.g. an @Internal method in a @Public class.
> >
> > 2. I agree it would be too verbose to annotate every internal method /
> > class / interface. Currently we already treat the methods / interfaces /
> > classes without annotations as effectively @Internal.
> >
> > 3. Per our discussion in the other thread, @Deprecated SHOULD coexist
> with
> > one of the @Experimental / @PublicEvolving / @Public. In that
> > case, @Deprecated overrides the other annotation, which means that public
> > API will not evolve and will be removed according to the deprecation
> > process.
> >
> > 4. The internal methods / classes / interfaces SHOULD NOT be marked as
> > deprecated. Instead, an immediate refactor should be done to remove the
> > "deprecated" internal methods / classes / interfaces, and migrate the
> code
> > to its successor. Otherwise, technical debts will build up.
> >
> > Thanks,
> >
> > Jiangjie (Becket) Qin
> >
> >
> >
> > On Sat, Sep 9, 2023 at 5:29 AM Jing Ge 
> wrote:
> >
> > > Hi devs,
> > >
> > > While I was joining the flink-avro enhancement and cleanup discussion
> > > driven by Becket[1], I realized that there are some issues with the
> > current
> > > Flink API annotation usage in the source code.
> > >
> > > As far as I am concerned, Flink wants to control the access/visibility
> of
> > > APIs across modules and for downstreams. Since no OSGI is used(it
> should
> > > not be used because of its complexity, IMHO), Flink decided to use a
> very
> > > lightweight but manual solution: customized annotation like @Internal,
> > > @Experimental, @PublicEvolving,
> > > etc. This is a Flink only concept on top of JDK annotation and is
> > therefore
> > > orthogonal to @Deprecated or any other annotations offered by JDK.
> After
> > > this concept has been used, APIs without one of these annotations are
> in
> > > the kind of gray area which means they have no contract in the context
> of
> > > this new concept. Without any given metadata they could be c

Re: [DISCUSS] Flink annotation strategy/consensus

2023-09-10 Thread Becket Qin
Hi Jing,

Thanks for bringing up the discussion. My two cents:

1. All the public methods / classes / interfaces MUST be annotated with one
of the @Experimental / @PublicEvolving / @Public. In practice, all the
methods by default inherit the annotation from the containing class, unless
annotated otherwise. e.g. an @Internal method in a @Public class.

2. I agree it would be too verbose to annotate every internal method /
class / interface. Currently we already treat the methods / interfaces /
classes without annotations as effectively @Internal.

3. Per our discussion in the other thread, @Deprecated SHOULD coexist with
one of the @Experimental / @PublicEvolving / @Public. In that
case, @Deprecated overrides the other annotation, which means that public
API will not evolve and will be removed according to the deprecation
process.

4. The internal methods / classes / interfaces SHOULD NOT be marked as
deprecated. Instead, an immediate refactor should be done to remove the
"deprecated" internal methods / classes / interfaces, and migrate the code
to its successor. Otherwise, technical debts will build up.

Thanks,

Jiangjie (Becket) Qin



On Sat, Sep 9, 2023 at 5:29 AM Jing Ge  wrote:

> Hi devs,
>
> While I was joining the flink-avro enhancement and cleanup discussion
> driven by Becket[1], I realized that there are some issues with the current
> Flink API annotation usage in the source code.
>
> As far as I am concerned, Flink wants to control the access/visibility of
> APIs across modules and for downstreams. Since no OSGI is used(it should
> not be used because of its complexity, IMHO), Flink decided to use a very
> lightweight but manual solution: customized annotation like @Internal,
> @Experimental, @PublicEvolving,
> etc. This is a Flink only concept on top of JDK annotation and is therefore
> orthogonal to @Deprecated or any other annotations offered by JDK. After
> this concept has been used, APIs without one of these annotations are in
> the kind of gray area which means they have no contract in the context of
> this new concept. Without any given metadata they could be considered
> as @Internal or @Experimental, because changes are allowed to be applied at
> any time. But there is no clear definition and therefore different people
> will understand it differently.
>
> There are two options to improve it, as far as I could figure out:
>
> option 1: All APIs must have one of those annotations. We should put some
> effort into going through all source code and add missing annotations.
> There were discussions[2] and activities going in this direction.
> option 2: the community comes to a new consensus that APIs without
> annotation equals one of @Internal, @Experimental, or @PublicEvolving. I
> personally will choose @Internal, because it is the safest one. And if
> @Internal is chosen as the default one, it could also be deprecated,
> because no annotation equals @Internal. If it makes sense, I can create a
> FLIP and help the community reach this consensus.
>
> Both options have their own pros and cons. I would choose option 2, since
> we will not end up with a lot of APIs marked as @Internal.
>
> Looking forward to hearing your thoughts.
>
> Best regards
> Jing
>
>
> [1] https://lists.apache.org/thread/7zsv528swbjxo5zk0bxq33hrkvd77d6f
> [2] https://lists.apache.org/thread/zl2rmodsjsdb49tt4hn6wv3gdwo0m31o
>


Re: [DISCUSS] FLIP-358: flink-avro enhancement and cleanup

2023-09-06 Thread Becket Qin
Hi Stephen,

I don't think you should compare the DataType with the AvroSchema directly.
They are for different purposes and sometimes cannot be mapped in both
directions.

As of now, the following conversions are needed in Flink format:
1. Avro Schema -> Flink Table Schema (DataType). This is required when
registering the Flink table.
2. Flink Table Schema (DataType) -> Avro Schema. This is because after
projection pushdown, maybe only some of the fields need to be read from the
Avro record. So Flink Avro format needs to generate an Avro reader schema
from the projected fields represented in DataType.

The issue today is when you convert an AvroSchema_A in step 1 to get the
DataType, and try to convert that DataType back to AvroSchema_B,
AvroSchema_A and AvroSchema_B are not compatible. The idea is to use the
original AvroSchema_A as the assistance in step 2, so that AvroSchema_A and
AvroSchema_B are compatible. In your case, the Avro schema stored in the
schema registry will be that original Avro schema, i.e. AvroSchema_A.

Thanks,

Jiangjie (Becket) Qin

On Wed, Sep 6, 2023 at 8:32 PM 吴 stephen  wrote:

> Hi Becket,
> I notice that a new config will introduce to Avro Format and user can
> input their own schema. Since the user can input their schema , should Avro
> Format support a validation utils that validate whether the input schema is
> compatible with table columns?
>
> I’m modifying the Avro-Confulent Format in my team and want to make it
> serialize/deserialize by the schema exists on the schema-registry instead
> of using the schema generate by datatype. And I am thinking how to compare
> the datatype from the ddl with Avro schema. As I see the
> AvroSchemaConverter can transfer the Avro schema to datatype, can
> validation be simple as to judge whether the dataype from ddl is equal to
> datatype from Avro schema? If no, may I ask what's your opinion about the
> validation.
>
> I'm interested in the flip. If there's anything I can help with, please
> feel free to reach out to me.
>
> Best regards,
> Stephen
>
>
> > 2023年9月5日 下午3:15,Becket Qin  写道:
> >
> > Hi Jing,
> >
> > Thanks for the comments.
> >
> > 1. "For the batch cases, currently the BulkFormat for DataStream is
> >> missing" - true, and there is another option to leverage
> >> StreamFormatAdapter[1]
> >>
> > StreamFormatAdapter is internal and it requires a StreamFormat
> > implementation for Avro files which does not exist either.
> >
> > 2. "The following two interfaces should probably be marked as Public for
> >> now and Deprecated once we deprecate the InputFormat / OutputFormat" -
> >> would you like to share some background info of the deprecation of the
> >> InputFormat / OutputFormat? It is for me a little bit weird to mark
> APIs as
> >> public that are now known to be deprecated.
> >
> > InputFormat and OutputFormat are legacy APIs for SourceFunction and
> > SinkFunction. So when the SourceFunction and SinkFunction are deprecated,
> > the InputFormat and OutputFormat should also be deprecated accordingly.
> As
> > of now, technically speaking, we have not deprecated these two APIs. So,
> > making them public for now is just to fix the stability annotation
> because
> > they are already used publicly by the users.
> >
> > 3. "Remove the PublicEvolving annotation for the following deprecated
> >> classes. It does not make sense for an API to be PublicEvolving and
> >> Deprecated at the same time" - this is very common in the Flink code
> base
> >> to have PublicEvolving and Deprecated at the same time. APIs that do not
> >> survive the PublicEvolving phase will be marked as deprecated in
> addition.
> >> Removing PublicEvolving in this case will break Flink API graduation
> rule.
> >
> > Both PublicEvolving and Deprecated are status in the API lifecycle, they
> > are by definition mutually exclusive. When an API is marked as
> deprecated,
> > either the functionality is completely going away, or another API is
> > replacing the deprecated one. In either case, it does not make sense to
> > evolve that API any more. Even though Flink has some APIs marked with
> both
> > PublicEvolving and Deprecated at the same time, that does not make sense
> > and needs to be fixed. If a PublicEvolving API is deprecated, it should
> > only be marked as Deprecated, just like a Public API. I am not sure how
> > this would violate the API graduation rule, can you explain?
> >
> > By the way, there is another orthogonal abuse of the Deprecated
> annotation
> > in the Flink code base. For private methods, we shoul

Re: [DISCUSS] FLIP-358: flink-avro enhancement and cleanup

2023-09-06 Thread Becket Qin
Hi Jing,

Thanks for the explanation.

Since SourceFunction is already deprecated and we are working on
> SinkFunction deprecation for 1.19, I would suggest directly
> marking InputFormat and OutputFormat as deprecated. Because, once we mark
> them as public in one release, users might start to use them(they are
> public APIs). It will be weird for them to have freshly graduated public
> APIs get deprecated just after one minor release.

OK, then let's mark them as deprecated as well.

According to the definition of PublicEvolving [1]:
> "Classes and methods with this annotation are intended for public use and
> have stable behavior.
>  However, their interfaces and signatures are not considered to be stable
> and might be changed
>  across versions."


> Let's think about it from users' point of view. Once APIs are marked as
> PublicEvolving, it means the APIs are public, users will be using and
> depending on them. If we remove @PublicEvolving between minor releases, it
> means for me a regression. The APIs are downgraded from public(evolving)
> back to non-public. They could even be removed in the next minor release,
> since they only have @Deprecated annotation. No one knows they were
> PublicEvolving if developers don't go through the git history (in most
> cases on one will check and care the git history). This, for me, breaks the
> contract of @PublicEvolving.


A deprecated API is still a *public* API. In fact, the Deprecated
annotation should only be applied to public APIs. For internal APIs, an
immediate refactor should be done. So, I don't think it will break the API
contract.
That said, I think the part that confuses me with both PublicEvolving and
Deprecated is whether this API will still evolve or not, as Deprecated
basically means "public, no more change, to be removed", while
PublicEvolving indicates still evolving. But I guess by intuition users
will just consider the evolving part overridden by the deprecation. Maybe
it is fine to keep both. I'll update the FLIP.

Thanks,

Jiangjie (Becket) Qin


On Thu, Sep 7, 2023 at 12:35 AM Jing Ge  wrote:

> Hi Becket,
>
> Thanks for the clarification.
>
>
> > StreamFormatAdapter is internal and it requires a StreamFormat
> > implementation for Avro files which does not exist either.
> >
>
> I thought the cases 1-6 described in the FLIP mean there is a StreamFormat
> implementation for Avro. That was my fault. I didn't understand it
> correctly.
>
>
> > InputFormat and OutputFormat are legacy APIs for SourceFunction and
> > SinkFunction. So when the SourceFunction and SinkFunction are deprecated,
> > the InputFormat and OutputFormat should also be deprecated accordingly.
> As
> > of now, technically speaking, we have not deprecated these two APIs. So,
> > making them public for now is just to fix the stability annotation
> because
> > they are already used publicly by the users.
> >
>
> Since SourceFunction is already deprecated and we are working on
> SinkFunction deprecation for 1.19, I would suggest directly
> marking InputFormat and OutputFormat as deprecated. Because, once we mark
> them as public in one release, users might start to use them(they are
> public APIs). It will be weird for them to have freshly graduated public
> APIs get deprecated just after one minor release.
>
>
> > Both PublicEvolving and Deprecated are status in the API lifecycle, they
> > are by definition mutually exclusive. When an API is marked as
> deprecated,
> > either the functionality is completely going away, or another API is
> > replacing the deprecated one. In either case, it does not make sense to
> > evolve that API any more. Even though Flink has some APIs marked with
> both
> > PublicEvolving and Deprecated at the same time, that does not make sense
> > and needs to be fixed. If a PublicEvolving API is deprecated, it should
> > only be marked as Deprecated, just like a Public API. I am not sure how
> > this would violate the API graduation rule, can you explain?
> >
>
> According to the definition of PublicEvolving [1]:
> "Classes and methods with this annotation are intended for public use and
> have stable behavior.
>  However, their interfaces and signatures are not considered to be stable
> and might be changed
>  across versions."
>
> Let's think about it from users' point of view. Once APIs are marked as
> PublicEvolving, it means the APIs are public, users will be using and
> depending on them. If we remove @PublicEvolving between minor releases, it
> means for me a regression. The APIs are downgraded from public(evolving)
> back to non-public. They could even be removed in the next minor release,
> sinc

Re: [VOTE] FLIP-356: Support Nested Fields Filter Pushdown

2023-09-05 Thread Becket Qin
Thanks for pushing the FLIP through.

+1 on the updated FLIP wiki.

Cheers,

Jiangjie (Becket) Qin

On Wed, Sep 6, 2023 at 1:12 PM Venkatakrishnan Sowrirajan 
wrote:

> Based on the recent discussions in the thread [DISCUSS] FLIP-356: Support
> Nested Fields Filter Pushdown
> <https://lists.apache.org/thread/686bhgwrrb4xmbfzlk60szwxos4z64t7>, I made
> some changes to the FLIP-356
> <
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-356%3A+Support+Nested+Fields+Filter+Pushdown
> >.
> Unless anyone else has any concerns, we can continue with this vote to
> reach consensus.
>
> Regards
> Venkata krishnan
>
>
> On Tue, Sep 5, 2023 at 8:04 AM Sergey Nuyanzin 
> wrote:
>
> > +1 (binding)
> >
> > On Tue, Sep 5, 2023 at 4:55 PM Jiabao Sun  > .invalid>
> > wrote:
> >
> > > +1 (non-binding)
> > >
> > > Best,
> > > Jiabao
> > >
> > >
> > > > 2023年9月5日 下午10:33,Martijn Visser  写道:
> > > >
> > > > +1 (binding)
> > > >
> > > > On Tue, Sep 5, 2023 at 4:16 PM ConradJam 
> wrote:
> > > >
> > > >> +1 (non-binding)
> > > >>
> > > >> Yuepeng Pan  于2023年9月1日周五 15:43写道:
> > > >>
> > > >>> +1 (non-binding)
> > > >>>
> > > >>> Best,
> > > >>> Yuepeng
> > > >>>
> > > >>>
> > > >>>
> > > >>> At 2023-09-01 14:32:19, "Jark Wu"  wrote:
> > > >>>> +1 (binding)
> > > >>>>
> > > >>>> Best,
> > > >>>> Jark
> > > >>>>
> > > >>>>> 2023年8月30日 02:40,Venkatakrishnan Sowrirajan 
> 写道:
> > > >>>>>
> > > >>>>> Hi everyone,
> > > >>>>>
> > > >>>>> Thank you all for your feedback on FLIP-356. I'd like to start a
> > > vote.
> > > >>>>>
> > > >>>>> Discussion thread:
> > > >>>>>
> >
> https://urldefense.com/v3/__https://lists.apache.org/thread/686bhgwrrb4xmbfzlk60szwxos4z64t7__;!!IKRxdwAv5BmarQ!eNR1R48e8jbqDCSdXqWj6bjfmP1uMn-IUIgVX3uXlgzYp_9rcf-nZOaAZ7KzFo2JwMAJPGYv8wfRxuRMAA$
> > > >>>>> FLIP:
> > > >>>>>
> > > >>>
> > > >>
> > >
> >
> https://urldefense.com/v3/__https://cwiki.apache.org/confluence/display/FLINK/FLIP-356*3A*Support*Nested*Fields*Filter*Pushdown__;JSsrKysr!!IKRxdwAv5BmarQ!eNR1R48e8jbqDCSdXqWj6bjfmP1uMn-IUIgVX3uXlgzYp_9rcf-nZOaAZ7KzFo2JwMAJPGYv8wdkI0waFw$
> > > >>>>>
> > > >>>>> Regards
> > > >>>>> Venkata krishnan
> > > >>>
> > > >>
> > >
> > >
> >
> > --
> > Best regards,
> > Sergey
> >
>


Re: [DISCUSS] FLIP-356: Support Nested Fields Filter Pushdown

2023-09-05 Thread Becket Qin
Hi Venkata,


> Also I made minor changes to the *NestedFieldReferenceExpression, *instead
> of *fieldIndexArray* we can just do away with *fieldNames *array that
> includes fieldName at every level for the nested field.


I don't think keeping only the field names array would work. At the end of
the day, the contract between Flink SQL and the connectors is based on the
indexes, not the names. Technically speaking, the connectors only emit a
bunch of RowData which is based on positions. The field names are added by
the SQL framework via the DDL for those RowData. In this sense, the
connectors may not be aware of the field names in Flink DDL at all. The
common language between Flink SQL and source is just positions. This is
also why ProjectionPushDown would work by only relying on the indexes, not
the field names. So I think the field index array is a must have here in
the NestedFieldReferenceExpression.

Thanks,

Jiangjie (Becket) Qin

On Fri, Sep 1, 2023 at 8:12 AM Venkatakrishnan Sowrirajan 
wrote:

> Gentle ping on the vote for FLIP-356: Support Nested fields filter pushdown
> <https://www.mail-archive.com/dev@flink.apache.org/msg69289.html>.
>
> Regards
> Venkata krishnan
>
>
> On Tue, Aug 29, 2023 at 9:18 PM Venkatakrishnan Sowrirajan <
> vsowr...@asu.edu>
> wrote:
>
> > Sure, will reference this discussion to resume where we started as part
> of
> > the flip to refactor SupportsProjectionPushDown.
> >
> > On Tue, Aug 29, 2023, 7:22 PM Jark Wu  wrote:
> >
> >> I'm fine with this. `ReferenceExpression` and
> `SupportsProjectionPushDown`
> >> can be another FLIP. However, could you summarize the design of this
> part
> >> in the future part of the FLIP? This can be easier to get started with
> in
> >> the future.
> >>
> >>
> >> Best,
> >> Jark
> >>
> >>
> >> On Wed, 30 Aug 2023 at 02:45, Venkatakrishnan Sowrirajan <
> >> vsowr...@asu.edu>
> >> wrote:
> >>
> >> > Thanks Jark. Sounds good.
> >> >
> >> > One more thing, earlier in my summary I mentioned,
> >> >
> >> > Introduce a new *ReferenceExpression* (or *BaseReferenceExpression*)
> >> > > abstract class which will be extended by both
> >> *FieldReferenceExpression*
> >> > >  and *NestedFieldReferenceExpression* (to be introduced as part of
> >> this
> >> > > FLIP)
> >> >
> >> > This can be punted for now and can be handled as part of refactoring
> >> > SupportsProjectionPushDown.
> >> >
> >> > Also I made minor changes to the *NestedFieldReferenceExpression,
> >> *instead
> >> > of *fieldIndexArray* we can just do away with *fieldNames *array that
> >> > includes fieldName at every level for the nested field.
> >> >
> >> > Updated the FLIP-357
> >> > <
> >> >
> >>
> https://urldefense.com/v3/__https://cwiki.apache.org/confluence/display/FLINK/FLIP-356*3A*Support*Nested*Fields*Filter*Pushdown__;JSsrKysr!!IKRxdwAv5BmarQ!YAk6kV4CYvUSPfpoUDQRs6VlbmJXVX8KOKqFxKbNDkUWKzShvwpkLRGkAV1tgV3EqClNrjGS-Ij86Q$
> >> > >
> >> > wiki as well.
> >> >
> >> > Regards
> >> > Venkata krishnan
> >> >
> >> >
> >> > On Tue, Aug 29, 2023 at 5:21 AM Jark Wu  wrote:
> >> >
> >> > > Hi Venkata,
> >> > >
> >> > > Your summary looks good to me. +1 to start a vote.
> >> > >
> >> > > I think we don't need "inputIndex" in
> NestedFieldReferenceExpression.
> >> > > Actually, I think it is also not needed in FieldReferenceExpression,
> >> > > and we should try to remove it (another topic). The RexInputRef in
> >> > Calcite
> >> > > also doesn't require an inputIndex because the field index should
> >> > represent
> >> > > index of the field in the underlying row type. Field references
> >> shouldn't
> >> > > be
> >> > >  aware of the number of inputs.
> >> > >
> >> > > Best,
> >> > > Jark
> >> > >
> >> > >
> >> > > On Tue, 29 Aug 2023 at 02:24, Venkatakrishnan Sowrirajan <
> >> > vsowr...@asu.edu
> >> > > >
> >> > > wrote:
> >> > >
> >> > > > Hi Jinsong,
> >> > > >
> >> > > > Thanks for your comments.
> >> > > >
> >> > &

Re: [DISCUSS] FLIP-358: flink-avro enhancement and cleanup

2023-09-05 Thread Becket Qin
Hi Jing,

Thanks for the comments.

1. "For the batch cases, currently the BulkFormat for DataStream is
> missing" - true, and there is another option to leverage
> StreamFormatAdapter[1]
>
StreamFormatAdapter is internal and it requires a StreamFormat
implementation for Avro files which does not exist either.

2. "The following two interfaces should probably be marked as Public for
> now and Deprecated once we deprecate the InputFormat / OutputFormat" -
> would you like to share some background info of the deprecation of the
> InputFormat / OutputFormat? It is for me a little bit weird to mark APIs as
> public that are now known to be deprecated.

InputFormat and OutputFormat are legacy APIs for SourceFunction and
SinkFunction. So when the SourceFunction and SinkFunction are deprecated,
the InputFormat and OutputFormat should also be deprecated accordingly. As
of now, technically speaking, we have not deprecated these two APIs. So,
making them public for now is just to fix the stability annotation because
they are already used publicly by the users.

3. "Remove the PublicEvolving annotation for the following deprecated
> classes. It does not make sense for an API to be PublicEvolving and
> Deprecated at the same time" - this is very common in the Flink code base
> to have PublicEvolving and Deprecated at the same time. APIs that do not
> survive the PublicEvolving phase will be marked as deprecated in addition.
> Removing PublicEvolving in this case will break Flink API graduation rule.

Both PublicEvolving and Deprecated are status in the API lifecycle, they
are by definition mutually exclusive. When an API is marked as deprecated,
either the functionality is completely going away, or another API is
replacing the deprecated one. In either case, it does not make sense to
evolve that API any more. Even though Flink has some APIs marked with both
PublicEvolving and Deprecated at the same time, that does not make sense
and needs to be fixed. If a PublicEvolving API is deprecated, it should
only be marked as Deprecated, just like a Public API. I am not sure how
this would violate the API graduation rule, can you explain?

By the way, there is another orthogonal abuse of the Deprecated annotation
in the Flink code base. For private methods, we should not mark them as
deprecated and leave the existing code base using it, while introducing a
new method. This is a bad practice adding to technical debts. Instead, a
proper refactor should be done immediately in the same patch to just remove
that private method and migrate all the usage to the new method.

Thanks,

Jiangjie (Becket) Qin



On Fri, Sep 1, 2023 at 12:00 AM Jing Ge  wrote:

> Hi Becket,
>
> It is a very useful proposal, thanks for driving it. +1. I'd like to ask
> some questions to make sure I understand your thoughts correctly:
>
> 1. "For the batch cases, currently the BulkFormat for DataStream is
> missing" - true, and there is another option to leverage
> StreamFormatAdapter[1]
> 2. "The following two interfaces should probably be marked as Public for
> now and Deprecated once we deprecate the InputFormat / OutputFormat" -
> would you like to share some background info of the deprecation of the
> InputFormat / OutputFormat? It is for me a little bit weird to mark APIs as
> public that are now known to be deprecated.
> 3. "Remove the PublicEvolving annotation for the following deprecated
> classes. It does not make sense for an API to be PublicEvolving and
> Deprecated at the same time" - this is very common in the Flink code base
> to have PublicEvolving and Deprecated at the same time. APIs that do not
> survive the PublicEvolving phase will be marked as deprecated in addition.
> Removing PublicEvolving in this case will break Flink API graduation rule.
>
> Best regards,
> Jing
>
>
>
> [1]
>
> https://github.com/apache/flink/blob/1d1247d4ae6d4313f7d952c4b2d66351314c9432/flink-connectors/flink-connector-files/src/main/java/org/apache/flink/connector/file/src/impl/StreamFormatAdapter.java#L61
>
> On Thu, Aug 31, 2023 at 4:16 PM Becket Qin  wrote:
>
> > Hi Ryan, thanks for the reply.
> >
> > Verifying the component with the schemas you have would be super helpful.
> >
> > I think enum is actually a type that is generally useful. Although it is
> > not a part of ANSI SQL, MySQL and some other databases have this type.
> > BTW, ENUM_STRING proposed in this FLIP is actually not a type by itself.
> > The ENUM_STRING is just a syntax sugar which actually creates a "new
> > AtomicDataType(new VarCharType(Integer.MAX_VALUE), ENUM_CLASS)".  So, we
> > are not really introducing a new type here. However, in order to make the
> > VARCHAR to ENUM conversion work, the ENUM class has to be considere

Re: [DISCUSS] FLIP-358: flink-avro enhancement and cleanup

2023-08-31 Thread Becket Qin
Hi Ryan, thanks for the reply.

Verifying the component with the schemas you have would be super helpful.

I think enum is actually a type that is generally useful. Although it is
not a part of ANSI SQL, MySQL and some other databases have this type.
BTW, ENUM_STRING proposed in this FLIP is actually not a type by itself.
The ENUM_STRING is just a syntax sugar which actually creates a "new
AtomicDataType(new VarCharType(Integer.MAX_VALUE), ENUM_CLASS)".  So, we
are not really introducing a new type here. However, in order to make the
VARCHAR to ENUM conversion work, the ENUM class has to be considered as a
ConversionClass of the VARCHAR type, and a StringToEnum converter is
required.

And yes, AvroSchemaUtils should be annotated as @PublicEvolving.

Thanks,

Jiangjie (Becket) Qin



On Thu, Aug 31, 2023 at 5:22 PM Ryan Skraba 
wrote:

> Hey -- I have a certain knowledge of Avro, and I'd be willing to help
> out with some of these enhancements, writing tests and reviewing.  I
> have a *lot* of Avro schemas available for validation!
>
> The FLIP looks pretty good and covers the possible cases pretty
> rigorously. I wasn't aware of some of the gaps you've pointed out
> here!
>
> How useful do you think the new ENUM_STRING DataType would be outside
> of the Avro use case?  It seems like a good enough addition that would
> solve the problem here.
>
> A small note: I assume the AvroSchemaUtils is meant to be annotated
> @PublicEvolving as well.
>
> All my best, Ryan
>
>
> On Tue, Aug 29, 2023 at 4:35 AM Becket Qin  wrote:
> >
> > Hi folks,
> >
> > I would like to start the discussion about FLIP-158[1] which proposes to
> > clean up and enhance the Avro support in Flink. More specifically, it
> > proposes to:
> >
> > 1. Make it clear what are the public APIs in flink-avro components.
> > 2. Fix a few buggy cases in flink-avro
> > 3. Add more supported Avro use cases out of the box.
> >
> > Feedbacks are welcome!
> >
> > Thanks
> >
> > Jiangjie (Becket) Qin
> >
> > [1]
> >
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-358%3A+flink-avro+enhancement+and+cleanup
>


[DISCUSS] FLIP-358: flink-avro enhancement and cleanup

2023-08-28 Thread Becket Qin
Hi folks,

I would like to start the discussion about FLIP-158[1] which proposes to
clean up and enhance the Avro support in Flink. More specifically, it
proposes to:

1. Make it clear what are the public APIs in flink-avro components.
2. Fix a few buggy cases in flink-avro
3. Add more supported Avro use cases out of the box.

Feedbacks are welcome!

Thanks

Jiangjie (Becket) Qin

[1]
https://cwiki.apache.org/confluence/display/FLINK/FLIP-358%3A+flink-avro+enhancement+and+cleanup


Re: [DISCUSS] FLIP-356: Support Nested Fields Filter Pushdown

2023-08-24 Thread Becket Qin
Hi Jark,

How about having a separate NestedFieldReferenceExpression, and
> abstracting a common base class "ReferenceExpression" for
> NestedFieldReferenceExpression and FieldReferenceExpression? This makes
> unifying expressions in
> "SupportsProjectionPushdown#applyProjections(List
> ...)"
> possible.


I'd be fine with this. It at least provides a consistent API style /
formality.

 Re: Yunhong,

3. Finally, I think we need to look at the costs and benefits of unifying
> the SupportsFilterPushDown and SupportsProjectionPushDown (or others) from
> the perspective of interface implementers. A stable API can reduce user
> development and change costs, if the current API can fully meet the
> functional requirements at the framework level, I personal suggest reducing
> the impact on connector developers.
>

I agree that the cost and benefit should be measured. And the measurement
should be in the long term instead of short term. That is why we always
need to align on the ideal end state first.
Meeting functionality requirements is the bare minimum bar for an API.
Simplicity, intuitiveness, robustness and evolvability are also important.
In addition, for projects with many APIs, such as Flink, a consistent API
style is also critical for the user adoption as well as bug avoidance. It
is very helpful for the community to agree on some API design conventions /
principles.
For example, in this particular case, via our discussion, hopefully we sort
of established the following API design conventions / principles for all
the Supports*PushDown interfaces.

1. By default, expressions should be used if applicable instead of other
representations.
2. In general, the pushdown method should not assume all the pushdowns will
succeed. So the applyX() method should return a boolean or List, to
handle the cases that some of the pushdowns cannot be fulfilled by the
implementation.

Establishing such conventions and principles demands careful thinking for
the aspects I mentioned earlier in addition to the API functionalities.
This helps lower the bar of understanding, reduces the chance of having
loose ends in the API, and will benefit all the participants in the project
over time. I think this is the right way to achieve real API stability.
Otherwise, we may end up chasing our tails to find ways not to change the
existing non-ideal APIs.

Thanks,

Jiangjie (Becket) Qin

On Fri, Aug 25, 2023 at 9:33 AM yh z  wrote:

> Hi, Venkat,
>
> Thanks for the FLIP, it sounds good to support nested fields filter
> pushdown. Based on the design of flip and the above options, I would like
> to make a few suggestions:
>
> 1.  At present, introducing NestedFieldReferenceExpression looks like a
> better solution, which can fully meet our requirements while reducing
> modifications to base class FieldReferenceExpression. In the long run, I
> tend to abstract a basic class for NestedFieldReferenceExpression and
> FieldReferenceExpression as u suggested.
>
> 2. Personally, I don't recommend introducing *supportsNestedFilters() in
> supportsFilterPushdown. We just need to better declare the return value of
> the method *applyFilters.
>
> 3. Finally, I think we need to look at the costs and benefits of unifying
> the SupportsFilterPushDown and SupportsProjectionPushDown (or others) from
> the perspective of interface implementers. A stable API can reduce user
> development and change costs, if the current API can fully meet the
> functional requirements at the framework level, I personal suggest reducing
> the impact on connector developers.
>
> Regards,
> Yunhong Zheng (Swuferhong)
>
>
> Venkatakrishnan Sowrirajan  于2023年8月25日周五 01:25写道:
>
> > To keep it backwards compatible, introduce another API *applyAggregates
> > *with
> > *List *when nested field support is added and
> > deprecate the current API. This will by default throw an exception. In
> > flink planner, *applyAggregates *with nested fields and if it throws
> > exception then *applyAggregates* without nested fields.
> >
> > Regards
> > Venkata krishnan
> >
> >
> > On Thu, Aug 24, 2023 at 10:13 AM Venkatakrishnan Sowrirajan <
> > vsowr...@asu.edu> wrote:
> >
> > > Jark,
> > >
> > > How about having a separate NestedFieldReferenceExpression, and
> > >> abstracting a common base class "ReferenceExpression" for
> > >> NestedFieldReferenceExpression and FieldReferenceExpression? This
> makes
> > >> unifying expressions in
> > >> "SupportsProjectionPushdown#applyProjections(List
> > >> ...)"
> > >> possible.
> > >
> > > This should be fine for *SupportsProjectionPushDown* and
> > > *SupportsFilterPushDown*. One con

Re: [DISCUSS] FLIP-323: Support Attached Execution on Flink Application Completion for Batch Jobs

2023-08-23 Thread Becket Qin
Hi Weihua,

Just want to clarify. "client.attached.after.submission" is going to be a
pure client side configuration.

On the cluster side, it is only "execution.shutdown-on-attached-exit"
controlling whether the cluster will shutdown or not when an attached
client is disconnected. In order to honor this configuration, the cluster
needs to know if the client submitting the job is attached or not. But the
cluster will not retrieve this information by reading the configuration of
"client.attached.after.submission". In fact this configuration should not
even be visible to the cluster. The cluster only knows if a client is
attached or not when a client submits a job.

Thanks,

Jiangjie (Becket) Qin



On Wed, Aug 23, 2023 at 2:35 PM Weihua Hu  wrote:

> Hi, Jiangjie
>
> Thanks for the clarification.
>
> My key point is the meaning of the "submission" in
> "client.attached.after.submission".
> At first glance, I thought only job submissions were taken into account.
> After your clarification, this option also works for cluster submissions.
>
> It's fine for me.
>
> Best,
> Weihua
>
>
> On Wed, Aug 23, 2023 at 8:35 AM Becket Qin  wrote:
>
> > Hi Weihua,
> >
> > Thanks for the explanation. From the doc, it looks like the current
> > behaviors of "execution.attached=true" between Yarn and K8S session
> > cluster are exactly the opposite. For YARN it basically means the cluster
> > will shutdown if the client disconnects. For K8S, it means the cluster
> will
> > not shutdown until a client explicitly stops it. This sounds like a bad
> > situation to me and needs to be fixed.
> >
> > My guess is that the YARN behavior here is the original intended
> behavior,
> > while K8S reused the configuration for a different purpose. If we
> deprecate
> > the execution.attached config here. The behavior would be:
> >
> > For YARN session clusters:
> > 1. Current "execution.attached=true" would be equivalent to
> > "execution.shutdown-on-attached-exit=true" +
> > "client.attached.after.submission=true".
> > 2. Current "execution.attached=false" would be equivalent to
> > "execution.shutdown-on-attached-exit=false", i.e. the cluster will keep
> > running until explicitly stopped.
> >
> > I am not sure what the current behavior of "execution.attached=true" +
> > "execution.shutdown-on-attached-exit=false" is. Supposedly, it should be
> > equivalent to "execution.shutdown-on-attached-exit=false", which means
> > "execution.attached" only controls the client side behavior, while the
> > cluster side behavior is controlled by
> > "execution.shutdown-on-attached-exit".
> >
> > For K8S session clusters:
> > 1. Current "execution.attached=true" would be equivalent to
> > "execution.shutdown-on-attached-exit=false".
> > 2. Current "execution.attached=false" would be equivalent to
> > "execution.shutdown-on-attached-exit=true" +
> > "client.attached.after.submission=true".
> >
> > This will make the same config behave the same for YARN and K8S.
> >
> > Thanks,
> >
> > Jiangjie (Becket) Qin
> >
> > On Tue, Aug 22, 2023 at 11:04 PM Weihua Hu 
> wrote:
> >
> > > Hi, Jiangjie
> > >
> > > 'execution.attached' can be used to attach an existing cluster and stop
> > it
> > > [1][2],
> > > which is not related to job submission. So does YARN session mode[3].
> > > IMO, this behavior should not be controlled by the new option
> > > 'client.attached.after.submission'.
> > >
> > > [1]
> > >
> > >
> >
> https://nightlies.apache.org/flink/flink-docs-master/docs/deployment/resource-providers/native_kubernetes/#session-mode
> > > [2]
> > >
> > >
> >
> https://github.com/apache/flink/blob/a85ffc491874ecf3410f747df3ed09f61df52ac6/flink-kubernetes/src/main/java/org/apache/flink/kubernetes/cli/KubernetesSessionCli.java#L126
> > > [3]
> > >
> > >
> >
> https://nightlies.apache.org/flink/flink-docs-master/docs/deployment/resource-providers/yarn/#session-mode
> > >
> > > Best,
> > > Weihua
> > >
> > >
> > > On Tue, Aug 22, 2023 at 5:16 PM Becket Qin 
> wrote:
> > >
> > > > Hi Weihua,
> > > >
> > > > Just want to clarify a little bit, what is the impact of
> > > > `execution.attached` on a cluster startup before a client submits a
> job
> &

Re: [DISCUSS] FLIP-323: Support Attached Execution on Flink Application Completion for Batch Jobs

2023-08-22 Thread Becket Qin
Hi Weihua,

Thanks for the explanation. From the doc, it looks like the current
behaviors of "execution.attached=true" between Yarn and K8S session
cluster are exactly the opposite. For YARN it basically means the cluster
will shutdown if the client disconnects. For K8S, it means the cluster will
not shutdown until a client explicitly stops it. This sounds like a bad
situation to me and needs to be fixed.

My guess is that the YARN behavior here is the original intended behavior,
while K8S reused the configuration for a different purpose. If we deprecate
the execution.attached config here. The behavior would be:

For YARN session clusters:
1. Current "execution.attached=true" would be equivalent to
"execution.shutdown-on-attached-exit=true" +
"client.attached.after.submission=true".
2. Current "execution.attached=false" would be equivalent to
"execution.shutdown-on-attached-exit=false", i.e. the cluster will keep
running until explicitly stopped.

I am not sure what the current behavior of "execution.attached=true" +
"execution.shutdown-on-attached-exit=false" is. Supposedly, it should be
equivalent to "execution.shutdown-on-attached-exit=false", which means
"execution.attached" only controls the client side behavior, while the
cluster side behavior is controlled by
"execution.shutdown-on-attached-exit".

For K8S session clusters:
1. Current "execution.attached=true" would be equivalent to
"execution.shutdown-on-attached-exit=false".
2. Current "execution.attached=false" would be equivalent to
"execution.shutdown-on-attached-exit=true" +
"client.attached.after.submission=true".

This will make the same config behave the same for YARN and K8S.

Thanks,

Jiangjie (Becket) Qin

On Tue, Aug 22, 2023 at 11:04 PM Weihua Hu  wrote:

> Hi, Jiangjie
>
> 'execution.attached' can be used to attach an existing cluster and stop it
> [1][2],
> which is not related to job submission. So does YARN session mode[3].
> IMO, this behavior should not be controlled by the new option
> 'client.attached.after.submission'.
>
> [1]
>
> https://nightlies.apache.org/flink/flink-docs-master/docs/deployment/resource-providers/native_kubernetes/#session-mode
> [2]
>
> https://github.com/apache/flink/blob/a85ffc491874ecf3410f747df3ed09f61df52ac6/flink-kubernetes/src/main/java/org/apache/flink/kubernetes/cli/KubernetesSessionCli.java#L126
> [3]
>
> https://nightlies.apache.org/flink/flink-docs-master/docs/deployment/resource-providers/yarn/#session-mode
>
> Best,
> Weihua
>
>
> On Tue, Aug 22, 2023 at 5:16 PM Becket Qin  wrote:
>
> > Hi Weihua,
> >
> > Just want to clarify a little bit, what is the impact of
> > `execution.attached` on a cluster startup before a client submits a job
> to
> > that cluster? Does this config only become effective after a job
> > submission?
> >
> > Currently, the cluster behavior has an independent config of
> > 'execution.shutdown-on-attached-exit'. So if a client submitted a job in
> > attached mode, and this `execution.shutdown-on-attached-exit` is set to
> > true, the cluster will shutdown if the client detaches from the cluster.
> Is
> > this sufficient? Or do you mean we need another independent
> configuration?
> >
> > Thanks,
> >
> > Jiangjie (Becket) Qin
> >
> > On Tue, Aug 22, 2023 at 2:20 PM Weihua Hu 
> wrote:
> >
> > > Hi Jiangjie
> > >
> > > Sorry for the late reply, I fully agree with the three user sensible
> > > behaviors you described.
> > >
> > > I would like to bring up a point.
> > >
> > > Currently, 'execution.attached' is not only used for submitting jobs,
> > > But also for starting a new cluster (YARN and Kubernetes). If it's
> true,
> > > the starting cluster script will
> > > wait for the user to input the next command (quit or stop).
> > >
> > > In my opinion, this behavior should have an independent option besides
> > > "client.attached.after.submission" for control.
> > >
> > >
> > > Best,
> > > Weihua
> > >
> > >
> > > On Thu, Aug 17, 2023 at 10:07 AM liu ron  wrote:
> > >
> > > > Hi, Jiangjie
> > > >
> > > > Thanks for your detailed explanation, I got your point. If the
> > > > execution.attached is only used for client currently, removing it
> also
> > > make
> > > > sense to me.
> > > >
> > > > Best,
> > > > Ron
> > > >
> > > > Becket Qin  于2023年8月17日周四 07:37写道:
>

Re: [DISCUSS] FLIP-356: Support Nested Fields Filter Pushdown

2023-08-22 Thread Becket Qin
Hi Jark,

Regarding the migration path, it would be useful to scrutinize the use case
of FiledReferenceExpression and ResolvedExpressions. There are two kinds of
use cases:

1. A ResolvedExpression is constructed by the user or connector / plugin
developers.
2. A ResolvedExpression is constructed by the framework and passed to user
or connector / plugin developers.

For the first case, both of the approaches provide the same migration
experience.

For the second case, generally speaking, introducing
NestedFieldReferenceExpression and extending FieldReferenceExpression would
have the same impact for backwards compatibility. SupportsFilterPushDown is
a special case here because understanding the filter expressions is
optional for the source implementation. In other use cases, if
understanding the reference to a nested field is a must have, the user code
has to be changed, regardless of which approach we take to support nested
fields.

Therefore, I think we have to check each public API where the nested field
reference is exposed. If we have many public APIs where understanding
nested fields is optional for the user  / plugin / connector developers,
having a separate NestedFieldReferenceExpression would have a more smooth
migration. Otherwise, there seems to be no difference between the two
approaches.

Migration path aside, the main reason I prefer extending
FieldReferenceExpression over a new NestedFieldReferenceExpression is
because this makes the SupportsProjectionPushDown interface simpler.
Otherwise, we have to treat it as a special case that does not match the
overall API style. Or we have to introduce two different applyProjections()
methods for FieldReferenceExpression / NestedFieldReferenceExpression
respectively. This issue further extends to implementation in addition to
public API. A single FieldReferenceExpression might help simplify the
implementation code a little bit. For example, in a recursive processing of
a row with nested rows, we may not need to switch between
FieldReferenceExpression and NestedFieldReferenceExpression depending on
whether the record being processed is a top level record or nested record.

Thanks,

Jiangjie (Becket) Qin


On Tue, Aug 22, 2023 at 11:43 PM Jark Wu  wrote:

> Hi Becket,
>
> I totally agree we should try to have a consistent API for a final state.
> The only concern I have mentioned is the "smooth" migration path.
> The FiledReferenceExpression is widely used in many public APIs,
> not only in the SupportsFilterPushDown. Yes, we can change every
> methods in 2-steps, but is it good to change API back and forth for this?
> Personally, I'm fine with a separate NestedFieldReferenceExpression class.
> TBH, I prefer the separated way because it makes the reference expression
> more clear and concise.
>
> Best,
> Jark
>
>
> On Tue, 22 Aug 2023 at 16:53, Becket Qin  wrote:
>
> > Thanks for the reply, Jark.
> >
> > I think it will be helpful to understand the final state we want to
> > eventually achieve first, then we can discuss the steps towards that
> final
> > state.
> >
> > It looks like there are two proposed end states now:
> >
> > 1. Have a separate NestedFieldReferenceExpression class; keep
> > SupportsFilterPushDown and SupportsProjectionPushDown the same. It is
> just
> > a one step change.
> >- Regarding the supportsNestedFilterPushDown() method, if our contract
> > with the connector developer today is "The implementation should ignore
> > unrecognized expressions by putting them into the remaining filters,
> > instead of throwing exceptions". Then there is no need for this method. I
> > am not sure about the current contract. We should probably make it clear
> in
> > the interface Java doc.
> >
> > 2. Extend the existing FiledReferenceExpression class to support nested
> > fields; SupportsFilterPushDown only has one method of
> > applyFilters(List); SupportsProjectionPushDown only
> has
> > one method of applyProjections(List, DataType).
> > It could just be two steps if we are not too obsessed with the exact
> names
> > of "applyFilters" and "applyProjections". More specifically, it takes two
> > steps to achieve this final state:
> > a. introduce a new method tryApplyFilters(List)
> to
> > SupportsFilterPushDown, which may have FiledReferenceExpression with
> nested
> > fields. The default implementation throws an exception. The runtime will
> > first call tryApplyFilters() with nested fields. In case of exception, it
> > calls the existing applyFilters() without including the nested filters.
> > Similarly, in SupportsProjectionPushDown, introduce a
> > tryApplyProjections method returning a Result.
> > The Result also contains the ac

Re: [DISCUSS] FLIP-323: Support Attached Execution on Flink Application Completion for Batch Jobs

2023-08-22 Thread Becket Qin
Hi Weihua,

Just want to clarify a little bit, what is the impact of
`execution.attached` on a cluster startup before a client submits a job to
that cluster? Does this config only become effective after a job submission?

Currently, the cluster behavior has an independent config of
'execution.shutdown-on-attached-exit'. So if a client submitted a job in
attached mode, and this `execution.shutdown-on-attached-exit` is set to
true, the cluster will shutdown if the client detaches from the cluster. Is
this sufficient? Or do you mean we need another independent configuration?

Thanks,

Jiangjie (Becket) Qin

On Tue, Aug 22, 2023 at 2:20 PM Weihua Hu  wrote:

> Hi Jiangjie
>
> Sorry for the late reply, I fully agree with the three user sensible
> behaviors you described.
>
> I would like to bring up a point.
>
> Currently, 'execution.attached' is not only used for submitting jobs,
> But also for starting a new cluster (YARN and Kubernetes). If it's true,
> the starting cluster script will
> wait for the user to input the next command (quit or stop).
>
> In my opinion, this behavior should have an independent option besides
> "client.attached.after.submission" for control.
>
>
> Best,
> Weihua
>
>
> On Thu, Aug 17, 2023 at 10:07 AM liu ron  wrote:
>
> > Hi, Jiangjie
> >
> > Thanks for your detailed explanation, I got your point. If the
> > execution.attached is only used for client currently, removing it also
> make
> > sense to me.
> >
> > Best,
> > Ron
> >
> > Becket Qin  于2023年8月17日周四 07:37写道:
> >
> > > Hi Ron,
> > >
> > > Isn't the cluster (session or per job) only using the
> execution.attached
> > to
> > > determine whether the client is attached? If so, the client can always
> > > include the information of whether it's an attached client or not in
> the
> > > JobSubmissoinRequestBody, right? For a shared session cluster, there
> > could
> > > be multiple clients submitting jobs to it. These clients may or may not
> > be
> > > attached. A static execution.attached configuration for the session
> > cluster
> > > does not work in this case, right?
> > >
> > > The current problem of execution.attached is that it is not always
> > honored.
> > > For example, if a session cluster was started with execution.attached
> set
> > > to false. And a client submits a job later to that session cluster with
> > > execution.attached set to true. In this case, the cluster won't (and
> > > shouldn't) shutdown after the job finishes or the attached client loses
> > > connection. So, in fact, the execution.attached configuration is only
> > > honored by the client, but not the cluster. Therefore, I think removing
> > it
> > > makes sense.
> > >
> > > Thanks,
> > >
> > > Jiangjie (Becket) Qin
> > >
> > > On Thu, Aug 17, 2023 at 12:31 AM liu ron  wrote:
> > >
> > > > Hi, Jiangjie
> > > >
> > > > Sorry for late reply. Thank you for such a detailed response. As you
> > say,
> > > > there are three behaviours here for users and I agree with you. The
> > goal
> > > of
> > > > this FLIP is to clarify the behaviour of the client side, which I
> also
> > > > agree with. However, as weihua said, the config execution.attached is
> > not
> > > > only for per-job mode, but also for session mode, but the FLIP says
> > that
> > > > this is only for per-job mode, and this config will be removed in the
> > > > future because the per-job mode has been deprecated. I don't think
> this
> > > is
> > > > correct and we should change the description in the corresponding
> > section
> > > > of the FLIP. Since execution.attached is used in session mode, there
> > is a
> > > > compatibility issue here if we change it directly to
> > > > client.attached.after.submission, and I think we should make this
> clear
> > > in
> > > > the FLIP.
> > > >
> > > > Best,
> > > > Ron
> > > >
> > > > Becket Qin  于2023年8月14日周一 20:33写道:
> > > >
> > > > > Hi Ron and Weihua,
> > > > >
> > > > > Thanks for the feedback.
> > > > >
> > > > > There seem three user sensible behaviors that we are talking about:
> > > > >
> > > > > 1. The behavior on the client side, i.e. whether blocking until the
> > job
> > > > > finishes or not

Re: [DISCUSS] FLIP-356: Support Nested Fields Filter Pushdown

2023-08-22 Thread Becket Qin
Thanks for the reply, Jark.

I think it will be helpful to understand the final state we want to
eventually achieve first, then we can discuss the steps towards that final
state.

It looks like there are two proposed end states now:

1. Have a separate NestedFieldReferenceExpression class; keep
SupportsFilterPushDown and SupportsProjectionPushDown the same. It is just
a one step change.
   - Regarding the supportsNestedFilterPushDown() method, if our contract
with the connector developer today is "The implementation should ignore
unrecognized expressions by putting them into the remaining filters,
instead of throwing exceptions". Then there is no need for this method. I
am not sure about the current contract. We should probably make it clear in
the interface Java doc.

2. Extend the existing FiledReferenceExpression class to support nested
fields; SupportsFilterPushDown only has one method of
applyFilters(List); SupportsProjectionPushDown only has
one method of applyProjections(List, DataType).
It could just be two steps if we are not too obsessed with the exact names
of "applyFilters" and "applyProjections". More specifically, it takes two
steps to achieve this final state:
a. introduce a new method tryApplyFilters(List) to
SupportsFilterPushDown, which may have FiledReferenceExpression with nested
fields. The default implementation throws an exception. The runtime will
first call tryApplyFilters() with nested fields. In case of exception, it
calls the existing applyFilters() without including the nested filters.
Similarly, in SupportsProjectionPushDown, introduce a
tryApplyProjections method returning a Result.
The Result also contains the accepted and unapplicable projections. The
default implementation also throws an exception. Deprecate all the other
methods except tryApplyFilters() and tryApplyProjections().
b. remove the deprecated methods in the next major version bump.

Now the question is putting the migration steps aside, which end state do
we prefer? While the first end state is acceptable for me, personally, I
prefer the latter if we are designing from scratch. It is clean, consistent
and intuitive. Given the size of Flink, keeping APIs in the same style over
time is important. The migration is also not that complicated.

Thanks,

Jiangjie (Becket) Qin


On Tue, Aug 22, 2023 at 2:23 PM Jark Wu  wrote:

> Hi Venkat,
>
> Thanks for the proposal.
>
> I have some minor comments about the FLIP.
>
> 1. I think we don't need to
> add SupportsFilterPushDown#supportsNestedFilters() method,
> because connectors can skip nested filters by putting them in
> Result#remainingFilters().
> And this is backward-compatible because unknown expressions were added to
> the remaining filters.
> Planner should push predicate expressions as more as possible. If we add a
> flag for each new filter,
> the interface will be filled with lots of flags (e.g., supportsBetween,
> supportsIN).
>
> 2. NestedFieldReferenceExpression#nestedFieldName should be an array of
> field names?
> Each string represents a field name part of the field path. Just keep
> aligning with `nestedFieldIndexArray`.
>
> 3. My concern about making FieldReferenceExpression support nested fields
> is the compatibility.
> It is a public API and users/connectors are already using it. People
> assumed it is a top-level column
> reference, and applied logic on it. But that's not true now and this may
> lead to unexpected errors.
> Having a separate NestedFieldReferenceExpression sounds safer to me. Mixing
> them in a class may
>  confuse users what's the meaning of getFieldName() and getFieldIndex().
>
>
> Regarding using NestedFieldReferenceExpression in
> SupportsProjectionPushDown, do you
> have any concerns @Timo Walther  ?
>
> Best,
> Jark
>
>
>
> On Tue, 22 Aug 2023 at 05:55, Venkatakrishnan Sowrirajan  >
> wrote:
>
> > Sounds like a great suggestion, Becket. +1. Agree with cleaning up the
> APIs
> > and making it consistent in all the pushdown APIs.
> >
> > Your suggested approach seems fine to me, unless anyone else has any
> other
> > concerns. Just have couple of clarifying questions:
> >
> > 1. Do you think we should standardize the APIs across all the pushdown
> > supports like SupportsPartitionPushdown, SupportsDynamicFiltering etc in
> > the end state?
> >
> > The current proposal works if we do not want to migrate
> > > SupportsFilterPushdown to also use NestedFieldReferenceExpression in
> the
> > > long term.
> > >
> > Did you mean *FieldReferenceExpression* instead of
> > *NestedFieldReferenceExpression*?
> >
> > 2. Extend the FieldReferenceExpression to support nested fields.
> > > - Change the index field type from int to in

Re: [DISCUSS] FLIP-356: Support Nested Fields Filter Pushdown

2023-08-18 Thread Becket Qin
Thanks for the proposal, Venkata.

The current proposal works if we do not want to migrate
SupportsFilterPushdown to also use NestedFieldReferenceExpression in the
long term.

Otherwise, the alternative solution briefly mentioned in the rejected
alternatives would be the following:
Phase 1:
1. Introduce a supportsNestedFilters() method to the SupportsFilterPushdown
interface. (same as current proposal).
2. Extend the FieldReferenceExpression to support nested fields.
- Change the index field type from int to int[].
- Add a new method int[] getFieldIndexArray().
- Deprecate the int getFieldIndex() method, the code will be removed in
the next major version bump.
3. In the SupportsProjectionPushDown interface
- add a new method applyProjection(List,
DataType), with default implementation invoking applyProjection(int[][],
DataType)
- deprecate the current applyProjection(int[][], DataType) method

Phase 2 (in the next major version bump)
1. remove the deprecated methods.

Phase 3 (optional)
1. deprecate and remove the supportsNestedFilters() /
supportsNestedProjection() methods from the SupportsFilterPushDown /
SupportsProjectionPushDown interfaces.

Personally I prefer this alternative. It takes longer to finish the work,
but the API eventually becomes clean and consistent. But I can live with
the current proposal.

Thanks,

Jiangjie (Becket) Qin

On Sat, Aug 19, 2023 at 12:09 AM Venkatakrishnan Sowrirajan <
vsowr...@asu.edu> wrote:

> Gentle ping for reviews/feedback.
>
> On Tue, Aug 15, 2023, 5:37 PM Venkatakrishnan Sowrirajan  >
> wrote:
>
> > Hi All,
> >
> > I am opening this thread to discuss FLIP-356: Support Nested Fields
> > Filter Pushdown. The FLIP can be found at
> >
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-356%3A+Support+Nested+Fields+Filter+Pushdown
> >
> > This FLIP adds support for pushing down nested fields filters to the
> > underlying TableSource. In our data lake, we find a lot of datasets have
> > nested fields and also user queries with filters defined on the nested
> > fields. This would drastically improve the performance for those sets of
> > queries.
> >
> > Appreciate any comments or feedback you may have on this proposal.
> >
> > Regards
> > Venkata krishnan
> >
>


Re: [DISCUSS] FLIP-323: Support Attached Execution on Flink Application Completion for Batch Jobs

2023-08-16 Thread Becket Qin
Hi Ron,

Isn't the cluster (session or per job) only using the execution.attached to
determine whether the client is attached? If so, the client can always
include the information of whether it's an attached client or not in the
JobSubmissoinRequestBody, right? For a shared session cluster, there could
be multiple clients submitting jobs to it. These clients may or may not be
attached. A static execution.attached configuration for the session cluster
does not work in this case, right?

The current problem of execution.attached is that it is not always honored.
For example, if a session cluster was started with execution.attached set
to false. And a client submits a job later to that session cluster with
execution.attached set to true. In this case, the cluster won't (and
shouldn't) shutdown after the job finishes or the attached client loses
connection. So, in fact, the execution.attached configuration is only
honored by the client, but not the cluster. Therefore, I think removing it
makes sense.

Thanks,

Jiangjie (Becket) Qin

On Thu, Aug 17, 2023 at 12:31 AM liu ron  wrote:

> Hi, Jiangjie
>
> Sorry for late reply. Thank you for such a detailed response. As you say,
> there are three behaviours here for users and I agree with you. The goal of
> this FLIP is to clarify the behaviour of the client side, which I also
> agree with. However, as weihua said, the config execution.attached is not
> only for per-job mode, but also for session mode, but the FLIP says that
> this is only for per-job mode, and this config will be removed in the
> future because the per-job mode has been deprecated. I don't think this is
> correct and we should change the description in the corresponding section
> of the FLIP. Since execution.attached is used in session mode, there is a
> compatibility issue here if we change it directly to
> client.attached.after.submission, and I think we should make this clear in
> the FLIP.
>
> Best,
> Ron
>
> Becket Qin  于2023年8月14日周一 20:33写道:
>
> > Hi Ron and Weihua,
> >
> > Thanks for the feedback.
> >
> > There seem three user sensible behaviors that we are talking about:
> >
> > 1. The behavior on the client side, i.e. whether blocking until the job
> > finishes or not.
> >
> > 2. The behavior of the submitted job, whether stop the job execution if
> the
> > client is detached from the Flink cluster, i.e. whether bind the
> lifecycle
> > of the job with the connection status of the attached client. For
> example,
> > one might want to keep a batch job running until finish even after the
> > client connection is lost. But it makes sense to stop the job upon client
> > connection lost if the job invokes collect() on a streaming job.
> >
> > 3. The behavior of the Flink cluster (JM and TMs), whether shutdown the
> > Flink cluster if the client is detached from the Flink cluster, i.e.
> > whether bind the cluster lifecycle with the job lifecycle. For dedicated
> > clusters (application cluster or dedicated session clusters), the
> lifecycle
> > of the cluster should be bound with the job lifecycle. But for shared
> > session clusters, the lifecycle of the Flink cluster should be
> independent
> > of the jobs running in it.
> >
> > As we can see, these three behaviors are sort of independent, the current
> > configurations fail to support all the combination of wanted behaviors.
> > Ideally there should be three separate configurations, for example:
> > - client.attached.after.submission and client.heartbeat.timeout control
> the
> > behavior on the client side.
> > - jobmanager.cancel-on-attached-client-exit controls the behavior of the
> > job when an attached client lost connection. The client heartbeat timeout
> > and attach-ness will be also passed to the JM upon job submission.
> > - cluster.shutdown-on-first-job-finishes *(*or
> > jobmanager.shutdown-cluster-after-job-finishes) controls the cluster
> > behavior after the job finishes normally / abnormally. This is a cluster
> > level setting instead of a job level setting. Therefore it can only be
> set
> > when launching the cluster.
> >
> > The current code sort of combines config 2 and 3 into
> > execution.shutdown-on-attach-exit.
> > This assumes the the life cycle of the cluster is the same as the job
> when
> > the client is attached. This FLIP does not intend to change that. but
> using
> > the execution.attached config for the client behavior control looks
> > misleading. So this FLIP proposes to replace it with a more intuitive
> > config of client.attached.after.submission. This makes it clear that it
> is
> > a configuration controlling the client s

Re: [DISCUSS] FLIP-330: Support specifying record timestamp requirement

2023-08-14 Thread Becket Qin
Hi Jark,

Thanks for the comments. I agree that at this point SQL is the only API
that we can apply this optimization transparently to the users.

For the other APIs (DataStream or the new Process Function targeted in
2.0), the hope is that in the future they will evolve so that the framework
can derive the necessity of the optimization. BTW, I think conceptually
watermarks should only be generated in the source, and the rest of the
operators should just merge and pass on the watermark merging result.
Therefore, if the sources in a job do not generate watermarks, it seems we
can assume the job won't have watermarks at all, and therefore, the
timestamp field is not needed.

I also agree with Xintong and Gyula that we need to prove the benefit. We
will probably see that the perf improvement is more obvious in some cases
while ignorable in some other use cases. The question is whether the use
cases that the optimization helps are important enough. And this judgement
is somewhat subjective.

Thanks,

Jiangjie (Becket) Qin

On Mon, Aug 14, 2023 at 9:13 PM Jark Wu  wrote:

> Hi Becket,
>
> > I kind of think that we can
> restrain the scope to just batch mode, and only for StreamRecord class.
> That means only in batch mode, the timestamp in the StreamRecord will be
> dropped when the config is enabled.
>
> However, IIUC, dropping timestamp in StreamRecord has been supported.
> This is an existing optimization in StreamElementSerializer that the 8bytes
> of
> the timestamp is not serialized if there is no timestamp on the
> StreamRecord.
>
> -
>
> Reducing 1-byte of StreamElement tag is a good idea to improve performance.
> But I agree with Xintong and Gyula that we should have a balance between
> complexity and performance. I'm fine to introduce this optimization if only
> for
> pure batch SQL. Because this is the only way (not even batch DataStream
> and batch Table API) to enable it by default. But I have concerns about
> other options.
>
> The largest concern from my side is it exposing a configuration to users
> which
> is hard to understand and afraid to enable and not worth enabling it. If
> users
> rarely enable this configuration, this would be an overhead to maintain for
> the community without benefits.
>
> Besides, I suspect whether we can remove "pipeline.force-timestamp-support"
> in the future. From my understanding, it is pretty hard for the framework
> to detect
> whether the job does not have a watermark strategy. Because the watermark
> may be assigned in any operators by using Output#emitWatermark.
>
> Best,
> Jark
>
>
> On Sat, 12 Aug 2023 at 13:23, Gyula Fóra  wrote:
>
> > Hey Devs,
> >
> > It would be great to see some other benchmarks ,  not only the dummy
> > WordCount example.
> >
> > I would love to see a few SQL queries documented and whether there is any
> > measurable benefit at all.
> >
> > Prod pipelines usually have some IO component etc which will add enough
> > overhead to make this even less noticeable. I agree that even small
> > improvements are worthwhile but they should be observable/significant on
> > real workloads. Otherwise complicating the runtime layer, types and
> configs
> > are not worth it in my opinion.
> >
> > Cheers
> > Gyula
> >
> > On Sat, 12 Aug 2023 at 04:39, Becket Qin  wrote:
> >
> > > Thanks for the FLIP, Yunfeng.
> > >
> > > I had a brief offline discussion with Dong, and here are my two cents:
> > >
> > > ## The benefit
> > > The FLIP is related to one of the perf benchmarks we saw at LinkedIn
> > which
> > > is pretty much doing a word count, except that the words are country
> > code,
> > > so it is typically just two bytes, e.g. CN, US, UK. What I see is that
> > the
> > > amount of data going through shuffle is much higher in Flink
> DataStream
> > > batch mode compared with the Flink DataSet API. And in this case,
> because
> > > the actual key is just 2 bytes so the overhead is kind of high. In
> batch
> > > processing, it is not rare that people first tokenize the data before
> > > processing to save cost. For example, imagine in word count the words
> are
> > > coded as 4-byte Integers instead of String. So the 1 byte overhead can
> > > still introduce 25 percent of the overhead. Therefore, I think the
> > > optimization in the FLIP can still benefit a bunch of batch processing
> > > cases. For streaming, the benefit still applies, although less compared
> > > with batch.
> > >
> > > ## The complexity and long term solution
> > >

Re: [DISCUSS] FLIP-323: Support Attached Execution on Flink Application Completion for Batch Jobs

2023-08-14 Thread Becket Qin
Hi Ron and Weihua,

Thanks for the feedback.

There seem three user sensible behaviors that we are talking about:

1. The behavior on the client side, i.e. whether blocking until the job
finishes or not.

2. The behavior of the submitted job, whether stop the job execution if the
client is detached from the Flink cluster, i.e. whether bind the lifecycle
of the job with the connection status of the attached client. For example,
one might want to keep a batch job running until finish even after the
client connection is lost. But it makes sense to stop the job upon client
connection lost if the job invokes collect() on a streaming job.

3. The behavior of the Flink cluster (JM and TMs), whether shutdown the
Flink cluster if the client is detached from the Flink cluster, i.e.
whether bind the cluster lifecycle with the job lifecycle. For dedicated
clusters (application cluster or dedicated session clusters), the lifecycle
of the cluster should be bound with the job lifecycle. But for shared
session clusters, the lifecycle of the Flink cluster should be independent
of the jobs running in it.

As we can see, these three behaviors are sort of independent, the current
configurations fail to support all the combination of wanted behaviors.
Ideally there should be three separate configurations, for example:
- client.attached.after.submission and client.heartbeat.timeout control the
behavior on the client side.
- jobmanager.cancel-on-attached-client-exit controls the behavior of the
job when an attached client lost connection. The client heartbeat timeout
and attach-ness will be also passed to the JM upon job submission.
- cluster.shutdown-on-first-job-finishes *(*or
jobmanager.shutdown-cluster-after-job-finishes) controls the cluster
behavior after the job finishes normally / abnormally. This is a cluster
level setting instead of a job level setting. Therefore it can only be set
when launching the cluster.

The current code sort of combines config 2 and 3 into
execution.shutdown-on-attach-exit.
This assumes the the life cycle of the cluster is the same as the job when
the client is attached. This FLIP does not intend to change that. but using
the execution.attached config for the client behavior control looks
misleading. So this FLIP proposes to replace it with a more intuitive
config of client.attached.after.submission. This makes it clear that it is
a configuration controlling the client side behavior, instead of the
execution of the job.

Thanks,

Jiangjie (Becket) Qin





On Thu, Aug 10, 2023 at 10:34 PM Weihua Hu  wrote:

> Hi Allison
>
> Thanks for driving this FLIP. It's a valuable feature for batch jobs.
> This helps keep "Drop Per-Job Mode [1]" going.
>
> +1 for this proposal.
>
> However, it seems that the change in this FLIP is not detailed enough.
> I have a few questions.
>
> 1. The config 'execution.attached' is not only used in per-job mode,
> but also in session mode to shutdown the cluster. IMHO, it's better to
> keep this option name.
>
> 2. This FLIP only mentions YARN mode. I believe this feature should
> work in both YARN and Kubernetes mode.
>
> 3. Within the attach mode, we support two features:
> execution.shutdown-on-attached-exit
> and client.heartbeat.timeout. These should also be taken into account.
>
> 4. The Application Mode will shut down once the job has been completed.
> So, if we use the flink client to poll job status via REST API for attach
> mode,
> there is a chance that the client will not be able to retrieve the job
> finish status.
> Perhaps FLINK-24113[3] will help with this.
>
>
> [1]https://issues.apache.org/jira/browse/FLINK-26000
> [2]
>
> https://nightlies.apache.org/flink/flink-docs-master/docs/deployment/resource-providers/native_kubernetes/#session-mode
> [2]https://issues.apache.org/jira/browse/FLINK-24113
>
> Best,
> Weihua
>
>
> On Thu, Aug 10, 2023 at 10:47 AM liu ron  wrote:
>
> > Hi, Allison
> >
> > Thanks for driving this proposal, it looks cool for batch jobs under
> > application mode. But after reading your FLIP document and [1], I have a
> > question. Why do you want to rename the execution.attached configuration
> to
> > client.attached.after.submission and at the same time deprecate
> > execution.attached? Based on your design, I understand the role of these
> > two options are the same. Introducing a new option would increase the
> cost
> > of understanding and use for the user, so why not follow the idea
> discussed
> > in FLINK-25495 and make Application mode support attached.execution.
> >
> > [1] https://issues.apache.org/jira/browse/FLINK-25495
> >
> > Best,
> > Ron
> >
> > Venkatakrishnan Sowrirajan  于2023年8月9日周三 02:07写道:
> >
> > > This is definitely a useful feature especially for the fli

Re: [DISCUSS] FLIP-330: Support specifying record timestamp requirement

2023-08-11 Thread Becket Qin
Thanks for the FLIP, Yunfeng.

I had a brief offline discussion with Dong, and here are my two cents:

## The benefit
The FLIP is related to one of the perf benchmarks we saw at LinkedIn which
is pretty much doing a word count, except that the words are country code,
so it is typically just two bytes, e.g. CN, US, UK. What I see is that the
amount of data going through shuffle is much higher in Flink  DataStream
batch mode compared with the Flink DataSet API. And in this case, because
the actual key is just 2 bytes so the overhead is kind of high. In batch
processing, it is not rare that people first tokenize the data before
processing to save cost. For example, imagine in word count the words are
coded as 4-byte Integers instead of String. So the 1 byte overhead can
still introduce 25 percent of the overhead. Therefore, I think the
optimization in the FLIP can still benefit a bunch of batch processing
cases. For streaming, the benefit still applies, although less compared
with batch.

## The complexity and long term solution
In terms of the complexity of the FLIP. I kind of think that we can
restrain the scope to just batch mode, and only for StreamRecord class.
That means only in batch mode, the timestamp in the StreamRecord will be
dropped when the config is enabled. This should give the most of the
benefit while significantly reducing the complexity of the FLIP.
In practice, I think people rarely use StreamRecord timestamps in batch
jobs. But because this is not an explicit API contract for users, from what
I understand, the configuration is introduced to make it 100% safe for the
users. In another word, we won't need this configuration if our contract
with users does not support timestamps in batch mode. In order to make the
contract clear, maybe we can print a warning if the timestamp field in
StreamRecord is accessed in batch mode starting from the next release. So
we can drop the configuration completely in 2.0. By then, Flink should have
enough information to determine whether timestamps in StreamRecords should
be supported for a job/operator or not, e.g. batch mode, processing time
only jobs, etc.

Thanks,

Jiangjie (Becket) Qin


On Fri, Aug 11, 2023 at 9:46 PM Dong Lin  wrote:

> Hi Xintong,
>
> Thanks for the quick reply. I also agree that we should hear from others
> about
> whether this optimization is worthwhile.
>
> Please see my comments inline.
>
> On Fri, Aug 11, 2023 at 5:54 PM Xintong Song 
> wrote:
>
> > Thanks for the quick replies.
> >
> > Overall, it seems that the main concern with this FLIP is that the 2%
> > > throughput saving might not be worth the added implementation
> complexity.
> > >
> >
> > Yes, my main concern is that the performance improvement may not be worth
> > the added complexity.
> >
> > And I'd like to point out that, even the number 2% was achieved with
> > relative record size. Records in WordCount consist of a string token and
> an
> > integer count. While I don't know the exact average length of the tokens,
> > I'd expect the record size to be ~20B, which from my experiences is much
> > smaller than most real workloads. Since you are arguing that this
> > improvement is important for batch workloads, have you tried it with the
> > widely-used TPD-DS benchmark? I doubt whether the differences can be
> > observed with that benchmark.
> >
>
> Yes, we have tried the TPC-DS benchmark. Unfortunately, it seems that the
> TPC-DS benchmarks' performance variation is more than 2%, using the same
> hardware and setup we obtained from other developers at Alibaba. Thus we
> are not able to use TPC-DS benchmark to identify the performance
> improvements like 2%.
>
> If you or anyone has a way to run the TPC-DS benchmark with very low
> performance variation (e.g. below 1%), then we would be happy to evaluate
> this FLIP's optimization again using TPC-DS.
>
>
> >
> > And assuming there's a 2% performance improvement for all batch scenarios
> > (which I don't think so), its value still depends on the development
> phase
> >
>
> I don't think we can get 2% improvement for all batch workloads.
>
> I suppose we can agree that there exists realistic workloads that can
> obtain 2% or more improvement using the proposed optimization. Now the only
> question is whether the resource footprint of those workloads will be small
> enough to be ignored.
>
> IMHO, unless there is evidence that such workloads will indeed be trivial
> enough to be ignored, then we should just fix this known performance issue
> early on, rather than *assuming* it is trivial, and waiting to fix it
> until it bites us (and Flink users).
>
> of the project. Setting a long-term goal to have the best possible
> > performance is nice, but that does

Re: FLINK-20767 - Support for nested fields filter push down

2023-08-02 Thread Becket Qin
Hi Jark,

If the FieldReferenceExpression contains an int[] to support a nested field
reference, List (or FieldReferenceExpression[])
and int[][] are actually equivalent. If we are designing this from scratch,
personally I prefer using List for consistency,
i.e. always resolving everything to expressions for users. Projection is a
simpler case, but should not be a special case. This avoids doing the same
thing in different ways which is also a confusion to the users. To me, the
int[][] format would become kind of a technical debt after we extend the
FieldReferenceExpression. Although we don't have to address it right away
in the same FLIP, this kind of debt accumulates over time and makes the
project harder to learn and maintain. So, personally I prefer to address
these technical debts as soon as possible.

Thanks,

Jiangjie (Becket) Qin

On Wed, Aug 2, 2023 at 8:19 PM Jark Wu  wrote:

> Hi,
>
> I agree with Becket that we may need to extend FieldReferenceExpression to
> support nested field access (or maybe a new
> NestedFieldReferenceExpression).
> But I have some concerns about evolving the
> SupportsProjectionPushDown.applyProjection.
> A projection is much simpler than Filter Expression which only needs to
> represent the field indexes.
> If we evolve `applyProjection` to accept `List
> projectedFields`,
> users have to convert the `List` back to int[][]
> which is an overhead for users.
> Field indexes (int[][]) is required to project schemas with the
> utility org.apache.flink.table.connector.Projection.
>
>
> Best,
> Jark
>
>
>
> On Wed, 2 Aug 2023 at 07:40, Venkatakrishnan Sowrirajan 
> wrote:
>
> > Thanks Becket for the suggestion. That makes sense. Let me try it out and
> > get back to you.
> >
> > Regards
> > Venkata krishnan
> >
> >
> > On Tue, Aug 1, 2023 at 9:04 AM Becket Qin  wrote:
> >
> > > This is a very useful feature in practice.
> > >
> > > It looks to me that the key issue here is that Flink ResolvedExpression
> > > does not have necessary abstraction for nested field access. So the
> > Calcite
> > > RexFieldAccess does not have a counterpart in the ResolvedExpression.
> The
> > > FieldReferenceExpression only supports direct access to the fields, not
> > > nested access.
> > >
> > > Theoretically speaking, this nested field reference is also required by
> > > projection pushdown. However, we addressed that by using an int[][] in
> > the
> > > SupportsProjectionPushDown interface. Maybe we can do the following:
> > >
> > > 1. Extend the FieldReferenceExpression to include an int[] for nested
> > field
> > > access,
> > > 2. By doing (1),
> > > SupportsFilterPushDown#applyFilters(List) can
> support
> > > nested field access.
> > > 3. Evolve the SupportsProjectionPushDown.applyProjection(int[][]
> > > projectedFields, DataType producedDataType) to
> > > applyProjection(List projectedFields,
> DataType
> > > producedDataType)
> > >
> > > This will need a FLIP.
> > >
> > > Thanks,
> > >
> > > Jiangjie (Becket) Qin
> > >
> > > On Tue, Aug 1, 2023 at 11:42 PM Venkatakrishnan Sowrirajan <
> > > vsowr...@asu.edu>
> > > wrote:
> > >
> > > > Thanks for the response. Looking forward to your pointers. In the
> > > > meanwhile, let me figure out how we can implement it. Will keep you
> > > posted.
> > > >
> > > > On Mon, Jul 31, 2023, 11:43 PM liu ron  wrote:
> > > >
> > > > > Hi, Venkata
> > > > >
> > > > > Thanks for reporting this issue. Currently, Flink doesn't support
> > > nested
> > > > > filter pushdown. I also think that this optimization would be
> useful,
> > > > > especially for jobs, which may need to read a lot of data from the
> > > > parquet
> > > > > or orc file. We didn't move forward with this for some priority
> > > reasons.
> > > > >
> > > > > Regarding your three questions, I will respond to you later after
> my
> > > > > on-call is finished because I need to dive into the source code.
> > About
> > > > your
> > > > > commit, I don't think it's the right solution because
> > > > > FieldReferenceExpression doesn't currently support nested field
> > filter
> > > > > pushdown, maybe we need to extend it.
> > > > >
> > > > > You can also look further into reasonable solutions, which we

Re: FLINK-20767 - Support for nested fields filter push down

2023-08-01 Thread Becket Qin
This is a very useful feature in practice.

It looks to me that the key issue here is that Flink ResolvedExpression
does not have necessary abstraction for nested field access. So the Calcite
RexFieldAccess does not have a counterpart in the ResolvedExpression. The
FieldReferenceExpression only supports direct access to the fields, not
nested access.

Theoretically speaking, this nested field reference is also required by
projection pushdown. However, we addressed that by using an int[][] in the
SupportsProjectionPushDown interface. Maybe we can do the following:

1. Extend the FieldReferenceExpression to include an int[] for nested field
access,
2. By doing (1),
SupportsFilterPushDown#applyFilters(List) can support
nested field access.
3. Evolve the SupportsProjectionPushDown.applyProjection(int[][]
projectedFields, DataType producedDataType) to
applyProjection(List projectedFields, DataType
producedDataType)

This will need a FLIP.

Thanks,

Jiangjie (Becket) Qin

On Tue, Aug 1, 2023 at 11:42 PM Venkatakrishnan Sowrirajan 
wrote:

> Thanks for the response. Looking forward to your pointers. In the
> meanwhile, let me figure out how we can implement it. Will keep you posted.
>
> On Mon, Jul 31, 2023, 11:43 PM liu ron  wrote:
>
> > Hi, Venkata
> >
> > Thanks for reporting this issue. Currently, Flink doesn't support nested
> > filter pushdown. I also think that this optimization would be useful,
> > especially for jobs, which may need to read a lot of data from the
> parquet
> > or orc file. We didn't move forward with this for some priority reasons.
> >
> > Regarding your three questions, I will respond to you later after my
> > on-call is finished because I need to dive into the source code. About
> your
> > commit, I don't think it's the right solution because
> > FieldReferenceExpression doesn't currently support nested field filter
> > pushdown, maybe we need to extend it.
> >
> > You can also look further into reasonable solutions, which we'll discuss
> > further later on.
> >
> > Best,
> > Ron
> >
> >
> > Venkatakrishnan Sowrirajan  于2023年7月29日周六 03:31写道:
> >
> > > Hi all,
> > >
> > > Currently, I am working on adding support for nested fields filter push
> > > down. In our use case running Flink on Batch, we found nested fields
> > filter
> > > push down is key - without it, it is significantly slow. Note: Spark
> SQL
> > > supports nested fields filter push down.
> > >
> > > While debugging the code using IcebergTableSource as the table source,
> > > narrowed down the issue to missing support for
> > > RexNodeExtractor#RexNodeToExpressionConverter#visitFieldAccess.
> > > As part of fixing it, I made changes by returning an
> > > Option(FieldReferenceExpression)
> > > with appropriate reference to the parent index and the child index for
> > the
> > > nested field with the data type info.
> > >
> > > But this new ResolvedExpression cannot be converted to RexNode which
> > > happens in PushFilterIntoSourceScanRuleBase
> > > <
> > >
> >
> https://urldefense.com/v3/__https://github.com/apache/flink/blob/3f63e03e83144e9857834f8db1895637d2aa218a/flink-table/flink-table-planner/src/main/java/org/apache/flink/table/planner/plan/rules/logical/PushFilterIntoSourceScanRuleBase.java*L104__;Iw!!IKRxdwAv5BmarQ!fNgxcul8ZGwkNE9ygOeVGlWlU6m_MLMXf4A3S3oQu9LBzYTPF90pZ7uXSGMr-5dFmzRn37-e9Q5cMnVs$
> > > >
> > > .
> > >
> > > Few questions
> > >
> > > 1. Does FieldReferenceExpression support nested fields currently or
> > should
> > > it be extended to support nested fields? I couldn't figure this out
> from
> > > the PushProjectIntoTableScanRule that supports nested column projection
> > > push down.
> > > 2. ExpressionConverter
> > > <
> > >
> >
> https://urldefense.com/v3/__https://github.com/apache/flink/blob/3f63e03e83144e9857834f8db1895637d2aa218a/flink-table/flink-table-planner/src/main/java/org/apache/flink/table/planner/expressions/converter/ExpressionConverter.java*L197__;Iw!!IKRxdwAv5BmarQ!fNgxcul8ZGwkNE9ygOeVGlWlU6m_MLMXf4A3S3oQu9LBzYTPF90pZ7uXSGMr-5dFmzRn37-e9Z6jnkJm$
> > > >
> > > converts ResolvedExpression -> RexNode but the new
> > FieldReferenceExpression
> > > with the nested field cannot be converted to RexNode. This is why the
> > > answer to the 1st question is key.
> > > 3. Anything else that I'm missing here? or is there an even easier way
> to
> > > add support for nested fields filter push down?
> > >
> > > Partially working changes - Commit
> > > <
> > >
> >
> https://urldefense.com/v3/__https://github.com/venkata91/flink/commit/00cdf34ecf9be3ba669a97baaed4b69b85cd26f9__;!!IKRxdwAv5BmarQ!fNgxcul8ZGwkNE9ygOeVGlWlU6m_MLMXf4A3S3oQu9LBzYTPF90pZ7uXSGMr-5dFmzRn37-e9XeOjJ_a$
> > > >
> > > Please
> > > feel free to leave a comment directly in the commit.
> > >
> > > Any pointers here would be much appreciated! Thanks in advance.
> > >
> > > Disclaimer: Relatively new to Flink code base especially Table planner
> > :-).
> > >
> > > Regards
> > > Venkata krishnan
> > >
> >
>


Re: [VOTE] FLIP-321: introduce an API deprecation process

2023-07-09 Thread Becket Qin
Thanks everyone for voting.

We have got 11 approving votes and no disapproving votes. FLIP-321 has now
passed!

The voting details are following:

9 binding votes:
Dong Lin
Xintong Song
Martijn Visser
Stefan Richter
Jing Ge
Matthias Pohl
Zhu Zhu
Jingsong Li
Becket Qin

2 non-binding votes:
ConradJam
John Roesler

Cheers!

Jiangjie (Becket) Qin



On Fri, Jul 7, 2023 at 6:43 AM John Roesler  wrote:

> +1 (non-binding)
>
> Thanks for the FLIP!
> -John
>
> On Mon, Jul 3, 2023, at 22:21, Jingsong Li wrote:
> > +1 binding
> >
> > On Tue, Jul 4, 2023 at 10:40 AM Zhu Zhu  wrote:
> >>
> >> +1 (binding)
> >>
> >> Thanks,
> >> Zhu
> >>
> >> ConradJam  于2023年7月3日周一 22:39写道:
> >> >
> >> > +1 (no-binding)
> >> >
> >> > Matthias Pohl  于2023年7月3日周一 22:33写道:
> >> >
> >> > > Thanks, Becket
> >> > >
> >> > > +1 (binding)
> >> > >
> >> > > On Mon, Jul 3, 2023 at 10:44 AM Jing Ge  >
> >> > > wrote:
> >> > >
> >> > > > +1(binding)
> >> > > >
> >> > > > On Mon, Jul 3, 2023 at 10:19 AM Stefan Richter
> >> > > >  wrote:
> >> > > >
> >> > > > > +1 (binding)
> >> > > > >
> >> > > > >
> >> > > > > > On 3. Jul 2023, at 10:08, Martijn Visser <
> martijnvis...@apache.org>
> >> > > > > wrote:
> >> > > > > >
> >> > > > > > +1 (binding)
> >> > > > > >
> >> > > > > >
> >> > > > > >
> >> > > > > > On Mon, Jul 3, 2023 at 10:03 AM Xintong Song <
> tonysong...@gmail.com
> >> > > > > <mailto:tonysong...@gmail.com>> wrote:
> >> > > > > >
> >> > > > > >> +1 (binding)
> >> > > > > >>
> >> > > > > >> Best,
> >> > > > > >>
> >> > > > > >> Xintong
> >> > > > > >>
> >> > > > > >>
> >> > > > > >>
> >> > > > > >> On Sat, Jul 1, 2023 at 11:26 PM Dong Lin <
> lindon...@gmail.com>
> >> > > wrote:
> >> > > > > >>
> >> > > > > >>> Thanks for the FLIP.
> >> > > > > >>>
> >> > > > > >>> +1 (binding)
> >> > > > > >>>
> >> > > > > >>> On Fri, Jun 30, 2023 at 5:39 PM Becket Qin <
> becket@gmail.com>
> >> > > > > wrote:
> >> > > > > >>>
> >> > > > > >>>> Hi folks,
> >> > > > > >>>>
> >> > > > > >>>> I'd like to start the VOTE for FLIP-321[1] which proposes
> to
> >> > > > introduce
> >> > > > > >> an
> >> > > > > >>>> API deprecation process to Flink. The discussion thread
> for the
> >> > > FLIP
> >> > > > > >> can
> >> > > > > >>> be
> >> > > > > >>>> found here[2].
> >> > > > > >>>>
> >> > > > > >>>> The vote will be open until at least July 4, following the
> >> > > consensus
> >> > > > > >>> voting
> >> > > > > >>>> process.
> >> > > > > >>>>
> >> > > > > >>>> Thanks,
> >> > > > > >>>>
> >> > > > > >>>> Jiangjie (Becket) Qin
> >> > > > > >>>>
> >> > > > > >>>> [1]
> >> > > > > >>>>
> >> > > > > >>>>
> >> > > > > >>>
> >> > > > > >>
> >> > > > >
> >> > > >
> >> > >
> https://www.google.com/url?q=https://cwiki.apache.org/confluence/display/FLINK/FLIP-321%253A%2BIntroduce%2Ban%2BAPI%2Bdeprecation%2Bprocess=gmail-imap=168897655400=AOvVaw24XYJrIcv_vYj1fJVQ7TNY
> >> > > > > >>>> [2]
> >> > > > >
> >> > > >
> >> > >
> https://www.google.com/url?q=https://lists.apache.org/thread/vmhzv8fcw2b33pqxp43486owrxbkd5x9=gmail-imap=168897655400=AOvVaw1yaMLBBkFfvbBhvyAbHYfX
> >> > > > >
> >> > > > >
> >> > > >
> >> > >
>


[VOTE] FLIP-321: introduce an API deprecation process

2023-06-30 Thread Becket Qin
Hi folks,

I'd like to start the VOTE for FLIP-321[1] which proposes to introduce an
API deprecation process to Flink. The discussion thread for the FLIP can be
found here[2].

The vote will be open until at least July 4, following the consensus voting
process.

Thanks,

Jiangjie (Becket) Qin

[1]
https://cwiki.apache.org/confluence/display/FLINK/FLIP-321%3A+Introduce+an+API+deprecation+process
[2] https://lists.apache.org/thread/vmhzv8fcw2b33pqxp43486owrxbkd5x9


Re: [DISCUSS] FLIP-321: Introduce an API deprecation process

2023-06-29 Thread Becket Qin
Hi Martijn,

Thanks for the reply. Regarding the behavioral stability guarantee, by
definition, an "API" always consists of the signature + behavior. So, the
behavior is a native part of an API. Therefore, behavioral changes should
be treated with the same guarantee as signature changes.

+1 on having toolings to enforce the conventions.

Thanks,

Jiangjie (Becket) Qin

On Wed, Jun 28, 2023 at 5:09 AM Martijn Visser 
wrote:

> Hi all,
>
> Thanks for the lively and good discussion. Given the length of the
> discussion, I skimmed through and then did a deep dive on the latest state
> of the FLIP. I think the FLIP is overall in a good state and ready to bring
> to a vote.
>
> One thing that I did notice while skimming through the discussions is that
> I think there are some follow-ups that could be worthy of a FLIP and a
> discussion. For example, if/what/how does the Flink community offer any
> behavioral stability guarantees, or other types of expectations and
> guarantees. I also do think that we must have tooling in place to implement
> this FLIP (and also FLIP-196 and FLIP-197), to avoid that we're not
> creating a policy on paper, but also have the means to enforce it. Last but
> not least, I echo Jark's point that we can't estimate maintenance cost with
> a concrete design and code/POC. For me, that means that a contributor can
> choose to propose a deviation, but that the contributor would need to
> explicitly mention it in the FLIP and get it discussed/voted on as part of
> the FLIP process. But the starting point is as defined in this and other
> relevant FLIPs.
>
> Best regards,
>
> Martijn
>
> On Tue, Jun 27, 2023 at 3:38 AM Becket Qin  wrote:
>
> > Hi Xintong, Jark and Jing,
> >
> > Thanks for the reply. Yes, we can only mark the DataStream API as
> > @Deprecated after the ProcessFunction API is fully functional and mature.
> >
> > It is a fair point that the condition of marking a @Public API as
> > deprecated should also be a part of this FLIP. I just added that to the
> > FLIP wiki. This is probably more of a clarification on the existing
> > convention, rather than a change.
> >
> > It looks like we are on the same page now for this FLIP. If so, I'll
> start
> > a VOTE thread in two days.
> >
> > Thanks,
> >
> > Jiangjie (Becket) Qin
> >
> > On Mon, Jun 26, 2023 at 8:09 PM Xintong Song 
> > wrote:
> >
> > > >
> > > > Considering DataStream API is the most fundamental and complex API of
> > > > Flink, I think it is worth a longer time than the general process for
> > the
> > > > deprecation period to wait for the new API be mature.
> > > >
> > >
> > > This inspires me. In this specific case, compared to how long should
> > > DataStream API be removed after deprecation, it's probably more
> important
> > > to answer the question how long would ProcessFunction API become mature
> > and
> > > stable after being introduced. According to FLIP-197[1], it requires 4
> > > minor releases by default to promote an @Experimental API to @Public.
> And
> > > for ProcessFunction API, which aims to replace DataStream API as one of
> > the
> > > most fundamental API of Flink, I'd expect this to take at least the
> > default
> > > time, or even longer. And we probably should wait until we believe
> > > ProcessFunction API is stable to mark DataStream API as deprecated,
> > rather
> > > than as soon as it's introduced. Assuming we introduce the
> > ProcessFunction
> > > API in 2.0, that means we would need to wait for 6 minor releases (4
> for
> > > the new API to become stable, and 2 for the migration period) to remove
> > > DataStream API, which is ~2.5 year (assuming 5 months / minor release),
> > > which sounds acceptable for another major version bump.
> > >
> > > To wrap things up, it seems to me, sadly, that anyway we cannot avoid
> the
> > > overhead for maintaining both DataStream & ProcessFunction APIs for at
> > > least 6 minor releases.
> > >
> > > Best,
> > >
> > > Xintong
> > >
> > >
> > > [1]
> > >
> > >
> >
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-197%3A+API+stability+graduation+process
> > >
> > >
> > >
> > > On Mon, Jun 26, 2023 at 5:41 PM Jing Ge 
> > > wrote:
> > >
> > > > Hi all,
> > > >
> > > > Just want to make sure we are on the same page. There is another
> > > example[1]
> > > &g

Re: [DISCUSS] FLIP-321: Introduce an API deprecation process

2023-06-26 Thread Becket Qin
Hi Xintong, Jark and Jing,

Thanks for the reply. Yes, we can only mark the DataStream API as
@Deprecated after the ProcessFunction API is fully functional and mature.

It is a fair point that the condition of marking a @Public API as
deprecated should also be a part of this FLIP. I just added that to the
FLIP wiki. This is probably more of a clarification on the existing
convention, rather than a change.

It looks like we are on the same page now for this FLIP. If so, I'll start
a VOTE thread in two days.

Thanks,

Jiangjie (Becket) Qin

On Mon, Jun 26, 2023 at 8:09 PM Xintong Song  wrote:

> >
> > Considering DataStream API is the most fundamental and complex API of
> > Flink, I think it is worth a longer time than the general process for the
> > deprecation period to wait for the new API be mature.
> >
>
> This inspires me. In this specific case, compared to how long should
> DataStream API be removed after deprecation, it's probably more important
> to answer the question how long would ProcessFunction API become mature and
> stable after being introduced. According to FLIP-197[1], it requires 4
> minor releases by default to promote an @Experimental API to @Public. And
> for ProcessFunction API, which aims to replace DataStream API as one of the
> most fundamental API of Flink, I'd expect this to take at least the default
> time, or even longer. And we probably should wait until we believe
> ProcessFunction API is stable to mark DataStream API as deprecated, rather
> than as soon as it's introduced. Assuming we introduce the ProcessFunction
> API in 2.0, that means we would need to wait for 6 minor releases (4 for
> the new API to become stable, and 2 for the migration period) to remove
> DataStream API, which is ~2.5 year (assuming 5 months / minor release),
> which sounds acceptable for another major version bump.
>
> To wrap things up, it seems to me, sadly, that anyway we cannot avoid the
> overhead for maintaining both DataStream & ProcessFunction APIs for at
> least 6 minor releases.
>
> Best,
>
> Xintong
>
>
> [1]
>
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-197%3A+API+stability+graduation+process
>
>
>
> On Mon, Jun 26, 2023 at 5:41 PM Jing Ge 
> wrote:
>
> > Hi all,
> >
> > Just want to make sure we are on the same page. There is another
> example[1]
> > I was aware of recently that shows why more factors need to be taken care
> > of than just the migration period. Thanks Galen for your hint.
> >
> > To put it simply, the concern about API deprecation is not that
> deprecated
> > APIs have been removed too early (min migration period is required). The
> > major concern is that APIs are marked as deprecated for a (too) long
> time,
> > much longer than the migration period discussed in this thread, afaik.
> > Since there is no clear picture/definition, no one knows when to do the
> > migration for users(after the migration period has expired) and when to
> > remove deprecated APIs for Flink developers.
> >
> > Based on all the information I knew, there are two kinds of obstacles
> that
> > will and should block the deprecation process:
> >
> > 1. Lack of functionalities in new APIs. It happens e.g. with the
> > SourceFunction to FLIP-27 Source migration. Users who rely on those
> > functions can not migrate to new APIs.
> > 2. new APIs have critical bugs. An example could be found at [1]. Users
> > have to stick to the deprecated APIs.
> >
> > Since FLIP-321 is focusing on the API deprecation process, those blocking
> > issues deserve attention and should be put into the FLIP. The current
> FLIP
> > seems to only focus on migration periods. If we consider those blocking
> > issues as orthogonal issues that are beyond the scope of this discussion,
> > does it make sense to change the FLIP title to something like "Introduce
> > minimum migration periods of API deprecation process"?
> >
> > Best regards,
> > Jing
> >
> > [1] https://lists.apache.org/thread/wxoo7py5pqqlz37l4w8jrq6qdvsdq5wc
> >
> > On Sun, Jun 25, 2023 at 2:01 PM Jark Wu  wrote:
> >
> > > I agree with Jingsong and Becket.
> > >
> > > Look at the legacy SourceFunction (a small part of DataStream API),
> > > the SourceFunction is still not and can't be marked deprecated[1] until
> > > now after the new Source was released 2 years ago, because the new
> Source
> > > still can't fully consume the abilities of legacy API. Considering
> > > DataStream
> > > API is the most fundamental and complex API of Flink, I think it is
> worth
> > > a longer time than the 

Re: [DISCUSS] FLIP-321: Introduce an API deprecation process

2023-06-25 Thread Becket Qin
Hi Jingsong,

Thanks for the reply. I completely agree with you.

The above 2 options are based on the assumption that the community cannot
afford to maintain the deprecated DataStream API for long. I'd say we
should try everything we can to maintain it for as much time as possible.
DataStream API is actually the most used API in Flink by so many users at
this point. Removing it any time soon will dramatically hurt our users. So
ideally we should keep it for at least 2 years after deprecation, if not
more.

The prohibitively high maintenance overhead is just an assumption.
Personally speaking, I don't feel this assumption is necessarily true. We
should re-evaluate once we have the new ProcessFunction API in place.
Without the code it is hard to tell for sure. I am actually kind of
optimistic about the maintenance cost.

Thanks,

Jiangjie (Becket) Qin



On Sun, Jun 25, 2023 at 11:30 AM Jingsong Li  wrote:

> Thanks Becket and all for your discussion.
>
> > 1. We say this FLIP is enforced starting release 2.0. For current 1.x
> APIs,
> we provide a migration period with best effort, while allowing exceptions
> for immediate removal in 2.0. That means we will still try with best effort
> to get the ProcessFuncion API ready and deprecate the DataStream API in
> 1.x, but will also be allowed to remove DataStream API in 2.0 if it's not
> deprecated 2 minor releases before the major version bump.
>
> > 2. We strictly follow the process in this FLIP, and will quickly bump the
> major version from 2.x to 3.0 once the migration period for DataStream API
> is reached.
>
> Sorry, I didn't read the previous detailed discussion because the
> discussion list was so long.
>
> I don't really like either of these options.
>
> Considering that DataStream is such an important API, can we offer a third
> option:
>
> 3. Maintain the DataStream API throughout 2.X and remove it until 3.x. But
> there's no need to assume that 2.X is a short version, it's still a normal
> major version.
>
> Best,
> Jingsong
>
> Becket Qin 于2023年6月22日 周四16:02写道:
>
> > Thanks much for the input, John, Stefan and Jing.
> >
> > I think Xingtong has well summarized the pros and cons of the two
> options.
> > Let's collect a few more opinions here and we can move forward with the
> one
> > more people prefer.
> >
> > Thanks,
> >
> > Jiangjie (Becket) Qin
> >
> > On Wed, Jun 21, 2023 at 3:20 AM Jing Ge 
> > wrote:
> >
> > > Hi all,
> > >
> > > Thanks Xingtong for the summary. If I could only choose one of the
> given
> > > two options, I would go with option 1. I understood that option 2
> worked
> > > great with Kafka. But the bridge release will still confuse users and
> my
> > > gut feeling is that many users will skip 2.0 and be waiting for 3.0 or
> > even
> > > 3.x. And since fewer users will use Flink 2.x, the development focus
> will
> > > be on Flink 3.0 with the fact that the current Flink release is 1.17
> and
> > we
> > > are preparing 2.0 release. That is weird for me.
> > >
> > > THB, I would not name the change from @Public to @Retired as a
> demotion.
> > > The purpose of @Retire is to extend the API lifecycle with one more
> > stage,
> > > like in the real world, people born, studied, graduated, worked, and
> > > retired. Afaiu from the previous discussion, there are two rules we'd
> > like
> > > to follow simultaneously:
> > >
> > > 1. Public APIs can only be changed between major releases.
> > > 2. A smooth migration phase should be offered to users, i.e. at least 2
> > > minor releases after APIs are marked as @deprecated. There should be
> new
> > > APIs as the replacement.
> > >
> > > Agree, those rules are good to improve the user friendliness. Issues we
> > > discussed are rising because we want to fulfill both of them. If we
> take
> > > care of deprecation very seriously, APIs can be marked as @Deprecated,
> > only
> > > when the new APIs as the replacement provide all functionalities the
> > > deprecated APIs have. In an ideal case without critical bugs that might
> > > stop users adopting the new APIs. Otherwise the expected "replacement"
> > will
> > > not happen. Users will still stick to the deprecated APIs, because the
> > new
> > > APIs can not be used. For big features, it will need at least 4 minor
> > > releases(ideal case), i.e. 2+ years to remove deprecated APIs:
> > >
> > > - 1st minor release to build the new APIs as the replacement and
> waiting
> > > for feedback. 

Re: [DISCUSS] FLIP-321: Introduce an API deprecation process

2023-06-22 Thread Becket Qin
Thanks much for the input, John, Stefan and Jing.

I think Xingtong has well summarized the pros and cons of the two options.
Let's collect a few more opinions here and we can move forward with the one
more people prefer.

Thanks,

Jiangjie (Becket) Qin

On Wed, Jun 21, 2023 at 3:20 AM Jing Ge  wrote:

> Hi all,
>
> Thanks Xingtong for the summary. If I could only choose one of the given
> two options, I would go with option 1. I understood that option 2 worked
> great with Kafka. But the bridge release will still confuse users and my
> gut feeling is that many users will skip 2.0 and be waiting for 3.0 or even
> 3.x. And since fewer users will use Flink 2.x, the development focus will
> be on Flink 3.0 with the fact that the current Flink release is 1.17 and we
> are preparing 2.0 release. That is weird for me.
>
> THB, I would not name the change from @Public to @Retired as a demotion.
> The purpose of @Retire is to extend the API lifecycle with one more stage,
> like in the real world, people born, studied, graduated, worked, and
> retired. Afaiu from the previous discussion, there are two rules we'd like
> to follow simultaneously:
>
> 1. Public APIs can only be changed between major releases.
> 2. A smooth migration phase should be offered to users, i.e. at least 2
> minor releases after APIs are marked as @deprecated. There should be new
> APIs as the replacement.
>
> Agree, those rules are good to improve the user friendliness. Issues we
> discussed are rising because we want to fulfill both of them. If we take
> care of deprecation very seriously, APIs can be marked as @Deprecated, only
> when the new APIs as the replacement provide all functionalities the
> deprecated APIs have. In an ideal case without critical bugs that might
> stop users adopting the new APIs. Otherwise the expected "replacement" will
> not happen. Users will still stick to the deprecated APIs, because the new
> APIs can not be used. For big features, it will need at least 4 minor
> releases(ideal case), i.e. 2+ years to remove deprecated APIs:
>
> - 1st minor release to build the new APIs as the replacement and waiting
> for feedback. It might be difficult to mark the old API as deprecated in
> this release, because we are not sure if the new APIs could cover 100%
> functionalities.
> -  In the lucky case,  mark all old APIs as deprecated in the 2nd minor
> release. (I would even suggest having the new APIs released at least for
> two minor releases before marking it as deprecated to make sure they can
> really replace the old APIs, in case we care more about smooth migration)
> - 3rd minor release for the migration period
> -  In another lucky case, the 4th release is a major release, the
> deprecated APIs could be removed.
>
> The above described scenario works only in an ideal case. In reality, it
> might take longer to get the new APIs ready and mark the old API
> deprecated. Furthermore, if the 4th release is not a major release, we will
> have to maintain both APIs for many further minor releases. The question is
> how to know the next major release in advance, especially 4 minor releases'
> period, i.e. more than 2 years in advance? Given that Flink contains many
> modules, it is difficult to ask devs to create a 2-3 years deprecation plan
> for each case. In case we want to build major releases at a fast pace,
> let's say every two years, it means devs must plan any API deprecation
> right after each major release. Afaiac, it is quite difficult.
>
> The major issue is, afaiu, if we follow rule 2, we have to keep all @Public
> APIs, e.g. DataStream, that are not marked as deprecated yet, to 2.0. Then
> we have to follow rule 1 to keep it unchanged until we have 3.0. That is
> why @Retired is useful to give devs more flexibility and still fulfill both
> rules. Let's check it with examples:
>
> - we have @Public DataStream API in 1.18. It will not be marked
> as @Deprecated, because the new APIs as the replacement are not ready.
> - we keep the DataStream API itself unchanged in 2.0, but change the
> annotation from @Public to @Retire. New APIs will be introduced too. In
> this case, Rule 1 is ok, since public API is allowed to change between
> major releases. Only changing annotation is the minimal change we could do
> and it does not break rule 1. Rule 2 is ok too, since the DataStream APIs
> work exactly the same.  Attention: the change of @Public -> @Retired can
> only be done between major releases, because @Public APIs can only be
> changed between major releases.
> - in 2.1, DataStream API will be marked as deprecated.
> - in 2.2, DataStream will be kept for the migration period.
> - in 2.3, DataStream will be removed.
>
> Becket mentioned previously (please correct me if I 

Re: [DISCUSS] FLIP-321: Introduce an API deprecation process

2023-06-19 Thread Becket Qin
Hi Xintong,

Let's compare the following cases:

A. If the maintenance overhead of the deprecated API is high, and we want
to remove it after two minor releases. Then there are two options:
A1: Demote the API in the major version bump and remove the code with a
minor version bump.
A2: Do exactly the same as 1-A, except that when removing the code,
bump up the major version. So this might be a short major version.

B. If the maintenance overhead of the deprecated API is not high, therefore
keeping it for a long time is affordable. There are also two options:
B1: Same as A-1, demote the API in the major version bump and remove
the code with a minor version bump.
B2: Keep the API for all the minor versions in the major version, and
only remove the code in the next major version.

For case B, do we agree that B2 is the way to go?

For case A:
The following stays the same for A1 and A2:
  - users will lose the API after two minor leases. So the migration period
is the same for A-1 and A-2.
  - A1 and A2 will have exactly the same code after the removal of
deprecated API, the only difference is versioning.
  - To move forward, users need to move to the next minor version in A1, or
to the next major version in A2. Because the code are the same, the actual
effort to upgrade is the same for A1 and A2.

The differences between A-1 and A-2 are:
  - A1 allows keeping the major version release cadence. A-2 will have a
short major version release.
  - A1 breaks the well understood API semantic, while A-2 does not.

>From what I see, since there is no well establish standard regarding how
long a major version should be released, A short major version release is
potential and emotional. It is not ideal but does not have material
downsides compared with A1. I did not hear anyone complaining about Kafka
only has two 1.x release. However, A1 actually breaks the well understood
API semantic, which has more material impact.

Also, I'd imagine 90% or more of the Public APIs should fall into case B.
So, short major versions should be very occasional. I'd be very concerned
if the reason we choose A1 is simply because we cannot afford maintaining a
bunch of deprecated APIs until the next major version. This indicates that
the actual problem we need to solve is to lower the maintenance overhead of
deprecated APIs, so that we are comfortable to keep them longer. As John
and I mentioned earlier, there are ways to achieve this and we need to
learn how to do it in Flink. Otherwise, our discussion about versioning
here does not bring much value, because we will end up with a bunch of
short-lived APIs which upset our users, no matter how we version the
releases.

So, if there are concrete examples that you think will block us from
keeping API stability with affordable cost, let's take a look together and
see if that can be improved.

Thanks,

Jiangjie (Becket) Qin


>From what I see,


On Mon, Jun 19, 2023 at 4:45 PM Xintong Song  wrote:

> >
> > The part I don't understand is if we are willing to have a migration
> > period, and do a minor version bump to remove an API, what do we lose to
> do
> > a major version bump instead, so we don't break the common versioning
> > semantic?
> >
>
> I think we are talking about the cases where a major version bump happens
> before a deprecated Public API reaches its migration period. So removing
> the API with another major version bump means we have two consecutive major
> versions in a very short time. And as previously mentioned, having frequent
> major version bumps would weaken the value of the commitment "Public API
> stay compatible within a major version". I think users also have
> expectations about how long a major version should live, which should be at
> least a couple of years rather than 1-2 minor releases.
>
> This is another option, but I think it is very likely more expensive than
> > simply bumping the major version. And I imagine there will be questions
> > like "why this feature is in 1.20, but does not exist in 2.0"?
> >
>
> Yes, it's more expensive than simply bumping the major version, but IMHO
> it's probably cheaper than carrying the API until the next major version if
> we don't bump the major version very soon. And see my reply above on why
> not bumping immediately.
>
> Best,
>
> Xintong
>
>
>
> On Mon, Jun 19, 2023 at 4:22 PM Becket Qin  wrote:
>
> > Hi Xintong,
> >
> > Please see the replies below.
> >
> > I see your point, that users would feel surprised if they find things no
> > > longer work when upgrading to another 2.x minor release. However, I'd
> > like
> > > to point out that PublicEvolving APIs would have the similar problem
> > > anyway. So the question is, how do we catch users' attention and make
> > sur

Re: [DISCUSS] FLIP-321: Introduce an API deprecation process

2023-06-19 Thread Becket Qin
Hi Xintong,

Please see the replies below.

I see your point, that users would feel surprised if they find things no
> longer work when upgrading to another 2.x minor release. However, I'd like
> to point out that PublicEvolving APIs would have the similar problem
> anyway. So the question is, how do we catch users' attention and make sure
> they are aware that the Public APIs in 1.x may no longer be Public in 2.0.
> There're various ways to do that, e.g., release notes, warnings in logs,
> etc.

First of all, I am not a fan of removing PublicEvolving APIs in minor
version changes. Personally speaking, I want them to be also removed in
major version changes.

Empirically, I rarely see projects with complex API semantic work well.
Many, if not most, users don't read docs / release notes / warnings in
logs. I don't think that is their fault. Typically an application developer
will have to deal with dozens of libraries maintained by various
communities / groups. It is just too difficult for them to keep track of
all the specific semantics each project puts there. If you think about it,
how many of Flink developers read all the release notes of Guava, Apache
Commons, Apache Avro, ProtoBuf, etc, when you upgrade a version of them?
One would probably try bumping the dependency version and see if there is
an exception and solve them case by case. And if it is a minor version
bump, one probably does not expect an exception at all. Another example is
that in Flink we still have so many usage of deprecated methods all over
the place, and I strongly doubt everyone knows when the source code of
these methods are deprecated and when they should be removed.

So, the most important principle of API is simple and intuitive. The
versioning semantic is a simple and universally accepted API stability
standard. If we ourselves as Flink developers are relying on this for our
own dependencies. I don't think we can expect more from our users.

Another possible alternative: whenever there's a deprecated Public API that
> reaches a major version bump before the migration period, and we also don't
> want to carry it for all the next major release series, we may consider
> releasing more minor releases for the previous major version after the
> bump. E.g., an Public API is deprecated in 1.19, and then we bump to 2.0,
> we can release one more 1.20 after 2.0. That should provide users another
> choice rather than upgrading to 2.0, while satisfying the 2-minor-release
> migration period.

This is another option, but I think it is very likely more expensive than
simply bumping the major version. And I imagine there will be questions
like "why this feature is in 1.20, but does not exist in 2.0"?

I think my major point is, we should not carry APIs deprecated in a
> previous major version along all the next major version series. I'd like to
> try giving users more commitments, i.e. the migration period, as long as it
> does not prevent us from making breaking changes. If it doesn't work, I'd
> be in favor of not providing the migration period, but fallback to only
> guarantee the compatibility within the major version.

The part I don't understand is if we are willing to have a migration
period, and do a minor version bump to remove an API, what do we lose to do
a major version bump instead, so we don't break the common versioning
semantic?

Thanks,

Jiangjie (Becket) Qin


On Mon, Jun 19, 2023 at 3:20 PM Xintong Song  wrote:

> >
> > As an end user who only uses Public APIs, if I don't change my code at
> > all, my expectation is the following:
> > 1. Upgrading from 1.x to 2.x may have issues.
> > 2. If I can upgrade from 1.x to 2.x without an issue, I am fine with all
> > the 2.x versions.
> > Actually I think there are some dependency version resolution policies
> out
> > there which picks the highest minor version when the dependencies pull in
> > multiple minor versions of the same jar, which may be broken if we remove
> > the API in minor releases.
> >
>
> I see your point, that users would feel surprised if they find things no
> longer work when upgrading to another 2.x minor release. However, I'd like
> to point out that PublicEvolving APIs would have the similar problem
> anyway. So the question is, how do we catch users' attention and make sure
> they are aware that the Public APIs in 1.x may no longer be Public in 2.0.
> There're various ways to do that, e.g., release notes, warnings in logs,
> etc.
>
> Another possible alternative: whenever there's a deprecated Public API that
> reaches a major version bump before the migration period, and we also don't
> want to carry it for all the next major release series, we may consider
> releasing more minor releases for the previous major version after the
> bump. E.g., an Public API is deprecated in 1.19, a

Re: [DISCUSS] FLIP-321: Introduce an API deprecation process

2023-06-18 Thread Becket Qin
Hi John,

Completely agree with all you said.

Can we consider only dropping deprecated APIs in major releases across the
> board? I understand that Experimental and PublicEvolving APIs are by
> definition less stable, but it seems like this should be reflected in the
> required deprecation period alone. I.e. that we must keep them around for
> at least zero or one minor release, not that we can drop them in a minor or
> patch release.

Personally speaking, I would love to do this, for exactly the reason you
mentioned. However, I did not propose this due to the following reasons:

1. I am hesitating a little bit about changing the accepted FLIPs too soon.
2. More importantly, to avoid slowing down our development. At this point,
Flink still lacks some design / routines to support good API evolvability /
extensibility. Just like you said, it takes some time to be good at this.
In this case, my concern is that only removing Experimental /
PublicEvolving APIs in major version changes may result in too much
overhead and dramatically slow down the development of Flink. So, I was
thinking that we can start with the current status. Hopefully after we are
more comfortable with the maintenance overhead of deprecated APIs, we can
then have a stronger guarantee for Experimental / PublicEvolving APIs.

Thanks,

Jiangjie (Becket) Qin



On Sun, Jun 18, 2023 at 6:44 AM John Roesler  wrote:

> Hi Becket,
>
> Thanks for this FLIP! Having a deprecation process is really important. I
> understand some people’s concerns about the additional burden for project
> maintainers, but my personal experience with Kafka has been that it’s very
> liveable and that it’s well worth the benefit to users. In fact, users
> being able to confidently upgrade is also a benefit to maintainers, as we
> will get fewer questions from people stuck on very old versions.
>
> One question:
> Can we consider only dropping deprecated APIs in major releases across the
> board? I understand that Experimental and PublicEvolving APIs are by
> definition less stable, but it seems like this should be reflected in the
> required deprecation period alone. I.e. that we must keep them around for
> at least zero or one minor release, not that we can drop them in a minor or
> patch release.
>
> The advantage of forbidding the removal of any API in minor or patch
> releases is that users will get a strong guarantee that they can bump the
> minor or patch version and still be able to compile, or even just re-link
> and know that they won’t face “MethodDef” exceptions at run time. This is a
> binary guarantee: if we allow removing  even Experimental APIs outside of
> major releases, users can no longer confidently upgrade.
>
> Aside from that, I’d share my 2 cents on a couple of points:
> * I’d use the official Deprecated annotation instead of introducing our
> own flavor (Retired, etc), since Deprecated is well integrated into build
> tools and IDEs.
> * I wouldn’t worry about a demotion process in this FLIP; it seems
> orthogonal, and something that should probably be taken case-by-case
> anyway.
> * Aside from deprecation and removal, there have been some discussions
> about how to evolve APIs and behavior in compatible ways. This is somewhat
> of an art, and if folks haven’t wrestled with it before, it’ll take some
> time to become good at it. I feel like this topic should also be orthogonal
> to this FLIP, but FWIW, my suggestion would be to adopt a simple policy not
> to break existing user programs, and leave the “how” up to implementers and
> reviewers.
>
> Thanks again,
> John
>
> On Sat, Jun 17, 2023, at 11:03, Jing Ge wrote:
> > Hi All,
> >
> > The @Public -> @PublicEvolving proposed by Xintong is a great idea.
> > Especially, after he suggest @PublicRetired, i.e. @PublicEvolving --(2
> > minor release)--> @Public --> @deprecated --(1 major
> > release)--> @PublicRetired. It will provide a lot of flexibility without
> > breaking any rules we had. @Public APIs are allowed to change between
> major
> > releases. Changing annotations is acceptable and provides additional
> > tolerance i.e. user-friendliness, since the APIs themself are not
> changed.
> >
> > I had similar thoughts when I was facing those issues. I want to move one
> > step further and suggest introducing one more annotation @Retired.
> >
> > Not like the @PublicRetired which is a compromise of downgrading @Public
> to
> > @PublicEvolving. As I mentioned earlier in my reply, Java standard
> > @deprecated should be used in the early stage of the deprecation process
> > and doesn't really meet our requirement. Since Java does not allow us to
> > extend annotation, I think it would be feasible to have the new @Retired
> t

Re: [DISCUSS] FLIP-321: Introduce an API deprecation process

2023-06-16 Thread Becket Qin
Hi Xintong,

Thanks for the explanation. Please see the replies inline below.

I agree. And from my understanding, demoting a Public API is also a kind of
> such change, just like removing one, which can only happen with major
> version bumps. I'm not proposing to allow demoting Public APIs anytime, but
> only in the case major version bumps happen before reaching the
> 2-minor-release migration period. Actually, demoting would be a weaker
> change compared to removing the API immediately upon major version bumps,
> in order to keep the commitment about the 2-minor-release migration period.
> If the concern is that `@Public` -> `@PublicEvolving` sounds against
> conventions, we may introduce a new annotation if necessary, e.g.,
> `@PublicRetiring`, to avoid confusions.

As an end user who only uses Public APIs, if I don't change my code at all,
my expectation is the following:
1. Upgrading from 1.x to 2.x may have issues.
2. If I can upgrade from 1.x to 2.x without an issue, I am fine with all
the 2.x versions.
Actually I think there are some dependency version resolution policies out
there which picks the highest minor version when the dependencies pull in
multiple minor versions of the same jar, which may be broken if we remove
the API in minor releases.

I'm not sure about this. Yes, it's completely "legal" that we bump up the
> major version whenever a breaking change is needed. However, this also
> weakens the value of the commitment that public APIs will stay stable
> within the major release series, as the series can end anytime. IMHO, short
> major release series are not something "make the end users happy", but
> backdoors that allow us as the developers to make frequent breaking
> changes. On the contrary, with the demoting approach, we can still have
> longer major release series, while only allowing Public APIs deprecated at
> the end of the previous major version to be removed in the next major
> version.

I totally agree that frequent major version bumps are not ideal, but here
we are comparing it with a minor version bump which removes a Public API.
So the context is that we have already decided to remove this Public API
while keeping everything else backwards compatible. I think a major version
bump is a commonly understood signal here, compared with a minor version
change. From end users' perspective, for those who are not impacted, in
this case upgrading a major version is not necessarily more involved than
upgrading a minor version - both should be as smooth as a dependency
version change. For those who are impacted, they will lose the Public API
anyways and a major version bump ensures there is no surprise.

Thanks,

Jiangjie (Becket) Qin

On Fri, Jun 16, 2023 at 10:13 AM Xintong Song  wrote:

> Public API is a well defined common concept, and one of its
>> convention is that it only changes with a major version change.
>>
>
> I agree. And from my understanding, demoting a Public API is also a kind
> of such change, just like removing one, which can only happen with major
> version bumps. I'm not proposing to allow demoting Public APIs anytime, but
> only in the case major version bumps happen before reaching the
> 2-minor-release migration period. Actually, demoting would be a weaker
> change compared to removing the API immediately upon major version bumps,
> in order to keep the commitment about the 2-minor-release migration period.
> If the concern is that `@Public` -> `@PublicEvolving` sounds against
> conventions, we may introduce a new annotation if necessary, e.g.,
> `@PublicRetiring`, to avoid confusions.
>
> But it should be
>> completely OK to bump up the major version if we really want to get rid of
>> a public API, right?
>>
>
> I'm not sure about this. Yes, it's completely "legal" that we bump up the
> major version whenever a breaking change is needed. However, this also
> weakens the value of the commitment that public APIs will stay stable
> within the major release series, as the series can end anytime. IMHO, short
> major release series are not something "make the end users happy", but
> backdoors that allow us as the developers to make frequent breaking
> changes. On the contrary, with the demoting approach, we can still have
> longer major release series, while only allowing Public APIs deprecated at
> the end of the previous major version to be removed in the next major
> version.
>
> Given our track record I would prefer a regular cycle (1-2 years) to
>> force us to think about this whole topic, and not put it again to the
>> wayside and giving us (and users) a clear expectation on when breaking
>> changes can be made.
>>
>
> +1. I personally think 2-3 years would be a good time for new major
> versions, or longer if

Re: [DISCUSS] FLIP-321: Introduce an API deprecation process

2023-06-16 Thread Becket Qin
Hi Chesnay,

I think that there are two things we are discussing here:
1. The API stability story we WANT to have.
2. The API stability guarantees we CAN have.

We can only design for what we want. Good API stability with affordable
maintenance overhead does demand careful design from the high level
architecture to specific APIs. I also believe that the proposed stability
guarantees are achievable with good practices. I understand the
concern that there might be some existing code that makes it more difficult
for us to get where we want to be. In that case, I think we should discuss
how to improve the code, instead of compromising on what we want.

So I think it is valuable that we bring up the parts of the code that
blocks us and see if we can solve it.

On the current 2.0 agenda is potentially dropping support for Java 8/11,
> which may very well be a problem for our current users.

Java 8 support was deprecated in May 2022 with release 1.15. Assuming we
have 2.0 released by the end of 2024, there are 2.5 years of migration
window. If our survey shows that most people have migrated off from Java 8,
I think it is reasonable to drop 1.8 support. When doing so, that basically
means users on Java 8 will be stuck on 1.x, and not getting new features
because all the new features are going to be in the 2.x releases.
Personally I think this is reasonable, given the long migration window
there.

Technically yes, but look at how long it took to get us to 2.0. ;)
>
> There's a separate discussion to be had on the cadence of major releases
> going forward, and there seem to be different opinions on that.
>
> If we take the Kafka example of 2 minor releases between major ones, that
> for us means that users have to potentially deal with breaking changes
> every 6 months, which seems like a lot.
>
> Given our track record I would prefer a regular cycle (1-2 years) to force
> us to think about this whole topic, and not put it again to the wayside and
> giving us (and users) a clear expectation on when breaking changes can be
> made.
>
> But again, maybe this should be in a separate thread.
>
I agree it makes sense for us to review the necessity of a major release
with a regular cycle.


> For a concrete example, consider the job submission. A few releases back
> we made changes such that the initialization of the job master happens
> asynchronously.
> This meant the job submission call returns sooner, and the job state enum
> was extended to cover this state.
> API-wise we consider this a compatible change, but the observed behavior
> may be different.
> Metrics are another example; I believe over time we changed what some
> metrics returned a few times.


For the job submission example, here's what I think we should do:
1. If we consider dispatcher gateway as a public API as well, in the
DispatcherGateway, introduce a new RPC method version submitJobV2() for the
async submission. Otherwise, we can change the RPC method in place, maybe
with an option of async or not.
2. On the client side, have a separate method of submitAsync(), while
keeping the original synchronous API. Whether the sync API should be
removed or not is debatable as users may want to block on it to fail fast
in case of some failures. The implementation of submit() and submitAsync()
can potentially share most of the code.
3. Depending on how the JobStatus enum is exposed to the users, we may or
may not need to bump the API version of the related APIs as well. For
example, if we assume that users only query the job status after submit() /
submitAsync() returns, then we don't need to do anything because the
existing users only invoke  submit() which only returns after the job
status becomes CREATED. Therefore the new status of INITIALIZING is not
exposed to them. On the other hand, if we think users might query the job
status before submit() / submitAsync() returns, then we may need to create
RestfulApi.requestJobStatusV2() which may return Initializing status, while
we make sure RestfulApi.requestJobStatus() keeps the current behavior and
does not return Initializing status to the users.

Does this introduce maintenance overhead? Sure. But this is what Kafka has
been doing for the past 10 years. If you check the Kafka protocol guide[1],
it has all the versions of all the RPC requests/responses. Therefore the
client side behavior can be kept the same. Is it affordable? From my
experience, once you have this pattern setup, the maintenance overhead is
not that high.

Metrics are another example; I believe over time we changed what some
> metrics returned a few times.

For metrics usually it is easier. We just need to add a new metric
meanwhile deprecate the previous one.

Thanks again for raising these examples. This is a good discussion, as we
are getting to some root causes of our hesitation about the API stabilities.

Thanks,

Jiangjie (Becket) Qin


On Fri, Jun 16, 2023 a

Re: [DISCUSS] FLIP-321: Introduce an API deprecation process

2023-06-15 Thread Becket Qin
Hi Xintong,

I think the key of stability guarantees is about not breaking the
> commitments we made to users. From my understanding, when we mark an API as
> Public in for example 1.18, we make the following two commitments.
> a) The API will not be removed in all the 1.x serials.
> b) The API will be kept at least 2 minor releases after being marked
> Deprecated.

There is one more thing we need to follow which is not breaking the
conventions. Public API is a well defined common concept, and one of its
convention is that it only changes with a major version change. While it is
fine for us to define our own graduation path of an API like Experimental
=> PublicEvolving => Public, once an API reaches the Public API state, we
need to follow the common convention. Not every project has Experimental /
PublicEvolving APIs, even for those that do have similar annotations, I am
not aware of any demotion process. So the demotion or API stability is
something extra we put on our users, which is specific to Flink and breaks
the Public API convention.

As far as I understand, the reason for the demotion of a Public API is to
allow us to remove it without a major version bump. But it should be
completely OK to bump up the major version if we really want to get rid of
a public API, right? Isn't it just a version change which has almost no
additional cost to us? As an example, Kafka deprecated the old consumer in
0.11, and kept for 1.0 and 1.1, then the community bumped up the major
version to 2.0 and removed the code. So there are only two 1.x minor
versions, while 2.x has 8 minor versions and 3.x has 5 minor versions at
present.

I think the first priority about API stability is to make the end users
happy. Sometimes this does mean there will be maintenance overhead for us
as Flink maintainers. When there is a conflict and no way around, having
some trade-off is reasonable. However, in this particular case, there seems
no material benefit of having a stability demotion process while it does
weaken the user experience.

Thanks,

Jiangjie (Becket) Qin



On Thu, Jun 15, 2023 at 7:31 PM Martijn Visser 
wrote:

> Hi all,
>
> First off, thanks for opening the discussion on this topic. I think it's an
> important one.
>
> From my perspective, I think that one of the major pain points that users
> have expressed over time, is the API stability of Flink overall. I think
> that every Flink user wants to have reliable, stable and good APIs, and
> that this should be an essential thing for the Flink community to achieve.
> I think there's still a lot of room to improve here. I think that Spark has
> a nice overview of considerations when to break an API [1]. What I really
> like and would emphasize is that they strive to avoid breaking APIs or
> silently changing behaviour. I think we should do the same for the Flink
> APIs.
>
> Overall, I would agree with the proposal regarding deprecation for both
> Experimental and PublicEvolving. I do think that we can only enforce this
> if we actually enforce the opposite as well (which is the promotion from
> Experimental -> PublicEvolving -> Public).
>
> What I don't think is a good principle, is the example where a Public API
> is deprecated in 1.20 and the next release is 2.0 with the requirement to
> keep it there until the next major release. I can see why it's being
> proposed (because it would avoid that the user needs to change their
> implementation towards a new API), the user is already faced with the
> situation that their implementation must be changed, given the fact that
> the community decided to go for a new major version. That already implies
> breaking changes and impact for the user. I think it's the primary reason
> why there is no Flink 2.0 yet.
>
> I'm not in favour of downgrading APIs from Public -> PublicEvolving ->
> Experimental. When doing that, we are breaking the contract with the Flink
> users who believe they are on an API that won't break, only to figure out a
> couple of releases later that this has actually happened.
>
> I believe we should treat API Stability as a first class citizen, so that
> each API is annotated (either Internal, else it's considered public with an
> annotation of either Experimental, PublicEvolving or Public) and users know
> how to rely on them. An accepted deprecation proposal will only help our
> users in understanding the guarantees that they can expect.
>
> Best regards,
>
> Martijn
>
> [1] https://spark.apache.org/versioning-policy.html
>
> On Thu, Jun 15, 2023 at 5:29 AM Xintong Song 
> wrote:
>
> > I agree that Public APIs should require a longer migration period. I
> think
> > that is why the FLIP requires at least 2 minor releases (compared to 1
> > minor release for PublicEvolving and 1 patch release for Experi

Re: [DISCUSS] FLIP-321: Introduce an API deprecation process

2023-06-14 Thread Becket Qin
Thanks for the explanation, Matthias.

In the example you raised, would it be better to just keep both YARN and
K8S support in the new major version, but with YARN support deprecated if
we want to? We can say for YARN we will only provide bug fixes but no
feature development anymore. Given these two features are probably in two
independent modules. Keeping both modules in the same new major version
likely has zero additional cost compared with maintaining them in two
different major versions respectively. This way we don't have the
non-linear version issue, have fewer releases, and save a bunch of
maintenance effort for multiple development branches.

Regarding the stability demotion, I see your point. However, I still feel
it a little weird that we demote a Public API to PublicEvolving just for
the purpose of code removal in minor versions. This also results in some
counter intuitive issues. For example, assuming users only use Public APIs,
they may be able to upgrade from 1.19.0 to 2.0 fine, but upgrading from
1.19 to 2.1 does not work because the Public API is removed, even though
from the users' perspective, both of them are major version upgrades. So,
in this case, I would rather bump up the major version again to remove the
deprecated Public API. That seems simpler and does not complicate the well
established versioning semantic conventions.

Thanks,

Jiangjie (Becket) Qin

On Wed, Jun 14, 2023 at 9:27 PM Matthias Pohl
 wrote:

> One (made-up) example from the top of my head would have been that the
> community decides to focus fully on Kubernetes without considering Yarn
> anymore because of some must-have feature on the Kubernetes side. At the
> same time there are still some users for whom it would be tricky to migrate
> from Yarn to Kubernetes. Therefore, there would be some desire to still
> maintain the older major version of Flink that supports Yarn.
>
> But now that I'm thinking about the example, I realize: A discussion on how
> we would handle the two major versions in that case, could be started when
> we actually run into this issue. It shouldn't be too hard to migrate to a
> non-linear versioning from what you are proposing if such a scenario comes
> up.
>
> And on the topic of downgrading from @Public to @PublicEvolving:
>
> Demoting a Public API to PublicEvolving API sounds hacky. From what I
> > understand the stability guarantee is not revocable because users will
> rely
> > on the stability guarantee to plan their usage of Flink. Demoting an API
> > essentially defeats the very purpose of stability guarantee to begin
> with.
> >
>
> I understand the stability guarantee in a way that it only applies within a
> major version. Downgrading the stability constraint for an API with a new
> major version still seems to comply with the definition of a @Public
> annotation as it's similar to changing the API in other ways. But I'm not
> insisting on that approach. I just found it a reasonable workaround.
>
> Thanks,
> Matthias
>
> On Wed, Jun 14, 2023 at 11:38 AM Becket Qin  wrote:
>
> > Hi Matthias,
> >
> > Thanks for the feedback.
> >
> > Do you have an example of behavioral change in mind? Not sure I fully
> > understand the concern for behavioral change here. From what I
> understand,
> > any user sensible change in an existing API, regardless of its kind (API
> > signature or behavioral change), can always be done in the following way:
> >
> > 1. Introduce a new API (new package, new class/interface, new method, new
> > config, new metric, etc) while marking the old one as deprecated.
> > 2. Let the new API and deprecated API coexist for the migration period to
> > allow planned migration from the users.
> > 3. Remove the deprecated API.
> >
> > For example, Kafka deprecated its old consumer and replaced it with a new
> > Consumer - basically everything changes. The source code of the old
> > consumer was kept there for a few years across multiple major versions.
> > This does mean we have to keep both of the APIs for a few releases, and
> > even fix bugs in the old consumer, so additional maintenance effort is
> > required. But this allows the users to keep up with Kafka releases which
> is
> > extremely rewarding.
> >
> > Thanks,
> >
> > Jiangjie (Becket) Qin
> >
> > On Wed, Jun 14, 2023 at 5:06 PM Matthias Pohl
> >  wrote:
> >
> > > Thanks for starting this discussion, Becket. A few good points were
> > raised.
> > > Here's what I want to add:
> > >
> > > Stefan raised the point of behavioral stability (in contrast to API
> > > stability). That might be a reason for users to not be able to go ahead
> > > with a major version bump

Re: [DISCUSS] FLIP-321: Introduce an API deprecation process

2023-06-14 Thread Becket Qin
Hi Matthias,

Thanks for the feedback.

Do you have an example of behavioral change in mind? Not sure I fully
understand the concern for behavioral change here. From what I understand,
any user sensible change in an existing API, regardless of its kind (API
signature or behavioral change), can always be done in the following way:

1. Introduce a new API (new package, new class/interface, new method, new
config, new metric, etc) while marking the old one as deprecated.
2. Let the new API and deprecated API coexist for the migration period to
allow planned migration from the users.
3. Remove the deprecated API.

For example, Kafka deprecated its old consumer and replaced it with a new
Consumer - basically everything changes. The source code of the old
consumer was kept there for a few years across multiple major versions.
This does mean we have to keep both of the APIs for a few releases, and
even fix bugs in the old consumer, so additional maintenance effort is
required. But this allows the users to keep up with Kafka releases which is
extremely rewarding.

Thanks,

Jiangjie (Becket) Qin

On Wed, Jun 14, 2023 at 5:06 PM Matthias Pohl
 wrote:

> Thanks for starting this discussion, Becket. A few good points were raised.
> Here's what I want to add:
>
> Stefan raised the point of behavioral stability (in contrast to API
> stability). That might be a reason for users to not be able to go ahead
> with a major version bump. Working around behavioral changes might be
> trickier than just switching from deprecated to newer APIs. I see your
> motivation of having a more linear versioning even between major versions
> to avoid backports. Backports are painful enough for minor versions.
>
> But with major versions becoming a thing in the Flink cosmos, I could
> imagine that the behavioral stability Stefan mentions actually could become
> a bigger issue: Major versions down the road might include bigger
> behavioral changes which would prevent users from going ahead with the
> major version bump. I understand that this is out of the original scope of
> this FLIP. But nevertheless, it does support Chesnay's concerns that a
> linear versioning without maintaining older major versions might not be
> feasible. It sounds like we should have a discussion about how we treat
> older major versions here (or have a dedicated discussion on that topic
> before going ahead with that FLIP).
>
> On another note: I like Xintong's proposal of downgrading an API
> from @Public to @PublicEvolving in the new major version. That would allow
> us to keep the original intention of the @Public annotation alive (i.e.
> that those APIs are only removed in the next major version).
>
> Matthias
>
> On Wed, Jun 14, 2023 at 10:10 AM Xintong Song 
> wrote:
>
> > Thanks for bringing up this discussion, Becket.
> >
> > My two cents:
> >
> > 1. Do we allow deprecation & removal of APIs without adding a new one as
> a
> > replacement? The examples in the table give me an impression that marking
> > an API as `@Deprecated` should only happen at the same time of
> introducing
> > a new replacing API, which I think is true in most but not all the cases.
> >
> > If there is a major version bump before 2 minor releases in the current
> > > major version are reached, the major version should keep the source
> code
> > in
> > > its own minor version until two minor versions are reached. For
> example,
> > in
> > > the above case, if Flink 2.0 is released after 1.20, then the
> deprecated
> > > source code of foo will be kept in 2.0 and all the 2.x versions. It can
> > > only be removed in 3.0.
> > >
> >
> > 2. I think this might be a bit too strict. For an API that we already
> > decided to remove, having to keep it for all the 2.x versions simply
> > because there's less than 2 minor releases between making the decision
> and
> > the major release bump seems not necessarily. Alternatively, I'd like to
> > propose to remove the `@Public` annotation (or downgrade it to
> > `@PublicEvolving`) in 2.0, and remove it in 2.2.
> >
> > Best,
> >
> > Xintong
> >
> >
> >
> > On Wed, Jun 14, 2023 at 3:56 PM Becket Qin  wrote:
> >
> > > Hi Jing,
> > >
> > > Thanks for the feedback. Please see the answers to your questions
> below:
> > >
> > > *"Always add a "Since X.X.X" comment to indicate when was a class /
> > > > interface / method marked as deprecated."*
> > > >  Could you describe it with a code example? Do you mean Java
> comments?
> > >
> > > It is just a comment such as "Since 1.18. Use X
> > > &

Re: [DISCUSS] FLIP-321: Introduce an API deprecation process

2023-06-14 Thread Becket Qin
Hi Xintong,

Thanks for the comment. Please see the replies below:

1. Do we allow deprecation & removal of APIs without adding a new one as a
> replacement? The examples in the table give me an impression that marking
> an API as `@Deprecated` should only happen at the same time of introducing
> a new replacing API, which I think is true in most but not all the cases.

Right, it is not necessary to have a replacement for the deprecated API, if
we decide to sunset the functionality. That does not change the migration
period, though.

2. I think this might be a bit too strict. For an API that we already
> decided to remove, having to keep it for all the 2.x versions simply
> because there's less than 2 minor releases between making the decision and
> the major release bump seems not necessarily. Alternatively, I'd like to
> propose to remove the `@Public` annotation (or downgrade it to
> `@PublicEvolving`) in 2.0, and remove it in 2.2.

I am not sure this is a good practice. The purpose of the migration period
is to give users enough time to adapt to a breaking API change without
holding them back from upgrading Flink. The reason we say Public API needs
at least two minor releases is because there are probably more users picked
them up over time and more jobs are running using these APIs. So, the
public APIs just require a larger migration window. Admittedly, this will
introduce a higher maintenance cost for us, this is why Public APIs should
be treated seriously. If the promotion of a PublicEvolving API to a Public
API requires two minor version releases, deprecation of a Public API should
only take longer.

Demoting a Public API to PublicEvolving API sounds hacky. From what I
understand the stability guarantee is not revocable because users will rely
on the stability guarantee to plan their usage of Flink. Demoting an API
essentially defeats the very purpose of stability guarantee to begin with.

If the concern of keeping a migration period of two minor releases across
major versions is about the maintenance overhead, we can choose to bump up
the major version to 3.0 at some point after the migration period has
passed, assuming by then most of the users have migrated away from the
deprecated Public API.

Thanks,

Jiangjie (Becket) Qin


On Wed, Jun 14, 2023 at 4:10 PM Xintong Song  wrote:

> Thanks for bringing up this discussion, Becket.
>
> My two cents:
>
> 1. Do we allow deprecation & removal of APIs without adding a new one as a
> replacement? The examples in the table give me an impression that marking
> an API as `@Deprecated` should only happen at the same time of introducing
> a new replacing API, which I think is true in most but not all the cases.
>
> If there is a major version bump before 2 minor releases in the current
> > major version are reached, the major version should keep the source code
> in
> > its own minor version until two minor versions are reached. For example,
> in
> > the above case, if Flink 2.0 is released after 1.20, then the deprecated
> > source code of foo will be kept in 2.0 and all the 2.x versions. It can
> > only be removed in 3.0.
> >
>
> 2. I think this might be a bit too strict. For an API that we already
> decided to remove, having to keep it for all the 2.x versions simply
> because there's less than 2 minor releases between making the decision and
> the major release bump seems not necessarily. Alternatively, I'd like to
> propose to remove the `@Public` annotation (or downgrade it to
> `@PublicEvolving`) in 2.0, and remove it in 2.2.
>
> Best,
>
> Xintong
>
>
>
> On Wed, Jun 14, 2023 at 3:56 PM Becket Qin  wrote:
>
> > Hi Jing,
> >
> > Thanks for the feedback. Please see the answers to your questions below:
> >
> > *"Always add a "Since X.X.X" comment to indicate when was a class /
> > > interface / method marked as deprecated."*
> > >  Could you describe it with a code example? Do you mean Java comments?
> >
> > It is just a comment such as "Since 1.18. Use X
> > <
> >
> https://kafka.apache.org/31/javadoc/org/apache/kafka/clients/admin/Admin.html#incrementalAlterConfigs(java.util.Map)
> > >XX
> > instead.". And we can then look it up in the deprecated list[1] in each
> > release and see which method should / can be deprecated.
> >
> > *"At least 1 patch release for the affected minor release for
> > > Experimental APIs"*
> > > The rule is absolutely right. However, afaiac, deprecation is different
> > as
> > > modification. As a user/dev, I would appreciate, if I do not need to do
> > any
> > > migration work for any deprecated API between patch releases upgrade.
> > BTW,
> > > if experimental 

Re: [DISCUSS] FLIP-321: Introduce an API deprecation process

2023-06-14 Thread Becket Qin
Hi Jing,

Thanks for the feedback. Please see the answers to your questions below:

*"Always add a "Since X.X.X" comment to indicate when was a class /
> interface / method marked as deprecated."*
>  Could you describe it with a code example? Do you mean Java comments?

It is just a comment such as "Since 1.18. Use X
<https://kafka.apache.org/31/javadoc/org/apache/kafka/clients/admin/Admin.html#incrementalAlterConfigs(java.util.Map)>XX
instead.". And we can then look it up in the deprecated list[1] in each
release and see which method should / can be deprecated.

*"At least 1 patch release for the affected minor release for
> Experimental APIs"*
> The rule is absolutely right. However, afaiac, deprecation is different as
> modification. As a user/dev, I would appreciate, if I do not need to do any
> migration work for any deprecated API between patch releases upgrade. BTW,
> if experimental APIs are allowed to change between patches, could we just
> change them instead of marking them as deprecated and create new ones to
> replace them?

Deprecating an API is just a more elegant way of replacing an API with a
new one. The only difference between the two is whether the old API is kept
and coexists with the new API for some releases or not. For end users,
deprecation-then-remove is much more friendly than direct replacement.

1. How to make sure the new APIs cover all functionality, i.e. backward
> compatible, before removing the deprecated APIs? Since the
> functionalities could only be built with the new APIs iteratively, there
> will be a while (might be longer than the migration period) that the new
> APIs are not backward compatible with the deprecated ones.

This is orthogonal to the deprecation process, and may not even be required
in some cases if the function changes by itself. But in general, this
relies on the developer to decide. A simple test on readiness is to see if
all the UT / IT cases written with the old API can be migrated to the new
one and still work.  If the new API is not ready, we probably should not
deprecate the old one to begin with.

2. Is it allowed to remove the deprecated APIs after the defined migration
> period expires while the new APis are still not backward compatible?

By "backwards compatible", do you mean functionally equivalent? If the new
APIs are designed to be not backwards compatible, then removing the
deprecated source code is definitely allowed. If we don't think the new API
is ready to take over the place for the old one, then we should wait. The
migration period is the minimum time we have to wait before removing the
source code. A longer migration period is OK.

3. For the case of core API upgrade with downstream implementations, e.g.
> connectors, What is the feasible deprecation strategy? Option1 bottom-up:
> make sure the downstream implementation is backward compatible before
> removing the deprecated core APIs. Option2 top-down: once the downstream
> implementation of new APIs works fine, we can remove the deprecated core
> APIs after the migration period expires. The implementation of the
> deprecated APIs will not get any further update in upcoming releases(it has
> been removed). There might be some missing features in the downstream
> implementation of new APIs compared to the old implementation. Both options
> have their own pros and cons.

The downstream projects such as connectors in Flink should also follow the
migration path we tell our users. i.e. If there is a cascading backwards
incompatible change in the connectors due to a backwards incompatible
change in the core, and as a result a longer migration period is required,
then I think we should postpone the removal of source code. But in general,
we should be able to provide the same migration period in the connectors as
the flink core, if the connectors are upgraded to the latest version of
core promptly.

Thanks,

Jiangjie (Becket) Qin

[1]
https://nightlies.apache.org/flink/flink-docs-master/api/java/deprecated-list.html


On Wed, Jun 14, 2023 at 1:15 AM Jing Ge  wrote:

> > This is by design. Most of these are @Public APIs that we had to carry
> > around until Flink 2.0, because that was the initial guarantee that we
> > gave people.
> >
>
> True, I knew @Public APIs could not be removed before the next major
> release. I meant house cleaning without violation of these annotations'
> design concept. i.e especially cleaning up for @PublicEvolving APIs since
> they are customer-facing. Regular cleaning up with all other @Experimental
> and @Internal APIs would be even better, if there might be some APIs marked
> as @deprecated.
>
> Best regards,
> Jing
>
>
>
> On Tue, Jun 13, 2023 at 4:25 PM Chesnay Schepler 
> wrote:
>
> > On 13/06/2023 12:50, Jing Ge wrote:
> > > One major is

Re: [DISCUSS] FLIP-321: Introduce an API deprecation process

2023-06-13 Thread Becket Qin
Hi Chesnay and Stefan,

Thanks for the feedback and sorry for the confusion about Public API
deprecation. I just noticed that there was a mistake in the NOTES part for
Public API due to a copy-paste error... I just fixed it.

To clarify on the deprecation of "Public" APIs.

The FLIP does not propose to remove the code of "Public" APIs with a minor
version bump. The Public APIs still can only be removed with a major
version bump. What this FLIP enforces is the minimum migration period in
addition to this rule.

For example, if the next version after 1.19 is 2.0, without the migration
period specified in this FLIP (that is 2 minor versions), we can
technically remove a Public API from 1.19 right away because the next
release changes the major version. However, this leaves no time for the
users to migrate. With this FLIP, we have to keep the deprecated API for at
least 2 minor releases, so we have to keep it in at least 2.0 and 2.1 due
to this FLIP, and also all the 2.x versions because removing a Public API
requires a major version bump. Therefore, that deprecated Public API can
only be removed in 3.0.

As another example, if after 1.19, we deprecate a Public API in 1.20, and
1.21, then we release 2.0. In this case, the deprecated Public API has been
kept for 2 minor releases (1.20 and 1.21), we can remove it in 2.0 if we
want to. We can also choose to keep it there for longer if we want to. But
if we decide to keep the deprecated API in 2.0, we can only remove it in
3.0.

The "carry recent Public APIs forward into the next major release" thing
> seems to presume a linear release history (aka, if 2.0 is released after
> 1.20, then there will be no 1.21), which I doubt will be the case. The
> idea behind it is good, but I'd say the right conclusion would be to not
> make that API public if we know a new major release hits in 3 months and
> is about to modify it. With a regular schedule for major releases this
> wouldn't be difficult to do.


I agree if we know a brand new Public API (not a replacement for deprecated
API) will be changed in 3 months, it would not make sense to introduce that
in the current major version in the first place. But that is a different
discussion from the deprecation process. The deprecation process is mainly
for the API that is already there. In the other case, if we are talking
about a new API as a replacement for a deprecated API, we have to introduce
it in the current major version, so that users can migrate from the
deprecated API to the new replacement API in the current major version,
which allows them to migrate to the next major version, which only has the
new API, once it is released.

Regarding the linear v.s. non-linear process, if we can enforce the
migration period, I wonder if we still need to release 1.X after 2.x is
released. The idea behind this is that if we can make the major version
upgrade experience smooth for the users, we don't need to release 1.x
anymore after 2.x is released. Instead we can focus on evolving 2.x. With a
decent migration period, users should be able to upgrade to 2.x from 1.x
because they have been given time to migrate away from all the deprecated
APIs in 1.x before 2.0 is released, that is at least 2 minor releases for
Public APIs, 1 minor release for PublicEvolving APIs and 1 patch release
for Experimental APIs.

It would be valuable if we can avoid releasing minor versions for previous
major versions. Releasing a minor version for old major versions usually
means we are backporting some feature to the previous major version, which
likely introduces non-trivial burdens. If users want new features, they
should upgrade to 2.x.

Thanks,

Jiangjie (Becket) Qin


On Tue, Jun 13, 2023 at 10:24 PM Chesnay Schepler 
wrote:

> On 13/06/2023 12:50, Jing Ge wrote:
> > One major issue we have, afaiu, is caused by the lack of
> housekeeping/house
> > cleaning, there are many APIs that were marked as deprecated a few years
> > ago and still don't get removed. Some APIs should be easy to remove and
> > others will need some more clear rules, like the issue discussed at [1].
>
> This is by design. Most of these are @Public APIs that we had to carry
> around until Flink 2.0, because that was the initial guarantee that we
> gave people.
>
>
> As for the FLIP, I like the idea of explicitly writing down a
> deprecation period for APIs, particularly PublicEvolving ones.
> For Experimental I don't think it'd be a problem if we could change them
> right away,
> but looking back a bit I don't think it hurts us to also enforce some
> deprecation period.
> 1 release for both of these sound fine to me.
>
>
> My major point of contention is the removal of Public APIs between minor
> versions.
> This to me would a major setback towards a simpler upgrade path for users.
> If these can be removed in minor versions than what ev

Re: [DISCUSS] FLIP-321: Introduce an API deprecation process

2023-06-13 Thread Becket Qin
Hi Stefan,

Thanks for the comments. I agree the API stability and compatibility
guarantees should be clear.
I think we need to answer the following four questions in order to complete
the the public API stability and compatibility story:

*1. What is the scope of public interfaces?*
We actually already have a definition of public interfaces in the FLIP[1]
wiki, it is basically anything that is sensible to the users, including
packages, classes, method signature and behavior (blocking v.s.
non-blocking, for example), metrics, configurations, CLI tools and
arguments, and so on.

*2. What API changes can happen across different versions?*
FLIP-196 answers the first question by defining in which versions
programming APIs (methods and classes annotated with "Public",
"PublicEvolving" or "Experimental") can have breaking changes. This allows
us to get rid of deprecated APIs in patch / minor / major versions, while
providing a clear expectation for our users.

*3. When can breaking changes happen?*
This FLIP tries to answer the second question, i.e. how users can adapt to
breaking changes. We should avoid sudden API changes between two versions,
and always leave time for the users to have a planned migration away from
breaking changes. This basically means we will always take two steps when
making breaking changes:
1. Introduce the new API (if needed), and mark the old API as deprecated.
2. After some time, which are the migration periods defined in this FLIP,
remove the old API.

And in this FLIP, we would like to define the exact length of the migration
period. It is going to be a trade-off between the maintenance cost of
deprecated APIs and the time users have to migrate away from those
deprecated APIs. In this FLIP we want to define the minimum lengths of the
migration periods.

*4. How do users upgrade to Flink versions with breaking changes?*
With the answers to the above three questions, the user upgrade path should
be simple and clear:
1. upgrade to a Flink version that contains both the deprecated and new API.
2. have a planned migration to move to the new API.
3. upgrade to later Flink versions in which the code of the deprecated API
is removed.

So, it looks like our story for API stability and compatibility would be
complete with this FLIP.

Thanks,

Jiangjie (Becket) Qin


On Tue, Jun 13, 2023 at 12:30 AM Stefan Richter
 wrote:

> Hi,
>
> Thanks a lot for bringing up this topic and for the initial proposal. As
> more and more people are looking into running Flink as a continuous service
> this discussion is becoming very relevant.
>
> What I would like to see is a clearer definition for what we understand by
> stability and compatibility. Our current policy only talks about being able
> to “compile” and “run” with a different version. As far as I can see, there
> is no guarantee about the stability of observable behavior. I believe it’s
> important for the community to include this important aspect in the
> guarantees that we give as our policy.
>
> For all changes that we do to the stable parts of the API, we should also
> consider how easy or difficult different types of changes would be for
> running Flink as a service with continuous delivery. For example,
> introducing a new interface to evolve the methods would make it easier to
> write adapter code than changing method signatures in-place on the existing
> interface. Those concerns should be considered in our process for evolving
> interfaces.
>
> Best,
> Stefan
>
>
>
>   <https://www.confluent.io/>
> Stefan Richter
> Principal Engineer II
>
> Follow us:  <
> https://www.confluent.io/blog?utm_source=footer_medium=email_campaign=ch.email-signature_type.community_content.blog>
> <https://twitter.com/ConfluentInc>
>
>
>
> > On 11. Jun 2023, at 14:30, Becket Qin  wrote:
> >
> > Hi folks,
> >
> > As one of the release 2.0 efforts, the release managers were discussing
> our
> > API lifecycle policies. There have been FLIP-196[1] and FLIP-197[2] that
> > are relevant to this topic. These two FLIPs defined the stability
> guarantee
> > of the programming APIs with various different stability annotations, and
> > the promotion process. A recap of the conclusion is following:
> >
> > Stability:
> > @Internal API: can change between major/minor/patch releases.
> > @Experimental API: can change between major/minor/patch releases.
> > @PublicEvolving API: can change between major/minor releases.
> > @Public API: can only change between major releases.
> >
> > Promotion:
> > An @Experimental API should be promoted to @PublicEvolving after two
> > releases, and a @PublicEvolving API should be promoted to @Public API
> after
> > two releases, unless there is a documented reaso

[DISCUSS] FLIP-321: Introduce an API deprecation process

2023-06-11 Thread Becket Qin
Hi folks,

As one of the release 2.0 efforts, the release managers were discussing our
API lifecycle policies. There have been FLIP-196[1] and FLIP-197[2] that
are relevant to this topic. These two FLIPs defined the stability guarantee
of the programming APIs with various different stability annotations, and
the promotion process. A recap of the conclusion is following:

Stability:
@Internal API: can change between major/minor/patch releases.
@Experimental API: can change between major/minor/patch releases.
@PublicEvolving API: can change between major/minor releases.
@Public API: can only change between major releases.

Promotion:
An @Experimental API should be promoted to @PublicEvolving after two
releases, and a @PublicEvolving API should be promoted to @Public API after
two releases, unless there is a documented reason not to do so.

One thing not mentioned in these two FLIPs is the API deprecation process,
which is in fact critical and fundamental to how the stability guarantee is
provided in practice, because the stability is all about removing existing
APIs. For example, if we want to change a method "ResultFoo foo(ArgumentFoo
arg)" to "ResultBar bar(ArgumentBar arg)", there will be two ways to do
this:

1. Mark method "foo" as deprecated and add the new method "bar". At some
point later, remove the method "foo".
2. Simply change the API in place, that basically means removing method foo
and adding method bar at the same time.

In the first option, users are given a period with stability guarantee to
migrate from "foo" to "bar". For the second option, this migration period
is effectively 0. A zero migration period is problematic because end users
may need a feature/bug fix from a new version, but cannot upgrade right
away due to some backwards compatible changes, even though these changes
perfectly comply with the API stability guarantees defined above. So the
migration period is critical to the API stability guarantees for the end
users.

The migration period is essentially how long a deprecated API can be
removed from the source code. So with this FLIP, I'd like to kick off the
discussion about our deprecation process.

https://cwiki.apache.org/confluence/display/FLINK/FLIP-321%3A+Introduce+an+API+deprecation+process

Comments are welcome!

Thanks,

Jiangjie (Becket) Qin

[1]
https://cwiki.apache.org/confluence/display/FLINK/FLIP-196%3A+Source+API+stability+guarantees
[2]
https://cwiki.apache.org/confluence/display/FLINK/FLIP-197%3A+API+stability+graduation+process


Re: [VOTE] FLIP-312: Add Yarn ACLs to Flink Containers

2023-06-05 Thread Becket Qin
+1 (binding)

Thanks for driving the FLIP, Archit.

Cheers,

Jiangjie (Becket) Qin

On Tue, Jun 6, 2023 at 4:33 AM Venkatakrishnan Sowrirajan 
wrote:

> Thanks for starting the vote on this one, Archit.
>
> +1 (non-binding)
>
> Regards
> Venkata krishnan
>
>
> On Mon, Jun 5, 2023 at 9:55 AM Archit Goyal 
> wrote:
>
> > Hi everyone,
> >
> > Thanks for all the feedback for FLIP-312: Add Yarn ACLs to Flink
> > Containers<
> >
> https://urldefense.com/v3/__https://cwiki.apache.org/confluence/display/FLINK/FLIP*312*3A*Add*Yarn*ACLs*to*Flink*Containers__;KyUrKysrKys!!IKRxdwAv5BmarQ!aWkLc7eHAWyHz5kwEq8kKzEAgbtKtlMmi9ifOy_1GNbiO93taxiMcwdHfENc4inLU_cxZIKPDMwBP97Z4oibXUIM$
> > >.
> > Following is the discussion thread : Link<
> >
> https://urldefense.com/v3/__https://lists.apache.org/thread/xj3ytkwj9lsl3hpjdb4n8pmy7lk3l8tv__;!!IKRxdwAv5BmarQ!aWkLc7eHAWyHz5kwEq8kKzEAgbtKtlMmi9ifOy_1GNbiO93taxiMcwdHfENc4inLU_cxZIKPDMwBP97Z4u3tNMqI$
> > >
> >
> > I'd like to start a vote for it. The vote will be open for at least 72
> > hours (until June 9th, 12:00AM GMT) unless there is an objection or an
> > insufficient number of votes.
> >
> > Thanks,
> > Archit Goyal
> >
>


Re: [DISCUSS] FLIP-312: Add Yarn ACLs to Flink Containers

2023-05-12 Thread Becket Qin
Thanks for the FLIP, Archit.

The motivation sounds reasonable and it looks like a straightforward
proposal. +1 from me.

Thanks,

Jiangjie (Becket) Qin

On Fri, May 12, 2023 at 1:30 AM Archit Goyal 
wrote:

> Hi all,
>
> I am opening this thread to discuss the proposal to support Yarn ACLs to
> Flink containers which has been documented in FLIP-312 <
> https://cwiki.apache.org/confluence/display/FLINK/FLIP+312%3A+Add+Yarn+ACLs+to+Flink+Containers
> >.
>
> This FLIP mentions about providing Yarn application ACL mechanism on Flink
> containers to be able to provide specific rights to users other than the
> one running the Flink application job. This will restrict other users in
> two ways:
>
>   *   view logs through the Resource Manager job history
>   *   kill the application
>
> Please feel free to reply to this email thread and share your opinions.
>
> Thanks,
> Archit Goyal
>
>


Re: [DISCUSS] Planning Flink 2.0

2023-04-25 Thread Becket Qin
Hi Xintong and Jark,

Thanks for starting the discussion about Flink 2.0. This is indeed
something that people talk about all the time but without material actions
taken. It is good timing to kick off this effort, so we can bring Flink to
the next stage and move faster.

I'd also volunteer to be a release manager of the Flink 2.0 release.

Cheers,

Jiangjie (Becket) Qin

On Tue, Apr 25, 2023 at 7:53 PM Leonard Xu  wrote:

> Thanks Xintong and Jark for kicking off the great discussion!
>
> The time goes so fast, it is already the 10th anniversary of Flink as an
> Apache project. Although I haven't gone through the proposal carefully, +1
> for the perfect release time and the release managers candidates.
>
> Best,
> Leonard
>
> > Furthermore, next year is the 10th year for Flink as an Apache project.
> > Flink joined the Apache incubator in April 2014, and became a top-level
> > project in December 2014. That makes 2024 a perfect time for bringing out
> > the release 2.0 milestone.
>
>


Re: [DISCUSS] SourceCoordinator and ExternallyInducedSourceReader do not work well together

2023-02-27 Thread Becket Qin
Hi Ming,

I am not sure if I fully understand what you want. It seems what you are
looking for is to have a checkpoint triggered at a customized timing which
aligns with some semantic. This is not what the current checkpoint in Flink
was designed for. I think the basic idea of checkpoint is to just take a
snapshot of the current state, so we can restore to that state in case of
failure. This is completely orthogonal to the data semantic.

Even with the ExternallyInducedSourceReader, the checkpoint is still
triggered by the JM. It is just the effective checkpoint barrier message (a
custom message in this case) will not be sent by the JM, but by the
external source storage. This helps when the external source storage needs
its own internal state to be aligned with the state of the Flink
SourceReader. For example, if the external source storage can only seek at
some bulk boundary, then it might wait until the current bulk to finish
before it sends the custom checkpoint barrier to the SourceReader.

 Considering this scenario, if the data we want has not been produced yet,
> but the *SourceCoordinator* receives the c*heckpoint* message, it will
> directly make a *checkpoint*, and the *ExternallyInducedSource* will not
> make a *checkpoint* immediately after receiving the *checkpoint*, but
> continues to wait for a new split. Even if a new split is generated, due to
> the behavior of closing *gateway* in *FLINK-28606*, the new split cannot be
> assigned to the *Source*, resulting in a deadlock (or forced to wait for
> checkpoint to time out).


In this case, the source reader should not "wait" for the splits that are
not included in this checkpoint. These splits should be a part of the next
checkpoint. It would be the Sink's responsibility to ensure the output is
committed in a way that aligns with the user semantic.

That said, I agree it might be useful in some cases if users can decided
the checkpoint triggering timing. But that will be a new feature which
needs some careful design.

Thanks,

Jiangjie (Becket) Qin


On Mon, Feb 27, 2023 at 8:35 PM ming li  wrote:

> Hi, dev,
>
> We recently used *SourceCoordinator* and *ExternallyInducedSource* to work
> together on some file type connectors to fulfill some requirements, but we
> found that these two interfaces do not work well together.
>
> *SourceCoordinator* (FLINK-15101) and *ExternallyInducedSource*
> (FLINK-20270) were introduced in Flip27. *SourceCoordinator* is responsible
> for running *SplitEnumerator* and coordinating the allocation of *Split*.
> *ExternallyInducedSource* allows us to delay making a c*heckpoint* in
> Source or make a c*heckpoint* at specified data. This works fine with
> connectors like *Kafka*.
>
> But in some connectors (such as hive connector), the split is completely
> allocated by the *SourceCoordinator*, and after the consumption is
> completed, it needs to wait for the allocation of the next batch of splits
> (it is not like kafka that continuously consumes the same split). In
> FLINK-28606, we introduced another mechanism: the *OperatorCoordinator* is
> not allowed to send *OperatorEvents* to the *Operator* before the
> *Operator's* checkpoint is completed.
>
> Considering this scenario, if the data we want has not been produced yet,
> but the *SourceCoordinator* receives the c*heckpoint* message, it will
> directly make a *checkpoint*, and the *ExternallyInducedSource* will not
> make a *checkpoint* immediately after receiving the *checkpoint*, but
> continues to wait for a new split. Even if a new split is generated, due to
> the behavior of closing *gateway* in *FLINK-28606*, the new split cannot be
> assigned to the *Source*, resulting in a deadlock (or forced to wait for
> checkpoint to time out).
>
> So should we also add a mechanism similar to *ExternallyInducedSource* in
> *OperatorCoordinator*: only make a checkpoint on *OperatorCoordinator* when
> *OperatorCoordinator* is ready, which allows us to delay making checkpoint?
>
> [1] https://issues.apache.org/jira/browse/FLINK-15101
> [2] https://issues.apache.org/jira/browse/FLINK-20270
> [3] https://issues.apache.org/jira/browse/FLINK-28606
>


Re: [DISCUSS] Enabling dynamic partition discovery by default in Kafka source

2023-01-14 Thread Becket Qin
Thanks for the proposal, Qingsheng.

+1 to enable auto partition discovery by default. Just a reminder, we need
a FLIP for this.

A bit more background on this.

Most of the Kafka users simply subscribe to a topic and let the consumer to
automatically adapt to partition changes. So enabling auto partition
discovery would align with that experience. The counter argument last time
when I proposed to enable auto partition discovery was mainly due to the
concern from the Flink users. There were arguments that sometimes users
don't want the partition changes to get automatically picked up, but want
to do this by restarting the job manually so they can avoid unnoticed
changes in the jobs.

Given that in the old Flink source, by default the auto partition discovery
was disabled, and there are use cases from both sides, we simply kept the
behavior unchanged. From the discussion we have here, it looks like
enabling auto partition discovery is much preferred. So I think we should
do it.

I am not worried about the performance. The new Kafka source will only have
the SplitEnumerator sending metadata requests when the feature is enabled.
It is actually much cheaper than the old Kafka source where every
subtask does that.

Thanks,

Jiangjie (Becket) Qin



On Sat, Jan 14, 2023 at 11:46 AM Yun Tang  wrote:

> +1 for this proposal and thanks Qingsheng for driving this.
>
> Considering the interval, we also set the value as 5min, equivalent to the
> default value of metadata.max.age.ms.
>
>
> Best
> Yun Tang
> 
> From: Benchao Li 
> Sent: Friday, January 13, 2023 23:06
> To: dev@flink.apache.org 
> Subject: Re: [DISCUSS] Enabling dynamic partition discovery by default in
> Kafka source
>
> +1, we've enabled this by default (10mins) in our production for years.
>
> Jing Ge  于2023年1月13日周五 22:22写道:
>
> > +1 for the proposal that makes users' daily work easier and therefore
> makes
> > Flink more attractive.
> >
> > Best regards,
> > Jing
> >
> >
> > On Fri, Jan 13, 2023 at 11:27 AM Qingsheng Ren  wrote:
> >
> > > Thanks everyone for joining the discussion!
> > >
> > > @Martijn:
> > >
> > > > All newly discovered partitions will be consumed from the earliest
> > offset
> > > possible.
> > >
> > > Thanks for the reminder! I checked the logic of KafkaSource and found
> > that
> > > new partitions will start from the offset initializer specified by the
> > user
> > > instead of the earliest. We need to correct this behavior to avoid
> > dropping
> > > messages from new partitions.
> > >
> > > > Job restarts from checkpoint
> > >
> > > I think the current logic guarantees the exactly-once semantic. New
> > > partitions created after the checkpoint will be re-discovered again and
> > > picked up by the source.
> > >
> > > @John:
> > >
> > > > If you want to be a little conservative with the default, 5 minutes
> > might
> > > be better than 30 seconds.
> > >
> > > Thanks for the suggestion! I tried to find the equivalent config in
> Kafka
> > > but missed it. It would be neat to align with the default value of "
> > > metadata.max.age.ms".
> > >
> > > @Gabor:
> > >
> > > > removed partition handling is not yet added
> > >
> > > There was a detailed discussion about removing partitions [1] but it
> > looks
> > > like this is not an easy task considering the potential data loss and
> > state
> > > inconsistency. I'm afraid there's no clear plan on this one and maybe
> we
> > > could trigger a new discussion thread about how to correctly handle
> > removed
> > > partitions.
> > >
> > > [1] https://lists.apache.org/thread/7r4h7v5k281w9cnbfw9lb8tp56r30lwt
> > >
> > > Best regards,
> > > Qingsheng
> > >
> > >
> > > On Fri, Jan 13, 2023 at 4:33 PM Gabor Somogyi <
> gabor.g.somo...@gmail.com
> > >
> > > wrote:
> > >
> > > > +1 on the overall direction, it's an important feature.
> > > >
> > > > I've had a look on the latest master and looks like removed partition
> > > > handling is not yet added but I think this is essential.
> > > >
> > > >
> > > >
> > >
> >
> https://github.com/apache/flink/blob/28c3e1a3923ba560b559a216985c1abeb794ebaa/flink-connectors/flink-connector-kafka/src/main/java/org/apache/flink/connector/kafka/source/enumerator/KafkaSourceEnumerator.java#L305
> > > >
>

Re: [DISCUSS] FLIP-286: Fix the API stability/scope annotation inconsistency in AbstractStreamOperator

2023-01-13 Thread Becket Qin
I don't have an overview of all the holes in our public API surface at the
moment. It would be great if there's some tool to do a scan.

In addition to fixing the annotation consistency formally, I think it is
equally, if not more, important to see whether the public APIs we expose
tell a good story. For example, if we say StreamConfig should be internal,
some fair questions to ask is why our own AbstractStreamOperator needs it?
Why does not a user-defined operator need it? Is there something in the
StreamConfig we should expose as a public interface if not the entire class?

Thanks,

Jiangjie (Becket) Qin

On Sat, Jan 14, 2023 at 5:36 AM Konstantin Knauf  wrote:

> Hi Becket,
>
> > It is a basic rule of public API that anything exposed by a public
> interface should also be public.
>
> I agree with this in general. Did you get an overview of where we currently
> violate this? Is this something that the Arc42 architecture tests could
> test for so that as a first measure we don't introduce more occurrences
> (cc @Ingo).
>
> Maybe its justified to make a pass over all of these occurrences and
> resolve these occurrences one way or another either making the
> members/parameters @PublicEvoling or actually making a class/method
> @Internal even if its was @PublicEvoling before. I think, this could be the
> better option than having @PublicEvolving classes/methods that really
> aren't.
>
> Cheers,
>
> Konstantin
>
> Am Fr., 13. Jan. 2023 um 17:02 Uhr schrieb Becket Qin <
> becket@gmail.com
> >:
>
> > Hi Dawid,
> >
> > Thanks for the reply. I am currently looking at the Beam Flink runner,
> and
> > there are quite some hacks the Beam runner has to do in order to deal
> with
> > the backwards incompatible changes in AbstractStreamOperator and some of
> > the classes exposed by it. Regardless of what we think, the fact is that
> > AbstractStreamOperator is marked as PublicEvolving today, and our users
> use
> > it. It is a basic rule of public API that anything exposed by a public
> > interface should also be public. This is the direct motivation of this
> > FLIP.
> >
> > Regarding the StreamTask / StreamConfig exposure, if you look at the
> > StreamOperatorFactory which is also a PublicEvolving class, it actually
> > exposes the StreamTask, StreamConfig as well as some other classes in the
> > StreamOperatorParameters. So these classes are already exposed in
> multiple
> > public APIs.
> >
> > Keeping our public API stability guarantee is really fundamental and
> > critical to the users. With the current status of inconsistent API
> > stability annotations, I don't see how can we assure of that. From what I
> > can see, accidental backwards incompatible changes is likely going to
> keep
> > happening. So I'd say let's see how to fix forward instead of doing
> > nothing.
> >
> > BTW, Initially I thought this is just an accidental mismatch, but after
> > further exam, it looks that it is a bigger issue. I guess one of the
> > reasons we end up in this situation is that we haven't really thought it
> > through regarding the boundary between framework and user space, i.e.
> what
> > framework primitives we want to expose to the users. So instead we just
> > expose a bunch of internal things and hope users only use the "stable"
> part
> > of them.
> >
> > Thanks,
> >
> > Jiangjie (Becket) Qin
> >
> >
> > On Fri, Jan 13, 2023 at 7:28 PM Dawid Wysakowicz  >
> > wrote:
> >
> > > Hi Becket,
> > >
> > > May I ask what is the motivation for the change?
> > >
> > > I'm really skeptical about making any of those classes `Public` or
> > > `PublicEvolving`. As far as I am concerned there is no agreement in the
> > > community if StreamOperator is part of the `Public(Evolving)` API. At
> > > least I think it should not. I understand `AbstractStreamOperator` was
> > > marked with `PublicEvolving`, but I am really not convinced it was the
> > > right decision.
> > >
> > > The listed classes are not the only classes exposed to
> > > `AbstractStreamOperator` that are `Internal` that break the consistency
> > > and I am sure there is no question those should remain `Internal` such
> > > as e.g. StreamTask, StreamConfig, StreamingRuntimeContext and many
> more.
> > >
> > > As it stands I am strongly against giving any additional guarantees on
> > > API stability to the classes there unless there is a good motivation
> for
> > > a given feature and we're sure this is the best way to go forward.
>

Re: [DISCUSS] FLIP-286: Fix the API stability/scope annotation inconsistency in AbstractStreamOperator

2023-01-13 Thread Becket Qin
Hi Dawid,

Thanks for the reply. I am currently looking at the Beam Flink runner, and
there are quite some hacks the Beam runner has to do in order to deal with
the backwards incompatible changes in AbstractStreamOperator and some of
the classes exposed by it. Regardless of what we think, the fact is that
AbstractStreamOperator is marked as PublicEvolving today, and our users use
it. It is a basic rule of public API that anything exposed by a public
interface should also be public. This is the direct motivation of this FLIP.

Regarding the StreamTask / StreamConfig exposure, if you look at the
StreamOperatorFactory which is also a PublicEvolving class, it actually
exposes the StreamTask, StreamConfig as well as some other classes in the
StreamOperatorParameters. So these classes are already exposed in multiple
public APIs.

Keeping our public API stability guarantee is really fundamental and
critical to the users. With the current status of inconsistent API
stability annotations, I don't see how can we assure of that. From what I
can see, accidental backwards incompatible changes is likely going to keep
happening. So I'd say let's see how to fix forward instead of doing nothing.

BTW, Initially I thought this is just an accidental mismatch, but after
further exam, it looks that it is a bigger issue. I guess one of the
reasons we end up in this situation is that we haven't really thought it
through regarding the boundary between framework and user space, i.e. what
framework primitives we want to expose to the users. So instead we just
expose a bunch of internal things and hope users only use the "stable" part
of them.

Thanks,

Jiangjie (Becket) Qin


On Fri, Jan 13, 2023 at 7:28 PM Dawid Wysakowicz 
wrote:

> Hi Becket,
>
> May I ask what is the motivation for the change?
>
> I'm really skeptical about making any of those classes `Public` or
> `PublicEvolving`. As far as I am concerned there is no agreement in the
> community if StreamOperator is part of the `Public(Evolving)` API. At
> least I think it should not. I understand `AbstractStreamOperator` was
> marked with `PublicEvolving`, but I am really not convinced it was the
> right decision.
>
> The listed classes are not the only classes exposed to
> `AbstractStreamOperator` that are `Internal` that break the consistency
> and I am sure there is no question those should remain `Internal` such
> as e.g. StreamTask, StreamConfig, StreamingRuntimeContext and many more.
>
> As it stands I am strongly against giving any additional guarantees on
> API stability to the classes there unless there is a good motivation for
> a given feature and we're sure this is the best way to go forward.
>
> Thus I'm inclined to go with -1 on any proposal on changing annotations
> for the sake of matching the one on `AbstractStreamOperator`. I might be
> convinced to support requests to give better guarantees for well
> motivated features that are now internal though.
>
> Best,
>
> Dawid
>
> On 12/01/2023 10:20, Becket Qin wrote:
> > Hi flink devs,
> >
> > I'd like to start a discussion thread for FLIP-286[1].
> >
> > As a recap, currently while AbstractStreamOperator is a class marked as
> > @PublicEvolving, some classes exposed via its methods / fields are
> > marked as @Internal. This FLIP wants to fix this inconsistency of
> > stability / scope annotation.
> >
> > Comments are welcome!
> >
> > Thanks,
> >
> > Jiangjie (Becket) Qin
> >
> > [1]
> >
> https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=240880841
> >
>


[DISCUSS] FLIP-286: Fix the API stability/scope annotation inconsistency in AbstractStreamOperator

2023-01-12 Thread Becket Qin
Hi flink devs,

I'd like to start a discussion thread for FLIP-286[1].

As a recap, currently while AbstractStreamOperator is a class marked as
@PublicEvolving, some classes exposed via its methods / fields are
marked as @Internal. This FLIP wants to fix this inconsistency of
stability / scope annotation.

Comments are welcome!

Thanks,

Jiangjie (Becket) Qin

[1]
https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=240880841


Re: [ANNOUNCE] Apache Flink 1.16.0 released

2022-10-30 Thread Becket Qin
Hooray!! Congratulations to the team!

Cheers,

Jiangjie (Becket) Qin

On Mon, Oct 31, 2022 at 9:57 AM Hang Ruan  wrote:

> Congratulations!
>
> Best,
> Hang
>
> Shengkai Fang  于2022年10月31日周一 09:40写道:
>
> > Congratulations!
> >
> > Best,
> > Shengkai
> >
> > Hangxiang Yu  于2022年10月31日周一 09:38写道:
> >
> > > Congratulations!
> > > Thanks Chesnay, Martijn, Godfrey & Xingbo for managing the release.
> > >
> > > On Fri, Oct 28, 2022 at 7:35 PM Jing Ge  wrote:
> > >
> > > > Congrats!
> > > >
> > > > On Fri, Oct 28, 2022 at 1:22 PM 任庆盛  wrote:
> > > >
> > > >> Congratulations and a big thanks to Chesnay, Martijn, Godfrey and
> > Xingbo
> > > >> for the awesome work for 1.16!
> > > >>
> > > >> Best regards,
> > > >> Qingsheng Ren
> > > >>
> > > >> > On Oct 28, 2022, at 14:46, Xingbo Huang  wrote:
> > > >> >
> > > >> > The Apache Flink community is very happy to announce the release
> of
> > > >> Apache
> > > >> > Flink 1.16.0, which is the first release for the Apache Flink 1.16
> > > >> series.
> > > >> >
> > > >> > Apache Flink® is an open-source stream processing framework for
> > > >> > distributed, high-performing, always-available, and accurate data
> > > >> streaming
> > > >> > applications.
> > > >> >
> > > >> > The release is available for download at:
> > > >> > https://flink.apache.org/downloads.html
> > > >> >
> > > >> > Please check out the release blog post for an overview of the
> > > >> > improvements for this release:
> > > >> > https://flink.apache.org/news/2022/10/28/1.16-announcement.html
> > > >> >
> > > >> > The full release notes are available in Jira:
> > > >> >
> > > >>
> > >
> >
> https://issues.apache.org/jira/secure/ReleaseNote.jspa?projectId=12315522=12351275
> > > >> >
> > > >> > We would like to thank all contributors of the Apache Flink
> > community
> > > >> > who made this release possible!
> > > >> >
> > > >> > Regards,
> > > >> > Chesnay, Martijn, Godfrey & Xingbo
> > > >>
> > > >
> > >
> > > --
> > > Best,
> > > Hangxiang.
> > >
> >
>


Re: [VOTE] Dedicated AWS externalized connector repo

2022-10-30 Thread Becket Qin
+1 (binding)

Thanks,

Jiangjie (Becket) Qin

On Mon, Oct 31, 2022 at 10:17 AM Jark Wu  wrote:

> +1 (binding)
>
> Best,
> Jark
>
> > 2022年10月29日 03:11,Jing Ge  写道:
> >
> > +1 (non-binding)
> >
> > Thanks!
> >
> > Best Regards,
> > Jing
> >
> > On Fri, Oct 28, 2022 at 5:29 PM Samrat Deb 
> wrote:
> >
> >> +1 (non binding)
> >>
> >> Thanks for driving Danny
> >>
> >> Bests
> >> Samrat
> >>
> >> On Fri, 28 Oct 2022 at 8:36 PM, Ahmed Hamdy 
> wrote:
> >>
> >>> +1 (non-binding)
> >>> Regards,
> >>> Ahmed
> >>>
> >>> On Thu, 27 Oct 2022 at 08:38, Teoh, Hong  >
> >>> wrote:
> >>>
> >>>> +1 (non-binding)
> >>>>
> >>>> Thanks for driving this, Danny!
> >>>>
> >>>> Hong
> >>>>
> >>>> On 26/10/2022, 08:14, "Martijn Visser" 
> >>> wrote:
> >>>>
> >>>>CAUTION: This email originated from outside of the organization. Do
> >>>> not click links or open attachments unless you can confirm the sender
> >> and
> >>>> know the content is safe.
> >>>>
> >>>>
> >>>>
> >>>>+1 binding
> >>>>
> >>>>Thanks Danny!
> >>>>
> >>>>On Wed, Oct 26, 2022 at 8:48 AM Danny Cranmer <
> >>> dannycran...@apache.org
> >>>>>
> >>>>wrote:
> >>>>
> >>>>> Hello all,
> >>>>>
> >>>>> As discussed in the discussion thread [1], I propose to create a
> >>>> dedicated
> >>>>> repository for AWS connectors called flink-connector-aws. This
> >> will
> >>>> house
> >>>>> 3x connectors: Amazon Kinesis Data Streams, Amazon Kinesis Data
> >>>> Firehose
> >>>>> and Amazon DynamoDB and any future AWS connectors. We will also
> >>>> externalize
> >>>>> the AWS base module from the main Flink repository [2] and
> >> create a
> >>>> parent
> >>>>> pom for version management.
> >>>>>
> >>>>> All modules within this repository will share the same version,
> >> and
> >>>> be
> >>>>> released/evolved together. We will adhere to the common Flink
> >> rules
> >>>> [3] for
> >>>>> connector development.
> >>>>>
> >>>>> Motivation: grouping AWS connectors together will reduce the
> >> number
> >>>> of
> >>>>> connector releases, simplify development, dependency management
> >> and
> >>>>> versioning for users.
> >>>>>
> >>>>> Voting schema:
> >>>>> Consensus, committers have binding votes, open for at least 72
> >>> hours.
> >>>>>
> >>>>> [1]
> >>> https://lists.apache.org/thread/swp4bs8407gtsgn2gh0k3wx1m4o3kqqp
> >>>>> [2]
> >>>>>
> >>>>>
> >>>>
> >>>
> >>
> https://github.com/apache/flink/tree/master/flink-connectors/flink-connector-aws-base
> >>>>> [3]
> >>>>>
> >>>>>
> >>>>
> >>>
> >>
> https://cwiki.apache.org/confluence/display/FLINK/Externalized+Connector+development
> >>>>>
> >>>>
> >>>>
> >>>
> >>
>
>


Re: [DISCUSS] Reverting sink metric name changes made in 1.15

2022-10-09 Thread Becket Qin
Thanks for raising the discussion, Qingsheng,

+1 on reverting the breaking changes.

In addition, we might want to release a 1.15.3 to fix this and update the
previous release docs with this known issue, so that users can upgrade to
1.15.3 when they hit it. It would also be good to add some backwards
compatibility tests on metrics to avoid unintended breaking changes like
this in the future.

Thanks,

Jiangjie (Becket) Qin

On Sun, Oct 9, 2022 at 10:35 AM Qingsheng Ren  wrote:

> Hi devs and users,
>
> I’d like to start a discussion about reverting a breaking change about sink 
> metrics made in 1.15 by FLINK-26126
> [1] and FLINK-26492 [2].
>
> TL;DR
>
>
> All sink metrics with name “numXXXOut” defined in FLIP-33 are replace by 
> “numXXXSend” in FLINK-26126 and FLINK-26492. Considering metric names are 
> public APIs, this is a breaking change to end users and not backward 
> compatible. Also unfortunately this breaking change was not discussed in the 
> mailing list before.
>
> Background
>
>
> As defined previously in FLIP-33 (the FLIP page has been changed so please 
> refer to the old version [3] ), metric “numRecordsOut” is used for reporting 
> the total number of output records since the sink started (number of records 
> written to the external system), and similarly for “numRecordsOutPerSecond”, 
> “numBytesOut”, “numBytesOutPerSecond” and “numRecordsOutError”. Most sinks 
> are following this naming and definition. However, these metrics are 
> ambiguous in the new Sink API as “numXXXOut” could be used by the output of 
> SinkWriterOperator for reporting number of Committables delivered to 
> SinkCommitterOperator. In order to resolve the conflict, FLINK-26126 and 
> FLINK-26492 changed names of these metrics with “numXXXSend”.
>
> Necessity of reverting this change
>
>
> - Metric names are actually public API, as end users need to configure metric 
> collecting and alerting system with metric names. Users have to reset all 
> configurations related to affected metrics.
>
> - This could also affect custom and external sinks not maintained by Flink, 
> which might have implemented with numXXXOut metrics.
>
> - The number of records sent to external system is way more important than 
> the number of Committables sent to SinkCommitterOperator, as the latter one 
> is just an internal implementation of sink. We could have a new metric name 
> for the latter one instead.
>
> - We could avoid splitting the project by version (like “plz use numXXXOut 
> before 1.15 and use numXXXSend after”) if we revert it ASAP, cosidering 1.16 
> is still not released for now.
>
>
> As a consequence, I’d like to hear from devs and users about your opinion on 
> changing these metrics back to “numXXXOut”.
>
> Looking forward to your reply!
>
> [1] https://issues.apache.org/jira/browse/FLINK-26126
> [2] https://issues.apache.org/jira/browse/FLINK-26492
> [1] FLIP-33, version 18:
> https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=211883136
>
> Best,
> Qingsheng
>


Re: Re: [VOTE] FLIP-252: Amazon DynamoDB Sink Connector

2022-07-24 Thread Becket Qin
+1

On Mon, Jul 25, 2022 at 9:22 AM Grant L (Grant) 
wrote:

> +1
>
> On 2022/07/21 18:27:52 Robert Metzger wrote:
> > +1
> >
> > On Wed, Jul 20, 2022 at 10:48 PM Konstantin Knauf 
> wrote:
> >
> > > +1. Thanks!
> > >
> > > Am Mi., 20. Juli 2022 um 16:48 Uhr schrieb Tzu-Li (Gordon) Tai <
> > > tzuli...@apache.org>:
> > >
> > > > +1
> > > >
> > > > On Wed, Jul 20, 2022 at 6:13 AM Danny Cranmer 
> > > > wrote:
> > > >
> > > > > Hi there,
> > > > >
> > > > > After the discussion in [1], I’d like to open a voting thread for
> > > > FLIP-252
> > > > > [2], which proposes the addition of an Amazon DynamoDB sink based
> on
> > > the
> > > > > Async Sink [3].
> > > > >
> > > > > The vote will be open until July 23rd earliest (72h), unless there
> are
> > > > any
> > > > > binding vetos.
> > > > >
> > > > > Cheers, Danny
> > > > >
> > > > > [1]
> https://lists.apache.org/thread/ssmf2c86n3xyd5qqmcdft22sqn4qw8mw
> > > > > [2]
> > > > >
> > > > >
> > > >
> > >
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-252%3A+Amazon+DynamoDB+Sink+Connector
> > > > > [3]
> > > > >
> > >
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-171%3A+Async+Sink
> > > > >
> > > >
> > >
> > >
> > > --
> > > https://twitter.com/snntrable
> > > https://github.com/knaufk
> > >
> >


Re: [DISCUSS] FLIP-217 Support watermark alignment of source splits

2022-07-13 Thread Becket Qin
Thanks for the explanation, Sebastian. I understand your concern now.

1. About the major concern. Personally I'd consider the coarse grained
watermark alignment as a special case for backward compatibility. In the
future, if for whatever reason we want to pause a split and that is not
supported, it seems the only thing that makes sense is throwing an
exception, instead of pausing the entire source reader. Regarding this
FLIP, if the logic that determines which split should be paused is in the
SourceOperator, the SourceOperator actually knows the reason why it pauses
a split. It also knows whether there are more than one split assigned to
the source reader. So it can just fallback to the coarse grained watermark
alignment, without affecting other reasons of pausing a split, right? And
in the future, if there are more purposes for pausing / resuming a split,
the SourceOperator still needs to understand each of the reasons in order
to resume the splits after all the pausing conditions are no longer met.

2. Naming wise, would "coarse.grained.watermark.alignment.enabled" address
your concern?

The only concern I have for Option A is that people may not be able to
benefit from split level WM alignment until all the sources they need have
that implemented. This seems unnecessarily delaying the adoption of a new
feature, which looks like a more substantive downside compared with the
"coarse.grained.wm.alignment.enabled" option.

BTW, the SourceOperator doesn't need to invoke the pauseOrResumeSplit()
method and catch the UnsupportedOperation every time. A flag can be set so
it doesn't attempt to pause the split after the first time it sees the
exception.


Thanks,

Jiangjie (Becket) Qin



On Wed, Jul 13, 2022 at 5:11 PM Sebastian Mattheis 
wrote:

> Hi Becket, Hi Thomas, Hi Piotrek,
>
> Thanks for the feedback. I would like to highlight some concerns:
>
>1. Major: A configuration parameter like "allow coarse grained
>alignment" defines a semantic that mixes two contexts conditionally as
>follows: "ignore incapability to pause splits in SourceReader/SplitReader"
>IF (conditional) we "allow coarse grained watermark alignment". At the same
>time we said that there is no way to check the capability of
>SourceReader/SplitReader to pause/resume other than observing a
>UnsupportedOperationException during runtime such that we cannot disable
>the trigger for watermark split alignment in the SourceOperator. Instead,
>we can only ignore the incapability of SourceReader/SplitReader during
>execution of a pause/resume attempt which, consequently, requires to check
>the "allow coarse grained alignment " parameter value (to implement the
>conditional semantic). However, during this execution we actually don't
>know whether the attempt was executed for the purpose of watermark
>alignment or for some other purpose such that the check actually depends on
>who triggered the pause/resume attempt and hides the exception potentially
>unexpectedly for some other use case. Of course, currently there is no
>other purpose and, hence, no other trigger than watermark alignment.
>However, this breaks, in my perspective, the idea of having
>pauseOrResumeSplits (re)usable for other use cases.
>2. Minor: I'm not aware of any configuration parameter in the format
>like `allow.*` as you suggested with
>`allow.coarse.grained.watermark.alignment`. Would that still be okay to do?
>
> As we have agreed to not have a "supportsPausableSplits" method because of
> potential inconsistencies between return value of this method and the
> actual implementation (and also the difficulty to have a meaningful return
> value where the support actually depends on SourceReader AND the assigned
> SplitReaders), I don't want to bring up the discussion about the
> "supportsPauseableSplits" method again. Instead, I see the following
> options:
>
> Option A: I would drop the idea of "allow coarse grained alignment"
> semantic of the parameter but implement a parameter to "enable/disable
> split watermark alignment" which we can easily use in the SourceOperator to
> disable the trigger of split alignment. This is indeed more static and less
> flexible, because it disables split alignment unconditionally, but it is
> "context-decoupled" and more straight-forward to use. This would also
> address the use case of disabling split alignment for the purpose of
> runtime behavior evaluation, as mentioned by Thomas (if I remember
> correctly.) I would implement the parameter with a default where watermark
> split alignment is enabled which requires users to check their application
> when upgrading to 1.16 if a) there is a source that re

Re: [DISCUSS] FLIP-217 Support watermark alignment of source splits

2022-07-13 Thread Becket Qin
Hi Sebastian,

Thanks for updating the FLIP wiki.

Just to double confirm, I was thinking of a configuration like
"allow.coarse.grained.watermark.alignment". This will allow the coarse
grained watermark alignment as a fallback instead of bubbling up an
exception if split pausing is not supported in some Sources in a Flink job.
And this will only affect the Sources that do not support split pausing,
but not the Sources that have split pausing supported.

This seems slightly different from a  enables / disables split
alignment. This sounds like a global thing, and it seems not necessary to
disable the split alignment, as long as the coarse grained alignment can be
a fallback.

Thanks,

Jiangjie (Becket) Qin

On Wed, Jul 13, 2022 at 2:46 PM Sebastian Mattheis 
wrote:

> Hi Piotrek,
>
> Sorry I've read it and forgot it when I was ripping out the
> supportsPauseOrResume method again. Thanks for pointing that out. I will
> add it as follows: The  enables/disables split alignment in the
> SourceOperator where the default is that split alignment is enabled. (And I
> will add the note: "In future releases, the  may be ignored such that
> split alignment is always enabled.")
>
> Cheers,
> Sebastian
>
> On Tue, Jul 12, 2022 at 11:14 PM Piotr Nowojski 
> wrote:
>
>> Hi Sebastian,
>>
>> Thanks for picking this up.
>>
>> > 5. There is NO configuration option to enable watermark alignment of
>> splits or disable pause/resume capabilities.
>>
>> Isn't this contradicting what we actually agreed on?
>>
>> > we are planning to have a configuration based way to revert to the
>> previous behavior
>>
>> I think what we agreed in the last couple of emails was to add a
>> configuration toggle, that would allow Flink 1.15 users, that are using
>> watermark alignment with multiple splits per source operator, to continue
>> using it with the old 1.15 semantic, even if their source doesn't support
>> pausing/resuming splits. It seems to me like the current FLIP and
>> implementation proposal would always throw an exception in that case?
>>
>> Best,
>> Piotrek
>>
>> wt., 12 lip 2022 o 10:18 Sebastian Mattheis 
>> napisał(a):
>>
>> > Hi all,
>> >
>> > I have updated FLIP-217 [1] to the proposed specification and adapted
>> the
>> > current implementation [2] respectively.
>> >
>> > This means both, FLIP and implementation, are ready for review from my
>> > side. (I would revise commit history and messages for the final PR but
>> left
>> > it as is for now and the records of this discussion.)
>> >
>> > 1. Please review the updated version of FLIP-217 [1]. If there are no
>> > further concerns, I would initiate the voting.
>> > (2. If you want to speed up things, please also have a look into the
>> > updated implementation [2].)
>> >
>> > Please consider the following updated specification in the current
>> status
>> > of FLIP-217 where the essence is as follows:
>> >
>> > 1. A method pauseOrResumeSplits is added to SourceReader with default
>> > implementation that throws UnsupportedOperationException.
>> > 2.  method pauseOrResumeSplits is added to SplitReader with default
>> > implementation that throws UnsupportedOperationException.
>> > 3. SourceOperator initiates split alignment only if more than one split
>> is
>> > assigned to the source (and, of course, only if withSplitAlignment is
>> used).
>> > 4. There is NO "supportsPauseOrResumeSplits" method at any place (to
>> > indicate if the implementation supports pause/resume capabilities).
>> > 5. There is NO configuration option to enable watermark alignment of
>> > splits or disable pause/resume capabilities.
>> >
>> > *Note:* If the SourceReader or some SplitReader do not override
>> > pauseOrResumeSplits but it is required in the application, an exception
>> is
>> > thrown at runtime when an split alignment attempt is executed (not
>> during
>> > startup or any time earlier).
>> >
>> > Also, I have revised the compatibility/migration section to describe a
>> bit
>> > of a rationale for the default implementation with exception throwing
>> > behavior.
>> >
>> > Regards,
>> > Sebastian
>> >
>> > [1]
>> >
>> https://cwiki.apache.org/confluence/display/FLINK/FLIP-217+Support+watermark+alignment+of+source+splits
>> > [2] https://github.com/smattheis/flink/tree/flip-217-split-wm-alignment
>> >
>> > On Mon,

Re: [VOTE] Apache Flink ML Release 2.1.0, release candidate #2

2022-07-06 Thread Becket Qin
+1 (binding)

Verified the following:
- verified the checksum and signature of the source and binary distribution
- built from source code without exception. (Some unit tests failed because
rocksdb doesn't work well with the Apple silicon).
- checked the source jar and it does not contain any binary files.
- verified the pom files and the flink-ml versions are updated.

Thanks,

Jiangjie (Becket) Qin

On Tue, Jul 5, 2022 at 3:58 PM Yun Gao  wrote:

> +1 (binding)
>
> - Verified the checksum and signatures.
> - Build from sources without exceptions.
> - Checked the artifacts uploading to the mvn repo is completed.
> - Reviewed the release PR and LGTM.
>
> Best,
> Yun Gao
>
>
>
>
> --
> From:Dian Fu 
> Send Time:2022 Jul. 5 (Tue.) 14:45
> To:dev 
> Subject:Re: [VOTE] Apache Flink ML Release 2.1.0, release candidate #2
>
> +1 (binding)
>
> - Verified the checksum and signature
> - Installed the Python package and runs an example documented here[1]
> - Reviewed the website PR and LGTM overall
>
> Regards,
> Dian
>
> [1]
>
> https://nightlies.apache.org/flink/flink-ml-docs-master/docs/operators/clustering/kmeans/#examples
>
> On Mon, Jul 4, 2022 at 2:01 PM Yunfeng Zhou 
> wrote:
>
> > Thanks for raising this release candidate.
> >
> > +1 (non-binding)
> >
> > - Verified that the checksums and GPG files match the corresponding
> release
> > files.
> > - Verified that the source distributions do not contain any binaries.
> > - Built the source distribution and ensured that all source files have
> > Apache headers.
> > - Verified that all POM files point to the same version.
> > - Browsed through JIRA release notes files and did not find anything
> > unexpected.
> > - Browsed through README.md files and did not find anything unexpected.
> > - Checked the source code tag "release-2.0.0-rc2" and did not find
> anything
> > unexpected.
> >
> >
> > On Fri, Jul 1, 2022 at 11:57 AM Dong Lin  wrote:
> >
> > > Thanks for the update!
> > >
> > > +1 (non-binding)
> > >
> > > Here is what I checked. All required checks are included.
> > >
> > > - Verified that the checksums and GPG files match the corresponding
> > release
> > > files.
> > > - Verified that the source distributions do not contain any binaries.
> > > - Built the source distribution and ensured that all source files have
> > > Apache headers.
> > > - Verified that all POM files point to the same version.
> > > - Browsed through JIRA release notes files and did not find anything
> > > unexpected.
> > > - Browsed through README.md files and did not find anything unexpected.
> > > - Checked the source code tag "release-2.0.0-rc2" and did not find
> > anything
> > > unexpected.
> > >
> > >
> > > On Fri, Jul 1, 2022 at 11:11 AM Zhipeng Zhang  >
> > > wrote:
> > >
> > > > Hi everyone,
> > > >
> > > >
> > > > Please review and vote on the release candidate #2 for the version
> > 2.1.0
> > > of
> > > > Apache Flink ML as follows:
> > > >
> > > > [ ] +1, Approve the release
> > > >
> > > > [ ] -1, Do not approve the release (please provide specific comments)
> > > >
> > > >
> > > > **Testing Guideline**
> > > >
> > > >
> > > > You can find here [1] a page in the project wiki on instructions for
> > > > testing.
> > > >
> > > > To cast a vote, it is not necessary to perform all listed checks, but
> > > > please
> > > >
> > > > mention which checks you have performed when voting.
> > > >
> > > >
> > > > **Release Overview**
> > > >
> > > >
> > > > As an overview, the release consists of the following:
> > > >
> > > > a) Flink ML source release to be deployed to dist.apache.org
> > > >
> > > > b) Flink ML Python source distributions to be deployed to PyPI
> > > >
> > > > c) Maven artifacts to be deployed to the Maven Central Repository
> > > >
> > > >
> > > > **Staging Areas to Review**
> > > >
> > > >
> > > > The staging areas containing the above mentioned artifacts are as
> > > follows,
> > > > for your review:
> > > >
> > > >

Re: [DISCUSS] FLIP-238: Introduce FLIP-27-based Data Generator Source

2022-07-05 Thread Becket Qin
Hi Alex,

Personally I prefer the latter option, i.e. just add the
currentParallelism() method. It is easy to add more stuff to the
SourceReaderContext in the future, and it is likely that most of the stuff
in the RuntimeContext is not required by the SourceReader implementations.
For the purpose of this FLIP, adding the method is probably good enough.

That said, I don't see a consistent pattern adopted in the project to
handle similar cases. The FunctionContext wraps the RuntimeContext and only
exposes necessary stuff. CEPRuntimeContext extends the RuntimeContext and
overrides some methods that it does not want to expose with exception
throwing logic. Some internal context classes simply expose the entire
RuntimeContext with some additional methods. If we want to make things
clean, I'd imagine all these variations of context can become some specific
combination of a ReadOnlyRuntimeContext and some "write" methods. But this
may require a closer look at all these cases to make sure the
ReadOnlyRuntimeContext is generally suitable. I feel that it will take some
time and could be a bigger discussion than the data generator source
itself. So maybe we can just go with adding a method at the moment. And
evolving the SourceReaderContext to use the ReadOnlyRuntimeContext in the
future.

Thanks,

Jiangjie (Becket) Qin

On Tue, Jul 5, 2022 at 8:31 PM Alexander Fedulov 
wrote:

> Hi Becket,
>
> I agree with you. We could introduce a *ReadOnlyRuntimeContext* that would
> act as a holder for the *RuntimeContext* data. This would also require
> read-only wrappers for the exposed fields, such as *ExecutionConfig*.
> Alternatively, we just add the *currentParallelism()* method for now and
> see if anything else might actually be needed later on. What do you think?
>
> Best,
> Alexander Fedulov
>
> On Tue, Jul 5, 2022 at 2:30 AM Becket Qin  wrote:
>
> > Hi Alex,
> >
> > While it is true that the RuntimeContext gives access to all the stuff
> the
> > framework can provide, it seems a little overkilling for the
> SourceReader.
> > It is probably OK to expose all the read-only information in the
> > RuntimeContext to the SourceReader, but we may want to hide the "write"
> > methods, such as creating states, writing stuff to distributed cache,
> etc,
> > because these methods may not work well with the SourceReader design and
> > cause confusion. For example, users may wonder why the snapshotState()
> > method exists while they can use the state directly.
> >
> > Thanks,
> >
> > Jiangjie (Becket) Qin
> >
> >
> >
> > On Tue, Jul 5, 2022 at 7:37 AM Alexander Fedulov <
> alexan...@ververica.com>
> > wrote:
> >
> > > Hi Becket,
> > >
> > > I updated and extended FLIP-238 accordingly.
> > >
> > > Here is also my POC branch [1].
> > > DataGeneratorSourceV3 is the class that I currently converged on [2].
> It
> > is
> > > based on the expanded SourceReaderContext.
> > > A couple more relevant classes [3] [4]
> > >
> > > Would appreciate it if you could take a quick look.
> > >
> > > [1]
> https://github.com/afedulov/flink/tree/FLINK-27919-generator-source
> > > [2]
> > >
> > >
> >
> https://github.com/afedulov/flink/blob/FLINK-27919-generator-source/flink-core/src/main/java/org/apache/flink/api/connector/source/lib/DataGeneratorSourceV3.java
> > > [3]
> > >
> > >
> >
> https://github.com/afedulov/flink/blob/FLINK-27919-generator-source/flink-core/src/main/java/org/apache/flink/api/connector/source/lib/util/MappingIteratorSourceReader.java
> > > [4]
> > >
> > >
> >
> https://github.com/afedulov/flink/blob/FLINK-27919-generator-source/flink-core/src/main/java/org/apache/flink/api/connector/source/lib/util/RateLimitedSourceReader.java
> > >
> > > Best,
> > > Alexander Fedulov
> > >
> > > On Mon, Jul 4, 2022 at 12:08 PM Alexander Fedulov <
> > alexan...@ververica.com
> > > >
> > > wrote:
> > >
> > > > Hi Becket,
> > > >
> > > > Exposing the RuntimeContext is potentially even more useful.
> > > > Do you think it is worth having both currentParallelism() and
> > > >  getRuntimeContext() methods?
> > > > One can always call getNumberOfParallelSubtasks() on the
> RuntimeContext
> > > > directly if we expose it.
> > > >
> > > > Best,
> > > > Alexander Fedulov
> > > >
> > > >
> > > > On Mon, Jul 4, 2022 at 3:44 AM Becket Qin 
> > wrote:
> > > >
> > > >

Re: [DISCUSS] FLIP-245: Source Supports Speculative Execution For Batch Job

2022-07-04 Thread Becket Qin
Yes, that sounds reasonable to me. That said, supporting custom events
might still be preferable if that does not complicate the design too much.
It would be good to avoid having a tricky feature availability matrix when
we add new features to the project.

Thanks,

Jiangjie (Becket) Qin



On Mon, Jul 4, 2022 at 5:09 PM Zhu Zhu  wrote:

> Hi Jiangjie,
>
> Yes you are that the goals of watermark alignment and speculative
> execution do not conflict. For the example you gave, we can make it
> work by only aligning watermarks for executions that are pipelined
> connected (i.e. in the same execution attempt level pipelined region).
> Even not considering speculative execution, it looks like to be a
> possible improvement of watermark alignment, for streaming jobs that
> contains embarrassingly parallel job vertices, so that a slow task
> does not cause unconnected tasks to be throttled.
>
> At the moment, given that it is not needed yet and to avoid further
> complicating things, I think it's fine to not support watermark
> alignment in speculative execution cases.
>
> WDYT?
>
> Thanks,
> Zhu
>
> Becket Qin  于2022年7月4日周一 16:15写道:
> >
> > Hi Zhu,
> >
> > I agree that if we are talking about a single execution region with
> > blocking shuffle, watermark alignment may not be that helpful as the
> > subtasks are running independently of each other.
> >
> > That said, I don't think watermark alignment and speculative execution
> > necessarily conflict with each other. The idea of watermark alignment is
> to
> > make sure the jobs run efficiently, regardless of whether or why the job
> > has performance issues. On the other hand, the purpose of speculative
> > execution is to find out whether the jobs have performance issues due to
> > slow tasks, and fix them.
> >
> > For example, a job has one task whose watermark is always lagging behind,
> > therefore it causes the other tasks to be throttled. The speculative
> > execution identified the slow task and decided to run it in another node,
> > thus unblocking the other subtasks.
> >
> > Thanks,
> >
> > Jiangjie (Becket) Qin
> >
> >
> >
> > On Mon, Jul 4, 2022 at 3:31 PM Zhu Zhu  wrote:
> >
> > > I had another thought and now I think watermark alignment is actually
> > > conceptually conflicted with speculative execution.
> > > This is because the idea of watermark alignment is to limit the
> progress
> > > of all sources to be around the progress of the slowest source in the
> > > watermark group. However, speculative execution's goal is to solve the
> > > slow task problem and it never wants to limit the progress of tasks
> with
> > > the progress of the slow task.
> > > Therefore, I think it's fine to not support watermark alignment.
> Instead,
> > > it should throw an error if watermark alignment is enabled in the case
> > > that speculative execution is enabled.
> > >
> > > Thanks,
> > > Zhu
> > >
> > > Zhu Zhu  于2022年7月4日周一 14:34写道:
> > > >
> > > > Thanks for updating the FLIP!
> > > >
> > > > I agree that at the moment users do not need watermark alignment(in
> > > > which case ReportedWatermarkEvent would happen) in batch cases.
> > > > However, I think the concept of watermark alignment is not conflicted
> > > > with speculative execution. It can work with speculative execution
> with
> > > > a little extra effort, by sending the WatermarkAlignmentEvent to all
> > > > the current executions of each subtask.
> > > > Therefore, I prefer to support watermark alignment in case it will be
> > > > needed by batch jobs in the future.
> > > >
> > > > Thanks,
> > > > Zhu
> > > >
> > > > Jing Zhang  于2022年7月1日周五 18:09写道:
> > > > >
> > > > > Hi all,
> > > > > After an offline discussion with Jiangjie (Becket) Qin, Guowei,
> Zhuzhu,
> > > > > I've updated the FLIP-245[1] to including:
> > > > > 1. Complete the fault-tolerant processing flow.
> > > > > 2. Support for SourceEvent because it's useful for some
> user-defined
> > > > > sources which have a custom event protocol between reader and
> > > enumerator.
> > > > > 3. How to handle ReportedWatermarkEvent/ReaderRegistrationEvent
> > > messages.
> > > > >
> > > > > Please review the FLIP-245[1] again, looking forward to your
> feedback.
> > > > >
> 

Re: [DISCUSS] FLIP-245: Source Supports Speculative Execution For Batch Job

2022-07-04 Thread Becket Qin
Hi Zhu,

I agree that if we are talking about a single execution region with
blocking shuffle, watermark alignment may not be that helpful as the
subtasks are running independently of each other.

That said, I don't think watermark alignment and speculative execution
necessarily conflict with each other. The idea of watermark alignment is to
make sure the jobs run efficiently, regardless of whether or why the job
has performance issues. On the other hand, the purpose of speculative
execution is to find out whether the jobs have performance issues due to
slow tasks, and fix them.

For example, a job has one task whose watermark is always lagging behind,
therefore it causes the other tasks to be throttled. The speculative
execution identified the slow task and decided to run it in another node,
thus unblocking the other subtasks.

Thanks,

Jiangjie (Becket) Qin



On Mon, Jul 4, 2022 at 3:31 PM Zhu Zhu  wrote:

> I had another thought and now I think watermark alignment is actually
> conceptually conflicted with speculative execution.
> This is because the idea of watermark alignment is to limit the progress
> of all sources to be around the progress of the slowest source in the
> watermark group. However, speculative execution's goal is to solve the
> slow task problem and it never wants to limit the progress of tasks with
> the progress of the slow task.
> Therefore, I think it's fine to not support watermark alignment. Instead,
> it should throw an error if watermark alignment is enabled in the case
> that speculative execution is enabled.
>
> Thanks,
> Zhu
>
> Zhu Zhu  于2022年7月4日周一 14:34写道:
> >
> > Thanks for updating the FLIP!
> >
> > I agree that at the moment users do not need watermark alignment(in
> > which case ReportedWatermarkEvent would happen) in batch cases.
> > However, I think the concept of watermark alignment is not conflicted
> > with speculative execution. It can work with speculative execution with
> > a little extra effort, by sending the WatermarkAlignmentEvent to all
> > the current executions of each subtask.
> > Therefore, I prefer to support watermark alignment in case it will be
> > needed by batch jobs in the future.
> >
> > Thanks,
> > Zhu
> >
> > Jing Zhang  于2022年7月1日周五 18:09写道:
> > >
> > > Hi all,
> > > After an offline discussion with Jiangjie (Becket) Qin, Guowei, Zhuzhu,
> > > I've updated the FLIP-245[1] to including:
> > > 1. Complete the fault-tolerant processing flow.
> > > 2. Support for SourceEvent because it's useful for some user-defined
> > > sources which have a custom event protocol between reader and
> enumerator.
> > > 3. How to handle ReportedWatermarkEvent/ReaderRegistrationEvent
> messages.
> > >
> > > Please review the FLIP-245[1] again, looking forward to your feedback.
> > >
> > > [1]
> > >
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-245%3A+Source+Supports+Speculative+Execution+For+Batch+Job
> > >
> > > Jing Zhang  于2022年7月1日周五 18:02写道:
> > >
> > > > Hi Guowei,
> > > > Thanks a lot for your feedback.
> > > > Your advices are really helpful.  I've updated the FLIP-245[1] to
> includes
> > > > these parts.
> > > > > First of all, please complete the fault-tolerant processing flow
> in the
> > > > FLIP.
> > > >
> > > > After an execution is created and a source operator becomes ready to
> > > > receive events,  subtaskReady is called,
> SpeculativeSourceCoordinator would
> > > > store the mapping of SubtaskGateway to execution attempt in
> > > > SpeculativeSourceCoordinatorContext.
> > > > Then source operator registers the reader to the coordinator,
> > > > SpeculativeSourceCoordinator would store the mapping of source
> reader to
> > > > execution attempt in SpeculativeSourceCoordinatorContext.
> > > > If the execution goes through a failover, subtaskFailed is called,
> > > > SpeculativeSourceCoordinator would clear information about this
> execution,
> > > > including source readers and SubtaskGateway.
> > > > If all the current executions of the execution vertex are failed,
> > > > subtaskReset would be called, SpeculativeSourceCoordinator would
> clear all
> > > > information about this executions and adding splits back to the split
> > > > enumerator of source.
> > > >
> > > > > Secondly the FLIP only says that user-defined events are not
> supported,
> > > > but it does not explain how to deal with the existing
> > &g

Re: [DISCUSS] FLIP-238: Introduce FLIP-27-based Data Generator Source

2022-07-03 Thread Becket Qin
Hi Alex,

Yes, that is what I had in mind. We need to add the method
getRuntimeContext() to the SourceReaderContext interface as well.

Thanks,

Jiangjie (Becket) Qin

On Mon, Jul 4, 2022 at 3:01 AM Alexander Fedulov 
wrote:

> Hi Becket,
>
> thanks for your input. I like the idea of adding the parallelism to the
> SourceReaderContext. My understanding is that any change of parallelism
> causes recreation of all readers, so it should be safe to consider it
> "fixed" after the readers' initialization. In that case, it should be as
> simple as adding the following to the anonymous SourceReaderContext
> implementation
> in SourceOperator#initReader():
>
> public int currentParallelism() {
>return getRuntimeContext().getNumberOfParallelSubtasks();
> }
>
> Is that what you had in mind?
>
> Best,
> Alexander Fedulov
>
>
>
>
> On Fri, Jul 1, 2022 at 11:30 AM Becket Qin  wrote:
>
> > Hi Alex,
> >
> > In FLIP-27 source, the SourceReader can get a SourceReaderContext. This
> is
> > passed in by the TM in Source#createReader(). And supposedly the Source
> > should pass this to the SourceReader if needed.
> >
> > In the SourceReaderContext, currently only the index of the current
> subtask
> > is available, but we can probably add the current parallelism as well.
> This
> > would be a change that affects all the Sources, not only for the data
> > generator source. Perhaps we can have a simple separate FLIP.
> >
> > Regarding the semantic of rate limiting, for the rate limit source,
> > personally I feel intuitive to keep the global rate untouched on scaling.
> >
> > Thanks,
> >
> > Jiangjie (Becket) Qin
> >
> > On Fri, Jul 1, 2022 at 4:00 AM Alexander Fedulov <
> alexan...@ververica.com>
> > wrote:
> >
> > > Hi all,
> > >
> > > getting back to the idea of reusing FlinkConnectorRateLimiter: it is
> > > designed for the SourceFunction API and has an open() method that
> takes a
> > > RuntimeContext. Therefore, we need to add a different interface for
> > > the new Source
> > > API.
> > >
> > > This is where I see a certain limitation for the rate-limiting use
> case:
> > in
> > > the old API the individual readers were able to retrieve the current
> > > parallelism from the RuntimeContext. In the new API, this is not
> > supported,
> > > the information about the parallelism is only available in the
> > > SplitEnumeratorContext to which the readers do not have access.
> > >
> > > I see two possibilities:
> > > 1. Add an optional RateLimiter parameter to the DataGeneratorSource
> > > constructor. The RateLimiter is then "fixed" and has to be fully
> > configured
> > > by the user in the main method.
> > > 2. Piggy-back on Splits: add parallelism as a field of a Split. The
> > > initialization of this field would happen dynamically upon splits
> > creation
> > > in the createEnumerator() method where currentParallelism is available.
> > >
> > > The second approach makes implementation rather significantly more
> > > complex since we cannot simply wrap
> NumberSequenceSource.SplitSerializer
> > in
> > > that case. The advantage of this approach is that with any kind of
> > > autoscaling, the source rate will match the original configuration. But
> > I'm
> > > not sure how useful this is. I can even imagine scenarios where scaling
> > the
> > > input rate together with parallelism would be better for demo purposes.
> > >
> > > Would be glad to hear your thoughts on this.
> > >
> > > Best,
> > > Alexander Fedulov
> > >
> > > On Mon, Jun 20, 2022 at 4:31 PM David Anderson 
> > > wrote:
> > >
> > > > I'm very happy with this. +1
> > > >
> > > > A lot of SourceFunction implementations used in demos/POC
> > implementations
> > > > include a call to sleep(), so adding rate limiting is a good idea, in
> > my
> > > > opinion.
> > > >
> > > > Best,
> > > > David
> > > >
> > > > On Mon, Jun 20, 2022 at 10:10 AM Qingsheng Ren 
> > > wrote:
> > > >
> > > > > Hi Alexander,
> > > > >
> > > > > Thanks for creating this FLIP! I’d like to share some thoughts.
> > > > >
> > > > > 1. About the “generatorFunction” I’m expecting an initializer on it
> > > > > because it’s hard to requir

Re: [DISCUSS] FLIP-238: Introduce FLIP-27-based Data Generator Source

2022-07-01 Thread Becket Qin
Hi Alex,

In FLIP-27 source, the SourceReader can get a SourceReaderContext. This is
passed in by the TM in Source#createReader(). And supposedly the Source
should pass this to the SourceReader if needed.

In the SourceReaderContext, currently only the index of the current subtask
is available, but we can probably add the current parallelism as well. This
would be a change that affects all the Sources, not only for the data
generator source. Perhaps we can have a simple separate FLIP.

Regarding the semantic of rate limiting, for the rate limit source,
personally I feel intuitive to keep the global rate untouched on scaling.

Thanks,

Jiangjie (Becket) Qin

On Fri, Jul 1, 2022 at 4:00 AM Alexander Fedulov 
wrote:

> Hi all,
>
> getting back to the idea of reusing FlinkConnectorRateLimiter: it is
> designed for the SourceFunction API and has an open() method that takes a
> RuntimeContext. Therefore, we need to add a different interface for
> the new Source
> API.
>
> This is where I see a certain limitation for the rate-limiting use case: in
> the old API the individual readers were able to retrieve the current
> parallelism from the RuntimeContext. In the new API, this is not supported,
> the information about the parallelism is only available in the
> SplitEnumeratorContext to which the readers do not have access.
>
> I see two possibilities:
> 1. Add an optional RateLimiter parameter to the DataGeneratorSource
> constructor. The RateLimiter is then "fixed" and has to be fully configured
> by the user in the main method.
> 2. Piggy-back on Splits: add parallelism as a field of a Split. The
> initialization of this field would happen dynamically upon splits creation
> in the createEnumerator() method where currentParallelism is available.
>
> The second approach makes implementation rather significantly more
> complex since we cannot simply wrap NumberSequenceSource.SplitSerializer in
> that case. The advantage of this approach is that with any kind of
> autoscaling, the source rate will match the original configuration. But I'm
> not sure how useful this is. I can even imagine scenarios where scaling the
> input rate together with parallelism would be better for demo purposes.
>
> Would be glad to hear your thoughts on this.
>
> Best,
> Alexander Fedulov
>
> On Mon, Jun 20, 2022 at 4:31 PM David Anderson 
> wrote:
>
> > I'm very happy with this. +1
> >
> > A lot of SourceFunction implementations used in demos/POC implementations
> > include a call to sleep(), so adding rate limiting is a good idea, in my
> > opinion.
> >
> > Best,
> > David
> >
> > On Mon, Jun 20, 2022 at 10:10 AM Qingsheng Ren 
> wrote:
> >
> > > Hi Alexander,
> > >
> > > Thanks for creating this FLIP! I’d like to share some thoughts.
> > >
> > > 1. About the “generatorFunction” I’m expecting an initializer on it
> > > because it’s hard to require all fields in the generator function are
> > > serializable in user’s implementation. Providing a function like “open”
> > in
> > > the interface could let the function to make some initializations in
> the
> > > task initializing stage.
> > >
> > > 2. As of the throttling functinality you mentioned, there’s a
> > > FlinkConnectorRateLimiter under flink-core and maybe we could reuse
> this
> > > interface. Actually I prefer to make rate limiting as a common feature
> > > provided in the Source API, but this requires another FLIP and a lot of
> > > discussions so I’m OK to have it in the DataGen source first.
> > >
> > > Best regards,
> > > Qingsheng
> > >
> > >
> > > > On Jun 17, 2022, at 01:47, Alexander Fedulov <
> alexan...@ververica.com>
> > > wrote:
> > > >
> > > > Hi Jing,
> > > >
> > > > thanks for your thorough analysis. I agree with the points you make
> and
> > > > also with the idea to approach the larger task of providing a
> universal
> > > > (DataStream + SQL) data generator base iteratively.
> > > > Regarding the name, the SourceFunction-based *DataGeneratorSource*
> > > resides
> > > > in the *org.apache.flink.streaming.api.functions.source.datagen*. I
> > think
> > > > it is OK to simply place the new one (with the same name) next to the
> > > > *NumberSequenceSource* into
> > *org.apache.flink.api.connector.source.lib*.
> > > >
> > > > One more thing I wanted to discuss:  I noticed that
> *DataGenTableSource
> > > *has
> > > > built-in throttling functionality (*rowsPer

Re: [DISCUSS] Releasing Flink ML 2.1.0

2022-06-23 Thread Becket Qin
+1.

It looks like we have some decent progress on Flink ML :)

Thanks,

Jiangjie (Becket) Qin

On Fri, Jun 24, 2022 at 8:29 AM Dong Lin  wrote:

> Hi Zhipeng and Yun,
>
> Thanks for starting the discussion. +1 for the Flink ML 2.1.0 release.
>
> Cheers,
> Dong
>
> On Thu, Jun 23, 2022 at 11:15 AM Zhipeng Zhang 
> wrote:
>
> > Hi devs,
> >
> > Yun and I would like to start a discussion for releasing Flink ML
> > <https://github.com/apache/flink-ml> 2.1.0.
> >
> > In the past few months, we focused on improving the infra (e.g. memory
> > management, benchmark infra, online training, python support) of Flink ML
> > by implementing, benchmarking, and optimizing 9 new algorithms in Flink
> ML.
> > Our results have shown that Flink ML is able to meet or exceed the
> > performance of selected algorithms in alternative popular ML libraries.
> >
> > Please see below for a detailed list of improvements:
> >
> > - A set of representative machine learning algorithms:
> > - feature engineering
> > - MinMaxScaler (
> https://issues.apache.org/jira/browse/FLINK-25552)
> > - StringIndexer (
> https://issues.apache.org/jira/browse/FLINK-25527
> > )
> > - VectorAssembler (
> > https://issues.apache.org/jira/browse/FLINK-25616
> > )
> > - StandardScaler (
> > https://issues.apache.org/jira/browse/FLINK-26626)
> > - Bucketizer (https://issues.apache.org/jira/browse/FLINK-27072)
> > - online learning:
> > - OnlineKmeans (
> https://issues.apache.org/jira/browse/FLINK-26313)
> > - OnlineLogisiticRegression (
> > https://issues.apache.org/jira/browse/FLINK-27170)
> > - regression:
> > - LinearRegression (
> > https://issues.apache.org/jira/browse/FLINK-27093)
> > - classification:
> > - LinearSVC (https://issues.apache.org/jira/browse/FLINK-27091)
> > - Evaluation:
> > - BinaryClassificationEvaluator (
> > https://issues.apache.org/jira/browse/FLINK-27294)
> > - A benchmark framework for Flink ML. (
> > https://issues.apache.org/jira/browse/FLINK-26443)
> > - A website for Flink ML users (
> > https://nightlies.apache.org/flink/flink-ml-docs-stable/)
> > - Python support for Flink ML algorithms (
> > https://issues.apache.org/jira/browse/FLINK-26268,
> > https://issues.apache.org/jira/browse/FLINK-26269)
> > - Several optimizations for FlinkML infrastructure (
> > https://issues.apache.org/jira/browse/FLINK-27096,
> > https://issues.apache.org/jira/browse/FLINK-27877)
> >
> > With the improvements and throughput benchmarks we have made, we think it
> > is time to release Flink ML 2.1.0, so that interested developers in the
> > community can try out the new Flink ML infra to develop algorithms with
> > high throughput and low latency.
> >
> > If there is any concern, please let us know.
> >
> >
> > Best,
> > Yun and Zhipeng
> >
>


Re: [VOTE] Deprecate SourceFunction API

2022-06-16 Thread Becket Qin
+1 (binding)

Thanks Alex.

Jiangjie (Becket) Qin

On Thu, Jun 16, 2022 at 7:16 PM Lijie Wang  wrote:

> +1 (non-binding)
>
> Thanks for driving this.
>
> Best,
> Lijie
>
> Martijn Visser  于2022年6月16日周四 19:07写道:
>
> > +1 (binding)
> >
> > Thanks again for opening this discussion Alex.
> >
> > Cheers, Martijn
> >
> > Op do 16 jun. 2022 om 11:36 schreef Jing Ge :
> >
> > > +1
> > > Thanks for driving this!
> > >
> > > Best regards,
> > > Jing
> > >
> > > On Wed, Jun 15, 2022 at 8:03 PM Alexander Fedulov <
> > alexan...@ververica.com
> > > >
> > > wrote:
> > >
> > > > Hi everyone,
> > > >
> > > > following the discussion in [1], I would like to open up a vote for
> > > > deprecating the SourceFunction API.
> > > >
> > > > An overview of the steps required for being able to drop this
> > > > API in the next major version is maintained in the umbrella
> > > > FLINK-28045 ticket [2].
> > > >
> > > > This proposition implies marking the SourceFunction interface
> > > > itself as @Deprecated  + redirecting to the FLIP-27 Source API
> > > > right away, without waiting for all the subtasks to be completed.
> > > >
> > > > [1] https://lists.apache.org/thread/d6cwqw9b3105wcpdkwq7rr4s7x4ywqr9
> > > > [2] https://issues.apache.org/jira/browse/FLINK-28045
> > > >
> > > > Best,
> > > > Alexander Fedulov
> > > >
> > >
> >
>


About the Current22 event

2022-06-15 Thread Becket Qin
Hi my Flink fellas,

The CFP for the Current22 [1] event is about to close.

The Current event is the next generation of KafkaSummit. It expands the
scope to cover **ALL** the technologies for real-time data, not limited to
Kafka. Given Flink is a leading project in this area, the program committee
is actively looking for speakers from the Flink community.

Please don't hesitate to submit a talk [2] if you are interested!

Thanks,

Jiangjie (Becket) Qin

[1] https://2022.currentevent.io/website/39543/
[2] https://sessionize.com/current-2022/


Re: [DISCUSS] Deprecate SourceFunction APIs

2022-06-13 Thread Becket Qin
In general, I'll give a big +1 to deprecating the SourceFunction.

That said, it is indeed worth looking into what might be missing or less
easy to implement with FLIP-27 Source compared with the SourceFunction.
Maybe we can just compile a list of things to do in order to fully
deprecate the SourceFunction. As far as I am aware of, there are two things
that need to be taken care of:

1. A simple high level API, as Jing mentioned, that makes simple cases that
do not involve the split enumerator easier. Ideally this should be as
simple as SourceFunction, if not simpler. Off the top of my head, I think a
default no-op split enumerator will just do the work. And the data
generator of FLIP-238 could be an implementation using this high level API.

2. FLIP-208, which allows users to stop the job upon receiving a record in
the stream.

Is there anything else that we have heard from the users / connector
developers that needs some attention?

Thanks,

Jiangjie (Becket) Qin



On Fri, Jun 10, 2022 at 3:25 PM David Anderson  wrote:

> +1 for deprecating SourceFunction from me as well. And a big THANK YOU to
> Alex Fedulov for bringing forward FLIP-238.
>
> David
>
> On Fri, Jun 10, 2022 at 4:03 AM Lijie Wang 
> wrote:
>
> > Hi all,
> >
> > Sorry for my mistake. The `StreamExecutionEnvironment#readFiles` and can
> be
> > easily replaced by `FileSource#forRecordStreamFormat/forBulkFileFormat`.
> I
> > have no other concerns.
> >
> >  +1 to deprecate SourceFunction and deprecate the methods (in
> > StreamExecutionEnvironment) based on SourceFunction .
> >
> > Best,
> > Lijie
> >
> > Konstantin Knauf  于2022年6月10日周五 05:11写道:
> >
> > > Hi everyone,
> > >
> > > thank you Jing for redirecting the discussion back to the topic at
> hand.
> > I
> > > agree with all of your points.
> > >
> > > +1 to deprecate SourceFunction
> > >
> > > Is there really no replacement for the
> > StreamExecutionEnvironment#readXXX.
> > > There is already a FLIP-27 based FileSource, right? What's missing to
> > > recommend using that as opposed to the the readXXX methods?
> > >
> > > Cheers,
> > >
> > > Konstantin
> > >
> > > Am Do., 9. Juni 2022 um 20:11 Uhr schrieb Alexander Fedulov <
> > > alexan...@ververica.com>:
> > >
> > > > Hi all,
> > > >
> > > > It seems that there is some understandable cautiousness with regard
> to
> > > > deprecating methods and subclasses that do not have alternatives just
> > > yet.
> > > >
> > > > We should probably first agree if it is in general OK for Flink to
> use
> > > > @Deprecated
> > > > annotation for parts of the code that do not have alternatives. In
> that
> > > > case,
> > > > we could add a comment along the lines of:
> > > > "This implementation is based on a deprecated SourceFunction API that
> > > > will gradually be phased out from Flink. No direct substitute exists
> at
> > > the
> > > > moment.
> > > > If you want to have a more future-proof solution, consider helping
> the
> > > > project by
> > > > contributing an implementation based on the new Source API."
> > > >
> > > > This should clearly communicate the message that usage of these
> > > > methods/classes
> > > > is discouraged and at the same time promote contributions for
> > addressing
> > > > the gap.
> > > > What do you think?
> > > >
> > > > Best,
> > > > Alexander Fedulov
> > > >
> > > >
> > > > On Thu, Jun 9, 2022 at 6:27 PM Ingo Bürk 
> wrote:
> > > >
> > > > > Hi,
> > > > >
> > > > > these APIs don't expose the underlying source directly, so I don't
> > > think
> > > > > we need to worry about deprecating them as well. There's also
> nothing
> > > > > inherently wrong with using a deprecated API internally, though
> even
> > > > > just for the experience of using our own new APIs I would
> personally
> > > say
> > > > > that they should be migrated to the new Source API. It's hard to
> > reason
> > > > > that users must migrate to a new API if we don't do it internally
> as
> > > > well.
> > > > >
> > > > >
> > > > > Best
> > > > > Ingo
> > > > >
> > > > > On 09.06.22 15:41, Lijie Wang w

[ANNOUNCE] New Apache Flink PMC Member - Jingsong Lee

2022-06-13 Thread Becket Qin
Hi all,

I'm very happy to announce that Jingsong Lee has joined the Flink PMC!

Jingsong became a Flink committer in Feb 2020 and has been continuously
contributing to the project since then, mainly in Flink SQL. He has been
quite active in the mailing list, fixing bugs, helping verifying releases,
reviewing patches and FLIPs. Jingsong is also devoted to pushing Flink SQL
to new use cases. He spent a lot of time in implementing the Flink
connectors for Apache Iceberg. Jingsong is also the primary driver behind
the effort of flink-table-store, which aims to provide a stream-batch
unified storage for Flink dynamic tables.

Congratulations and welcome, Jingsong!

Cheers,

Jiangjie (Becket) Qin
(On behalf of the Apache Flink PMC)


Re: [DISCUSS] suggest using granularityNumber in ColumnStats

2022-06-04 Thread Becket Qin
Hi Jing,

Hmm, granularity and ndv still don't seem to mean the same thing to me.
Granularity basically means how detailed the data is, in another word,
whether a field / column be further divided. For example, a field like
"age“ cannot be further divided so it is quite granular. In contrast, an
"address" field can be further divided into "street", "city", "country",
etc. Therefore "address" is less granular. When it comes to NDV, it
actually means how many distinct values are there in the field / column,
which is orthogonal to the granularity.

Anyways, it looks like most people think NDV or its full phrase is a better
name. It probably makes sense to just use either of them.

Thanks,

Jiangjie (Becket) Qin


On Fri, Jun 3, 2022 at 9:45 PM Jark Wu  wrote:

> Hi Jing,
>
> I agree with you that "NDV is more SQL-oriented(implementation)
> and granularity is more data analytics-oriented". As you said,
> "granularity"
> may be commonly used for data modeling and business-related.
> However, TableStats is not used for data modeling but is an implementation
>  detail for SQL optimization. NDV is the terminology in the optimizer
> field,
> and Calcite also uses this word[1]. I didn't notice there any vendors are
> using "granularity" for this purpose. If I miss any, please correct me.
>
> If NDV sounds like a function to you, I'm OK to use "numDistinctVals" as
> Calcite does.
>
> Best,
> Jark
>
>
> [1]:
>
> https://calcite.apache.org/javadocAggregate/org/apache/calcite/rel/metadata/RelMdUtil.html#numDistinctVals(java.lang.Double,java.lang.Double)
>
> On Fri, 3 Jun 2022 at 00:14, Jing Ge  wrote:
>
> > Thanks all for your feedback! It is very informative.
> >
> > to Becket:
> >
> > At the beginning, I chose the same word because we used it in daily work.
> > Before I started this discussion, to make sure it is the right one, I did
> > some checking and it turns out that *cardinality* has a very different
> > (also very common) meaning within data modeling[1]. And on the other side
> > *granularity* is actually the right word for the meaning when we use
> > cardinality in the context of NDV[2].
> >
> > to Jark, Jingsong,
> >
> > NDV seems to me more like a function than a field defined in a class.
> > Briefly speaking, NDV is more SQL-oriented(implementation) and
> > *granularity* is more data analytics-oriented(abstraction/concept)[3][4].
> >
> > Best regards,
> > Jing
> >
> > [1] https://en.wikipedia.org/wiki/Cardinality_(data_modeling)
> > [2] https://www.talon.one/glossary/granularity
> > [3] https://www.quora.com/What-is-granularity-in-database
> > [4] https://www.statisticshowto.com/data-granularity/
> >
> > On Thu, Jun 2, 2022 at 11:16 AM Jingsong Li 
> > wrote:
> >
> > > Hi,
> > >
> > > +1 for NDV (number of distinct values) is a widely used terminology in
> > > table statistics.
> > >
> > > I've also seen the one called `distinctCount`.
> > >
> > > This name can be found in databases like oracle too. [1]
> > >
> > > So it is not good to change a completely different name.
> > >
> > > [1]
> > >
> > >
> >
> https://docs.oracle.com/database/121/TGSQL/glossary.htm#GUID-34DC46FD-32CE-4242-8ED9-945AE7A9F922
> > >
> > > Best,
> > > Jingsong
> > >
> > > On Thu, Jun 2, 2022 at 4:46 PM Jark Wu  wrote:
> > >
> > > > Hi Jing,
> > > >
> > > > I can see there might be developers who don't understand the meaning
> at
> > > the
> > > > first glance.
> > > > However, NDV is a widely used terminology in table statistics, see
> > > > [1][2][3].
> > > > If we use another name, it may confuse developers who are familiar
> with
> > > > stats and optimization.
> > > > I think at least, the Javadoc is needed to explain the meaning and
> full
> > > > name.
> > > > If we want to change the name, we can use the full name
> > > > "numberOfDistinctValues()".
> > > >
> > > > Best,
> > > > Jark
> > > >
> > > > [1]:
> > > >
> > > >
> > >
> >
> https://www.alibabacloud.com/help/en/maxcompute/latest/collect-information-for-the-optimizer-of-maxcompute
> > > > [2]:
> > > >
> > >
> >
> https://docs.dremio.com/software/sql-reference/sql-functions/functions/ndv/
> > > > [3]:
>

Re: [DISCUSS] suggest using granularityNumber in ColumnStats

2022-06-02 Thread Becket Qin
Hi Jing,

While I do agree that NDV is a little confusing at first sight, it seems
quite concise once I got the meaning. So personally I am OK with keeping it
as is, but proper documentation would be helpful. If we really want to
replace it with a more professional name, *cardinality* might be a good
alternative.

Thanks,

Jiangjie (Becket) Qin

On Thu, Jun 2, 2022 at 12:51 AM Jing Ge  wrote:

> Hi Dev,
>
> I am not really sure if it is feasible to start this discussion. According
> to the contribution guidelines, dev ml is the right place to reach
> consensus.
>
> In ColumnStats, Currently ndv, which stands for "number of distinct
> values", is used. First of all, it is difficult to understand the meaning
> with the abbreviation. Second, it might be good to use a professional
> naming instead.
>
>
>
> Suggestion:
>
> replace ndv with granularityNumber:
>
>
>
> The good news, afaik, is that the method getNdv() hasn't been used within
> Flink which means the renaming will have very limited impact.
>
>
>
> ColumnStats {
>
> /** number of distinct values. */
>
> @Deprecated
> private final Long ndv;
>
>
>
> /**Granularity refers to the level of details used to sort and separate
> data at column level. Highly granular data is categorized or separated very
> precisely. For example, the granularity number of gender columns should
> normally be 2. The granularity number of the month column will be 12. In
> the SQL world, it means the number of distinct values. */
>
> private final Long granularityNumber;
>
>
>
> @Deprecated
> public Long getNdv()
> { return ndv; }
>
>
>
> public Long getGranularityNumber()
> { return granularityNumber; }
> }
>
> Best regards,
> --
>
> Jing
>


Re: [DISCUSS] FLIP-221 Abstraction for lookup source cache and metric

2022-06-01 Thread Becket Qin
Thanks for updating the FLIP, Qingsheng. A few more comments:

1. I am still not sure about what is the use case for cacheMissingKey().
More specifically, when would users want to have getCache() return a
non-empty value and cacheMissingKey() returns false?

2. The builder pattern. Usually the builder pattern is used when there are
a lot of variations of constructors. For example, if a class has three
variables and all of them are optional, so there could potentially be many
combinations of the variables. But in this FLIP, I don't see such case.
What is the reason we have builders for all the classes?

3. Should the caching strategy be excluded from the top level provider API?
Technically speaking, the Flink framework should only have two interfaces
to deal with:
A) LookupFunction
B) AsyncLookupFunction
Orthogonally, we *believe* there are two different strategies people can do
caching. Note that the Flink framework does not care what is the caching
strategy here.
a) partial caching
b) full caching

Putting them together, we end up with 3 combinations that we think are
valid:
 Aa) PartialCachingLookupFunctionProvider
 Ba) PartialCachingAsyncLookupFunctionProvider
 Ab) FullCachingLookupFunctionProvider

However, the caching strategy could actually be quite flexible. E.g. an
initial full cache load followed by some partial updates. Also, I am not
100% sure if the full caching will always use ScanTableSource. Including
the caching strategy in the top level provider API would make it harder to
extend.

One possible solution is to just have *LookupFunctionProvider* and
*AsyncLookupFunctionProvider
*as the top level API, both with a getCacheStrategy() method returning an
optional CacheStrategy. The CacheStrategy class would have the following
methods:
1. void open(Context), the context exposes some of the resources that may
be useful for the the caching strategy, e.g. an ExecutorService that is
synchronized with the data processing, or a cache refresh trigger which
blocks data processing and refresh the cache.
2. void initializeCache(), a blocking method allows users to pre-populate
the cache before processing any data if they wish.
3. void maybeCache(RowData key, Collection value), blocking or
non-blocking method.
4. void refreshCache(), a blocking / non-blocking method that is invoked by
the Flink framework when the cache refresh trigger is pulled.

In the above design, partial caching and full caching would be
implementations of the CachingStrategy. And it is OK for users to implement
their own CachingStrategy if they want to.

Thanks,

Jiangjie (Becket) Qin


On Thu, Jun 2, 2022 at 12:14 PM Jark Wu  wrote:

> Thank Qingsheng for the detailed summary and updates,
>
> The changes look good to me in general. I just have one minor improvement
> comment.
> Could we add a static util method to the "FullCachingReloadTrigger"
> interface for quick usage?
>
> #periodicReloadAtFixedRate(Duration)
> #periodicReloadWithFixedDelay(Duration)
>
> I think we can also do this for LookupCache. Because users may not know
> where is the default
> implementations and how to use them.
>
> Best,
> Jark
>
>
>
>
>
>
> On Wed, 1 Jun 2022 at 18:32, Qingsheng Ren  wrote:
>
> > Hi Jingsong,
> >
> > Thanks for your comments!
> >
> > > AllCache definition is not flexible, for example, PartialCache can use
> > any custom storage, while the AllCache can not, AllCache can also be
> > considered to store memory or disk, also need a flexible strategy.
> >
> > We had an offline discussion with Jark and Leonard. Basically we think
> > exposing the interface of full cache storage to connector developers
> might
> > limit our future optimizations. The storage of full caching shouldn’t
> have
> > too many variations for different lookup tables so making it pluggable
> > might not help a lot. Also I think it is not quite easy for connector
> > developers to implement such an optimized storage. We can keep optimizing
> > this storage in the future and all full caching lookup tables would
> benefit
> > from this.
> >
> > > We are more inclined to deprecate the connector `async` option when
> > discussing FLIP-234. Can we remove this option from this FLIP?
> >
> > Thanks for the reminder! This option has been removed in the latest
> > version.
> >
> > Best regards,
> >
> > Qingsheng
> >
> >
> > > On Jun 1, 2022, at 15:28, Jingsong Li  wrote:
> > >
> > > Thanks Alexander for your reply. We can discuss the new interface when
> it
> > > comes out.
> > >
> > > We are more inclined to deprecate the connector `async` option when
> > > discussing FLIP-234 [1]. We should use hint to let planner decide

  1   2   3   4   5   >