RE: [DISCUSS] FLIP-437: Support ML Models in Flink SQL

2024-04-09 Thread David Radley
Hi Han,
Thanks for getting back to me.

I am curious about the valid characters in a model name – I assume any 
characters are valid as it is a quoted string in SQL. So $ could be in the 
model name. I would think that the model would be determined then the model is 
deployed, ( there could be other versions associated with authoring  or 
intermediate states of the model that never get deployed) – rather than 
allocated by Flink if there is none.
I see https://github.com/onnx/onnx/blob/main/docs/Versioning.md supports 
numbers or semantic versioning and 3 different types of versioning.

It would be interesting to see how champion challenger scenarios would play out 
– when you try a new version of the model that might perform better.
I suggest having a new optional model-version keyword, which would seem to be a 
cleaner way of specifying a model.



 Kind regards, David.

From: Hao Li 
Date: Wednesday, 3 April 2024 at 18:58
To: dev@flink.apache.org 
Subject: [EXTERNAL] Re: [DISCUSS] FLIP-437: Support ML Models in Flink SQL
Cross post David Radley's comments here from voting thread:

> I don’t think this counts as an objection, I have some comments. I should
have put this on the discussion thread earlier but have just got to this.
> - I suggest we can put a model version in the model resource. Versions
are notoriously difficult to add later; I don’t think we want to
proliferate differently named models as a model mutates. We may want to
work with non-latest models.
> - I see that the model name is the unique identifier. I realise this
would move away from the Oracle syntax – so may not be feasible short term;
but I wonder if we can have:
> - a uuid as the main identifier and the model name as an attribute.
> or
>  - a namespace (or something like a system of origin)
> to help organise models with the same name.
> - does the model have an owner? I assume that Flink model resource is the
master of the model? I imagine in the future that a model that comes in via
a new connector could be kept up to date with the external model and would
not be allowed to be changed by anything other than the connector.

Thanks for the comments. I agree supporting the model version is important.
I think we could support versioning without changing the overall syntax by
appending version number/name to the model name. Catalog implementations
can handle the versions. For example,

CREATE MODEL `my-model$1`...

"$1" would imply it's version 1. If no version is provided, we can auto
increment the version if the model name exists already or create the first
version if the model name doesn't exist yet.

As for model ownership, I'm not entirely sure about the use case and how it
should be controlled. It could be controlled from the user side through
ACL/rbac or some way in the catalog I guess. Maybe we can follow up on this
as the requirement or use case becomes more clear.

Cross post David Moravek's comments from voting thread:

> My only suggestion would be to move Catalog changes into a separate
> interface to allow us to begin with lower stability guarantees. Existing
> Catalogs would be able to opt-in by implementing it. It's a minor thing
> though, overall the FLIP is solid and the direction is pretty exciting.

I think it's fine to move model related catalog changes to a separate
interface and let the current catalog interface extend it. As model support
will be built-in in Flink, the current catalog interface will need to
support model CRUD operations. For my own education, can you elaborate more
on how separate interface will allow us to begin with lower stability
guarantees?

Thanks,
Hao


On Thu, Mar 28, 2024 at 10:14 AM Hao Li  wrote:

> Thanks Timo. I'll start a vote tomorrow if no further discussion.
>
> Thanks,
> Hao
>
> On Thu, Mar 28, 2024 at 9:33 AM Timo Walther  wrote:
>
>> Hi everyone,
>>
>> I updated the FLIP according to this discussion.
>>
>> @Hao Li: Let me know if I made a mistake somewhere. I added some
>> additional explaning comments about the new PTF syntax.
>>
>> There are no further objections from my side. If nobody objects, Hao
>> feel free to start the voting tomorrow.
>>
>> Regards,
>> Timo
>>
>>
>> On 28.03.24 16:30, Jark Wu wrote:
>> > Thanks, Hao,
>> >
>> > Sounds good to me.
>> >
>> > Best,
>> > Jark
>> >
>> > On Thu, 28 Mar 2024 at 01:02, Hao Li  wrote:
>> >
>> >> Hi Jark,
>> >>
>> >> I think we can start with supporting popular model providers such as
>> >> openai, azureml, sagemaker for remote models.
>> >>
>> >> Thanks,
>> >> Hao
>> >>
>> >> On Tue, Mar 26, 2024 at 8:15 PM Jark Wu  wrote:
>>

Re: [DISCUSS] FLIP-437: Support ML Models in Flink SQL

2024-04-03 Thread Hao Li
Cross post David Radley's comments here from voting thread:

> I don’t think this counts as an objection, I have some comments. I should
have put this on the discussion thread earlier but have just got to this.
> - I suggest we can put a model version in the model resource. Versions
are notoriously difficult to add later; I don’t think we want to
proliferate differently named models as a model mutates. We may want to
work with non-latest models.
> - I see that the model name is the unique identifier. I realise this
would move away from the Oracle syntax – so may not be feasible short term;
but I wonder if we can have:
> - a uuid as the main identifier and the model name as an attribute.
> or
>  - a namespace (or something like a system of origin)
> to help organise models with the same name.
> - does the model have an owner? I assume that Flink model resource is the
master of the model? I imagine in the future that a model that comes in via
a new connector could be kept up to date with the external model and would
not be allowed to be changed by anything other than the connector.

Thanks for the comments. I agree supporting the model version is important.
I think we could support versioning without changing the overall syntax by
appending version number/name to the model name. Catalog implementations
can handle the versions. For example,

CREATE MODEL `my-model$1`...

"$1" would imply it's version 1. If no version is provided, we can auto
increment the version if the model name exists already or create the first
version if the model name doesn't exist yet.

As for model ownership, I'm not entirely sure about the use case and how it
should be controlled. It could be controlled from the user side through
ACL/rbac or some way in the catalog I guess. Maybe we can follow up on this
as the requirement or use case becomes more clear.

Cross post David Moravek's comments from voting thread:

> My only suggestion would be to move Catalog changes into a separate
> interface to allow us to begin with lower stability guarantees. Existing
> Catalogs would be able to opt-in by implementing it. It's a minor thing
> though, overall the FLIP is solid and the direction is pretty exciting.

I think it's fine to move model related catalog changes to a separate
interface and let the current catalog interface extend it. As model support
will be built-in in Flink, the current catalog interface will need to
support model CRUD operations. For my own education, can you elaborate more
on how separate interface will allow us to begin with lower stability
guarantees?

Thanks,
Hao


On Thu, Mar 28, 2024 at 10:14 AM Hao Li  wrote:

> Thanks Timo. I'll start a vote tomorrow if no further discussion.
>
> Thanks,
> Hao
>
> On Thu, Mar 28, 2024 at 9:33 AM Timo Walther  wrote:
>
>> Hi everyone,
>>
>> I updated the FLIP according to this discussion.
>>
>> @Hao Li: Let me know if I made a mistake somewhere. I added some
>> additional explaning comments about the new PTF syntax.
>>
>> There are no further objections from my side. If nobody objects, Hao
>> feel free to start the voting tomorrow.
>>
>> Regards,
>> Timo
>>
>>
>> On 28.03.24 16:30, Jark Wu wrote:
>> > Thanks, Hao,
>> >
>> > Sounds good to me.
>> >
>> > Best,
>> > Jark
>> >
>> > On Thu, 28 Mar 2024 at 01:02, Hao Li  wrote:
>> >
>> >> Hi Jark,
>> >>
>> >> I think we can start with supporting popular model providers such as
>> >> openai, azureml, sagemaker for remote models.
>> >>
>> >> Thanks,
>> >> Hao
>> >>
>> >> On Tue, Mar 26, 2024 at 8:15 PM Jark Wu  wrote:
>> >>
>> >>> Thanks for the PoC and updating,
>> >>>
>> >>> The final syntax looks good to me, at least it is a nice and concise
>> >> first
>> >>> step.
>> >>>
>> >>> SELECT f1, f2, label FROM
>> >>> ML_PREDICT(
>> >>>   input => `my_data`,
>> >>>   model => `my_cat`.`my_db`.`classifier_model`,
>> >>>   args => DESCRIPTOR(f1, f2));
>> >>>
>> >>> Besides, what built-in models will we support in the FLIP? This might
>> be
>> >>> important
>> >>> because it relates to what use cases can run with the new Flink
>> version
>> >> out
>> >>> of the box.
>> >>>
>> >>> Best,
>> >>> Jark
>> >>>
>> >>> On Wed, 27 Mar 2024 at 01:10, Hao Li 
>> wrote:
>> >>>
>>  Hi Timo,
>> 
>>  Yeah. For `primary key` and `from table(...)` those are explicitly
>> >>> matched
>>  in parser: [1].
>> 
>> > SELECT f1, f2, label FROM
>>  ML_PREDICT(
>>    input => `my_data`,
>>    model => `my_cat`.`my_db`.`classifier_model`,
>>    args => DESCRIPTOR(f1, f2));
>> 
>>  This named argument syntax looks good to me. It can be supported
>> >> together
>>  with
>> 
>>  SELECT f1, f2, label FROM ML_PREDICT(`my_data`,
>>  `my_cat`.`my_db`.`classifier_model`,DESCRIPTOR(f1, f2));
>> 
>>  Sure. Will let you know once updated the FLIP.
>> 
>>  [1]
>> 
>> 
>> >>>
>> >>
>> https://github.com/confluentinc/flink/blob/release-1.18-confluent/flink-tab

Re: [DISCUSS] FLIP-437: Support ML Models in Flink SQL

2024-03-28 Thread Hao Li
Thanks Timo. I'll start a vote tomorrow if no further discussion.

Thanks,
Hao

On Thu, Mar 28, 2024 at 9:33 AM Timo Walther  wrote:

> Hi everyone,
>
> I updated the FLIP according to this discussion.
>
> @Hao Li: Let me know if I made a mistake somewhere. I added some
> additional explaning comments about the new PTF syntax.
>
> There are no further objections from my side. If nobody objects, Hao
> feel free to start the voting tomorrow.
>
> Regards,
> Timo
>
>
> On 28.03.24 16:30, Jark Wu wrote:
> > Thanks, Hao,
> >
> > Sounds good to me.
> >
> > Best,
> > Jark
> >
> > On Thu, 28 Mar 2024 at 01:02, Hao Li  wrote:
> >
> >> Hi Jark,
> >>
> >> I think we can start with supporting popular model providers such as
> >> openai, azureml, sagemaker for remote models.
> >>
> >> Thanks,
> >> Hao
> >>
> >> On Tue, Mar 26, 2024 at 8:15 PM Jark Wu  wrote:
> >>
> >>> Thanks for the PoC and updating,
> >>>
> >>> The final syntax looks good to me, at least it is a nice and concise
> >> first
> >>> step.
> >>>
> >>> SELECT f1, f2, label FROM
> >>> ML_PREDICT(
> >>>   input => `my_data`,
> >>>   model => `my_cat`.`my_db`.`classifier_model`,
> >>>   args => DESCRIPTOR(f1, f2));
> >>>
> >>> Besides, what built-in models will we support in the FLIP? This might
> be
> >>> important
> >>> because it relates to what use cases can run with the new Flink version
> >> out
> >>> of the box.
> >>>
> >>> Best,
> >>> Jark
> >>>
> >>> On Wed, 27 Mar 2024 at 01:10, Hao Li  wrote:
> >>>
>  Hi Timo,
> 
>  Yeah. For `primary key` and `from table(...)` those are explicitly
> >>> matched
>  in parser: [1].
> 
> > SELECT f1, f2, label FROM
>  ML_PREDICT(
>    input => `my_data`,
>    model => `my_cat`.`my_db`.`classifier_model`,
>    args => DESCRIPTOR(f1, f2));
> 
>  This named argument syntax looks good to me. It can be supported
> >> together
>  with
> 
>  SELECT f1, f2, label FROM ML_PREDICT(`my_data`,
>  `my_cat`.`my_db`.`classifier_model`,DESCRIPTOR(f1, f2));
> 
>  Sure. Will let you know once updated the FLIP.
> 
>  [1]
> 
> 
> >>>
> >>
> https://github.com/confluentinc/flink/blob/release-1.18-confluent/flink-table/flink-sql-parser/src/main/codegen/includes/parserImpls.ftl#L814
> 
>  Thanks,
>  Hao
> 
>  On Tue, Mar 26, 2024 at 4:15 AM Timo Walther 
> >> wrote:
> 
> > Hi Hao,
> >
> >   > `TABLE(my_data)` and `MODEL(my_cat.my_db.classifier_model)`
> >> doesn't
> >   > work since `TABLE` and `MODEL` are already key words
> >
> > This argument doesn't count. The parser supports introducing keywords
> > that are still non-reserved. For example, this enables using "key"
> >> for
> > both primary key and a column name:
> >
> > CREATE TABLE t (i INT PRIMARY KEY NOT ENFORCED)
> > WITH ('connector' = 'datagen');
> >
> > SELECT i AS key FROM t;
> >
> > I'm sure we will introduce `TABLE(my_data)` eventually as this is
> >> what
> > the standard dictates. But for now, let's use the most compact syntax
> > possible which is also in sync with Oracle.
> >
> > TLDR: We allow identifiers as arguments for PTFs which are expanded
> >>> with
> > catalog and database if necessary. Those identifier arguments
> >> translate
> > to catalog lookups for table and models. The ML_ functions will make
> > sure that the arguments are of correct type model or table.
> >
> > SELECT f1, f2, label FROM
> > ML_PREDICT(
> >   input => `my_data`,
> >   model => `my_cat`.`my_db`.`classifier_model`,
> >   args => DESCRIPTOR(f1, f2));
> >
> > So this will allow us to also use in the future:
> >
> > SELECT * FROM poly_func(table1);
> >
> > Same support as Oracle [1]. Very concise.
> >
> > Let me know when you updated the FLIP for a final review before
> >> voting.
> >
> > Do others have additional objections?
> >
> > Regards,
> > Timo
> >
> > [1]
> >
> >
> 
> >>>
> >>
> https://livesql.oracle.com/apex/livesql/file/content_HQK7TYEO0NHSJCDY3LN2ERDV6.html
> >
> >
> >
> > On 25.03.24 23:40, Hao Li wrote:
> >> Hi Timo,
> >>
> >>> Please double check if this is implementable with the current
> >>> stack. I
> >> fear the parser or validator might not like the "identifier"
> >>> argument?
> >>
> >> I checked this, currently the validator throws an exception trying
> >> to
>  get
> >> the full qualifier name for `classifier_model`. But since
> >> `SqlValidatorImpl` is implemented in Flink, we should be able to
> >> fix
> > this.
> >> The only caveator is if not full model path is provided,
> >> the qualifier is interpreted as a column. We should be able to
> >>> special
> >> handle this by rewriting the `ml_predict` function to add the
> >> catalog
>  and
> >> database nam

Re: [DISCUSS] FLIP-437: Support ML Models in Flink SQL

2024-03-28 Thread Timo Walther

Hi everyone,

I updated the FLIP according to this discussion.

@Hao Li: Let me know if I made a mistake somewhere. I added some 
additional explaning comments about the new PTF syntax.


There are no further objections from my side. If nobody objects, Hao 
feel free to start the voting tomorrow.


Regards,
Timo


On 28.03.24 16:30, Jark Wu wrote:

Thanks, Hao,

Sounds good to me.

Best,
Jark

On Thu, 28 Mar 2024 at 01:02, Hao Li  wrote:


Hi Jark,

I think we can start with supporting popular model providers such as
openai, azureml, sagemaker for remote models.

Thanks,
Hao

On Tue, Mar 26, 2024 at 8:15 PM Jark Wu  wrote:


Thanks for the PoC and updating,

The final syntax looks good to me, at least it is a nice and concise

first

step.

SELECT f1, f2, label FROM
ML_PREDICT(
  input => `my_data`,
  model => `my_cat`.`my_db`.`classifier_model`,
  args => DESCRIPTOR(f1, f2));

Besides, what built-in models will we support in the FLIP? This might be
important
because it relates to what use cases can run with the new Flink version

out

of the box.

Best,
Jark

On Wed, 27 Mar 2024 at 01:10, Hao Li  wrote:


Hi Timo,

Yeah. For `primary key` and `from table(...)` those are explicitly

matched

in parser: [1].


SELECT f1, f2, label FROM

ML_PREDICT(
  input => `my_data`,
  model => `my_cat`.`my_db`.`classifier_model`,
  args => DESCRIPTOR(f1, f2));

This named argument syntax looks good to me. It can be supported

together

with

SELECT f1, f2, label FROM ML_PREDICT(`my_data`,
`my_cat`.`my_db`.`classifier_model`,DESCRIPTOR(f1, f2));

Sure. Will let you know once updated the FLIP.

[1]





https://github.com/confluentinc/flink/blob/release-1.18-confluent/flink-table/flink-sql-parser/src/main/codegen/includes/parserImpls.ftl#L814


Thanks,
Hao

On Tue, Mar 26, 2024 at 4:15 AM Timo Walther 

wrote:



Hi Hao,

  > `TABLE(my_data)` and `MODEL(my_cat.my_db.classifier_model)`

doesn't

  > work since `TABLE` and `MODEL` are already key words

This argument doesn't count. The parser supports introducing keywords
that are still non-reserved. For example, this enables using "key"

for

both primary key and a column name:

CREATE TABLE t (i INT PRIMARY KEY NOT ENFORCED)
WITH ('connector' = 'datagen');

SELECT i AS key FROM t;

I'm sure we will introduce `TABLE(my_data)` eventually as this is

what

the standard dictates. But for now, let's use the most compact syntax
possible which is also in sync with Oracle.

TLDR: We allow identifiers as arguments for PTFs which are expanded

with

catalog and database if necessary. Those identifier arguments

translate

to catalog lookups for table and models. The ML_ functions will make
sure that the arguments are of correct type model or table.

SELECT f1, f2, label FROM
ML_PREDICT(
  input => `my_data`,
  model => `my_cat`.`my_db`.`classifier_model`,
  args => DESCRIPTOR(f1, f2));

So this will allow us to also use in the future:

SELECT * FROM poly_func(table1);

Same support as Oracle [1]. Very concise.

Let me know when you updated the FLIP for a final review before

voting.


Do others have additional objections?

Regards,
Timo

[1]







https://livesql.oracle.com/apex/livesql/file/content_HQK7TYEO0NHSJCDY3LN2ERDV6.html




On 25.03.24 23:40, Hao Li wrote:

Hi Timo,


Please double check if this is implementable with the current

stack. I

fear the parser or validator might not like the "identifier"

argument?


I checked this, currently the validator throws an exception trying

to

get

the full qualifier name for `classifier_model`. But since
`SqlValidatorImpl` is implemented in Flink, we should be able to

fix

this.

The only caveator is if not full model path is provided,
the qualifier is interpreted as a column. We should be able to

special

handle this by rewriting the `ml_predict` function to add the

catalog

and

database name in `FlinkCalciteSqlValidator` though.


SELECT f1, f2, label FROM

 ML_PREDICT(
   TABLE `my_data`,
   my_cat.my_db.classifier_model,
   DESCRIPTOR(f1, f2))

SELECT f1, f2, label FROM
 ML_PREDICT(
   input => TABLE `my_data`,
   model => my_cat.my_db.classifier_model,
   args => DESCRIPTOR(f1, f2))

I verified these can be parsed. The problem is in validator for

qualifier

as mentioned above.


So the safest option would be the long-term solution:


SELECT f1, f2, label FROM
 ML_PREDICT(
   input => TABLE(my_data),
   model => MODEL(my_cat.my_db.classifier_model),
   args => DESCRIPTOR(f1, f2))

`TABLE(my_data)` and `MODEL(my_cat.my_db.classifier_model)` doesn't

work

since `TABLE` and `MODEL` are already key words in calcite used by

`CREATE

TABLE`, `CREATE MODEL`. Changing to `model_name(...)` works and

will

be

treated as a function.

So I think

SELECT f1, f2, label FROM
 ML_PREDICT(
   input => TABLE `my_data`,
   model => my_cat.my_db.classifier_model,
   args => DESCRIPTOR(f1, f2))
should be fine for now.

For the 

Re: [DISCUSS] FLIP-437: Support ML Models in Flink SQL

2024-03-28 Thread Jark Wu
Thanks, Hao,

Sounds good to me.

Best,
Jark

On Thu, 28 Mar 2024 at 01:02, Hao Li  wrote:

> Hi Jark,
>
> I think we can start with supporting popular model providers such as
> openai, azureml, sagemaker for remote models.
>
> Thanks,
> Hao
>
> On Tue, Mar 26, 2024 at 8:15 PM Jark Wu  wrote:
>
> > Thanks for the PoC and updating,
> >
> > The final syntax looks good to me, at least it is a nice and concise
> first
> > step.
> >
> > SELECT f1, f2, label FROM
> >ML_PREDICT(
> >  input => `my_data`,
> >  model => `my_cat`.`my_db`.`classifier_model`,
> >  args => DESCRIPTOR(f1, f2));
> >
> > Besides, what built-in models will we support in the FLIP? This might be
> > important
> > because it relates to what use cases can run with the new Flink version
> out
> > of the box.
> >
> > Best,
> > Jark
> >
> > On Wed, 27 Mar 2024 at 01:10, Hao Li  wrote:
> >
> > > Hi Timo,
> > >
> > > Yeah. For `primary key` and `from table(...)` those are explicitly
> > matched
> > > in parser: [1].
> > >
> > > > SELECT f1, f2, label FROM
> > >ML_PREDICT(
> > >  input => `my_data`,
> > >  model => `my_cat`.`my_db`.`classifier_model`,
> > >  args => DESCRIPTOR(f1, f2));
> > >
> > > This named argument syntax looks good to me. It can be supported
> together
> > > with
> > >
> > > SELECT f1, f2, label FROM ML_PREDICT(`my_data`,
> > > `my_cat`.`my_db`.`classifier_model`,DESCRIPTOR(f1, f2));
> > >
> > > Sure. Will let you know once updated the FLIP.
> > >
> > > [1]
> > >
> > >
> >
> https://github.com/confluentinc/flink/blob/release-1.18-confluent/flink-table/flink-sql-parser/src/main/codegen/includes/parserImpls.ftl#L814
> > >
> > > Thanks,
> > > Hao
> > >
> > > On Tue, Mar 26, 2024 at 4:15 AM Timo Walther 
> wrote:
> > >
> > > > Hi Hao,
> > > >
> > > >  > `TABLE(my_data)` and `MODEL(my_cat.my_db.classifier_model)`
> doesn't
> > > >  > work since `TABLE` and `MODEL` are already key words
> > > >
> > > > This argument doesn't count. The parser supports introducing keywords
> > > > that are still non-reserved. For example, this enables using "key"
> for
> > > > both primary key and a column name:
> > > >
> > > > CREATE TABLE t (i INT PRIMARY KEY NOT ENFORCED)
> > > > WITH ('connector' = 'datagen');
> > > >
> > > > SELECT i AS key FROM t;
> > > >
> > > > I'm sure we will introduce `TABLE(my_data)` eventually as this is
> what
> > > > the standard dictates. But for now, let's use the most compact syntax
> > > > possible which is also in sync with Oracle.
> > > >
> > > > TLDR: We allow identifiers as arguments for PTFs which are expanded
> > with
> > > > catalog and database if necessary. Those identifier arguments
> translate
> > > > to catalog lookups for table and models. The ML_ functions will make
> > > > sure that the arguments are of correct type model or table.
> > > >
> > > > SELECT f1, f2, label FROM
> > > >ML_PREDICT(
> > > >  input => `my_data`,
> > > >  model => `my_cat`.`my_db`.`classifier_model`,
> > > >  args => DESCRIPTOR(f1, f2));
> > > >
> > > > So this will allow us to also use in the future:
> > > >
> > > > SELECT * FROM poly_func(table1);
> > > >
> > > > Same support as Oracle [1]. Very concise.
> > > >
> > > > Let me know when you updated the FLIP for a final review before
> voting.
> > > >
> > > > Do others have additional objections?
> > > >
> > > > Regards,
> > > > Timo
> > > >
> > > > [1]
> > > >
> > > >
> > >
> >
> https://livesql.oracle.com/apex/livesql/file/content_HQK7TYEO0NHSJCDY3LN2ERDV6.html
> > > >
> > > >
> > > >
> > > > On 25.03.24 23:40, Hao Li wrote:
> > > > > Hi Timo,
> > > > >
> > > > >> Please double check if this is implementable with the current
> > stack. I
> > > > > fear the parser or validator might not like the "identifier"
> > argument?
> > > > >
> > > > > I checked this, currently the validator throws an exception trying
> to
> > > get
> > > > > the full qualifier name for `classifier_model`. But since
> > > > > `SqlValidatorImpl` is implemented in Flink, we should be able to
> fix
> > > > this.
> > > > > The only caveator is if not full model path is provided,
> > > > > the qualifier is interpreted as a column. We should be able to
> > special
> > > > > handle this by rewriting the `ml_predict` function to add the
> catalog
> > > and
> > > > > database name in `FlinkCalciteSqlValidator` though.
> > > > >
> > > > >> SELECT f1, f2, label FROM
> > > > > ML_PREDICT(
> > > > >   TABLE `my_data`,
> > > > >   my_cat.my_db.classifier_model,
> > > > >   DESCRIPTOR(f1, f2))
> > > > >
> > > > > SELECT f1, f2, label FROM
> > > > > ML_PREDICT(
> > > > >   input => TABLE `my_data`,
> > > > >   model => my_cat.my_db.classifier_model,
> > > > >   args => DESCRIPTOR(f1, f2))
> > > > >
> > > > > I verified these can be parsed. The problem is in validator for
> > > qualifier
> > > > > as mentioned above.
> > > > >
> > > > >> So the safest option would be the long-term solution:
> > > > >
> > > > > SELECT f1, f2, label FROM
> > > > 

Re: [DISCUSS] FLIP-437: Support ML Models in Flink SQL

2024-03-27 Thread Hao Li
Hi Jark,

I think we can start with supporting popular model providers such as
openai, azureml, sagemaker for remote models.

Thanks,
Hao

On Tue, Mar 26, 2024 at 8:15 PM Jark Wu  wrote:

> Thanks for the PoC and updating,
>
> The final syntax looks good to me, at least it is a nice and concise first
> step.
>
> SELECT f1, f2, label FROM
>ML_PREDICT(
>  input => `my_data`,
>  model => `my_cat`.`my_db`.`classifier_model`,
>  args => DESCRIPTOR(f1, f2));
>
> Besides, what built-in models will we support in the FLIP? This might be
> important
> because it relates to what use cases can run with the new Flink version out
> of the box.
>
> Best,
> Jark
>
> On Wed, 27 Mar 2024 at 01:10, Hao Li  wrote:
>
> > Hi Timo,
> >
> > Yeah. For `primary key` and `from table(...)` those are explicitly
> matched
> > in parser: [1].
> >
> > > SELECT f1, f2, label FROM
> >ML_PREDICT(
> >  input => `my_data`,
> >  model => `my_cat`.`my_db`.`classifier_model`,
> >  args => DESCRIPTOR(f1, f2));
> >
> > This named argument syntax looks good to me. It can be supported together
> > with
> >
> > SELECT f1, f2, label FROM ML_PREDICT(`my_data`,
> > `my_cat`.`my_db`.`classifier_model`,DESCRIPTOR(f1, f2));
> >
> > Sure. Will let you know once updated the FLIP.
> >
> > [1]
> >
> >
> https://github.com/confluentinc/flink/blob/release-1.18-confluent/flink-table/flink-sql-parser/src/main/codegen/includes/parserImpls.ftl#L814
> >
> > Thanks,
> > Hao
> >
> > On Tue, Mar 26, 2024 at 4:15 AM Timo Walther  wrote:
> >
> > > Hi Hao,
> > >
> > >  > `TABLE(my_data)` and `MODEL(my_cat.my_db.classifier_model)` doesn't
> > >  > work since `TABLE` and `MODEL` are already key words
> > >
> > > This argument doesn't count. The parser supports introducing keywords
> > > that are still non-reserved. For example, this enables using "key" for
> > > both primary key and a column name:
> > >
> > > CREATE TABLE t (i INT PRIMARY KEY NOT ENFORCED)
> > > WITH ('connector' = 'datagen');
> > >
> > > SELECT i AS key FROM t;
> > >
> > > I'm sure we will introduce `TABLE(my_data)` eventually as this is what
> > > the standard dictates. But for now, let's use the most compact syntax
> > > possible which is also in sync with Oracle.
> > >
> > > TLDR: We allow identifiers as arguments for PTFs which are expanded
> with
> > > catalog and database if necessary. Those identifier arguments translate
> > > to catalog lookups for table and models. The ML_ functions will make
> > > sure that the arguments are of correct type model or table.
> > >
> > > SELECT f1, f2, label FROM
> > >ML_PREDICT(
> > >  input => `my_data`,
> > >  model => `my_cat`.`my_db`.`classifier_model`,
> > >  args => DESCRIPTOR(f1, f2));
> > >
> > > So this will allow us to also use in the future:
> > >
> > > SELECT * FROM poly_func(table1);
> > >
> > > Same support as Oracle [1]. Very concise.
> > >
> > > Let me know when you updated the FLIP for a final review before voting.
> > >
> > > Do others have additional objections?
> > >
> > > Regards,
> > > Timo
> > >
> > > [1]
> > >
> > >
> >
> https://livesql.oracle.com/apex/livesql/file/content_HQK7TYEO0NHSJCDY3LN2ERDV6.html
> > >
> > >
> > >
> > > On 25.03.24 23:40, Hao Li wrote:
> > > > Hi Timo,
> > > >
> > > >> Please double check if this is implementable with the current
> stack. I
> > > > fear the parser or validator might not like the "identifier"
> argument?
> > > >
> > > > I checked this, currently the validator throws an exception trying to
> > get
> > > > the full qualifier name for `classifier_model`. But since
> > > > `SqlValidatorImpl` is implemented in Flink, we should be able to fix
> > > this.
> > > > The only caveator is if not full model path is provided,
> > > > the qualifier is interpreted as a column. We should be able to
> special
> > > > handle this by rewriting the `ml_predict` function to add the catalog
> > and
> > > > database name in `FlinkCalciteSqlValidator` though.
> > > >
> > > >> SELECT f1, f2, label FROM
> > > > ML_PREDICT(
> > > >   TABLE `my_data`,
> > > >   my_cat.my_db.classifier_model,
> > > >   DESCRIPTOR(f1, f2))
> > > >
> > > > SELECT f1, f2, label FROM
> > > > ML_PREDICT(
> > > >   input => TABLE `my_data`,
> > > >   model => my_cat.my_db.classifier_model,
> > > >   args => DESCRIPTOR(f1, f2))
> > > >
> > > > I verified these can be parsed. The problem is in validator for
> > qualifier
> > > > as mentioned above.
> > > >
> > > >> So the safest option would be the long-term solution:
> > > >
> > > > SELECT f1, f2, label FROM
> > > > ML_PREDICT(
> > > >   input => TABLE(my_data),
> > > >   model => MODEL(my_cat.my_db.classifier_model),
> > > >   args => DESCRIPTOR(f1, f2))
> > > >
> > > > `TABLE(my_data)` and `MODEL(my_cat.my_db.classifier_model)` doesn't
> > work
> > > > since `TABLE` and `MODEL` are already key words in calcite used by
> > > `CREATE
> > > > TABLE`, `CREATE MODEL`. Changing to `model_name(...)` works and will
>

Re: [DISCUSS] FLIP-437: Support ML Models in Flink SQL

2024-03-26 Thread Jark Wu
Thanks for the PoC and updating,

The final syntax looks good to me, at least it is a nice and concise first
step.

SELECT f1, f2, label FROM
   ML_PREDICT(
 input => `my_data`,
 model => `my_cat`.`my_db`.`classifier_model`,
 args => DESCRIPTOR(f1, f2));

Besides, what built-in models will we support in the FLIP? This might be
important
because it relates to what use cases can run with the new Flink version out
of the box.

Best,
Jark

On Wed, 27 Mar 2024 at 01:10, Hao Li  wrote:

> Hi Timo,
>
> Yeah. For `primary key` and `from table(...)` those are explicitly matched
> in parser: [1].
>
> > SELECT f1, f2, label FROM
>ML_PREDICT(
>  input => `my_data`,
>  model => `my_cat`.`my_db`.`classifier_model`,
>  args => DESCRIPTOR(f1, f2));
>
> This named argument syntax looks good to me. It can be supported together
> with
>
> SELECT f1, f2, label FROM ML_PREDICT(`my_data`,
> `my_cat`.`my_db`.`classifier_model`,DESCRIPTOR(f1, f2));
>
> Sure. Will let you know once updated the FLIP.
>
> [1]
>
> https://github.com/confluentinc/flink/blob/release-1.18-confluent/flink-table/flink-sql-parser/src/main/codegen/includes/parserImpls.ftl#L814
>
> Thanks,
> Hao
>
> On Tue, Mar 26, 2024 at 4:15 AM Timo Walther  wrote:
>
> > Hi Hao,
> >
> >  > `TABLE(my_data)` and `MODEL(my_cat.my_db.classifier_model)` doesn't
> >  > work since `TABLE` and `MODEL` are already key words
> >
> > This argument doesn't count. The parser supports introducing keywords
> > that are still non-reserved. For example, this enables using "key" for
> > both primary key and a column name:
> >
> > CREATE TABLE t (i INT PRIMARY KEY NOT ENFORCED)
> > WITH ('connector' = 'datagen');
> >
> > SELECT i AS key FROM t;
> >
> > I'm sure we will introduce `TABLE(my_data)` eventually as this is what
> > the standard dictates. But for now, let's use the most compact syntax
> > possible which is also in sync with Oracle.
> >
> > TLDR: We allow identifiers as arguments for PTFs which are expanded with
> > catalog and database if necessary. Those identifier arguments translate
> > to catalog lookups for table and models. The ML_ functions will make
> > sure that the arguments are of correct type model or table.
> >
> > SELECT f1, f2, label FROM
> >ML_PREDICT(
> >  input => `my_data`,
> >  model => `my_cat`.`my_db`.`classifier_model`,
> >  args => DESCRIPTOR(f1, f2));
> >
> > So this will allow us to also use in the future:
> >
> > SELECT * FROM poly_func(table1);
> >
> > Same support as Oracle [1]. Very concise.
> >
> > Let me know when you updated the FLIP for a final review before voting.
> >
> > Do others have additional objections?
> >
> > Regards,
> > Timo
> >
> > [1]
> >
> >
> https://livesql.oracle.com/apex/livesql/file/content_HQK7TYEO0NHSJCDY3LN2ERDV6.html
> >
> >
> >
> > On 25.03.24 23:40, Hao Li wrote:
> > > Hi Timo,
> > >
> > >> Please double check if this is implementable with the current stack. I
> > > fear the parser or validator might not like the "identifier" argument?
> > >
> > > I checked this, currently the validator throws an exception trying to
> get
> > > the full qualifier name for `classifier_model`. But since
> > > `SqlValidatorImpl` is implemented in Flink, we should be able to fix
> > this.
> > > The only caveator is if not full model path is provided,
> > > the qualifier is interpreted as a column. We should be able to special
> > > handle this by rewriting the `ml_predict` function to add the catalog
> and
> > > database name in `FlinkCalciteSqlValidator` though.
> > >
> > >> SELECT f1, f2, label FROM
> > > ML_PREDICT(
> > >   TABLE `my_data`,
> > >   my_cat.my_db.classifier_model,
> > >   DESCRIPTOR(f1, f2))
> > >
> > > SELECT f1, f2, label FROM
> > > ML_PREDICT(
> > >   input => TABLE `my_data`,
> > >   model => my_cat.my_db.classifier_model,
> > >   args => DESCRIPTOR(f1, f2))
> > >
> > > I verified these can be parsed. The problem is in validator for
> qualifier
> > > as mentioned above.
> > >
> > >> So the safest option would be the long-term solution:
> > >
> > > SELECT f1, f2, label FROM
> > > ML_PREDICT(
> > >   input => TABLE(my_data),
> > >   model => MODEL(my_cat.my_db.classifier_model),
> > >   args => DESCRIPTOR(f1, f2))
> > >
> > > `TABLE(my_data)` and `MODEL(my_cat.my_db.classifier_model)` doesn't
> work
> > > since `TABLE` and `MODEL` are already key words in calcite used by
> > `CREATE
> > > TABLE`, `CREATE MODEL`. Changing to `model_name(...)` works and will be
> > > treated as a function.
> > >
> > > So I think
> > >
> > > SELECT f1, f2, label FROM
> > > ML_PREDICT(
> > >   input => TABLE `my_data`,
> > >   model => my_cat.my_db.classifier_model,
> > >   args => DESCRIPTOR(f1, f2))
> > > should be fine for now.
> > >
> > > For the syntax part:
> > > 1). Sounds good. We can drop model task and model kind from the
> > definition.
> > > They can be deduced from the options.
> > >
> > > 2). Sure. We can add temporary mode

Re: [DISCUSS] FLIP-437: Support ML Models in Flink SQL

2024-03-26 Thread Hao Li
Hi Timo,

Yeah. For `primary key` and `from table(...)` those are explicitly matched
in parser: [1].

> SELECT f1, f2, label FROM
   ML_PREDICT(
 input => `my_data`,
 model => `my_cat`.`my_db`.`classifier_model`,
 args => DESCRIPTOR(f1, f2));

This named argument syntax looks good to me. It can be supported together
with

SELECT f1, f2, label FROM ML_PREDICT(`my_data`,
`my_cat`.`my_db`.`classifier_model`,DESCRIPTOR(f1, f2));

Sure. Will let you know once updated the FLIP.

[1]
https://github.com/confluentinc/flink/blob/release-1.18-confluent/flink-table/flink-sql-parser/src/main/codegen/includes/parserImpls.ftl#L814

Thanks,
Hao

On Tue, Mar 26, 2024 at 4:15 AM Timo Walther  wrote:

> Hi Hao,
>
>  > `TABLE(my_data)` and `MODEL(my_cat.my_db.classifier_model)` doesn't
>  > work since `TABLE` and `MODEL` are already key words
>
> This argument doesn't count. The parser supports introducing keywords
> that are still non-reserved. For example, this enables using "key" for
> both primary key and a column name:
>
> CREATE TABLE t (i INT PRIMARY KEY NOT ENFORCED)
> WITH ('connector' = 'datagen');
>
> SELECT i AS key FROM t;
>
> I'm sure we will introduce `TABLE(my_data)` eventually as this is what
> the standard dictates. But for now, let's use the most compact syntax
> possible which is also in sync with Oracle.
>
> TLDR: We allow identifiers as arguments for PTFs which are expanded with
> catalog and database if necessary. Those identifier arguments translate
> to catalog lookups for table and models. The ML_ functions will make
> sure that the arguments are of correct type model or table.
>
> SELECT f1, f2, label FROM
>ML_PREDICT(
>  input => `my_data`,
>  model => `my_cat`.`my_db`.`classifier_model`,
>  args => DESCRIPTOR(f1, f2));
>
> So this will allow us to also use in the future:
>
> SELECT * FROM poly_func(table1);
>
> Same support as Oracle [1]. Very concise.
>
> Let me know when you updated the FLIP for a final review before voting.
>
> Do others have additional objections?
>
> Regards,
> Timo
>
> [1]
>
> https://livesql.oracle.com/apex/livesql/file/content_HQK7TYEO0NHSJCDY3LN2ERDV6.html
>
>
>
> On 25.03.24 23:40, Hao Li wrote:
> > Hi Timo,
> >
> >> Please double check if this is implementable with the current stack. I
> > fear the parser or validator might not like the "identifier" argument?
> >
> > I checked this, currently the validator throws an exception trying to get
> > the full qualifier name for `classifier_model`. But since
> > `SqlValidatorImpl` is implemented in Flink, we should be able to fix
> this.
> > The only caveator is if not full model path is provided,
> > the qualifier is interpreted as a column. We should be able to special
> > handle this by rewriting the `ml_predict` function to add the catalog and
> > database name in `FlinkCalciteSqlValidator` though.
> >
> >> SELECT f1, f2, label FROM
> > ML_PREDICT(
> >   TABLE `my_data`,
> >   my_cat.my_db.classifier_model,
> >   DESCRIPTOR(f1, f2))
> >
> > SELECT f1, f2, label FROM
> > ML_PREDICT(
> >   input => TABLE `my_data`,
> >   model => my_cat.my_db.classifier_model,
> >   args => DESCRIPTOR(f1, f2))
> >
> > I verified these can be parsed. The problem is in validator for qualifier
> > as mentioned above.
> >
> >> So the safest option would be the long-term solution:
> >
> > SELECT f1, f2, label FROM
> > ML_PREDICT(
> >   input => TABLE(my_data),
> >   model => MODEL(my_cat.my_db.classifier_model),
> >   args => DESCRIPTOR(f1, f2))
> >
> > `TABLE(my_data)` and `MODEL(my_cat.my_db.classifier_model)` doesn't work
> > since `TABLE` and `MODEL` are already key words in calcite used by
> `CREATE
> > TABLE`, `CREATE MODEL`. Changing to `model_name(...)` works and will be
> > treated as a function.
> >
> > So I think
> >
> > SELECT f1, f2, label FROM
> > ML_PREDICT(
> >   input => TABLE `my_data`,
> >   model => my_cat.my_db.classifier_model,
> >   args => DESCRIPTOR(f1, f2))
> > should be fine for now.
> >
> > For the syntax part:
> > 1). Sounds good. We can drop model task and model kind from the
> definition.
> > They can be deduced from the options.
> >
> > 2). Sure. We can add temporary model
> >
> > 3). Make sense. We can use `show create model ` to display all
> > information and `describe model ` to show input/output schema
> >
> > Thanks,
> > Hao
> >
> > On Mon, Mar 25, 2024 at 3:21 PM Hao Li  wrote:
> >
> >> Hi Ahmed,
> >>
> >> Looks like the feature freeze time for 1.20 release is June 15th. We can
> >> definitely get the model DDL into 1.20. For predict and evaluate
> functions,
> >> if we can't get into the 1.20 release, we can get them into the 1.21
> >> release for sure.
> >>
> >> Thanks,
> >> Hao
> >>
> >>
> >>
> >> On Mon, Mar 25, 2024 at 1:25 AM Timo Walther 
> wrote:
> >>
> >>> Hi Jark and Hao,
> >>>
> >>> thanks for the information, Jark! Great that the Calcite community
> >>> already fixed the problem for us. +1 to adopt the simp

Re: [DISCUSS] FLIP-437: Support ML Models in Flink SQL

2024-03-26 Thread Timo Walther

Hi Hao,

> `TABLE(my_data)` and `MODEL(my_cat.my_db.classifier_model)` doesn't
> work since `TABLE` and `MODEL` are already key words

This argument doesn't count. The parser supports introducing keywords 
that are still non-reserved. For example, this enables using "key" for 
both primary key and a column name:


CREATE TABLE t (i INT PRIMARY KEY NOT ENFORCED)
WITH ('connector' = 'datagen');

SELECT i AS key FROM t;

I'm sure we will introduce `TABLE(my_data)` eventually as this is what 
the standard dictates. But for now, let's use the most compact syntax 
possible which is also in sync with Oracle.


TLDR: We allow identifiers as arguments for PTFs which are expanded with 
catalog and database if necessary. Those identifier arguments translate 
to catalog lookups for table and models. The ML_ functions will make 
sure that the arguments are of correct type model or table.


SELECT f1, f2, label FROM
  ML_PREDICT(
input => `my_data`,
model => `my_cat`.`my_db`.`classifier_model`,
args => DESCRIPTOR(f1, f2));

So this will allow us to also use in the future:

SELECT * FROM poly_func(table1);

Same support as Oracle [1]. Very concise.

Let me know when you updated the FLIP for a final review before voting.

Do others have additional objections?

Regards,
Timo

[1] 
https://livesql.oracle.com/apex/livesql/file/content_HQK7TYEO0NHSJCDY3LN2ERDV6.html




On 25.03.24 23:40, Hao Li wrote:

Hi Timo,


Please double check if this is implementable with the current stack. I

fear the parser or validator might not like the "identifier" argument?

I checked this, currently the validator throws an exception trying to get
the full qualifier name for `classifier_model`. But since
`SqlValidatorImpl` is implemented in Flink, we should be able to fix this.
The only caveator is if not full model path is provided,
the qualifier is interpreted as a column. We should be able to special
handle this by rewriting the `ml_predict` function to add the catalog and
database name in `FlinkCalciteSqlValidator` though.


SELECT f1, f2, label FROM

ML_PREDICT(
  TABLE `my_data`,
  my_cat.my_db.classifier_model,
  DESCRIPTOR(f1, f2))

SELECT f1, f2, label FROM
ML_PREDICT(
  input => TABLE `my_data`,
  model => my_cat.my_db.classifier_model,
  args => DESCRIPTOR(f1, f2))

I verified these can be parsed. The problem is in validator for qualifier
as mentioned above.


So the safest option would be the long-term solution:


SELECT f1, f2, label FROM
ML_PREDICT(
  input => TABLE(my_data),
  model => MODEL(my_cat.my_db.classifier_model),
  args => DESCRIPTOR(f1, f2))

`TABLE(my_data)` and `MODEL(my_cat.my_db.classifier_model)` doesn't work
since `TABLE` and `MODEL` are already key words in calcite used by `CREATE
TABLE`, `CREATE MODEL`. Changing to `model_name(...)` works and will be
treated as a function.

So I think

SELECT f1, f2, label FROM
ML_PREDICT(
  input => TABLE `my_data`,
  model => my_cat.my_db.classifier_model,
  args => DESCRIPTOR(f1, f2))
should be fine for now.

For the syntax part:
1). Sounds good. We can drop model task and model kind from the definition.
They can be deduced from the options.

2). Sure. We can add temporary model

3). Make sense. We can use `show create model ` to display all
information and `describe model ` to show input/output schema

Thanks,
Hao

On Mon, Mar 25, 2024 at 3:21 PM Hao Li  wrote:


Hi Ahmed,

Looks like the feature freeze time for 1.20 release is June 15th. We can
definitely get the model DDL into 1.20. For predict and evaluate functions,
if we can't get into the 1.20 release, we can get them into the 1.21
release for sure.

Thanks,
Hao



On Mon, Mar 25, 2024 at 1:25 AM Timo Walther  wrote:


Hi Jark and Hao,

thanks for the information, Jark! Great that the Calcite community
already fixed the problem for us. +1 to adopt the simplified syntax
asap. Maybe even before we upgrade Calcite (i.e. copy over classes), if
upgrading Calcite is too much work right now?

  > Is `DESCRIPTOR` a must in the syntax?

Yes, we should still stick to the standard as much as possible and all
vendors use DESCRIPTOR/COLUMNS for distinuishing columns vs. literal
arguments. So the final syntax of this discussion would be:


SELECT f1, f2, label FROM
ML_PREDICT(TABLE `my_data`, `classifier_model`, DESCRIPTOR(f1, f2))

SELECT * FROM
ML_EVALUATE(TABLE `eval_data`, `classifier_model`, DESCRIPTOR(f1, f2))

Please double check if this is implementable with the current stack. I
fear the parser or validator might not like the "identifier" argument?

Make sure that also these variations are supported:

SELECT f1, f2, label FROM
ML_PREDICT(
  TABLE `my_data`,
  my_cat.my_db.classifier_model,
  DESCRIPTOR(f1, f2))

SELECT f1, f2, label FROM
ML_PREDICT(
  input => TABLE `my_data`,
  model => my_cat.my_db.classifier_model,
  args => DESCRIPTOR(f1, f2))

It might be safer and more future proof to wrap a MODEL() func

Re: [DISCUSS] FLIP-437: Support ML Models in Flink SQL

2024-03-25 Thread Hao Li
Hi Timo,

> Please double check if this is implementable with the current stack. I
fear the parser or validator might not like the "identifier" argument?

I checked this, currently the validator throws an exception trying to get
the full qualifier name for `classifier_model`. But since
`SqlValidatorImpl` is implemented in Flink, we should be able to fix this.
The only caveator is if not full model path is provided,
the qualifier is interpreted as a column. We should be able to special
handle this by rewriting the `ml_predict` function to add the catalog and
database name in `FlinkCalciteSqlValidator` though.

> SELECT f1, f2, label FROM
   ML_PREDICT(
 TABLE `my_data`,
 my_cat.my_db.classifier_model,
 DESCRIPTOR(f1, f2))

SELECT f1, f2, label FROM
   ML_PREDICT(
 input => TABLE `my_data`,
 model => my_cat.my_db.classifier_model,
 args => DESCRIPTOR(f1, f2))

I verified these can be parsed. The problem is in validator for qualifier
as mentioned above.

> So the safest option would be the long-term solution:

SELECT f1, f2, label FROM
   ML_PREDICT(
 input => TABLE(my_data),
 model => MODEL(my_cat.my_db.classifier_model),
 args => DESCRIPTOR(f1, f2))

`TABLE(my_data)` and `MODEL(my_cat.my_db.classifier_model)` doesn't work
since `TABLE` and `MODEL` are already key words in calcite used by `CREATE
TABLE`, `CREATE MODEL`. Changing to `model_name(...)` works and will be
treated as a function.

So I think

SELECT f1, f2, label FROM
   ML_PREDICT(
 input => TABLE `my_data`,
 model => my_cat.my_db.classifier_model,
 args => DESCRIPTOR(f1, f2))
should be fine for now.

For the syntax part:
1). Sounds good. We can drop model task and model kind from the definition.
They can be deduced from the options.

2). Sure. We can add temporary model

3). Make sense. We can use `show create model ` to display all
information and `describe model ` to show input/output schema

Thanks,
Hao

On Mon, Mar 25, 2024 at 3:21 PM Hao Li  wrote:

> Hi Ahmed,
>
> Looks like the feature freeze time for 1.20 release is June 15th. We can
> definitely get the model DDL into 1.20. For predict and evaluate functions,
> if we can't get into the 1.20 release, we can get them into the 1.21
> release for sure.
>
> Thanks,
> Hao
>
>
>
> On Mon, Mar 25, 2024 at 1:25 AM Timo Walther  wrote:
>
>> Hi Jark and Hao,
>>
>> thanks for the information, Jark! Great that the Calcite community
>> already fixed the problem for us. +1 to adopt the simplified syntax
>> asap. Maybe even before we upgrade Calcite (i.e. copy over classes), if
>> upgrading Calcite is too much work right now?
>>
>>  > Is `DESCRIPTOR` a must in the syntax?
>>
>> Yes, we should still stick to the standard as much as possible and all
>> vendors use DESCRIPTOR/COLUMNS for distinuishing columns vs. literal
>> arguments. So the final syntax of this discussion would be:
>>
>>
>> SELECT f1, f2, label FROM
>>ML_PREDICT(TABLE `my_data`, `classifier_model`, DESCRIPTOR(f1, f2))
>>
>> SELECT * FROM
>>ML_EVALUATE(TABLE `eval_data`, `classifier_model`, DESCRIPTOR(f1, f2))
>>
>> Please double check if this is implementable with the current stack. I
>> fear the parser or validator might not like the "identifier" argument?
>>
>> Make sure that also these variations are supported:
>>
>> SELECT f1, f2, label FROM
>>ML_PREDICT(
>>  TABLE `my_data`,
>>  my_cat.my_db.classifier_model,
>>  DESCRIPTOR(f1, f2))
>>
>> SELECT f1, f2, label FROM
>>ML_PREDICT(
>>  input => TABLE `my_data`,
>>  model => my_cat.my_db.classifier_model,
>>  args => DESCRIPTOR(f1, f2))
>>
>> It might be safer and more future proof to wrap a MODEL() function
>> around it. This would be more in sync with the standard that actually
>> still requires to put a TABLE() around the input argument:
>>
>> ML_PREDICT(TABLE(`my_data`) PARTITIONED BY c1 ORDERED BY c1, )
>>
>> So the safest option would be the long-term solution:
>>
>> SELECT f1, f2, label FROM
>>ML_PREDICT(
>>  input => TABLE(my_data),
>>  model => MODEL(my_cat.my_db.classifier_model),
>>  args => DESCRIPTOR(f1, f2))
>>
>> But I'm fine with this if others have a strong opinion:
>>
>> SELECT f1, f2, label FROM
>>ML_PREDICT(
>>  input => TABLE `my_data`,
>>  model => my_cat.my_db.classifier_model,
>>  args => DESCRIPTOR(f1, f2))
>>
>> Some feedback for the remainder of the FLIP:
>>
>> 1) Simplify catalog objects
>>
>> I would suggest to drop:
>> CatalogModel.getModelKind()
>> CatalogModel.getModelTask()
>>
>> A catalog object should fully resemble the DDL. And since the DDL puts
>> those properties in the WITH clause, the catalog object should the same
>> (i.e. put them into the `getModelOptions()`). Btw renaming this method
>> to just `getOptions()` for consistency should be good as well.
>> Internally, we can still provide enums for these frequently used
>> classes. Similar to what we do in `FactoryUtil` for other frequently
>> used options.
>>
>> Remove `getDes

Re: [DISCUSS] FLIP-437: Support ML Models in Flink SQL

2024-03-25 Thread Hao Li
Hi Ahmed,

Looks like the feature freeze time for 1.20 release is June 15th. We can
definitely get the model DDL into 1.20. For predict and evaluate functions,
if we can't get into the 1.20 release, we can get them into the 1.21
release for sure.

Thanks,
Hao



On Mon, Mar 25, 2024 at 1:25 AM Timo Walther  wrote:

> Hi Jark and Hao,
>
> thanks for the information, Jark! Great that the Calcite community
> already fixed the problem for us. +1 to adopt the simplified syntax
> asap. Maybe even before we upgrade Calcite (i.e. copy over classes), if
> upgrading Calcite is too much work right now?
>
>  > Is `DESCRIPTOR` a must in the syntax?
>
> Yes, we should still stick to the standard as much as possible and all
> vendors use DESCRIPTOR/COLUMNS for distinuishing columns vs. literal
> arguments. So the final syntax of this discussion would be:
>
>
> SELECT f1, f2, label FROM
>ML_PREDICT(TABLE `my_data`, `classifier_model`, DESCRIPTOR(f1, f2))
>
> SELECT * FROM
>ML_EVALUATE(TABLE `eval_data`, `classifier_model`, DESCRIPTOR(f1, f2))
>
> Please double check if this is implementable with the current stack. I
> fear the parser or validator might not like the "identifier" argument?
>
> Make sure that also these variations are supported:
>
> SELECT f1, f2, label FROM
>ML_PREDICT(
>  TABLE `my_data`,
>  my_cat.my_db.classifier_model,
>  DESCRIPTOR(f1, f2))
>
> SELECT f1, f2, label FROM
>ML_PREDICT(
>  input => TABLE `my_data`,
>  model => my_cat.my_db.classifier_model,
>  args => DESCRIPTOR(f1, f2))
>
> It might be safer and more future proof to wrap a MODEL() function
> around it. This would be more in sync with the standard that actually
> still requires to put a TABLE() around the input argument:
>
> ML_PREDICT(TABLE(`my_data`) PARTITIONED BY c1 ORDERED BY c1, )
>
> So the safest option would be the long-term solution:
>
> SELECT f1, f2, label FROM
>ML_PREDICT(
>  input => TABLE(my_data),
>  model => MODEL(my_cat.my_db.classifier_model),
>  args => DESCRIPTOR(f1, f2))
>
> But I'm fine with this if others have a strong opinion:
>
> SELECT f1, f2, label FROM
>ML_PREDICT(
>  input => TABLE `my_data`,
>  model => my_cat.my_db.classifier_model,
>  args => DESCRIPTOR(f1, f2))
>
> Some feedback for the remainder of the FLIP:
>
> 1) Simplify catalog objects
>
> I would suggest to drop:
> CatalogModel.getModelKind()
> CatalogModel.getModelTask()
>
> A catalog object should fully resemble the DDL. And since the DDL puts
> those properties in the WITH clause, the catalog object should the same
> (i.e. put them into the `getModelOptions()`). Btw renaming this method
> to just `getOptions()` for consistency should be good as well.
> Internally, we can still provide enums for these frequently used
> classes. Similar to what we do in `FactoryUtil` for other frequently
> used options.
>
> Remove `getDescription()` and `getDetailedDescription()`. They were a
> mistake for CatalogTable and should actually be deprecated. They got
> replaced by `getComment()` which is sufficient.
>
> 2) CREATE TEMPORARY MODEL is not supported.
>
> This is an unnecessary restriction. We should support temporary versions
> of these catalog objects as well for consistency. Adding support for
> this should be straightforward.
>
> 3) DESCRIBE | DESC } MODEL [catalog_name.][database_name.]model_name
>
> I would suggest we support `SHOW CREATE MODEL` instead. Similar to `SHOW
> CREATE TABLE`, this should show all properties. If we support `DESCRIBE
> MODEL` it should only list the input parameters similar to `DESCRIBE
> TABLE` only shows the columns (not the WITH clause).
>
> Regards,
> Timo
>
>
> On 23.03.24 13:17, Ahmed Hamdy wrote:
> > Hi everyone,
> > +1 for this proposal, I believe it is very useful to the minimum, It
> would
> > be great even having  "ML_PREDICT" and "ML_EVALUATE" as built-in PTFs in
> > this FLIP as discussed.
> > IIUC this will be included in the 1.20 roadmap?
> > Best Regards
> > Ahmed Hamdy
> >
> >
> > On Fri, 22 Mar 2024 at 23:54, Hao Li  wrote:
> >
> >> Hi Timo and Jark,
> >>
> >> I agree Oracle's syntax seems concise and more descriptive. For the
> >> built-in `ML_PREDICT` and `ML_EVALUATE` functions I agree with Jark we
> can
> >> support them as built-in PTF using `SqlTableFunction` for this FLIP. We
> can
> >> have a different FLIP discussing user defined PTF and adopt that later
> for
> >> model functions later. To summarize, the current proposed syntax is
> >>
> >> SELECT f1, f2, label FROM TABLE(ML_PREDICT(TABLE `my_data`,
> >> `classifier_model`, f1, f2))
> >>
> >> SELECT * FROM TABLE(ML_EVALUATE(TABLE `eval_data`, `classifier_model`,
> f1,
> >> f2))
> >>
> >> Is `DESCRIPTOR` a must in the syntax? If so, it becomes
> >>
> >> SELECT f1, f2, label FROM TABLE(ML_PREDICT(TABLE `my_data`,
> >> `classifier_model`, DESCRIPTOR(f1), DESCRIPTOR(f2)))
> >>
> >> SELECT * FROM TABLE(ML_EVALUATE(TABLE `eval_data`, `classifier_model`,
> >> DESCRIPTOR(f1), DESCRI

Re: [DISCUSS] FLIP-437: Support ML Models in Flink SQL

2024-03-25 Thread Timo Walther

Hi Jark and Hao,

thanks for the information, Jark! Great that the Calcite community 
already fixed the problem for us. +1 to adopt the simplified syntax 
asap. Maybe even before we upgrade Calcite (i.e. copy over classes), if 
upgrading Calcite is too much work right now?


> Is `DESCRIPTOR` a must in the syntax?

Yes, we should still stick to the standard as much as possible and all 
vendors use DESCRIPTOR/COLUMNS for distinuishing columns vs. literal 
arguments. So the final syntax of this discussion would be:



SELECT f1, f2, label FROM
  ML_PREDICT(TABLE `my_data`, `classifier_model`, DESCRIPTOR(f1, f2))

SELECT * FROM
  ML_EVALUATE(TABLE `eval_data`, `classifier_model`, DESCRIPTOR(f1, f2))

Please double check if this is implementable with the current stack. I 
fear the parser or validator might not like the "identifier" argument?


Make sure that also these variations are supported:

SELECT f1, f2, label FROM
  ML_PREDICT(
TABLE `my_data`,
my_cat.my_db.classifier_model,
DESCRIPTOR(f1, f2))

SELECT f1, f2, label FROM
  ML_PREDICT(
input => TABLE `my_data`,
model => my_cat.my_db.classifier_model,
args => DESCRIPTOR(f1, f2))

It might be safer and more future proof to wrap a MODEL() function 
around it. This would be more in sync with the standard that actually 
still requires to put a TABLE() around the input argument:


ML_PREDICT(TABLE(`my_data`) PARTITIONED BY c1 ORDERED BY c1, )

So the safest option would be the long-term solution:

SELECT f1, f2, label FROM
  ML_PREDICT(
input => TABLE(my_data),
model => MODEL(my_cat.my_db.classifier_model),
args => DESCRIPTOR(f1, f2))

But I'm fine with this if others have a strong opinion:

SELECT f1, f2, label FROM
  ML_PREDICT(
input => TABLE `my_data`,
model => my_cat.my_db.classifier_model,
args => DESCRIPTOR(f1, f2))

Some feedback for the remainder of the FLIP:

1) Simplify catalog objects

I would suggest to drop:
CatalogModel.getModelKind()
CatalogModel.getModelTask()

A catalog object should fully resemble the DDL. And since the DDL puts 
those properties in the WITH clause, the catalog object should the same 
(i.e. put them into the `getModelOptions()`). Btw renaming this method 
to just `getOptions()` for consistency should be good as well. 
Internally, we can still provide enums for these frequently used 
classes. Similar to what we do in `FactoryUtil` for other frequently 
used options.


Remove `getDescription()` and `getDetailedDescription()`. They were a 
mistake for CatalogTable and should actually be deprecated. They got 
replaced by `getComment()` which is sufficient.


2) CREATE TEMPORARY MODEL is not supported.

This is an unnecessary restriction. We should support temporary versions 
of these catalog objects as well for consistency. Adding support for 
this should be straightforward.


3) DESCRIBE | DESC } MODEL [catalog_name.][database_name.]model_name

I would suggest we support `SHOW CREATE MODEL` instead. Similar to `SHOW 
CREATE TABLE`, this should show all properties. If we support `DESCRIBE 
MODEL` it should only list the input parameters similar to `DESCRIBE 
TABLE` only shows the columns (not the WITH clause).


Regards,
Timo


On 23.03.24 13:17, Ahmed Hamdy wrote:

Hi everyone,
+1 for this proposal, I believe it is very useful to the minimum, It would
be great even having  "ML_PREDICT" and "ML_EVALUATE" as built-in PTFs in
this FLIP as discussed.
IIUC this will be included in the 1.20 roadmap?
Best Regards
Ahmed Hamdy


On Fri, 22 Mar 2024 at 23:54, Hao Li  wrote:


Hi Timo and Jark,

I agree Oracle's syntax seems concise and more descriptive. For the
built-in `ML_PREDICT` and `ML_EVALUATE` functions I agree with Jark we can
support them as built-in PTF using `SqlTableFunction` for this FLIP. We can
have a different FLIP discussing user defined PTF and adopt that later for
model functions later. To summarize, the current proposed syntax is

SELECT f1, f2, label FROM TABLE(ML_PREDICT(TABLE `my_data`,
`classifier_model`, f1, f2))

SELECT * FROM TABLE(ML_EVALUATE(TABLE `eval_data`, `classifier_model`, f1,
f2))

Is `DESCRIPTOR` a must in the syntax? If so, it becomes

SELECT f1, f2, label FROM TABLE(ML_PREDICT(TABLE `my_data`,
`classifier_model`, DESCRIPTOR(f1), DESCRIPTOR(f2)))

SELECT * FROM TABLE(ML_EVALUATE(TABLE `eval_data`, `classifier_model`,
DESCRIPTOR(f1), DESCRIPTOR(f2)))

If Calcite supports dropping outer table keyword, it becomes

SELECT f1, f2, label FROM ML_PREDICT(TABLE `my_data`, `classifier_model`,
DESCRIPTOR(f1), DESCRIPTOR(f2))

SELECT * FROM ML_EVALUATE(TABLE `eval_data`, `classifier_model`,
DESCRIPTOR(
f1), DESCRIPTOR(f2))

Thanks,
Hao



On Fri, Mar 22, 2024 at 9:16 AM Jark Wu  wrote:


Sorry, I mean we can bump the Calcite version if needed in Flink 1.20.

On Fri, 22 Mar 2024 at 22:19, Jark Wu  wrote:


Hi Timo,

Introducing user-defined PTF is very useful in Flink, I'm +1 for this.
But I think the ML model FLIP is not blocked by this, because we
can intro

Re: [DISCUSS] FLIP-437: Support ML Models in Flink SQL

2024-03-23 Thread Ahmed Hamdy
Hi everyone,
+1 for this proposal, I believe it is very useful to the minimum, It would
be great even having  "ML_PREDICT" and "ML_EVALUATE" as built-in PTFs in
this FLIP as discussed.
IIUC this will be included in the 1.20 roadmap?
Best Regards
Ahmed Hamdy


On Fri, 22 Mar 2024 at 23:54, Hao Li  wrote:

> Hi Timo and Jark,
>
> I agree Oracle's syntax seems concise and more descriptive. For the
> built-in `ML_PREDICT` and `ML_EVALUATE` functions I agree with Jark we can
> support them as built-in PTF using `SqlTableFunction` for this FLIP. We can
> have a different FLIP discussing user defined PTF and adopt that later for
> model functions later. To summarize, the current proposed syntax is
>
> SELECT f1, f2, label FROM TABLE(ML_PREDICT(TABLE `my_data`,
> `classifier_model`, f1, f2))
>
> SELECT * FROM TABLE(ML_EVALUATE(TABLE `eval_data`, `classifier_model`, f1,
> f2))
>
> Is `DESCRIPTOR` a must in the syntax? If so, it becomes
>
> SELECT f1, f2, label FROM TABLE(ML_PREDICT(TABLE `my_data`,
> `classifier_model`, DESCRIPTOR(f1), DESCRIPTOR(f2)))
>
> SELECT * FROM TABLE(ML_EVALUATE(TABLE `eval_data`, `classifier_model`,
> DESCRIPTOR(f1), DESCRIPTOR(f2)))
>
> If Calcite supports dropping outer table keyword, it becomes
>
> SELECT f1, f2, label FROM ML_PREDICT(TABLE `my_data`, `classifier_model`,
> DESCRIPTOR(f1), DESCRIPTOR(f2))
>
> SELECT * FROM ML_EVALUATE(TABLE `eval_data`, `classifier_model`,
> DESCRIPTOR(
> f1), DESCRIPTOR(f2))
>
> Thanks,
> Hao
>
>
>
> On Fri, Mar 22, 2024 at 9:16 AM Jark Wu  wrote:
>
> > Sorry, I mean we can bump the Calcite version if needed in Flink 1.20.
> >
> > On Fri, 22 Mar 2024 at 22:19, Jark Wu  wrote:
> >
> > > Hi Timo,
> > >
> > > Introducing user-defined PTF is very useful in Flink, I'm +1 for this.
> > > But I think the ML model FLIP is not blocked by this, because we
> > > can introduce ML_PREDICT and ML_EVALUATE as built-in PTFs
> > > just like TUMBLE/HOP. And support user-defined ML functions as
> > > a future FLIP.
> > >
> > > Regarding the simplified PTF syntax which reduces the outer TABLE()
> > > keyword,
> > > it seems it was just supported[1] by the Calcite community last month
> and
> > > will be
> > > released in the next version (v1.37). The Calcite community is
> preparing
> > > the
> > > 1.37 release, so we can bump the version if needed in Flink 1.19.
> > >
> > > Best,
> > > Jark
> > >
> > > [1]: https://issues.apache.org/jira/browse/CALCITE-6254
> > >
> > > On Fri, 22 Mar 2024 at 21:46, Timo Walther  wrote:
> > >
> > >> Hi everyone,
> > >>
> > >> this is a very important change to the Flink SQL syntax but we can't
> > >> wait until the SQL standard is ready for this. So I'm +1 on
> introducing
> > >> the MODEL concept as a first class citizen in Flink.
> > >>
> > >> For your information: Over the past months I have already spent a
> > >> significant amount of time thinking about how we can introduce PTFs in
> > >> Flink. I reserved FLIP-440[1] for this purpose and I will share a
> > >> version of this in the next 1-2 weeks.
> > >>
> > >> For a good implementation of FLIP-440 and also FLIP-437, we should
> > >> evolve the PTF syntax in collaboration with Apache Calcite.
> > >>
> > >> There are different syntax versions out there:
> > >>
> > >> 1) Flink
> > >>
> > >> SELECT * FROM
> > >>TABLE(TUMBLE(TABLE Bid, DESCRIPTOR(bidtime), INTERVAL '10'
> MINUTES));
> > >>
> > >> 2) SQL standard
> > >>
> > >> SELECT * FROM
> > >>TABLE(TUMBLE(TABLE(Bid), DESCRIPTOR(bidtime), INTERVAL '10'
> > MINUTES));
> > >>
> > >> 3) Oracle
> > >>
> > >> SELECT * FROM
> > >> TUMBLE(Bid, COLUMNS(bidtime), INTERVAL '10' MINUTES));
> > >>
> > >> As you can see above, Flink does not follow the standard correctly as
> it
> > >> would need to use `TABLE()` but this is not provided by Calcite yet.
> > >>
> > >> I really like the Oracle syntax[2][3] a lot. It reduces necessary
> > >> keywords to a minimum. Personally, I would like to discuss this syntax
> > >> in a separate FLIP and hope I will find supporters for:
> > >>
> > >>
> > >> SELECT * FROM
> > >>TUMBLE(TABLE Bid, DESCRIPTOR(bidtime), INTERVAL '10' MINUTES);
> > >>
> > >> If we go entirely with the Oracle syntax, as you can see in the
> example,
> > >> Oracle allows for passing identifiers directly. This would solve our
> > >> problems for the MODEL as well:
> > >>
> > >> SELECT f1, f2, label FROM ML_PREDICT(
> > >>data => `my_data`,
> > >>model => `classifier_model`,
> > >>input => DESCRIPTOR(f1, f2));
> > >>
> > >> Or we completely adopt the Oracle syntax:
> > >>
> > >> SELECT f1, f2, label FROM ML_PREDICT(
> > >>data => `my_data`,
> > >>model => `classifier_model`,
> > >>input => COLUMNS(f1, f2));
> > >>
> > >>
> > >> What do you think?
> > >>
> > >> Happy to create a FLIP for just this syntax question and collaborate
> > >> with the Calcite community on this. Supporting the syntax of Oracle
> > >> shouldn't be too hard to convince at least as parser parameter.
> > >>
> > >> Regards,
> > >> 

Re: [DISCUSS] FLIP-437: Support ML Models in Flink SQL

2024-03-22 Thread Hao Li
Hi Timo and Jark,

I agree Oracle's syntax seems concise and more descriptive. For the
built-in `ML_PREDICT` and `ML_EVALUATE` functions I agree with Jark we can
support them as built-in PTF using `SqlTableFunction` for this FLIP. We can
have a different FLIP discussing user defined PTF and adopt that later for
model functions later. To summarize, the current proposed syntax is

SELECT f1, f2, label FROM TABLE(ML_PREDICT(TABLE `my_data`,
`classifier_model`, f1, f2))

SELECT * FROM TABLE(ML_EVALUATE(TABLE `eval_data`, `classifier_model`, f1,
f2))

Is `DESCRIPTOR` a must in the syntax? If so, it becomes

SELECT f1, f2, label FROM TABLE(ML_PREDICT(TABLE `my_data`,
`classifier_model`, DESCRIPTOR(f1), DESCRIPTOR(f2)))

SELECT * FROM TABLE(ML_EVALUATE(TABLE `eval_data`, `classifier_model`,
DESCRIPTOR(f1), DESCRIPTOR(f2)))

If Calcite supports dropping outer table keyword, it becomes

SELECT f1, f2, label FROM ML_PREDICT(TABLE `my_data`, `classifier_model`,
DESCRIPTOR(f1), DESCRIPTOR(f2))

SELECT * FROM ML_EVALUATE(TABLE `eval_data`, `classifier_model`, DESCRIPTOR(
f1), DESCRIPTOR(f2))

Thanks,
Hao



On Fri, Mar 22, 2024 at 9:16 AM Jark Wu  wrote:

> Sorry, I mean we can bump the Calcite version if needed in Flink 1.20.
>
> On Fri, 22 Mar 2024 at 22:19, Jark Wu  wrote:
>
> > Hi Timo,
> >
> > Introducing user-defined PTF is very useful in Flink, I'm +1 for this.
> > But I think the ML model FLIP is not blocked by this, because we
> > can introduce ML_PREDICT and ML_EVALUATE as built-in PTFs
> > just like TUMBLE/HOP. And support user-defined ML functions as
> > a future FLIP.
> >
> > Regarding the simplified PTF syntax which reduces the outer TABLE()
> > keyword,
> > it seems it was just supported[1] by the Calcite community last month and
> > will be
> > released in the next version (v1.37). The Calcite community is preparing
> > the
> > 1.37 release, so we can bump the version if needed in Flink 1.19.
> >
> > Best,
> > Jark
> >
> > [1]: https://issues.apache.org/jira/browse/CALCITE-6254
> >
> > On Fri, 22 Mar 2024 at 21:46, Timo Walther  wrote:
> >
> >> Hi everyone,
> >>
> >> this is a very important change to the Flink SQL syntax but we can't
> >> wait until the SQL standard is ready for this. So I'm +1 on introducing
> >> the MODEL concept as a first class citizen in Flink.
> >>
> >> For your information: Over the past months I have already spent a
> >> significant amount of time thinking about how we can introduce PTFs in
> >> Flink. I reserved FLIP-440[1] for this purpose and I will share a
> >> version of this in the next 1-2 weeks.
> >>
> >> For a good implementation of FLIP-440 and also FLIP-437, we should
> >> evolve the PTF syntax in collaboration with Apache Calcite.
> >>
> >> There are different syntax versions out there:
> >>
> >> 1) Flink
> >>
> >> SELECT * FROM
> >>TABLE(TUMBLE(TABLE Bid, DESCRIPTOR(bidtime), INTERVAL '10' MINUTES));
> >>
> >> 2) SQL standard
> >>
> >> SELECT * FROM
> >>TABLE(TUMBLE(TABLE(Bid), DESCRIPTOR(bidtime), INTERVAL '10'
> MINUTES));
> >>
> >> 3) Oracle
> >>
> >> SELECT * FROM
> >> TUMBLE(Bid, COLUMNS(bidtime), INTERVAL '10' MINUTES));
> >>
> >> As you can see above, Flink does not follow the standard correctly as it
> >> would need to use `TABLE()` but this is not provided by Calcite yet.
> >>
> >> I really like the Oracle syntax[2][3] a lot. It reduces necessary
> >> keywords to a minimum. Personally, I would like to discuss this syntax
> >> in a separate FLIP and hope I will find supporters for:
> >>
> >>
> >> SELECT * FROM
> >>TUMBLE(TABLE Bid, DESCRIPTOR(bidtime), INTERVAL '10' MINUTES);
> >>
> >> If we go entirely with the Oracle syntax, as you can see in the example,
> >> Oracle allows for passing identifiers directly. This would solve our
> >> problems for the MODEL as well:
> >>
> >> SELECT f1, f2, label FROM ML_PREDICT(
> >>data => `my_data`,
> >>model => `classifier_model`,
> >>input => DESCRIPTOR(f1, f2));
> >>
> >> Or we completely adopt the Oracle syntax:
> >>
> >> SELECT f1, f2, label FROM ML_PREDICT(
> >>data => `my_data`,
> >>model => `classifier_model`,
> >>input => COLUMNS(f1, f2));
> >>
> >>
> >> What do you think?
> >>
> >> Happy to create a FLIP for just this syntax question and collaborate
> >> with the Calcite community on this. Supporting the syntax of Oracle
> >> shouldn't be too hard to convince at least as parser parameter.
> >>
> >> Regards,
> >> Timo
> >>
> >> [1]
> >>
> >>
> https://cwiki.apache.org/confluence/display/FLINK/%5BWIP%5D+FLIP-440%3A+User-defined+Polymorphic+Table+Functions
> >> [2]
> >>
> >>
> https://docs.oracle.com/en/database/oracle/oracle-database/19/arpls/DBMS_TF.html#GUID-0F66E239-DE77-4C0E-AC76-D5B632AB8072
> >> [3]
> https://oracle-base.com/articles/18c/polymorphic-table-functions-18c
> >>
> >>
> >>
> >> On 20.03.24 17:22, Mingge Deng wrote:
> >> > Thanks Jark for all the insightful comments.
> >> >
> >> > We have updated the proposal per our offline discussions:
> >> > 1. Model will

Re: [DISCUSS] FLIP-437: Support ML Models in Flink SQL

2024-03-22 Thread Jark Wu
Sorry, I mean we can bump the Calcite version if needed in Flink 1.20.

On Fri, 22 Mar 2024 at 22:19, Jark Wu  wrote:

> Hi Timo,
>
> Introducing user-defined PTF is very useful in Flink, I'm +1 for this.
> But I think the ML model FLIP is not blocked by this, because we
> can introduce ML_PREDICT and ML_EVALUATE as built-in PTFs
> just like TUMBLE/HOP. And support user-defined ML functions as
> a future FLIP.
>
> Regarding the simplified PTF syntax which reduces the outer TABLE()
> keyword,
> it seems it was just supported[1] by the Calcite community last month and
> will be
> released in the next version (v1.37). The Calcite community is preparing
> the
> 1.37 release, so we can bump the version if needed in Flink 1.19.
>
> Best,
> Jark
>
> [1]: https://issues.apache.org/jira/browse/CALCITE-6254
>
> On Fri, 22 Mar 2024 at 21:46, Timo Walther  wrote:
>
>> Hi everyone,
>>
>> this is a very important change to the Flink SQL syntax but we can't
>> wait until the SQL standard is ready for this. So I'm +1 on introducing
>> the MODEL concept as a first class citizen in Flink.
>>
>> For your information: Over the past months I have already spent a
>> significant amount of time thinking about how we can introduce PTFs in
>> Flink. I reserved FLIP-440[1] for this purpose and I will share a
>> version of this in the next 1-2 weeks.
>>
>> For a good implementation of FLIP-440 and also FLIP-437, we should
>> evolve the PTF syntax in collaboration with Apache Calcite.
>>
>> There are different syntax versions out there:
>>
>> 1) Flink
>>
>> SELECT * FROM
>>TABLE(TUMBLE(TABLE Bid, DESCRIPTOR(bidtime), INTERVAL '10' MINUTES));
>>
>> 2) SQL standard
>>
>> SELECT * FROM
>>TABLE(TUMBLE(TABLE(Bid), DESCRIPTOR(bidtime), INTERVAL '10' MINUTES));
>>
>> 3) Oracle
>>
>> SELECT * FROM
>> TUMBLE(Bid, COLUMNS(bidtime), INTERVAL '10' MINUTES));
>>
>> As you can see above, Flink does not follow the standard correctly as it
>> would need to use `TABLE()` but this is not provided by Calcite yet.
>>
>> I really like the Oracle syntax[2][3] a lot. It reduces necessary
>> keywords to a minimum. Personally, I would like to discuss this syntax
>> in a separate FLIP and hope I will find supporters for:
>>
>>
>> SELECT * FROM
>>TUMBLE(TABLE Bid, DESCRIPTOR(bidtime), INTERVAL '10' MINUTES);
>>
>> If we go entirely with the Oracle syntax, as you can see in the example,
>> Oracle allows for passing identifiers directly. This would solve our
>> problems for the MODEL as well:
>>
>> SELECT f1, f2, label FROM ML_PREDICT(
>>data => `my_data`,
>>model => `classifier_model`,
>>input => DESCRIPTOR(f1, f2));
>>
>> Or we completely adopt the Oracle syntax:
>>
>> SELECT f1, f2, label FROM ML_PREDICT(
>>data => `my_data`,
>>model => `classifier_model`,
>>input => COLUMNS(f1, f2));
>>
>>
>> What do you think?
>>
>> Happy to create a FLIP for just this syntax question and collaborate
>> with the Calcite community on this. Supporting the syntax of Oracle
>> shouldn't be too hard to convince at least as parser parameter.
>>
>> Regards,
>> Timo
>>
>> [1]
>>
>> https://cwiki.apache.org/confluence/display/FLINK/%5BWIP%5D+FLIP-440%3A+User-defined+Polymorphic+Table+Functions
>> [2]
>>
>> https://docs.oracle.com/en/database/oracle/oracle-database/19/arpls/DBMS_TF.html#GUID-0F66E239-DE77-4C0E-AC76-D5B632AB8072
>> [3] https://oracle-base.com/articles/18c/polymorphic-table-functions-18c
>>
>>
>>
>> On 20.03.24 17:22, Mingge Deng wrote:
>> > Thanks Jark for all the insightful comments.
>> >
>> > We have updated the proposal per our offline discussions:
>> > 1. Model will be treated as a new relation in FlinkSQL.
>> > 2. Include the common ML predict and evaluate functions into the open
>> > source flink to complete the user journey.
>> >  And we should be able to extend the calcite SqlTableFunction to
>> support
>> > these two ML functions.
>> >
>> > Best,
>> > Mingge
>> >
>> > On Mon, Mar 18, 2024 at 7:05 PM Jark Wu  wrote:
>> >
>> >> Hi Hao,
>> >>
>> >>> I meant how the table name
>> >> in window TVF gets translated to `SqlCallingBinding`. Probably we need
>> to
>> >> fetch the table definition from the catalog somewhere. Do we treat
>> those
>> >> window TVF specially in parser/planner so that catalog is looked up
>> when
>> >> they are seen?
>> >>
>> >> The table names are resolved and validated by Calcite SqlValidator.  We
>> >> don' need to fetch from catalog manually.
>> >> The specific checking logic of cumulate window happens in
>> >> SqlCumulateTableFunction.OperandMetadataImpl#checkOperandTypes.
>> >> The return type of SqlCumulateTableFunction is defined in
>> >> #getRowTypeInference() method.
>> >> Both are public interfaces provided by Calcite and it seems it's not
>> >> specially handled in parser/planner.
>> >>
>> >> I didn't try that, but my gut feeling is that the framework is ready to
>> >> extend a customized TVF.
>> >>
>> >>> For what model is, I'm wondering if it has to be datatype or relation.
>> >> Can

Re: [DISCUSS] FLIP-437: Support ML Models in Flink SQL

2024-03-22 Thread Jark Wu
Hi Timo,

Introducing user-defined PTF is very useful in Flink, I'm +1 for this.
But I think the ML model FLIP is not blocked by this, because we
can introduce ML_PREDICT and ML_EVALUATE as built-in PTFs
just like TUMBLE/HOP. And support user-defined ML functions as
a future FLIP.

Regarding the simplified PTF syntax which reduces the outer TABLE()
keyword,
it seems it was just supported[1] by the Calcite community last month and
will be
released in the next version (v1.37). The Calcite community is preparing
the
1.37 release, so we can bump the version if needed in Flink 1.19.

Best,
Jark

[1]: https://issues.apache.org/jira/browse/CALCITE-6254

On Fri, 22 Mar 2024 at 21:46, Timo Walther  wrote:

> Hi everyone,
>
> this is a very important change to the Flink SQL syntax but we can't
> wait until the SQL standard is ready for this. So I'm +1 on introducing
> the MODEL concept as a first class citizen in Flink.
>
> For your information: Over the past months I have already spent a
> significant amount of time thinking about how we can introduce PTFs in
> Flink. I reserved FLIP-440[1] for this purpose and I will share a
> version of this in the next 1-2 weeks.
>
> For a good implementation of FLIP-440 and also FLIP-437, we should
> evolve the PTF syntax in collaboration with Apache Calcite.
>
> There are different syntax versions out there:
>
> 1) Flink
>
> SELECT * FROM
>TABLE(TUMBLE(TABLE Bid, DESCRIPTOR(bidtime), INTERVAL '10' MINUTES));
>
> 2) SQL standard
>
> SELECT * FROM
>TABLE(TUMBLE(TABLE(Bid), DESCRIPTOR(bidtime), INTERVAL '10' MINUTES));
>
> 3) Oracle
>
> SELECT * FROM
> TUMBLE(Bid, COLUMNS(bidtime), INTERVAL '10' MINUTES));
>
> As you can see above, Flink does not follow the standard correctly as it
> would need to use `TABLE()` but this is not provided by Calcite yet.
>
> I really like the Oracle syntax[2][3] a lot. It reduces necessary
> keywords to a minimum. Personally, I would like to discuss this syntax
> in a separate FLIP and hope I will find supporters for:
>
>
> SELECT * FROM
>TUMBLE(TABLE Bid, DESCRIPTOR(bidtime), INTERVAL '10' MINUTES);
>
> If we go entirely with the Oracle syntax, as you can see in the example,
> Oracle allows for passing identifiers directly. This would solve our
> problems for the MODEL as well:
>
> SELECT f1, f2, label FROM ML_PREDICT(
>data => `my_data`,
>model => `classifier_model`,
>input => DESCRIPTOR(f1, f2));
>
> Or we completely adopt the Oracle syntax:
>
> SELECT f1, f2, label FROM ML_PREDICT(
>data => `my_data`,
>model => `classifier_model`,
>input => COLUMNS(f1, f2));
>
>
> What do you think?
>
> Happy to create a FLIP for just this syntax question and collaborate
> with the Calcite community on this. Supporting the syntax of Oracle
> shouldn't be too hard to convince at least as parser parameter.
>
> Regards,
> Timo
>
> [1]
>
> https://cwiki.apache.org/confluence/display/FLINK/%5BWIP%5D+FLIP-440%3A+User-defined+Polymorphic+Table+Functions
> [2]
>
> https://docs.oracle.com/en/database/oracle/oracle-database/19/arpls/DBMS_TF.html#GUID-0F66E239-DE77-4C0E-AC76-D5B632AB8072
> [3] https://oracle-base.com/articles/18c/polymorphic-table-functions-18c
>
>
>
> On 20.03.24 17:22, Mingge Deng wrote:
> > Thanks Jark for all the insightful comments.
> >
> > We have updated the proposal per our offline discussions:
> > 1. Model will be treated as a new relation in FlinkSQL.
> > 2. Include the common ML predict and evaluate functions into the open
> > source flink to complete the user journey.
> >  And we should be able to extend the calcite SqlTableFunction to
> support
> > these two ML functions.
> >
> > Best,
> > Mingge
> >
> > On Mon, Mar 18, 2024 at 7:05 PM Jark Wu  wrote:
> >
> >> Hi Hao,
> >>
> >>> I meant how the table name
> >> in window TVF gets translated to `SqlCallingBinding`. Probably we need
> to
> >> fetch the table definition from the catalog somewhere. Do we treat those
> >> window TVF specially in parser/planner so that catalog is looked up when
> >> they are seen?
> >>
> >> The table names are resolved and validated by Calcite SqlValidator.  We
> >> don' need to fetch from catalog manually.
> >> The specific checking logic of cumulate window happens in
> >> SqlCumulateTableFunction.OperandMetadataImpl#checkOperandTypes.
> >> The return type of SqlCumulateTableFunction is defined in
> >> #getRowTypeInference() method.
> >> Both are public interfaces provided by Calcite and it seems it's not
> >> specially handled in parser/planner.
> >>
> >> I didn't try that, but my gut feeling is that the framework is ready to
> >> extend a customized TVF.
> >>
> >>> For what model is, I'm wondering if it has to be datatype or relation.
> >> Can
> >> it be another kind of citizen parallel to datatype/relation/function/db?
> >> Redshift also supports `show models` operation, so it seems it's treated
> >> specially as well?
> >>
> >> If it is an entity only used in catalog scope (e.g., show xxx, create
> xxx,
> >> drop xxx),

Re: [DISCUSS] FLIP-437: Support ML Models in Flink SQL

2024-03-22 Thread Timo Walther

Hi everyone,

this is a very important change to the Flink SQL syntax but we can't 
wait until the SQL standard is ready for this. So I'm +1 on introducing 
the MODEL concept as a first class citizen in Flink.


For your information: Over the past months I have already spent a 
significant amount of time thinking about how we can introduce PTFs in 
Flink. I reserved FLIP-440[1] for this purpose and I will share a 
version of this in the next 1-2 weeks.


For a good implementation of FLIP-440 and also FLIP-437, we should 
evolve the PTF syntax in collaboration with Apache Calcite.


There are different syntax versions out there:

1) Flink

SELECT * FROM
  TABLE(TUMBLE(TABLE Bid, DESCRIPTOR(bidtime), INTERVAL '10' MINUTES));

2) SQL standard

SELECT * FROM
  TABLE(TUMBLE(TABLE(Bid), DESCRIPTOR(bidtime), INTERVAL '10' MINUTES));

3) Oracle

SELECT * FROM
   TUMBLE(Bid, COLUMNS(bidtime), INTERVAL '10' MINUTES));

As you can see above, Flink does not follow the standard correctly as it 
would need to use `TABLE()` but this is not provided by Calcite yet.


I really like the Oracle syntax[2][3] a lot. It reduces necessary 
keywords to a minimum. Personally, I would like to discuss this syntax 
in a separate FLIP and hope I will find supporters for:



SELECT * FROM
  TUMBLE(TABLE Bid, DESCRIPTOR(bidtime), INTERVAL '10' MINUTES);

If we go entirely with the Oracle syntax, as you can see in the example, 
Oracle allows for passing identifiers directly. This would solve our 
problems for the MODEL as well:


SELECT f1, f2, label FROM ML_PREDICT(
  data => `my_data`,
  model => `classifier_model`,
  input => DESCRIPTOR(f1, f2));

Or we completely adopt the Oracle syntax:

SELECT f1, f2, label FROM ML_PREDICT(
  data => `my_data`,
  model => `classifier_model`,
  input => COLUMNS(f1, f2));


What do you think?

Happy to create a FLIP for just this syntax question and collaborate 
with the Calcite community on this. Supporting the syntax of Oracle 
shouldn't be too hard to convince at least as parser parameter.


Regards,
Timo

[1] 
https://cwiki.apache.org/confluence/display/FLINK/%5BWIP%5D+FLIP-440%3A+User-defined+Polymorphic+Table+Functions
[2] 
https://docs.oracle.com/en/database/oracle/oracle-database/19/arpls/DBMS_TF.html#GUID-0F66E239-DE77-4C0E-AC76-D5B632AB8072

[3] https://oracle-base.com/articles/18c/polymorphic-table-functions-18c



On 20.03.24 17:22, Mingge Deng wrote:

Thanks Jark for all the insightful comments.

We have updated the proposal per our offline discussions:
1. Model will be treated as a new relation in FlinkSQL.
2. Include the common ML predict and evaluate functions into the open
source flink to complete the user journey.
 And we should be able to extend the calcite SqlTableFunction to support
these two ML functions.

Best,
Mingge

On Mon, Mar 18, 2024 at 7:05 PM Jark Wu  wrote:


Hi Hao,


I meant how the table name

in window TVF gets translated to `SqlCallingBinding`. Probably we need to
fetch the table definition from the catalog somewhere. Do we treat those
window TVF specially in parser/planner so that catalog is looked up when
they are seen?

The table names are resolved and validated by Calcite SqlValidator.  We
don' need to fetch from catalog manually.
The specific checking logic of cumulate window happens in
SqlCumulateTableFunction.OperandMetadataImpl#checkOperandTypes.
The return type of SqlCumulateTableFunction is defined in
#getRowTypeInference() method.
Both are public interfaces provided by Calcite and it seems it's not
specially handled in parser/planner.

I didn't try that, but my gut feeling is that the framework is ready to
extend a customized TVF.


For what model is, I'm wondering if it has to be datatype or relation.

Can
it be another kind of citizen parallel to datatype/relation/function/db?
Redshift also supports `show models` operation, so it seems it's treated
specially as well?

If it is an entity only used in catalog scope (e.g., show xxx, create xxx,
drop xxx), it is fine to introduce it.
We have introduced such one before, called Module: "load module", "show
modules" [1].
But if we want to use Model in TVF parameters, it means it has to be a
relation or datatype, because
that is what it only accepts now.

Thanks for sharing the reason of preferring TVF instead of Redshift way. It
sounds reasonable to me.

Best,
Jark

  [1]:

https://nightlies.apache.org/flink/flink-docs-master/docs/dev/table/modules/

On Fri, 15 Mar 2024 at 13:41, Hao Li  wrote:


Hi Jark,

Thanks for the pointer. Sorry for the confusion: I meant how the table

name

in window TVF gets translated to `SqlCallingBinding`. Probably we need to
fetch the table definition from the catalog somewhere. Do we treat those
window TVF specially in parser/planner so that catalog is looked up when
they are seen?

For what model is, I'm wondering if it has to be datatype or relation.

Can

it be another kind of citizen parallel to datatype/relation/function/db?
Redshift also supports `show models` opera

Re: [DISCUSS] FLIP-437: Support ML Models in Flink SQL

2024-03-20 Thread Mingge Deng
Thanks Jark for all the insightful comments.

We have updated the proposal per our offline discussions:
1. Model will be treated as a new relation in FlinkSQL.
2. Include the common ML predict and evaluate functions into the open
source flink to complete the user journey.
And we should be able to extend the calcite SqlTableFunction to support
these two ML functions.

Best,
Mingge

On Mon, Mar 18, 2024 at 7:05 PM Jark Wu  wrote:

> Hi Hao,
>
> > I meant how the table name
> in window TVF gets translated to `SqlCallingBinding`. Probably we need to
> fetch the table definition from the catalog somewhere. Do we treat those
> window TVF specially in parser/planner so that catalog is looked up when
> they are seen?
>
> The table names are resolved and validated by Calcite SqlValidator.  We
> don' need to fetch from catalog manually.
> The specific checking logic of cumulate window happens in
> SqlCumulateTableFunction.OperandMetadataImpl#checkOperandTypes.
> The return type of SqlCumulateTableFunction is defined in
> #getRowTypeInference() method.
> Both are public interfaces provided by Calcite and it seems it's not
> specially handled in parser/planner.
>
> I didn't try that, but my gut feeling is that the framework is ready to
> extend a customized TVF.
>
> > For what model is, I'm wondering if it has to be datatype or relation.
> Can
> it be another kind of citizen parallel to datatype/relation/function/db?
> Redshift also supports `show models` operation, so it seems it's treated
> specially as well?
>
> If it is an entity only used in catalog scope (e.g., show xxx, create xxx,
> drop xxx), it is fine to introduce it.
> We have introduced such one before, called Module: "load module", "show
> modules" [1].
> But if we want to use Model in TVF parameters, it means it has to be a
> relation or datatype, because
> that is what it only accepts now.
>
> Thanks for sharing the reason of preferring TVF instead of Redshift way. It
> sounds reasonable to me.
>
> Best,
> Jark
>
>  [1]:
>
> https://nightlies.apache.org/flink/flink-docs-master/docs/dev/table/modules/
>
> On Fri, 15 Mar 2024 at 13:41, Hao Li  wrote:
>
> > Hi Jark,
> >
> > Thanks for the pointer. Sorry for the confusion: I meant how the table
> name
> > in window TVF gets translated to `SqlCallingBinding`. Probably we need to
> > fetch the table definition from the catalog somewhere. Do we treat those
> > window TVF specially in parser/planner so that catalog is looked up when
> > they are seen?
> >
> > For what model is, I'm wondering if it has to be datatype or relation.
> Can
> > it be another kind of citizen parallel to datatype/relation/function/db?
> > Redshift also supports `show models` operation, so it seems it's treated
> > specially as well? The reasons I don't like Redshift's syntax are:
> > 1. It's a bit verbose, you need to think of a model name as well as a
> > function name and the function name also needs to be unique.
> > 2. More importantly, prediction function isn't the only function that can
> > operate on models. There could be a set of inference functions [1] and
> > evaluation functions [2] which can operate on models. It's hard to
> specify
> > all of them in model creation.
> >
> > [1]:
> >
> >
> https://cloud.google.com/bigquery/docs/reference/standard-sql/bigqueryml-syntax-predict
> > [2]:
> >
> >
> https://cloud.google.com/bigquery/docs/reference/standard-sql/bigqueryml-syntax-evaluate
> >
> > Thanks,
> > Hao
> >
> > On Thu, Mar 14, 2024 at 8:18 PM Jark Wu  wrote:
> >
> > > Hi Hao,
> > >
> > > > Can you send me some pointers
> > > where the function gets the table information?
> > >
> > > Here is the code of cumulate window type checking [1].
> > >
> > > > Also is it possible to support  in
> > > window functions in addiction to table?
> > >
> > > Yes. It is not allowed in TVF.
> > >
> > > Thanks for the syntax links of other systems. The reason I prefer the
> > > Redshift way is
> > > that it avoids introducing Model as a relation or datatype (referenced
> > as a
> > > parameter in TVF).
> > > Model is not a relation because it can be queried directly (e.g.,
> SELECT
> > *
> > > FROM model).
> > > I'm also confused about making Model as a datatype, because I don't
> know
> > > what class the
> > > model parameter of the eval method of TableFunction/ScalarFunction
> should
> > > be. By defining
> > > the function with the model, users can directly invoke the function
> > without
> > > reference to the model name.
> > >
> > > Best,
> > > Jark
> > >
> > > [1]:
> > >
> > >
> >
> https://github.com/apache/flink/blob/d6c7eee8243b4fe3e593698f250643534dc79cb5/flink-table/flink-table-planner/src/main/java/org/apache/flink/table/planner/functions/sql/SqlCumulateTableFunction.java#L53
> > >
> > > On Fri, 15 Mar 2024 at 02:48, Hao Li  wrote:
> > >
> > > > Hi Jark,
> > > >
> > > > Thanks for the pointers. It's very helpful.
> > > >
> > > > 1. Looks like `tumble`, `hopping` are keywords in calcite parser. And
> > the
> > > > syntax `c

Re: [DISCUSS] FLIP-437: Support ML Models in Flink SQL

2024-03-18 Thread Jark Wu
Hi Hao,

> I meant how the table name
in window TVF gets translated to `SqlCallingBinding`. Probably we need to
fetch the table definition from the catalog somewhere. Do we treat those
window TVF specially in parser/planner so that catalog is looked up when
they are seen?

The table names are resolved and validated by Calcite SqlValidator.  We
don' need to fetch from catalog manually.
The specific checking logic of cumulate window happens in
SqlCumulateTableFunction.OperandMetadataImpl#checkOperandTypes.
The return type of SqlCumulateTableFunction is defined in
#getRowTypeInference() method.
Both are public interfaces provided by Calcite and it seems it's not
specially handled in parser/planner.

I didn't try that, but my gut feeling is that the framework is ready to
extend a customized TVF.

> For what model is, I'm wondering if it has to be datatype or relation. Can
it be another kind of citizen parallel to datatype/relation/function/db?
Redshift also supports `show models` operation, so it seems it's treated
specially as well?

If it is an entity only used in catalog scope (e.g., show xxx, create xxx,
drop xxx), it is fine to introduce it.
We have introduced such one before, called Module: "load module", "show
modules" [1].
But if we want to use Model in TVF parameters, it means it has to be a
relation or datatype, because
that is what it only accepts now.

Thanks for sharing the reason of preferring TVF instead of Redshift way. It
sounds reasonable to me.

Best,
Jark

 [1]:
https://nightlies.apache.org/flink/flink-docs-master/docs/dev/table/modules/

On Fri, 15 Mar 2024 at 13:41, Hao Li  wrote:

> Hi Jark,
>
> Thanks for the pointer. Sorry for the confusion: I meant how the table name
> in window TVF gets translated to `SqlCallingBinding`. Probably we need to
> fetch the table definition from the catalog somewhere. Do we treat those
> window TVF specially in parser/planner so that catalog is looked up when
> they are seen?
>
> For what model is, I'm wondering if it has to be datatype or relation. Can
> it be another kind of citizen parallel to datatype/relation/function/db?
> Redshift also supports `show models` operation, so it seems it's treated
> specially as well? The reasons I don't like Redshift's syntax are:
> 1. It's a bit verbose, you need to think of a model name as well as a
> function name and the function name also needs to be unique.
> 2. More importantly, prediction function isn't the only function that can
> operate on models. There could be a set of inference functions [1] and
> evaluation functions [2] which can operate on models. It's hard to specify
> all of them in model creation.
>
> [1]:
>
> https://cloud.google.com/bigquery/docs/reference/standard-sql/bigqueryml-syntax-predict
> [2]:
>
> https://cloud.google.com/bigquery/docs/reference/standard-sql/bigqueryml-syntax-evaluate
>
> Thanks,
> Hao
>
> On Thu, Mar 14, 2024 at 8:18 PM Jark Wu  wrote:
>
> > Hi Hao,
> >
> > > Can you send me some pointers
> > where the function gets the table information?
> >
> > Here is the code of cumulate window type checking [1].
> >
> > > Also is it possible to support  in
> > window functions in addiction to table?
> >
> > Yes. It is not allowed in TVF.
> >
> > Thanks for the syntax links of other systems. The reason I prefer the
> > Redshift way is
> > that it avoids introducing Model as a relation or datatype (referenced
> as a
> > parameter in TVF).
> > Model is not a relation because it can be queried directly (e.g., SELECT
> *
> > FROM model).
> > I'm also confused about making Model as a datatype, because I don't know
> > what class the
> > model parameter of the eval method of TableFunction/ScalarFunction should
> > be. By defining
> > the function with the model, users can directly invoke the function
> without
> > reference to the model name.
> >
> > Best,
> > Jark
> >
> > [1]:
> >
> >
> https://github.com/apache/flink/blob/d6c7eee8243b4fe3e593698f250643534dc79cb5/flink-table/flink-table-planner/src/main/java/org/apache/flink/table/planner/functions/sql/SqlCumulateTableFunction.java#L53
> >
> > On Fri, 15 Mar 2024 at 02:48, Hao Li  wrote:
> >
> > > Hi Jark,
> > >
> > > Thanks for the pointers. It's very helpful.
> > >
> > > 1. Looks like `tumble`, `hopping` are keywords in calcite parser. And
> the
> > > syntax `cumulate(Table my_table, ...)` needs to get table information
> > from
> > > catalog somewhere for type validation etc. Can you send me some
> pointers
> > > where the function gets the table information?
> > > 2. The ideal syntax for model function I think would be
> `ML_PREDICT(MODEL
> > > , {table  | (query_stmt) })`. I think with
> > special
> > > handling of the `ML_PREDICT` function in parser/planner, maybe we can
> do
> > > this like window functions. But to support `MODEL` keyword, we need
> > calcite
> > > parser change I guess. Also is it possible to support  in
> > > window functions in addiction to table?
> > >
> > > For the redshift syntax, I'm not sure the purpose of 

Re: [DISCUSS] FLIP-437: Support ML Models in Flink SQL

2024-03-14 Thread Hao Li
Hi Jark,

Thanks for the pointer. Sorry for the confusion: I meant how the table name
in window TVF gets translated to `SqlCallingBinding`. Probably we need to
fetch the table definition from the catalog somewhere. Do we treat those
window TVF specially in parser/planner so that catalog is looked up when
they are seen?

For what model is, I'm wondering if it has to be datatype or relation. Can
it be another kind of citizen parallel to datatype/relation/function/db?
Redshift also supports `show models` operation, so it seems it's treated
specially as well? The reasons I don't like Redshift's syntax are:
1. It's a bit verbose, you need to think of a model name as well as a
function name and the function name also needs to be unique.
2. More importantly, prediction function isn't the only function that can
operate on models. There could be a set of inference functions [1] and
evaluation functions [2] which can operate on models. It's hard to specify
all of them in model creation.

[1]:
https://cloud.google.com/bigquery/docs/reference/standard-sql/bigqueryml-syntax-predict
[2]:
https://cloud.google.com/bigquery/docs/reference/standard-sql/bigqueryml-syntax-evaluate

Thanks,
Hao

On Thu, Mar 14, 2024 at 8:18 PM Jark Wu  wrote:

> Hi Hao,
>
> > Can you send me some pointers
> where the function gets the table information?
>
> Here is the code of cumulate window type checking [1].
>
> > Also is it possible to support  in
> window functions in addiction to table?
>
> Yes. It is not allowed in TVF.
>
> Thanks for the syntax links of other systems. The reason I prefer the
> Redshift way is
> that it avoids introducing Model as a relation or datatype (referenced as a
> parameter in TVF).
> Model is not a relation because it can be queried directly (e.g., SELECT *
> FROM model).
> I'm also confused about making Model as a datatype, because I don't know
> what class the
> model parameter of the eval method of TableFunction/ScalarFunction should
> be. By defining
> the function with the model, users can directly invoke the function without
> reference to the model name.
>
> Best,
> Jark
>
> [1]:
>
> https://github.com/apache/flink/blob/d6c7eee8243b4fe3e593698f250643534dc79cb5/flink-table/flink-table-planner/src/main/java/org/apache/flink/table/planner/functions/sql/SqlCumulateTableFunction.java#L53
>
> On Fri, 15 Mar 2024 at 02:48, Hao Li  wrote:
>
> > Hi Jark,
> >
> > Thanks for the pointers. It's very helpful.
> >
> > 1. Looks like `tumble`, `hopping` are keywords in calcite parser. And the
> > syntax `cumulate(Table my_table, ...)` needs to get table information
> from
> > catalog somewhere for type validation etc. Can you send me some pointers
> > where the function gets the table information?
> > 2. The ideal syntax for model function I think would be `ML_PREDICT(MODEL
> > , {table  | (query_stmt) })`. I think with
> special
> > handling of the `ML_PREDICT` function in parser/planner, maybe we can do
> > this like window functions. But to support `MODEL` keyword, we need
> calcite
> > parser change I guess. Also is it possible to support  in
> > window functions in addiction to table?
> >
> > For the redshift syntax, I'm not sure the purpose of defining the
> function
> > name with the model. Is it to define the function input/output schema? We
> > have the schema in our create model syntax and the `ML_PREDICT` can
> handle
> > it by getting model definition. I think our syntax is more concise to
> have
> > a generic prediction function. I also did some research and it's the
> syntax
> > used by Databricks `ai_query` [1], Snowflake `predict` [2], Azureml
> > `predict` [3].
> >
> > [1]:
> >
> https://docs.databricks.com/en/sql/language-manual/functions/ai_query.html
> > [2]:
> >
> >
> https://github.com/Snowflake-Labs/sfguide-intro-to-machine-learning-with-snowpark-ml-for-python/blob/main/3_snowpark_ml_model_training_inference.ipynb?_fsi=sksXUwQ0
> > [3]:
> >
> >
> https://learn.microsoft.com/en-us/sql/machine-learning/tutorials/quickstart-python-train-score-model?view=azuresqldb-mi-current
> >
> > Thanks,
> > Hao
> >
> > On Wed, Mar 13, 2024 at 8:57 PM Jark Wu  wrote:
> >
> > > Hi Mingge, Hao,
> > >
> > > Thanks for your replies.
> > >
> > > > PTF is actually the ideal approach for model functions, and we do
> have
> > > the plans to use PTF for
> > > all model functions (including prediction, evaluation etc..) once the
> PTF
> > > is supported in FlinkSQL
> > > confluent extension.
> > >
> > > It sounds that PTF is the ideal way and table function is a temporary
> > > solution which will be dropped in the future.
> > > I'm not sure whether we can implement it using PTF in Flink SQL. But we
> > > have implemented window
> > > functions using PTF[1]. And introduced a new window function (called
> > > CUMULATE[2]) in Flink SQL based
> > > on this. I think it might work to use PTF and implement model function
> > > syntax like this:
> > >
> > > SELECT * FROM TABLE(ML_PREDICT(
> > >   TABLE my_table,
> > >   my_model,

Re: [DISCUSS] FLIP-437: Support ML Models in Flink SQL

2024-03-14 Thread Jark Wu
Hi Hao,

> Can you send me some pointers
where the function gets the table information?

Here is the code of cumulate window type checking [1].

> Also is it possible to support  in
window functions in addiction to table?

Yes. It is not allowed in TVF.

Thanks for the syntax links of other systems. The reason I prefer the
Redshift way is
that it avoids introducing Model as a relation or datatype (referenced as a
parameter in TVF).
Model is not a relation because it can be queried directly (e.g., SELECT *
FROM model).
I'm also confused about making Model as a datatype, because I don't know
what class the
model parameter of the eval method of TableFunction/ScalarFunction should
be. By defining
the function with the model, users can directly invoke the function without
reference to the model name.

Best,
Jark

[1]:
https://github.com/apache/flink/blob/d6c7eee8243b4fe3e593698f250643534dc79cb5/flink-table/flink-table-planner/src/main/java/org/apache/flink/table/planner/functions/sql/SqlCumulateTableFunction.java#L53

On Fri, 15 Mar 2024 at 02:48, Hao Li  wrote:

> Hi Jark,
>
> Thanks for the pointers. It's very helpful.
>
> 1. Looks like `tumble`, `hopping` are keywords in calcite parser. And the
> syntax `cumulate(Table my_table, ...)` needs to get table information from
> catalog somewhere for type validation etc. Can you send me some pointers
> where the function gets the table information?
> 2. The ideal syntax for model function I think would be `ML_PREDICT(MODEL
> , {table  | (query_stmt) })`. I think with special
> handling of the `ML_PREDICT` function in parser/planner, maybe we can do
> this like window functions. But to support `MODEL` keyword, we need calcite
> parser change I guess. Also is it possible to support  in
> window functions in addiction to table?
>
> For the redshift syntax, I'm not sure the purpose of defining the function
> name with the model. Is it to define the function input/output schema? We
> have the schema in our create model syntax and the `ML_PREDICT` can handle
> it by getting model definition. I think our syntax is more concise to have
> a generic prediction function. I also did some research and it's the syntax
> used by Databricks `ai_query` [1], Snowflake `predict` [2], Azureml
> `predict` [3].
>
> [1]:
> https://docs.databricks.com/en/sql/language-manual/functions/ai_query.html
> [2]:
>
> https://github.com/Snowflake-Labs/sfguide-intro-to-machine-learning-with-snowpark-ml-for-python/blob/main/3_snowpark_ml_model_training_inference.ipynb?_fsi=sksXUwQ0
> [3]:
>
> https://learn.microsoft.com/en-us/sql/machine-learning/tutorials/quickstart-python-train-score-model?view=azuresqldb-mi-current
>
> Thanks,
> Hao
>
> On Wed, Mar 13, 2024 at 8:57 PM Jark Wu  wrote:
>
> > Hi Mingge, Hao,
> >
> > Thanks for your replies.
> >
> > > PTF is actually the ideal approach for model functions, and we do have
> > the plans to use PTF for
> > all model functions (including prediction, evaluation etc..) once the PTF
> > is supported in FlinkSQL
> > confluent extension.
> >
> > It sounds that PTF is the ideal way and table function is a temporary
> > solution which will be dropped in the future.
> > I'm not sure whether we can implement it using PTF in Flink SQL. But we
> > have implemented window
> > functions using PTF[1]. And introduced a new window function (called
> > CUMULATE[2]) in Flink SQL based
> > on this. I think it might work to use PTF and implement model function
> > syntax like this:
> >
> > SELECT * FROM TABLE(ML_PREDICT(
> >   TABLE my_table,
> >   my_model,
> >   col1,
> >   col2
> > ));
> >
> > Besides, did you consider following the way of AWS Redshift which defines
> > model function with the model itself together?
> > IIUC, a model is a black-box which defines input parameters and output
> > parameters which can be modeled into functions.
> >
> >
> > Best,
> > Jark
> >
> > [1]:
> >
> >
> https://nightlies.apache.org/flink/flink-docs-master/docs/dev/table/sql/queries/window-tvf/#session
> > [2]:
> >
> >
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-145%3A+Support+SQL+windowing+table-valued+function#FLIP145:SupportSQLwindowingtablevaluedfunction-CumulatingWindows
> > [3]:
> >
> >
> https://github.com/aws-samples/amazon-redshift-ml-getting-started/blob/main/use-cases/bring-your-own-model-remote-inference/README.md#create-model
> >
> >
> >
> >
> > On Wed, 13 Mar 2024 at 15:00, Hao Li  wrote:
> >
> > > Hi Jark,
> > >
> > > Thanks for your questions. These are good questions!
> > >
> > > 1. The polymorphism table function I was referring to takes a table as
> > > input and outputs a table. So the syntax would be like
> > > ```
> > > SELECT * FROM ML_PREDICT('model', (SELECT * FROM my_table))
> > > ```
> > > As far as I know, this is not supported yet on Flink. So before it's
> > > supported, one option for the predict function is using table function
> > > which can output multiple columns
> > > ```
> > > SELECT * FROM my_table, LATERAL VIEW (ML_PREDICT('model', co

Re: [DISCUSS] FLIP-437: Support ML Models in Flink SQL

2024-03-14 Thread Hao Li
Hi Jark,

Thanks for the pointers. It's very helpful.

1. Looks like `tumble`, `hopping` are keywords in calcite parser. And the
syntax `cumulate(Table my_table, ...)` needs to get table information from
catalog somewhere for type validation etc. Can you send me some pointers
where the function gets the table information?
2. The ideal syntax for model function I think would be `ML_PREDICT(MODEL
, {table  | (query_stmt) })`. I think with special
handling of the `ML_PREDICT` function in parser/planner, maybe we can do
this like window functions. But to support `MODEL` keyword, we need calcite
parser change I guess. Also is it possible to support  in
window functions in addiction to table?

For the redshift syntax, I'm not sure the purpose of defining the function
name with the model. Is it to define the function input/output schema? We
have the schema in our create model syntax and the `ML_PREDICT` can handle
it by getting model definition. I think our syntax is more concise to have
a generic prediction function. I also did some research and it's the syntax
used by Databricks `ai_query` [1], Snowflake `predict` [2], Azureml
`predict` [3].

[1]:
https://docs.databricks.com/en/sql/language-manual/functions/ai_query.html
[2]:
https://github.com/Snowflake-Labs/sfguide-intro-to-machine-learning-with-snowpark-ml-for-python/blob/main/3_snowpark_ml_model_training_inference.ipynb?_fsi=sksXUwQ0
[3]:
https://learn.microsoft.com/en-us/sql/machine-learning/tutorials/quickstart-python-train-score-model?view=azuresqldb-mi-current

Thanks,
Hao

On Wed, Mar 13, 2024 at 8:57 PM Jark Wu  wrote:

> Hi Mingge, Hao,
>
> Thanks for your replies.
>
> > PTF is actually the ideal approach for model functions, and we do have
> the plans to use PTF for
> all model functions (including prediction, evaluation etc..) once the PTF
> is supported in FlinkSQL
> confluent extension.
>
> It sounds that PTF is the ideal way and table function is a temporary
> solution which will be dropped in the future.
> I'm not sure whether we can implement it using PTF in Flink SQL. But we
> have implemented window
> functions using PTF[1]. And introduced a new window function (called
> CUMULATE[2]) in Flink SQL based
> on this. I think it might work to use PTF and implement model function
> syntax like this:
>
> SELECT * FROM TABLE(ML_PREDICT(
>   TABLE my_table,
>   my_model,
>   col1,
>   col2
> ));
>
> Besides, did you consider following the way of AWS Redshift which defines
> model function with the model itself together?
> IIUC, a model is a black-box which defines input parameters and output
> parameters which can be modeled into functions.
>
>
> Best,
> Jark
>
> [1]:
>
> https://nightlies.apache.org/flink/flink-docs-master/docs/dev/table/sql/queries/window-tvf/#session
> [2]:
>
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-145%3A+Support+SQL+windowing+table-valued+function#FLIP145:SupportSQLwindowingtablevaluedfunction-CumulatingWindows
> [3]:
>
> https://github.com/aws-samples/amazon-redshift-ml-getting-started/blob/main/use-cases/bring-your-own-model-remote-inference/README.md#create-model
>
>
>
>
> On Wed, 13 Mar 2024 at 15:00, Hao Li  wrote:
>
> > Hi Jark,
> >
> > Thanks for your questions. These are good questions!
> >
> > 1. The polymorphism table function I was referring to takes a table as
> > input and outputs a table. So the syntax would be like
> > ```
> > SELECT * FROM ML_PREDICT('model', (SELECT * FROM my_table))
> > ```
> > As far as I know, this is not supported yet on Flink. So before it's
> > supported, one option for the predict function is using table function
> > which can output multiple columns
> > ```
> > SELECT * FROM my_table, LATERAL VIEW (ML_PREDICT('model', col1, col2))
> > ```
> >
> > 2. Good question. Type inference is hard for the `ML_PREDICT` function
> > because it takes a model name string as input. I can think of three ways
> of
> > doing type inference for it.
> >1). Treat `ML_PREDICT` function as something special and during sql
> > parsing or planning time, if it's encountered, we need to look up the
> model
> > from the first argument which is a model name from catalog. Then we can
> > infer the input/output for the function.
> >2). We can define a `model` keyword and use that in the predict
> function
> > to indicate the argument refers to a model. So it's like
> `ML_PREDICT(model
> > 'my_model', col1, col2))`
> >3). We can create a special type of table function maybe called
> > `ModelFunction` which can resolve the model type inference by special
> > handling it during parsing or planning time.
> > 1) is hacky, 2) isn't supported in Flink for function, 3) might be a
> > good option.
> >
> > 3. I sketched the `ML_PREDICT` function for inference. But there are
> > limitations of the function mentioned in 1 and 2. So maybe we don't need
> to
> > introduce them as built-in functions until polymorphism table function
> and
> > we can properly deal with type inference.
> > After that, defini

Re: [DISCUSS] FLIP-437: Support ML Models in Flink SQL

2024-03-13 Thread Jark Wu
Hi Mingge, Hao,

Thanks for your replies.

> PTF is actually the ideal approach for model functions, and we do have
the plans to use PTF for
all model functions (including prediction, evaluation etc..) once the PTF
is supported in FlinkSQL
confluent extension.

It sounds that PTF is the ideal way and table function is a temporary
solution which will be dropped in the future.
I'm not sure whether we can implement it using PTF in Flink SQL. But we
have implemented window
functions using PTF[1]. And introduced a new window function (called
CUMULATE[2]) in Flink SQL based
on this. I think it might work to use PTF and implement model function
syntax like this:

SELECT * FROM TABLE(ML_PREDICT(
  TABLE my_table,
  my_model,
  col1,
  col2
));

Besides, did you consider following the way of AWS Redshift which defines
model function with the model itself together?
IIUC, a model is a black-box which defines input parameters and output
parameters which can be modeled into functions.


Best,
Jark

[1]:
https://nightlies.apache.org/flink/flink-docs-master/docs/dev/table/sql/queries/window-tvf/#session
[2]:
https://cwiki.apache.org/confluence/display/FLINK/FLIP-145%3A+Support+SQL+windowing+table-valued+function#FLIP145:SupportSQLwindowingtablevaluedfunction-CumulatingWindows
[3]:
https://github.com/aws-samples/amazon-redshift-ml-getting-started/blob/main/use-cases/bring-your-own-model-remote-inference/README.md#create-model




On Wed, 13 Mar 2024 at 15:00, Hao Li  wrote:

> Hi Jark,
>
> Thanks for your questions. These are good questions!
>
> 1. The polymorphism table function I was referring to takes a table as
> input and outputs a table. So the syntax would be like
> ```
> SELECT * FROM ML_PREDICT('model', (SELECT * FROM my_table))
> ```
> As far as I know, this is not supported yet on Flink. So before it's
> supported, one option for the predict function is using table function
> which can output multiple columns
> ```
> SELECT * FROM my_table, LATERAL VIEW (ML_PREDICT('model', col1, col2))
> ```
>
> 2. Good question. Type inference is hard for the `ML_PREDICT` function
> because it takes a model name string as input. I can think of three ways of
> doing type inference for it.
>1). Treat `ML_PREDICT` function as something special and during sql
> parsing or planning time, if it's encountered, we need to look up the model
> from the first argument which is a model name from catalog. Then we can
> infer the input/output for the function.
>2). We can define a `model` keyword and use that in the predict function
> to indicate the argument refers to a model. So it's like `ML_PREDICT(model
> 'my_model', col1, col2))`
>3). We can create a special type of table function maybe called
> `ModelFunction` which can resolve the model type inference by special
> handling it during parsing or planning time.
> 1) is hacky, 2) isn't supported in Flink for function, 3) might be a
> good option.
>
> 3. I sketched the `ML_PREDICT` function for inference. But there are
> limitations of the function mentioned in 1 and 2. So maybe we don't need to
> introduce them as built-in functions until polymorphism table function and
> we can properly deal with type inference.
> After that, defining a user-defined model function should also be
> straightforward.
>
> 4. For model types, do you mean 'remote', 'import', 'native' models or
> other things?
>
> 5. We could support popular providers such as 'azureml', 'vertexai',
> 'googleai' as long as we support the `ML_PREDICT` function. Users should be
> able to implement 3rd-party providers if they can implement a function
> handling the input/output for the provider.
>
> I think for the model functions, there are still dependencies or hacks we
> need to sort out as a built-in function. Maybe we can separate that as a
> follow up if we want to have it built-in and focus on the model syntax for
> this FLIP?
>
> Thanks,
> Hao
>
> On Tue, Mar 12, 2024 at 10:33 PM Jark Wu  wrote:
>
> > Hi Minge, Chris, Hao,
> >
> > Thanks for proposing this interesting idea. I think this is a nice step
> > towards
> > the AI world for Apache Flink. I don't know much about AI/ML, so I may
> have
> > some stupid questions.
> >
> > 1. Could you tell more about why polymorphism table function (PTF)
> doesn't
> > work and do we have plan to use PTF as model functions?
> >
> > 2. What kind of object does the model map to in SQL? A relation or a data
> > type?
> > It looks like a data type because we use it as a parameter of the table
> > function.
> > If it is a data type, how does it cooperate with type inference[1]?
> >
> > 3. What built-in model functions will we support? How to define a
> > user-defined model function?
> >
> > 4. What built-in model types will we support? How to define a
> user-defined
> > model type?
> >
> > 5. Regarding the remote model, what providers will we support? Can users
> > implement
> > 3rd-party providers except OpenAI?
> >
> > Best,
> > Jark
> >
> > [1]:
> >
> >
> https://nightlies

Re: [DISCUSS] FLIP-437: Support ML Models in Flink SQL

2024-03-13 Thread Hao Li
Hi Jark,

Thanks for your questions. These are good questions!

1. The polymorphism table function I was referring to takes a table as
input and outputs a table. So the syntax would be like
```
SELECT * FROM ML_PREDICT('model', (SELECT * FROM my_table))
```
As far as I know, this is not supported yet on Flink. So before it's
supported, one option for the predict function is using table function
which can output multiple columns
```
SELECT * FROM my_table, LATERAL VIEW (ML_PREDICT('model', col1, col2))
```

2. Good question. Type inference is hard for the `ML_PREDICT` function
because it takes a model name string as input. I can think of three ways of
doing type inference for it.
   1). Treat `ML_PREDICT` function as something special and during sql
parsing or planning time, if it's encountered, we need to look up the model
from the first argument which is a model name from catalog. Then we can
infer the input/output for the function.
   2). We can define a `model` keyword and use that in the predict function
to indicate the argument refers to a model. So it's like `ML_PREDICT(model
'my_model', col1, col2))`
   3). We can create a special type of table function maybe called
`ModelFunction` which can resolve the model type inference by special
handling it during parsing or planning time.
1) is hacky, 2) isn't supported in Flink for function, 3) might be a
good option.

3. I sketched the `ML_PREDICT` function for inference. But there are
limitations of the function mentioned in 1 and 2. So maybe we don't need to
introduce them as built-in functions until polymorphism table function and
we can properly deal with type inference.
After that, defining a user-defined model function should also be
straightforward.

4. For model types, do you mean 'remote', 'import', 'native' models or
other things?

5. We could support popular providers such as 'azureml', 'vertexai',
'googleai' as long as we support the `ML_PREDICT` function. Users should be
able to implement 3rd-party providers if they can implement a function
handling the input/output for the provider.

I think for the model functions, there are still dependencies or hacks we
need to sort out as a built-in function. Maybe we can separate that as a
follow up if we want to have it built-in and focus on the model syntax for
this FLIP?

Thanks,
Hao

On Tue, Mar 12, 2024 at 10:33 PM Jark Wu  wrote:

> Hi Minge, Chris, Hao,
>
> Thanks for proposing this interesting idea. I think this is a nice step
> towards
> the AI world for Apache Flink. I don't know much about AI/ML, so I may have
> some stupid questions.
>
> 1. Could you tell more about why polymorphism table function (PTF) doesn't
> work and do we have plan to use PTF as model functions?
>
> 2. What kind of object does the model map to in SQL? A relation or a data
> type?
> It looks like a data type because we use it as a parameter of the table
> function.
> If it is a data type, how does it cooperate with type inference[1]?
>
> 3. What built-in model functions will we support? How to define a
> user-defined model function?
>
> 4. What built-in model types will we support? How to define a user-defined
> model type?
>
> 5. Regarding the remote model, what providers will we support? Can users
> implement
> 3rd-party providers except OpenAI?
>
> Best,
> Jark
>
> [1]:
>
> https://nightlies.apache.org/flink/flink-docs-master/docs/dev/table/functions/udfs/#type-inference
>
>
>
>
> On Wed, 13 Mar 2024 at 05:55, Hao Li  wrote:
>
> > Hi, Dev
> >
> >
> > Mingge, Chris and I would like to start a discussion about FLIP-437:
> > Support ML Models in Flink SQL.
> >
> > This FLIP is proposing to support machine learning models in Flink SQL
> > syntax so that users can CRUD models with Flink SQL and use models on
> Flink
> > to do prediction with Flink data. The FLIP also proposes new model
> entities
> > and changes to catalog interface to support model CRUD operations in
> > catalog.
> >
> > For more details, see FLIP-437 [1]. Looking forward to your feedback.
> >
> >
> > [1]
> >
> >
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-437%3A+Support+ML+Models+in+Flink+SQL
> >
> > Thanks,
> > Minge, Chris & Hao
> >
>


Re: [DISCUSS] FLIP-437: Support ML Models in Flink SQL

2024-03-12 Thread Jark Wu
Hi Minge, Chris, Hao,

Thanks for proposing this interesting idea. I think this is a nice step
towards
the AI world for Apache Flink. I don't know much about AI/ML, so I may have
some stupid questions.

1. Could you tell more about why polymorphism table function (PTF) doesn't
work and do we have plan to use PTF as model functions?

2. What kind of object does the model map to in SQL? A relation or a data
type?
It looks like a data type because we use it as a parameter of the table
function.
If it is a data type, how does it cooperate with type inference[1]?

3. What built-in model functions will we support? How to define a
user-defined model function?

4. What built-in model types will we support? How to define a user-defined
model type?

5. Regarding the remote model, what providers will we support? Can users
implement
3rd-party providers except OpenAI?

Best,
Jark

[1]:
https://nightlies.apache.org/flink/flink-docs-master/docs/dev/table/functions/udfs/#type-inference




On Wed, 13 Mar 2024 at 05:55, Hao Li  wrote:

> Hi, Dev
>
>
> Mingge, Chris and I would like to start a discussion about FLIP-437:
> Support ML Models in Flink SQL.
>
> This FLIP is proposing to support machine learning models in Flink SQL
> syntax so that users can CRUD models with Flink SQL and use models on Flink
> to do prediction with Flink data. The FLIP also proposes new model entities
> and changes to catalog interface to support model CRUD operations in
> catalog.
>
> For more details, see FLIP-437 [1]. Looking forward to your feedback.
>
>
> [1]
>
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-437%3A+Support+ML+Models+in+Flink+SQL
>
> Thanks,
> Minge, Chris & Hao
>