Re: [DISCUSS] FLIP-57 - Rework FunctionCatalog

Timo Walther Wed, 04 Sep 2019 06:19:56 -0700

Hi all,

thanks for the healthy discussion. It is already a very long discussionwith a lot of text. So I will just post my opinion to a couple ofstatements:

> Hive built-in functions are not part of Flink built-in functions,they are catalog functions

That is not entirely true. Correct me if I'm wrong but I think Hivebuilt-in functions are also not catalog functions. They are not storedin every Hive metastore catalog that is freshly created but are a set offunctions that are listed somewhere and made available.

> ambiguous functions reference just shouldn't be resolved to adifferent catalog

I agree. They should not be resolved to a different catalog. That's whyI am suggesting to split the concept of built-in functions and cataloglookup semantics.


> I don't know if any other databases handle built-in functions like that

What I called "module" is:
- Extension in Postgres [1]
- Plugin in Presto [2]

Btw. Presto even mentions example modules that are similar to the onesthat we will introduce in the near future both for ML and System XYZcompatibility:"See either the presto-ml module for machine learning functions or thepresto-teradata-functions module for Teradata-compatible functions, bothin the root of the Presto source."

> functions should be either built-in already or just librariesfunctions, and library functions can be adapted to catalog APIs or ofsome other syntax to use

Regarding "built-in already", of course we can add a lot of functions asbuilt-ins but we will end-up in a dependency hell in the near future ifwe don't introduce a pluggable approach. Library functions is what youalso suggest but storing them in a catalog means to always fully qualifythem or modifying the existing catalog design that was inspired by thestandard.

I don't think "it brings in even more complicated scenarios to thedesign", it just does clear separation of concerns. Integrating thefunctionality into the current design makes the catalog API morecomplicated.

> why would users name a temporary function the same as a built-infunction then?

Because you never know what users do. If they don't, my suggestedresolution order should not be a problem, right?


> I don't think hive functions deserves be a function module

Our goal is not to create a Hive clone. We need to think forward andHive is just one of many systems that we can support. Not every built-infunction behaves and will behave exactly like Hive.


> regarding temporary functions, there are few systems that support it

IMHO Spark and Hive are not always the best examples for consistentdesign. Systems like Postgres, Presto, or SQL Server should be used as areference. I don't think that a user can overwrite a built-in functionthere.


Regards,
Timo

[1] https://www.postgresql.org/docs/10/extend-extensions.html
[2] https://prestodb.github.io/docs/current/develop/functions.html


On 04.09.19 13:44, Jark Wu wrote:

Hi all,

Regarding #1 temp function <> built-in function and naming.
I'm fine with temp functions should precede built-in function and can
override built-in functions (we already support to override built-in
function in 1.9).
If we don't allow the same name as a built-in function, I'm afraid we will
have compatibility issues in the future.
Say users register a user defined function named "explode" in 1.9, and we
support a built-in "explode" function in 1.10.
Then the user's jobs which call the registered "explode" function in 1.9
will all fail in 1.10 because of naming conflict.

Regarding #2 "External" built-in functions.
I think if we store external built-in functions in catalog, then
"hive1::sqrt" is a good way to go.
However, I would prefer to support a discovery mechanism (e.g. SPI) for
built-in functions as Timo suggested above.
This gives us the flexibility to add Hive or MySQL or Geo or whatever
function set as built-in functions in an easy way.

Best,
Jark

On Wed, 4 Sep 2019 at 17:47, Xuefu Z <[email protected]> wrote:

Hi David,

Thank you for sharing your findings. It seems to me that there is no SQL
standard regarding temporary functions. There are few systems that support
it. Here are what I have found:

1. Hive: no DB qualifier allowed. Can overwrite built-in.
2. Spark: basically follows Hive (

https://docs.databricks.com/spark/latest/spark-sql/language-manual/create-function.html
)
3. SAP SQL Anywhere Server: can have owner (db?). Not sure of overwriting
behavior. (
http://dcx.sap.com/sqla170/en/html/816bdf316ce210148d3acbebf6d39b18.html)

Because of lack of standard, it's perfectly fine for Flink to define
whatever it sees appropriate. Thus, your proposal (no overwriting and must
have DB as holder) is one option. The advantage is simplicity, The downside
is the deviation from Hive, which is popular and de facto standard in big
data world.

However, I don't think we have to follow Hive. More importantly, we need a
consensus. I have no objection if your proposal is generally agreed upon.

Thanks,
Xuefu

On Tue, Sep 3, 2019 at 11:58 PM Dawid Wysakowicz <[email protected]>
wrote:

Hi all,

Just an opinion on the built-in <> temporary functions resolution and
NAMING issue. I think we should not allow overriding the built-in
functions, as this may pose serious issues and to be honest is rather
not feasible and would require major rework. What happens if a user
wants to override CAST? Calls to that function are generated at
different layers of the stack that unfortunately does not always go
through the Catalog API (at least yet). Moreover from what I've checked
no other systems allow overriding the built-in functions. All the
systems I've checked so far register temporary functions in a
database/schema (either special database for temporary functions, or
just current database). What I would suggest is to always register
temporary functions with a 3 part identifier. The same way as tables,
views etc. This effectively means you cannot override built-in
functions. With such approach it is natural that the temporary functions
end up a step lower in the resolution order:

1. built-in functions (1 part, maybe 2? - this is still under discussion)

2. temporary functions (always 3 part path)

3. catalog functions (always 3 part path)

Let me know what do you think.

Best,

Dawid

On 04/09/2019 06:13, Bowen Li wrote:

Hi,

I agree with Xuefu that the main controversial points are mainly the

two

places. My thoughts on them:

1) Determinism of referencing Hive built-in functions. We can either

remove

Hive built-in functions from ambiguous function resolution and require
users to use special syntax for their qualified names, or add a config

flag

to catalog constructor/yaml for turning on and off Hive built-in

functions

with the flag set to 'false' by default and proper doc added to help

users

make their decisions.

2) Flink temp functions v.s. Flink built-in functions in ambiguous

function

resolution order. We believe Flink temp functions should precede Flink
built-in functions, and I have presented my reasons. Just in case if we
cannot reach an agreement, I propose forbid users registering temp
functions in the same name as a built-in function, like MySQL's

approach,

for the moment. It won't have any performance concern, since built-in
functions are all in memory and thus cost of a name check will be

really

trivial.


On Tue, Sep 3, 2019 at 8:01 PM Xuefu Z <[email protected]> wrote:

 From what I have seen, there are a couple of focal disagreements:

1. Resolution order: temp function --> flink built-in function -->

catalog

function vs flink built-in function --> temp function -> catalog

function.

2. "External" built-in functions: how to treat built-in functions in
external system and how users reference them

For #1, I agree with Bowen that temp function needs to be at the

highest

priority because that's how a user might overwrite a built-in function
without referencing a persistent, overwriting catalog function with a

fully

qualified name. Putting built-in functions at the highest priority
eliminates that usage.

For #2, I saw a general agreement on referencing "external" built-in
functions such as those in Hive needs to be explicit and deterministic

even

though different approaches are proposed. To limit the scope and

simply

the

usage, it seems making sense to me to introduce special syntax for

user  to

explicitly reference an external built-in function such as hive1::sqrt

or

hive1._built_in.sqrt. This is a DML syntax matching nicely Catalog API

call

hive1.getFunction(ObjectPath functionName) where the database name is
absent for bulit-in functions available in that catalog hive1. I

understand

that Bowen's original proposal was trying to avoid this, but this

could

turn out to be a clean and simple solution.

(Timo's modular approach is great way to "expand" Flink's built-in

function

set, which seems orthogonal and complementary to this, which could be
tackled in further future work.)

I'd be happy to hear further thoughts on the two points.

Thanks,
Xuefu

On Tue, Sep 3, 2019 at 7:11 PM Kurt Young <[email protected]> wrote:

Thanks Timo & Bowen for the feedback. Bowen was right, my proposal is

the

same
as Bowen's. But after thinking about it, I'm currently lean to Timo's
suggestion.

The reason is backward compatibility. If we follow Bowen's approach,

let's

say we
first find function in Flink's built-in functions, and then hive's
built-in. For example, `foo`
is not supported by Flink, but hive has such built-in function. So

user

will have hive's
behavior for function `foo`. And in next release, Flink realize this

is a

very popular function
and add it into Flink's built-in functions, but with different

behavior

as

hive's. So in next
release, the behavior changes.

With Timo's approach, IIUC user have to tell the framework explicitly

what

kind of
built-in functions he would like to use. He can just tell framework

to

abandon Flink's built-in
functions, and use hive's instead. User can only choose between them,

but

not use
them at the same time. I think this approach is more predictable.

Best,
Kurt


On Wed, Sep 4, 2019 at 8:00 AM Bowen Li <[email protected]> wrote:

Hi all,

Thanks for the feedback. Just a kindly reminder that the [Proposal]

section

in the google doc was updated, please take a look first and let me

know

if

you have more questions.

On Tue, Sep 3, 2019 at 4:57 PM Bowen Li <[email protected]>

wrote:

Hi Timo,

Re> 1) We should not have the restriction "hive built-in functions

can

only

be used when current catalog is hive catalog". Switching a catalog
should only have implications on the cat.db.object resolution but

not

functions. It would be quite convinient for users to use Hive

built-ins

even if they use a Confluent schema registry or just the in-memory

catalog.

There might be a misunderstanding here.

First of all, Hive built-in functions are not part of Flink

built-in

functions, they are catalog functions, thus if the current catalog

is

not a

HiveCatalog but, say, a schema registry catalog, ambiguous

functions

reference just shouldn't be resolved to a different catalog.

Second, Hive built-in functions can potentially be referenced

across

catalog, but it doesn't have db namespace and we currently just

don't

have

a SQL syntax for it. It can be enabled when such a SQL syntax is

defined,

e.g. "catalog::function", but it's out of scope of this FLIP.

2) I would propose to have separate concepts for catalog and

built-in

functions. In particular it would be nice to modularize built-in
functions. Some built-in functions are very crucial (like AS, CAST,
MINUS), others are more optional but stable (MD5, CONCAT_WS), and

maybe

we add more experimental functions in the future or function for

some

special application area (Geo functions, ML functions). A data

platform

team might not want to make every built-in function available. Or a
function module like ML functions is in a different Maven module.

I think this is orthogonal to this FLIP, especially we don't have

the

"external built-in functions" anymore and currently the built-in

function

category remains untouched.

But just to share some thoughts on the proposal, I'm not sure about

it:

- I don't know if any other databases handle built-in functions

like

that.

Maybe you can give some examples? IMHO, built-in functions are

system

info

and should be deterministic, not depending on loaded libraries. Geo
functions should be either built-in already or just libraries

functions,

and library functions can be adapted to catalog APIs or of some

other

syntax to use
- I don't know if all use cases stand, and many can be achieved by

other

approaches too. E.g. experimental functions can be taken good care

of

by

documentations, annotations, etc
- the proposal basically introduces some concept like a pluggable

built-in

function catalog, despite the already existing catalog APIs
- it brings in even more complicated scenarios to the design. E.g.

how

do

you handle built-in functions in different modules but different

names?

In short, I'm not sure if it really stands and it looks like an

overkill

to me. I'd rather not go to that route. Related discussion can be

on

its

own thread.

3) Following the suggestion above, we can have a separate discovery
mechanism for built-in functions. Instead of just going through a

static

list like in BuiltInFunctionDefinitions, a platform team should be

able

to select function modules like
catalogManager.setFunctionModules(CoreFunctions, GeoFunctions,
HiveFunctions) or via service discovery;

Same as above. I'll leave it to its own thread.

re > 3) Dawid and I discussed the resulution order again. I agree

with

Kurt

that we should unify built-in function (external or internal)

under a

common layer. However, the resolution order should be:
   1. built-in functions
   2. temporary functions
   3. regular catalog resolution logic
Otherwise a temporary function could cause clashes with Flink's

built-in

functions. If you take a look at other vendors, like SQL Server

they

also do not allow to overwrite built-in functions.

”I agree with Kurt that we should unify built-in function (external

or

internal) under a common layer.“ <- I don't think this is what Kurt

means.

Kurt and I are in favor of unifying built-in functions of external

systems

and catalog functions. Did you type a mistake?

Besides, I'm not sure about the resolution order you proposed.

Temporary

functions have a lifespan over a session and are only visible to

the

session owner, they are unique to each user, and users create them

on

purpose to be the highest priority in order to overwrite system

info

(built-in functions in this case).

In your case, why would users name a temporary function the same

as a

built-in function then? Since using that name in ambiguous function
reference will always be resolved to built-in functions, creating a
same-named temp function would be meaningless in the end.


On Tue, Sep 3, 2019 at 1:44 PM Bowen Li <[email protected]>

wrote:

Hi Jingsong,

Re> 1.Hive built-in functions is an intermediate solution. So we

should

not introduce interfaces to influence the framework. To make
Flink itself more powerful, we should implement the functions
we need to add.

Yes, please see the doc.

Re> 2.Non-flink built-in functions are easy for users to change

their

behavior. If we support some flink built-in functions in the
future but act differently from non-flink built-in, this will

lead

to

changes in user behavior.

There's no such concept as "external built-in functions" any more.
Built-in functions of external systems will be treated as special

catalog

functions.

Re> Another question is, does this fallback include all

hive built-in functions? As far as I know, some hive functions
have some hacky. If possible, can we start with a white list?
Once we implement some functions to flink built-in, we can
also update the whitelist.

Yes, that's something we thought of too. I don't think it's super
critical to the scope of this FLIP, thus I'd like to leave it to

future

efforts as a nice-to-have feature.


On Tue, Sep 3, 2019 at 1:37 PM Bowen Li <[email protected]>

wrote:

Hi Kurt,

Re: > What I want to propose is we can merge #3 and #4, make them

both

under

"catalog" concept, by extending catalog function to make it have

ability to

have built-in catalog functions. Some benefits I can see from

this

approach:

1. We don't have to introduce new concept like external built-in

functions.

Actually I don't see a full story about how to treat a built-in

functions, and it

seems a little bit disrupt with catalog. As a result, you have

to

make

some restriction

like "hive built-in functions can only be used when current

catalog

is

hive catalog".

Yes, I've unified #3 and #4 but it seems I didn't update some

part

of

the doc. I've modified those sections, and they are up to date

now.

In short, now built-in function of external systems are defined

as

special kind of catalog function in Flink, and handled by Flink

as

following:
- An external built-in function must be associated with a catalog

for

the purpose of decoupling flink-table and external systems.
- It always resides in front of catalog functions in ambiguous

function

reference order, just like in its own external system
- It is a special catalog function that doesn’t have a

schema/database

namespace
- It goes thru the same instantiation logic as other user defined
catalog functions in the external system

Please take another look at the doc, and let me know if you have

more

questions.


On Tue, Sep 3, 2019 at 7:28 AM Timo Walther <[email protected]>

wrote:

Hi Kurt,

it should not affect the functions and operations we currently

have

in

SQL. It just categorizes the available built-in functions. It is

kind

of
an orthogonal concept to the catalog API but built-in functions

deserve

this special kind of treatment. CatalogFunction still fits

perfectly

in

there because the regular catalog object resolution logic is not
affected. So tables and functions are resolved in the same way

but

with

built-in functions that have priority as in the original design.

Regards,
Timo


On 03.09.19 15:26, Kurt Young wrote:

Does this only affect the functions and operations we currently

have

in SQL

and
have no effect on tables, right? Looks like this is an

orthogonal

concept

with Catalog?
If the answer are both yes, then the catalog function will be a

weird

concept?

Best,
Kurt


On Tue, Sep 3, 2019 at 8:10 PM Danny Chan <

[email protected]

wrote:

The way you proposed are basically the same as what Calcite

does, I

think

we are in the same line.

Best,
Danny Chan
在 2019年9月3日 +0800 PM7:57，Timo Walther <[email protected]

，写道：

This sounds exactly as the module approach I mentioned, no?

Regards,
Timo

On 03.09.19 13:42, Danny Chan wrote:

Thanks Bowen for bring up this topic, I think it’s a useful

refactoring to make our function usage more user friendly.

For the topic of how to organize the builtin operators and

operators

of Hive, here is a solution from Apache Calcite, the Calcite

way

is

to make

every dialect operators a “Library”, user can specify which

libraries they

want to use for a sql query. The builtin operators always

comes

as

the

first class objects and the others are used from the order

they

appears.

Maybe you can take a reference.

[1]

https://github.com/apache/calcite/commit/9a4eab5240d96379431d14a1ac33bfebaf6fbb28

Best,
Danny Chan
在 2019年8月28日 +0800 AM2:50，Bowen Li <[email protected]

，写道：

Hi folks,

I'd like to kick off a discussion on reworking Flink's

FunctionCatalog.

It's critically helpful to improve function usability in

SQL.

https://docs.google.com/document/d/1w3HZGj9kry4RsKVCduWp82HkW6hhgi2unnvOAUS72t8/edit?usp=sharing

In short, it:
- adds support for precise function reference with

fully/partially

qualified name
- redefines function resolution order for ambiguous

function

reference

- adds support for Hive's rich built-in functions (support

for

Hive

user

defined functions was already added in 1.9.0)
- clarifies the concept of temporary functions

Would love to hear your thoughts.

Bowen

--
Xuefu Zhang

"In Honey We Trust!"

--
Xuefu Zhang

"In Honey We Trust!"

Re: [DISCUSS] FLIP-57 - Rework FunctionCatalog

Reply via email to