Hi all,

thanks for the healthy discussion. It is already a very long discussion with a lot of text. So I will just post my opinion to a couple of statements:

> Hive built-in functions are not part of Flink built-in functions, they are catalog functions

That is not entirely true. Correct me if I'm wrong but I think Hive built-in functions are also not catalog functions. They are not stored in every Hive metastore catalog that is freshly created but are a set of functions that are listed somewhere and made available.

> ambiguous functions reference just shouldn't be resolved to a different catalog

I agree. They should not be resolved to a different catalog. That's why I am suggesting to split the concept of built-in functions and catalog lookup semantics.

> I don't know if any other databases handle built-in functions like that

What I called "module" is:
- Extension in Postgres [1]
- Plugin in Presto [2]

Btw. Presto even mentions example modules that are similar to the ones that we will introduce in the near future both for ML and System XYZ compatibility: "See either the presto-ml module for machine learning functions or the presto-teradata-functions module for Teradata-compatible functions, both in the root of the Presto source."

> functions should be either built-in already or just libraries functions, and library functions can be adapted to catalog APIs or of some other syntax to use

Regarding "built-in already", of course we can add a lot of functions as built-ins but we will end-up in a dependency hell in the near future if we don't introduce a pluggable approach. Library functions is what you also suggest but storing them in a catalog means to always fully qualify them or modifying the existing catalog design that was inspired by the standard.

I don't think "it brings in even more complicated scenarios to the design", it just does clear separation of concerns. Integrating the functionality into the current design makes the catalog API more complicated.

> why would users name a temporary function the same as a built-in function then?

Because you never know what users do. If they don't, my suggested resolution order should not be a problem, right?

> I don't think hive functions deserves be a function module

Our goal is not to create a Hive clone. We need to think forward and Hive is just one of many systems that we can support. Not every built-in function behaves and will behave exactly like Hive.

> regarding temporary functions, there are few systems that support it

IMHO Spark and Hive are not always the best examples for consistent design. Systems like Postgres, Presto, or SQL Server should be used as a reference. I don't think that a user can overwrite a built-in function there.

Regards,
Timo

[1] https://www.postgresql.org/docs/10/extend-extensions.html
[2] https://prestodb.github.io/docs/current/develop/functions.html


On 04.09.19 13:44, Jark Wu wrote:
Hi all,

Regarding #1 temp function <> built-in function and naming.
I'm fine with temp functions should precede built-in function and can
override built-in functions (we already support to override built-in
function in 1.9).
If we don't allow the same name as a built-in function, I'm afraid we will
have compatibility issues in the future.
Say users register a user defined function named "explode" in 1.9, and we
support a built-in "explode" function in 1.10.
Then the user's jobs which call the registered "explode" function in 1.9
will all fail in 1.10 because of naming conflict.

Regarding #2 "External" built-in functions.
I think if we store external built-in functions in catalog, then
"hive1::sqrt" is a good way to go.
However, I would prefer to support a discovery mechanism (e.g. SPI) for
built-in functions as Timo suggested above.
This gives us the flexibility to add Hive or MySQL or Geo or whatever
function set as built-in functions in an easy way.

Best,
Jark

On Wed, 4 Sep 2019 at 17:47, Xuefu Z <usxu...@gmail.com> wrote:

Hi David,

Thank you for sharing your findings. It seems to me that there is no SQL
standard regarding temporary functions. There are few systems that support
it. Here are what I have found:

1. Hive: no DB qualifier allowed. Can overwrite built-in.
2. Spark: basically follows Hive (

https://docs.databricks.com/spark/latest/spark-sql/language-manual/create-function.html
)
3. SAP SQL Anywhere Server: can have owner (db?). Not sure of overwriting
behavior. (
http://dcx.sap.com/sqla170/en/html/816bdf316ce210148d3acbebf6d39b18.html)

Because of lack of standard, it's perfectly fine for Flink to define
whatever it sees appropriate. Thus, your proposal (no overwriting and must
have DB as holder) is one option. The advantage is simplicity, The downside
is the deviation from Hive, which is popular and de facto standard in big
data world.

However, I don't think we have to follow Hive. More importantly, we need a
consensus. I have no objection if your proposal is generally agreed upon.

Thanks,
Xuefu

On Tue, Sep 3, 2019 at 11:58 PM Dawid Wysakowicz <dwysakow...@apache.org>
wrote:

Hi all,

Just an opinion on the built-in <> temporary functions resolution and
NAMING issue. I think we should not allow overriding the built-in
functions, as this may pose serious issues and to be honest is rather
not feasible and would require major rework. What happens if a user
wants to override CAST? Calls to that function are generated at
different layers of the stack that unfortunately does not always go
through the Catalog API (at least yet). Moreover from what I've checked
no other systems allow overriding the built-in functions. All the
systems I've checked so far register temporary functions in a
database/schema (either special database for temporary functions, or
just current database). What I would suggest is to always register
temporary functions with a 3 part identifier. The same way as tables,
views etc. This effectively means you cannot override built-in
functions. With such approach it is natural that the temporary functions
end up a step lower in the resolution order:

1. built-in functions (1 part, maybe 2? - this is still under discussion)

2. temporary functions (always 3 part path)

3. catalog functions (always 3 part path)

Let me know what do you think.

Best,

Dawid

On 04/09/2019 06:13, Bowen Li wrote:
Hi,

I agree with Xuefu that the main controversial points are mainly the
two
places. My thoughts on them:

1) Determinism of referencing Hive built-in functions. We can either
remove
Hive built-in functions from ambiguous function resolution and require
users to use special syntax for their qualified names, or add a config
flag
to catalog constructor/yaml for turning on and off Hive built-in
functions
with the flag set to 'false' by default and proper doc added to help
users
make their decisions.

2) Flink temp functions v.s. Flink built-in functions in ambiguous
function
resolution order. We believe Flink temp functions should precede Flink
built-in functions, and I have presented my reasons. Just in case if we
cannot reach an agreement, I propose forbid users registering temp
functions in the same name as a built-in function, like MySQL's
approach,
for the moment. It won't have any performance concern, since built-in
functions are all in memory and thus cost of a name check will be
really
trivial.


On Tue, Sep 3, 2019 at 8:01 PM Xuefu Z <usxu...@gmail.com> wrote:

 From what I have seen, there are a couple of focal disagreements:

1. Resolution order: temp function --> flink built-in function -->
catalog
function vs flink built-in function --> temp function -> catalog
function.
2. "External" built-in functions: how to treat built-in functions in
external system and how users reference them

For #1, I agree with Bowen that temp function needs to be at the
highest
priority because that's how a user might overwrite a built-in function
without referencing a persistent, overwriting catalog function with a
fully
qualified name. Putting built-in functions at the highest priority
eliminates that usage.

For #2, I saw a general agreement on referencing "external" built-in
functions such as those in Hive needs to be explicit and deterministic
even
though different approaches are proposed. To limit the scope and
simply
the
usage, it seems making sense to me to introduce special syntax for
user  to
explicitly reference an external built-in function such as hive1::sqrt
or
hive1._built_in.sqrt. This is a DML syntax matching nicely Catalog API
call
hive1.getFunction(ObjectPath functionName) where the database name is
absent for bulit-in functions available in that catalog hive1. I
understand
that Bowen's original proposal was trying to avoid this, but this
could
turn out to be a clean and simple solution.

(Timo's modular approach is great way to "expand" Flink's built-in
function
set, which seems orthogonal and complementary to this, which could be
tackled in further future work.)

I'd be happy to hear further thoughts on the two points.

Thanks,
Xuefu

On Tue, Sep 3, 2019 at 7:11 PM Kurt Young <ykt...@gmail.com> wrote:

Thanks Timo & Bowen for the feedback. Bowen was right, my proposal is
the
same
as Bowen's. But after thinking about it, I'm currently lean to Timo's
suggestion.

The reason is backward compatibility. If we follow Bowen's approach,
let's
say we
first find function in Flink's built-in functions, and then hive's
built-in. For example, `foo`
is not supported by Flink, but hive has such built-in function. So
user
will have hive's
behavior for function `foo`. And in next release, Flink realize this
is a
very popular function
and add it into Flink's built-in functions, but with different
behavior
as
hive's. So in next
release, the behavior changes.

With Timo's approach, IIUC user have to tell the framework explicitly
what
kind of
built-in functions he would like to use. He can just tell framework
to
abandon Flink's built-in
functions, and use hive's instead. User can only choose between them,
but
not use
them at the same time. I think this approach is more predictable.

Best,
Kurt


On Wed, Sep 4, 2019 at 8:00 AM Bowen Li <bowenl...@gmail.com> wrote:

Hi all,

Thanks for the feedback. Just a kindly reminder that the [Proposal]
section
in the google doc was updated, please take a look first and let me
know
if
you have more questions.

On Tue, Sep 3, 2019 at 4:57 PM Bowen Li <bowenl...@gmail.com>
wrote:
Hi Timo,

Re> 1) We should not have the restriction "hive built-in functions
can
only
be used when current catalog is hive catalog". Switching a catalog
should only have implications on the cat.db.object resolution but
not
functions. It would be quite convinient for users to use Hive
built-ins
even if they use a Confluent schema registry or just the in-memory
catalog.

There might be a misunderstanding here.

First of all, Hive built-in functions are not part of Flink
built-in
functions, they are catalog functions, thus if the current catalog
is
not a
HiveCatalog but, say, a schema registry catalog, ambiguous
functions
reference just shouldn't be resolved to a different catalog.

Second, Hive built-in functions can potentially be referenced
across
catalog, but it doesn't have db namespace and we currently just
don't
have
a SQL syntax for it. It can be enabled when such a SQL syntax is
defined,
e.g. "catalog::function", but it's out of scope of this FLIP.

2) I would propose to have separate concepts for catalog and
built-in
functions. In particular it would be nice to modularize built-in
functions. Some built-in functions are very crucial (like AS, CAST,
MINUS), others are more optional but stable (MD5, CONCAT_WS), and
maybe
we add more experimental functions in the future or function for
some
special application area (Geo functions, ML functions). A data
platform
team might not want to make every built-in function available. Or a
function module like ML functions is in a different Maven module.

I think this is orthogonal to this FLIP, especially we don't have
the
"external built-in functions" anymore and currently the built-in
function
category remains untouched.

But just to share some thoughts on the proposal, I'm not sure about
it:
- I don't know if any other databases handle built-in functions
like
that.
Maybe you can give some examples? IMHO, built-in functions are
system
info
and should be deterministic, not depending on loaded libraries. Geo
functions should be either built-in already or just libraries
functions,
and library functions can be adapted to catalog APIs or of some
other
syntax to use
- I don't know if all use cases stand, and many can be achieved by
other
approaches too. E.g. experimental functions can be taken good care
of
by
documentations, annotations, etc
- the proposal basically introduces some concept like a pluggable
built-in
function catalog, despite the already existing catalog APIs
- it brings in even more complicated scenarios to the design. E.g.
how
do
you handle built-in functions in different modules but different
names?
In short, I'm not sure if it really stands and it looks like an
overkill
to me. I'd rather not go to that route. Related discussion can be
on
its
own thread.

3) Following the suggestion above, we can have a separate discovery
mechanism for built-in functions. Instead of just going through a
static
list like in BuiltInFunctionDefinitions, a platform team should be
able
to select function modules like
catalogManager.setFunctionModules(CoreFunctions, GeoFunctions,
HiveFunctions) or via service discovery;

Same as above. I'll leave it to its own thread.

re > 3) Dawid and I discussed the resulution order again. I agree
with
Kurt
that we should unify built-in function (external or internal)
under a
common layer. However, the resolution order should be:
   1. built-in functions
   2. temporary functions
   3. regular catalog resolution logic
Otherwise a temporary function could cause clashes with Flink's
built-in
functions. If you take a look at other vendors, like SQL Server
they
also do not allow to overwrite built-in functions.
”I agree with Kurt that we should unify built-in function (external
or
internal) under a common layer.“ <- I don't think this is what Kurt
means.
Kurt and I are in favor of unifying built-in functions of external
systems
and catalog functions. Did you type a mistake?

Besides, I'm not sure about the resolution order you proposed.
Temporary
functions have a lifespan over a session and are only visible to
the
session owner, they are unique to each user, and users create them
on
purpose to be the highest priority in order to overwrite system
info
(built-in functions in this case).

In your case, why would users name a temporary function the same
as a
built-in function then? Since using that name in ambiguous function
reference will always be resolved to built-in functions, creating a
same-named temp function would be meaningless in the end.


On Tue, Sep 3, 2019 at 1:44 PM Bowen Li <bowenl...@gmail.com>
wrote:
Hi Jingsong,

Re> 1.Hive built-in functions is an intermediate solution. So we
should
not introduce interfaces to influence the framework. To make
Flink itself more powerful, we should implement the functions
we need to add.
Yes, please see the doc.

Re> 2.Non-flink built-in functions are easy for users to change
their
behavior. If we support some flink built-in functions in the
future but act differently from non-flink built-in, this will
lead
to
changes in user behavior.
There's no such concept as "external built-in functions" any more.
Built-in functions of external systems will be treated as special
catalog
functions.

Re> Another question is, does this fallback include all
hive built-in functions? As far as I know, some hive functions
have some hacky. If possible, can we start with a white list?
Once we implement some functions to flink built-in, we can
also update the whitelist.
Yes, that's something we thought of too. I don't think it's super
critical to the scope of this FLIP, thus I'd like to leave it to
future
efforts as a nice-to-have feature.


On Tue, Sep 3, 2019 at 1:37 PM Bowen Li <bowenl...@gmail.com>
wrote:
Hi Kurt,

Re: > What I want to propose is we can merge #3 and #4, make them
both
under
"catalog" concept, by extending catalog function to make it have
ability to
have built-in catalog functions. Some benefits I can see from
this
approach:
1. We don't have to introduce new concept like external built-in
functions.
Actually I don't see a full story about how to treat a built-in
functions, and it
seems a little bit disrupt with catalog. As a result, you have
to
make
some restriction
like "hive built-in functions can only be used when current
catalog
is
hive catalog".

Yes, I've unified #3 and #4 but it seems I didn't update some
part
of
the doc. I've modified those sections, and they are up to date
now.
In short, now built-in function of external systems are defined
as
a
special kind of catalog function in Flink, and handled by Flink
as
following:
- An external built-in function must be associated with a catalog
for
the purpose of decoupling flink-table and external systems.
- It always resides in front of catalog functions in ambiguous
function
reference order, just like in its own external system
- It is a special catalog function that doesn’t have a
schema/database
namespace
- It goes thru the same instantiation logic as other user defined
catalog functions in the external system

Please take another look at the doc, and let me know if you have
more
questions.


On Tue, Sep 3, 2019 at 7:28 AM Timo Walther <twal...@apache.org>
wrote:
Hi Kurt,

it should not affect the functions and operations we currently
have
in
SQL. It just categorizes the available built-in functions. It is
kind
of
an orthogonal concept to the catalog API but built-in functions
deserve
this special kind of treatment. CatalogFunction still fits
perfectly
in
there because the regular catalog object resolution logic is not
affected. So tables and functions are resolved in the same way
but
with
built-in functions that have priority as in the original design.

Regards,
Timo


On 03.09.19 15:26, Kurt Young wrote:
Does this only affect the functions and operations we currently
have
in SQL
and
have no effect on tables, right? Looks like this is an
orthogonal
concept
with Catalog?
If the answer are both yes, then the catalog function will be a
weird
concept?

Best,
Kurt


On Tue, Sep 3, 2019 at 8:10 PM Danny Chan <
yuzhao....@gmail.com
wrote:
The way you proposed are basically the same as what Calcite
does, I
think
we are in the same line.

Best,
Danny Chan
在 2019年9月3日 +0800 PM7:57,Timo Walther <twal...@apache.org
,写道:
This sounds exactly as the module approach I mentioned, no?

Regards,
Timo

On 03.09.19 13:42, Danny Chan wrote:
Thanks Bowen for bring up this topic, I think it’s a useful
refactoring to make our function usage more user friendly.
For the topic of how to organize the builtin operators and
operators
of Hive, here is a solution from Apache Calcite, the Calcite
way
is
to make
every dialect operators a “Library”, user can specify which
libraries they
want to use for a sql query. The builtin operators always
comes
as
the
first class objects and the others are used from the order
they
appears.
Maybe you can take a reference.
[1]
https://github.com/apache/calcite/commit/9a4eab5240d96379431d14a1ac33bfebaf6fbb28
Best,
Danny Chan
在 2019年8月28日 +0800 AM2:50,Bowen Li <bowenl...@gmail.com
,写道:
Hi folks,

I'd like to kick off a discussion on reworking Flink's
FunctionCatalog.
It's critically helpful to improve function usability in
SQL.
https://docs.google.com/document/d/1w3HZGj9kry4RsKVCduWp82HkW6hhgi2unnvOAUS72t8/edit?usp=sharing
In short, it:
- adds support for precise function reference with
fully/partially
qualified name
- redefines function resolution order for ambiguous
function
reference
- adds support for Hive's rich built-in functions (support
for
Hive
user
defined functions was already added in 1.9.0)
- clarifies the concept of temporary functions

Would love to hear your thoughts.

Bowen
--
Xuefu Zhang

"In Honey We Trust!"


--
Xuefu Zhang

"In Honey We Trust!"


Reply via email to