Re: [DISCUSS] FLIP-57 - Rework FunctionCatalog

Dawid Wysakowicz Wed, 11 Sep 2019 08:09:46 -0700

Hi Fabian,
Thank you for your response.
Regarding the temporary function, just wanted to clarify one thing: the
3-part identifier does not mean the user always has to provide the catalog
& database explicitly. The same way user does not have to provide them in
e.g. when creating permanent table, view etc. It means though functions are
always stored within a database. The same way as all the permanent objects
and other temporary objects(tables, views). If not given explicitly the
current catalog & database would be used, both in the create statement or
when using the function.


Point taken though your preference would be to support overriding built-in
functions.

Best,
Dawid

On Wed, 11 Sep 2019, 21:14 Fabian Hueske, <[email protected]> wrote:

> Hi all,
>
> I'd like to add my opinion on this topic as well ;-)
>
> In general, I think overriding built-in function with temp functions has a
> couple of benefits but also a few challenges:
>
> * Users can reimplement the behavior of a built-in functions of a different
> system, e.g., for backward compatibility after a migration.
> * I don't think that "accidental" overrides and surprising semantics are an
> issue or dangerous. The user registered the temp function in the same
> session and should therefore be aware of the changed semantics.
> * I see that not all built-in functions can be overridden, like the CAST
> example that Dawid gave. However, I think these should be a small fraction
> and such functions could be blacklisted. Sure, that's not super consistent,
> but should (IMO) not be a big issue in practice.
> * Temp functions should be easy to use. Requiring a 3-part addressing makes
> them a lot less user friendly, IMO. Users need to think about what catalog
> and db to choose when registering them. Also using a temp function in a
> query becomes less convenient. Moreover, I agree with Bowen's concerns that
> a 3-part addressing scheme reduces the temporal appearance of the function.
>
> From the three possible solutions, my preference order is
> 1) 1-part address with override of built-in
> 2) 1-part address without override of built-in
> 3) 3-part address
>
> Regarding the issue of external built-in functions, I don't think that
> Timo's proposal of modules is fully orthogonal to this discussion.
> A Hive function module could be an alternative to offering Hive functions
> as part of Hive's catalog.
> From a user's point of view, I think that modules would be a "cleaner"
> integration ("Why do I need a Hive catalog if all I want to do is apply a
> Hive function on a Kafka table?").
> However, the module approach clearly has the problem of dealing with
> same-named functions in different modules (e.g., a Hive function and a
> Flink built-in function).
> The catalog approach as the benefit that functions can be addressed like
> hiveCat::func (or a similar path).
>
> I'm not sure what's the best solution here.
>
> Cheers,
> Fabian
>
>
> Am Mo., 9. Sept. 2019 um 06:30 Uhr schrieb Bowen Li <[email protected]>:
>
> > Hi,
> >
> > W.r.t temp functions, I feel both options have their benefits and can
> > theoretically achieve similar functionalities one way or another. In the
> > end, it's more about use cases, users habits, and trade-offs.
> >
> > Re> Not always users are in full control of the catalog functions. There
> is
> > also the case where different teams manage the catalog & use the catalog.
> >
> > Temp functions live within a session, and not within a catalog. Having
> > 3-part paths may implies temp functions are tied to a catalog in two
> > aspects.
> > 1) it may indicate each catalog manages their temp functions, which is
> not
> > true as we seem all agree they should reside at a central place, either
> in
> > FunctionCatalog or CatalogManager
> > 2) it may indicate there's some access control. When users are forbidden
> to
> > manipulate some objects in the catalog that's managed by other teams, but
> > are allowed to manipulate some other objects (temp functions in this
> case)
> > belonging to the catalog in namespaces, users may think we introduced
> extra
> > complexity and confusion with some kind of access control into the
> problem.
> > It doesn't feel intuitive enough for end users.
> >
> > Thus, I'd be in favor of 1-part path for temporary functions, and other
> > temp objects.
> >
> > Thanks,
> > Bowen
> >
> >
> >
> > On Fri, Sep 6, 2019 at 2:16 AM Dawid Wysakowicz <[email protected]>
> > wrote:
> >
> > > I agree the consequences of the decision are substantial. Let's see
> what
> > > others think.
> > >
> > > -- Catalog functions are defined by users, and we suppose they can
> > > drop/alter it in any way they want. Thus, overwriting a catalog
> function
> > > doesn't seem to be a strong use case that we should be concerned about.
> > > Rather, there are known use case for overwriting built-in functions.
> > >
> > > Not always users are in full control of the catalog functions. There is
> > > also the case where different teams manage the catalog & use the
> catalog.
> > > As for overriding built-in functions with 3-part approach user can
> always
> > > use an equally named function from a catalog. E.g. to override
> > >
> > > *    SELECT explode(arr) FROM ...*
> > >
> > > user can always write:
> > >
> > > *    SELECT db.explode(arr) FROM ...*
> > >
> > > Best,
> > >
> > > Dawid
> > > On 06/09/2019 10:54, Xuefu Z wrote:
> > >
> > > Hi Dawid,
> > >
> > > Thank you for your summary. While the only difference in the two
> > proposals
> > > is one- or three-part in naming, the consequence would be substantial.
> > >
> > > To me, there are two major use cases of temporary functions compared to
> > > persistent ones:
> > > 1. Temporary in nature and auto managed by the session. More often than
> > > not, admin doesn't even allow user to create persistent functions.
> > > 2. Provide an opportunity to overwriting system built-in functions.
> > >
> > > Since built-in functions has one-part name, requiring three-part name
> for
> > > temporary functions eliminates the overwriting opportunity.
> > >
> > > One-part naming essentially puts all temp functions under a single
> > > namespace and simplifies function resolution, such as we don't need to
> > > consider the case of a temp function and a persistent function with the
> > > same name under the same database.
> > >
> > > I agree having three-parts does have its merits, such as consistency
> with
> > > other temporary objects (table) and minor difference between temp vs
> > > catalog functions. However, there is a slight difference between tables
> > and
> > > function in that there is no built-in table in SQL so there is no need
> to
> > > overwrite it.
> > >
> > > I'm not sure if I fully agree the benefits you listed as the advantages
> > of
> > > the three-part naming of temp functions.
> > >   -- Allowing overwriting built-in functions is a benefit and the
> > solution
> > > for disallowing certain overwriting shouldn't be totally banning it.
> > >   -- Catalog functions are defined by users, and we suppose they can
> > > drop/alter it in any way they want. Thus, overwriting a catalog
> function
> > > doesn't seem to be a strong use case that we should be concerned about.
> > > Rather, there are known use case for overwriting built-in functions.
> > >
> > > Thus, personally I would prefer one-part name for temporary functions.
> In
> > > lack of SQL standard on this, I certainly like to get opinions from
> > others
> > > to see if a consensus can be eventually reached.
> > >
> > > (To your point on modular approach to support external built-in
> > functions,
> > > we saw the value and are actively looking into it. Thanks for sharing
> > your
> > > opinion on that.)
> > >
> > > Thanks,
> > > Xuefu
> > >
> > > On Fri, Sep 6, 2019 at 3:48 PM Dawid Wysakowicz <
> [email protected]>
> > <[email protected]>
> > > wrote:
> > >
> > >
> > > Hi Xuefu,
> > >
> > > Thank you for your answers.
> > >
> > > Let me summarize my understanding. In principle we differ only in
> regards
> > > to the fact if a temporary function can be only 1-part or only 3-part
> > > identified. I can reconfirm that if the community decides it prefers
> the
> > > 1-part approach I will commit to that, with the assumption that we will
> > > force ONLY 1-part function names. (We will parse identifier and throw
> > > exception if a user tries to register e.g. db.temp_func).
> > >
> > > My preference is though the 3-part approach:
> > >
> > >    - there are some functions that it makes no sense to override, e.g.
> > >    CAST, moreover I'm afraid that allowing overriding such will lead to
> > high
> > >    inconsistency, similar to those that I mentioned spark has
> > >    - you cannot shadow a fully-qualified function. (If a user fully
> > >    qualifies his/her objects in a SQL query, which is often considered
> a
> > good
> > >    practice)
> > >    - it does not differentiate between functions & temporary functions.
> > >    Temporary functions just differ with regards to their life-cycle.
> The
> > >    registration & usage is exactly the same.
> > >
> > > As it can be seen, the proposed concept regarding temp function and
> > > function resolution is quite simple.
> > >
> > > Both approaches are equally simple. I would even say the 3-part
> approach
> > > is slightly simpler as it does not have to care about some special
> > built-in
> > > functions such as CAST.
> > >
> > > I don't want to express my opinion on the differentiation between
> > built-in
> > > functions and "external" built-in functions in this thread as it is
> > rather
> > > orthogonal, but I also like the modular approach and I definitely don't
> > > like the special syntax "cat::function". I think it's better to stick
> to
> > a
> > > standard or at least other proved solutions from other systems.
> > >
> > > Best,
> > >
> > > Dawid
> > > On 05/09/2019 10:12, Xuefu Z wrote:
> > >
> > > Hi David,
> > >
> > > Thanks for sharing your thoughts and  request for clarifications. I
> > believe
> > > that I fully understood your proposal, which does has its merit.
> However,
> > > it's different from ours. Here are the answers to your questions:
> > >
> > > Re #1: yes, the temp functions in the proposal are global and have just
> > > one-part names, similar to built-in functions. Two- or three-part names
> > are
> > > not allowed.
> > >
> > > Re #2: not applicable as two- or three-part names are disallowed.
> > >
> > > Re #3: same as above. Referencing external built-in functions is
> achieved
> > > either implicitly (only the built-in functions in the current catalogs
> > are
> > > considered) or via special syntax such as cat::function. However, we
> are
> > > looking into the modular approach that Time suggested with other
> feedback
> > > received from the community.
> > >
> > > Re #4: the resolution order goes like the following in our proposal:
> > >
> > > 1. temporary functions
> > > 2. bulit-in functions (including those augmented by add-on modules)
> > > 3. built-in functions in current catalog (this will not be needed if
> the
> > > special syntax "cat::function" is required)
> > > 4. functions in current catalog and db.
> > >
> > > If we go with the modular approach and make external built-in functions
> > as
> > > an add-on module, the 2 and 3 above will be combined. In essence, the
> > > resolution order is equivalent in the two approaches.
> > >
> > > By the way, resolution order matters only for simple name reference.
> For
> > > names such as db.function (interpreted as current_cat/db/function) or
> > > cat.db.function, the reference is unambiguous, so on resolution is
> > needed.
> > >
> > > As it can be seen, the proposed concept regarding temp function and
> > > function resolution is quite simple. Additionally, the proposed
> > resolution
> > > order allows temp function to shadow a built-in function, which is
> > > important (though not decisive) in our opinion.
> > >
> > > I started liking the modular approach as the resolution order will only
> > > include 1, 2, and 4, which is simpler and more generic. That's why I
> > > suggested we look more into this direction.
> > >
> > > Please let me know if there are further questions.
> > >
> > > Thanks,
> > > Xuefu
> > >
> > >
> > >
> > >
> > > On Thu, Sep 5, 2019 at 2:42 PM Dawid Wysakowicz <
> [email protected]>
> > <[email protected]> <[email protected]> <
> [email protected]>
> > > wrote:
> > >
> > >
> > > Hi Xuefu,
> > >
> > > Just wanted to summarize my opinion on the one topic (temporary
> > functions).
> > >
> > > My preference would be to make temporary functions always 3-part
> > qualified
> > > (as a result that would prohibit overriding built-in functions). Having
> > > said that if the community decides that it's better to allow overriding
> > > built-in functions I am fine with it and can commit to that decision.
> > >
> > > I wanted to ask if you could clarify a few points for me around that
> > > option.
> > >
> > >    1. Would you enforce temporary functions to be always just a single
> > >    name (without db & cat) as hive does, or would you allow also 3 or
> > even 2
> > >    part identifiers?
> > >    2. Assuming 2/3-part paths. How would you register a function from a
> > >    following statement: CREATE TEMPORARY FUNCTION db.func? Would that
> > shadow
> > >    all functions named 'func' in all databases named 'db' in all
> > catalogs? Or
> > >    would you shadow only function 'func' in database 'db' in current
> > catalog?
> > >    3. This point is still under discussion, but was mentioned a few
> > >    times, that maybe we want to enable syntax cat.func for "external
> > built-in
> > >    functions". How would that affect statement from previous point?
> Would
> > >    'db.func' shadow "external built-in function" in 'db' catalog or
> user
> > >    functions as in point 2? Or maybe both?
> > >    4. Lastly in fact to summarize the previous points. Assuming
> 2/3-part
> > >    paths. Would the function resolution be actually as follows?:
> > >       1. temporary functions (1-part path)
> > >       2. built-in functions
> > >       3. temporary functions (2-part path)
> > >       4. 2-part catalog functions a.k.a. "external built-in functions"
> > >       (cat + func) - this is still under discussion, if we want that in
> > the other
> > >       focal point
> > >       5. temporary functions (3-part path)
> > >       6. 3-part catalog functions a.k.a. user functions
> > >
> > > I would be really grateful if you could explain me those questions,
> > thanks.
> > >
> > > BTW, Thank you all for a healthy discussion.
> > >
> > > Best,
> > >
> > > Dawid
> > > On 04/09/2019 23:25, Xuefu Z wrote:
> > >
> > > Thank all for the sharing thoughts. I think we have gathered some
> useful
> > > initial feedback from this long discussion with a couple of focal
> points
> > > sticking out.
> > >
> > >  We will go back to do more research and adapt our proposal. Once it's
> > > ready, we will ask for a new round of review. If there is any
> > disagreement,
> > > we will start a new discussion thread on each rather than having a mega
> > > discussion like this.
> > >
> > > Thanks to everyone for participating.
> > >
> > > Regards,
> > > Xuefu
> > >
> > >
> > > On Thu, Sep 5, 2019 at 2:52 AM Bowen Li <[email protected]> <
> > [email protected]> <[email protected]> <[email protected]> <
> > [email protected]> <[email protected]> <[email protected]> <
> > [email protected]> wrote:
> > >
> > >
> > > Let me try to summarize and conclude the long thread so far:
> > >
> > > 1. For order of temp function v.s. built-in function:
> > >
> > > I think Dawid's point that temp function should be of fully qualified
> > path
> > > is a better reasoning to back the newly proposed order, and i agree we
> > > don't need to follow Hive/Spark.
> > >
> > > However, I'd rather not change fundamentals of temporary functions in
> > this
> > > FLIP. It belongs to a bigger story of how temporary objects should be
> > > redefined and be handled uniformly - currently temporary tables and
> views
> > > (those registered from TableEnv#registerTable()) behave different than
> > what
> > > Dawid propose for temp functions, and we need a FLIP to just unify
> their
> > > APIs and behaviors.
> > >
> > > I agree that backward compatibility is not an issue w.r.t Jark's
> points.
> > >
> > > ***Seems we do have consensus that it's acceptable to prevent users
> > > registering a temp function in the same name as a built-in function. To
> > > help us move forward, I'd like to propose setting such a restraint on
> > temp
> > > functions in this FLIP to simplify the design and avoid disputes.*** It
> > > will also leave rooms for improvements in the future.
> > >
> > >
> > > 2. For Hive built-in function:
> > >
> > > Thanks Timo for providing the Presto and Postgres examples. I feel
> > modular
> > > built-in functions can be a good fit for the geo and ml example as a
> > native
> > > Flink extension, but not sure if it fits well with external
> integrations.
> > > Anyway, I think modular built-in functions is a bigger story and can be
> > on
> > > its own thread too, and our proposal doesn't prevent Flink from doing
> > that
> > > in the future.
> > >
> > > ***Seems we have consensus that users should be able to use built-in
> > > functions of Hive or other external systems in SQL explicitly and
> > > deterministically regardless of Flink built-in functions and the
> > potential
> > > modular built-in functions, via some new syntax like "mycat::func"? If
> > so,
> > > I'd like to propose removing Hive built-in functions from ambiguous
> > > function resolution order, and empower users with such a syntax. This
> way
> > > we sacrifice a little convenience for certainty***
> > >
> > >
> > > What do you think?
> > >
> > > On Wed, Sep 4, 2019 at 7:02 AM Dawid Wysakowicz <
> [email protected]>
> > <[email protected]> <[email protected]> <
> [email protected]>
> > <[email protected]> <[email protected]> <
> [email protected]>
> > <[email protected]>
> > > wrote:
> > >
> > >
> > > Hi,
> > >
> > > Regarding the Hive & Spark support of TEMPORARY FUNCTIONS. I've just
> > > performed some experiments (hive-2.3.2 & spark 2.4.4) and I think they
> > >
> > > are
> > >
> > > very inconsistent in that manner (spark being way worse on that).
> > >
> > > Hive:
> > >
> > > You cannot overwrite all the built-in functions. I could overwrite most
> > >
> > > of
> > >
> > > the functions I tried e.g. length, e, pi, round, rtrim, but there are
> > > functions I cannot overwrite e.g. CAST, ARRAY I get:
> > >
> > >
> > > *    ParseException line 1:29 cannot recognize input near 'array' 'AS'
> *
> > >
> > > What is interesting is that I cannot ovewrite *array*, but I can
> ovewrite
> > > *map* or *struct*. Though hive behaves reasonable well if I manage to
> > > overwrite a function. When I drop the temporary function the native
> > > function is still available.
> > >
> > > Spark:
> > >
> > > Spark's behavior imho is super bad.
> > >
> > > Theoretically I could overwrite all functions. I was able e.g. to
> > > overwrite CAST function. I had to use though CREATE OR REPLACE
> TEMPORARY
> > > FUNCTION syntax. Otherwise I get an exception that a function already
> > > exists. However when I used the CAST function in a query it used the
> > > native, built-in one.
> > >
> > > When I overwrote current_date() function, it was used in a query, but
> it
> > > completely replaces the built-in function and I can no longer use the
> > > native function in any way. I cannot also drop the temporary function.
> I
> > > get:
> > >
> > > *    Error in query: Cannot drop native function 'current_date';*
> > >
> > > Additional note, both systems do not allow creating TEMPORARY FUNCTIONS
> > > with a database. Temporary functions are always represented as a single
> > > name.
> > >
> > > In my opinion neither of the systems have consistent behavior.
> Generally
> > > speaking I think overwriting any system provided functions is just
> > > dangerous.
> > >
> > > Regarding Jark's concerns. Such functions would be registered in a
> > >
> > > current
> > >
> > > catalog/database schema, so a user could still use its own function,
> but
> > > would have to fully qualify the function (because built-in functions
> take
> > > precedence). Moreover users would have the same problem with permanent
> > > functions. Imagine a user have a permanent function 'cat.db.explode'.
> In
> > > 1.9 the user could use just the 'explode' function as long as the
> 'cat' &
> > > 'db' were the default catalog & database. If we introduce 'explode'
> > > built-in function in 1.10, the user has to fully qualify the function.
> > >
> > > Best,
> > >
> > > Dawid
> > > On 04/09/2019 15:19, Timo Walther wrote:
> > >
> > > Hi all,
> > >
> > > thanks for the healthy discussion. It is already a very long discussion
> > > with a lot of text. So I will just post my opinion to a couple of
> > > statements:
> > >
> > >
> > > Hive built-in functions are not part of Flink built-in functions, they
> > >
> > > are catalog functions
> > >
> > > That is not entirely true. Correct me if I'm wrong but I think Hive
> > > built-in functions are also not catalog functions. They are not stored
> in
> > > every Hive metastore catalog that is freshly created but are a set of
> > > functions that are listed somewhere and made available.
> > >
> > >
> > > ambiguous functions reference just shouldn't be resolved to a different
> > >
> > > catalog
> > >
> > > I agree. They should not be resolved to a different catalog. That's
> why I
> > > am suggesting to split the concept of built-in functions and catalog
> > >
> > > lookup
> > >
> > > semantics.
> > >
> > >
> > > I don't know if any other databases handle built-in functions like that
> > >
> > > What I called "module" is:
> > > - Extension in Postgres [1]
> > > - Plugin in Presto [2]
> > >
> > > Btw. Presto even mentions example modules that are similar to the ones
> > > that we will introduce in the near future both for ML and System XYZ
> > > compatibility:
> > > "See either the presto-ml module for machine learning functions or the
> > > presto-teradata-functions module for Teradata-compatible functions,
> both
> > >
> > > in
> > >
> > > the root of the Presto source."
> > >
> > >
> > > functions should be either built-in already or just libraries
> > >
> > > functions,
> > >
> > > and library functions can be adapted to catalog APIs or of some other
> > > syntax to use
> > >
> > > Regarding "built-in already", of course we can add a lot of functions
> as
> > > built-ins but we will end-up in a dependency hell in the near future if
> > >
> > > we
> > >
> > > don't introduce a pluggable approach. Library functions is what you
> also
> > > suggest but storing them in a catalog means to always fully qualify
> them
> > >
> > > or
> > >
> > > modifying the existing catalog design that was inspired by the
> standard.
> > >
> > > I don't think "it brings in even more complicated scenarios to the
> > > design", it just does clear separation of concerns. Integrating the
> > > functionality into the current design makes the catalog API more
> > > complicated.
> > >
> > >
> > > why would users name a temporary function the same as a built-in
> > >
> > > function then?
> > >
> > > Because you never know what users do. If they don't, my suggested
> > > resolution order should not be a problem, right?
> > >
> > >
> > > I don't think hive functions deserves be a function module
> > >
> > > Our goal is not to create a Hive clone. We need to think forward and
> Hive
> > > is just one of many systems that we can support. Not every built-in
> > > function behaves and will behave exactly like Hive.
> > >
> > >
> > > regarding temporary functions, there are few systems that support it
> > >
> > > IMHO Spark and Hive are not always the best examples for consistent
> > > design. Systems like Postgres, Presto, or SQL Server should be used as
> a
> > > reference. I don't think that a user can overwrite a built-in function
> > > there.
> > >
> > > Regards,
> > > Timo
> > >
> > > [1] https://www.postgresql.org/docs/10/extend-extensions.html
> > > [2] https://prestodb.github.io/docs/current/develop/functions.html
> > >
> > >
> > > On 04.09.19 13:44, Jark Wu wrote:
> > >
> > > Hi all,
> > >
> > > Regarding #1 temp function <> built-in function and naming.
> > > I'm fine with temp functions should precede built-in function and can
> > > override built-in functions (we already support to override built-in
> > > function in 1.9).
> > > If we don't allow the same name as a built-in function, I'm afraid we
> > >
> > > will
> > >
> > > have compatibility issues in the future.
> > > Say users register a user defined function named "explode" in 1.9, and
> we
> > > support a built-in "explode" function in 1.10.
> > > Then the user's jobs which call the registered "explode" function in
> 1.9
> > > will all fail in 1.10 because of naming conflict.
> > >
> > > Regarding #2 "External" built-in functions.
> > > I think if we store external built-in functions in catalog, then
> > > "hive1::sqrt" is a good way to go.
> > > However, I would prefer to support a discovery mechanism (e.g. SPI) for
> > > built-in functions as Timo suggested above.
> > > This gives us the flexibility to add Hive or MySQL or Geo or whatever
> > > function set as built-in functions in an easy way.
> > >
> > > Best,
> > > Jark
> > >
> > > On Wed, 4 Sep 2019 at 17:47, Xuefu Z <[email protected]> <
> > [email protected]> <[email protected]> <[email protected]> <
> > [email protected]> <[email protected]> <[email protected]> <
> > [email protected]><[email protected]> <[email protected]> <
> > [email protected]> <[email protected]> <[email protected]> <
> > [email protected]> <[email protected]> <[email protected]> wrote:
> > >
> > > Hi David,
> > >
> > > Thank you for sharing your findings. It seems to me that there is no
> SQL
> > > standard regarding temporary functions. There are few systems that
> > >
> > > support
> > >
> > > it. Here are what I have found:
> > >
> > > 1. Hive: no DB qualifier allowed. Can overwrite built-in.
> > > 2. Spark: basically follows Hive (
> > >
> > >
> > >
> >
> https://docs.databricks.com/spark/latest/spark-sql/language-manual/create-function.html
> > >
> > > )
> > > 3. SAP SQL Anywhere Server: can have owner (db?). Not sure of
> overwriting
> > > behavior. (
> > http://dcx.sap.com/sqla170/en/html/816bdf316ce210148d3acbebf6d39b18.html
> > >
> > > )
> > >
> > > Because of lack of standard, it's perfectly fine for Flink to define
> > > whatever it sees appropriate. Thus, your proposal (no overwriting and
> > >
> > > must
> > >
> > > have DB as holder) is one option. The advantage is simplicity, The
> > > downside
> > > is the deviation from Hive, which is popular and de facto standard in
> big
> > > data world.
> > >
> > > However, I don't think we have to follow Hive. More importantly, we
> need
> > >
> > > a
> > >
> > > consensus. I have no objection if your proposal is generally agreed
> upon.
> > >
> > > Thanks,
> > > Xuefu
> > >
> > > On Tue, Sep 3, 2019 at 11:58 PM Dawid Wysakowicz <
> [email protected]
> > <[email protected]> <[email protected]> <
> [email protected]>
> > <[email protected]> <[email protected]> <
> [email protected]>
> > <[email protected]> <[email protected]>
> > > wrote:
> > >
> > > Hi all,
> > >
> > > Just an opinion on the built-in <> temporary functions resolution and
> > > NAMING issue. I think we should not allow overriding the built-in
> > > functions, as this may pose serious issues and to be honest is rather
> > > not feasible and would require major rework. What happens if a user
> > > wants to override CAST? Calls to that function are generated at
> > > different layers of the stack that unfortunately does not always go
> > > through the Catalog API (at least yet). Moreover from what I've checked
> > > no other systems allow overriding the built-in functions. All the
> > > systems I've checked so far register temporary functions in a
> > > database/schema (either special database for temporary functions, or
> > > just current database). What I would suggest is to always register
> > > temporary functions with a 3 part identifier. The same way as tables,
> > > views etc. This effectively means you cannot override built-in
> > > functions. With such approach it is natural that the temporary
> functions
> > > end up a step lower in the resolution order:
> > >
> > > 1. built-in functions (1 part, maybe 2? - this is still under
> discussion)
> > >
> > > 2. temporary functions (always 3 part path)
> > >
> > > 3. catalog functions (always 3 part path)
> > >
> > > Let me know what do you think.
> > >
> > > Best,
> > >
> > > Dawid
> > >
> > > On 04/09/2019 06:13, Bowen Li wrote:
> > >
> > > Hi,
> > >
> > > I agree with Xuefu that the main controversial points are mainly the
> > >
> > > two
> > >
> > > places. My thoughts on them:
> > >
> > > 1) Determinism of referencing Hive built-in functions. We can either
> > >
> > > remove
> > >
> > > Hive built-in functions from ambiguous function resolution and require
> > > users to use special syntax for their qualified names, or add a config
> > >
> > > flag
> > >
> > > to catalog constructor/yaml for turning on and off Hive built-in
> > >
> > > functions
> > >
> > > with the flag set to 'false' by default and proper doc added to help
> > >
> > > users
> > >
> > > make their decisions.
> > >
> > > 2) Flink temp functions v.s. Flink built-in functions in ambiguous
> > >
> > > function
> > >
> > > resolution order. We believe Flink temp functions should precede Flink
> > > built-in functions, and I have presented my reasons. Just in case if we
> > > cannot reach an agreement, I propose forbid users registering temp
> > > functions in the same name as a built-in function, like MySQL's
> > >
> > > approach,
> > >
> > > for the moment. It won't have any performance concern, since built-in
> > > functions are all in memory and thus cost of a name check will be
> > >
> > > really
> > >
> > > trivial.
> > >
> > >
> > > On Tue, Sep 3, 2019 at 8:01 PM Xuefu Z <[email protected]> <
> > [email protected]> <[email protected]> <[email protected]> <
> > [email protected]> <[email protected]> <[email protected]> <
> > [email protected]><[email protected]> <[email protected]> <
> > [email protected]> <[email protected]> <[email protected]> <
> > [email protected]> <[email protected]> <[email protected]> wrote:
> > >
> > >  From what I have seen, there are a couple of focal disagreements:
> > >
> > > 1. Resolution order: temp function --> flink built-in function -->
> > >
> > > catalog
> > >
> > > function vs flink built-in function --> temp function -> catalog
> > >
> > > function.
> > >
> > > 2. "External" built-in functions: how to treat built-in functions in
> > > external system and how users reference them
> > >
> > > For #1, I agree with Bowen that temp function needs to be at the
> > >
> > > highest
> > >
> > > priority because that's how a user might overwrite a built-in function
> > > without referencing a persistent, overwriting catalog function with a
> > >
> > > fully
> > >
> > > qualified name. Putting built-in functions at the highest priority
> > > eliminates that usage.
> > >
> > > For #2, I saw a general agreement on referencing "external" built-in
> > > functions such as those in Hive needs to be explicit and deterministic
> > >
> > > even
> > >
> > > though different approaches are proposed. To limit the scope and
> > >
> > > simply
> > >
> > > the
> > >
> > > usage, it seems making sense to me to introduce special syntax for
> > >
> > > user  to
> > >
> > > explicitly reference an external built-in function such as hive1::sqrt
> > >
> > > or
> > >
> > > hive1._built_in.sqrt. This is a DML syntax matching nicely Catalog API
> > >
> > > call
> > >
> > > hive1.getFunction(ObjectPath functionName) where the database name is
> > > absent for bulit-in functions available in that catalog hive1. I
> > >
> > > understand
> > >
> > > that Bowen's original proposal was trying to avoid this, but this
> > >
> > > could
> > >
> > > turn out to be a clean and simple solution.
> > >
> > > (Timo's modular approach is great way to "expand" Flink's built-in
> > >
> > > function
> > >
> > > set, which seems orthogonal and complementary to this, which could be
> > > tackled in further future work.)
> > >
> > > I'd be happy to hear further thoughts on the two points.
> > >
> > > Thanks,
> > > Xuefu
> > >
> > > On Tue, Sep 3, 2019 at 7:11 PM Kurt Young <[email protected]> <
> > [email protected]> <[email protected]> <[email protected]> <
> [email protected]>
> > <[email protected]> <[email protected]> <[email protected]><
> [email protected]>
> > <[email protected]> <[email protected]> <[email protected]> <
> [email protected]>
> > <[email protected]> <[email protected]> <[email protected]> wrote:
> > >
> > > Thanks Timo & Bowen for the feedback. Bowen was right, my proposal is
> > >
> > > the
> > >
> > > same
> > > as Bowen's. But after thinking about it, I'm currently lean to Timo's
> > > suggestion.
> > >
> > > The reason is backward compatibility. If we follow Bowen's approach,
> > >
> > > let's
> > >
> > > say we
> > > first find function in Flink's built-in functions, and then hive's
> > > built-in. For example, `foo`
> > > is not supported by Flink, but hive has such built-in function. So
> > >
> > > user
> > >
> > > will have hive's
> > > behavior for function `foo`. And in next release, Flink realize this
> > >
> > > is a
> > >
> > > very popular function
> > > and add it into Flink's built-in functions, but with different
> > >
> > > behavior
> > >
> > > as
> > >
> > > hive's. So in next
> > > release, the behavior changes.
> > >
> > > With Timo's approach, IIUC user have to tell the framework explicitly
> > >
> > > what
> > >
> > > kind of
> > > built-in functions he would like to use. He can just tell framework
> > >
> > > to
> > >
> > > abandon Flink's built-in
> > > functions, and use hive's instead. User can only choose between them,
> > >
> > > but
> > >
> > > not use
> > > them at the same time. I think this approach is more predictable.
> > >
> > > Best,
> > > Kurt
> > >
> > >
> > > On Wed, Sep 4, 2019 at 8:00 AM Bowen Li <[email protected]> <
> > [email protected]> <[email protected]> <[email protected]> <
> > [email protected]> <[email protected]> <[email protected]> <
> > [email protected]><[email protected]> <[email protected]> <
> > [email protected]> <[email protected]> <[email protected]> <
> > [email protected]> <[email protected]> <[email protected]> wrote:
> > >
> > > Hi all,
> > >
> > > Thanks for the feedback. Just a kindly reminder that the [Proposal]
> > >
> > > section
> > >
> > > in the google doc was updated, please take a look first and let me
> > >
> > > know
> > >
> > > if
> > >
> > > you have more questions.
> > >
> > > On Tue, Sep 3, 2019 at 4:57 PM Bowen Li <[email protected]> <
> > [email protected]> <[email protected]> <[email protected]> <
> > [email protected]> <[email protected]> <[email protected]> <
> > [email protected]><[email protected]> <[email protected]> <
> > [email protected]> <[email protected]> <[email protected]> <
> > [email protected]> <[email protected]> <[email protected]>
> > >
> > > wrote:
> > >
> > > Hi Timo,
> > >
> > > Re> 1) We should not have the restriction "hive built-in functions
> > >
> > > can
> > >
> > > only
> > >
> > > be used when current catalog is hive catalog". Switching a catalog
> > > should only have implications on the cat.db.object resolution but
> > >
> > > not
> > >
> > > functions. It would be quite convinient for users to use Hive
> > >
> > > built-ins
> > >
> > > even if they use a Confluent schema registry or just the in-memory
> > >
> > > catalog.
> > >
> > > There might be a misunderstanding here.
> > >
> > > First of all, Hive built-in functions are not part of Flink
> > >
> > > built-in
> > >
> > > functions, they are catalog functions, thus if the current catalog
> > >
> > > is
> > >
> > > not a
> > >
> > > HiveCatalog but, say, a schema registry catalog, ambiguous
> > >
> > > functions
> > >
> > > reference just shouldn't be resolved to a different catalog.
> > >
> > > Second, Hive built-in functions can potentially be referenced
> > >
> > > across
> > >
> > > catalog, but it doesn't have db namespace and we currently just
> > >
> > > don't
> > >
> > > have
> > >
> > > a SQL syntax for it. It can be enabled when such a SQL syntax is
> > >
> > > defined,
> > >
> > > e.g. "catalog::function", but it's out of scope of this FLIP.
> > >
> > > 2) I would propose to have separate concepts for catalog and
> > >
> > > built-in
> > >
> > > functions. In particular it would be nice to modularize built-in
> > > functions. Some built-in functions are very crucial (like AS, CAST,
> > > MINUS), others are more optional but stable (MD5, CONCAT_WS), and
> > >
> > > maybe
> > >
> > > we add more experimental functions in the future or function for
> > >
> > > some
> > >
> > > special application area (Geo functions, ML functions). A data
> > >
> > > platform
> > >
> > > team might not want to make every built-in function available. Or a
> > > function module like ML functions is in a different Maven module.
> > >
> > > I think this is orthogonal to this FLIP, especially we don't have
> > >
> > > the
> > >
> > > "external built-in functions" anymore and currently the built-in
> > >
> > > function
> > >
> > > category remains untouched.
> > >
> > > But just to share some thoughts on the proposal, I'm not sure about
> > >
> > > it:
> > >
> > > - I don't know if any other databases handle built-in functions
> > >
> > > like
> > >
> > > that.
> > >
> > > Maybe you can give some examples? IMHO, built-in functions are
> > >
> > > system
> > >
> > > info
> > >
> > > and should be deterministic, not depending on loaded libraries. Geo
> > > functions should be either built-in already or just libraries
> > >
> > > functions,
> > >
> > > and library functions can be adapted to catalog APIs or of some
> > >
> > > other
> > >
> > > syntax to use
> > > - I don't know if all use cases stand, and many can be achieved by
> > >
> > > other
> > >
> > > approaches too. E.g. experimental functions can be taken good care
> > >
> > > of
> > >
> > > by
> > >
> > > documentations, annotations, etc
> > > - the proposal basically introduces some concept like a pluggable
> > >
> > > built-in
> > >
> > > function catalog, despite the already existing catalog APIs
> > > - it brings in even more complicated scenarios to the design. E.g.
> > >
> > > how
> > >
> > > do
> > >
> > > you handle built-in functions in different modules but different
> > >
> > > names?
> > >
> > > In short, I'm not sure if it really stands and it looks like an
> > >
> > > overkill
> > >
> > > to me. I'd rather not go to that route. Related discussion can be
> > >
> > > on
> > >
> > > its
> > >
> > > own thread.
> > >
> > > 3) Following the suggestion above, we can have a separate discovery
> > > mechanism for built-in functions. Instead of just going through a
> > >
> > > static
> > >
> > > list like in BuiltInFunctionDefinitions, a platform team should be
> > >
> > > able
> > >
> > > to select function modules like
> > > catalogManager.setFunctionModules(CoreFunctions, GeoFunctions,
> > > HiveFunctions) or via service discovery;
> > >
> > > Same as above. I'll leave it to its own thread.
> > >
> > > re > 3) Dawid and I discussed the resulution order again. I agree
> > >
> > > with
> > >
> > > Kurt
> > >
> > > that we should unify built-in function (external or internal)
> > >
> > > under a
> > >
> > > common layer. However, the resolution order should be:
> > >    1. built-in functions
> > >    2. temporary functions
> > >    3. regular catalog resolution logic
> > > Otherwise a temporary function could cause clashes with Flink's
> > >
> > > built-in
> > >
> > > functions. If you take a look at other vendors, like SQL Server
> > >
> > > they
> > >
> > > also do not allow to overwrite built-in functions.
> > >
> > > ”I agree with Kurt that we should unify built-in function (external
> > >
> > > or
> > >
> > > internal) under a common layer.“ <- I don't think this is what Kurt
> > >
> > > means.
> > >
> > > Kurt and I are in favor of unifying built-in functions of external
> > >
> > > systems
> > >
> > > and catalog functions. Did you type a mistake?
> > >
> > > Besides, I'm not sure about the resolution order you proposed.
> > >
> > > Temporary
> > >
> > > functions have a lifespan over a session and are only visible to
> > >
> > > the
> > >
> > > session owner, they are unique to each user, and users create them
> > >
> > > on
> > >
> > > purpose to be the highest priority in order to overwrite system
> > >
> > > info
> > >
> > > (built-in functions in this case).
> > >
> > > In your case, why would users name a temporary function the same
> > >
> > > as a
> > >
> > > built-in function then? Since using that name in ambiguous function
> > > reference will always be resolved to built-in functions, creating a
> > > same-named temp function would be meaningless in the end.
> > >
> > >
> > > On Tue, Sep 3, 2019 at 1:44 PM Bowen Li <[email protected]> <
> > [email protected]> <[email protected]> <[email protected]> <
> > [email protected]> <[email protected]> <[email protected]> <
> > [email protected]><[email protected]> <[email protected]> <
> > [email protected]> <[email protected]> <[email protected]> <
> > [email protected]> <[email protected]> <[email protected]>
> > >
> > > wrote:
> > >
> > > Hi Jingsong,
> > >
> > > Re> 1.Hive built-in functions is an intermediate solution. So we
> > >
> > > should
> > >
> > > not introduce interfaces to influence the framework. To make
> > > Flink itself more powerful, we should implement the functions
> > > we need to add.
> > >
> > > Yes, please see the doc.
> > >
> > > Re> 2.Non-flink built-in functions are easy for users to change
> > >
> > > their
> > >
> > > behavior. If we support some flink built-in functions in the
> > > future but act differently from non-flink built-in, this will
> > >
> > > lead
> > >
> > > to
> > >
> > > changes in user behavior.
> > >
> > > There's no such concept as "external built-in functions" any more.
> > > Built-in functions of external systems will be treated as special
> > >
> > > catalog
> > >
> > > functions.
> > >
> > > Re> Another question is, does this fallback include all
> > >
> > > hive built-in functions? As far as I know, some hive functions
> > > have some hacky. If possible, can we start with a white list?
> > > Once we implement some functions to flink built-in, we can
> > > also update the whitelist.
> > >
> > > Yes, that's something we thought of too. I don't think it's super
> > > critical to the scope of this FLIP, thus I'd like to leave it to
> > >
> > > future
> > >
> > > efforts as a nice-to-have feature.
> > >
> > >
> > > On Tue, Sep 3, 2019 at 1:37 PM Bowen Li <[email protected]> <
> > [email protected]> <[email protected]> <[email protected]> <
> > [email protected]> <[email protected]> <[email protected]> <
> > [email protected]><[email protected]> <[email protected]> <
> > [email protected]> <[email protected]> <[email protected]> <
> > [email protected]> <[email protected]> <[email protected]>
> > >
> > > wrote:
> > >
> > > Hi Kurt,
> > >
> > > Re: > What I want to propose is we can merge #3 and #4, make them
> > >
> > > both
> > >
> > > under
> > >
> > > "catalog" concept, by extending catalog function to make it have
> > >
> > > ability to
> > >
> > > have built-in catalog functions. Some benefits I can see from
> > >
> > > this
> > >
> > > approach:
> > >
> > > 1. We don't have to introduce new concept like external built-in
> > >
> > > functions.
> > >
> > > Actually I don't see a full story about how to treat a built-in
> > >
> > > functions, and it
> > >
> > > seems a little bit disrupt with catalog. As a result, you have
> > >
> > > to
> > >
> > > make
> > >
> > > some restriction
> > >
> > > like "hive built-in functions can only be used when current
> > >
> > > catalog
> > >
> > > is
> > >
> > > hive catalog".
> > >
> > > Yes, I've unified #3 and #4 but it seems I didn't update some
> > >
> > > part
> > >
> > > of
> > >
> > > the doc. I've modified those sections, and they are up to date
> > >
> > > now.
> > >
> > > In short, now built-in function of external systems are defined
> > >
> > > as
> > >
> > > a
> > >
> > > special kind of catalog function in Flink, and handled by Flink
> > >
> > > as
> > >
> > > following:
> > > - An external built-in function must be associated with a catalog
> > >
> > > for
> > >
> > > the purpose of decoupling flink-table and external systems.
> > > - It always resides in front of catalog functions in ambiguous
> > >
> > > function
> > >
> > > reference order, just like in its own external system
> > > - It is a special catalog function that doesn’t have a
> > >
> > > schema/database
> > >
> > > namespace
> > > - It goes thru the same instantiation logic as other user defined
> > > catalog functions in the external system
> > >
> > > Please take another look at the doc, and let me know if you have
> > >
> > > more
> > >
> > > questions.
> > >
> > >
> > > On Tue, Sep 3, 2019 at 7:28 AM Timo Walther <[email protected]> <
> > [email protected]> <[email protected]> <[email protected]> <
> > [email protected]> <[email protected]> <[email protected]> <
> > [email protected]><[email protected]> <[email protected]> <
> > [email protected]> <[email protected]> <[email protected]> <
> > [email protected]> <[email protected]> <[email protected]>
> > >
> > > wrote:
> > >
> > > Hi Kurt,
> > >
> > > it should not affect the functions and operations we currently
> > >
> > > have
> > >
> > > in
> > >
> > > SQL. It just categorizes the available built-in functions. It is
> > >
> > > kind
> > >
> > > of
> > > an orthogonal concept to the catalog API but built-in functions
> > >
> > > deserve
> > >
> > > this special kind of treatment. CatalogFunction still fits
> > >
> > > perfectly
> > >
> > > in
> > >
> > > there because the regular catalog object resolution logic is not
> > > affected. So tables and functions are resolved in the same way
> > >
> > > but
> > >
> > > with
> > >
> > > built-in functions that have priority as in the original design.
> > >
> > > Regards,
> > > Timo
> > >
> > >
> > > On 03.09.19 15:26, Kurt Young wrote:
> > >
> > > Does this only affect the functions and operations we currently
> > >
> > > have
> > >
> > > in SQL
> > >
> > > and
> > > have no effect on tables, right? Looks like this is an
> > >
> > > orthogonal
> > >
> > > concept
> > >
> > > with Catalog?
> > > If the answer are both yes, then the catalog function will be a
> > >
> > > weird
> > >
> > > concept?
> > >
> > > Best,
> > > Kurt
> > >
> > >
> > > On Tue, Sep 3, 2019 at 8:10 PM Danny Chan <[email protected]
> > >
> > > wrote:
> > >
> > > The way you proposed are basically the same as what Calcite
> > >
> > > does, I
> > >
> > > think
> > >
> > > we are in the same line.
> > >
> > > Best,
> > > Danny Chan
> > > 在 2019年9月3日 +0800 PM7:57，Timo Walther <[email protected]
> > >
> > > ，写道：
> > >
> > > This sounds exactly as the module approach I mentioned, no?
> > >
> > > Regards,
> > > Timo
> > >
> > > On 03.09.19 13:42, Danny Chan wrote:
> > >
> > > Thanks Bowen for bring up this topic, I think it’s a useful
> > >
> > > refactoring to make our function usage more user friendly.
> > >
> > > For the topic of how to organize the builtin operators and
> > >
> > > operators
> > >
> > > of Hive, here is a solution from Apache Calcite, the Calcite
> > >
> > > way
> > >
> > > is
> > >
> > > to make
> > >
> > > every dialect operators a “Library”, user can specify which
> > >
> > > libraries they
> > >
> > > want to use for a sql query. The builtin operators always
> > >
> > > comes
> > >
> > > as
> > >
> > > the
> > >
> > > first class objects and the others are used from the order
> > >
> > > they
> > >
> > > appears.
> > >
> > > Maybe you can take a reference.
> > >
> > > [1]
> > >
> > >
> > >
> >
> https://github.com/apache/calcite/commit/9a4eab5240d96379431d14a1ac33bfebaf6fbb28
> > >
> > > Best,
> > > Danny Chan
> > > 在 2019年8月28日 +0800 AM2:50，Bowen Li <[email protected]
> > >
> > > ，写道：
> > >
> > > Hi folks,
> > >
> > > I'd like to kick off a discussion on reworking Flink's
> > >
> > > FunctionCatalog.
> > >
> > > It's critically helpful to improve function usability in
> > >
> > > SQL.
> > >
> > >
> > >
> >
> https://docs.google.com/document/d/1w3HZGj9kry4RsKVCduWp82HkW6hhgi2unnvOAUS72t8/edit?usp=sharing
> > >
> > > In short, it:
> > > - adds support for precise function reference with
> > >
> > > fully/partially
> > >
> > > qualified name
> > > - redefines function resolution order for ambiguous
> > >
> > > function
> > >
> > > reference
> > >
> > > - adds support for Hive's rich built-in functions (support
> > >
> > > for
> > >
> > > Hive
> > >
> > > user
> > >
> > > defined functions was already added in 1.9.0)
> > > - clarifies the concept of temporary functions
> > >
> > > Would love to hear your thoughts.
> > >
> > > Bowen
> > >
> > > --
> > > Xuefu Zhang
> > >
> > > "In Honey We Trust!"
> > >
> > >
> > > --
> > > Xuefu Zhang
> > >
> > > "In Honey We Trust!"
> > >
> > >
> > >
> > >
> > >
> > >
> > >
> > >
> > >
> >
>

Re: [DISCUSS] FLIP-57 - Rework FunctionCatalog

Reply via email to