Re: [DISCUSS] Integrate Flink SQL well with Hive ecosystem

Timo Walther Wed, 09 Jan 2019 00:28:41 -0800

Hi Bowen,

thanks for your feedback. We should not change the Google doc anymorebut apply additional comments in the wiki page. I will also add a bitmore explanation to some parts so that people know about certain designdecisions.


Regards,
Timo


Am 08.01.19 um 22:54 schrieb Bowen Li:

Thank you, Xuefu and Timo, for putting together the FLIP! I like that both
its scope and implementation plan are clear. Look forward to feedbacks from
the group.

I also added a few more complementary details in the doc.

Thanks,
Bowen


On Mon, Jan 7, 2019 at 8:37 PM Zhang, Xuefu <xuef...@alibaba-inc.com> wrote:

Thanks, Timo!

I have started put the content from the google doc to FLIP-30 [1].
However, please still keep the discussion along this thread.

Thanks,
Xuefu

[1]
https://cwiki.apache.org/confluence/display/FLINK/FLIP-30%3A+Unified+Catalog+APIs


------------------------------------------------------------------
From:Timo Walther <twal...@apache.org>
Sent At:2019 Jan. 7 (Mon.) 05:59
To:dev <dev@flink.apache.org>
Subject:Re: [DISCUSS] Integrate Flink SQL well with Hive ecosystem

Hi everyone,

Xuefu and I had multiple iterations over the catalog design document
[1]. I believe that it is in a good shape now to be converted into FLIP.
Maybe we need a bit more explanation at some places but the general
design would be ready now.

The design document covers the following changes:
- Unify external catalog interface and Flink's internal catalog in
TableEnvironment
- Clearly define a hierarchy of reference objects namely:
"catalog.database.table"
- Enable a tight integration with Hive + Hive data connectors as well as
a broad integration with existing TableFactories and discovery mechanism
- Make the catalog interfaces more feature complete by adding views and
functions

If you have any further feedback, it would be great to give it now
before we convert it into a FLIP.

Thanks,
Timo

[1]

https://docs.google.com/document/d/1Y9it78yaUvbv4g572ZK_lZnZaAGjqwM_EhjdOv4yJtw/edit#



Am 07.01.19 um 13:51 schrieb Timo Walther:

Hi Eron,

thank you very much for the contributions. I merged the first little
bug fixes. For the remaining PRs I think we can review and merge them
soon. As you said, the code is agnostic to the details of the
ExternalCatalog interface and I don't expect bigger merge conflicts in
the near future.

However, exposing the current external catalog interfaces to SQL
Client users would make it even more difficult to change the
interfaces in the future. So maybe I would first wait until the
general catalog discussion is over and the FLIP has been created. This
should happen shortly.

We should definitely coordinate the efforts better in the future to
avoid duplicate work.

Thanks,
Timo


Am 07.01.19 um 00:24 schrieb Eron Wright:

Thanks Timo for merging a couple of the PRs.   Are you also able to
review the others that I mentioned? Xuefu I would like to incorporate
your feedback too.

Check out this short demonstration of using a catalog in SQL Client:
https://asciinema.org/a/C8xuAjmZSxCuApgFgZQyeIHuo

Thanks again!

On Thu, Jan 3, 2019 at 9:37 AM Eron Wright <eronwri...@gmail.com
<mailto:eronwri...@gmail.com>> wrote:

     Would a couple folks raise their hand to make a review pass thru
     the 6 PRs listed above?  It is a lovely stack of PRs that is 'all
     green' at the moment.   I would be happy to open follow-on PRs to
     rapidly align with other efforts.

     Note that the code is agnostic to the details of the
     ExternalCatalog interface; the code would not be obsolete if/when
     the catalog interface is enhanced as per the design doc.



     On Wed, Jan 2, 2019 at 1:35 PM Eron Wright <eronwri...@gmail.com
     <mailto:eronwri...@gmail.com>> wrote:

         I propose that the community review and merge the PRs that I
         posted, and then evolve the design thru 1.8 and beyond.  I
         think having a basic infrastructure in place now will
         accelerate the effort, do you agree?

         Thanks again!

         On Wed, Jan 2, 2019 at 11:20 AM Zhang, Xuefu
         <xuef...@alibaba-inc.com <mailto:xuef...@alibaba-inc.com>>
wrote:

             Hi Eron,

             Happy New Year!

             Thank you very much for your contribution, especially
             during the holidays. Wile I'm encouraged by your work, I'd
             also like to share my thoughts on how to move forward.

             First, please note that the design discussion is still
             finalizing, and we expect some moderate changes,
             especially around TableFactories. Another pending change
             is our decision to shy away from scala, which our work
             will be impacted by.

             Secondly, while your work seemed about plugging in
             catalogs definitions to the execution environment, which
             is less impacted by TableFactory change, I did notice some
             duplication of your work and ours. This is no big deal,
             but going forward, we should probable have a better
             communication on the work assignment so as to avoid any
             possible duplication of work. On the other hand, I think
             some of your work is interesting and valuable for
             inclusion once we finalize the overall design.

             Thus, please continue your research and experiment and let
             us know when you start working on anything so we can
             better coordinate.

             Thanks again for your interest and contributions.

             Thanks,
             Xuefu



------------------------------------------------------------------
                 From:Eron Wright <eronwri...@gmail.com
                 <mailto:eronwri...@gmail.com>>
                 Sent At:2019 Jan. 1 (Tue.) 18:39
                 To:dev <dev@flink.apache.org
                 <mailto:dev@flink.apache.org>>; Xuefu
                 <xuef...@alibaba-inc.com
<mailto:xuef...@alibaba-inc.com>>
                 Cc:Xiaowei Jiang <xiaow...@gmail.com
                 <mailto:xiaow...@gmail.com>>; twalthr
                 <twal...@apache.org <mailto:twal...@apache.org>>;
                 piotr <pi...@data-artisans.com
                 <mailto:pi...@data-artisans.com>>; Fabian Hueske
                 <fhue...@gmail.com <mailto:fhue...@gmail.com>>;
                 suez1224 <suez1...@gmail.com
                 <mailto:suez1...@gmail.com>>; Bowen Li
                 <bowenl...@gmail.com <mailto:bowenl...@gmail.com>>
                 Subject:Re: [DISCUSS] Integrate Flink SQL well with
                 Hive ecosystem

                 Hi folks, there's clearly some incremental steps to be
                 taken to introduce catalog support to SQL Client,
                 complementary to what is proposed in the Flink-Hive
                 Metastore design doc.  I was quietly working on this
                 over the holidays.   I posted some new sub-tasks, PRs,
                 and sample code to FLINK-10744.

                 What inspired me to get involved is that the catalog
                 interface seems like a great way to encapsulate a
                 'library' of Flink tables and functions. For example,
                 the NYC Taxi dataset (TaxiRides, TaxiFares, various
                 UDFs) may be nicely encapsulated as a catalog
                 (TaxiData).  Such a library should be fully consumable
                 in SQL Client.

                 I implemented the above. Some highlights:
                 1. A fully-worked example of using the Taxi dataset in
                 SQL Client via an environment file.
                 - an ASCII video showing the SQL Client in action:
https://asciinema.org/a/C8xuAjmZSxCuApgFgZQyeIHuo

                 - the corresponding environment file (will be even
                 more concise once 'FLINK-10696 Catalog UDFs' is merged):
_

https://github.com/EronWright/flink-training-exercises/blob/3be008d64be975ced0f1a7e3901a8c5353f72a7e/src/main/dist/conf/sql-client-defaults.yaml_

                 - the typed API for standalone table applications:
_

https://github.com/EronWright/flink-training-exercises/blob/3be008d64be975ced0f1a7e3901a8c5353f72a7e/src/main/java/com/dataartisans/flinktraining/examples/table_java/examples/ViaCatalog.java#L50_

                 2. Implementation of the core catalog descriptor and
                 factory.  I realize that some renames may later occur
                 as per the design doc, and would be happy to do that
                 as a follow-up.
                 https://github.com/apache/flink/pull/7390

                 3. Implementation of a connect-style API on
                 TableEnvironment to use catalog descriptor.
                 https://github.com/apache/flink/pull/7392

                 4. Integration into SQL-Client's environment file:
                 https://github.com/apache/flink/pull/7393

                 I realize that the overall Hive integration is still
                 evolving, but I believe that these PRs are a good
                 stepping stone. Here's the list (in bottom-up order):
                 - https://github.com/apache/flink/pull/7386
                 - https://github.com/apache/flink/pull/7388
                 - https://github.com/apache/flink/pull/7389
                 - https://github.com/apache/flink/pull/7390
                 - https://github.com/apache/flink/pull/7392
                 - https://github.com/apache/flink/pull/7393

                 Thanks and enjoy 2019!
                 Eron W


                 On Sun, Nov 18, 2018 at 3:04 PM Zhang, Xuefu
                 <xuef...@alibaba-inc.com
                 <mailto:xuef...@alibaba-inc.com>> wrote:
                 Hi Xiaowei,

                 Thanks for bringing up the question. In the current
                 design, the properties for meta objects are meant to
                 cover anything that's specific to a particular catalog
                 and agnostic to Flink. Anything that is common (such
                 as schema for tables, query text for views, and udf
                 classname) are abstracted as members of the respective
                 classes. However, this is still in discussion, and
                 Timo and I will go over this and provide an update.

                 Please note that UDF is a little more involved than
                 what the current design doc shows. I'm still refining
                 this part.

                 Thanks,
                 Xuefu


------------------------------------------------------------------
                 Sender:Xiaowei Jiang <xiaow...@gmail.com
                 <mailto:xiaow...@gmail.com>>
                 Sent at:2018 Nov 18 (Sun) 15:17
                 Recipient:dev <dev@flink.apache.org
                 <mailto:dev@flink.apache.org>>
                 Cc:Xuefu <xuef...@alibaba-inc.com
                 <mailto:xuef...@alibaba-inc.com>>; twalthr
                 <twal...@apache.org <mailto:twal...@apache.org>>;
                 piotr <pi...@data-artisans.com
                 <mailto:pi...@data-artisans.com>>; Fabian Hueske
                 <fhue...@gmail.com <mailto:fhue...@gmail.com>>;
                 suez1224 <suez1...@gmail.com
<mailto:suez1...@gmail.com>>
                 Subject:Re: [DISCUSS] Integrate Flink SQL well with
                 Hive ecosystem

                 Thanks Xuefu for the detailed design doc! One question
                 on the properties associated with the catalog objects.
                 Are we going to leave them completely free form or we
                 are going to set some standard for that? I think that
                 the answer may depend on if we want to explore catalog
                 specific optimization opportunities. In any case, I
                 think that it might be helpful for standardize as much
                 as possible into strongly typed classes and use leave
                 these properties for catalog specific things. But I
                 think that we can do it in steps.

                 Xiaowei
                 On Fri, Nov 16, 2018 at 4:00 AM Bowen Li
                 <bowenl...@gmail.com <mailto:bowenl...@gmail.com>>
wrote:
                 Thanks for keeping on improving the overall design,
                 Xuefu! It looks quite
                  good to me now.

                  Would be nice that cc-ed Flink committers can help to
                 review and confirm!



                  One minor suggestion: Since the last section of
                 design doc already touches
                  some new sql statements, shall we add another section
                 in our doc and
                  formalize the new sql statements in SQL Client and
                 TableEnvironment that
                  are gonna come along naturally with our design? Here
                 are some that the
                  design doc mentioned and some that I came up with:

                  To be added:

                     - USE <catalog> - set default catalog
                     - USE <catalog.schema> - set default schema
                     - SHOW CATALOGS - show all registered catalogs
                     - SHOW SCHEMAS [FROM catalog] - list schemas in
                 the current default
                     catalog or the specified catalog
                     - DESCRIBE VIEW view - show the view's definition
                 in CatalogView
                     - SHOW VIEWS [FROM schema/catalog.schema] - show
                 views from current or a
                     specified schema.

                     (DDLs that can be addressed by either our design
                 or Shuyi's DDL design)

                     - CREATE/DROP/ALTER SCHEMA schema
                     - CREATE/DROP/ALTER CATALOG catalog

                  To be modified:

                     - SHOW TABLES [FROM schema/catalog.schema] - show
                 tables from current or
                     a specified schema. Add 'from schema' to existing
                 'SHOW TABLES' statement
                     - SHOW FUNCTIONS [FROM schema/catalog.schema] -
                 show functions from
                     current or a specified schema. Add 'from schema'
                 to existing 'SHOW TABLES'
                     statement'


                  Thanks, Bowen



                  On Wed, Nov 14, 2018 at 10:39 PM Zhang, Xuefu
                 <xuef...@alibaba-inc.com
<mailto:xuef...@alibaba-inc.com>>
                  wrote:

                  > Thanks, Bowen, for catching the error. I have
                 granted comment permission
                  > with the link.
                  >
                  > I also updated the doc with the latest class
                 definitions. Everyone is
                  > encouraged to review and comment.
                  >
                  > Thanks,
                  > Xuefu
                  >
                  >
------------------------------------------------------------------
                  > Sender:Bowen Li <bowenl...@gmail.com
                 <mailto:bowenl...@gmail.com>>
                  > Sent at:2018 Nov 14 (Wed) 06:44
                  > Recipient:Xuefu <xuef...@alibaba-inc.com
                 <mailto:xuef...@alibaba-inc.com>>
                  > Cc:piotr <pi...@data-artisans.com
                 <mailto:pi...@data-artisans.com>>; dev
                 <dev@flink.apache.org <mailto:dev@flink.apache.org>>;
                 Shuyi
                  > Chen <suez1...@gmail.com <mailto:suez1...@gmail.com

                  > Subject:Re: [DISCUSS] Integrate Flink SQL well with
                 Hive ecosystem
                  >
                  > Hi Xuefu,
                  >
                  > Currently the new design doc
                  >
<

https://docs.google.com/document/d/1Y9it78yaUvbv4g572ZK_lZnZaAGjqwM_EhjdOv4yJtw/edit

                  > is on “view only" mode, and people cannot leave
                 comments. Can you please
                  > change it to "can comment" or "can edit" mode?
                  >
                  > Thanks, Bowen
                  >
                  >
                  > On Mon, Nov 12, 2018 at 9:51 PM Zhang, Xuefu
                 <xuef...@alibaba-inc.com
<mailto:xuef...@alibaba-inc.com>>
                  > wrote:
                  > Hi Piotr
                  >
                  > I have extracted the API portion of  the design and
                 the google doc is here
                  >
<

https://docs.google.com/document/d/1Y9it78yaUvbv4g572ZK_lZnZaAGjqwM_EhjdOv4yJtw/edit?usp=sharing

                  > Please review and provide your feedback.
                  >
                  > Thanks,
                  > Xuefu
                  >
                  >
------------------------------------------------------------------
                  > Sender:Xuefu <xuef...@alibaba-inc.com
                 <mailto:xuef...@alibaba-inc.com>>
                  > Sent at:2018 Nov 12 (Mon) 12:43
                  > Recipient:Piotr Nowojski <pi...@data-artisans.com
                 <mailto:pi...@data-artisans.com>>; dev <
                  > dev@flink.apache.org <mailto:dev@flink.apache.org>>
                  > Cc:Bowen Li <bowenl...@gmail.com
                 <mailto:bowenl...@gmail.com>>; Shuyi Chen
                 <suez1...@gmail.com <mailto:suez1...@gmail.com>>
                  > Subject:Re: [DISCUSS] Integrate Flink SQL well with
                 Hive ecosystem
                  >
                  > Hi Piotr,
                  >
                  > That sounds good to me. Let's close all the open
                 questions ((there are a
                  > couple of them)) in the Google doc and I should be
                 able to quickly split
                  > it into the three proposals as you suggested.
                  >
                  > Thanks,
                  > Xuefu
                  >
                  >
------------------------------------------------------------------
                  > Sender:Piotr Nowojski <pi...@data-artisans.com
                 <mailto:pi...@data-artisans.com>>
                  > Sent at:2018 Nov 9 (Fri) 22:46
                  > Recipient:dev <dev@flink.apache.org
                 <mailto:dev@flink.apache.org>>; Xuefu
                 <xuef...@alibaba-inc.com
<mailto:xuef...@alibaba-inc.com>>
                  > Cc:Bowen Li <bowenl...@gmail.com
                 <mailto:bowenl...@gmail.com>>; Shuyi Chen
                 <suez1...@gmail.com <mailto:suez1...@gmail.com>>
                  > Subject:Re: [DISCUSS] Integrate Flink SQL well with
                 Hive ecosystem
                  >
                  > Hi,
                  >
                  >
                  > Yes, it seems like the best solution. Maybe someone
                 else can also suggests if we can split it further?
                 Maybe changes in the interface in one doc, reading
                 from hive meta store another and final storing our
                 meta informations in hive meta store?
                  >
                  > Piotrek
                  >
                  > > On 9 Nov 2018, at 01:44, Zhang, Xuefu
                 <xuef...@alibaba-inc.com
                 <mailto:xuef...@alibaba-inc.com>> wrote:
                  > >
                  > > Hi Piotr,
                  > >
                  > > That seems to be good idea!
                  > >
                  >
                  > > Since the google doc for the design is currently
                 under extensive review, I will leave it as it is for
                 now. However, I'll convert it to two different FLIPs
                 when the time comes.
                  > >
                  > > How does it sound to you?
                  > >
                  > > Thanks,
                  > > Xuefu
                  > >
                  > >
                  > >
------------------------------------------------------------------
                  > > Sender:Piotr Nowojski <pi...@data-artisans.com
                 <mailto:pi...@data-artisans.com>>
                  > > Sent at:2018 Nov 9 (Fri) 02:31
                  > > Recipient:dev <dev@flink.apache.org
                 <mailto:dev@flink.apache.org>>
                  > > Cc:Bowen Li <bowenl...@gmail.com
                 <mailto:bowenl...@gmail.com>>; Xuefu
                 <xuef...@alibaba-inc.com
<mailto:xuef...@alibaba-inc.com>
                  > >; Shuyi Chen <suez1...@gmail.com
                 <mailto:suez1...@gmail.com>>
                  > > Subject:Re: [DISCUSS] Integrate Flink SQL well
                 with Hive ecosystem
                  > >
                  > > Hi,
                  > >
                  >
                  > > Maybe we should split this topic (and the design
                 doc) into couple of smaller ones, hopefully
                 independent. The questions that you have asked Fabian
                 have for example very little to do with reading
                 metadata from Hive Meta Store?
                  > >
                  > > Piotrek
                  > >
                  > >> On 7 Nov 2018, at 14:27, Fabian Hueske
                 <fhue...@gmail.com <mailto:fhue...@gmail.com>> wrote:
                  > >>
                  > >> Hi Xuefu and all,
                  > >>
                  > >> Thanks for sharing this design document!
                  >
                  > >> I'm very much in favor of restructuring /
                 reworking the catalog handling in
                  > >> Flink SQL as outlined in the document.
                  >
                  > >> Most changes described in the design document
                 seem to be rather general and
                  > >> not specifically related to the Hive integration.
                  > >>
                  >
                  > >> IMO, there are some aspects, especially those at
                 the boundary of Hive and
                  > >> Flink, that need a bit more discussion. For
example
                  > >>
                  > >> * What does it take to make Flink schema
                 compatible with Hive schema?
                  > >> * How will Flink tables (descriptors) be stored
                 in HMS?
                  > >> * How do both Hive catalogs differ? Could they
                 be integrated into to a
                  > >> single one? When to use which one?
                  >
                  > >> * What meta information is provided by HMS? What
                 of this can be leveraged
                  > >> by Flink?
                  > >>
                  > >> Thank you,
                  > >> Fabian
                  > >>
                  > >> Am Fr., 2. Nov. 2018 um 00:31 Uhr schrieb Bowen
                 Li <bowenl...@gmail.com <mailto:bowenl...@gmail.com>
                  > >:
                  > >>
                  > >>> After taking a look at how other discussion
                 threads work, I think it's
                  > >>> actually fine just keep our discussion here.
                 It's up to you, Xuefu.
                  > >>>
                  > >>> The google doc LGTM. I left some minor comments.
                  > >>>
                  > >>> On Thu, Nov 1, 2018 at 10:17 AM Bowen Li
                 <bowenl...@gmail.com <mailto:bowenl...@gmail.com>>
wrote:
                  > >>>
                  > >>>> Hi all,
                  > >>>>
                  > >>>> As Xuefu has published the design doc on
                 google, I agree with Shuyi's
                  >
                  > >>>> suggestion that we probably should start a new
                 email thread like "[DISCUSS]
                  >
                  > >>>> ... Hive integration design ..." on only dev
                 mailing list for community
                  > >>>> devs to review. The current thread sends to
                 both dev and user list.
                  > >>>>
                  >
                  > >>>> This email thread is more like validating the
                 general idea and direction
                  >
                  > >>>> with the community, and it's been pretty long
                 and crowded so far. Since
                  >
                  > >>>> everyone is pro for the idea, we can move
                 forward with another thread to
                  > >>>> discuss and finalize the design.
                  > >>>>
                  > >>>> Thanks,
                  > >>>> Bowen
                  > >>>>
                  > >>>> On Wed, Oct 31, 2018 at 12:16 PM Zhang, Xuefu <
                  > xuef...@alibaba-inc.com
                 <mailto:xuef...@alibaba-inc.com>>
                  > >>>> wrote:
                  > >>>>
                  > >>>>> Hi Shuiyi,
                  > >>>>>
                  >
                  > >>>>> Good idea. Actually the PDF was converted
                 from a google doc. Here is its
                  > >>>>> link:
                  > >>>>>
                  > >>>>>
                  >

https://docs.google.com/document/d/1SkppRD_rE3uOKSN-LuZCqn4f7dz0zW5aa6T_hBZq5_o/edit?usp=sharing

                  > >>>>> Once we reach an agreement, I can convert it
                 to a FLIP.
                  > >>>>>
                  > >>>>> Thanks,
                  > >>>>> Xuefu
                  > >>>>>
                  > >>>>>
                  > >>>>>
                  > >>>>>
------------------------------------------------------------------
                  > >>>>> Sender:Shuyi Chen <suez1...@gmail.com
                 <mailto:suez1...@gmail.com>>
                  > >>>>> Sent at:2018 Nov 1 (Thu) 02:47
                  > >>>>> Recipient:Xuefu <xuef...@alibaba-inc.com
                 <mailto:xuef...@alibaba-inc.com>>
                  > >>>>> Cc:vino yang <yanghua1...@gmail.com
                 <mailto:yanghua1...@gmail.com>>; Fabian Hueske <
                  > fhue...@gmail.com <mailto:fhue...@gmail.com>>;
                  > >>>>> dev <dev@flink.apache.org
                 <mailto:dev@flink.apache.org>>; user
                 <u...@flink.apache.org <mailto:u...@flink.apache.org>>
                  > >>>>> Subject:Re: [DISCUSS] Integrate Flink SQL
                 well with Hive ecosystem
                  > >>>>>
                  > >>>>> Hi Xuefu,
                  > >>>>>
                  >
                  > >>>>> Thanks a lot for driving this big effort. I
                 would suggest convert your
                  >
                  > >>>>> proposal and design doc into a google doc,
                 and share it on the dev mailing
                  >
                  > >>>>> list for the community to review and comment
                 with title like "[DISCUSS] ...
                  >
                  > >>>>> Hive integration design ..." . Once
                 approved,  we can document it as a FLIP
                  >
                  > >>>>> (Flink Improvement Proposals), and use JIRAs
                 to track the implementations.
                  > >>>>> What do you think?
                  > >>>>>
                  > >>>>> Shuyi
                  > >>>>>
                  > >>>>> On Tue, Oct 30, 2018 at 11:32 AM Zhang, Xuefu <
                  > xuef...@alibaba-inc.com
                 <mailto:xuef...@alibaba-inc.com>>
                  > >>>>> wrote:
                  > >>>>> Hi all,
                  > >>>>>
                  > >>>>> I have also shared a design doc on Hive
                 metastore integration that is
                  >
                  > >>>>> attached here and also to FLINK-10556[1].
                 Please kindly review and share
                  > >>>>> your feedback.
                  > >>>>>
                  > >>>>>
                  > >>>>> Thanks,
                  > >>>>> Xuefu
                  > >>>>>
                  > >>>>> [1]
https://issues.apache.org/jira/browse/FLINK-10556
                  > >>>>>
------------------------------------------------------------------
                  > >>>>> Sender:Xuefu <xuef...@alibaba-inc.com
                 <mailto:xuef...@alibaba-inc.com>>
                  > >>>>> Sent at:2018 Oct 25 (Thu) 01:08
                  > >>>>> Recipient:Xuefu <xuef...@alibaba-inc.com
                 <mailto:xuef...@alibaba-inc.com>>; Shuyi Chen <
                  > >>>>> suez1...@gmail.com <mailto:suez1...@gmail.com

                  > >>>>> Cc:yanghua1127 <yanghua1...@gmail.com
                 <mailto:yanghua1...@gmail.com>>; Fabian Hueske <
                  > fhue...@gmail.com <mailto:fhue...@gmail.com>>;
                  > >>>>> dev <dev@flink.apache.org
                 <mailto:dev@flink.apache.org>>; user
                 <u...@flink.apache.org <mailto:u...@flink.apache.org>>
                  > >>>>> Subject:Re: [DISCUSS] Integrate Flink SQL
                 well with Hive ecosystem
                  > >>>>>
                  > >>>>> Hi all,
                  > >>>>>
                  > >>>>> To wrap up the discussion, I have attached a
                 PDF describing the
                  >
                  > >>>>> proposal, which is also attached to
                 FLINK-10556 [1]. Please feel free to
                  > >>>>> watch that JIRA to track the progress.
                  > >>>>>
                  > >>>>> Please also let me know if you have
                 additional comments or questions.
                  > >>>>>
                  > >>>>> Thanks,
                  > >>>>> Xuefu
                  > >>>>>
                  > >>>>> [1]
https://issues.apache.org/jira/browse/FLINK-10556
                  > >>>>>
                  > >>>>>
                  > >>>>>
------------------------------------------------------------------
                  > >>>>> Sender:Xuefu <xuef...@alibaba-inc.com
                 <mailto:xuef...@alibaba-inc.com>>
                  > >>>>> Sent at:2018 Oct 16 (Tue) 03:40
                  > >>>>> Recipient:Shuyi Chen <suez1...@gmail.com
                 <mailto:suez1...@gmail.com>>
                  > >>>>> Cc:yanghua1127 <yanghua1...@gmail.com
                 <mailto:yanghua1...@gmail.com>>; Fabian Hueske <
                  > fhue...@gmail.com <mailto:fhue...@gmail.com>>;
                  > >>>>> dev <dev@flink.apache.org
                 <mailto:dev@flink.apache.org>>; user
                 <u...@flink.apache.org <mailto:u...@flink.apache.org>>
                  > >>>>> Subject:Re: [DISCUSS] Integrate Flink SQL
                 well with Hive ecosystem
                  > >>>>>
                  > >>>>> Hi Shuyi,
                  > >>>>>
                  >
                  > >>>>> Thank you for your input. Yes, I agreed with
                 a phased approach and like
                  >
                  > >>>>> to move forward fast. :) We did some work
                 internally on DDL utilizing babel
                  > >>>>> parser in Calcite. While babel makes
                 Calcite's grammar extensible, at
                  > >>>>> first impression it still seems too
                 cumbersome for a project when too
                  >
                  > >>>>> much extensions are made. It's even
                 challenging to find where the extension
                  >
                  > >>>>> is needed! It would be certainly better if
                 Calcite can magically support
                  >
                  > >>>>> Hive QL by just turning on a flag, such as
                 that for MYSQL_5. I can also
                  >
                  > >>>>> see that this could mean a lot of work on
                 Calcite. Nevertheless, I will
                  >
                  > >>>>> bring up the discussion over there and to see
                 what their community thinks.
                  > >>>>>
                  > >>>>> Would mind to share more info about the
                 proposal on DDL that you
                  > >>>>> mentioned? We can certainly collaborate on
this.
                  > >>>>>
                  > >>>>> Thanks,
                  > >>>>> Xuefu
                  > >>>>>
                  > >>>>>
------------------------------------------------------------------
                  > >>>>> Sender:Shuyi Chen <suez1...@gmail.com
                 <mailto:suez1...@gmail.com>>
                  > >>>>> Sent at:2018 Oct 14 (Sun) 08:30
                  > >>>>> Recipient:Xuefu <xuef...@alibaba-inc.com
                 <mailto:xuef...@alibaba-inc.com>>
                  > >>>>> Cc:yanghua1127 <yanghua1...@gmail.com
                 <mailto:yanghua1...@gmail.com>>; Fabian Hueske <
                  > fhue...@gmail.com <mailto:fhue...@gmail.com>>;
                  > >>>>> dev <dev@flink.apache.org
                 <mailto:dev@flink.apache.org>>; user
                 <u...@flink.apache.org <mailto:u...@flink.apache.org>>
                  > >>>>> Subject:Re: [DISCUSS] Integrate Flink SQL
                 well with Hive ecosystem
                  > >>>>>
                  > >>>>> Welcome to the community and thanks for the
                 great proposal, Xuefu! I
                  >
                  > >>>>> think the proposal can be divided into 2
                 stages: making Flink to support
                  >
                  > >>>>> Hive features, and make Hive to work with
                 Flink. I agreed with Timo that on
                  >
                  > >>>>> starting with a smaller scope, so we can make
                 progress faster. As for [6],
                  >
                  > >>>>> a proposal for DDL is already in progress,
                 and will come after the unified
                  >
                  > >>>>> SQL connector API is done. For supporting
                 Hive syntax, we might need to
                  > >>>>> work with the Calcite community, and a recent
                 effort called babel (
                  > >>>>>
https://issues.apache.org/jira/browse/CALCITE-2280) in
                 Calcite might
                  > >>>>> help here.
                  > >>>>>
                  > >>>>> Thanks
                  > >>>>> Shuyi
                  > >>>>>
                  > >>>>> On Wed, Oct 10, 2018 at 8:02 PM Zhang, Xuefu <
                  > xuef...@alibaba-inc.com
                 <mailto:xuef...@alibaba-inc.com>>
                  > >>>>> wrote:
                  > >>>>> Hi Fabian/Vno,
                  > >>>>>
                  >
                  > >>>>> Thank you very much for your encouragement
                 inquiry. Sorry that I didn't
                  >
                  > >>>>> see Fabian's email until I read Vino's
                 response just now. (Somehow Fabian's
                  > >>>>> went to the spam folder.)
                  > >>>>>
                  >
                  > >>>>> My proposal contains long-term and
                 short-terms goals. Nevertheless, the
                  > >>>>> effort will focus on the following areas,
                 including Fabian's list:
                  > >>>>>
                  > >>>>> 1. Hive metastore connectivity - This covers
                 both read/write access,
                  >
                  > >>>>> which means Flink can make full use of Hive's
                 metastore as its catalog (at
                  > >>>>> least for the batch but can extend for
                 streaming as well).
                  >
                  > >>>>> 2. Metadata compatibility - Objects
                 (databases, tables, partitions, etc)
                  >
                  > >>>>> created by Hive can be understood by Flink
                 and the reverse direction is
                  > >>>>> true also.
                  > >>>>> 3. Data compatibility - Similar to #2, data
                 produced by Hive can be
                  > >>>>> consumed by Flink and vise versa.
                  >
                  > >>>>> 4. Support Hive UDFs - For all Hive's native
                 udfs, Flink either provides
                  > >>>>> its own implementation or make Hive's
                 implementation work in Flink.
                  > >>>>> Further, for user created UDFs in Hive, Flink
                 SQL should provide a
                  >
                  > >>>>> mechanism allowing user to import them into
                 Flink without any code change
                  > >>>>> required.
                  > >>>>> 5. Data types - Flink SQL should support all
                 data types that are
                  > >>>>> available in Hive.
                  > >>>>> 6. SQL Language - Flink SQL should support
                 SQL standard (such as
                  >
                  > >>>>> SQL2003) with extension to support Hive's
                 syntax and language features,
                  > >>>>> around DDL, DML, and SELECT queries.
                  >
                  > >>>>> 7.  SQL CLI - this is currently developing in
                 Flink but more effort is
                  > >>>>> needed.
                  >
                  > >>>>> 8. Server - provide a server that's
                 compatible with Hive's HiverServer2
                  >
                  > >>>>> in thrift APIs, such that HiveServer2 users
                 can reuse their existing client
                  > >>>>> (such as beeline) but connect to Flink's
                 thrift server instead.
                  >
                  > >>>>> 9. JDBC/ODBC drivers - Flink may provide its
                 own JDBC/ODBC drivers for
                  > >>>>> other application to use to connect to its
                 thrift server
                  > >>>>> 10. Support other user's customizations in
                 Hive, such as Hive Serdes,
                  > >>>>> storage handlers, etc.
                  >
                  > >>>>> 11. Better task failure tolerance and task
                 scheduling at Flink runtime.
                  > >>>>>
                  > >>>>> As you can see, achieving all those requires
                 significant effort and
                  >
                  > >>>>> across all layers in Flink. However, a
                 short-term goal could include only
                  >
                  > >>>>> core areas (such as 1, 2, 4, 5, 6, 7) or
                 start  at a smaller scope (such as
                  > >>>>> #3, #6).
                  > >>>>>
                  >
                  > >>>>> Please share your further thoughts. If we
                 generally agree that this is
                  >
                  > >>>>> the right direction, I could come up with a
                 formal proposal quickly and
                  > >>>>> then we can follow up with broader discussions.
                  > >>>>>
                  > >>>>> Thanks,
                  > >>>>> Xuefu
                  > >>>>>
                  > >>>>>
                  > >>>>>
                  > >>>>>
------------------------------------------------------------------
                  > >>>>> Sender:vino yang <yanghua1...@gmail.com
                 <mailto:yanghua1...@gmail.com>>
                  > >>>>> Sent at:2018 Oct 11 (Thu) 09:45
                  > >>>>> Recipient:Fabian Hueske <fhue...@gmail.com
                 <mailto:fhue...@gmail.com>>
                  > >>>>> Cc:dev <dev@flink.apache.org
                 <mailto:dev@flink.apache.org>>; Xuefu
                 <xuef...@alibaba-inc.com
<mailto:xuef...@alibaba-inc.com>
                  > >; user <
                  > >>>>> u...@flink.apache.org
                 <mailto:u...@flink.apache.org>>
                  > >>>>> Subject:Re: [DISCUSS] Integrate Flink SQL
                 well with Hive ecosystem
                  > >>>>>
                  > >>>>> Hi Xuefu,
                  > >>>>>
                  >
                  > >>>>> Appreciate this proposal, and like Fabian, it
                 would look better if you
                  > >>>>> can give more details of the plan.
                  > >>>>>
                  > >>>>> Thanks, vino.
                  > >>>>>
                  > >>>>> Fabian Hueske <fhue...@gmail.com
                 <mailto:fhue...@gmail.com>> 于2018年10月10日周三
                 下午5:27写道：
                  > >>>>> Hi Xuefu,
                  > >>>>>
                  >
                  > >>>>> Welcome to the Flink community and thanks for
                 starting this discussion!
                  > >>>>> Better Hive integration would be really great!
                  > >>>>> Can you go into details of what you are
                 proposing? I can think of a
                  > >>>>> couple ways to improve Flink in that regard:
                  > >>>>>
                  > >>>>> * Support for Hive UDFs
                  > >>>>> * Support for Hive metadata catalog
                  > >>>>> * Support for HiveQL syntax
                  > >>>>> * ???
                  > >>>>>
                  > >>>>> Best, Fabian
                  > >>>>>
                  > >>>>> Am Di., 9. Okt. 2018 um 19:22 Uhr schrieb
                 Zhang, Xuefu <
                  > >>>>> xuef...@alibaba-inc.com
                 <mailto:xuef...@alibaba-inc.com>>:
                  > >>>>> Hi all,
                  > >>>>>
                  > >>>>> Along with the community's effort, inside
                 Alibaba we have explored
                  >
                  > >>>>> Flink's potential as an execution engine not
                 just for stream processing but
                  > >>>>> also for batch processing. We are encouraged
                 by our findings and have
                  >
                  > >>>>> initiated our effort to make Flink's SQL
                 capabilities full-fledged. When
                  >
                  > >>>>> comparing what's available in Flink to the
                 offerings from competitive data
                  >
                  > >>>>> processing engines, we identified a major gap
                 in Flink: a well integration
                  >
                  > >>>>> with Hive ecosystem. This is crucial to the
                 success of Flink SQL and batch
                  >
                  > >>>>> due to the well-established data ecosystem
                 around Hive. Therefore, we have
                  >
                  > >>>>> done some initial work along this direction
                 but there are still a lot of
                  > >>>>> effort needed.
                  > >>>>>
                  > >>>>> We have two strategies in mind. The first one
                 is to make Flink SQL
                  >
                  > >>>>> full-fledged and well-integrated with Hive
                 ecosystem. This is a similar
                  >
                  > >>>>> approach to what Spark SQL adopted. The
                 second strategy is to make Hive
                  >
                  > >>>>> itself work with Flink, similar to the
                 proposal in [1]. Each approach bears
                  >
                  > >>>>> its pros and cons, but they don’t need to be
                 mutually exclusive with each
                  > >>>>> targeting at different users and use cases.
                 We believe that both will
                  > >>>>> promote a much greater adoption of Flink
                 beyond stream processing.
                  > >>>>>
                  > >>>>> We have been focused on the first approach
                 and would like to showcase
                  >
                  > >>>>> Flink's batch and SQL capabilities with Flink
                 SQL. However, we have also
                  > >>>>> planned to start strategy #2 as the follow-up
                 effort.
                  > >>>>>
                  >
                  > >>>>> I'm completely new to Flink(, with a short
                 bio [2] below), though many
                  >
                  > >>>>> of my colleagues here at Alibaba are
                 long-time contributors. Nevertheless,
                  >
                  > >>>>> I'd like to share our thoughts and invite
                 your early feedback. At the same
                  >
                  > >>>>> time, I am working on a detailed proposal on
                 Flink SQL's integration with
                  > >>>>> Hive ecosystem, which will be also shared
                 when ready.
                  > >>>>>
                  > >>>>> While the ideas are simple, each approach
                 will demand significant
                  >
                  > >>>>> effort, more than what we can afford. Thus,
                 the input and contributions
                  > >>>>> from the communities are greatly welcome and
                 appreciated.
                  > >>>>>
                  > >>>>> Regards,
                  > >>>>>
                  > >>>>>
                  > >>>>> Xuefu
                  > >>>>>
                  > >>>>> References:
                  > >>>>>
                  > >>>>> [1]
                 https://issues.apache.org/jira/browse/HIVE-10712
                  >
                  > >>>>> [2] Xuefu Zhang is a long-time open source
                 veteran, worked or working on
                  > >>>>> many projects under Apache Foundation, of
                 which he is also an honored
                  >
                  > >>>>> member. About 10 years ago he worked in the
                 Hadoop team at Yahoo where the
                  >
                  > >>>>> projects just got started. Later he worked at
                 Cloudera, initiating and
                  >
                  > >>>>> leading the development of Hive on Spark
                 project in the communities and
                  >
                  > >>>>> across many organizations. Prior to joining
                 Alibaba, he worked at Uber
                  >
                  > >>>>> where he promoted Hive on Spark to all Uber's
                 SQL on Hadoop workload and
                  > >>>>> significantly improved Uber's cluster
efficiency.
                  > >>>>>
                  > >>>>>
                  > >>>>>
                  > >>>>>
                  > >>>>> --
                  >
                  > >>>>> "So you have to trust that the dots will
                 somehow connect in your future."
                  > >>>>>
                  > >>>>>
                  > >>>>> --
                  >
                  > >>>>> "So you have to trust that the dots will
                 somehow connect in your future."
                  > >>>>>
                  >
                  >

Re: [DISCUSS] Integrate Flink SQL well with Hive ecosystem

Reply via email to