Re: [DISCUSS] Integrate Flink SQL well with Hive ecosystem

Eron Wright Thu, 03 Jan 2019 09:37:51 -0800

Would a couple folks raise their hand to make a review pass thru the 6 PRs
listed above?  It is a lovely stack of PRs that is 'all green' at the
moment.   I would be happy to open follow-on PRs to rapidly align with
other efforts.


Note that the code is agnostic to the details of the ExternalCatalog
interface; the code would not be obsolete if/when the catalog interface is
enhanced as per the design doc.



On Wed, Jan 2, 2019 at 1:35 PM Eron Wright <[email protected]> wrote:

> I propose that the community review and merge the PRs that I posted, and
> then evolve the design thru 1.8 and beyond.   I think having a basic
> infrastructure in place now will accelerate the effort, do you agree?
>
> Thanks again!
>
> On Wed, Jan 2, 2019 at 11:20 AM Zhang, Xuefu <[email protected]>
> wrote:
>
>> Hi Eron,
>>
>> Happy New Year!
>>
>> Thank you very much for your contribution, especially during the
>> holidays. Wile I'm encouraged by your work, I'd also like to share my
>> thoughts on how to move forward.
>>
>> First, please note that the design discussion is still finalizing, and we
>> expect some moderate changes, especially around TableFactories. Another
>> pending change is our decision to shy away from scala, which our work will
>> be impacted by.
>>
>> Secondly, while your work seemed about plugging in catalogs definitions
>> to the execution environment, which is less impacted by TableFactory
>> change, I did notice some duplication of your work and ours. This is no big
>> deal, but going forward, we should probable have a better communication on
>> the work assignment so as to avoid any possible duplication of work. On the
>> other hand, I think some of your work is interesting and valuable for
>> inclusion once we finalize the overall design.
>>
>> Thus, please continue your research and experiment and let us know when
>> you start working on anything so we can better coordinate.
>>
>> Thanks again for your interest and contributions.
>>
>> Thanks,
>> Xuefu
>>
>>
>>
>> ------------------------------------------------------------------
>> From:Eron Wright <[email protected]>
>> Sent At:2019 Jan. 1 (Tue.) 18:39
>> To:dev <[email protected]>; Xuefu <[email protected]>
>> Cc:Xiaowei Jiang <[email protected]>; twalthr <[email protected]>;
>> piotr <[email protected]>; Fabian Hueske <[email protected]>;
>> suez1224 <[email protected]>; Bowen Li <[email protected]>
>> Subject:Re: [DISCUSS] Integrate Flink SQL well with Hive ecosystem
>>
>> Hi folks, there's clearly some incremental steps to be taken to introduce
>> catalog support to SQL Client, complementary to what is proposed in the
>> Flink-Hive Metastore design doc.  I was quietly working on this over the
>> holidays.   I posted some new sub-tasks, PRs, and sample code
>> to FLINK-10744.
>>
>> What inspired me to get involved is that the catalog interface seems like
>> a great way to encapsulate a 'library' of Flink tables and functions.  For
>> example, the NYC Taxi dataset (TaxiRides, TaxiFares, various UDFs) may be
>> nicely encapsulated as a catalog (TaxiData).   Such a library should be
>> fully consumable in SQL Client.
>>
>> I implemented the above.  Some highlights:
>>
>> 1. A fully-worked example of using the Taxi dataset in SQL Client via an
>> environment file.
>> - an ASCII video showing the SQL Client in action:
>> https://asciinema.org/a/C8xuAjmZSxCuApgFgZQyeIHuo
>>
>> - the corresponding environment file (will be even more concise once
>> 'FLINK-10696 Catalog UDFs' is merged):
>> *https://github.com/EronWright/flink-training-exercises/blob/3be008d64be975ced0f1a7e3901a8c5353f72a7e/src/main/dist/conf/sql-client-defaults.yaml
>> <https://github.com/EronWright/flink-training-exercises/blob/3be008d64be975ced0f1a7e3901a8c5353f72a7e/src/main/dist/conf/sql-client-defaults.yaml>*
>>
>> - the typed API for standalone table applications:
>> *https://github.com/EronWright/flink-training-exercises/blob/3be008d64be975ced0f1a7e3901a8c5353f72a7e/src/main/java/com/dataartisans/flinktraining/examples/table_java/examples/ViaCatalog.java#L50
>> <https://github.com/EronWright/flink-training-exercises/blob/3be008d64be975ced0f1a7e3901a8c5353f72a7e/src/main/java/com/dataartisans/flinktraining/examples/table_java/examples/ViaCatalog.java#L50>*
>>
>> 2. Implementation of the core catalog descriptor and factory.  I realize
>> that some renames may later occur as per the design doc, and would be happy
>> to do that as a follow-up.
>> https://github.com/apache/flink/pull/7390
>>
>> 3. Implementation of a connect-style API on TableEnvironment to use
>> catalog descriptor.
>> https://github.com/apache/flink/pull/7392
>>
>> 4. Integration into SQL-Client's environment file:
>> https://github.com/apache/flink/pull/7393
>>
>> I realize that the overall Hive integration is still evolving, but I
>> believe that these PRs are a good stepping stone. Here's the list (in
>> bottom-up order):
>> - https://github.com/apache/flink/pull/7386
>> - https://github.com/apache/flink/pull/7388
>> - https://github.com/apache/flink/pull/7389
>> - https://github.com/apache/flink/pull/7390
>> - https://github.com/apache/flink/pull/7392
>> - https://github.com/apache/flink/pull/7393
>>
>> Thanks and enjoy 2019!
>> Eron W
>>
>>
>> On Sun, Nov 18, 2018 at 3:04 PM Zhang, Xuefu <[email protected]>
>> wrote:
>> Hi Xiaowei,
>>
>> Thanks for bringing up the question. In the current design, the
>> properties for meta objects are meant to cover anything that's specific to
>> a particular catalog and agnostic to Flink. Anything that is common (such
>> as schema for tables, query text for views, and udf classname) are
>> abstracted as members of the respective classes. However, this is still in
>> discussion, and Timo and I will go over this and provide an update.
>>
>> Please note that UDF is a little more involved than what the current
>> design doc shows. I'm still refining this part.
>>
>> Thanks,
>> Xuefu
>>
>>
>> ------------------------------------------------------------------
>> Sender:Xiaowei Jiang <[email protected]>
>> Sent at:2018 Nov 18 (Sun) 15:17
>> Recipient:dev <[email protected]>
>> Cc:Xuefu <[email protected]>; twalthr <[email protected]>; piotr <
>> [email protected]>; Fabian Hueske <[email protected]>; suez1224 <
>> [email protected]>
>> Subject:Re: [DISCUSS] Integrate Flink SQL well with Hive ecosystem
>>
>> Thanks Xuefu for the detailed design doc! One question on the properties
>> associated with the catalog objects. Are we going to leave them completely
>> free form or we are going to set some standard for that? I think that the
>> answer may depend on if we want to explore catalog specific optimization
>> opportunities. In any case, I think that it might be helpful for
>> standardize as much as possible into strongly typed classes and use leave
>> these properties for catalog specific things. But I think that we can do it
>> in steps.
>>
>> Xiaowei
>> On Fri, Nov 16, 2018 at 4:00 AM Bowen Li <[email protected]> wrote:
>> Thanks for keeping on improving the overall design, Xuefu! It looks quite
>>  good to me now.
>>
>>  Would be nice that cc-ed Flink committers can help to review and confirm!
>>
>>
>>
>>  One minor suggestion: Since the last section of design doc already
>> touches
>>  some new sql statements, shall we add another section in our doc and
>>  formalize the new sql statements in SQL Client and TableEnvironment that
>>  are gonna come along naturally with our design? Here are some that the
>>  design doc mentioned and some that I came up with:
>>
>>  To be added:
>>
>>     - USE <catalog> - set default catalog
>>     - USE <catalog.schema> - set default schema
>>     - SHOW CATALOGS - show all registered catalogs
>>     - SHOW SCHEMAS [FROM catalog] - list schemas in the current default
>>     catalog or the specified catalog
>>     - DESCRIBE VIEW view - show the view's definition in CatalogView
>>     - SHOW VIEWS [FROM schema/catalog.schema] - show views from current
>> or a
>>     specified schema.
>>
>>     (DDLs that can be addressed by either our design or Shuyi's DDL
>> design)
>>
>>     - CREATE/DROP/ALTER SCHEMA schema
>>     - CREATE/DROP/ALTER CATALOG catalog
>>
>>  To be modified:
>>
>>     - SHOW TABLES [FROM schema/catalog.schema] - show tables from current
>> or
>>     a specified schema. Add 'from schema' to existing 'SHOW TABLES'
>> statement
>>     - SHOW FUNCTIONS [FROM schema/catalog.schema] - show functions from
>>     current or a specified schema. Add 'from schema' to existing 'SHOW
>> TABLES'
>>     statement'
>>
>>
>>  Thanks, Bowen
>>
>>
>>
>>  On Wed, Nov 14, 2018 at 10:39 PM Zhang, Xuefu <[email protected]>
>>  wrote:
>>
>>  > Thanks, Bowen, for catching the error. I have granted comment
>> permission
>>  > with the link.
>>  >
>>  > I also updated the doc with the latest class definitions. Everyone is
>>  > encouraged to review and comment.
>>  >
>>  > Thanks,
>>  > Xuefu
>>  >
>>  > ------------------------------------------------------------------
>>  > Sender:Bowen Li <[email protected]>
>>  > Sent at:2018 Nov 14 (Wed) 06:44
>>  > Recipient:Xuefu <[email protected]>
>>  > Cc:piotr <[email protected]>; dev <[email protected]>; Shuyi
>>  > Chen <[email protected]>
>>  > Subject:Re: [DISCUSS] Integrate Flink SQL well with Hive ecosystem
>>  >
>>  > Hi Xuefu,
>>  >
>>  > Currently the new design doc
>>  > <
>> https://docs.google.com/document/d/1Y9it78yaUvbv4g572ZK_lZnZaAGjqwM_EhjdOv4yJtw/edit
>> >
>>  > is on “view only" mode, and people cannot leave comments. Can you
>> please
>>  > change it to "can comment" or "can edit" mode?
>>  >
>>  > Thanks, Bowen
>>  >
>>  >
>>  > On Mon, Nov 12, 2018 at 9:51 PM Zhang, Xuefu <[email protected]>
>>  > wrote:
>>  > Hi Piotr
>>  >
>>  > I have extracted the API portion of  the design and the google doc is
>> here
>>  > <
>> https://docs.google.com/document/d/1Y9it78yaUvbv4g572ZK_lZnZaAGjqwM_EhjdOv4yJtw/edit?usp=sharing
>> >.
>>  > Please review and provide your feedback.
>>  >
>>  > Thanks,
>>  > Xuefu
>>  >
>>  > ------------------------------------------------------------------
>>  > Sender:Xuefu <[email protected]>
>>  > Sent at:2018 Nov 12 (Mon) 12:43
>>  > Recipient:Piotr Nowojski <[email protected]>; dev <
>>  > [email protected]>
>>  > Cc:Bowen Li <[email protected]>; Shuyi Chen <[email protected]>
>>  > Subject:Re: [DISCUSS] Integrate Flink SQL well with Hive ecosystem
>>  >
>>  > Hi Piotr,
>>  >
>>  > That sounds good to me. Let's close all the open questions ((there are
>> a
>>  > couple of them)) in the Google doc and I should be able to quickly
>> split
>>  > it into the three proposals as you suggested.
>>  >
>>  > Thanks,
>>  > Xuefu
>>  >
>>  > ------------------------------------------------------------------
>>  > Sender:Piotr Nowojski <[email protected]>
>>  > Sent at:2018 Nov 9 (Fri) 22:46
>>  > Recipient:dev <[email protected]>; Xuefu <[email protected]>
>>  > Cc:Bowen Li <[email protected]>; Shuyi Chen <[email protected]>
>>  > Subject:Re: [DISCUSS] Integrate Flink SQL well with Hive ecosystem
>>  >
>>  > Hi,
>>  >
>>  >
>>  > Yes, it seems like the best solution. Maybe someone else can also
>> suggests if we can split it further? Maybe changes in the interface in one
>> doc, reading from hive meta store another and final storing our meta
>> informations in hive meta store?
>>  >
>>  > Piotrek
>>  >
>>  > > On 9 Nov 2018, at 01:44, Zhang, Xuefu <[email protected]>
>> wrote:
>>  > >
>>  > > Hi Piotr,
>>  > >
>>  > > That seems to be good idea!
>>  > >
>>  >
>>  > > Since the google doc for the design is currently under extensive
>> review, I will leave it as it is for now. However, I'll convert it to two
>> different FLIPs when the time comes.
>>  > >
>>  > > How does it sound to you?
>>  > >
>>  > > Thanks,
>>  > > Xuefu
>>  > >
>>  > >
>>  > > ------------------------------------------------------------------
>>  > > Sender:Piotr Nowojski <[email protected]>
>>  > > Sent at:2018 Nov 9 (Fri) 02:31
>>  > > Recipient:dev <[email protected]>
>>  > > Cc:Bowen Li <[email protected]>; Xuefu <[email protected]
>>  > >; Shuyi Chen <[email protected]>
>>  > > Subject:Re: [DISCUSS] Integrate Flink SQL well with Hive ecosystem
>>  > >
>>  > > Hi,
>>  > >
>>  >
>>  > > Maybe we should split this topic (and the design doc) into couple of
>> smaller ones, hopefully independent. The questions that you have asked
>> Fabian have for example very little to do with reading metadata from Hive
>> Meta Store?
>>  > >
>>  > > Piotrek
>>  > >
>>  > >> On 7 Nov 2018, at 14:27, Fabian Hueske <[email protected]> wrote:
>>  > >>
>>  > >> Hi Xuefu and all,
>>  > >>
>>  > >> Thanks for sharing this design document!
>>  >
>>  > >> I'm very much in favor of restructuring / reworking the catalog
>> handling in
>>  > >> Flink SQL as outlined in the document.
>>  >
>>  > >> Most changes described in the design document seem to be rather
>> general and
>>  > >> not specifically related to the Hive integration.
>>  > >>
>>  >
>>  > >> IMO, there are some aspects, especially those at the boundary of
>> Hive and
>>  > >> Flink, that need a bit more discussion. For example
>>  > >>
>>  > >> * What does it take to make Flink schema compatible with Hive
>> schema?
>>  > >> * How will Flink tables (descriptors) be stored in HMS?
>>  > >> * How do both Hive catalogs differ? Could they be integrated into
>> to a
>>  > >> single one? When to use which one?
>>  >
>>  > >> * What meta information is provided by HMS? What of this can be
>> leveraged
>>  > >> by Flink?
>>  > >>
>>  > >> Thank you,
>>  > >> Fabian
>>  > >>
>>  > >> Am Fr., 2. Nov. 2018 um 00:31 Uhr schrieb Bowen Li <
>> [email protected]
>>  > >:
>>  > >>
>>  > >>> After taking a look at how other discussion threads work, I think
>> it's
>>  > >>> actually fine just keep our discussion here. It's up to you, Xuefu.
>>  > >>>
>>  > >>> The google doc LGTM. I left some minor comments.
>>  > >>>
>>  > >>> On Thu, Nov 1, 2018 at 10:17 AM Bowen Li <[email protected]>
>> wrote:
>>  > >>>
>>  > >>>> Hi all,
>>  > >>>>
>>  > >>>> As Xuefu has published the design doc on google, I agree with
>> Shuyi's
>>  >
>>  > >>>> suggestion that we probably should start a new email thread like
>> "[DISCUSS]
>>  >
>>  > >>>> ... Hive integration design ..." on only dev mailing list for
>> community
>>  > >>>> devs to review. The current thread sends to both dev and user
>> list.
>>  > >>>>
>>  >
>>  > >>>> This email thread is more like validating the general idea and
>> direction
>>  >
>>  > >>>> with the community, and it's been pretty long and crowded so far.
>> Since
>>  >
>>  > >>>> everyone is pro for the idea, we can move forward with another
>> thread to
>>  > >>>> discuss and finalize the design.
>>  > >>>>
>>  > >>>> Thanks,
>>  > >>>> Bowen
>>  > >>>>
>>  > >>>> On Wed, Oct 31, 2018 at 12:16 PM Zhang, Xuefu <
>>  > [email protected]>
>>  > >>>> wrote:
>>  > >>>>
>>  > >>>>> Hi Shuiyi,
>>  > >>>>>
>>  >
>>  > >>>>> Good idea. Actually the PDF was converted from a google doc.
>> Here is its
>>  > >>>>> link:
>>  > >>>>>
>>  > >>>>>
>>  >
>> https://docs.google.com/document/d/1SkppRD_rE3uOKSN-LuZCqn4f7dz0zW5aa6T_hBZq5_o/edit?usp=sharing
>>  > >>>>> Once we reach an agreement, I can convert it to a FLIP.
>>  > >>>>>
>>  > >>>>> Thanks,
>>  > >>>>> Xuefu
>>  > >>>>>
>>  > >>>>>
>>  > >>>>>
>>  > >>>>>
>> ------------------------------------------------------------------
>>  > >>>>> Sender:Shuyi Chen <[email protected]>
>>  > >>>>> Sent at:2018 Nov 1 (Thu) 02:47
>>  > >>>>> Recipient:Xuefu <[email protected]>
>>  > >>>>> Cc:vino yang <[email protected]>; Fabian Hueske <
>>  > [email protected]>;
>>  > >>>>> dev <[email protected]>; user <[email protected]>
>>  > >>>>> Subject:Re: [DISCUSS] Integrate Flink SQL well with Hive
>> ecosystem
>>  > >>>>>
>>  > >>>>> Hi Xuefu,
>>  > >>>>>
>>  >
>>  > >>>>> Thanks a lot for driving this big effort. I would suggest
>> convert your
>>  >
>>  > >>>>> proposal and design doc into a google doc, and share it on the
>> dev mailing
>>  >
>>  > >>>>> list for the community to review and comment with title like
>> "[DISCUSS] ...
>>  >
>>  > >>>>> Hive integration design ..." . Once approved,  we can document
>> it as a FLIP
>>  >
>>  > >>>>> (Flink Improvement Proposals), and use JIRAs to track the
>> implementations.
>>  > >>>>> What do you think?
>>  > >>>>>
>>  > >>>>> Shuyi
>>  > >>>>>
>>  > >>>>> On Tue, Oct 30, 2018 at 11:32 AM Zhang, Xuefu <
>>  > [email protected]>
>>  > >>>>> wrote:
>>  > >>>>> Hi all,
>>  > >>>>>
>>  > >>>>> I have also shared a design doc on Hive metastore integration
>> that is
>>  >
>>  > >>>>> attached here and also to FLINK-10556[1]. Please kindly review
>> and share
>>  > >>>>> your feedback.
>>  > >>>>>
>>  > >>>>>
>>  > >>>>> Thanks,
>>  > >>>>> Xuefu
>>  > >>>>>
>>  > >>>>> [1] https://issues.apache.org/jira/browse/FLINK-10556
>>  > >>>>>
>> ------------------------------------------------------------------
>>  > >>>>> Sender:Xuefu <[email protected]>
>>  > >>>>> Sent at:2018 Oct 25 (Thu) 01:08
>>  > >>>>> Recipient:Xuefu <[email protected]>; Shuyi Chen <
>>  > >>>>> [email protected]>
>>  > >>>>> Cc:yanghua1127 <[email protected]>; Fabian Hueske <
>>  > [email protected]>;
>>  > >>>>> dev <[email protected]>; user <[email protected]>
>>  > >>>>> Subject:Re: [DISCUSS] Integrate Flink SQL well with Hive
>> ecosystem
>>  > >>>>>
>>  > >>>>> Hi all,
>>  > >>>>>
>>  > >>>>> To wrap up the discussion, I have attached a PDF describing the
>>  >
>>  > >>>>> proposal, which is also attached to FLINK-10556 [1]. Please feel
>> free to
>>  > >>>>> watch that JIRA to track the progress.
>>  > >>>>>
>>  > >>>>> Please also let me know if you have additional comments or
>> questions.
>>  > >>>>>
>>  > >>>>> Thanks,
>>  > >>>>> Xuefu
>>  > >>>>>
>>  > >>>>> [1] https://issues.apache.org/jira/browse/FLINK-10556
>>  > >>>>>
>>  > >>>>>
>>  > >>>>>
>> ------------------------------------------------------------------
>>  > >>>>> Sender:Xuefu <[email protected]>
>>  > >>>>> Sent at:2018 Oct 16 (Tue) 03:40
>>  > >>>>> Recipient:Shuyi Chen <[email protected]>
>>  > >>>>> Cc:yanghua1127 <[email protected]>; Fabian Hueske <
>>  > [email protected]>;
>>  > >>>>> dev <[email protected]>; user <[email protected]>
>>  > >>>>> Subject:Re: [DISCUSS] Integrate Flink SQL well with Hive
>> ecosystem
>>  > >>>>>
>>  > >>>>> Hi Shuyi,
>>  > >>>>>
>>  >
>>  > >>>>> Thank you for your input. Yes, I agreed with a phased approach
>> and like
>>  >
>>  > >>>>> to move forward fast. :) We did some work internally on DDL
>> utilizing babel
>>  > >>>>> parser in Calcite. While babel makes Calcite's grammar
>> extensible, at
>>  > >>>>> first impression it still seems too cumbersome for a project
>> when too
>>  >
>>  > >>>>> much extensions are made. It's even challenging to find where
>> the extension
>>  >
>>  > >>>>> is needed! It would be certainly better if Calcite can magically
>> support
>>  >
>>  > >>>>> Hive QL by just turning on a flag, such as that for MYSQL_5. I
>> can also
>>  >
>>  > >>>>> see that this could mean a lot of work on Calcite. Nevertheless,
>> I will
>>  >
>>  > >>>>> bring up the discussion over there and to see what their
>> community thinks.
>>  > >>>>>
>>  > >>>>> Would mind to share more info about the proposal on DDL that you
>>  > >>>>> mentioned? We can certainly collaborate on this.
>>  > >>>>>
>>  > >>>>> Thanks,
>>  > >>>>> Xuefu
>>  > >>>>>
>>  > >>>>>
>> ------------------------------------------------------------------
>>  > >>>>> Sender:Shuyi Chen <[email protected]>
>>  > >>>>> Sent at:2018 Oct 14 (Sun) 08:30
>>  > >>>>> Recipient:Xuefu <[email protected]>
>>  > >>>>> Cc:yanghua1127 <[email protected]>; Fabian Hueske <
>>  > [email protected]>;
>>  > >>>>> dev <[email protected]>; user <[email protected]>
>>  > >>>>> Subject:Re: [DISCUSS] Integrate Flink SQL well with Hive
>> ecosystem
>>  > >>>>>
>>  > >>>>> Welcome to the community and thanks for the great proposal,
>> Xuefu! I
>>  >
>>  > >>>>> think the proposal can be divided into 2 stages: making Flink to
>> support
>>  >
>>  > >>>>> Hive features, and make Hive to work with Flink. I agreed with
>> Timo that on
>>  >
>>  > >>>>> starting with a smaller scope, so we can make progress faster.
>> As for [6],
>>  >
>>  > >>>>> a proposal for DDL is already in progress, and will come after
>> the unified
>>  >
>>  > >>>>> SQL connector API is done. For supporting Hive syntax, we might
>> need to
>>  > >>>>> work with the Calcite community, and a recent effort called
>> babel (
>>  > >>>>> https://issues.apache.org/jira/browse/CALCITE-2280) in Calcite
>> might
>>  > >>>>> help here.
>>  > >>>>>
>>  > >>>>> Thanks
>>  > >>>>> Shuyi
>>  > >>>>>
>>  > >>>>> On Wed, Oct 10, 2018 at 8:02 PM Zhang, Xuefu <
>>  > [email protected]>
>>  > >>>>> wrote:
>>  > >>>>> Hi Fabian/Vno,
>>  > >>>>>
>>  >
>>  > >>>>> Thank you very much for your encouragement inquiry. Sorry that I
>> didn't
>>  >
>>  > >>>>> see Fabian's email until I read Vino's response just now.
>> (Somehow Fabian's
>>  > >>>>> went to the spam folder.)
>>  > >>>>>
>>  >
>>  > >>>>> My proposal contains long-term and short-terms goals.
>> Nevertheless, the
>>  > >>>>> effort will focus on the following areas, including Fabian's
>> list:
>>  > >>>>>
>>  > >>>>> 1. Hive metastore connectivity - This covers both read/write
>> access,
>>  >
>>  > >>>>> which means Flink can make full use of Hive's metastore as its
>> catalog (at
>>  > >>>>> least for the batch but can extend for streaming as well).
>>  >
>>  > >>>>> 2. Metadata compatibility - Objects (databases, tables,
>> partitions, etc)
>>  >
>>  > >>>>> created by Hive can be understood by Flink and the reverse
>> direction is
>>  > >>>>> true also.
>>  > >>>>> 3. Data compatibility - Similar to #2, data produced by Hive can
>> be
>>  > >>>>> consumed by Flink and vise versa.
>>  >
>>  > >>>>> 4. Support Hive UDFs - For all Hive's native udfs, Flink either
>> provides
>>  > >>>>> its own implementation or make Hive's implementation work in
>> Flink.
>>  > >>>>> Further, for user created UDFs in Hive, Flink SQL should provide
>> a
>>  >
>>  > >>>>> mechanism allowing user to import them into Flink without any
>> code change
>>  > >>>>> required.
>>  > >>>>> 5. Data types -  Flink SQL should support all data types that are
>>  > >>>>> available in Hive.
>>  > >>>>> 6. SQL Language - Flink SQL should support SQL standard (such as
>>  >
>>  > >>>>> SQL2003) with extension to support Hive's syntax and language
>> features,
>>  > >>>>> around DDL, DML, and SELECT queries.
>>  >
>>  > >>>>> 7.  SQL CLI - this is currently developing in Flink but more
>> effort is
>>  > >>>>> needed.
>>  >
>>  > >>>>> 8. Server - provide a server that's compatible with Hive's
>> HiverServer2
>>  >
>>  > >>>>> in thrift APIs, such that HiveServer2 users can reuse their
>> existing client
>>  > >>>>> (such as beeline) but connect to Flink's thrift server instead.
>>  >
>>  > >>>>> 9. JDBC/ODBC drivers - Flink may provide its own JDBC/ODBC
>> drivers for
>>  > >>>>> other application to use to connect to its thrift server
>>  > >>>>> 10. Support other user's customizations in Hive, such as Hive
>> Serdes,
>>  > >>>>> storage handlers, etc.
>>  >
>>  > >>>>> 11. Better task failure tolerance and task scheduling at Flink
>> runtime.
>>  > >>>>>
>>  > >>>>> As you can see, achieving all those requires significant effort
>> and
>>  >
>>  > >>>>> across all layers in Flink. However, a short-term goal could
>> include only
>>  >
>>  > >>>>> core areas (such as 1, 2, 4, 5, 6, 7) or start  at a smaller
>> scope (such as
>>  > >>>>> #3, #6).
>>  > >>>>>
>>  >
>>  > >>>>> Please share your further thoughts. If we generally agree that
>> this is
>>  >
>>  > >>>>> the right direction, I could come up with a formal proposal
>> quickly and
>>  > >>>>> then we can follow up with broader discussions.
>>  > >>>>>
>>  > >>>>> Thanks,
>>  > >>>>> Xuefu
>>  > >>>>>
>>  > >>>>>
>>  > >>>>>
>>  > >>>>>
>> ------------------------------------------------------------------
>>  > >>>>> Sender:vino yang <[email protected]>
>>  > >>>>> Sent at:2018 Oct 11 (Thu) 09:45
>>  > >>>>> Recipient:Fabian Hueske <[email protected]>
>>  > >>>>> Cc:dev <[email protected]>; Xuefu <[email protected]
>>  > >; user <
>>  > >>>>> [email protected]>
>>  > >>>>> Subject:Re: [DISCUSS] Integrate Flink SQL well with Hive
>> ecosystem
>>  > >>>>>
>>  > >>>>> Hi Xuefu,
>>  > >>>>>
>>  >
>>  > >>>>> Appreciate this proposal, and like Fabian, it would look better
>> if you
>>  > >>>>> can give more details of the plan.
>>  > >>>>>
>>  > >>>>> Thanks, vino.
>>  > >>>>>
>>  > >>>>> Fabian Hueske <[email protected]> 于2018年10月10日周三 下午5:27写道：
>>  > >>>>> Hi Xuefu,
>>  > >>>>>
>>  >
>>  > >>>>> Welcome to the Flink community and thanks for starting this
>> discussion!
>>  > >>>>> Better Hive integration would be really great!
>>  > >>>>> Can you go into details of what you are proposing? I can think
>> of a
>>  > >>>>> couple ways to improve Flink in that regard:
>>  > >>>>>
>>  > >>>>> * Support for Hive UDFs
>>  > >>>>> * Support for Hive metadata catalog
>>  > >>>>> * Support for HiveQL syntax
>>  > >>>>> * ???
>>  > >>>>>
>>  > >>>>> Best, Fabian
>>  > >>>>>
>>  > >>>>> Am Di., 9. Okt. 2018 um 19:22 Uhr schrieb Zhang, Xuefu <
>>  > >>>>> [email protected]>:
>>  > >>>>> Hi all,
>>  > >>>>>
>>  > >>>>> Along with the community's effort, inside Alibaba we have
>> explored
>>  >
>>  > >>>>> Flink's potential as an execution engine not just for stream
>> processing but
>>  > >>>>> also for batch processing. We are encouraged by our findings and
>> have
>>  >
>>  > >>>>> initiated our effort to make Flink's SQL capabilities
>> full-fledged. When
>>  >
>>  > >>>>> comparing what's available in Flink to the offerings from
>> competitive data
>>  >
>>  > >>>>> processing engines, we identified a major gap in Flink: a well
>> integration
>>  >
>>  > >>>>> with Hive ecosystem. This is crucial to the success of Flink SQL
>> and batch
>>  >
>>  > >>>>> due to the well-established data ecosystem around Hive.
>> Therefore, we have
>>  >
>>  > >>>>> done some initial work along this direction but there are still
>> a lot of
>>  > >>>>> effort needed.
>>  > >>>>>
>>  > >>>>> We have two strategies in mind. The first one is to make Flink
>> SQL
>>  >
>>  > >>>>> full-fledged and well-integrated with Hive ecosystem. This is a
>> similar
>>  >
>>  > >>>>> approach to what Spark SQL adopted. The second strategy is to
>> make Hive
>>  >
>>  > >>>>> itself work with Flink, similar to the proposal in [1]. Each
>> approach bears
>>  >
>>  > >>>>> its pros and cons, but they don’t need to be mutually exclusive
>> with each
>>  > >>>>> targeting at different users and use cases. We believe that both
>> will
>>  > >>>>> promote a much greater adoption of Flink beyond stream
>> processing.
>>  > >>>>>
>>  > >>>>> We have been focused on the first approach and would like to
>> showcase
>>  >
>>  > >>>>> Flink's batch and SQL capabilities with Flink SQL. However, we
>> have also
>>  > >>>>> planned to start strategy #2 as the follow-up effort.
>>  > >>>>>
>>  >
>>  > >>>>> I'm completely new to Flink(, with a short bio [2] below),
>> though many
>>  >
>>  > >>>>> of my colleagues here at Alibaba are long-time contributors.
>> Nevertheless,
>>  >
>>  > >>>>> I'd like to share our thoughts and invite your early feedback.
>> At the same
>>  >
>>  > >>>>> time, I am working on a detailed proposal on Flink SQL's
>> integration with
>>  > >>>>> Hive ecosystem, which will be also shared when ready.
>>  > >>>>>
>>  > >>>>> While the ideas are simple, each approach will demand significant
>>  >
>>  > >>>>> effort, more than what we can afford. Thus, the input and
>> contributions
>>  > >>>>> from the communities are greatly welcome and appreciated.
>>  > >>>>>
>>  > >>>>> Regards,
>>  > >>>>>
>>  > >>>>>
>>  > >>>>> Xuefu
>>  > >>>>>
>>  > >>>>> References:
>>  > >>>>>
>>  > >>>>> [1] https://issues.apache.org/jira/browse/HIVE-10712
>>  >
>>  > >>>>> [2] Xuefu Zhang is a long-time open source veteran, worked or
>> working on
>>  > >>>>> many projects under Apache Foundation, of which he is also an
>> honored
>>  >
>>  > >>>>> member. About 10 years ago he worked in the Hadoop team at Yahoo
>> where the
>>  >
>>  > >>>>> projects just got started. Later he worked at Cloudera,
>> initiating and
>>  >
>>  > >>>>> leading the development of Hive on Spark project in the
>> communities and
>>  >
>>  > >>>>> across many organizations. Prior to joining Alibaba, he worked
>> at Uber
>>  >
>>  > >>>>> where he promoted Hive on Spark to all Uber's SQL on Hadoop
>> workload and
>>  > >>>>> significantly improved Uber's cluster efficiency.
>>  > >>>>>
>>  > >>>>>
>>  > >>>>>
>>  > >>>>>
>>  > >>>>> --
>>  >
>>  > >>>>> "So you have to trust that the dots will somehow connect in your
>> future."
>>  > >>>>>
>>  > >>>>>
>>  > >>>>> --
>>  >
>>  > >>>>> "So you have to trust that the dots will somehow connect in your
>> future."
>>  > >>>>>
>>  >
>>  >
>>
>>

Re: [DISCUSS] Integrate Flink SQL well with Hive ecosystem

Reply via email to