Re: [DISCUSS] Integrate Flink SQL well with Hive ecosystem

Eron Wright Tue, 01 Jan 2019 18:40:17 -0800

Hi folks, there's clearly some incremental steps to be taken to introduce
catalog support to SQL Client, complementary to what is proposed in the
Flink-Hive Metastore design doc.  I was quietly working on this over the
holidays.   I posted some new sub-tasks, PRs, and sample code
to FLINK-10744.


What inspired me to get involved is that the catalog interface seems like a
great way to encapsulate a 'library' of Flink tables and functions.  For
example, the NYC Taxi dataset (TaxiRides, TaxiFares, various UDFs) may be
nicely encapsulated as a catalog (TaxiData).   Such a library should be
fully consumable in SQL Client.

I implemented the above.  Some highlights:

1. A fully-worked example of using the Taxi dataset in SQL Client via an
environment file.
- an ASCII video showing the SQL Client in action:
https://asciinema.org/a/C8xuAjmZSxCuApgFgZQyeIHuo

- the corresponding environment file (will be even more concise once
'FLINK-10696 Catalog UDFs' is merged):
*https://github.com/EronWright/flink-training-exercises/blob/3be008d64be975ced0f1a7e3901a8c5353f72a7e/src/main/dist/conf/sql-client-defaults.yaml
<https://github.com/EronWright/flink-training-exercises/blob/3be008d64be975ced0f1a7e3901a8c5353f72a7e/src/main/dist/conf/sql-client-defaults.yaml>*

- the typed API for standalone table applications:
*https://github.com/EronWright/flink-training-exercises/blob/3be008d64be975ced0f1a7e3901a8c5353f72a7e/src/main/java/com/dataartisans/flinktraining/examples/table_java/examples/ViaCatalog.java#L50
<https://github.com/EronWright/flink-training-exercises/blob/3be008d64be975ced0f1a7e3901a8c5353f72a7e/src/main/java/com/dataartisans/flinktraining/examples/table_java/examples/ViaCatalog.java#L50>*

2. Implementation of the core catalog descriptor and factory.  I realize
that some renames may later occur as per the design doc, and would be happy
to do that as a follow-up.
https://github.com/apache/flink/pull/7390

3. Implementation of a connect-style API on TableEnvironment to use catalog
descriptor.
https://github.com/apache/flink/pull/7392

4. Integration into SQL-Client's environment file:
https://github.com/apache/flink/pull/7393

I realize that the overall Hive integration is still evolving, but I
believe that these PRs are a good stepping stone. Here's the list (in
bottom-up order):
- https://github.com/apache/flink/pull/7386
- https://github.com/apache/flink/pull/7388
- https://github.com/apache/flink/pull/7389
- https://github.com/apache/flink/pull/7390
- https://github.com/apache/flink/pull/7392
- https://github.com/apache/flink/pull/7393

Thanks and enjoy 2019!
Eron W


On Sun, Nov 18, 2018 at 3:04 PM Zhang, Xuefu <[email protected]>
wrote:

> Hi Xiaowei,
>
> Thanks for bringing up the question. In the current design, the properties
> for meta objects are meant to cover anything that's specific to a
> particular catalog and agnostic to Flink. Anything that is common (such as
> schema for tables, query text for views, and udf classname) are abstracted
> as members of the respective classes. However, this is still in discussion,
> and Timo and I will go over this and provide an update.
>
> Please note that UDF is a little more involved than what the current
> design doc shows. I'm still refining this part.
>
> Thanks,
> Xuefu
>
>
> ------------------------------------------------------------------
> Sender:Xiaowei Jiang <[email protected]>
> Sent at:2018 Nov 18 (Sun) 15:17
> Recipient:dev <[email protected]>
> Cc:Xuefu <[email protected]>; twalthr <[email protected]>; piotr <
> [email protected]>; Fabian Hueske <[email protected]>; suez1224 <
> [email protected]>
> Subject:Re: [DISCUSS] Integrate Flink SQL well with Hive ecosystem
>
> Thanks Xuefu for the detailed design doc! One question on the properties
> associated with the catalog objects. Are we going to leave them completely
> free form or we are going to set some standard for that? I think that the
> answer may depend on if we want to explore catalog specific optimization
> opportunities. In any case, I think that it might be helpful for
> standardize as much as possible into strongly typed classes and use leave
> these properties for catalog specific things. But I think that we can do it
> in steps.
>
> Xiaowei
> On Fri, Nov 16, 2018 at 4:00 AM Bowen Li <[email protected]> wrote:
> Thanks for keeping on improving the overall design, Xuefu! It looks quite
>  good to me now.
>
>  Would be nice that cc-ed Flink committers can help to review and confirm!
>
>
>
>  One minor suggestion: Since the last section of design doc already touches
>  some new sql statements, shall we add another section in our doc and
>  formalize the new sql statements in SQL Client and TableEnvironment that
>  are gonna come along naturally with our design? Here are some that the
>  design doc mentioned and some that I came up with:
>
>  To be added:
>
>     - USE <catalog> - set default catalog
>     - USE <catalog.schema> - set default schema
>     - SHOW CATALOGS - show all registered catalogs
>     - SHOW SCHEMAS [FROM catalog] - list schemas in the current default
>     catalog or the specified catalog
>     - DESCRIBE VIEW view - show the view's definition in CatalogView
>     - SHOW VIEWS [FROM schema/catalog.schema] - show views from current or
> a
>     specified schema.
>
>     (DDLs that can be addressed by either our design or Shuyi's DDL design)
>
>     - CREATE/DROP/ALTER SCHEMA schema
>     - CREATE/DROP/ALTER CATALOG catalog
>
>  To be modified:
>
>     - SHOW TABLES [FROM schema/catalog.schema] - show tables from current
> or
>     a specified schema. Add 'from schema' to existing 'SHOW TABLES'
> statement
>     - SHOW FUNCTIONS [FROM schema/catalog.schema] - show functions from
>     current or a specified schema. Add 'from schema' to existing 'SHOW
> TABLES'
>     statement'
>
>
>  Thanks, Bowen
>
>
>
>  On Wed, Nov 14, 2018 at 10:39 PM Zhang, Xuefu <[email protected]>
>  wrote:
>
>  > Thanks, Bowen, for catching the error. I have granted comment permission
>  > with the link.
>  >
>  > I also updated the doc with the latest class definitions. Everyone is
>  > encouraged to review and comment.
>  >
>  > Thanks,
>  > Xuefu
>  >
>  > ------------------------------------------------------------------
>  > Sender:Bowen Li <[email protected]>
>  > Sent at:2018 Nov 14 (Wed) 06:44
>  > Recipient:Xuefu <[email protected]>
>  > Cc:piotr <[email protected]>; dev <[email protected]>; Shuyi
>  > Chen <[email protected]>
>  > Subject:Re: [DISCUSS] Integrate Flink SQL well with Hive ecosystem
>  >
>  > Hi Xuefu,
>  >
>  > Currently the new design doc
>  > <
> https://docs.google.com/document/d/1Y9it78yaUvbv4g572ZK_lZnZaAGjqwM_EhjdOv4yJtw/edit
> >
>  > is on “view only" mode, and people cannot leave comments. Can you please
>  > change it to "can comment" or "can edit" mode?
>  >
>  > Thanks, Bowen
>  >
>  >
>  > On Mon, Nov 12, 2018 at 9:51 PM Zhang, Xuefu <[email protected]>
>  > wrote:
>  > Hi Piotr
>  >
>  > I have extracted the API portion of  the design and the google doc is
> here
>  > <
> https://docs.google.com/document/d/1Y9it78yaUvbv4g572ZK_lZnZaAGjqwM_EhjdOv4yJtw/edit?usp=sharing
> >.
>  > Please review and provide your feedback.
>  >
>  > Thanks,
>  > Xuefu
>  >
>  > ------------------------------------------------------------------
>  > Sender:Xuefu <[email protected]>
>  > Sent at:2018 Nov 12 (Mon) 12:43
>  > Recipient:Piotr Nowojski <[email protected]>; dev <
>  > [email protected]>
>  > Cc:Bowen Li <[email protected]>; Shuyi Chen <[email protected]>
>  > Subject:Re: [DISCUSS] Integrate Flink SQL well with Hive ecosystem
>  >
>  > Hi Piotr,
>  >
>  > That sounds good to me. Let's close all the open questions ((there are a
>  > couple of them)) in the Google doc and I should be able to quickly split
>  > it into the three proposals as you suggested.
>  >
>  > Thanks,
>  > Xuefu
>  >
>  > ------------------------------------------------------------------
>  > Sender:Piotr Nowojski <[email protected]>
>  > Sent at:2018 Nov 9 (Fri) 22:46
>  > Recipient:dev <[email protected]>; Xuefu <[email protected]>
>  > Cc:Bowen Li <[email protected]>; Shuyi Chen <[email protected]>
>  > Subject:Re: [DISCUSS] Integrate Flink SQL well with Hive ecosystem
>  >
>  > Hi,
>  >
>  >
>  > Yes, it seems like the best solution. Maybe someone else can also
> suggests if we can split it further? Maybe changes in the interface in one
> doc, reading from hive meta store another and final storing our meta
> informations in hive meta store?
>  >
>  > Piotrek
>  >
>  > > On 9 Nov 2018, at 01:44, Zhang, Xuefu <[email protected]>
> wrote:
>  > >
>  > > Hi Piotr,
>  > >
>  > > That seems to be good idea!
>  > >
>  >
>  > > Since the google doc for the design is currently under extensive
> review, I will leave it as it is for now. However, I'll convert it to two
> different FLIPs when the time comes.
>  > >
>  > > How does it sound to you?
>  > >
>  > > Thanks,
>  > > Xuefu
>  > >
>  > >
>  > > ------------------------------------------------------------------
>  > > Sender:Piotr Nowojski <[email protected]>
>  > > Sent at:2018 Nov 9 (Fri) 02:31
>  > > Recipient:dev <[email protected]>
>  > > Cc:Bowen Li <[email protected]>; Xuefu <[email protected]
>  > >; Shuyi Chen <[email protected]>
>  > > Subject:Re: [DISCUSS] Integrate Flink SQL well with Hive ecosystem
>  > >
>  > > Hi,
>  > >
>  >
>  > > Maybe we should split this topic (and the design doc) into couple of
> smaller ones, hopefully independent. The questions that you have asked
> Fabian have for example very little to do with reading metadata from Hive
> Meta Store?
>  > >
>  > > Piotrek
>  > >
>  > >> On 7 Nov 2018, at 14:27, Fabian Hueske <[email protected]> wrote:
>  > >>
>  > >> Hi Xuefu and all,
>  > >>
>  > >> Thanks for sharing this design document!
>  >
>  > >> I'm very much in favor of restructuring / reworking the catalog
> handling in
>  > >> Flink SQL as outlined in the document.
>  >
>  > >> Most changes described in the design document seem to be rather
> general and
>  > >> not specifically related to the Hive integration.
>  > >>
>  >
>  > >> IMO, there are some aspects, especially those at the boundary of
> Hive and
>  > >> Flink, that need a bit more discussion. For example
>  > >>
>  > >> * What does it take to make Flink schema compatible with Hive schema?
>  > >> * How will Flink tables (descriptors) be stored in HMS?
>  > >> * How do both Hive catalogs differ? Could they be integrated into to
> a
>  > >> single one? When to use which one?
>  >
>  > >> * What meta information is provided by HMS? What of this can be
> leveraged
>  > >> by Flink?
>  > >>
>  > >> Thank you,
>  > >> Fabian
>  > >>
>  > >> Am Fr., 2. Nov. 2018 um 00:31 Uhr schrieb Bowen Li <
> [email protected]
>  > >:
>  > >>
>  > >>> After taking a look at how other discussion threads work, I think
> it's
>  > >>> actually fine just keep our discussion here. It's up to you, Xuefu.
>  > >>>
>  > >>> The google doc LGTM. I left some minor comments.
>  > >>>
>  > >>> On Thu, Nov 1, 2018 at 10:17 AM Bowen Li <[email protected]>
> wrote:
>  > >>>
>  > >>>> Hi all,
>  > >>>>
>  > >>>> As Xuefu has published the design doc on google, I agree with
> Shuyi's
>  >
>  > >>>> suggestion that we probably should start a new email thread like
> "[DISCUSS]
>  >
>  > >>>> ... Hive integration design ..." on only dev mailing list for
> community
>  > >>>> devs to review. The current thread sends to both dev and user list.
>  > >>>>
>  >
>  > >>>> This email thread is more like validating the general idea and
> direction
>  >
>  > >>>> with the community, and it's been pretty long and crowded so far.
> Since
>  >
>  > >>>> everyone is pro for the idea, we can move forward with another
> thread to
>  > >>>> discuss and finalize the design.
>  > >>>>
>  > >>>> Thanks,
>  > >>>> Bowen
>  > >>>>
>  > >>>> On Wed, Oct 31, 2018 at 12:16 PM Zhang, Xuefu <
>  > [email protected]>
>  > >>>> wrote:
>  > >>>>
>  > >>>>> Hi Shuiyi,
>  > >>>>>
>  >
>  > >>>>> Good idea. Actually the PDF was converted from a google doc. Here
> is its
>  > >>>>> link:
>  > >>>>>
>  > >>>>>
>  >
> https://docs.google.com/document/d/1SkppRD_rE3uOKSN-LuZCqn4f7dz0zW5aa6T_hBZq5_o/edit?usp=sharing
>  > >>>>> Once we reach an agreement, I can convert it to a FLIP.
>  > >>>>>
>  > >>>>> Thanks,
>  > >>>>> Xuefu
>  > >>>>>
>  > >>>>>
>  > >>>>>
>  > >>>>> ------------------------------------------------------------------
>  > >>>>> Sender:Shuyi Chen <[email protected]>
>  > >>>>> Sent at:2018 Nov 1 (Thu) 02:47
>  > >>>>> Recipient:Xuefu <[email protected]>
>  > >>>>> Cc:vino yang <[email protected]>; Fabian Hueske <
>  > [email protected]>;
>  > >>>>> dev <[email protected]>; user <[email protected]>
>  > >>>>> Subject:Re: [DISCUSS] Integrate Flink SQL well with Hive ecosystem
>  > >>>>>
>  > >>>>> Hi Xuefu,
>  > >>>>>
>  >
>  > >>>>> Thanks a lot for driving this big effort. I would suggest convert
> your
>  >
>  > >>>>> proposal and design doc into a google doc, and share it on the
> dev mailing
>  >
>  > >>>>> list for the community to review and comment with title like
> "[DISCUSS] ...
>  >
>  > >>>>> Hive integration design ..." . Once approved,  we can document it
> as a FLIP
>  >
>  > >>>>> (Flink Improvement Proposals), and use JIRAs to track the
> implementations.
>  > >>>>> What do you think?
>  > >>>>>
>  > >>>>> Shuyi
>  > >>>>>
>  > >>>>> On Tue, Oct 30, 2018 at 11:32 AM Zhang, Xuefu <
>  > [email protected]>
>  > >>>>> wrote:
>  > >>>>> Hi all,
>  > >>>>>
>  > >>>>> I have also shared a design doc on Hive metastore integration
> that is
>  >
>  > >>>>> attached here and also to FLINK-10556[1]. Please kindly review
> and share
>  > >>>>> your feedback.
>  > >>>>>
>  > >>>>>
>  > >>>>> Thanks,
>  > >>>>> Xuefu
>  > >>>>>
>  > >>>>> [1] https://issues.apache.org/jira/browse/FLINK-10556
>  > >>>>> ------------------------------------------------------------------
>  > >>>>> Sender:Xuefu <[email protected]>
>  > >>>>> Sent at:2018 Oct 25 (Thu) 01:08
>  > >>>>> Recipient:Xuefu <[email protected]>; Shuyi Chen <
>  > >>>>> [email protected]>
>  > >>>>> Cc:yanghua1127 <[email protected]>; Fabian Hueske <
>  > [email protected]>;
>  > >>>>> dev <[email protected]>; user <[email protected]>
>  > >>>>> Subject:Re: [DISCUSS] Integrate Flink SQL well with Hive ecosystem
>  > >>>>>
>  > >>>>> Hi all,
>  > >>>>>
>  > >>>>> To wrap up the discussion, I have attached a PDF describing the
>  >
>  > >>>>> proposal, which is also attached to FLINK-10556 [1]. Please feel
> free to
>  > >>>>> watch that JIRA to track the progress.
>  > >>>>>
>  > >>>>> Please also let me know if you have additional comments or
> questions.
>  > >>>>>
>  > >>>>> Thanks,
>  > >>>>> Xuefu
>  > >>>>>
>  > >>>>> [1] https://issues.apache.org/jira/browse/FLINK-10556
>  > >>>>>
>  > >>>>>
>  > >>>>> ------------------------------------------------------------------
>  > >>>>> Sender:Xuefu <[email protected]>
>  > >>>>> Sent at:2018 Oct 16 (Tue) 03:40
>  > >>>>> Recipient:Shuyi Chen <[email protected]>
>  > >>>>> Cc:yanghua1127 <[email protected]>; Fabian Hueske <
>  > [email protected]>;
>  > >>>>> dev <[email protected]>; user <[email protected]>
>  > >>>>> Subject:Re: [DISCUSS] Integrate Flink SQL well with Hive ecosystem
>  > >>>>>
>  > >>>>> Hi Shuyi,
>  > >>>>>
>  >
>  > >>>>> Thank you for your input. Yes, I agreed with a phased approach
> and like
>  >
>  > >>>>> to move forward fast. :) We did some work internally on DDL
> utilizing babel
>  > >>>>> parser in Calcite. While babel makes Calcite's grammar
> extensible, at
>  > >>>>> first impression it still seems too cumbersome for a project when
> too
>  >
>  > >>>>> much extensions are made. It's even challenging to find where the
> extension
>  >
>  > >>>>> is needed! It would be certainly better if Calcite can magically
> support
>  >
>  > >>>>> Hive QL by just turning on a flag, such as that for MYSQL_5. I
> can also
>  >
>  > >>>>> see that this could mean a lot of work on Calcite. Nevertheless,
> I will
>  >
>  > >>>>> bring up the discussion over there and to see what their
> community thinks.
>  > >>>>>
>  > >>>>> Would mind to share more info about the proposal on DDL that you
>  > >>>>> mentioned? We can certainly collaborate on this.
>  > >>>>>
>  > >>>>> Thanks,
>  > >>>>> Xuefu
>  > >>>>>
>  > >>>>> ------------------------------------------------------------------
>  > >>>>> Sender:Shuyi Chen <[email protected]>
>  > >>>>> Sent at:2018 Oct 14 (Sun) 08:30
>  > >>>>> Recipient:Xuefu <[email protected]>
>  > >>>>> Cc:yanghua1127 <[email protected]>; Fabian Hueske <
>  > [email protected]>;
>  > >>>>> dev <[email protected]>; user <[email protected]>
>  > >>>>> Subject:Re: [DISCUSS] Integrate Flink SQL well with Hive ecosystem
>  > >>>>>
>  > >>>>> Welcome to the community and thanks for the great proposal,
> Xuefu! I
>  >
>  > >>>>> think the proposal can be divided into 2 stages: making Flink to
> support
>  >
>  > >>>>> Hive features, and make Hive to work with Flink. I agreed with
> Timo that on
>  >
>  > >>>>> starting with a smaller scope, so we can make progress faster. As
> for [6],
>  >
>  > >>>>> a proposal for DDL is already in progress, and will come after
> the unified
>  >
>  > >>>>> SQL connector API is done. For supporting Hive syntax, we might
> need to
>  > >>>>> work with the Calcite community, and a recent effort called babel
> (
>  > >>>>> https://issues.apache.org/jira/browse/CALCITE-2280) in Calcite
> might
>  > >>>>> help here.
>  > >>>>>
>  > >>>>> Thanks
>  > >>>>> Shuyi
>  > >>>>>
>  > >>>>> On Wed, Oct 10, 2018 at 8:02 PM Zhang, Xuefu <
>  > [email protected]>
>  > >>>>> wrote:
>  > >>>>> Hi Fabian/Vno,
>  > >>>>>
>  >
>  > >>>>> Thank you very much for your encouragement inquiry. Sorry that I
> didn't
>  >
>  > >>>>> see Fabian's email until I read Vino's response just now.
> (Somehow Fabian's
>  > >>>>> went to the spam folder.)
>  > >>>>>
>  >
>  > >>>>> My proposal contains long-term and short-terms goals.
> Nevertheless, the
>  > >>>>> effort will focus on the following areas, including Fabian's list:
>  > >>>>>
>  > >>>>> 1. Hive metastore connectivity - This covers both read/write
> access,
>  >
>  > >>>>> which means Flink can make full use of Hive's metastore as its
> catalog (at
>  > >>>>> least for the batch but can extend for streaming as well).
>  >
>  > >>>>> 2. Metadata compatibility - Objects (databases, tables,
> partitions, etc)
>  >
>  > >>>>> created by Hive can be understood by Flink and the reverse
> direction is
>  > >>>>> true also.
>  > >>>>> 3. Data compatibility - Similar to #2, data produced by Hive can
> be
>  > >>>>> consumed by Flink and vise versa.
>  >
>  > >>>>> 4. Support Hive UDFs - For all Hive's native udfs, Flink either
> provides
>  > >>>>> its own implementation or make Hive's implementation work in
> Flink.
>  > >>>>> Further, for user created UDFs in Hive, Flink SQL should provide a
>  >
>  > >>>>> mechanism allowing user to import them into Flink without any
> code change
>  > >>>>> required.
>  > >>>>> 5. Data types -  Flink SQL should support all data types that are
>  > >>>>> available in Hive.
>  > >>>>> 6. SQL Language - Flink SQL should support SQL standard (such as
>  >
>  > >>>>> SQL2003) with extension to support Hive's syntax and language
> features,
>  > >>>>> around DDL, DML, and SELECT queries.
>  >
>  > >>>>> 7.  SQL CLI - this is currently developing in Flink but more
> effort is
>  > >>>>> needed.
>  >
>  > >>>>> 8. Server - provide a server that's compatible with Hive's
> HiverServer2
>  >
>  > >>>>> in thrift APIs, such that HiveServer2 users can reuse their
> existing client
>  > >>>>> (such as beeline) but connect to Flink's thrift server instead.
>  >
>  > >>>>> 9. JDBC/ODBC drivers - Flink may provide its own JDBC/ODBC
> drivers for
>  > >>>>> other application to use to connect to its thrift server
>  > >>>>> 10. Support other user's customizations in Hive, such as Hive
> Serdes,
>  > >>>>> storage handlers, etc.
>  >
>  > >>>>> 11. Better task failure tolerance and task scheduling at Flink
> runtime.
>  > >>>>>
>  > >>>>> As you can see, achieving all those requires significant effort
> and
>  >
>  > >>>>> across all layers in Flink. However, a short-term goal could
> include only
>  >
>  > >>>>> core areas (such as 1, 2, 4, 5, 6, 7) or start  at a smaller
> scope (such as
>  > >>>>> #3, #6).
>  > >>>>>
>  >
>  > >>>>> Please share your further thoughts. If we generally agree that
> this is
>  >
>  > >>>>> the right direction, I could come up with a formal proposal
> quickly and
>  > >>>>> then we can follow up with broader discussions.
>  > >>>>>
>  > >>>>> Thanks,
>  > >>>>> Xuefu
>  > >>>>>
>  > >>>>>
>  > >>>>>
>  > >>>>> ------------------------------------------------------------------
>  > >>>>> Sender:vino yang <[email protected]>
>  > >>>>> Sent at:2018 Oct 11 (Thu) 09:45
>  > >>>>> Recipient:Fabian Hueske <[email protected]>
>  > >>>>> Cc:dev <[email protected]>; Xuefu <[email protected]
>  > >; user <
>  > >>>>> [email protected]>
>  > >>>>> Subject:Re: [DISCUSS] Integrate Flink SQL well with Hive ecosystem
>  > >>>>>
>  > >>>>> Hi Xuefu,
>  > >>>>>
>  >
>  > >>>>> Appreciate this proposal, and like Fabian, it would look better
> if you
>  > >>>>> can give more details of the plan.
>  > >>>>>
>  > >>>>> Thanks, vino.
>  > >>>>>
>  > >>>>> Fabian Hueske <[email protected]> 于2018年10月10日周三 下午5:27写道：
>  > >>>>> Hi Xuefu,
>  > >>>>>
>  >
>  > >>>>> Welcome to the Flink community and thanks for starting this
> discussion!
>  > >>>>> Better Hive integration would be really great!
>  > >>>>> Can you go into details of what you are proposing? I can think of
> a
>  > >>>>> couple ways to improve Flink in that regard:
>  > >>>>>
>  > >>>>> * Support for Hive UDFs
>  > >>>>> * Support for Hive metadata catalog
>  > >>>>> * Support for HiveQL syntax
>  > >>>>> * ???
>  > >>>>>
>  > >>>>> Best, Fabian
>  > >>>>>
>  > >>>>> Am Di., 9. Okt. 2018 um 19:22 Uhr schrieb Zhang, Xuefu <
>  > >>>>> [email protected]>:
>  > >>>>> Hi all,
>  > >>>>>
>  > >>>>> Along with the community's effort, inside Alibaba we have explored
>  >
>  > >>>>> Flink's potential as an execution engine not just for stream
> processing but
>  > >>>>> also for batch processing. We are encouraged by our findings and
> have
>  >
>  > >>>>> initiated our effort to make Flink's SQL capabilities
> full-fledged. When
>  >
>  > >>>>> comparing what's available in Flink to the offerings from
> competitive data
>  >
>  > >>>>> processing engines, we identified a major gap in Flink: a well
> integration
>  >
>  > >>>>> with Hive ecosystem. This is crucial to the success of Flink SQL
> and batch
>  >
>  > >>>>> due to the well-established data ecosystem around Hive.
> Therefore, we have
>  >
>  > >>>>> done some initial work along this direction but there are still a
> lot of
>  > >>>>> effort needed.
>  > >>>>>
>  > >>>>> We have two strategies in mind. The first one is to make Flink SQL
>  >
>  > >>>>> full-fledged and well-integrated with Hive ecosystem. This is a
> similar
>  >
>  > >>>>> approach to what Spark SQL adopted. The second strategy is to
> make Hive
>  >
>  > >>>>> itself work with Flink, similar to the proposal in [1]. Each
> approach bears
>  >
>  > >>>>> its pros and cons, but they don’t need to be mutually exclusive
> with each
>  > >>>>> targeting at different users and use cases. We believe that both
> will
>  > >>>>> promote a much greater adoption of Flink beyond stream processing.
>  > >>>>>
>  > >>>>> We have been focused on the first approach and would like to
> showcase
>  >
>  > >>>>> Flink's batch and SQL capabilities with Flink SQL. However, we
> have also
>  > >>>>> planned to start strategy #2 as the follow-up effort.
>  > >>>>>
>  >
>  > >>>>> I'm completely new to Flink(, with a short bio [2] below), though
> many
>  >
>  > >>>>> of my colleagues here at Alibaba are long-time contributors.
> Nevertheless,
>  >
>  > >>>>> I'd like to share our thoughts and invite your early feedback. At
> the same
>  >
>  > >>>>> time, I am working on a detailed proposal on Flink SQL's
> integration with
>  > >>>>> Hive ecosystem, which will be also shared when ready.
>  > >>>>>
>  > >>>>> While the ideas are simple, each approach will demand significant
>  >
>  > >>>>> effort, more than what we can afford. Thus, the input and
> contributions
>  > >>>>> from the communities are greatly welcome and appreciated.
>  > >>>>>
>  > >>>>> Regards,
>  > >>>>>
>  > >>>>>
>  > >>>>> Xuefu
>  > >>>>>
>  > >>>>> References:
>  > >>>>>
>  > >>>>> [1] https://issues.apache.org/jira/browse/HIVE-10712
>  >
>  > >>>>> [2] Xuefu Zhang is a long-time open source veteran, worked or
> working on
>  > >>>>> many projects under Apache Foundation, of which he is also an
> honored
>  >
>  > >>>>> member. About 10 years ago he worked in the Hadoop team at Yahoo
> where the
>  >
>  > >>>>> projects just got started. Later he worked at Cloudera,
> initiating and
>  >
>  > >>>>> leading the development of Hive on Spark project in the
> communities and
>  >
>  > >>>>> across many organizations. Prior to joining Alibaba, he worked at
> Uber
>  >
>  > >>>>> where he promoted Hive on Spark to all Uber's SQL on Hadoop
> workload and
>  > >>>>> significantly improved Uber's cluster efficiency.
>  > >>>>>
>  > >>>>>
>  > >>>>>
>  > >>>>>
>  > >>>>> --
>  >
>  > >>>>> "So you have to trust that the dots will somehow connect in your
> future."
>  > >>>>>
>  > >>>>>
>  > >>>>> --
>  >
>  > >>>>> "So you have to trust that the dots will somehow connect in your
> future."
>  > >>>>>
>  >
>  >
>
>

Re: [DISCUSS] Integrate Flink SQL well with Hive ecosystem

Reply via email to