Re: [DISCUSS] Integrate Flink SQL well with Hive ecosystem

Zhang, Xuefu Wed, 02 Jan 2019 11:20:52 -0800

Hi Eron,

Happy New Year!


Thank you very much for your contribution, especially during the holidays. Wile 
I'm encouraged by your work, I'd also like to share my thoughts on how to move 
forward.

First, please note that the design discussion is still finalizing, and we 
expect some moderate changes, especially around TableFactories. Another pending 
change is our decision to shy away from scala, which our work will be impacted 
by.

Secondly, while your work seemed about plugging in catalogs definitions to the 
execution environment, which is less impacted by TableFactory change, I did 
notice some duplication of your work and ours. This is no big deal, but going 
forward, we should probable have a better communication on the work assignment 
so as to avoid any possible duplication of work. On the other hand, I think 
some of your work is interesting and valuable for inclusion once we finalize 
the overall design.

Thus, please continue your research and experiment and let us know when you 
start working on anything so we can better coordinate.

Thanks again for your interest and contributions.

Thanks,
Xuefu




------------------------------------------------------------------
From:Eron Wright <[email protected]>
Sent At:2019 Jan. 1 (Tue.) 18:39
To:dev <[email protected]>; Xuefu <[email protected]>
Cc:Xiaowei Jiang <[email protected]>; twalthr <[email protected]>; piotr 
<[email protected]>; Fabian Hueske <[email protected]>; suez1224 
<[email protected]>; Bowen Li <[email protected]>
Subject:Re: [DISCUSS] Integrate Flink SQL well with Hive ecosystem

Hi folks, there's clearly some incremental steps to be taken to introduce 
catalog support to SQL Client, complementary to what is proposed in the 
Flink-Hive Metastore design doc.  I was quietly working on this over the 
holidays.   I posted some new sub-tasks, PRs, and sample code to FLINK-10744. 

What inspired me to get involved is that the catalog interface seems like a 
great way to encapsulate a 'library' of Flink tables and functions.  For 
example, the NYC Taxi dataset (TaxiRides, TaxiFares, various UDFs) may be 
nicely encapsulated as a catalog (TaxiData).   Such a library should be fully 
consumable in SQL Client.

I implemented the above.  Some highlights:
1. A fully-worked example of using the Taxi dataset in SQL Client via an 
environment file.
- an ASCII video showing the SQL Client in action:
https://asciinema.org/a/C8xuAjmZSxCuApgFgZQyeIHuo

- the corresponding environment file (will be even more concise once 
'FLINK-10696 Catalog UDFs' is merged):
https://github.com/EronWright/flink-training-exercises/blob/3be008d64be975ced0f1a7e3901a8c5353f72a7e/src/main/dist/conf/sql-client-defaults.yaml

- the typed API for standalone table applications:
https://github.com/EronWright/flink-training-exercises/blob/3be008d64be975ced0f1a7e3901a8c5353f72a7e/src/main/java/com/dataartisans/flinktraining/examples/table_java/examples/ViaCatalog.java#L50

2. Implementation of the core catalog descriptor and factory.  I realize that 
some renames may later occur as per the design doc, and would be happy to do 
that as a follow-up.
https://github.com/apache/flink/pull/7390

3. Implementation of a connect-style API on TableEnvironment to use catalog 
descriptor.
https://github.com/apache/flink/pull/7392

4. Integration into SQL-Client's environment file:
https://github.com/apache/flink/pull/7393

I realize that the overall Hive integration is still evolving, but I believe 
that these PRs are a good stepping stone. Here's the list (in bottom-up order):
- https://github.com/apache/flink/pull/7386
- https://github.com/apache/flink/pull/7388
- https://github.com/apache/flink/pull/7389
- https://github.com/apache/flink/pull/7390
- https://github.com/apache/flink/pull/7392
- https://github.com/apache/flink/pull/7393

Thanks and enjoy 2019!
Eron W


On Sun, Nov 18, 2018 at 3:04 PM Zhang, Xuefu <[email protected]> wrote:
Hi Xiaowei,

 Thanks for bringing up the question. In the current design, the properties for 
meta objects are meant to cover anything that's specific to a particular 
catalog and agnostic to Flink. Anything that is common (such as schema for 
tables, query text for views, and udf classname) are abstracted as members of 
the respective classes. However, this is still in discussion, and Timo and I 
will go over this and provide an update.

 Please note that UDF is a little more involved than what the current design 
doc shows. I'm still refining this part.

 Thanks,
 Xuefu


 ------------------------------------------------------------------
 Sender:Xiaowei Jiang <[email protected]>
 Sent at:2018 Nov 18 (Sun) 15:17
 Recipient:dev <[email protected]>
 Cc:Xuefu <[email protected]>; twalthr <[email protected]>; piotr 
<[email protected]>; Fabian Hueske <[email protected]>; suez1224 
<[email protected]>
 Subject:Re: [DISCUSS] Integrate Flink SQL well with Hive ecosystem

 Thanks Xuefu for the detailed design doc! One question on the properties 
associated with the catalog objects. Are we going to leave them completely free 
form or we are going to set some standard for that? I think that the answer may 
depend on if we want to explore catalog specific optimization opportunities. In 
any case, I think that it might be helpful for standardize as much as possible 
into strongly typed classes and use leave these properties for catalog specific 
things. But I think that we can do it in steps.

 Xiaowei
 On Fri, Nov 16, 2018 at 4:00 AM Bowen Li <[email protected]> wrote:
 Thanks for keeping on improving the overall design, Xuefu! It looks quite
  good to me now.

  Would be nice that cc-ed Flink committers can help to review and confirm!



  One minor suggestion: Since the last section of design doc already touches
  some new sql statements, shall we add another section in our doc and
  formalize the new sql statements in SQL Client and TableEnvironment that
  are gonna come along naturally with our design? Here are some that the
  design doc mentioned and some that I came up with:

  To be added:

     - USE <catalog> - set default catalog
     - USE <catalog.schema> - set default schema
     - SHOW CATALOGS - show all registered catalogs
     - SHOW SCHEMAS [FROM catalog] - list schemas in the current default
     catalog or the specified catalog
     - DESCRIBE VIEW view - show the view's definition in CatalogView
     - SHOW VIEWS [FROM schema/catalog.schema] - show views from current or a
     specified schema.

     (DDLs that can be addressed by either our design or Shuyi's DDL design)

     - CREATE/DROP/ALTER SCHEMA schema
     - CREATE/DROP/ALTER CATALOG catalog

  To be modified:

     - SHOW TABLES [FROM schema/catalog.schema] - show tables from current or
     a specified schema. Add 'from schema' to existing 'SHOW TABLES' statement
     - SHOW FUNCTIONS [FROM schema/catalog.schema] - show functions from
     current or a specified schema. Add 'from schema' to existing 'SHOW TABLES'
     statement'


  Thanks, Bowen



  On Wed, Nov 14, 2018 at 10:39 PM Zhang, Xuefu <[email protected]>
  wrote:

  > Thanks, Bowen, for catching the error. I have granted comment permission
  > with the link.
  >
  > I also updated the doc with the latest class definitions. Everyone is
  > encouraged to review and comment.
  >
  > Thanks,
  > Xuefu
  >
  > ------------------------------------------------------------------
  > Sender:Bowen Li <[email protected]>
  > Sent at:2018 Nov 14 (Wed) 06:44
  > Recipient:Xuefu <[email protected]>
  > Cc:piotr <[email protected]>; dev <[email protected]>; Shuyi
  > Chen <[email protected]>
  > Subject:Re: [DISCUSS] Integrate Flink SQL well with Hive ecosystem
  >
  > Hi Xuefu,
  >
  > Currently the new design doc
  > 
<https://docs.google.com/document/d/1Y9it78yaUvbv4g572ZK_lZnZaAGjqwM_EhjdOv4yJtw/edit>
  > is on “view only" mode, and people cannot leave comments. Can you please
  > change it to "can comment" or "can edit" mode?
  >
  > Thanks, Bowen
  >
  >
  > On Mon, Nov 12, 2018 at 9:51 PM Zhang, Xuefu <[email protected]>
  > wrote:
  > Hi Piotr
  >
  > I have extracted the API portion of  the design and the google doc is here
  > 
<https://docs.google.com/document/d/1Y9it78yaUvbv4g572ZK_lZnZaAGjqwM_EhjdOv4yJtw/edit?usp=sharing>.
  > Please review and provide your feedback.
  >
  > Thanks,
  > Xuefu
  >
  > ------------------------------------------------------------------
  > Sender:Xuefu <[email protected]>
  > Sent at:2018 Nov 12 (Mon) 12:43
  > Recipient:Piotr Nowojski <[email protected]>; dev <
  > [email protected]>
  > Cc:Bowen Li <[email protected]>; Shuyi Chen <[email protected]>
  > Subject:Re: [DISCUSS] Integrate Flink SQL well with Hive ecosystem
  >
  > Hi Piotr,
  >
  > That sounds good to me. Let's close all the open questions ((there are a
  > couple of them)) in the Google doc and I should be able to quickly split
  > it into the three proposals as you suggested.
  >
  > Thanks,
  > Xuefu
  >
  > ------------------------------------------------------------------
  > Sender:Piotr Nowojski <[email protected]>
  > Sent at:2018 Nov 9 (Fri) 22:46
  > Recipient:dev <[email protected]>; Xuefu <[email protected]>
  > Cc:Bowen Li <[email protected]>; Shuyi Chen <[email protected]>
  > Subject:Re: [DISCUSS] Integrate Flink SQL well with Hive ecosystem
  >
  > Hi,
  >
  >
  > Yes, it seems like the best solution. Maybe someone else can also suggests 
if we can split it further? Maybe changes in the interface in one doc, reading 
from hive meta store another and final storing our meta informations in hive 
meta store?
  >
  > Piotrek
  >
  > > On 9 Nov 2018, at 01:44, Zhang, Xuefu <[email protected]> wrote:
  > >
  > > Hi Piotr,
  > >
  > > That seems to be good idea!
  > >
  >
  > > Since the google doc for the design is currently under extensive review, 
I will leave it as it is for now. However, I'll convert it to two different 
FLIPs when the time comes.
  > >
  > > How does it sound to you?
  > >
  > > Thanks,
  > > Xuefu
  > >
  > >
  > > ------------------------------------------------------------------
  > > Sender:Piotr Nowojski <[email protected]>
  > > Sent at:2018 Nov 9 (Fri) 02:31
  > > Recipient:dev <[email protected]>
  > > Cc:Bowen Li <[email protected]>; Xuefu <[email protected]
  > >; Shuyi Chen <[email protected]>
  > > Subject:Re: [DISCUSS] Integrate Flink SQL well with Hive ecosystem
  > >
  > > Hi,
  > >
  >
  > > Maybe we should split this topic (and the design doc) into couple of 
smaller ones, hopefully independent. The questions that you have asked Fabian 
have for example very little to do with reading metadata from Hive Meta Store?
  > >
  > > Piotrek
  > >
  > >> On 7 Nov 2018, at 14:27, Fabian Hueske <[email protected]> wrote:
  > >>
  > >> Hi Xuefu and all,
  > >>
  > >> Thanks for sharing this design document!
  >
  > >> I'm very much in favor of restructuring / reworking the catalog handling 
in
  > >> Flink SQL as outlined in the document.
  >
  > >> Most changes described in the design document seem to be rather general 
and
  > >> not specifically related to the Hive integration.
  > >>
  >
  > >> IMO, there are some aspects, especially those at the boundary of Hive and
  > >> Flink, that need a bit more discussion. For example
  > >>
  > >> * What does it take to make Flink schema compatible with Hive schema?
  > >> * How will Flink tables (descriptors) be stored in HMS?
  > >> * How do both Hive catalogs differ? Could they be integrated into to a
  > >> single one? When to use which one?
  >
  > >> * What meta information is provided by HMS? What of this can be leveraged
  > >> by Flink?
  > >>
  > >> Thank you,
  > >> Fabian
  > >>
  > >> Am Fr., 2. Nov. 2018 um 00:31 Uhr schrieb Bowen Li <[email protected]
  > >:
  > >>
  > >>> After taking a look at how other discussion threads work, I think it's
  > >>> actually fine just keep our discussion here. It's up to you, Xuefu.
  > >>>
  > >>> The google doc LGTM. I left some minor comments.
  > >>>
  > >>> On Thu, Nov 1, 2018 at 10:17 AM Bowen Li <[email protected]> wrote:
  > >>>
  > >>>> Hi all,
  > >>>>
  > >>>> As Xuefu has published the design doc on google, I agree with Shuyi's
  >
  > >>>> suggestion that we probably should start a new email thread like 
"[DISCUSS]
  >
  > >>>> ... Hive integration design ..." on only dev mailing list for community
  > >>>> devs to review. The current thread sends to both dev and user list.
  > >>>>
  >
  > >>>> This email thread is more like validating the general idea and 
direction
  >
  > >>>> with the community, and it's been pretty long and crowded so far. Since
  >
  > >>>> everyone is pro for the idea, we can move forward with another thread 
to
  > >>>> discuss and finalize the design.
  > >>>>
  > >>>> Thanks,
  > >>>> Bowen
  > >>>>
  > >>>> On Wed, Oct 31, 2018 at 12:16 PM Zhang, Xuefu <
  > [email protected]>
  > >>>> wrote:
  > >>>>
  > >>>>> Hi Shuiyi,
  > >>>>>
  >
  > >>>>> Good idea. Actually the PDF was converted from a google doc. Here is 
its
  > >>>>> link:
  > >>>>>
  > >>>>>
  > 
https://docs.google.com/document/d/1SkppRD_rE3uOKSN-LuZCqn4f7dz0zW5aa6T_hBZq5_o/edit?usp=sharing
  > >>>>> Once we reach an agreement, I can convert it to a FLIP.
  > >>>>>
  > >>>>> Thanks,
  > >>>>> Xuefu
  > >>>>>
  > >>>>>
  > >>>>>
  > >>>>> ------------------------------------------------------------------
  > >>>>> Sender:Shuyi Chen <[email protected]>
  > >>>>> Sent at:2018 Nov 1 (Thu) 02:47
  > >>>>> Recipient:Xuefu <[email protected]>
  > >>>>> Cc:vino yang <[email protected]>; Fabian Hueske <
  > [email protected]>;
  > >>>>> dev <[email protected]>; user <[email protected]>
  > >>>>> Subject:Re: [DISCUSS] Integrate Flink SQL well with Hive ecosystem
  > >>>>>
  > >>>>> Hi Xuefu,
  > >>>>>
  >
  > >>>>> Thanks a lot for driving this big effort. I would suggest convert your
  >
  > >>>>> proposal and design doc into a google doc, and share it on the dev 
mailing
  >
  > >>>>> list for the community to review and comment with title like 
"[DISCUSS] ...
  >
  > >>>>> Hive integration design ..." . Once approved,  we can document it as 
a FLIP
  >
  > >>>>> (Flink Improvement Proposals), and use JIRAs to track the 
implementations.
  > >>>>> What do you think?
  > >>>>>
  > >>>>> Shuyi
  > >>>>>
  > >>>>> On Tue, Oct 30, 2018 at 11:32 AM Zhang, Xuefu <
  > [email protected]>
  > >>>>> wrote:
  > >>>>> Hi all,
  > >>>>>
  > >>>>> I have also shared a design doc on Hive metastore integration that is
  >
  > >>>>> attached here and also to FLINK-10556[1]. Please kindly review and 
share
  > >>>>> your feedback.
  > >>>>>
  > >>>>>
  > >>>>> Thanks,
  > >>>>> Xuefu
  > >>>>>
  > >>>>> [1] https://issues.apache.org/jira/browse/FLINK-10556
  > >>>>> ------------------------------------------------------------------
  > >>>>> Sender:Xuefu <[email protected]>
  > >>>>> Sent at:2018 Oct 25 (Thu) 01:08
  > >>>>> Recipient:Xuefu <[email protected]>; Shuyi Chen <
  > >>>>> [email protected]>
  > >>>>> Cc:yanghua1127 <[email protected]>; Fabian Hueske <
  > [email protected]>;
  > >>>>> dev <[email protected]>; user <[email protected]>
  > >>>>> Subject:Re: [DISCUSS] Integrate Flink SQL well with Hive ecosystem
  > >>>>>
  > >>>>> Hi all,
  > >>>>>
  > >>>>> To wrap up the discussion, I have attached a PDF describing the
  >
  > >>>>> proposal, which is also attached to FLINK-10556 [1]. Please feel free 
to
  > >>>>> watch that JIRA to track the progress.
  > >>>>>
  > >>>>> Please also let me know if you have additional comments or questions.
  > >>>>>
  > >>>>> Thanks,
  > >>>>> Xuefu
  > >>>>>
  > >>>>> [1] https://issues.apache.org/jira/browse/FLINK-10556
  > >>>>>
  > >>>>>
  > >>>>> ------------------------------------------------------------------
  > >>>>> Sender:Xuefu <[email protected]>
  > >>>>> Sent at:2018 Oct 16 (Tue) 03:40
  > >>>>> Recipient:Shuyi Chen <[email protected]>
  > >>>>> Cc:yanghua1127 <[email protected]>; Fabian Hueske <
  > [email protected]>;
  > >>>>> dev <[email protected]>; user <[email protected]>
  > >>>>> Subject:Re: [DISCUSS] Integrate Flink SQL well with Hive ecosystem
  > >>>>>
  > >>>>> Hi Shuyi,
  > >>>>>
  >
  > >>>>> Thank you for your input. Yes, I agreed with a phased approach and 
like
  >
  > >>>>> to move forward fast. :) We did some work internally on DDL utilizing 
babel
  > >>>>> parser in Calcite. While babel makes Calcite's grammar extensible, at
  > >>>>> first impression it still seems too cumbersome for a project when too
  >
  > >>>>> much extensions are made. It's even challenging to find where the 
extension
  >
  > >>>>> is needed! It would be certainly better if Calcite can magically 
support
  >
  > >>>>> Hive QL by just turning on a flag, such as that for MYSQL_5. I can 
also
  >
  > >>>>> see that this could mean a lot of work on Calcite. Nevertheless, I 
will
  >
  > >>>>> bring up the discussion over there and to see what their community 
thinks.
  > >>>>>
  > >>>>> Would mind to share more info about the proposal on DDL that you
  > >>>>> mentioned? We can certainly collaborate on this.
  > >>>>>
  > >>>>> Thanks,
  > >>>>> Xuefu
  > >>>>>
  > >>>>> ------------------------------------------------------------------
  > >>>>> Sender:Shuyi Chen <[email protected]>
  > >>>>> Sent at:2018 Oct 14 (Sun) 08:30
  > >>>>> Recipient:Xuefu <[email protected]>
  > >>>>> Cc:yanghua1127 <[email protected]>; Fabian Hueske <
  > [email protected]>;
  > >>>>> dev <[email protected]>; user <[email protected]>
  > >>>>> Subject:Re: [DISCUSS] Integrate Flink SQL well with Hive ecosystem
  > >>>>>
  > >>>>> Welcome to the community and thanks for the great proposal, Xuefu! I
  >
  > >>>>> think the proposal can be divided into 2 stages: making Flink to 
support
  >
  > >>>>> Hive features, and make Hive to work with Flink. I agreed with Timo 
that on
  >
  > >>>>> starting with a smaller scope, so we can make progress faster. As for 
[6],
  >
  > >>>>> a proposal for DDL is already in progress, and will come after the 
unified
  >
  > >>>>> SQL connector API is done. For supporting Hive syntax, we might need 
to
  > >>>>> work with the Calcite community, and a recent effort called babel (
  > >>>>> https://issues.apache.org/jira/browse/CALCITE-2280) in Calcite might
  > >>>>> help here.
  > >>>>>
  > >>>>> Thanks
  > >>>>> Shuyi
  > >>>>>
  > >>>>> On Wed, Oct 10, 2018 at 8:02 PM Zhang, Xuefu <
  > [email protected]>
  > >>>>> wrote:
  > >>>>> Hi Fabian/Vno,
  > >>>>>
  >
  > >>>>> Thank you very much for your encouragement inquiry. Sorry that I 
didn't
  >
  > >>>>> see Fabian's email until I read Vino's response just now. (Somehow 
Fabian's
  > >>>>> went to the spam folder.)
  > >>>>>
  >
  > >>>>> My proposal contains long-term and short-terms goals. Nevertheless, 
the
  > >>>>> effort will focus on the following areas, including Fabian's list:
  > >>>>>
  > >>>>> 1. Hive metastore connectivity - This covers both read/write access,
  >
  > >>>>> which means Flink can make full use of Hive's metastore as its 
catalog (at
  > >>>>> least for the batch but can extend for streaming as well).
  >
  > >>>>> 2. Metadata compatibility - Objects (databases, tables, partitions, 
etc)
  >
  > >>>>> created by Hive can be understood by Flink and the reverse direction 
is
  > >>>>> true also.
  > >>>>> 3. Data compatibility - Similar to #2, data produced by Hive can be
  > >>>>> consumed by Flink and vise versa.
  >
  > >>>>> 4. Support Hive UDFs - For all Hive's native udfs, Flink either 
provides
  > >>>>> its own implementation or make Hive's implementation work in Flink.
  > >>>>> Further, for user created UDFs in Hive, Flink SQL should provide a
  >
  > >>>>> mechanism allowing user to import them into Flink without any code 
change
  > >>>>> required.
  > >>>>> 5. Data types -  Flink SQL should support all data types that are
  > >>>>> available in Hive.
  > >>>>> 6. SQL Language - Flink SQL should support SQL standard (such as
  >
  > >>>>> SQL2003) with extension to support Hive's syntax and language 
features,
  > >>>>> around DDL, DML, and SELECT queries.
  >
  > >>>>> 7.  SQL CLI - this is currently developing in Flink but more effort is
  > >>>>> needed.
  >
  > >>>>> 8. Server - provide a server that's compatible with Hive's 
HiverServer2
  >
  > >>>>> in thrift APIs, such that HiveServer2 users can reuse their existing 
client
  > >>>>> (such as beeline) but connect to Flink's thrift server instead.
  >
  > >>>>> 9. JDBC/ODBC drivers - Flink may provide its own JDBC/ODBC drivers for
  > >>>>> other application to use to connect to its thrift server
  > >>>>> 10. Support other user's customizations in Hive, such as Hive Serdes,
  > >>>>> storage handlers, etc.
  >
  > >>>>> 11. Better task failure tolerance and task scheduling at Flink 
runtime.
  > >>>>>
  > >>>>> As you can see, achieving all those requires significant effort and
  >
  > >>>>> across all layers in Flink. However, a short-term goal could  include 
only
  >
  > >>>>> core areas (such as 1, 2, 4, 5, 6, 7) or start  at a smaller scope 
(such as
  > >>>>> #3, #6).
  > >>>>>
  >
  > >>>>> Please share your further thoughts. If we generally agree that this is
  >
  > >>>>> the right direction, I could come up with a formal proposal quickly 
and
  > >>>>> then we can follow up with broader discussions.
  > >>>>>
  > >>>>> Thanks,
  > >>>>> Xuefu
  > >>>>>
  > >>>>>
  > >>>>>
  > >>>>> ------------------------------------------------------------------
  > >>>>> Sender:vino yang <[email protected]>
  > >>>>> Sent at:2018 Oct 11 (Thu) 09:45
  > >>>>> Recipient:Fabian Hueske <[email protected]>
  > >>>>> Cc:dev <[email protected]>; Xuefu <[email protected]
  > >; user <
  > >>>>> [email protected]>
  > >>>>> Subject:Re: [DISCUSS] Integrate Flink SQL well with Hive ecosystem
  > >>>>>
  > >>>>> Hi Xuefu,
  > >>>>>
  >
  > >>>>> Appreciate this proposal, and like Fabian, it would look better if you
  > >>>>> can give more details of the plan.
  > >>>>>
  > >>>>> Thanks, vino.
  > >>>>>
  > >>>>> Fabian Hueske <[email protected]> 于2018年10月10日周三 下午5:27写道：
  > >>>>> Hi Xuefu,
  > >>>>>
  >
  > >>>>> Welcome to the Flink community and thanks for starting this 
discussion!
  > >>>>> Better Hive integration would be really great!
  > >>>>> Can you go into details of what you are proposing? I can think of a
  > >>>>> couple ways to improve Flink in that regard:
  > >>>>>
  > >>>>> * Support for Hive UDFs
  > >>>>> * Support for Hive metadata catalog
  > >>>>> * Support for HiveQL syntax
  > >>>>> * ???
  > >>>>>
  > >>>>> Best, Fabian
  > >>>>>
  > >>>>> Am Di., 9. Okt. 2018 um 19:22 Uhr schrieb Zhang, Xuefu <
  > >>>>> [email protected]>:
  > >>>>> Hi all,
  > >>>>>
  > >>>>> Along with the community's effort, inside Alibaba we have explored
  >
  > >>>>> Flink's potential as an execution engine not just for stream 
processing but
  > >>>>> also for batch processing. We are encouraged by our findings and have
  >
  > >>>>> initiated our effort to make Flink's SQL capabilities full-fledged. 
When
  >
  > >>>>> comparing what's available in Flink to the offerings from competitive 
data
  >
  > >>>>> processing engines, we identified a major gap in Flink: a well 
integration
  >
  > >>>>> with Hive ecosystem. This is crucial to the success of Flink SQL and 
batch
  >
  > >>>>> due to the well-established data ecosystem around Hive. Therefore, we 
have
  >
  > >>>>> done some initial work along this direction but there are still a lot 
of
  > >>>>> effort needed.
  > >>>>>
  > >>>>> We have two strategies in mind. The first one is to make Flink SQL
  >
  > >>>>> full-fledged and well-integrated with Hive ecosystem. This is a 
similar
  >
  > >>>>> approach to what Spark SQL adopted. The second strategy is to make 
Hive
  >
  > >>>>> itself work with Flink, similar to the proposal in [1]. Each approach 
bears
  >
  > >>>>> its pros and cons, but they don’t need to be mutually exclusive with 
each
  > >>>>> targeting at different users and use cases. We believe that both will
  > >>>>> promote a much greater adoption of Flink beyond stream processing.
  > >>>>>
  > >>>>> We have been focused on the first approach and would like to showcase
  >
  > >>>>> Flink's batch and SQL capabilities with Flink SQL. However, we have 
also
  > >>>>> planned to start strategy #2 as the follow-up effort.
  > >>>>>
  >
  > >>>>> I'm completely new to Flink(, with a short bio [2] below), though many
  >
  > >>>>> of my colleagues here at Alibaba are long-time contributors. 
Nevertheless,
  >
  > >>>>> I'd like to share our thoughts and invite your early feedback. At the 
same
  >
  > >>>>> time, I am working on a detailed proposal on Flink SQL's integration 
with
  > >>>>> Hive ecosystem, which will be also shared when ready.
  > >>>>>
  > >>>>> While the ideas are simple, each approach will demand significant
  >
  > >>>>> effort, more than what we can afford. Thus, the input and 
contributions
  > >>>>> from the communities are greatly welcome and appreciated.
  > >>>>>
  > >>>>> Regards,
  > >>>>>
  > >>>>>
  > >>>>> Xuefu
  > >>>>>
  > >>>>> References:
  > >>>>>
  > >>>>> [1] https://issues.apache.org/jira/browse/HIVE-10712
  >
  > >>>>> [2] Xuefu Zhang is a long-time open source veteran, worked or working 
on
  > >>>>> many projects under Apache Foundation, of which he is also an honored
  >
  > >>>>> member. About 10 years ago he worked in the Hadoop team at Yahoo 
where the
  >
  > >>>>> projects just got started. Later he worked at Cloudera, initiating and
  >
  > >>>>> leading the development of Hive on Spark project in the communities 
and
  >
  > >>>>> across many organizations. Prior to joining Alibaba, he worked at Uber
  >
  > >>>>> where he promoted Hive on Spark to all Uber's SQL on Hadoop workload 
and
  > >>>>> significantly improved Uber's cluster efficiency.
  > >>>>>
  > >>>>>
  > >>>>>
  > >>>>>
  > >>>>> --
  >
  > >>>>> "So you have to trust that the dots will somehow connect in your 
future."
  > >>>>>
  > >>>>>
  > >>>>> --
  >
  > >>>>> "So you have to trust that the dots will somehow connect in your 
future."
  > >>>>>
  >
  >

Re: [DISCUSS] Integrate Flink SQL well with Hive ecosystem

Reply via email to