Re: [DISCUSS] Integrate Flink SQL well with Hive ecosystem

Jörn Franke Wed, 10 Oct 2018 22:55:35 -0700

Would it maybe make sense to provide Flink as an engine on Hive 
(„flink-on-Hive“)? Eg to address 4,5,6,8,9,10. this could be more loosely 
coupled than integrating hive in all possible flink core modules and thus 
introducing a very tight dependency to Hive in the core.
1,2,3 could be achieved via a connector based on the Flink Table API.
Just as a proposal to start this Endeavour as independent projects (hive 
engine, connector) to avoid too tight coupling with Flink. Maybe in a more 
distant future if the Hive integration is heavily demanded one could then 
integrate it more tightly if needed.


What is meant by 11?
> Am 11.10.2018 um 05:01 schrieb Zhang, Xuefu <xuef...@alibaba-inc.com>:
> 
> Hi Fabian/Vno,
> 
> Thank you very much for your encouragement inquiry. Sorry that I didn't see 
> Fabian's email until I read Vino's response just now. (Somehow Fabian's went 
> to the spam folder.)
> 
> My proposal contains long-term and short-terms goals. Nevertheless, the 
> effort will focus on the following areas, including Fabian's list:
> 
> 1. Hive metastore connectivity - This covers both read/write access, which 
> means Flink can make full use of Hive's metastore as its catalog (at least 
> for the batch but can extend for streaming as well).
> 2. Metadata compatibility - Objects (databases, tables, partitions, etc) 
> created by Hive can be understood by Flink and the reverse direction is true 
> also.
> 3. Data compatibility - Similar to #2, data produced by Hive can be consumed 
> by Flink and vise versa.
> 4. Support Hive UDFs - For all Hive's native udfs, Flink either provides its 
> own implementation or make Hive's implementation work in Flink. Further, for 
> user created UDFs in Hive, Flink SQL should provide a mechanism allowing user 
> to import them into Flink without any code change required.
> 5. Data types -  Flink SQL should support all data types that are available 
> in Hive.
> 6. SQL Language - Flink SQL should support SQL standard (such as SQL2003) 
> with extension to support Hive's syntax and language features, around DDL, 
> DML, and SELECT queries.
> 7.  SQL CLI - this is currently developing in Flink but more effort is needed.
> 8. Server - provide a server that's compatible with Hive's HiverServer2 in 
> thrift APIs, such that HiveServer2 users can reuse their existing client 
> (such as beeline) but connect to Flink's thrift server instead.
> 9. JDBC/ODBC drivers - Flink may provide its own JDBC/ODBC drivers for other 
> application to use to connect to its thrift server
> 10. Support other user's customizations in Hive, such as Hive Serdes, storage 
> handlers, etc.
> 11. Better task failure tolerance and task scheduling at Flink runtime.
> 
> As you can see, achieving all those requires significant effort and across 
> all layers in Flink. However, a short-term goal could  include only core 
> areas (such as 1, 2, 4, 5, 6, 7) or start  at a smaller scope (such as #3, 
> #6).
> 
> Please share your further thoughts. If we generally agree that this is the 
> right direction, I could come up with a formal proposal quickly and then we 
> can follow up with broader discussions.
> 
> Thanks,
> Xuefu
> 
> 
> 
> ------------------------------------------------------------------
> Sender:vino yang <yanghua1...@gmail.com>
> Sent at:2018 Oct 11 (Thu) 09:45
> Recipient:Fabian Hueske <fhue...@gmail.com>
> Cc:dev <d...@flink.apache.org>; Xuefu <xuef...@alibaba-inc.com>; user 
> <user@flink.apache.org>
> Subject:Re: [DISCUSS] Integrate Flink SQL well with Hive ecosystem
> 
> Hi Xuefu,
> 
> Appreciate this proposal, and like Fabian, it would look better if you can 
> give more details of the plan.
> 
> Thanks, vino.
> 
> Fabian Hueske <fhue...@gmail.com> 于2018年10月10日周三 下午5:27写道：
> Hi Xuefu,
> 
> Welcome to the Flink community and thanks for starting this discussion! 
> Better Hive integration would be really great!
> Can you go into details of what you are proposing? I can think of a couple 
> ways to improve Flink in that regard:
> 
> * Support for Hive UDFs
> * Support for Hive metadata catalog
> * Support for HiveQL syntax
> * ???
> 
> Best, Fabian
> 
> Am Di., 9. Okt. 2018 um 19:22 Uhr schrieb Zhang, Xuefu 
> <xuef...@alibaba-inc.com>:
> Hi all,
> 
> Along with the community's effort, inside Alibaba we have explored Flink's 
> potential as an execution engine not just for stream processing but also for 
> batch processing. We are encouraged by our findings and have initiated our 
> effort to make Flink's SQL capabilities full-fledged. When comparing what's 
> available in Flink to the offerings from competitive data processing engines, 
> we identified a major gap in Flink: a well integration with Hive ecosystem. 
> This is crucial to the success of Flink SQL and batch due to the 
> well-established data ecosystem around Hive. Therefore, we have done some 
> initial work along this direction but there are still a lot of effort needed.
> 
> We have two strategies in mind. The first one is to make Flink SQL 
> full-fledged and well-integrated with Hive ecosystem. This is a similar 
> approach to what Spark SQL adopted. The second strategy is to make Hive 
> itself work with Flink, similar to the proposal in [1]. Each approach bears 
> its pros and cons, but they don’t need to be mutually exclusive with each 
> targeting at different users and use cases. We believe that both will promote 
> a much greater adoption of Flink beyond stream processing.
> 
> We have been focused on the first approach and would like to showcase Flink's 
> batch and SQL capabilities with Flink SQL. However, we have also planned to 
> start strategy #2 as the follow-up effort.
> 
> I'm completely new to Flink(, with a short bio [2] below), though many of my 
> colleagues here at Alibaba are long-time contributors. Nevertheless, I'd like 
> to share our thoughts and invite your early feedback. At the same time, I am 
> working on a detailed proposal on Flink SQL's integration with Hive 
> ecosystem, which will be also shared when ready.
> 
> While the ideas are simple, each approach will demand significant effort, 
> more than what we can afford. Thus, the input and contributions from the 
> communities are greatly welcome and appreciated.
> 
> Regards,
> 
> 
> Xuefu
> 
> References:
> 
> [1] https://issues.apache.org/jira/browse/HIVE-10712
> [2] Xuefu Zhang is a long-time open source veteran, worked or working on many 
> projects under Apache Foundation, of which he is also an honored member. 
> About 10 years ago he worked in the Hadoop team at Yahoo where the projects 
> just got started. Later he worked at Cloudera, initiating and leading the 
> development of Hive on Spark project in the communities and across many 
> organizations. Prior to joining Alibaba, he worked at Uber where he promoted 
> Hive on Spark to all Uber's SQL on Hadoop workload and significantly improved 
> Uber's cluster efficiency.
> 
>

Re: [DISCUSS] Integrate Flink SQL well with Hive ecosystem

Reply via email to