Would it maybe make sense to provide Flink as an engine on Hive („flink-on-Hive“)? Eg to address 4,5,6,8,9,10. this could be more loosely coupled than integrating hive in all possible flink core modules and thus introducing a very tight dependency to Hive in the core. 1,2,3 could be achieved via a connector based on the Flink Table API. Just as a proposal to start this Endeavour as independent projects (hive engine, connector) to avoid too tight coupling with Flink. Maybe in a more distant future if the Hive integration is heavily demanded one could then integrate it more tightly if needed.
What is meant by 11? > Am 11.10.2018 um 05:01 schrieb Zhang, Xuefu <xuef...@alibaba-inc.com>: > > Hi Fabian/Vno, > > Thank you very much for your encouragement inquiry. Sorry that I didn't see > Fabian's email until I read Vino's response just now. (Somehow Fabian's went > to the spam folder.) > > My proposal contains long-term and short-terms goals. Nevertheless, the > effort will focus on the following areas, including Fabian's list: > > 1. Hive metastore connectivity - This covers both read/write access, which > means Flink can make full use of Hive's metastore as its catalog (at least > for the batch but can extend for streaming as well). > 2. Metadata compatibility - Objects (databases, tables, partitions, etc) > created by Hive can be understood by Flink and the reverse direction is true > also. > 3. Data compatibility - Similar to #2, data produced by Hive can be consumed > by Flink and vise versa. > 4. Support Hive UDFs - For all Hive's native udfs, Flink either provides its > own implementation or make Hive's implementation work in Flink. Further, for > user created UDFs in Hive, Flink SQL should provide a mechanism allowing user > to import them into Flink without any code change required. > 5. Data types - Flink SQL should support all data types that are available > in Hive. > 6. SQL Language - Flink SQL should support SQL standard (such as SQL2003) > with extension to support Hive's syntax and language features, around DDL, > DML, and SELECT queries. > 7. SQL CLI - this is currently developing in Flink but more effort is needed. > 8. Server - provide a server that's compatible with Hive's HiverServer2 in > thrift APIs, such that HiveServer2 users can reuse their existing client > (such as beeline) but connect to Flink's thrift server instead. > 9. JDBC/ODBC drivers - Flink may provide its own JDBC/ODBC drivers for other > application to use to connect to its thrift server > 10. Support other user's customizations in Hive, such as Hive Serdes, storage > handlers, etc. > 11. Better task failure tolerance and task scheduling at Flink runtime. > > As you can see, achieving all those requires significant effort and across > all layers in Flink. However, a short-term goal could include only core > areas (such as 1, 2, 4, 5, 6, 7) or start at a smaller scope (such as #3, > #6). > > Please share your further thoughts. If we generally agree that this is the > right direction, I could come up with a formal proposal quickly and then we > can follow up with broader discussions. > > Thanks, > Xuefu > > > > ------------------------------------------------------------------ > Sender:vino yang <yanghua1...@gmail.com> > Sent at:2018 Oct 11 (Thu) 09:45 > Recipient:Fabian Hueske <fhue...@gmail.com> > Cc:dev <d...@flink.apache.org>; Xuefu <xuef...@alibaba-inc.com>; user > <user@flink.apache.org> > Subject:Re: [DISCUSS] Integrate Flink SQL well with Hive ecosystem > > Hi Xuefu, > > Appreciate this proposal, and like Fabian, it would look better if you can > give more details of the plan. > > Thanks, vino. > > Fabian Hueske <fhue...@gmail.com> 于2018年10月10日周三 下午5:27写道: > Hi Xuefu, > > Welcome to the Flink community and thanks for starting this discussion! > Better Hive integration would be really great! > Can you go into details of what you are proposing? I can think of a couple > ways to improve Flink in that regard: > > * Support for Hive UDFs > * Support for Hive metadata catalog > * Support for HiveQL syntax > * ??? > > Best, Fabian > > Am Di., 9. Okt. 2018 um 19:22 Uhr schrieb Zhang, Xuefu > <xuef...@alibaba-inc.com>: > Hi all, > > Along with the community's effort, inside Alibaba we have explored Flink's > potential as an execution engine not just for stream processing but also for > batch processing. We are encouraged by our findings and have initiated our > effort to make Flink's SQL capabilities full-fledged. When comparing what's > available in Flink to the offerings from competitive data processing engines, > we identified a major gap in Flink: a well integration with Hive ecosystem. > This is crucial to the success of Flink SQL and batch due to the > well-established data ecosystem around Hive. Therefore, we have done some > initial work along this direction but there are still a lot of effort needed. > > We have two strategies in mind. The first one is to make Flink SQL > full-fledged and well-integrated with Hive ecosystem. This is a similar > approach to what Spark SQL adopted. The second strategy is to make Hive > itself work with Flink, similar to the proposal in [1]. Each approach bears > its pros and cons, but they don’t need to be mutually exclusive with each > targeting at different users and use cases. We believe that both will promote > a much greater adoption of Flink beyond stream processing. > > We have been focused on the first approach and would like to showcase Flink's > batch and SQL capabilities with Flink SQL. However, we have also planned to > start strategy #2 as the follow-up effort. > > I'm completely new to Flink(, with a short bio [2] below), though many of my > colleagues here at Alibaba are long-time contributors. Nevertheless, I'd like > to share our thoughts and invite your early feedback. At the same time, I am > working on a detailed proposal on Flink SQL's integration with Hive > ecosystem, which will be also shared when ready. > > While the ideas are simple, each approach will demand significant effort, > more than what we can afford. Thus, the input and contributions from the > communities are greatly welcome and appreciated. > > Regards, > > > Xuefu > > References: > > [1] https://issues.apache.org/jira/browse/HIVE-10712 > [2] Xuefu Zhang is a long-time open source veteran, worked or working on many > projects under Apache Foundation, of which he is also an honored member. > About 10 years ago he worked in the Hadoop team at Yahoo where the projects > just got started. Later he worked at Cloudera, initiating and leading the > development of Hive on Spark project in the communities and across many > organizations. Prior to joining Alibaba, he worked at Uber where he promoted > Hive on Spark to all Uber's SQL on Hadoop workload and significantly improved > Uber's cluster efficiency. > >