(Forwarding this also to the User mailing list as I made a typo when replying to this email thread)
---------- Forwarded message --------- From: Martijn Visser <martijnvis...@apache.org> Date: Wed, 9 Mar 2022 at 20:57 Subject: Re: [DISCUSS] Flink's supported APIs and Hive query syntax To: dev <d...@flink.apache.org>, Francesco Guardiani <france...@ververica.com>, Timo Walther <twal...@apache.org>, <us...@flink.apache.org> Hi everyone, Thank you all very much for your input. From my perspective, I consider batch as a special case of streaming. So with Flink SQL, we can support both batch and streaming use cases and I think we should use Flink SQL as our target. To reply on some of the comments: @Jing on your remark: > Since Flink has a clear vision of unified batch and stream processing, supporting batch jobs will be one of the critical core features to help us reach the vision and let Flink have an even bigger impact in the industry. I fully agree with that statement. I do think that having Hive syntax support doesn't help in that unified batch and stream processing. We're making it easier for batch users to run their Hive batch jobs on Flink, but that doesn't fit the "unified" part since it's focussed on batch, while Flink SQL focusses on batch and streaming. I would have rather invested time in making batch improvements to Flink and Flink SQL vs investing in Hive syntax support. I do understand from the given replies that Hive syntax support is valuable for those that are already running batch processing and would like to run these queries on Flink. I do think that's limited to mostly Chinese companies at the moment. @Jark I think you've provided great input and are spot on with: > Regarding the maintenance concern you raised, I think that's a good point and they are in the plan. The Hive dialect has already been a plugin and option now, and the implementation is located in hive-connector module. We still need some work to make the Hive dialect purely rely on public APIs, and the Hive connector should be decopule with table planner. At that time, we can move the whole Hive connector into a separate repository (I guess this is also in the externalize connectors plan). I'm looping in Francesco and Timo who can elaborate more in depth on the current maintenance issues. I think we need to have a proper plan on how this tech debt/maintenance can be addressed and to get commitment that this will be resolved in Flink 1.16, since we indeed need to move out all previously agreed connectors before Flink 1.16 is released. > From my perspective, Hive is still widely used and there exists many running Hive SQL jobs, so why not to provide users a better experience to help them migrate Hive jobs to Flink? Also, it doesn't conflict with Flink SQL as it's just a dialect option. I do think there is a conflict with Flink SQL; you can't use both of them at the same time, so you don't have access to all features in Flink. That increases feature sparsity and user friction. It also puts a bigger burden on the Flink community, because having both options available means more maintenance work. For example, an upgrade of Calcite is more impactful. The Flink codebase is already rather large and CI build times are already too long. More code means more risk of bugs. If a user at some point wants to change his Hive batch job to a streaming Flink SQL job, there's still migration work for the user, it just needs to happen at a later stage. @Jingsong I think you have a good argument that migrating SQL for Batch ETL is indeed an expensive effort. Last but not least, there was no one who has yet commented on the supported Hive versions and security issues. I've reached out to the Hive community and from the info I've received so far is that only Hive 3.1.x and Hive 2.3.x are still supported. The older Hive versions are no longer maintained and also don't receive security updates. This is important because many companies scan the Flink project for vulnerabilities and won't allow using it when these types of vulnerabilities are included. My summary would be the following: * Like Jark said, in the short term, Hive syntax compatibility is the ticket for us to have a seat in the batch processing. Having improved Hive syntax support for that in Flink can help in this. * In the long term, we can and should drop it and focus on Flink SQL itself both for batch and stream processing. * The Hive maintainers/volunteers should come up with a plan on how the tech debt/maintenance with regards to Hive query syntax can be addressed and will be resolved for Flink 1.16. This includes stuff like using public APIs and decoupling it from the planner. This is also extremely important since we want to move out connectors with Flink 1.16 (next Flink release). I'm hoping that those who can help out with this will chip-in. * We follow the Hive versions that are still supported, which means we drop support for Hive 1.*, 2.1.x and 2.2.x and upgrade Hive 2.3 and Hive 3.1 to the latest version. Thanks again for your input and looking forward to your thoughts on this. Best regards, Martijn On Tue, 8 Mar 2022 at 10:39, 罗宇侠(莫辞) <luoyuxia.luoyu...@alibaba-inc.com> wrote: > Hi Martijn, > Thanks for driving this discussion. > > About your concerns, I would like to share my opinion. > > Actually, more exactly, FLIP-152 [1] is not to extend Flink SQL to support > Hive query synax, it provides a Hive dialect option to enable users to switch > to Hive dialect. From the commits about the corresponding FLINK-21529, it > doesn't involve much about Flink itself. > > > - About the struggling with maintaining. The current implementation is just > to provide an option for user to use Hive dialect. I think there won't be > much bother. > > > - Although Apache Hive is less popular, it's widely used as an open source > database over the years. There still exists many Hive SQL jobs in many > companies. > > > - As I said, the current implementation for Hive SQL synax is more like > pluggable, we can also support for Snowflake and the others as long as it's > necessary. > > > - As for the know security vulnerabilities of Hive, maybe it's not a critical > problem in this discuss. > > - For current implementation for Hive SQL syntax, it uses a pluggable > HiveParser[3] to parse the SQL statement. I think there won't be much > complexity brought to Flink > to support Hive syntax. > > > From my perspective, Hive is still widely used and there exists many running > Hive SQL jobs, so why not to provide users a better experience to help them > migrate Hive jobs to Flink? Also, it doesn't conflict with Flink SQL as it's > just a dialect option. > > [1] > https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=165227316 > [2] https://issues.apache.org/jira/browse/FLINK-21529 > [3] https://issues.apache.org/jira/browse/FLINK-21531 > Best regards, > Yuxia. > > > ------------------原始邮件 ------------------ > *发件人:*Martijn Visser <martijnvis...@apache.org> > *发送时间:*Mon Mar 7 19:23:15 2022 > *收件人:*dev <d...@flink.apache.org>, User <user@flink.apache.org> > *主题:*[DISCUSS] Flink's supported APIs and Hive query syntax > >> Hi everyone, >> >> >> Flink currently has 4 APIs with multiple language support which can be used >> to develop applications: >> >> * DataStream API, both Java and Scala >> * Table API, both Java and Scala >> * Flink SQL, both in Flink query syntax and Hive query syntax (partially) >> * Python API >> >> Since FLIP-152 [1] the Flink SQL support has been extended to also support >> the Hive query syntax. There is now a follow-up FLINK-26360 [2] to address >> more syntax compatibility issues. >> >> I would like to open a discussion on Flink directly supporting the Hive >> query syntax. I have some concerns if having a 100% Hive query syntax is >> indeed something that we should aim for in Flink. >> >> I can understand that having Hive query syntax support in Flink could help >> users due to interoperability and being able to migrate. However: >> >> - Adding full Hive query syntax support will mean that we go from 6 fully >> supported API/language combinations to 7. I think we are currently already >> struggling with maintaining the existing combinations, let another one >> more. >> >> - Apache Hive is/appears to be a project that's not that actively developed >> anymore. The last release was made in January 2021. It's popularity is >> rapidly declining in Europe and the United State, also due Hadoop becoming >> less popular. >> - Related to the previous topic, other software like Snowflake, >> >> Trino/Presto, Databricks are becoming more and more popular. If we add full >> support for the Hive query syntax, then why not add support for Snowflake >> and the others? >> - We are supporting Hive versions that are no longer supported by the Hive >> community with known security vulnerabilities. This makes Flink also >> vulnerable for those type of vulnerabilities. >> - The currently Hive implementation is done by using a lot of internals of >> Flink, making Flink hard to maintain, with lots of tech debt and making >> things overly complex. >> >> From my perspective, I think it would be better to not have Hive query >> syntax compatibility directly in Flink itself. Of course we should have a >> proper Hive connector and a proper Hive catalog to make connectivity with >> Hive (the versions that are still supported by the Hive community) itself >> possible. Alternatively, if Hive query syntax is so important, it should >> not rely on internals but be available as a dialect/pluggable option. That >> >> could also open up the possibility to add more syntax support for others in >> the future, but I really think we should just focus on Flink SQL itself. >> That's already hard enough to maintain and improve on. >> >> I'm looking forward to the thoughts of both Developers and Users, so I'm >> cross-posting to both mailing lists. >> >> Best regards, >> >> Martijn Visser >> https://twitter.com/MartijnVisser82 >> >> [1] >> https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=165227316 >> [2] https://issues.apache.org/jira/browse/FLINK-21529 >> >