回复：Re: Fwd: [DISCUSS] Flink's supported APIs and Hive query syntax

罗宇侠(莫辞) Tue, 15 Mar 2022 08:15:55 -0700

Hi, thanks for the inputs. 
I wrote a google doc “Plan for decoupling Hive connector with Flink planner” 
[1] that shows how to decouple Hive connector with planner.

[1] 
https://docs.google.com/document/d/1LMQ_mWfB_mkYkEBCUa2DgCO2YdtiZV7YRs2mpXyjdP4/edit?usp=sharing

Best, 
Yuxia.------------------------------------------------------------------
发件人：Jark Wu<[email protected]>
日　期：2022年03月10日 19:59:30
收件人：Francesco Guardiani<[email protected]>; dev<[email protected]>
抄　送：Martijn Visser<[email protected]>; User<[email protected]>; 
罗宇侠(莫辞)<[email protected]>
主　题：Re: Fwd: [DISCUSS] Flink's supported APIs and Hive query syntax

Hi Francesco,

Yes. The Hive syntax is a syntax plugin provided by Hive connector.

> But right now I don't think It's a good idea adding new features on top, 
as it will create only more maintenance burden both for Hive developers and for 
table developers.

We are not adding new Hive features, but fixing compatibility or behavior bugs, 
and almost all of them
are just related to the Hive connector code, nothing to do with table planner. 

I agree we should investigate how to and how much work to decouple Hive 
connector and planner ASAP. 
We will come up with a google doc soon. But AFAIK, this may not be a huge work 
and not conflict with the bugfix works. 

Best,
Jark
On Thu, 10 Mar 2022 at 17:03, Francesco Guardiani <[email protected]> 
wrote:

> We still need some work to make the Hive dialect purely rely on public APIs, 
> and the Hive connector should be decopule with table planner. 

From the table perspective, I think this is the big pain point at the moment. 
First of all, when we talk about the Hive syntax, we're really talking about 
the Hive connector, as my understanding is that without the Hive connector in 
the classpath you can't use the Hive syntax [1].

The Hive connector is heavily relying on internals [2], and this is an 
important struggle for the table project, as sometimes is impedes and slows 
down development of new features and creates a huge maintenance burden for 
table developers [3]. The planner itself has some classes specific to Hive [4], 
making the codebase of the planner more complex than it already is. Some of 
these are just legacy, others exists because there are some abstractions 
missing in the table planner side, but those just need some work.

So I agree with Jark, when the two Hive modules (connector-hive and 
sql-parser-hive) reach a point where they don't depend at all on 
flink-table-planner, like every other connector (except for testing of course), 
we should be good to move them in a separate repo and continue committing to 
them. But right now I don't think It's a good idea adding new features on top, 
as it will create only more maintenance burden both for Hive developers and for 
table developers.

My concern with this plan is: how much realistic is to fix all the planner 
internal leaks in the existing Hive connector/parser? To me this seems like a 
huge task, including a non trivial amount of work to stabilize and design new 
entry points in Table API.

[1] HiveParser
[2] HiveParserCalcitePlanner
[3] Just talking about code coupling, not even mentioning problems like 
dependencies and security updates
[4] HiveAggSqlFunction
On Thu, Mar 10, 2022 at 9:05 AM Martijn Visser <[email protected]> wrote:

Thank you Yuxia for volunteering, that's really much appreciated. It would be 
great if you can create an umbrella ticket for that. 

It would be great to get some insights from currently Flink and Hive users 
which versions are being used.
@Jark I would indeed deprecate the old Hive versions in Flink 1.15 and then 
drop them in Flink 1.16. That would also remove some tech debt and make it less 
work with regards to externalizing connectors.

Best regards,

Martijn
On Thu, 10 Mar 2022 at 07:39, Jark Wu <[email protected]> wrote:

Thanks Martijn for the reply and summary. 

I totally agree with your plan and thank Yuxia for volunteering the Hive tech 
debt issue. 
I think we can create an umbrella issue for this and target version 1.16. We 
can discuss
details and create subtasks there. 

Regarding dropping old Hive versions, I'm also fine with that. But I would like 
to investigate
some Hive users first to see whether it's acceptable at this point. My first 
thought was we
can deprecate the old Hive versions in 1.15, and we can discuss dropping it in 
1.16 or 1.17. 

Best,
Jark

On Thu, 10 Mar 2022 at 14:19, 罗宇侠(莫辞) <[email protected]> wrote:

Thanks Martijn for your insights.

About the tech debt/maintenance with regards to Hive query syntax, I would like 
to chip-in and expect it can be resolved for Flink 1.16.

Best regards,

Yuxia

 ------------------原始邮件 ------------------
发件人:Martijn Visser <[email protected]>
发送时间:Thu Mar 10 04:03:34 2022
收件人:User <[email protected]>
主题:Fwd: [DISCUSS] Flink's supported APIs and Hive query syntax

(Forwarding this also to the User mailing list as I made a typo when replying 
to this email thread)

---------- Forwarded message ---------
From: Martijn Visser <[email protected]>
Date: Wed, 9 Mar 2022 at 20:57
Subject: Re: [DISCUSS] Flink's supported APIs and Hive query syntax
To: dev <[email protected]>, Francesco Guardiani <[email protected]>, 
Timo Walther <[email protected]>, <[email protected]>

Hi everyone,

Thank you all very much for your input. From my perspective, I consider batch 
as a special case of streaming. So with Flink SQL, we can support both batch 
and streaming use cases and I think we should use Flink SQL as our target. 

To reply on some of the comments:

@Jing on your remark:
> Since Flink has a clear vision of unified batch and stream processing, 
> supporting batch jobs will be one of the critical core features to help us 
> reach the vision and let Flink have an even bigger impact in the industry.

I fully agree with that statement. I do think that having Hive syntax support 
doesn't help in that unified batch and stream processing. We're making it 
easier for batch users to run their Hive batch jobs on Flink, but that doesn't 
fit the "unified" part since it's focussed on batch, while Flink SQL focusses 
on batch and streaming. I would have rather invested time in making batch 
improvements to Flink and Flink SQL vs investing in Hive syntax support. I do 
understand from the given replies that Hive syntax support is valuable for 
those that are already running batch processing and would like to run these 
queries on Flink. I do think that's limited to mostly Chinese companies at the 
moment. 

@Jark I think you've provided great input and are spot on with: 
> Regarding the maintenance concern you raised, I think that's a good point and 
> they are in the plan. The Hive dialect has already been a plugin and option 
> now, and the implementation is located in hive-connector module. We still 
> need some work to make the Hive dialect purely rely on public APIs, and the 
> Hive connector should be decopule with table planner. At that time, we can 
> move the whole Hive connector into a separate repository (I guess this is 
> also in the externalize connectors plan).

I'm looping in Francesco and Timo who can elaborate more in depth on the 
current maintenance issues. I think we need to have a proper plan on how this 
tech debt/maintenance can be addressed and to get commitment that this will be 
resolved in Flink 1.16, since we indeed need to move out all previously agreed 
connectors before Flink 1.16 is released.

> From my perspective, Hive is still widely used and there exists many running 
> Hive SQL jobs, so why not to provide users a better experience to help them 
> migrate Hive jobs to Flink? Also, it doesn't conflict with Flink SQL as it's 
> just a dialect option. 

I do think there is a conflict with Flink SQL; you can't use both of them at 
the same time, so you don't have access to all features in Flink. That 
increases feature sparsity and user friction. It also puts a bigger burden on 
the Flink community, because having both options available means more 
maintenance work. For example, an upgrade of Calcite is more impactful. The 
Flink codebase is already rather large and CI build times are already too long. 
More code means more risk of bugs. If a user at some point wants to change his 
Hive batch job to a streaming Flink SQL job, there's still migration work for 
the user, it just needs to happen at a later stage. 

@Jingsong I think you have a good argument that migrating SQL for Batch ETL is 
indeed an expensive effort. 

Last but not least, there was no one who has yet commented on the supported 
Hive versions and security issues. I've reached out to the Hive community and 
from the info I've received so far is that only Hive 3.1.x and Hive 2.3.x are 
still supported. The older Hive versions are no longer maintained and also 
don't receive security updates. This is important because many companies scan 
the Flink project for vulnerabilities and won't allow using it when these types 
of vulnerabilities are included. 

My summary would be the following:
* Like Jark said, in the short term, Hive syntax compatibility is the ticket 
for us to have a seat in the batch processing. Having improved Hive syntax 
support for that in Flink can help in this. 
* In the long term, we can and should drop it and focus on Flink SQL itself 
both for batch and stream processing.
* The Hive maintainers/volunteers should come up with a plan on how the tech 
debt/maintenance with regards to Hive query syntax can be addressed and will be 
resolved for Flink 1.16. This includes stuff like using public APIs and 
decoupling it from the planner. This is also extremely important since we want 
to move out connectors with Flink 1.16 (next Flink release). I'm hoping that 
those who can help out with this will chip-in. 
* We follow the Hive versions that are still supported, which means we drop 
support for Hive 1.*, 2.1.x and 2.2.x and upgrade Hive 2.3 and Hive 3.1 to the 
latest version. 

Thanks again for your input and looking forward to your thoughts on this.

Best regards,

Martijn 
On Tue, 8 Mar 2022 at 10:39, 罗宇侠(莫辞) <[email protected]> wrote:

Hi Martijn,
Thanks for driving this discussion. 

About your concerns, I would like to share my opinion.
Actually, more exactly, FLIP-152 [1] is not to extend Flink SQL to support Hive 
query synax, it provides a Hive dialect option to enable users to switch to 
Hive dialect. From the commits about the corresponding FLINK-21529, it doesn't 
involve much about Flink itself. 

- About the struggling with maintaining. The current implementation is just to 
provide an option for user to use Hive dialect. I think there won't be much 
bother.

- Although Apache Hive is less popular, it's widely used as an open source 
database over the years. There still exists many Hive SQL jobs in many 
companies.

- As I said, the current implementation for Hive SQL synax is more like 
pluggable, we can also support for Snowflake and the others as long as it's 
necessary.

- As for the know security vulnerabilities of Hive, maybe it's not a critical 
problem in this discuss.

- For current implementation for Hive SQL syntax, it uses a pluggable 
HiveParser[3] to parse the SQL statement. I think there won't be much 
complexity brought to Flink to support Hive syntax. 

From my perspective, Hive is still widely used and there exists many running 
Hive SQL jobs, so why not to provide users a better experience to help them 
migrate Hive jobs to Flink? Also, it doesn't conflict with Flink SQL as it's 
just a dialect option. 

[1] https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=165227316
[2] https://issues.apache.org/jira/browse/FLINK-21529
[3] https://issues.apache.org/jira/browse/FLINK-21531
Best regards,
Yuxia.

 ------------------原始邮件 ------------------
发件人:Martijn Visser <[email protected]>
发送时间:Mon Mar 7 19:23:15 2022
收件人:dev <[email protected]>, User <[email protected]>
主题:[DISCUSS] Flink's supported APIs and Hive query syntax
Hi everyone,

Flink currently has 4 APIs with multiple language support which can be used
to develop applications:

* DataStream API, both Java and Scala
* Table API, both Java and Scala
* Flink SQL, both in Flink query syntax and Hive query syntax (partially)
* Python API

Since FLIP-152 [1] the Flink SQL support has been extended to also support
the Hive query syntax. There is now a follow-up FLINK-26360 [2] to address
more syntax compatibility issues.

I would like to open a discussion on Flink directly supporting the Hive
query syntax. I have some concerns if having a 100% Hive query syntax is
indeed something that we should aim for in Flink.

I can understand that having Hive query syntax support in Flink could help
users due to interoperability and being able to migrate. However:

- Adding full Hive query syntax support will mean that we go from 6 fully
supported API/language combinations to 7. I think we are currently already
struggling with maintaining the existing combinations, let another one
more.
- Apache Hive is/appears to be a project that's not that actively developed
anymore. The last release was made in January 2021. It's popularity is
rapidly declining in Europe and the United State, also due Hadoop becoming
less popular.
- Related to the previous topic, other software like Snowflake,
Trino/Presto, Databricks are becoming more and more popular. If we add full
support for the Hive query syntax, then why not add support for Snowflake
and the others?
- We are supporting Hive versions that are no longer supported by the Hive
community with known security vulnerabilities. This makes Flink also
vulnerable for those type of vulnerabilities.
- The currently Hive implementation is done by using a lot of internals of
Flink, making Flink hard to maintain, with lots of tech debt and making
things overly complex.

From my perspective, I think it would be better to not have Hive query
syntax compatibility directly in Flink itself. Of course we should have a
proper Hive connector and a proper Hive catalog to make connectivity with
Hive (the versions that are still supported by the Hive community) itself
possible. Alternatively, if Hive query syntax is so important, it should
not rely on internals but be available as a dialect/pluggable option. That
could also open up the possibility to add more syntax support for others in
the future, but I really think we should just focus on Flink SQL itself.
That's already hard enough to maintain and improve on.

I'm looking forward to the thoughts of both Developers and Users, so I'm
cross-posting to both mailing lists.

Best regards,

Martijn Visser
https://twitter.com/MartijnVisser82

[1]
https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=165227316
[2] https://issues.apache.org/jira/browse/FLINK-21529

回复：Re: Fwd: [DISCUSS] Flink's supported APIs and Hive query syntax

Reply via email to