Hive integration Improvment

18717838093 Thu, 15 Jul 2021 05:15:19 -0700


Hi, experts.

Currently, Hudi sql statements for DML are executed by Hive Driver with
concatenation SQL statements in most cases. The way SQL is concatenated is hard
to maintain and the code is easy to break. Other than that, multiple versions
of Hive cannot be supported at the moment and makes a lot of headaches for
users to use. So, I would like to refactor and refine these two things for
getting a better design and more convenient for users to use.

for example, the following function use driver to execute sql.

HiveSyncTool#syncHoodieTable used for creating a database by driver.
HoodieHiveClient#createTable, for creating a table by driver.
HoodieHiveClient#addPartitionsToTable by driver.
HoodieHiveClient#updatePartitionsToTable by driver.
HoodieHiveClient#updateTableDefinition, alter table by driver.

Other than that, HoodieHiveClient#updateTableProperties,
HoodieHiveClient#scanTablePartitions, HoodieHiveClient#doesTableExist and etc,
those metadata operation use client api to execute sql. Consider from the
design, the two pieces are not aligned. So I would think we need to abstract a
unified interface completely for all stuff contact with HMS and does not use
Driver to execute DML. As for the hive that can support multiple versions, we
can add a shim layer to support different versions of HMS.

I have a preliminary conception of the design in RFC-31
(https://cwiki.apache.org/confluence/display/HUDI/RFC+-+31%3A+Hive+integration+Improvment).
I hope everyone can help with some reviews and provide some suggestions.
thank you very much.

- Looking forward to your reply.

minglei

| |
18717838093
|
|
18717838...@126.com
|
签名由网易邮箱大师定制

Hive integration Improvment

Reply via email to