Hi, experts.


Currently, Hudi sql statements for DML are executed by Hive Driver with 
concatenation SQL statements in most cases. The way SQL is concatenated is hard 
to maintain and the code is easy to break. Other than that, multiple versions 
of Hive cannot be supported at the moment and makes a lot of headaches for 
users to use. So, I would like to refactor and refine these two things for 
getting a better design and more convenient for users to use.


for example, the following function use driver to execute sql.

HiveSyncTool#syncHoodieTable used for creating a database by driver.
HoodieHiveClient#createTable, for creating a table by driver.
HoodieHiveClient#addPartitionsToTable by driver.
HoodieHiveClient#updatePartitionsToTable by driver.
HoodieHiveClient#updateTableDefinition, alter table by driver.




Other than that, HoodieHiveClient#updateTableProperties, 
HoodieHiveClient#scanTablePartitions, HoodieHiveClient#doesTableExist and etc, 
those metadata operation use client api to execute sql. Consider from the 
design, the two pieces are not aligned. So I would think we need to abstract a 
unified interface completely for all stuff contact with HMS and does not use 
Driver to execute DML. As for the hive that can support multiple versions, we 
can add a shim layer to support different versions of HMS.


I have a preliminary conception of the design in RFC-31 
(https://cwiki.apache.org/confluence/display/HUDI/RFC+-+31%3A+Hive+integration+Improvment).
  I hope everyone can help with some reviews and provide some suggestions. 
thank you very much.


- Looking forward to your reply.


minglei




| |
18717838093
|
|
18717838...@126.com
|
签名由网易邮箱大师定制

Reply via email to