RE: Support Apache Hudi

FIXED-TERM Cheng Yuanbin (CR/PJ-AI-S1) Tue, 16 Jul 2019 13:54:26 -0700

Hi Tim,

Thanks so much for the suggestion.
I also think that implement Hudi Table as a variant of HdfsTable should be a 
cleaner way.
I will focus on understand the hdfsTable now, it is really a big file.

Currently, our team only use the Copy-on-Write mode now, so I will try to 
implement the Copy-on-Write first.

Can you explain more about the two catalog implementations? 
My understand is that one is more the metadata of the table and one is for the 
frontend interface of the table, however, for the HdfsTable, I only found 
HdfsTable, no FeHdfsTable.

Thanks so much!

Best regards

Yuanbin Cheng
CR/PJ-AI-S1  

-----Original Message-----
From: Tim Armstrong <tarmstr...@cloudera.com> 
Sent: Tuesday, July 16, 2019 12:28 PM
To: dev@impala <dev@impala.apache.org>
Subject: Re: Support Apache Hudi

Hi Cheng,
  I think that is one way you could approach it. I'm not really familiar enough 
with Hudi to know if that's the right way. I took a quick look at 
https://hudi.incubator.apache.org/concepts.html and I'm wondering if it would 
actually be cleaner to implement as a variant of HdfsTable. HdfsTable is used 
for any Hive filesystem-based table, not just HDFS - e.g. S3 or whatever. Hudi 
seems like it's similar Hive ACID in a lot of ways, which we're currently 
adding support for in that way.

Which Hudi features are you planning to implement? Copy-on-Write seems like it 
would be simpler to implement - it might only require changes in the frontend 
(i.e. java code). Merge-on-read probably requires backend support for merging 
the delta files with the base files. Write support also seems more complex than 
read support.

Also another note - currently there are actually two catalog implementations 
that require their own table implementation, e.g. see 
fe/src/main/java/org/apache/impala/catalog/FeHBaseTable.java and 
fe/src/main/java/org/apache/impala/catalog/HBaseTable.java

On Tue, Jul 16, 2019 at 9:55 AM FIXED-TERM Cheng Yuanbin (CR/PJ-AI-S1) < 
fixed-term.yuanbin.ch...@us.bosch.com> wrote:

> Hi,
>
> Our team now is using Apache Hudi to migrate our data pipeline from 
> batch to incremental processing.
> However, we find that the Apache Impala cannot pull the Hudi metadata 
> from the Hive.
> Here is the issue: https://github.com/apache/incubator-hudi/issues/179
> Now I am trying to fix this issue.
>
> After reading some code related to the table object of the Impala, 
> currently, my thought is to implement a new HudiTable class and add it 
> to the fromMetastoreTable method in Table class.
> Maybe only add some support methods in the current Table type can also 
> solve this issue? Not very familiar with the Impala source code.
> Here is the Jira ticket for this issue:
> https://issues.apache.org/jira/projects/HUDI/issues/HUDI-146
>
> Do you have any idea about how to solve this issue?
>
> I appreciate any help!
>
> Best regards
>
> Yuanbin Cheng
> CR/PJ-AI-S1
>
>
>

RE: Support Apache Hudi

Reply via email to