RE: Support Apache Hudi

FIXED-TERM Cheng Yuanbin (CR/PJ-AI-S1) Fri, 19 Jul 2019 14:05:01 -0700

Hi Tim,

The guys from Hudi said that the Hudi partitioning is compatible with Hive 
partitioning.
I think I get some idea from the implementation of the Hive ACID support 
tickets. And I am trying to implement the Hudi support now.


Could I create a Jira ticket for this task and use your Jenkins server for 
build? It takes me soo much time waiting the build process.

Thanks so much!

Best regards

Yuanbin Cheng
CR/PJ-AI-S1  



-----Original Message-----
From: Tim Armstrong <tarmstr...@cloudera.com> 
Sent: Tuesday, July 16, 2019 3:24 PM
To: dev@impala <dev@impala.apache.org>
Subject: Re: Support Apache Hudi

Sorry I meant to refer to
./fe/src/main/java/org/apache/impala/catalog/local/LocalHbaseTable.java;
FeHdfsTable is an interface shared by those two classes.

There's a default catalog implementation that is based on all Impala daemons 
holding a cached snapshot of metadata, and a re-implementation where impala 
daemons fetch metadata on demand from a catalog service. The design doc for the 
reimplementation is here, although i suspect some details have changed:
https://docs.google.com/document/d/1WcUQ7nC3fzLFtZLofzO6kvWdGHFaaqh97fC_PvqVGCk/edit

It may be helpful to look at some recent commits that added Hive ACID support 
just to get an idea of how that was implemented:
https://gerrit.cloudera.org/#/q/acid

I guess one detail that may not work so well with HdfsTable is the partitioning 
- it's unclear to me how compatible the Hudi partitioning is with Hive's 
partitioning scheme.

- Tim



On Wed, Jul 17, 2019 at 6:54 AM FIXED-TERM Cheng Yuanbin (CR/PJ-AI-S1) < 
fixed-term.yuanbin.ch...@us.bosch.com> wrote:

> Hi Tim,
>
> Thanks so much for the suggestion.
> I also think that implement Hudi Table as a variant of HdfsTable 
> should be a cleaner way.
> I will focus on understand the hdfsTable now, it is really a big file.
>
> Currently, our team only use the Copy-on-Write mode now, so I will try 
> to implement the Copy-on-Write first.
>
> Can you explain more about the two catalog implementations?
> My understand is that one is more the metadata of the table and one is 
> for the frontend interface of the table, however, for the HdfsTable, I 
> only found HdfsTable, no FeHdfsTable.
>
> Thanks so much!
>
> Best regards
>
> Yuanbin Cheng
> CR/PJ-AI-S1
>
>
>
>
> -----Original Message-----
> From: Tim Armstrong <tarmstr...@cloudera.com>
> Sent: Tuesday, July 16, 2019 12:28 PM
> To: dev@impala <dev@impala.apache.org>
> Subject: Re: Support Apache Hudi
>
> Hi Cheng,
>   I think that is one way you could approach it. I'm not really 
> familiar enough with Hudi to know if that's the right way. I took a 
> quick look at https://hudi.incubator.apache.org/concepts.html and I'm 
> wondering if it would actually be cleaner to implement as a variant of 
> HdfsTable. HdfsTable is used for any Hive filesystem-based table, not 
> just HDFS - e.g. S3 or whatever. Hudi seems like it's similar Hive 
> ACID in a lot of ways, which we're currently adding support for in that way.
>
> Which Hudi features are you planning to implement? Copy-on-Write seems 
> like it would be simpler to implement - it might only require changes 
> in the frontend (i.e. java code). Merge-on-read probably requires 
> backend support for merging the delta files with the base files. Write 
> support also seems more complex than read support.
>
> Also another note - currently there are actually two catalog 
> implementations that require their own table implementation, e.g. see 
> fe/src/main/java/org/apache/impala/catalog/FeHBaseTable.java and 
> fe/src/main/java/org/apache/impala/catalog/HBaseTable.java
>
> On Tue, Jul 16, 2019 at 9:55 AM FIXED-TERM Cheng Yuanbin (CR/PJ-AI-S1) 
> < fixed-term.yuanbin.ch...@us.bosch.com> wrote:
>
> > Hi,
> >
> > Our team now is using Apache Hudi to migrate our data pipeline from 
> > batch to incremental processing.
> > However, we find that the Apache Impala cannot pull the Hudi 
> > metadata from the Hive.
> > Here is the issue: 
> > https://github.com/apache/incubator-hudi/issues/179
> > Now I am trying to fix this issue.
> >
> > After reading some code related to the table object of the Impala, 
> > currently, my thought is to implement a new HudiTable class and add 
> > it to the fromMetastoreTable method in Table class.
> > Maybe only add some support methods in the current Table type can 
> > also solve this issue? Not very familiar with the Impala source code.
> > Here is the Jira ticket for this issue:
> > https://issues.apache.org/jira/projects/HUDI/issues/HUDI-146
> >
> > Do you have any idea about how to solve this issue?
> >
> > I appreciate any help!
> >
> > Best regards
> >
> > Yuanbin Cheng
> > CR/PJ-AI-S1
> >
> >
> >
>

RE: Support Apache Hudi

Reply via email to