Re: Support Apache Hudi

Tim Armstrong Fri, 19 Jul 2019 16:19:20 -0700

I added you to the contributor role on JIRA.

On Fri, Jul 19, 2019 at 3:39 PM FIXED-TERM Cheng Yuanbin (CR/PJ-AI-S1) <
fixed-term.yuanbin.ch...@us.bosch.com> wrote:


> Hi Tim,
>
> Thanks so much for the information.
> My Jira user name is Yuanbin.
>
> Looking forward to doing some contribution.
>
> Best regards
>
> Yuanbin Cheng
> CR/PJ-AI-S1
>
>
>
>
> -----Original Message-----
> From: Tim Armstrong <tarmstr...@cloudera.com>
> Sent: Friday, July 19, 2019 3:23 PM
> To: dev@impala <dev@impala.apache.org>
> Subject: Re: Support Apache Hudi
>
> Please feel free to create a JIRA. we can add you as a contributor on
> Apache JIRA if you give us your username then you can assign it to yourself.
>
> You should be able to use our jenkins instance to run tests on a draft
> gerrit patch:
>
> https://cwiki.apache.org/confluence/display/IMPALA/Using+Gerrit+to+submit+and+review+patches#UsingGerrittosubmitandreviewpatches-Verifyingapatch(opentoallImpalacontributors)
> .
>
>
> Unfortunately we don't have a way to accelerate the initial local build.
> We have a few tips for making incremental builds significantly faster here:
>
> https://cloudera.atlassian.net/wiki/spaces/ENG/pages/100832437/Tips+for+Faster+Impala+Builds
> . It is a lot quicker to iterate on code changes if you follow some of the
> tips there, e.g. use ccache and only rebuild the components of impala that
> you modified.
>
> - Tim
>
> On Fri, Jul 19, 2019 at 2:04 PM FIXED-TERM Cheng Yuanbin (CR/PJ-AI-S1) <
> fixed-term.yuanbin.ch...@us.bosch.com> wrote:
>
> > Hi Tim,
> >
> > The guys from Hudi said that the Hudi partitioning is compatible with
> > Hive partitioning.
> > I think I get some idea from the implementation of the Hive ACID
> > support tickets. And I am trying to implement the Hudi support now.
> >
> > Could I create a Jira ticket for this task and use your Jenkins server
> > for build? It takes me soo much time waiting the build process.
> >
> > Thanks so much!
> >
> > Best regards
> >
> > Yuanbin Cheng
> > CR/PJ-AI-S1
> >
> >
> >
> > -----Original Message-----
> > From: Tim Armstrong <tarmstr...@cloudera.com>
> > Sent: Tuesday, July 16, 2019 3:24 PM
> > To: dev@impala <dev@impala.apache.org>
> > Subject: Re: Support Apache Hudi
> >
> > Sorry I meant to refer to
> > ./fe/src/main/java/org/apache/impala/catalog/local/LocalHbaseTable.jav
> > a; FeHdfsTable is an interface shared by those two classes.
> >
> > There's a default catalog implementation that is based on all Impala
> > daemons holding a cached snapshot of metadata, and a re-implementation
> > where impala daemons fetch metadata on demand from a catalog service.
> > The design doc for the reimplementation is here, although i suspect
> > some details have changed:
> >
> > https://docs.google.com/document/d/1WcUQ7nC3fzLFtZLofzO6kvWdGHFaaqh97f
> > C_PvqVGCk/edit
> >
> > It may be helpful to look at some recent commits that added Hive ACID
> > support just to get an idea of how that was implemented:
> > https://gerrit.cloudera.org/#/q/acid
> >
> > I guess one detail that may not work so well with HdfsTable is the
> > partitioning - it's unclear to me how compatible the Hudi partitioning
> > is with Hive's partitioning scheme.
> >
> > - Tim
> >
> >
> >
> > On Wed, Jul 17, 2019 at 6:54 AM FIXED-TERM Cheng Yuanbin (CR/PJ-AI-S1)
> > < fixed-term.yuanbin.ch...@us.bosch.com> wrote:
> >
> > > Hi Tim,
> > >
> > > Thanks so much for the suggestion.
> > > I also think that implement Hudi Table as a variant of HdfsTable
> > > should be a cleaner way.
> > > I will focus on understand the hdfsTable now, it is really a big file.
> > >
> > > Currently, our team only use the Copy-on-Write mode now, so I will
> > > try to implement the Copy-on-Write first.
> > >
> > > Can you explain more about the two catalog implementations?
> > > My understand is that one is more the metadata of the table and one
> > > is for the frontend interface of the table, however, for the
> > > HdfsTable, I only found HdfsTable, no FeHdfsTable.
> > >
> > > Thanks so much!
> > >
> > > Best regards
> > >
> > > Yuanbin Cheng
> > > CR/PJ-AI-S1
> > >
> > >
> > >
> > >
> > > -----Original Message-----
> > > From: Tim Armstrong <tarmstr...@cloudera.com>
> > > Sent: Tuesday, July 16, 2019 12:28 PM
> > > To: dev@impala <dev@impala.apache.org>
> > > Subject: Re: Support Apache Hudi
> > >
> > > Hi Cheng,
> > >   I think that is one way you could approach it. I'm not really
> > > familiar enough with Hudi to know if that's the right way. I took a
> > > quick look at https://hudi.incubator.apache.org/concepts.html and
> > > I'm wondering if it would actually be cleaner to implement as a
> > > variant of HdfsTable. HdfsTable is used for any Hive
> > > filesystem-based table, not just HDFS - e.g. S3 or whatever. Hudi
> > > seems like it's similar Hive ACID in a lot of ways, which we're
> > > currently adding support for in that
> > way.
> > >
> > > Which Hudi features are you planning to implement? Copy-on-Write
> > > seems like it would be simpler to implement - it might only require
> > > changes in the frontend (i.e. java code). Merge-on-read probably
> > > requires backend support for merging the delta files with the base
> > > files. Write support also seems more complex than read support.
> > >
> > > Also another note - currently there are actually two catalog
> > > implementations that require their own table implementation, e.g.
> > > see fe/src/main/java/org/apache/impala/catalog/FeHBaseTable.java and
> > > fe/src/main/java/org/apache/impala/catalog/HBaseTable.java
> > >
> > > On Tue, Jul 16, 2019 at 9:55 AM FIXED-TERM Cheng Yuanbin
> > > (CR/PJ-AI-S1) < fixed-term.yuanbin.ch...@us.bosch.com> wrote:
> > >
> > > > Hi,
> > > >
> > > > Our team now is using Apache Hudi to migrate our data pipeline
> > > > from batch to incremental processing.
> > > > However, we find that the Apache Impala cannot pull the Hudi
> > > > metadata from the Hive.
> > > > Here is the issue:
> > > > https://github.com/apache/incubator-hudi/issues/179
> > > > Now I am trying to fix this issue.
> > > >
> > > > After reading some code related to the table object of the Impala,
> > > > currently, my thought is to implement a new HudiTable class and
> > > > add it to the fromMetastoreTable method in Table class.
> > > > Maybe only add some support methods in the current Table type can
> > > > also solve this issue? Not very familiar with the Impala source code.
> > > > Here is the Jira ticket for this issue:
> > > > https://issues.apache.org/jira/projects/HUDI/issues/HUDI-146
> > > >
> > > > Do you have any idea about how to solve this issue?
> > > >
> > > > I appreciate any help!
> > > >
> > > > Best regards
> > > >
> > > > Yuanbin Cheng
> > > > CR/PJ-AI-S1
> > > >
> > > >
> > > >
> > >
> >
>

Re: Support Apache Hudi

Reply via email to