Re: Time Travel (querying the historical versions of data) ability for Hudi Table

Vinoth Chandar Sun, 31 Jan 2021 18:11:58 -0800

Stumbled onto this again, in a different context.

I think we can actually support a query that returns all versions of a
single key, given how Hudi maps a record to a given file group
determinsitically on any snapshot/


Any interest in reviving this discussion again?

Thanks
Vinoth

On Thu, Dec 17, 2020 at 5:35 AM heng qian <chaineq...@gmail.com> wrote:

> Hi Gary:
>
> Thank you for your suggestion. The design of manifest file in RFC-15 looks
> great, I think It can work well and effectively when listing and managing
> files in time-travel process.
>
> For the second point, forcing compaction before the snapshot can work well
> if there are only a few historical versions we want to access. But if we
> plan to build and query versions of snapshot hourly, compaction may leads
> to a waste of disk space. If it  is committed frequently, log files can
> prevent generating plenty of base files as one log file can maintain
> several versions of records. Compaction may happens in a longer interval to
> do acceleration. So I think the ability to deal with the log files should
> be necessary in time-travel for MOR table.
>
> It’s great to receive your discussion
>
> Thanks again
>
> Chaine
>
> > 2020年12月17日 下午6:11，Gary Li <garyli1...@outlook.com> 写道：
> >
> > Hi Heng,
> >
> > Thanks for bringing up this discussion. We have this business
> requirement as well. IMO the historical snapshot could work nicely with
> RFC-15. Once we want to achieve a snapshot, we could export the hudi
> metadata table as a savepoint. This table including all the file paths of
> this snapshot.
> >
> > Regarding the COW and MOR, I think we can make one function that works
> for both. If we force a compaction action before the snapshot(or
> savepoint), we will have parquet file only and we don't have to deal with
> the log files.
> >
> > Happy to discuss more on the RFC.
> >
> > Thanks
> >
> > Gary
> > ________________________________
> > From: heng qian <chaineq...@gmail.com>
> > Sent: Wednesday, December 16, 2020 9:02 PM
> > To: dev@hudi.apache.org <dev@hudi.apache.org>
> > Subject: Time Travel (querying the historical versions of data) ability
> for Hudi Table
> >
> > Hi, all:
> > We plan to use Hudi to sync mysql binlog data. There will be a flink ETL
> task to consume binlog records from kafka and save data to hudi every one
> hour. The binlog records are also grouped every one hour and all records of
> one hour will be saved in one commit. The data transmission pipeline should
> be like – binlog -> kafka -> flink -> parquet.
> >
> > After the data is synced to hudi, we want to querying the historical
> hourly versions of the Hudi table in hive SQL.
> >
> > Here is a more detailed description of our issue along with a simply
> design of Time Travel for Hudi, the design is under development and testing:
> >
> >
> https://docs.google.com/document/d/1r0iwUsklw9aKSDMzZaiq43dy57cSJSAqT9KCvgjbtUo/edit?usp=sharing
> >
> > I opened a issue here: https://issues.apache.org/jira/browse/HUDI-1460
> >
> > We have to support Time Travel ability recently for our business needs.
> We also have seen the RFC 07.
> > Be glad to receive any suggestion or discussion.
>
>

Re: Time Travel (querying the historical versions of data) ability for Hudi Table

Reply via email to