Re: Time Travel (querying the historical versions of data) ability for Hudi Table

heng qian Thu, 17 Dec 2020 05:35:56 -0800

Hi Gary:

Thank you for your suggestion. The design of manifest file in RFC-15 looks 
great, I think It can work well and effectively when listing and managing files 
in time-travel process.


For the second point, forcing compaction before the snapshot can work well if 
there are only a few historical versions we want to access. But if we plan to 
build and query versions of snapshot hourly, compaction may leads to a waste of 
disk space. If it  is committed frequently, log files can prevent generating 
plenty of base files as one log file can maintain several versions of records. 
Compaction may happens in a longer interval to do acceleration. So I think the 
ability to deal with the log files should be necessary in time-travel for MOR 
table.

It’s great to receive your discussion

Thanks again

Chaine

> 2020年12月17日 下午6:11，Gary Li <[email protected]> 写道：
> 
> Hi Heng,
> 
> Thanks for bringing up this discussion. We have this business requirement as 
> well. IMO the historical snapshot could work nicely with RFC-15. Once we want 
> to achieve a snapshot, we could export the hudi metadata table as a 
> savepoint. This table including all the file paths of this snapshot.
> 
> Regarding the COW and MOR, I think we can make one function that works for 
> both. If we force a compaction action before the snapshot(or savepoint), we 
> will have parquet file only and we don't have to deal with the log files.
> 
> Happy to discuss more on the RFC.
> 
> Thanks
> 
> Gary
> ________________________________
> From: heng qian <[email protected]>
> Sent: Wednesday, December 16, 2020 9:02 PM
> To: [email protected] <[email protected]>
> Subject: Time Travel (querying the historical versions of data) ability for 
> Hudi Table
> 
> Hi, all:
> We plan to use Hudi to sync mysql binlog data. There will be a flink ETL task 
> to consume binlog records from kafka and save data to hudi every one hour. 
> The binlog records are also grouped every one hour and all records of one 
> hour will be saved in one commit. The data transmission pipeline should be 
> like – binlog -> kafka -> flink -> parquet.
> 
> After the data is synced to hudi, we want to querying the historical hourly 
> versions of the Hudi table in hive SQL.
> 
> Here is a more detailed description of our issue along with a simply design 
> of Time Travel for Hudi, the design is under development and testing:
> 
> https://docs.google.com/document/d/1r0iwUsklw9aKSDMzZaiq43dy57cSJSAqT9KCvgjbtUo/edit?usp=sharing
> 
> I opened a issue here: https://issues.apache.org/jira/browse/HUDI-1460
> 
> We have to support Time Travel ability recently for our business needs. We 
> also have seen the RFC 07.
> Be glad to receive any suggestion or discussion.

Re: Time Travel (querying the historical versions of data) ability for Hudi Table

Reply via email to