Hi Heng,

Thanks for bringing up this discussion. We have this business requirement as 
well. IMO the historical snapshot could work nicely with RFC-15. Once we want 
to achieve a snapshot, we could export the hudi metadata table as a savepoint. 
This table including all the file paths of this snapshot.

Regarding the COW and MOR, I think we can make one function that works for 
both. If we force a compaction action before the snapshot(or savepoint), we 
will have parquet file only and we don't have to deal with the log files.

Happy to discuss more on the RFC.

Thanks

Gary
________________________________
From: heng qian <chaineq...@gmail.com>
Sent: Wednesday, December 16, 2020 9:02 PM
To: dev@hudi.apache.org <dev@hudi.apache.org>
Subject: Time Travel (querying the historical versions of data) ability for 
Hudi Table

Hi, all:
We plan to use Hudi to sync mysql binlog data. There will be a flink ETL task 
to consume binlog records from kafka and save data to hudi every one hour. The 
binlog records are also grouped every one hour and all records of one hour will 
be saved in one commit. The data transmission pipeline should be like – binlog 
-> kafka -> flink -> parquet.

After the data is synced to hudi, we want to querying the historical hourly 
versions of the Hudi table in hive SQL.

Here is a more detailed description of our issue along with a simply design of 
Time Travel for Hudi, the design is under development and testing:

https://docs.google.com/document/d/1r0iwUsklw9aKSDMzZaiq43dy57cSJSAqT9KCvgjbtUo/edit?usp=sharing

I opened a issue here: https://issues.apache.org/jira/browse/HUDI-1460

We have to support Time Travel ability recently for our business needs. We also 
have seen the RFC 07.
Be glad to receive any suggestion or discussion.

Reply via email to