Stumbled onto this again, in a different context. I think we can actually support a query that returns all versions of a single key, given how Hudi maps a record to a given file group determinsitically on any snapshot/
Any interest in reviving this discussion again? Thanks Vinoth On Thu, Dec 17, 2020 at 5:35 AM heng qian <chaineq...@gmail.com> wrote: > Hi Gary: > > Thank you for your suggestion. The design of manifest file in RFC-15 looks > great, I think It can work well and effectively when listing and managing > files in time-travel process. > > For the second point, forcing compaction before the snapshot can work well > if there are only a few historical versions we want to access. But if we > plan to build and query versions of snapshot hourly, compaction may leads > to a waste of disk space. If it is committed frequently, log files can > prevent generating plenty of base files as one log file can maintain > several versions of records. Compaction may happens in a longer interval to > do acceleration. So I think the ability to deal with the log files should > be necessary in time-travel for MOR table. > > It’s great to receive your discussion > > Thanks again > > Chaine > > > 2020年12月17日 下午6:11,Gary Li <garyli1...@outlook.com> 写道: > > > > Hi Heng, > > > > Thanks for bringing up this discussion. We have this business > requirement as well. IMO the historical snapshot could work nicely with > RFC-15. Once we want to achieve a snapshot, we could export the hudi > metadata table as a savepoint. This table including all the file paths of > this snapshot. > > > > Regarding the COW and MOR, I think we can make one function that works > for both. If we force a compaction action before the snapshot(or > savepoint), we will have parquet file only and we don't have to deal with > the log files. > > > > Happy to discuss more on the RFC. > > > > Thanks > > > > Gary > > ________________________________ > > From: heng qian <chaineq...@gmail.com> > > Sent: Wednesday, December 16, 2020 9:02 PM > > To: dev@hudi.apache.org <dev@hudi.apache.org> > > Subject: Time Travel (querying the historical versions of data) ability > for Hudi Table > > > > Hi, all: > > We plan to use Hudi to sync mysql binlog data. There will be a flink ETL > task to consume binlog records from kafka and save data to hudi every one > hour. The binlog records are also grouped every one hour and all records of > one hour will be saved in one commit. The data transmission pipeline should > be like – binlog -> kafka -> flink -> parquet. > > > > After the data is synced to hudi, we want to querying the historical > hourly versions of the Hudi table in hive SQL. > > > > Here is a more detailed description of our issue along with a simply > design of Time Travel for Hudi, the design is under development and testing: > > > > > https://docs.google.com/document/d/1r0iwUsklw9aKSDMzZaiq43dy57cSJSAqT9KCvgjbtUo/edit?usp=sharing > > > > I opened a issue here: https://issues.apache.org/jira/browse/HUDI-1460 > > > > We have to support Time Travel ability recently for our business needs. > We also have seen the RFC 07. > > Be glad to receive any suggestion or discussion. > >