Re: Time Travel (querying the historical versions of data) ability for Hudi Table

Vinoth Chandar Tue, 09 Feb 2021 17:01:18 -0800

Thinking about this again, it seems like we can achieve this if we can
support infinite retention (i.e no cleaning whatsoever) and
just query all file slices for the keys?


Balaji, I think this is similar to the usecase you brought up.

On Sun, Jan 31, 2021 at 6:11 PM Vinoth Chandar <vin...@apache.org> wrote:

> Stumbled onto this again, in a different context.
>
> I think we can actually support a query that returns all versions of a
> single key, given how Hudi maps a record to a given file group
> determinsitically on any snapshot/
>
> Any interest in reviving this discussion again?
>
> Thanks
> Vinoth
>
> On Thu, Dec 17, 2020 at 5:35 AM heng qian <chaineq...@gmail.com> wrote:
>
>> Hi Gary:
>>
>> Thank you for your suggestion. The design of manifest file in RFC-15
>> looks great, I think It can work well and effectively when listing and
>> managing files in time-travel process.
>>
>> For the second point, forcing compaction before the snapshot can work
>> well if there are only a few historical versions we want to access. But if
>> we plan to build and query versions of snapshot hourly, compaction may
>> leads to a waste of disk space. If it  is committed frequently, log files
>> can prevent generating plenty of base files as one log file can maintain
>> several versions of records. Compaction may happens in a longer interval to
>> do acceleration. So I think the ability to deal with the log files should
>> be necessary in time-travel for MOR table.
>>
>> It’s great to receive your discussion
>>
>> Thanks again
>>
>> Chaine
>>
>> > 2020年12月17日 下午6:11，Gary Li <garyli1...@outlook.com> 写道：
>> >
>> > Hi Heng,
>> >
>> > Thanks for bringing up this discussion. We have this business
>> requirement as well. IMO the historical snapshot could work nicely with
>> RFC-15. Once we want to achieve a snapshot, we could export the hudi
>> metadata table as a savepoint. This table including all the file paths of
>> this snapshot.
>> >
>> > Regarding the COW and MOR, I think we can make one function that works
>> for both. If we force a compaction action before the snapshot(or
>> savepoint), we will have parquet file only and we don't have to deal with
>> the log files.
>> >
>> > Happy to discuss more on the RFC.
>> >
>> > Thanks
>> >
>> > Gary
>> > ________________________________
>> > From: heng qian <chaineq...@gmail.com>
>> > Sent: Wednesday, December 16, 2020 9:02 PM
>> > To: dev@hudi.apache.org <dev@hudi.apache.org>
>> > Subject: Time Travel (querying the historical versions of data) ability
>> for Hudi Table
>> >
>> > Hi, all:
>> > We plan to use Hudi to sync mysql binlog data. There will be a flink
>> ETL task to consume binlog records from kafka and save data to hudi every
>> one hour. The binlog records are also grouped every one hour and all
>> records of one hour will be saved in one commit. The data transmission
>> pipeline should be like – binlog -> kafka -> flink -> parquet.
>> >
>> > After the data is synced to hudi, we want to querying the historical
>> hourly versions of the Hudi table in hive SQL.
>> >
>> > Here is a more detailed description of our issue along with a simply
>> design of Time Travel for Hudi, the design is under development and testing:
>> >
>> >
>> https://docs.google.com/document/d/1r0iwUsklw9aKSDMzZaiq43dy57cSJSAqT9KCvgjbtUo/edit?usp=sharing
>> >
>> > I opened a issue here: https://issues.apache.org/jira/browse/HUDI-1460
>> >
>> > We have to support Time Travel ability recently for our business needs.
>> We also have seen the RFC 07.
>> > Be glad to receive any suggestion or discussion.
>>
>>

Re: Time Travel (querying the historical versions of data) ability for Hudi Table

Reply via email to