Background:

1. The data arrives roughly in event time order

2. When some users read the hudi table, they may not concern with the immediate 
full data, but the full data before time T (eg. daily snapshot data)

3. Reading the RT table will be more time-consuming than reading the COW table 
(RO table) due to compaction

4. Currently, compact strategy will select ALL log files under the file slice 
or not. 

5. There is no compaction strategy based on event time. The only 
DayBasedCompactionStrategy need the table partitioned by day in specified 
format(yyyy/mm/dd)




Based on this, I plan to launch a compaction strategy based on event time: 
EventTimeBasedCompactionStrategy.




This strategy can

1. Expand use-case of RO table: assign the event time attribute to the RO table 
without date partition. Given event time T, the log files before T can be 
compacted, then resulting RO table obtains all data before T, reducing query 
latency.

2. Exact event time data use-case: in some cases, user want their data to be 
partitioned by date. Based on the strategy, we can achieve this application 
scenario。







To implement the strategy, we need to 




1. support merge some log files in a file slice. 

Currently, all compaction strategies only support select all log files or 
select no log file in the orderAndFilter method. 

To support event time based strategy, there will be log files left after the 
compaction. This need hudi to support the new feature of merging some log files 
not all the log files in one file slice.

For the left log files, we have to make them visible to the timeline, which can 
be achieved by creating symlinks for those log files or just copy log files to 
new instant time. 

A simple diagram is shown below.




2. write min event time property to the header of log block

when append log, we add the min event time property to the log block, then we 
can qeury out the min event time without deserializing the log data.




3. design the EventTimeBasedCompactionStrategy

the strategy can select all the log files needed to compact before event time T




4. sync min event time to RO table property

with this, user can know the freshness of RO table.




In my company, this scenario is common, which can reduce read latency while 
meeting user requirements for data time.

I would like to ask if I can propose an RFC for this feature. I think this 
feature would be useful for the community as well.





Reply via email to