Background: 1. The data arrives roughly in event time order
2. When some users read the hudi table, they may not concern with the immediate full data, but the full data before time T (eg. daily snapshot data) 3. Reading the RT table will be more time-consuming than reading the COW table (RO table) due to compaction 4. Currently, compact strategy will select ALL log files under the file slice or not. 5. There is no compaction strategy based on event time. The only DayBasedCompactionStrategy need the table partitioned by day in specified format(yyyy/mm/dd) Based on this, I plan to launch a compaction strategy based on event time: EventTimeBasedCompactionStrategy. This strategy can 1. Expand use-case of RO table: assign the event time attribute to the RO table without date partition. Given event time T, the log files before T can be compacted, then resulting RO table obtains all data before T, reducing query latency. 2. Exact event time data use-case: in some cases, user want their data to be partitioned by date. Based on the strategy, we can achieve this application scenario。 To implement the strategy, we need to 1. support merge some log files in a file slice. Currently, all compaction strategies only support select all log files or select no log file in the orderAndFilter method. To support event time based strategy, there will be log files left after the compaction. This need hudi to support the new feature of merging some log files not all the log files in one file slice. For the left log files, we have to make them visible to the timeline, which can be achieved by creating symlinks for those log files or just copy log files to new instant time. A simple diagram is shown below. 2. write min event time property to the header of log block when append log, we add the min event time property to the log block, then we can qeury out the min event time without deserializing the log data. 3. design the EventTimeBasedCompactionStrategy the strategy can select all the log files needed to compact before event time T 4. sync min event time to RO table property with this, user can know the freshness of RO table. In my company, this scenario is common, which can reduce read latency while meeting user requirements for data time. I would like to ask if I can propose an RFC for this feature. I think this feature would be useful for the community as well.