Record level index with not unique keys

nicolas paris Wed, 12 Jul 2023 02:01:58 -0700

hi there,

Just tested preview of RLI (rfc-08), amazing feature. Soon the fast COW
(rfc-68) will be based on RLI to get the parquet offsets and allow
targeting parquet row groups.


RLI is a global index, therefore it assumes the hudi key is present in
at most one parquet file. As a result in the MDT, the RLI is of type
struct, and there is a 1:1 mapping w/ a given file.

Type:
   |-- recordIndexMetadata: struct (nullable = true)  
   |    |-- partition: string (nullable = false)  
   |    |-- fileIdHighBits: long (nullable = false)  
   |    |-- fileIdLowBits: long (nullable = false)  
   |    |-- fileIndex: integer (nullable = false)  
   |    |-- instantTime: long (nullable = false)

Content:
   |event_id:1        |{part=3, -6811947225812876253, -7812062179961430298, 0, 
1689147210233}|
   
We would love to use both RLI and FCOW features, but I'm afraid our 
keys are not unique in our kafka archives. Same key might be present 
in multiple partitions, and even in multiple slices within partitions.

I wonder if the future, RLI could support multiple parquet files (by 
storing an array of struct for eg). This would enable to leverage LRI
in more contexts

Thx

Record level index with not unique keys

Reply via email to