Hi Nicolas, The RI feature is designed for max performance as it is at a record-count scale. Hence, the schema is simplified and minimized.
With non unique keys how would tagging of records (for updates / deletes) work? How would record Index know which mapping of the array to return for a given record key? Thanks Prashant On Wed, Jul 12, 2023 at 2:02 AM nicolas paris <nicolas.pa...@riseup.net> wrote: > hi there, > > Just tested preview of RLI (rfc-08), amazing feature. Soon the fast COW > (rfc-68) will be based on RLI to get the parquet offsets and allow > targeting parquet row groups. > > RLI is a global index, therefore it assumes the hudi key is present in > at most one parquet file. As a result in the MDT, the RLI is of type > struct, and there is a 1:1 mapping w/ a given file. > > Type: > |-- recordIndexMetadata: struct (nullable = true) > | |-- partition: string (nullable = false) > | |-- fileIdHighBits: long (nullable = false) > | |-- fileIdLowBits: long (nullable = false) > | |-- fileIndex: integer (nullable = false) > | |-- instantTime: long (nullable = false) > > Content: > |event_id:1 |{part=3, -6811947225812876253, > -7812062179961430298, 0, 1689147210233}| > > We would love to use both RLI and FCOW features, but I'm afraid our > keys are not unique in our kafka archives. Same key might be present > in multiple partitions, and even in multiple slices within partitions. > > I wonder if the future, RLI could support multiple parquet files (by > storing an array of struct for eg). This would enable to leverage LRI > in more contexts > > Thx > > > > >