Re: Record level index with not unique keys

Prashant Wason Thu, 13 Jul 2023 10:19:53 -0700

Hi Nicolas,

The RI feature is designed for max performance as it is at a record-count
scale. Hence, the schema is simplified and minimized.


With non unique keys how would tagging of records (for updates / deletes)
work? How would record Index know which mapping of the array to return for
a given record key?

Thanks
Prashant



On Wed, Jul 12, 2023 at 2:02 AM nicolas paris <nicolas.pa...@riseup.net>
wrote:

> hi there,
>
> Just tested preview of RLI (rfc-08), amazing feature. Soon the fast COW
> (rfc-68) will be based on RLI to get the parquet offsets and allow
> targeting parquet row groups.
>
> RLI is a global index, therefore it assumes the hudi key is present in
> at most one parquet file. As a result in the MDT, the RLI is of type
> struct, and there is a 1:1 mapping w/ a given file.
>
> Type:
>    |-- recordIndexMetadata: struct (nullable = true)
>    |    |-- partition: string (nullable = false)
>    |    |-- fileIdHighBits: long (nullable = false)
>    |    |-- fileIdLowBits: long (nullable = false)
>    |    |-- fileIndex: integer (nullable = false)
>    |    |-- instantTime: long (nullable = false)
>
> Content:
>    |event_id:1        |{part=3, -6811947225812876253,
> -7812062179961430298, 0, 1689147210233}|
>
> We would love to use both RLI and FCOW features, but I'm afraid our
> keys are not unique in our kafka archives. Same key might be present
> in multiple partitions, and even in multiple slices within partitions.
>
> I wonder if the future, RLI could support multiple parquet files (by
> storing an array of struct for eg). This would enable to leverage LRI
> in more contexts
>
> Thx
>
>
>
>
>

Re: Record level index with not unique keys

Reply via email to