All,

Currently, Hudi supports partitioned and non-partitioned datasets. A
partitioned dataset is one which bucketizes groups of files (data) into
buckets called partitions. A hudi dataset may be composed of N number of
partitions with M number of files. This structure helps canonical
hive/presto/spark queries to limit the amount of data read by using the
partition as a filter. The value of the partition/bucket in most cases is
derived from the incoming data itself. The requirement is that once a
record is mapped to a partition/bucket, this mapping should be a) known to
hudi b) should remain constant for the lifecycle of the dataset for hudi to
perform upserts on them. Consequently, in a non-partitioned dataset one can
think of this problem as a record key <-> file id mapping that is required
for hudi to be able to perform upserts on a record.
Current solution is either a) for the client/user to provide the correct
partition value as part of the payload or b) use a GlobalBloomIndex
implementation to scan all the files under a given path (say
non-partitioned table). In both these cases, we are limited either by the
capability of the user to provide this information or by the performance
overhead of scanning all files' bloom index.
I'm proposing a new design, naming it global index, that is a mapping of
(recordKey <-> fileId). This mapping will be stored and maintained by Hudi
as another implementation of HoodieIndex and will address the 2 limitations
mentioned above. I'd like to see if there are other community members
interested in this project. I will send out a HIP shortly describing more
details around the need and architecture of this.

Thanks,
Nishith

Reply via email to