All, Currently, Hudi supports partitioned and non-partitioned datasets. A partitioned dataset is one which bucketizes groups of files (data) into buckets called partitions. A hudi dataset may be composed of N number of partitions with M number of files. This structure helps canonical hive/presto/spark queries to limit the amount of data read by using the partition as a filter. The value of the partition/bucket in most cases is derived from the incoming data itself. The requirement is that once a record is mapped to a partition/bucket, this mapping should be a) known to hudi b) should remain constant for the lifecycle of the dataset for hudi to perform upserts on them. Consequently, in a non-partitioned dataset one can think of this problem as a record key <-> file id mapping that is required for hudi to be able to perform upserts on a record. Current solution is either a) for the client/user to provide the correct partition value as part of the payload or b) use a GlobalBloomIndex implementation to scan all the files under a given path (say non-partitioned table). In both these cases, we are limited either by the capability of the user to provide this information or by the performance overhead of scanning all files' bloom index. I'm proposing a new design, naming it global index, that is a mapping of (recordKey <-> fileId). This mapping will be stored and maintained by Hudi as another implementation of HoodieIndex and will address the 2 limitations mentioned above. I'd like to see if there are other community members interested in this project. I will send out a HIP shortly describing more details around the need and architecture of this.
Thanks, Nishith
