Here is the HIP : https://docs.google.com/document/d/1RdxVqF60N9yRUH7HZ-s2Y_aYHLHb9xGrlRLK1OWtYKM/edit?usp=sharing @Vinoth Chandar <[email protected]> @balaji added you guys as approvers, please take a look.
-Nishith On Tue, Mar 26, 2019 at 9:47 PM nishith agarwal <[email protected]> wrote: > JIRA : https://issues.apache.org/jira/projects/HUDI/issues/HUDI-53 > > -Nishith > > > > On Tue, Mar 26, 2019 at 9:21 PM nishith agarwal <[email protected]> > wrote: > >> All, >> >> Currently, Hudi supports partitioned and non-partitioned datasets. A >> partitioned dataset is one which bucketizes groups of files (data) into >> buckets called partitions. A hudi dataset may be composed of N number of >> partitions with M number of files. This structure helps canonical >> hive/presto/spark queries to limit the amount of data read by using the >> partition as a filter. The value of the partition/bucket in most cases is >> derived from the incoming data itself. The requirement is that once a >> record is mapped to a partition/bucket, this mapping should be a) known to >> hudi b) should remain constant for the lifecycle of the dataset for hudi to >> perform upserts on them. Consequently, in a non-partitioned dataset one can >> think of this problem as a record key <-> file id mapping that is required >> for hudi to be able to perform upserts on a record. >> Current solution is either a) for the client/user to provide the correct >> partition value as part of the payload or b) use a GlobalBloomIndex >> implementation to scan all the files under a given path (say >> non-partitioned table). In both these cases, we are limited either by the >> capability of the user to provide this information or by the performance >> overhead of scanning all files' bloom index. >> I'm proposing a new design, naming it global index, that is a mapping of >> (recordKey <-> fileId). This mapping will be stored and maintained by Hudi >> as another implementation of HoodieIndex and will address the 2 limitations >> mentioned above. I'd like to see if there are other community members >> interested in this project. I will send out a HIP shortly describing more >> details around the need and architecture of this. >> >> Thanks, >> Nishith >> >
