Re: Global Index for partitioned Hudi datasets

nishith agarwal Wed, 27 Mar 2019 23:26:59 -0700

Here is the HIP :
https://docs.google.com/document/d/1RdxVqF60N9yRUH7HZ-s2Y_aYHLHb9xGrlRLK1OWtYKM/edit?usp=sharing
@Vinoth Chandar <[email protected]> @balaji added you guys as approvers,
please take a look.


-Nishith

On Tue, Mar 26, 2019 at 9:47 PM nishith agarwal <[email protected]> wrote:

> JIRA : https://issues.apache.org/jira/projects/HUDI/issues/HUDI-53
>
> -Nishith
>
>
>
> On Tue, Mar 26, 2019 at 9:21 PM nishith agarwal <[email protected]>
> wrote:
>
>> All,
>>
>> Currently, Hudi supports partitioned and non-partitioned datasets. A
>> partitioned dataset is one which bucketizes groups of files (data) into
>> buckets called partitions. A hudi dataset may be composed of N number of
>> partitions with M number of files. This structure helps canonical
>> hive/presto/spark queries to limit the amount of data read by using the
>> partition as a filter. The value of the partition/bucket in most cases is
>> derived from the incoming data itself. The requirement is that once a
>> record is mapped to a partition/bucket, this mapping should be a) known to
>> hudi b) should remain constant for the lifecycle of the dataset for hudi to
>> perform upserts on them. Consequently, in a non-partitioned dataset one can
>> think of this problem as a record key <-> file id mapping that is required
>> for hudi to be able to perform upserts on a record.
>> Current solution is either a) for the client/user to provide the correct
>> partition value as part of the payload or b) use a GlobalBloomIndex
>> implementation to scan all the files under a given path (say
>> non-partitioned table). In both these cases, we are limited either by the
>> capability of the user to provide this information or by the performance
>> overhead of scanning all files' bloom index.
>> I'm proposing a new design, naming it global index, that is a mapping of
>> (recordKey <-> fileId). This mapping will be stored and maintained by Hudi
>> as another implementation of HoodieIndex and will address the 2 limitations
>> mentioned above. I'd like to see if there are other community members
>> interested in this project. I will send out a HIP shortly describing more
>> details around the need and architecture of this.
>>
>> Thanks,
>> Nishith
>>
>

Re: Global Index for partitioned Hudi datasets

Reply via email to