Re: Global Index for partitioned Hudi datasets

Prasanna Thu, 28 Mar 2019 00:59:31 -0700

Btw I am not able to comment on the jira. I will get this fixed and post
the comment on the jira as well. Cheers.


On Thu, Mar 28, 2019 at 12:49 AM Prasanna <[email protected]> wrote:

> Hey Nishith,
>
> Glad we have a concrete proposal on this.
>
> My 0.02 thoughts on this.
>
> What we are really building is an approximate indexing system which can
> help us reduce the number of files to look for when a key is updated. The
> problem with having something random in the key (like uuid) means
> approximate does not really work and hence we need a complete mapping of
> every id to the file. We could do with much more efficient indexing
> techniques for cases where the key is ordered approximately based on the
> file creation order (ingestion order). I am thinking something like a
> min/max on the row key for each file. There could be cases where a
> monotonous increasing id generation service is used when there are new
> entities and there could be updates on it and even for UUID there are ways
> of storing it so that the ordering can be preserved roughly (this could be
> one way https://www.percona.com/blog/2014/12/19/store-uuid-optimized-way/
> ).
>
> Not to throw a monkey wrench into the current design. It would be better
> to think about the right abstractions for the indexing system so that the
> indexing we pick is really configurable based on the dataset key.
>
> Additionally, eventually it would be awesome for us to make this indexing
> system extendable to speed up querying a hudi dataset as well.
> We could potentially store the min/max, distinct stats stored in the
> parquet footer for every column in this index and plug in a hudi
> implementation of ExternalCatalog.listPartitionsByFiler in Spark
> https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/catalog/ExternalCatalog.scala#L269
>
> I would assume this should give a big boost in performance for selective
> filters. i.e. do more granular file pruning based on the predicate
> expressions.
>
> We could do the same to plugin a Hoodie starburst optimizer stats in
> presto as well.
>
> - Prasanna
>
>
>
> On Wed, Mar 27, 2019 at 11:26 PM nishith agarwal <[email protected]>
> wrote:
>
>> Here is the HIP :
>>
>> https://docs.google.com/document/d/1RdxVqF60N9yRUH7HZ-s2Y_aYHLHb9xGrlRLK1OWtYKM/edit?usp=sharing
>> @Vinoth Chandar <[email protected]> @balaji added you guys as approvers,
>> please take a look.
>>
>> -Nishith
>>
>> On Tue, Mar 26, 2019 at 9:47 PM nishith agarwal <[email protected]>
>> wrote:
>>
>> > JIRA : https://issues.apache.org/jira/projects/HUDI/issues/HUDI-53
>> >
>> > -Nishith
>> >
>> >
>> >
>> > On Tue, Mar 26, 2019 at 9:21 PM nishith agarwal <[email protected]>
>> > wrote:
>> >
>> >> All,
>> >>
>> >> Currently, Hudi supports partitioned and non-partitioned datasets. A
>> >> partitioned dataset is one which bucketizes groups of files (data) into
>> >> buckets called partitions. A hudi dataset may be composed of N number
>> of
>> >> partitions with M number of files. This structure helps canonical
>> >> hive/presto/spark queries to limit the amount of data read by using the
>> >> partition as a filter. The value of the partition/bucket in most cases
>> is
>> >> derived from the incoming data itself. The requirement is that once a
>> >> record is mapped to a partition/bucket, this mapping should be a)
>> known to
>> >> hudi b) should remain constant for the lifecycle of the dataset for
>> hudi to
>> >> perform upserts on them. Consequently, in a non-partitioned dataset
>> one can
>> >> think of this problem as a record key <-> file id mapping that is
>> required
>> >> for hudi to be able to perform upserts on a record.
>> >> Current solution is either a) for the client/user to provide the
>> correct
>> >> partition value as part of the payload or b) use a GlobalBloomIndex
>> >> implementation to scan all the files under a given path (say
>> >> non-partitioned table). In both these cases, we are limited either by
>> the
>> >> capability of the user to provide this information or by the
>> performance
>> >> overhead of scanning all files' bloom index.
>> >> I'm proposing a new design, naming it global index, that is a mapping
>> of
>> >> (recordKey <-> fileId). This mapping will be stored and maintained by
>> Hudi
>> >> as another implementation of HoodieIndex and will address the 2
>> limitations
>> >> mentioned above. I'd like to see if there are other community members
>> >> interested in this project. I will send out a HIP shortly describing
>> more
>> >> details around the need and architecture of this.
>> >>
>> >> Thanks,
>> >> Nishith
>> >>
>> >
>>
>

Re: Global Index for partitioned Hudi datasets

Reply via email to