Re: Global Index for partitioned Hudi datasets

Prasanna Thu, 28 Mar 2019 00:50:28 -0700

Hey Nishith,

Glad we have a concrete proposal on this.

My 0.02 thoughts on this.

What we are really building is an approximate indexing system which can
help us reduce the number of files to look for when a key is updated. The
problem with having something random in the key (like uuid) means
approximate does not really work and hence we need a complete mapping of
every id to the file. We could do with much more efficient indexing
techniques for cases where the key is ordered approximately based on the
file creation order (ingestion order). I am thinking something like a
min/max on the row key for each file. There could be cases where a
monotonous increasing id generation service is used when there are new
entities and there could be updates on it and even for UUID there are ways
of storing it so that the ordering can be preserved roughly (this could be
one way https://www.percona.com/blog/2014/12/19/store-uuid-optimized-way/).

Not to throw a monkey wrench into the current design. It would be better to
think about the right abstractions for the indexing system so that the
indexing we pick is really configurable based on the dataset key.

Additionally, eventually it would be awesome for us to make this indexing
system extendable to speed up querying a hudi dataset as well.
We could potentially store the min/max, distinct stats stored in the
parquet footer for every column in this index and plug in a hudi
implementation of ExternalCatalog.listPartitionsByFiler in Spark
https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/catalog/ExternalCatalog.scala#L269

I would assume this should give a big boost in performance for selective
filters. i.e. do more granular file pruning based on the predicate
expressions.

We could do the same to plugin a Hoodie starburst optimizer stats in presto
as well.

- Prasanna

On Wed, Mar 27, 2019 at 11:26 PM nishith agarwal <[email protected]>
wrote:

> Here is the HIP :
>
> https://docs.google.com/document/d/1RdxVqF60N9yRUH7HZ-s2Y_aYHLHb9xGrlRLK1OWtYKM/edit?usp=sharing
> @Vinoth Chandar <[email protected]> @balaji added you guys as approvers,
> please take a look.
>
> -Nishith
>
> On Tue, Mar 26, 2019 at 9:47 PM nishith agarwal <[email protected]>
> wrote:
>
> > JIRA : https://issues.apache.org/jira/projects/HUDI/issues/HUDI-53
> >
> > -Nishith
> >
> >
> >
> > On Tue, Mar 26, 2019 at 9:21 PM nishith agarwal <[email protected]>
> > wrote:
> >
> >> All,
> >>
> >> Currently, Hudi supports partitioned and non-partitioned datasets. A
> >> partitioned dataset is one which bucketizes groups of files (data) into
> >> buckets called partitions. A hudi dataset may be composed of N number of
> >> partitions with M number of files. This structure helps canonical
> >> hive/presto/spark queries to limit the amount of data read by using the
> >> partition as a filter. The value of the partition/bucket in most cases
> is
> >> derived from the incoming data itself. The requirement is that once a
> >> record is mapped to a partition/bucket, this mapping should be a) known
> to
> >> hudi b) should remain constant for the lifecycle of the dataset for
> hudi to
> >> perform upserts on them. Consequently, in a non-partitioned dataset one
> can
> >> think of this problem as a record key <-> file id mapping that is
> required
> >> for hudi to be able to perform upserts on a record.
> >> Current solution is either a) for the client/user to provide the correct
> >> partition value as part of the payload or b) use a GlobalBloomIndex
> >> implementation to scan all the files under a given path (say
> >> non-partitioned table). In both these cases, we are limited either by
> the
> >> capability of the user to provide this information or by the performance
> >> overhead of scanning all files' bloom index.
> >> I'm proposing a new design, naming it global index, that is a mapping of
> >> (recordKey <-> fileId). This mapping will be stored and maintained by
> Hudi
> >> as another implementation of HoodieIndex and will address the 2
> limitations
> >> mentioned above. I'd like to see if there are other community members
> >> interested in this project. I will send out a HIP shortly describing
> more
> >> details around the need and architecture of this.
> >>
> >> Thanks,
> >> Nishith
> >>
> >
>

Re: Global Index for partitioned Hudi datasets

Reply via email to