Re: Global Index for partitioned Hudi datasets

Vinoth Chandar Fri, 29 Mar 2019 09:27:27 -0700

Great thoughts.. Lets chat more on the HIP.

>> I am thinking something like a min/max on the row key for each file.
There could be cases where a monotonous increasing id generation service is
used when there are new entities
BloomIndex already does this today. In addition to Bloom filters, it also
leverages range information for such keys,,



Prasanna,  On the jira, if you can share your jira id, I can help with the
 perms



On Thu, Mar 28, 2019 at 12:58 AM Prasanna <[email protected]> wrote:

> Btw I am not able to comment on the jira. I will get this fixed and post
> the comment on the jira as well. Cheers.
>
> On Thu, Mar 28, 2019 at 12:49 AM Prasanna <[email protected]> wrote:
>
> > Hey Nishith,
> >
> > Glad we have a concrete proposal on this.
> >
> > My 0.02 thoughts on this.
> >
> > What we are really building is an approximate indexing system which can
> > help us reduce the number of files to look for when a key is updated. The
> > problem with having something random in the key (like uuid) means
> > approximate does not really work and hence we need a complete mapping of
> > every id to the file. We could do with much more efficient indexing
> > techniques for cases where the key is ordered approximately based on the
> > file creation order (ingestion order). I am thinking something like a
> > min/max on the row key for each file. There could be cases where a
> > monotonous increasing id generation service is used when there are new
> > entities and there could be updates on it and even for UUID there are
> ways
> > of storing it so that the ordering can be preserved roughly (this could
> be
> > one way
> https://www.percona.com/blog/2014/12/19/store-uuid-optimized-way/
> > ).
> >
> > Not to throw a monkey wrench into the current design. It would be better
> > to think about the right abstractions for the indexing system so that the
> > indexing we pick is really configurable based on the dataset key.
> >
> > Additionally, eventually it would be awesome for us to make this indexing
> > system extendable to speed up querying a hudi dataset as well.
> > We could potentially store the min/max, distinct stats stored in the
> > parquet footer for every column in this index and plug in a hudi
> > implementation of ExternalCatalog.listPartitionsByFiler in Spark
> >
> https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/catalog/ExternalCatalog.scala#L269
> >
> > I would assume this should give a big boost in performance for selective
> > filters. i.e. do more granular file pruning based on the predicate
> > expressions.
> >
> > We could do the same to plugin a Hoodie starburst optimizer stats in
> > presto as well.
> >
> > - Prasanna
> >
> >
> >
> > On Wed, Mar 27, 2019 at 11:26 PM nishith agarwal <[email protected]>
> > wrote:
> >
> >> Here is the HIP :
> >>
> >>
> https://docs.google.com/document/d/1RdxVqF60N9yRUH7HZ-s2Y_aYHLHb9xGrlRLK1OWtYKM/edit?usp=sharing
> >> @Vinoth Chandar <[email protected]> @balaji added you guys as
> approvers,
> >> please take a look.
> >>
> >> -Nishith
> >>
> >> On Tue, Mar 26, 2019 at 9:47 PM nishith agarwal <[email protected]>
> >> wrote:
> >>
> >> > JIRA : https://issues.apache.org/jira/projects/HUDI/issues/HUDI-53
> >> >
> >> > -Nishith
> >> >
> >> >
> >> >
> >> > On Tue, Mar 26, 2019 at 9:21 PM nishith agarwal <[email protected]>
> >> > wrote:
> >> >
> >> >> All,
> >> >>
> >> >> Currently, Hudi supports partitioned and non-partitioned datasets. A
> >> >> partitioned dataset is one which bucketizes groups of files (data)
> into
> >> >> buckets called partitions. A hudi dataset may be composed of N number
> >> of
> >> >> partitions with M number of files. This structure helps canonical
> >> >> hive/presto/spark queries to limit the amount of data read by using
> the
> >> >> partition as a filter. The value of the partition/bucket in most
> cases
> >> is
> >> >> derived from the incoming data itself. The requirement is that once a
> >> >> record is mapped to a partition/bucket, this mapping should be a)
> >> known to
> >> >> hudi b) should remain constant for the lifecycle of the dataset for
> >> hudi to
> >> >> perform upserts on them. Consequently, in a non-partitioned dataset
> >> one can
> >> >> think of this problem as a record key <-> file id mapping that is
> >> required
> >> >> for hudi to be able to perform upserts on a record.
> >> >> Current solution is either a) for the client/user to provide the
> >> correct
> >> >> partition value as part of the payload or b) use a GlobalBloomIndex
> >> >> implementation to scan all the files under a given path (say
> >> >> non-partitioned table). In both these cases, we are limited either by
> >> the
> >> >> capability of the user to provide this information or by the
> >> performance
> >> >> overhead of scanning all files' bloom index.
> >> >> I'm proposing a new design, naming it global index, that is a mapping
> >> of
> >> >> (recordKey <-> fileId). This mapping will be stored and maintained by
> >> Hudi
> >> >> as another implementation of HoodieIndex and will address the 2
> >> limitations
> >> >> mentioned above. I'd like to see if there are other community members
> >> >> interested in this project. I will send out a HIP shortly describing
> >> more
> >> >> details around the need and architecture of this.
> >> >>
> >> >> Thanks,
> >> >> Nishith
> >> >>
> >> >
> >>
> >
>

Re: Global Index for partitioned Hudi datasets

Reply via email to