Hey Nishith, Glad we have a concrete proposal on this.
My 0.02 thoughts on this. What we are really building is an approximate indexing system which can help us reduce the number of files to look for when a key is updated. The problem with having something random in the key (like uuid) means approximate does not really work and hence we need a complete mapping of every id to the file. We could do with much more efficient indexing techniques for cases where the key is ordered approximately based on the file creation order (ingestion order). I am thinking something like a min/max on the row key for each file. There could be cases where a monotonous increasing id generation service is used when there are new entities and there could be updates on it and even for UUID there are ways of storing it so that the ordering can be preserved roughly (this could be one way https://www.percona.com/blog/2014/12/19/store-uuid-optimized-way/). Not to throw a monkey wrench into the current design. It would be better to think about the right abstractions for the indexing system so that the indexing we pick is really configurable based on the dataset key. Additionally, eventually it would be awesome for us to make this indexing system extendable to speed up querying a hudi dataset as well. We could potentially store the min/max, distinct stats stored in the parquet footer for every column in this index and plug in a hudi implementation of ExternalCatalog.listPartitionsByFiler in Spark https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/catalog/ExternalCatalog.scala#L269 I would assume this should give a big boost in performance for selective filters. i.e. do more granular file pruning based on the predicate expressions. We could do the same to plugin a Hoodie starburst optimizer stats in presto as well. - Prasanna On Wed, Mar 27, 2019 at 11:26 PM nishith agarwal <[email protected]> wrote: > Here is the HIP : > > https://docs.google.com/document/d/1RdxVqF60N9yRUH7HZ-s2Y_aYHLHb9xGrlRLK1OWtYKM/edit?usp=sharing > @Vinoth Chandar <[email protected]> @balaji added you guys as approvers, > please take a look. > > -Nishith > > On Tue, Mar 26, 2019 at 9:47 PM nishith agarwal <[email protected]> > wrote: > > > JIRA : https://issues.apache.org/jira/projects/HUDI/issues/HUDI-53 > > > > -Nishith > > > > > > > > On Tue, Mar 26, 2019 at 9:21 PM nishith agarwal <[email protected]> > > wrote: > > > >> All, > >> > >> Currently, Hudi supports partitioned and non-partitioned datasets. A > >> partitioned dataset is one which bucketizes groups of files (data) into > >> buckets called partitions. A hudi dataset may be composed of N number of > >> partitions with M number of files. This structure helps canonical > >> hive/presto/spark queries to limit the amount of data read by using the > >> partition as a filter. The value of the partition/bucket in most cases > is > >> derived from the incoming data itself. The requirement is that once a > >> record is mapped to a partition/bucket, this mapping should be a) known > to > >> hudi b) should remain constant for the lifecycle of the dataset for > hudi to > >> perform upserts on them. Consequently, in a non-partitioned dataset one > can > >> think of this problem as a record key <-> file id mapping that is > required > >> for hudi to be able to perform upserts on a record. > >> Current solution is either a) for the client/user to provide the correct > >> partition value as part of the payload or b) use a GlobalBloomIndex > >> implementation to scan all the files under a given path (say > >> non-partitioned table). In both these cases, we are limited either by > the > >> capability of the user to provide this information or by the performance > >> overhead of scanning all files' bloom index. > >> I'm proposing a new design, naming it global index, that is a mapping of > >> (recordKey <-> fileId). This mapping will be stored and maintained by > Hudi > >> as another implementation of HoodieIndex and will address the 2 > limitations > >> mentioned above. I'd like to see if there are other community members > >> interested in this project. I will send out a HIP shortly describing > more > >> details around the need and architecture of this. > >> > >> Thanks, > >> Nishith > >> > > >
