Btw I am not able to comment on the jira. I will get this fixed and post the comment on the jira as well. Cheers.
On Thu, Mar 28, 2019 at 12:49 AM Prasanna <[email protected]> wrote: > Hey Nishith, > > Glad we have a concrete proposal on this. > > My 0.02 thoughts on this. > > What we are really building is an approximate indexing system which can > help us reduce the number of files to look for when a key is updated. The > problem with having something random in the key (like uuid) means > approximate does not really work and hence we need a complete mapping of > every id to the file. We could do with much more efficient indexing > techniques for cases where the key is ordered approximately based on the > file creation order (ingestion order). I am thinking something like a > min/max on the row key for each file. There could be cases where a > monotonous increasing id generation service is used when there are new > entities and there could be updates on it and even for UUID there are ways > of storing it so that the ordering can be preserved roughly (this could be > one way https://www.percona.com/blog/2014/12/19/store-uuid-optimized-way/ > ). > > Not to throw a monkey wrench into the current design. It would be better > to think about the right abstractions for the indexing system so that the > indexing we pick is really configurable based on the dataset key. > > Additionally, eventually it would be awesome for us to make this indexing > system extendable to speed up querying a hudi dataset as well. > We could potentially store the min/max, distinct stats stored in the > parquet footer for every column in this index and plug in a hudi > implementation of ExternalCatalog.listPartitionsByFiler in Spark > https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/catalog/ExternalCatalog.scala#L269 > > I would assume this should give a big boost in performance for selective > filters. i.e. do more granular file pruning based on the predicate > expressions. > > We could do the same to plugin a Hoodie starburst optimizer stats in > presto as well. > > - Prasanna > > > > On Wed, Mar 27, 2019 at 11:26 PM nishith agarwal <[email protected]> > wrote: > >> Here is the HIP : >> >> https://docs.google.com/document/d/1RdxVqF60N9yRUH7HZ-s2Y_aYHLHb9xGrlRLK1OWtYKM/edit?usp=sharing >> @Vinoth Chandar <[email protected]> @balaji added you guys as approvers, >> please take a look. >> >> -Nishith >> >> On Tue, Mar 26, 2019 at 9:47 PM nishith agarwal <[email protected]> >> wrote: >> >> > JIRA : https://issues.apache.org/jira/projects/HUDI/issues/HUDI-53 >> > >> > -Nishith >> > >> > >> > >> > On Tue, Mar 26, 2019 at 9:21 PM nishith agarwal <[email protected]> >> > wrote: >> > >> >> All, >> >> >> >> Currently, Hudi supports partitioned and non-partitioned datasets. A >> >> partitioned dataset is one which bucketizes groups of files (data) into >> >> buckets called partitions. A hudi dataset may be composed of N number >> of >> >> partitions with M number of files. This structure helps canonical >> >> hive/presto/spark queries to limit the amount of data read by using the >> >> partition as a filter. The value of the partition/bucket in most cases >> is >> >> derived from the incoming data itself. The requirement is that once a >> >> record is mapped to a partition/bucket, this mapping should be a) >> known to >> >> hudi b) should remain constant for the lifecycle of the dataset for >> hudi to >> >> perform upserts on them. Consequently, in a non-partitioned dataset >> one can >> >> think of this problem as a record key <-> file id mapping that is >> required >> >> for hudi to be able to perform upserts on a record. >> >> Current solution is either a) for the client/user to provide the >> correct >> >> partition value as part of the payload or b) use a GlobalBloomIndex >> >> implementation to scan all the files under a given path (say >> >> non-partitioned table). In both these cases, we are limited either by >> the >> >> capability of the user to provide this information or by the >> performance >> >> overhead of scanning all files' bloom index. >> >> I'm proposing a new design, naming it global index, that is a mapping >> of >> >> (recordKey <-> fileId). This mapping will be stored and maintained by >> Hudi >> >> as another implementation of HoodieIndex and will address the 2 >> limitations >> >> mentioned above. I'd like to see if there are other community members >> >> interested in this project. I will send out a HIP shortly describing >> more >> >> details around the need and architecture of this. >> >> >> >> Thanks, >> >> Nishith >> >> >> > >> >
