[ https://issues.apache.org/jira/browse/HUDI-1265?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Raymond Xu updated HUDI-1265: ----------------------------- Due Date: 30/Sep/22 > Improving bootstrap and efficient migration of existing non-Hudi dataset > ------------------------------------------------------------------------ > > Key: HUDI-1265 > URL: https://issues.apache.org/jira/browse/HUDI-1265 > Project: Apache Hudi > Issue Type: Epic > Components: bootstrap > Reporter: Balaji Varadarajan > Assignee: Ethan Guo > Priority: Blocker > Labels: hudi-umbrellas > Fix For: 0.13.0 > > > This is an EPIC to revisit the logic of bootstrap for efficient migration of > existing non-Hudi dataset, bridging any gaps with new features such as > metadata table. > Here are the two modes of bootstrap and migration we suppose to support: > # Onboard for new partitions alone: Given an existing non-Hudi partitioned > dataset (/path/parquet), Hudi manages new partitions under the same table > path (/path/parquet) while keeping non-Hudi partitions untouched in place. > Query engine treats non-Hudi partitions differently when reading the data. > This works perfect for immutable data where there are no updates to old > partitions and new data is only appended to the new partition. > # Metadata-only and full-record bootstrap: Given an existing parquet dataset > (/path/parquet), Hudi generates the record-level metadata (Hudi meta columns) > during the bootstrap process in a new table path (/path/parquet_hudi) > different from the parquet dataset. There are two modes; they can be chosen > at the granularity of partition in a single bootstrap action. This unlocks > the ability for Hudi to do upsert for all partitions. > ## Metadata-only: generates record-level metadata only per parquet file and > a bootstrap index for mapping, without rewriting the actual data records. > During query execution, the source data is merged with Hudi metadata to > return the results. This is the default mode. > ## Full-record: use bulk insert to generate record-level metadata, copy over > and rewrite the source data with bulk insert. During query execution, > record-level metadata, i.e., meta columns, and the data columns are read from > the same parquet, improving the read performance. > Phase 1: Testing and verification of status-quo (1~1.5 week) > Writing: > * Two migration modes above > * COW and MOR > * 1 additional commit after bootstrap doing upsert for metadata-only and > full-record bootstrap > * Spark datasource, Deltastreamer > * Partitioned and non-partitioned table > * Simple/complex key gen > * Hive-style partition > * w/ and w/o metadata table enabled > * Meta sync > Reading: > * Hive QL, Spark SQL, Spark datasource, Presto/Trino > * Snapshot, read-optimized, incremental query > * Queries in the original query testing plan: > [https://docs.google.com/spreadsheets/d/1xVfatk-6-fekwuCCZ-nTHQkewcHSEk89y-ReVV5vHQU/edit#gid=1813901684] > Need to develop a validation tool for automated validation > * Metadata, i.e., meta columns and index in metadata table, is properly > populated > * Data queried from Hudi table matches the parquet data > Add tests when needed > * HUDI-4125 Add integration tests around bootstrapped Hudi table > Phase 2: Functionality and correctness fix, (2~3 weeks) > Known and possible issues: > * Spark cannot see non-Hudi partitions in first onboarding mode > * Bootstrap Relation does not support MOR; HUDI-2071 Support Reading > Bootstrap MOR RT Table In Spark DataSource Table > * HUDI-915 Partition Columns missing in files upserted after Metadata > Bootstrap > * HUDI-992 For hive-style partitioned source data, partition columns synced > with Hive will always have String type > * HUDI-1369 Bootstrap Datasource jobs from hanging via spark-submit > * HUDI-3122 Presto query failed for bootstrap tables > * HUDI-1779 Fail to bootstrap/upsert a table which contains timestamp column > Phase 3: Performance (1~2 weeks) > * HUDI-1157 Optimization whether to query Bootstrapped table using > HoodieBootstrapRelation vs Sparks Parquet datasource > * HUDI-4453 Support partition pruning for tables Bootstrapped from Source > Hive Style partitioned tables > * HUDI-619 Avoid stitching meta columns and only load data columns for > improving read performance > * HUDI-1158 Optimizations in parallelized listing behaviour for markers and > bootstrap source files > -- This message was sent by Atlassian Jira (v8.20.10#820010)