[ 
https://issues.apache.org/jira/browse/HUDI-1265?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Raymond Xu updated HUDI-1265:
-----------------------------
    Due Date: 30/Sep/22

> Improving bootstrap and efficient migration of existing non-Hudi dataset
> ------------------------------------------------------------------------
>
>                 Key: HUDI-1265
>                 URL: https://issues.apache.org/jira/browse/HUDI-1265
>             Project: Apache Hudi
>          Issue Type: Epic
>          Components: bootstrap
>            Reporter: Balaji Varadarajan
>            Assignee: Ethan Guo
>            Priority: Blocker
>              Labels: hudi-umbrellas
>             Fix For: 0.13.0
>
>
> This is an EPIC to revisit the logic of bootstrap for efficient migration of 
> existing non-Hudi dataset, bridging any gaps with new features such as 
> metadata table.
> Here are the two modes of bootstrap and migration we suppose to support:
>  # Onboard for new partitions alone: Given an existing non-Hudi partitioned 
> dataset (/path/parquet), Hudi manages new partitions under the same table 
> path (/path/parquet) while keeping non-Hudi partitions untouched in place.  
> Query engine treats non-Hudi partitions differently when reading the data.  
> This works perfect for immutable data where there are no updates to old 
> partitions and new data is only appended to the new partition.
>  # Metadata-only and full-record bootstrap: Given an existing parquet dataset 
> (/path/parquet), Hudi generates the record-level metadata (Hudi meta columns) 
> during the bootstrap process in a new table path (/path/parquet_hudi) 
> different from the parquet dataset.  There are two modes; they can be chosen 
> at the granularity of partition in a single bootstrap action.  This unlocks 
> the ability for Hudi to do upsert for all partitions.
>  ## Metadata-only: generates record-level metadata only per parquet file and 
> a bootstrap index for mapping, without rewriting the actual data records. 
> During query execution, the source data is merged with Hudi metadata to 
> return the results.  This is the default mode.  
>  ## Full-record: use bulk insert to generate record-level metadata, copy over 
> and rewrite the source data with bulk insert.  During query execution, 
> record-level metadata, i.e., meta columns, and the data columns are read from 
> the same parquet, improving the read performance.
> Phase 1: Testing and verification of status-quo (1~1.5 week)
> Writing:
>  * Two migration modes above
>  * COW and MOR
>  * 1 additional commit after bootstrap doing upsert for metadata-only and 
> full-record bootstrap
>  * Spark datasource, Deltastreamer
>  * Partitioned and non-partitioned table
>  * Simple/complex key gen
>  * Hive-style partition
>  * w/ and w/o metadata table enabled
>  * Meta sync
> Reading:
>  * Hive QL, Spark SQL, Spark datasource, Presto/Trino
>  * Snapshot, read-optimized, incremental query
>  * Queries in the original query testing plan: 
> [https://docs.google.com/spreadsheets/d/1xVfatk-6-fekwuCCZ-nTHQkewcHSEk89y-ReVV5vHQU/edit#gid=1813901684]
> Need to develop a validation tool for automated validation
>  * Metadata, i.e., meta columns and index in metadata table, is properly 
> populated
>  * Data queried from Hudi table matches the parquet data
> Add tests when needed
>  * HUDI-4125 Add integration tests around bootstrapped Hudi table
> Phase 2: Functionality and correctness fix,  (2~3 weeks)
> Known and possible issues:
>  * Spark cannot see non-Hudi partitions in first onboarding mode
>  * Bootstrap Relation does not support MOR; HUDI-2071 Support Reading 
> Bootstrap MOR RT Table In Spark DataSource Table
>  * HUDI-915 Partition Columns missing in files upserted after Metadata 
> Bootstrap
>  * HUDI-992 For hive-style partitioned source data, partition columns synced 
> with Hive will always have String type
>  * HUDI-1369 Bootstrap Datasource jobs from hanging via spark-submit
>  * HUDI-3122  Presto query failed for bootstrap tables
>  * HUDI-1779  Fail to bootstrap/upsert a table which contains timestamp column
> Phase 3: Performance (1~2 weeks)
>  * HUDI-1157 Optimization whether to query Bootstrapped table using 
> HoodieBootstrapRelation vs Sparks Parquet datasource
>  * HUDI-4453 Support partition pruning for tables Bootstrapped from Source 
> Hive Style partitioned tables
>  * HUDI-619 Avoid stitching meta columns and only load data columns for 
> improving read performance
>  * HUDI-1158 Optimizations in parallelized listing behaviour for markers and 
> bootstrap source files
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to