[ 
https://issues.apache.org/jira/browse/HUDI-1265?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ethan Guo updated HUDI-1265:
----------------------------
    Description: 
This is an EPIC to revisit the logic of bootstrap for efficient migration of 
existing non-Hudi dataset, bridging any gaps with new features such as metadata 
table.

Here are the two modes of bootstrap and migration we suppose to support:
 # Onboard for new partitions alone: Given an existing non-Hudi partitioned 
dataset (/path/parquet), Hudi manages new partitions under the same table path 
(/path/parquet) while keeping non-Hudi partitions untouched in place.  Query 
engine treats non-Hudi partitions differently when reading the data.  This 
works perfect for immutable data where there are no updates to old partitions 
and new data is only appended to the new partition.
 # Metadata-only and full-record bootstrap: Given an existing parquet dataset 
(/path/parquet), Hudi generates the record-level metadata (Hudi meta columns) 
during the bootstrap process in a new table path (/path/parquet_hudi) different 
from the parquet dataset.  There are two modes; they can be chosen at the 
granularity of partition in a single bootstrap action.  This unlocks the 
ability for Hudi to do upsert for all partitions.
 ## Metadata-only: generates record-level metadata only per parquet file and a 
bootstrap index for mapping, without rewriting the actual data records. During 
query execution, the source data is merged with Hudi metadata to return the 
results.  This is the default mode.  
 ## Full-record: use bulk insert to generate record-level metadata, copy over 
and rewrite the source data with bulk insert.  During query execution, 
record-level metadata, i.e., meta columns, and the data columns are read from 
the same parquet, improving the read performance.

Important requirements:
 * Query engine integration: Spark, Hive, Presto/Trino
 * COW more important than MOR
 * Address performance degradation due to treating the entire table as bootstrap
 * Metadata table integration
 * Support source dataset with Hive-style partitioning
 * Support of non-Hudi partitions

Phase 1: Testing and verification of status-quo (1~1.5 week)

Writing:
 * Two migration modes above
 * COW and MOR
 * 1 additional commit after bootstrap doing upsert for metadata-only and 
full-record bootstrap
 * Spark datasource, Deltastreamer
 * Partitioned and non-partitioned table
 * Simple/complex key gen
 * Hive-style partition
 * w/ and w/o metadata table enabled
 * Meta sync

Reading:
 * Hive QL, Spark SQL, Spark datasource, Presto/Trino
 * Snapshot, read-optimized, incremental query
 * Queries in the original query testing plan: 
[https://docs.google.com/spreadsheets/d/1xVfatk-6-fekwuCCZ-nTHQkewcHSEk89y-ReVV5vHQU/edit#gid=1813901684]

Need to develop a validation tool for automated validation
 * Metadata, i.e., meta columns and index in metadata table, is properly 
populated
 * Data queried from Hudi table matches the parquet data

Add tests when needed
 * HUDI-4125 Add integration tests around bootstrapped Hudi table

Phase 2: Functionality and correctness fix,  (2~3 weeks)

Known and possible issues:
 * Spark cannot see non-Hudi partitions in first onboarding mode
 * Bootstrap Relation does not support MOR; HUDI-2071 Support Reading Bootstrap 
MOR RT Table In Spark DataSource Table
 * HUDI-915 Partition Columns missing in files upserted after Metadata Bootstrap
 * HUDI-992 For hive-style partitioned source data, partition columns synced 
with Hive will always have String type
 * HUDI-1369 Bootstrap Datasource jobs from hanging via spark-submit
 * HUDI-3122  Presto query failed for bootstrap tables
 * HUDI-1779  Fail to bootstrap/upsert a table which contains timestamp column

Phase 3: Performance (1~2 weeks)
 * HUDI-1157 Optimization whether to query Bootstrapped table using 
HoodieBootstrapRelation vs Sparks Parquet datasource
 * HUDI-4453 Support partition pruning for tables Bootstrapped from Source Hive 
Style partitioned tables
 * HUDI-619 Avoid stitching meta columns and only load data columns for 
improving read performance
 * HUDI-1158 Optimizations in parallelized listing behaviour for markers and 
bootstrap source files

 

  was:
This is an EPIC to revisit the logic of bootstrap for efficient migration of 
existing non-Hudi dataset, bridging any gaps with new features such as metadata 
table.

Here are the two modes of bootstrap and migration we suppose to support:
 # Onboard for new partitions alone: Given an existing non-Hudi partitioned 
dataset (/path/parquet), Hudi manages new partitions under the same table path 
(/path/parquet) while keeping non-Hudi partitions untouched in place.  Query 
engine treats non-Hudi partitions differently when reading the data.  This 
works perfect for immutable data where there are no updates to old partitions 
and new data is only appended to the new partition.
 # Metadata-only and full-record bootstrap: Given an existing parquet dataset 
(/path/parquet), Hudi generates the record-level metadata (Hudi meta columns) 
during the bootstrap process in a new table path (/path/parquet_hudi) different 
from the parquet dataset.  There are two modes; they can be chosen at the 
granularity of partition in a single bootstrap action.  This unlocks the 
ability for Hudi to do upsert for all partitions.
 ## Metadata-only: generates record-level metadata only per parquet file and a 
bootstrap index for mapping, without rewriting the actual data records. During 
query execution, the source data is merged with Hudi metadata to return the 
results.  This is the default mode.  
 ## Full-record: use bulk insert to generate record-level metadata, copy over 
and rewrite the source data with bulk insert.  During query execution, 
record-level metadata, i.e., meta columns, and the data columns are read from 
the same parquet, improving the read performance.

Phase 1: Testing and verification of status-quo (1~1.5 week)

Writing:
 * Two migration modes above
 * COW and MOR
 * 1 additional commit after bootstrap doing upsert for metadata-only and 
full-record bootstrap
 * Spark datasource, Deltastreamer
 * Partitioned and non-partitioned table
 * Simple/complex key gen
 * Hive-style partition
 * w/ and w/o metadata table enabled
 * Meta sync

Reading:
 * Hive QL, Spark SQL, Spark datasource, Presto/Trino
 * Snapshot, read-optimized, incremental query
 * Queries in the original query testing plan: 
[https://docs.google.com/spreadsheets/d/1xVfatk-6-fekwuCCZ-nTHQkewcHSEk89y-ReVV5vHQU/edit#gid=1813901684]

Need to develop a validation tool for automated validation
 * Metadata, i.e., meta columns and index in metadata table, is properly 
populated
 * Data queried from Hudi table matches the parquet data

Add tests when needed
 * HUDI-4125 Add integration tests around bootstrapped Hudi table

Phase 2: Functionality and correctness fix,  (2~3 weeks)

Known and possible issues:
 * Spark cannot see non-Hudi partitions in first onboarding mode
 * Bootstrap Relation does not support MOR; HUDI-2071 Support Reading Bootstrap 
MOR RT Table In Spark DataSource Table
 * HUDI-915 Partition Columns missing in files upserted after Metadata Bootstrap
 * HUDI-992 For hive-style partitioned source data, partition columns synced 
with Hive will always have String type
 * HUDI-1369 Bootstrap Datasource jobs from hanging via spark-submit
 * HUDI-3122  Presto query failed for bootstrap tables
 * HUDI-1779  Fail to bootstrap/upsert a table which contains timestamp column

Phase 3: Performance (1~2 weeks)
 * HUDI-1157 Optimization whether to query Bootstrapped table using 
HoodieBootstrapRelation vs Sparks Parquet datasource
 * HUDI-4453 Support partition pruning for tables Bootstrapped from Source Hive 
Style partitioned tables
 * HUDI-619 Avoid stitching meta columns and only load data columns for 
improving read performance
 * HUDI-1158 Optimizations in parallelized listing behaviour for markers and 
bootstrap source files

 


> Efficient bootstrap and migration of existing non-Hudi dataset
> --------------------------------------------------------------
>
>                 Key: HUDI-1265
>                 URL: https://issues.apache.org/jira/browse/HUDI-1265
>             Project: Apache Hudi
>          Issue Type: Epic
>          Components: bootstrap
>            Reporter: Balaji Varadarajan
>            Assignee: Ethan Guo
>            Priority: Blocker
>              Labels: hudi-umbrellas
>             Fix For: 0.13.0
>
>
> This is an EPIC to revisit the logic of bootstrap for efficient migration of 
> existing non-Hudi dataset, bridging any gaps with new features such as 
> metadata table.
> Here are the two modes of bootstrap and migration we suppose to support:
>  # Onboard for new partitions alone: Given an existing non-Hudi partitioned 
> dataset (/path/parquet), Hudi manages new partitions under the same table 
> path (/path/parquet) while keeping non-Hudi partitions untouched in place.  
> Query engine treats non-Hudi partitions differently when reading the data.  
> This works perfect for immutable data where there are no updates to old 
> partitions and new data is only appended to the new partition.
>  # Metadata-only and full-record bootstrap: Given an existing parquet dataset 
> (/path/parquet), Hudi generates the record-level metadata (Hudi meta columns) 
> during the bootstrap process in a new table path (/path/parquet_hudi) 
> different from the parquet dataset.  There are two modes; they can be chosen 
> at the granularity of partition in a single bootstrap action.  This unlocks 
> the ability for Hudi to do upsert for all partitions.
>  ## Metadata-only: generates record-level metadata only per parquet file and 
> a bootstrap index for mapping, without rewriting the actual data records. 
> During query execution, the source data is merged with Hudi metadata to 
> return the results.  This is the default mode.  
>  ## Full-record: use bulk insert to generate record-level metadata, copy over 
> and rewrite the source data with bulk insert.  During query execution, 
> record-level metadata, i.e., meta columns, and the data columns are read from 
> the same parquet, improving the read performance.
> Important requirements:
>  * Query engine integration: Spark, Hive, Presto/Trino
>  * COW more important than MOR
>  * Address performance degradation due to treating the entire table as 
> bootstrap
>  * Metadata table integration
>  * Support source dataset with Hive-style partitioning
>  * Support of non-Hudi partitions
> Phase 1: Testing and verification of status-quo (1~1.5 week)
> Writing:
>  * Two migration modes above
>  * COW and MOR
>  * 1 additional commit after bootstrap doing upsert for metadata-only and 
> full-record bootstrap
>  * Spark datasource, Deltastreamer
>  * Partitioned and non-partitioned table
>  * Simple/complex key gen
>  * Hive-style partition
>  * w/ and w/o metadata table enabled
>  * Meta sync
> Reading:
>  * Hive QL, Spark SQL, Spark datasource, Presto/Trino
>  * Snapshot, read-optimized, incremental query
>  * Queries in the original query testing plan: 
> [https://docs.google.com/spreadsheets/d/1xVfatk-6-fekwuCCZ-nTHQkewcHSEk89y-ReVV5vHQU/edit#gid=1813901684]
> Need to develop a validation tool for automated validation
>  * Metadata, i.e., meta columns and index in metadata table, is properly 
> populated
>  * Data queried from Hudi table matches the parquet data
> Add tests when needed
>  * HUDI-4125 Add integration tests around bootstrapped Hudi table
> Phase 2: Functionality and correctness fix,  (2~3 weeks)
> Known and possible issues:
>  * Spark cannot see non-Hudi partitions in first onboarding mode
>  * Bootstrap Relation does not support MOR; HUDI-2071 Support Reading 
> Bootstrap MOR RT Table In Spark DataSource Table
>  * HUDI-915 Partition Columns missing in files upserted after Metadata 
> Bootstrap
>  * HUDI-992 For hive-style partitioned source data, partition columns synced 
> with Hive will always have String type
>  * HUDI-1369 Bootstrap Datasource jobs from hanging via spark-submit
>  * HUDI-3122  Presto query failed for bootstrap tables
>  * HUDI-1779  Fail to bootstrap/upsert a table which contains timestamp column
> Phase 3: Performance (1~2 weeks)
>  * HUDI-1157 Optimization whether to query Bootstrapped table using 
> HoodieBootstrapRelation vs Sparks Parquet datasource
>  * HUDI-4453 Support partition pruning for tables Bootstrapped from Source 
> Hive Style partitioned tables
>  * HUDI-619 Avoid stitching meta columns and only load data columns for 
> improving read performance
>  * HUDI-1158 Optimizations in parallelized listing behaviour for markers and 
> bootstrap source files
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to