[I] Inconsistency in Hudi Table Configuration between Initial Insert and Subsequent Merges [hudi]

via GitHub Mon, 05 Feb 2024 08:36:21 -0800


prashant462 opened a new issue, #10626:
URL: https://github.com/apache/hudi/issues/10626


   ### Issue Summary
   
   When using dbt Spark with Hudi to create a Hudi format table, there is an 
inconsistency in the Hudi table configuration between the initial insert and 
subsequent merge operations. The properties provided in the options of the dbt 
model are correctly fetched and applied during the first run. However, during 
the second run, when executing the merge operation, Hudi fetches a subset of 
the properties from the Hudi catalog table, leading to the addition of default 
properties and changes in configuration.
   
   
   ### Steps to Reproduce
   
   - Execute the dbt model with Hudi options for the initial insert.
   
      Sample model
      
            {{
         config(
         materialized = 'incremental',
         file_format= 'hudi',
         pre_hook="SET spark.sql.legacy.allowNonEmptyLocationInCTAS = true",
         location_root="file:///Users/B0279627/Downloads/Hudi",
         unique_key="id",
         incremental_strategy="merge",
         options={
         'preCombineField': 'id2',
         'hoodie.index.type':"GLOBAL_SIMPLE",
         'hoodie.simple.index.update.partition.path':'true',
         'hoodie.keep.min.commits':'145',
         'hoodie.keep.max.commits':'288',
         'hoodie.cleaner.policy':'KEEP_LATEST_BY_HOURS',
         'hoodie.cleaner.hours.retained':'72',
         'hoodie.cleaner.fileversions.retained':'144',
         'hoodie.cleaner.commits.retained':'144',
         'hoodie.upsert.shuffle.parallelism':'200',
         'hoodie.insert.shuffle.parallelism':'200',
         'hoodie.bulkinsert.shuffle.parallelism':'200',
         'hoodie.delete.shuffle.parallelism':'200',
         'hoodie.parquet.compression.codec':'zstd',
         'hoodie.datasource.hive_sync.support_timestamp':'true',
         'hoodie.datasource.write.reconcile.schema':'true',
         'hoodie.enable.data.skipping':'true',
         
'hoodie.datasource.write.payload.class':'org.apache.hudi.common.model.DefaultHoodieRecordPayload',
         }
         )
         }}
   - Observe that all specified properties are correctly applied during the 
first run.
   - For observation you can check with sample property like 
hoodie.index.type=GLOBAL_SIMPLE
   - Execute the dbt model with Hudi options for a subsequent merge operation.
   - Observe changes in Hudi table properties, with defaults being applied for 
certain configurations like hoodie.index.type changed to SIMPLE (Target table 
created seems like following hoodie.index.type= SIMPLE)
   
   ### Expected Behavior
   Hudi should consistently set all specified properties in every run, 
irrespective of whether it is the initial insert or a subsequent merge 
operation. The properties passed in the options of the dbt model should be 
retained and applied consistently across all operations.
   
   ### Environment Description
   
   * Hudi version : 0.12.1
   
   * Spark version : 3.3.1
   
   * Hive version : 3.1.3
   
   * Hadoop version : 3.1.1
   
   * DBT version: 1.7.1
   
   * Storage (HDFS/S3/GCS..) : Checked with s3 , hdfs and local file system.
   
   * Running on Docker? (yes/no) : no
   
   
   ### **Additional context**
   
   In the second run MergeIntohoodieTableCommand.scala executes 
InsertIntoHoodieTableCommand.run() in this case hudi fetch the props from 
hudicatalog table where it fetches tableConfigs and catalog properties. But 
they are not all that complete properties which I passed in the first run using 
dbt options. Due to which hudi add some other default properties in the hoodie 
props which are not fetched in the hudicatalog props . Seems due to this many 
properties are changing.
   Below i have attached some images of properties fetched in subsequent merge 
operations
   
   <img width="1440" alt="MicrosoftTeams-image (21)" 
src="https://github.com/apache/hudi/assets/31952894/46126281-b95a-47a4-9116-66a093a97506";>
   <img width="1120" alt="Screenshot 2024-02-05 at 10 00 20 PM" 
src="https://github.com/apache/hudi/assets/31952894/80ba4206-77d0-4852-aaf1-fd0e19c91025";>
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[I] Inconsistency in Hudi Table Configuration between Initial Insert and Subsequent Merges [hudi]

Reply via email to