gtwuser opened a new issue, #6869:
URL: https://github.com/apache/hudi/issues/6869

   A clear and concise description of the problem.
   Hudi merge is not working as described in the hudi docs. On saving a record 
with few fields updated, it is saved as a new record instead of being merged 
into the same previous record with same recordKey. As per the docs the new 
incoming records  with same recordKey will be merged to the previously existing 
records.  Please correct me if this isnt the expected behaviour. The records 
are read back using athena or spark queries
   
   
   **To Reproduce**
   
   Steps to reproduce the behavior:
   
   1. Read and create a dataframe for below mentioned JSON body and save it 
using the given hudi configs
   ```bash
   {
       "abcd": {
           "payment": "upi",
           "delivery": "20000"
       },
       "xyz": {
           "vouchers": {
               "items": [
                   {
                       "manifests": {
                           "items": [
                               {
                                   "type": "online",
                                   "version": "1.0.0"
                               }
                           ]
                       }
                   }
               ]
           },
           "recordedAt": 1661730366620
       }
   }
   ```
   2. Hudi Configs:
   ```bash
   commonConfig = {
                   'className': 'org.apache.hudi',
                   'hoodie.datasource.hive_sync.use_jdbc': 'false',
                   'hoodie.datasource.write.precombine.field': 
'abcd.recordedAt',
                   'hoodie.datasource.write.recordkey.field': 
'abcd.delivery,abcd.payment',
                   'hoodie.table.name': 'gifts',
                   # 'hoodie.consistency.check.enabled': 'true',
                   'hoodie.datasource.hive_sync.database': some_db,
                   'hoodie.datasource.write.reconcile.schema': 'true',
                   'hoodie.datasource.hive_sync.table': 
f'sse_{"_".join(prefix.split("/")[-7:-5])}'.lower(),
                   'hoodie.datasource.hive_sync.enable': 'true',
                   'path': 's3://' +'some_bucket'+ '/merged/gifts/' + 
f'{prefix.split("/")[-7]}'.lower(),
                   'hoodie.parquet.small.file.limit': '307200',
                   'hoodie.parquet.max.file.size': '128000000'
               }
               unpartitionDataConfig = {
                   'hoodie.datasource.hive_sync.partition_extractor_class': 
'org.apache.hudi.hive.NonPartitionedExtractor',
                   'hoodie.datasource.write.keygenerator.class': 
'org.apache.hudi.keygen.NonpartitionedKeyGenerator'
               }
               incrementalConfig = {
                   'hoodie.upsert.shuffle.parallelism': 68, 
'hoodie.datasource.write.operation': 'upsert',
                   'hoodie.cleaner.policy': 'KEEP_LATEST_COMMITS', 
'hoodie.cleaner.commits.retained': 10
               }
   
               combinedConf = {**commonConfig, **
                               unpartitionDataConfig, **incrementalConfig}
   ```
   3. Save it using these configs:
   ```bash
   inputDf.write \
                   .format('org.apache.hudi') \
                   
.option("spark.hadoop.parquet.avro.write-old-list-structure", "false") \
                   .option("parquet.avro.write-old-list-structure", "false") \
                   
.option("spark.hadoop.parquet.avro.add-list-element-records", "false") \
                   .option("parquet.avro.add-list-element-records", "false") \
                   .option("hoodie.parquet.avro.write-old-list-structure", 
"false") \
                   .option("hoodie.datasource.write.reconcile.schema", "true") \
                   .options(**combinedConf) \
                   .mode('append') \
                   .save()
   ```
   4. Repeat step 1 to 3 again with just the field ` "delivery": "20000"` 
updated to ` "delivery": "30000"` and save it.
   
   **Expected behavior**
   We expect when we read from destination thats is s3 here, as results we get 
only 1 record with update ` "delivery": "30000"`.
   **Environment Description**
   
   * Hudi version : 0.13.0
   
   * Spark version : 3.1
   
   * Storage (HDFS/S3/GCS..) : S3
   
   * Running on Docker? (yes/no) : no
   
   
   **Additional context**
   We are running the hudi ingestion using aws glue jobs
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Reply via email to