gtwuser opened a new issue, #6869: URL: https://github.com/apache/hudi/issues/6869
A clear and concise description of the problem. Hudi merge is not working as described in the hudi docs. On saving a record with few fields updated, it is saved as a new record instead of being merged into the same previous record with same recordKey. As per the docs the new incoming records with same recordKey will be merged to the previously existing records. Please correct me if this isnt the expected behaviour. The records are read back using athena or spark queries **To Reproduce** Steps to reproduce the behavior: 1. Read and create a dataframe for below mentioned JSON body and save it using the given hudi configs ```bash { "abcd": { "payment": "upi", "delivery": "20000" }, "xyz": { "vouchers": { "items": [ { "manifests": { "items": [ { "type": "online", "version": "1.0.0" } ] } } ] }, "recordedAt": 1661730366620 } } ``` 2. Hudi Configs: ```bash commonConfig = { 'className': 'org.apache.hudi', 'hoodie.datasource.hive_sync.use_jdbc': 'false', 'hoodie.datasource.write.precombine.field': 'abcd.recordedAt', 'hoodie.datasource.write.recordkey.field': 'abcd.delivery,abcd.payment', 'hoodie.table.name': 'gifts', # 'hoodie.consistency.check.enabled': 'true', 'hoodie.datasource.hive_sync.database': some_db, 'hoodie.datasource.write.reconcile.schema': 'true', 'hoodie.datasource.hive_sync.table': f'sse_{"_".join(prefix.split("/")[-7:-5])}'.lower(), 'hoodie.datasource.hive_sync.enable': 'true', 'path': 's3://' +'some_bucket'+ '/merged/gifts/' + f'{prefix.split("/")[-7]}'.lower(), 'hoodie.parquet.small.file.limit': '307200', 'hoodie.parquet.max.file.size': '128000000' } unpartitionDataConfig = { 'hoodie.datasource.hive_sync.partition_extractor_class': 'org.apache.hudi.hive.NonPartitionedExtractor', 'hoodie.datasource.write.keygenerator.class': 'org.apache.hudi.keygen.NonpartitionedKeyGenerator' } incrementalConfig = { 'hoodie.upsert.shuffle.parallelism': 68, 'hoodie.datasource.write.operation': 'upsert', 'hoodie.cleaner.policy': 'KEEP_LATEST_COMMITS', 'hoodie.cleaner.commits.retained': 10 } combinedConf = {**commonConfig, ** unpartitionDataConfig, **incrementalConfig} ``` 3. Save it using these configs: ```bash inputDf.write \ .format('org.apache.hudi') \ .option("spark.hadoop.parquet.avro.write-old-list-structure", "false") \ .option("parquet.avro.write-old-list-structure", "false") \ .option("spark.hadoop.parquet.avro.add-list-element-records", "false") \ .option("parquet.avro.add-list-element-records", "false") \ .option("hoodie.parquet.avro.write-old-list-structure", "false") \ .option("hoodie.datasource.write.reconcile.schema", "true") \ .options(**combinedConf) \ .mode('append') \ .save() ``` 4. Repeat step 1 to 3 again with just the field ` "delivery": "20000"` updated to ` "delivery": "30000"` and save it. **Expected behavior** We expect when we read from destination thats is s3 here, as results we get only 1 record with update ` "delivery": "30000"`. **Environment Description** * Hudi version : 0.13.0 * Spark version : 3.1 * Storage (HDFS/S3/GCS..) : S3 * Running on Docker? (yes/no) : no **Additional context** We are running the hudi ingestion using aws glue jobs -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org