[ 
https://issues.apache.org/jira/browse/HUDI-8432?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinoth Chandar updated HUDI-8432:
---------------------------------
    Reviewers: sivabalan narayanan  (was: Y Ethan Guo)

> Data skipping not working with RLI if record key is composite
> -------------------------------------------------------------
>
>                 Key: HUDI-8432
>                 URL: https://issues.apache.org/jira/browse/HUDI-8432
>             Project: Apache Hudi
>          Issue Type: Bug
>          Components: reader-core
>            Reporter: Aditya Goenka
>            Assignee: Lokesh Jain
>            Priority: Blocker
>              Labels: pull-request-available
>             Fix For: 1.0.0
>
>         Attachments: image-2024-10-24-13-44-06-097.png, 
> image-2024-10-24-13-45-03-581.png
>
>   Original Estimate: 5h
>  Remaining Estimate: 5h
>
> Data skipping with RLI is working when using one column as primary key but 
> not working when primary key have multiple columns.
>  
> Code to Reproduce.
> ```
> from pyspark.sql.functions import *
> columns = ["ts","uuid","rider","driver","fare","city"]
> data 
> =[(1695159649087,"334e26e9-8355-45cc-97c6-c31daf0df330","rider-A","driver-K",19.10,"san_francisco"),
> (1695091554788,"e96c4396-3fad-413a-a942-4cb36106d721","rider-B","driver-L",27.70
>  ,"san_francisco"),
> (1695046462179,"9909a8b1-2d15-4d3d-8ec9-efc48c536a00","rider-C","driver-M",33.90
>  ,"san_francisco"),
> (1695516137016,"e3cf430c-889d-4015-bc98-59bdce1e530c","rider-C","driver-N",34.15,"sao_paulo")]
> inserts = spark.createDataFrame(data).toDF(*columns)
> hudi_options = {
> 'hoodie.table.name': TABLE_NAME,
> 'hoodie.datasource.write.recordkey.field' : 'uuid,rider',
> 'hoodie.datasource.write.precombine.field' : 'ts',
> 'hoodie.datasource.write.partitionpath.field': 'city',
> 'hoodie.index.type' : 'RECORD_INDEX',
> 'hoodie.metadata.record.index.enable' : 'true',
> 'hoodie.metadata.enable':'true',
> 'hoodie.enable.data.skipping':'true'
> }
> # Insert data
> inserts.write.format("hudi"). \
> options(**hudi_options). \
> mode("overwrite"). \
> save(PATH)
> inserts.write.format("hudi"). \
> options(**hudi_options). \
> mode("append"). \
> save(PATH)
> spark.read.options(**hudi_options).format("hudi").load(PATH).where("uuid = 
> 'e3cf430c-889d-4015-bc98-59bdce1e530c' and rider = 'rider-C'").show()
> ```
> The read query above scanned both the files.
> !image-2024-10-24-13-44-06-097.png!
> But if we use below code - 
>  
> ```
> hudi_options = {
> 'hoodie.table.name': TABLE_NAME,
> 'hoodie.datasource.write.recordkey.field' : 'uuid',
> 'hoodie.datasource.write.precombine.field' : 'ts',
> 'hoodie.datasource.write.partitionpath.field': 'city',
> 'hoodie.index.type' : 'RECORD_INDEX',
> 'hoodie.metadata.record.index.enable' : 'true',
> 'hoodie.metadata.enable':'true',
> 'hoodie.enable.data.skipping':'true'
> }
> # Insert data
> inserts.write.format("hudi"). \
> options(**hudi_options). \
> mode("overwrite"). \
> save(PATH)
> inserts.write.format("hudi"). \
> options(**hudi_options). \
> mode("append"). \
> save(PATH)
> spark.read.options(**hudi_options).format("hudi").load(PATH).where("uuid = 
> 'e3cf430c-889d-4015-bc98-59bdce1e530c'").show()
> ```
> with one key, read works as expected. 
> !image-2024-10-24-13-45-03-581.png!



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to