[ 
https://issues.apache.org/jira/browse/HUDI-6248?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

sivabalan narayanan updated HUDI-6248:
--------------------------------------
    Description: 
For COW, col stats processing/evaluation is straight forward. but for MOR 
tables, it may not be. 

We store file name as one of the component in record key in col stats entries. 
So, each file will have an entry in col stats index in MDT. But for MOR table, 
expectation is, we need one col stats value for entire file slice. (base file + 
bunch of log files). 

we should also include merged file slices (last but one file slice's base file 
+ last but one file slice's log files + latest slice's log files) 

We need to go through the flow end to end and verify these are all intact. 
 * Ensure we are able to evaluate one col stats value for one file slice
 * Ensure merge file slices are in-corporated 
 * Ensure delete log blocks are in-corporated as well. 
 * Ensure custom deletes (custom payloads) are also incorporated. 

 

  was:
For COW, col stats processing/evaluation is straight forward. but for MOR 
tables, it may not be. 

We store file name as one of the component in record key in col stats entries. 
So, each file will have an entry in col stats index in MDT. But for MOR table, 
expectation is, we need one col stats value for entire file slice. (base file + 
bunch of log files). 

we should also include merged file slices (last but one file slice's base file 
+ last but one file slice's log files + latest slice's log files) 

We need to go through the flow end to end and verify these are all intact. 

 


> Validate col stats for MOR data table
> -------------------------------------
>
>                 Key: HUDI-6248
>                 URL: https://issues.apache.org/jira/browse/HUDI-6248
>             Project: Apache Hudi
>          Issue Type: Bug
>          Components: reader-core, writer-core
>            Reporter: sivabalan narayanan
>            Priority: Major
>
> For COW, col stats processing/evaluation is straight forward. but for MOR 
> tables, it may not be. 
> We store file name as one of the component in record key in col stats 
> entries. So, each file will have an entry in col stats index in MDT. But for 
> MOR table, expectation is, we need one col stats value for entire file slice. 
> (base file + bunch of log files). 
> we should also include merged file slices (last but one file slice's base 
> file + last but one file slice's log files + latest slice's log files) 
> We need to go through the flow end to end and verify these are all intact. 
>  * Ensure we are able to evaluate one col stats value for one file slice
>  * Ensure merge file slices are in-corporated 
>  * Ensure delete log blocks are in-corporated as well. 
>  * Ensure custom deletes (custom payloads) are also incorporated. 
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to