[
https://issues.apache.org/jira/browse/HUDI-8389?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
sivabalan narayanan closed HUDI-8389.
-------------------------------------
Resolution: Won't Fix
> Optimize re-adding missing files to col stats pruning
> ------------------------------------------------------
>
> Key: HUDI-8389
> URL: https://issues.apache.org/jira/browse/HUDI-8389
> Project: Apache Hudi
> Issue Type: Sub-task
> Components: metadata
> Reporter: sivabalan narayanan
> Assignee: sivabalan narayanan
> Priority: Blocker
> Fix For: 1.0.1
>
> Original Estimate: 2h
> Time Spent: 1h
> Remaining Estimate: 1h
>
> Here is out logic to do col stats based pruning
>
> h3. Pruning Design:
> * step1 : Fetch latest file slices for pruned partitions (from MDT)
> * step2.a : Fetch stats from Col stats index which outputs in the format
> \{{File1, col1 ➝ stat1}, \{File2, col1 ➝ stat2},...} i.e. one entry per
> file,column combo. Here we are reading using
> HoodieTableMetadata.{*}getRecordsByKeyPrefixes(){*}. just that we are passing
> in just the {*}columns{*}.
> ** step2.b: Apply filter function to prune entries from step 2.a based on
> the list from step 1. col stats value will contain the file name and we
> filter based on that. Output from this step will be latest files looked up
> from col stats partition in MDT.
> ** step2.b : Construct a matrix of the format File1 ➝ \{col1_valuecount,
> col1_minvalue, col1_maxvalue, col2_valuecount, .... } i.e. one entry per file.
> ** step2.c: Get the list of files indexed by col stats.
> ** step2.d: Apply the query predicate and get the list of pruned file names
> over step 2.b.
> ** step3: If there are any files missing to be indexed from col stats (step1
> output - step2.c output), add them back to 2.d to get list of final pruned
> files list. Or in other words, pruned files + missingToIndexFiles are the
> final set of candidate files we return from this step.
> *** lets name the output from step3 as *candidate files.*
> ** step5: For every file slice from step3 => if every file in this file
> slice is missing from the candidate files, we can ignore the file slice(in
> other words, every file in this file slice did not match the predicate from
> col stats, we are safe to ignore the entire file slice). Even if one file is
> present in candidate files, we need to include the file slice in its entirety.
>
> Why do we need to re-add the files missing to be indexed from col stats(step
> 3). We know there are 2 cases in which this could legitimately happen. For
> eg, log files from failed commit and rollback blocks. We can ignore these
> files and only do pruning based on rest of the files in the file slice. For
> eg, if we have a base file and 5 log files(out of which one is a rollback
> block) in a file slice, if the base file and 4 log files did not match the
> predicate, we should skip the file slice. but as of now, we can't skip this.
> In summary, as per current logic, if a file slice has either of these
> (rollback block, delete block, and data blocks from failed commit), it can
> never be filtered out w/ col stats based pruning. We should definitely
> revisit this and fix it as much as possible. For delete blocks also, I am
> thinking if we can do the same. i.e. on the write path, we can skip adding
> the entries to col stats. and then while pruning only consider files w/ valid
> stats to prune a file slice. For eg, we have a base file and 3 log files, out
> of which one of them is a delete block. We do stats based pruning for base
> file and 3 log files. If all of them did not match, should we filter out the
> entire file slice? or do we give a benefit of doubt and include it (which is
> what we do as of today)?
--
This message was sent by Atlassian Jira
(v8.20.10#820010)