GitHub user the-other-tim-brown edited a comment on the discussion: RFC-80: Design proposal discussions
### Reader path: #### Baseline (No split column groups): Pros: - All data is in a base file + log files and can be easily read - Can easily prune the files that must be read Cons: - Since total number of keys per file is potentially smaller, we will have more files to open even if only a subset of columns is used. #### Proposal A: Pros: - If we can maintain consistent ordering between the column groups, we can open multiple iterators and just iterate through them and join the values to compute the final row. - We can more easily prune files that need to be read since they are grouped by the keys. Cons: - Potentially small files can lead to performance issues. - If the ordering of keys is not consistent between the files, we will need to do a join on the rows or buffer some of the files in memory to compute the final rows. - If event time ordering is used and the ordering field is not in the column group that is read, then we will potentially need to read the value from the other file group to properly determine the final row when merging log files. #### Proposal B: Pros: - Well sized files leads to better read performance for individual files Cons: - Since row keys are now split amongst various file groups, the rows must be computed by doing a join between the column groups. - If a filter is specified on a field in a column group, we will not be able to easily prune the candidate files from the other file groups leading to more IO for a given query. - For incremental queries, if the commit time is only reflected in the updated column groups then we may not be able to effectively filter out files since we can only know when the row was updated after joining all the column groups GitHub link: https://github.com/apache/hudi/discussions/14062#discussioncomment-14630183 ---- This is an automatically sent email for [email protected]. To unsubscribe, please send an email to: [email protected]
