GitHub user the-other-tim-brown edited a comment on the discussion: RFC-80: 
Design proposal discussions

### Writer path:

#### Baseline (No split column groups):
Pros:
- There is a single file group that a given key is assigned to so the lookup is 
only done once

Cons:
- If there is unstructured data or columns with larger values, we will have 
less keys per file to maintain the target file size.
- If there is an update to a subset of the columns, the full file gets 
rewritten.

#### Proposal A:
Pros:
- There is a single file group that a given key is assigned to so the lookup is 
only done once
- Updates to a subset of columns can avoid parsing and rewriting all the 
columns if those columns belong to a subset of the column groups

Cons:
- The files within a group all contain the same keys and if the size of the 
data per column group is not balanced, you will have unbalanced file sizes. 
This will result in small or large files and can later impact the read and 
update performance.


#### Proposal B:
Pros:
- Updates to a subset of columns can avoid parsing and rewriting all the 
columns if those columns belong to a subset of the column groups
- The file sizes grow independently, allowing for well sized files within each 
column group.

Cons:
- A given key now belongs to N file groups where N is the number of column 
groups. This increases the cost of an upsert operation.


GitHub link: 
https://github.com/apache/hudi/discussions/14062#discussioncomment-14630104

----
This is an automatically sent email for [email protected].
To unsubscribe, please send an email to: [email protected]

Reply via email to