hililiwei opened a new pull request, #6043:
URL: https://github.com/apache/iceberg/pull/6043
# Proposal: Partial Updates
## motivation
Take feature engineering as an example, there are thousands or even tens of
thousands of columns in the table, but the task will update only a few of them.
Currently, if want to update a row, we need to fetch all the columns, which is
very inefficient. If we support partial updates, we only need to generate data
files with equality and updated columns on write, which greatly improves
throughput and reduces complexity ( You do not need to query the values of
other columns that do not need to be changed). When reading, we combine the
data file with the partial update file, which has some similarities to COR. In
addition, to improve the read efficiency, a background asynchronous task can be
used to merge files when the system is idle.
### Partial Update Files
Partial updates files identify updated rows in a collection of data files by
one or more column values, and includes one or more columns of the updated rows
that need to be updated.
Partial updates files store any subset of a table’s columns and use the
table’s field ids. The *equality columns* are the columns of the file used to
match data rows. The p*artial columns* are columns of the file used to update
the specified column of the matching data row.
The partial columns in a data row is updated to the new value if its
equality columns values are equal to all equality columns for any row in an
partial update file that applies to the row’s data file.
For example, a table with the following data:
```text
1: id | 2: category | 3: name
-------|-------------|---------
1 | marsupial | Koala
2 | toy | Teddy
3 | NULL | Grizzly
4 | NULL | Polar
```
The equality `id = 3` and `name = Lily` could be written as the following
partial update files:
```text
equality_ids=[1]
partial_ids=[3]
1: id | 3: name
-------|---------
3 | Lily
```
After applying the partially update file, will have the following data::
```text
1: id | 2: category | 3: name
-------|-------------|---------
1 | marsupial | Koala
2 | toy | Teddy
3 | NULL | Lily
4 | NULL | Polar
```
Illustration:

In this example, we will find the id1 and id3 in the a.file, then update
its Col1 to the new value.

In this example, we add a new column to the table and insert a partial
update file that contains only the new columns. It might look more like an
Insert, except we're inserting new columns for the old row, rather than
inserting new rows.
### Brief change log
This PR consists of two parts:
* Evolution of the table format specification, mainly partial update file
* Partial update files Write\Read
P.S. :
This is an internal feature that is under development. I wanted to hear from
the community early on, so I raised this PR before it was finished. Of course,
there's the engine integration part, but I think this PR is the core part of
it, and we should talk about it there first to try and get on the same page.
With this approach, we can solve a large number of business scenarios.
Internally, we have implemented it with Flink and achieved satisfactory results
in validation.
What is our community's view of it? Hope to receive your feedback.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]