Re: Iceberg 1.10.0 release update - July 1, 2025

Steven Wu Mon, 14 Jul 2025 22:18:03 -0700

> Engines may model operations as deleting/inserting rows or as
modifications to rows that preserve row ids.


Manu, I agree this sentence probably lacks some context. The first half (as
deleting/inserting rows) is probably about the row lineage handling with
equality deletes, which is described in another place.

"Row lineage does not track lineage for rows updated via Equality Deletes
<https://iceberg.apache.org/spec/#equality-delete-files>, because engines
using equality deletes avoid reading existing data before writing changes
and can't provide the original row ID for the new rows. These updates are
always treated as if the existing row was completely removed and a unique
new row was added."

On Mon, Jul 14, 2025 at 5:49 PM Manu Zhang <[email protected]> wrote:

> Thanks Steven, I missed that part but the following sentence is a bit hard
> to understand (maybe just me)
>
> Engines may model operations as deleting/inserting rows or as
> modifications to rows that preserve row ids.
>
> Can you please help to explain?
>
>
> Steven Wu <[email protected]>于2025年7月15日 周二04:41写道：
>
>> Manu
>>
>> The spec already covers the row lineage carry over (for replace)
>> https://iceberg.apache.org/spec/#row-lineage
>>
>> "When an existing row is moved to a different data file for any reason,
>> writers should write _row_id and _last_updated_sequence_number according
>> to the following rules:"
>>
>> Thanks,
>> Steven
>>
>>
>> On Mon, Jul 14, 2025 at 1:38 PM Steven Wu <[email protected]> wrote:
>>
>>> another update on the release.
>>>
>>> We have one open PR left for the 1.10.0 milestone
>>> <https://github.com/apache/iceberg/milestone/54> (with 25 closed PRs).
>>> Amogh is actively working on the last blocker PR.
>>> Spark 4.0: Preserve row lineage information on compaction
>>> <https://github.com/apache/iceberg/pull/13555>
>>>
>>> I will publish a release candidate after the above blocker is merged and
>>> backported.
>>>
>>> Thanks,
>>> Steven
>>>
>>> On Mon, Jul 7, 2025 at 11:56 PM Manu Zhang <[email protected]>
>>> wrote:
>>>
>>>> Hi Amogh,
>>>>
>>>> Is it defined in the table spec that "replace" operation should carry
>>>> over existing lineage info insteading of assigning new IDs? If not, we'd
>>>> better firstly define it in spec because all engines and implementations
>>>> need to follow it.
>>>>
>>>> On Tue, Jul 8, 2025 at 11:44 AM Amogh Jahagirdar <[email protected]>
>>>> wrote:
>>>>
>>>>> One other area I think we need to make sure works with row lineage
>>>>> before release is data file compaction. At the moment,
>>>>> <https://github.com/apache/iceberg/blob/main/spark/v3.5/spark/src/main/java/org/apache/iceberg/spark/actions/SparkBinPackFileRewriteRunner.java#L44>
>>>>>  it
>>>>> looks like compaction will read the records from the data files without
>>>>> projecting the lineage fields. What this means is that on write of the new
>>>>> compacted data files we'd be losing the lineage information. There's no
>>>>> data change in a compaction but we do need to make sure the lineage info
>>>>> from carried over records is materialized in the newly compacted files so
>>>>> they don't get new IDs or inherit the new file sequence number. I'm 
>>>>> working
>>>>> on addressing this as well, but I'd call this out as a blocker as well.
>>>>>
>>>>

Re: Iceberg 1.10.0 release update - July 1, 2025

Reply via email to