Re: [DISCUSS] Offloading Snapshots from Metadata.json

Ryan Blue Thu, 16 Apr 2026 09:08:41 -0700

They do? Where is that?

Definitely something we should remove as soon as we can.


On Thu, Apr 16, 2026 at 8:58 AM Yufei Gu <[email protected]> wrote:

> To add to that, some engines like Spark still assume metadata.json exists
> in storage. The executors load the file directly instead of checking the
> REST catalog for table metadata. We will need to modify that.
>
> Yufei
>
>
> On Thu, Apr 16, 2026 at 8:45 AM Ryan Blue <[email protected]> wrote:
>
>> I think that the problem of large metadata.json files is largely solved
>> by the REST protocol, which does not need to send snapshots to clients. I
>> agree with Anton's suggestion to relax the requirement that the
>> metadata.json file has to be stored somewhere (for v4). As long as catalogs
>> are required to be able to produce the full content of metadata.json when
>> loading the table for a client requesting all snapshots, we don't need to
>> worry about storing the file.
>>
>> There are two things to keep in mind though:
>> 1. I think the current Java REST implementation still requests all
>> snapshots to commit, which we should fix
>> 2. I think it is a bad idea to split up the metadata.json file for
>> non-REST catalogs. This introduces way too much complexity that necessarily
>> leaks out of the catalog implementation. I don't think this is a problem
>> worth solving when we have a perfectly good solution that has significant
>> benefits.
>>
>> Ryan
>>
>> On Thu, Apr 16, 2026 at 12:13 AM Innocent Djiofack <[email protected]>
>> wrote:
>>
>>> Hi all,
>>>
>>> Thank you for the replies. Steven the change is scoped to only
>>> offloading snapshots history. Yufei, yes this is a large change. I
>>> agreed that removing the requirement for a metadata.json file per commit in
>>> storage would help most of the concerns. If there is already a design doc
>>> for that direction, please share it with me. If not, I can start something
>>> around that line of reasoning.
>>>
>>> Thanks.
>>>
>>> On Tue, Apr 14, 2026 at 4:09 PM Yufei Gu <[email protected]> wrote:
>>>
>>>> Separating snapshot history from table metadata feels like a large,
>>>> invasive change since it would require updates across all clients and
>>>> engines. If we instead remove the requirement for a metadata.json file per
>>>> commit in storage, many of the current concerns could be addressed. This
>>>> seems like a more practical path forward. There are already
>>>> multiple discussions over there. I'd suggest to move forward with that
>>>> direction.
>>>>
>>>> Yufei
>>>>
>>>>
>>>> On Tue, Apr 14, 2026 at 8:44 AM Steven Wu <[email protected]> wrote:
>>>>
>>>>> I understand the problem we are trying to solve here. But the actual
>>>>> proposed solution is unclear to me. The proposal seems lack some details 
>>>>> in
>>>>> the actual design/solution.
>>>>>
>>>>> How do the proposed snapshot read and write APIs differ from the
>>>>> current APIs? I can't tell the difference.
>>>>>
>>>>> > Once defined, this interface could be implemented by various
>>>>> backing stores, such as another file or even a Catalog.
>>>>>
>>>>> To support offloading, we probably have to update the table metadata
>>>>> in the table spec
>>>>> <https://iceberg.apache.org/spec/#table-metadata-fields>. Does this
>>>>> depend on making metadata.json file optional? Or is this limited to just
>>>>> externalizing the snapshot list?
>>>>>
>>>>> On Tue, Apr 14, 2026 at 2:53 AM Jean-Baptiste Onofré <[email protected]>
>>>>> wrote:
>>>>>
>>>>>> Hi Innocent
>>>>>>
>>>>>> Maybe it's a kind of redundant with the V4 initiative ?
>>>>>> What are your thoughts on this?
>>>>>>
>>>>>> Thanks!
>>>>>>
>>>>>> Regards
>>>>>> JB
>>>>>>
>>>>>> On Tue, Apr 14, 2026 at 6:44 AM Innocent Djiofack <
>>>>>> [email protected]> wrote:
>>>>>>
>>>>>>> Hello Everyone,
>>>>>>>
>>>>>>> My name is Innocent and I have enjoyed working on the apache Iceberg
>>>>>>> project so far and have learned a lot from people in the group.
>>>>>>> I wanted to follow up on a concern raised by Anton around the
>>>>>>> growing size of metadata.json and the problems it brings. Before going
>>>>>>> ahead and doing the implementation work, I wanted to share the high 
>>>>>>> level
>>>>>>> thinking with the community and get feedback. You will find the link to 
>>>>>>> the
>>>>>>> proposal here
>>>>>>> <https://docs.google.com/document/d/1xpzpsA9BGSkxo58yUhSdDQaSu7_ITQLFmGarEOyM8P0/edit?tab=t.0#heading=h.7g59t9p9o1xi>
>>>>>>>  I
>>>>>>> would appreciate comments and feedback on it.
>>>>>>>
>>>>>>> Thanks.
>>>>>>>
>>>>>>> --
>>>>>>>
>>>>>>> *DJIOFACK INNOCENT*
>>>>>>> *"Be better than the day before!" -*
>>>>>>> *+1 404 751 8024*
>>>>>>>
>>>>>>
>>>
>>> --
>>>
>>> *DJIOFACK INNOCENT*
>>> *"Be better than the day before!" -*
>>> *+1 404 751 8024*
>>>
>>

Re: [DISCUSS] Offloading Snapshots from Metadata.json

Reply via email to