Re: [DISCUSS] Offloading Snapshots from Metadata.json

Prashant Singh Thu, 16 Apr 2026 11:09:57 -0700

Hey Ryan / Yufei,
Here is my one attempt to get rid of that, it was from gov pov, it's mostly
from Serializable Table [1]
If we are all onboard, I can clean up and revive this effort.


[1] https://github.com/apache/iceberg/pull/14944#issuecomment-3812676977

Best,
Prashant Singh

On Thu, Apr 16, 2026 at 9:08 AM Ryan Blue <[email protected]> wrote:

> They do? Where is that?
>
> Definitely something we should remove as soon as we can.
>
> On Thu, Apr 16, 2026 at 8:58 AM Yufei Gu <[email protected]> wrote:
>
>> To add to that, some engines like Spark still assume metadata.json exists
>> in storage. The executors load the file directly instead of checking the
>> REST catalog for table metadata. We will need to modify that.
>>
>> Yufei
>>
>>
>> On Thu, Apr 16, 2026 at 8:45 AM Ryan Blue <[email protected]> wrote:
>>
>>> I think that the problem of large metadata.json files is largely solved
>>> by the REST protocol, which does not need to send snapshots to clients. I
>>> agree with Anton's suggestion to relax the requirement that the
>>> metadata.json file has to be stored somewhere (for v4). As long as catalogs
>>> are required to be able to produce the full content of metadata.json when
>>> loading the table for a client requesting all snapshots, we don't need to
>>> worry about storing the file.
>>>
>>> There are two things to keep in mind though:
>>> 1. I think the current Java REST implementation still requests all
>>> snapshots to commit, which we should fix
>>> 2. I think it is a bad idea to split up the metadata.json file for
>>> non-REST catalogs. This introduces way too much complexity that necessarily
>>> leaks out of the catalog implementation. I don't think this is a problem
>>> worth solving when we have a perfectly good solution that has significant
>>> benefits.
>>>
>>> Ryan
>>>
>>> On Thu, Apr 16, 2026 at 12:13 AM Innocent Djiofack <
>>> [email protected]> wrote:
>>>
>>>> Hi all,
>>>>
>>>> Thank you for the replies. Steven the change is scoped to only
>>>> offloading snapshots history. Yufei, yes this is a large change. I
>>>> agreed that removing the requirement for a metadata.json file per commit in
>>>> storage would help most of the concerns. If there is already a design doc
>>>> for that direction, please share it with me. If not, I can start something
>>>> around that line of reasoning.
>>>>
>>>> Thanks.
>>>>
>>>> On Tue, Apr 14, 2026 at 4:09 PM Yufei Gu <[email protected]> wrote:
>>>>
>>>>> Separating snapshot history from table metadata feels like a large,
>>>>> invasive change since it would require updates across all clients and
>>>>> engines. If we instead remove the requirement for a metadata.json file per
>>>>> commit in storage, many of the current concerns could be addressed. This
>>>>> seems like a more practical path forward. There are already
>>>>> multiple discussions over there. I'd suggest to move forward with that
>>>>> direction.
>>>>>
>>>>> Yufei
>>>>>
>>>>>
>>>>> On Tue, Apr 14, 2026 at 8:44 AM Steven Wu <[email protected]>
>>>>> wrote:
>>>>>
>>>>>> I understand the problem we are trying to solve here. But the actual
>>>>>> proposed solution is unclear to me. The proposal seems lack some details 
>>>>>> in
>>>>>> the actual design/solution.
>>>>>>
>>>>>> How do the proposed snapshot read and write APIs differ from the
>>>>>> current APIs? I can't tell the difference.
>>>>>>
>>>>>> > Once defined, this interface could be implemented by various
>>>>>> backing stores, such as another file or even a Catalog.
>>>>>>
>>>>>> To support offloading, we probably have to update the table metadata
>>>>>> in the table spec
>>>>>> <https://iceberg.apache.org/spec/#table-metadata-fields>. Does this
>>>>>> depend on making metadata.json file optional? Or is this limited to just
>>>>>> externalizing the snapshot list?
>>>>>>
>>>>>> On Tue, Apr 14, 2026 at 2:53 AM Jean-Baptiste Onofré <[email protected]>
>>>>>> wrote:
>>>>>>
>>>>>>> Hi Innocent
>>>>>>>
>>>>>>> Maybe it's a kind of redundant with the V4 initiative ?
>>>>>>> What are your thoughts on this?
>>>>>>>
>>>>>>> Thanks!
>>>>>>>
>>>>>>> Regards
>>>>>>> JB
>>>>>>>
>>>>>>> On Tue, Apr 14, 2026 at 6:44 AM Innocent Djiofack <
>>>>>>> [email protected]> wrote:
>>>>>>>
>>>>>>>> Hello Everyone,
>>>>>>>>
>>>>>>>> My name is Innocent and I have enjoyed working on the apache
>>>>>>>> Iceberg project so far and have learned a lot from people in the group.
>>>>>>>> I wanted to follow up on a concern raised by Anton around the
>>>>>>>> growing size of metadata.json and the problems it brings. Before going
>>>>>>>> ahead and doing the implementation work, I wanted to share the high 
>>>>>>>> level
>>>>>>>> thinking with the community and get feedback. You will find the link 
>>>>>>>> to the
>>>>>>>> proposal here
>>>>>>>> <https://docs.google.com/document/d/1xpzpsA9BGSkxo58yUhSdDQaSu7_ITQLFmGarEOyM8P0/edit?tab=t.0#heading=h.7g59t9p9o1xi>
>>>>>>>>  I
>>>>>>>> would appreciate comments and feedback on it.
>>>>>>>>
>>>>>>>> Thanks.
>>>>>>>>
>>>>>>>> --
>>>>>>>>
>>>>>>>> *DJIOFACK INNOCENT*
>>>>>>>> *"Be better than the day before!" -*
>>>>>>>> *+1 404 751 8024*
>>>>>>>>
>>>>>>>
>>>>
>>>> --
>>>>
>>>> *DJIOFACK INNOCENT*
>>>> *"Be better than the day before!" -*
>>>> *+1 404 751 8024*
>>>>
>>>

Re: [DISCUSS] Offloading Snapshots from Metadata.json

Reply via email to