Hi all, regarding the "big metadata" issue, my understanding is even for
Plan/Preplan API in the task planning use case, it will still have the same
issue when the engine is doing a full table scan for large tables. Is
my understanding correct?

Also, given metadata compute could be heavy, do we consider some kind of
distributed processing in the rest catalog? If we want to avoid getting
things complicated, just thinking out loud, maybe another option is to let
the rest catalog offload the metadata compute to a separate
compute framework (e.g. Trino) and retrieve it back.

On Thu, Jul 4, 2024 at 9:08 AM Amogh Jahagirdar <2am...@gmail.com> wrote:

> Thanks Yufei!
>
> I think it's worth thinking through if it makes sense to leverage
> Plan/Preplan APIs like Jack alluded to. I think this makes sense from a
> scale argument, since in the worst case the Plan/Preplan APIs need to be
> able to churn through all the metadata anyways. However, with this approach
> we probably want to think through the API modeling because currently
> Plan/Preplan is designed around data file/delete file scan tasks. It sounds
> theoretically possible but at least to me it's not obvious if it'll make
> for a good API to work with.
>
> Overall though, like the general direction of this.
>
> Thanks,
>
> Amogh Jahagirdar
>
> On Thu, Jul 4, 2024 at 4:10 AM Robert Stupp <sn...@snazy.de> wrote:
>
>> Hi Yufei,
>>
>> I think the proposal is very interesting! The direction this and other
>> proposals are going is IMO the right one.
>>
>> Since many proposals need access to at least manifest-lists and manifest
>> files, potentially also data/delete files, does it make sense to bundle all
>> proposals that need this ability?
>>
>> Robert
>> On 03.07.24 22:44, Yufei Gu wrote:
>>
>> Hi folks,
>>
>> I'd like to discuss a new proposal to support server-side metadata tables.
>>
>> One of Iceberg's most advantageous features is the ability to inspect a
>> table using metadata tables. For instance, we can query snapshots just like
>> we query data rows using the following command: SELECT * FROM
>> prod.db.table.snapshots;
>>
>> With the REST catalog, we can simplify this process further by providing
>> metadata directly from REST endpoints. Here are several benefits of this
>> approach:
>>
>>    - Engine Independence: The metadata tables do not rely on a specific
>>    implementation of an engine. The REST server returns the results directly.
>>    For example, the Rust Iceberg does not need to implement its own logic to
>>    query the snapshot table if it connects to a server with this capability.
>>    This reduces the complexity and development effort required for different
>>    clients and engines.
>>    - Enabled New Use Cases: A catalog UI or Lakehouse UI can present a
>>    table's metadata (e.g., snapshot/partition list) without relying on an
>>    engine like Trino. This opens up possibilities for lightweight UIs and
>>    tools that can directly interact with the REST endpoints to retrieve and
>>    display metadata.
>>    - Enhanced Performance: With server-side caching, the server-side
>>    metadata tables will perform better. Caching reduces the need to 
>> repeatedly
>>    compute or retrieve metadata, leading to faster response times and reduced
>>    load on the underlying storage systems.
>>
>> Here is the proposal in google doc:
>> https://docs.google.com/document/d/1MVLwyMQtZ-7jewsQ0PuTvtJbpfl4HCoVdbowMqFTmfc/edit?usp=sharing
>>
>> Estimated read time: 5 mins
>>
>> Would really appreciate any feedback on this topic and proposal!
>>
>>
>> Yufei
>>
>> --
>> Robert Stupp
>> @snazy
>>
>>

Reply via email to