Hi Yufei,

Interesting that we are thinking about similar things. I had this item as a
part of the roadmap discussion items in the catalog sync meeting, and then
I removed it before the meeting because I felt it's too early to discuss.

My main concern for having server-side metadata tables is how we solve the
"big metadata" issue. The partitions, manifests, files table can easily
itself become a big table, and the REST server becomes inefficient in
retrieving results. It's the same old "HMS is too slow in iterating through
the partitions" problem. Iceberg kind of solves it by having this
information in Avro and in storage that can be scanned distributedly, but
with server-side metadata tables, we are technically re-introducing the
problem.

Maybe one potential approach is to run those potentially large metadata
table scans through the PreplanTable and PlanTable APIs. Just a quick
thought for now, I need to think a bit more about this.

Best,
Jack Ye





On Wed, Jul 3, 2024 at 1:45 PM Yufei Gu <flyrain...@gmail.com> wrote:

> Hi folks,
>
> I'd like to discuss a new proposal to support server-side metadata tables.
>
> One of Iceberg's most advantageous features is the ability to inspect a
> table using metadata tables. For instance, we can query snapshots just like
> we query data rows using the following command: SELECT * FROM
> prod.db.table.snapshots;
>
> With the REST catalog, we can simplify this process further by providing
> metadata directly from REST endpoints. Here are several benefits of this
> approach:
>
>    - Engine Independence: The metadata tables do not rely on a specific
>    implementation of an engine. The REST server returns the results directly.
>    For example, the Rust Iceberg does not need to implement its own logic to
>    query the snapshot table if it connects to a server with this capability.
>    This reduces the complexity and development effort required for different
>    clients and engines.
>    - Enabled New Use Cases: A catalog UI or Lakehouse UI can present a
>    table's metadata (e.g., snapshot/partition list) without relying on an
>    engine like Trino. This opens up possibilities for lightweight UIs and
>    tools that can directly interact with the REST endpoints to retrieve and
>    display metadata.
>    - Enhanced Performance: With server-side caching, the server-side
>    metadata tables will perform better. Caching reduces the need to repeatedly
>    compute or retrieve metadata, leading to faster response times and reduced
>    load on the underlying storage systems.
>
> Here is the proposal in google doc:
> https://docs.google.com/document/d/1MVLwyMQtZ-7jewsQ0PuTvtJbpfl4HCoVdbowMqFTmfc/edit?usp=sharing
>
> Estimated read time: 5 mins
>
> Would really appreciate any feedback on this topic and proposal!
>
>
> Yufei
>

Reply via email to