Thanks Yufei for the +1.

JB, could you help add a biweekly metrics architecture sync to the Polaris
community calendar? I'm thinking Thursdays at 9-10am PT, on the off-weeks
from the community meeting (starting May 7), 60 minutes.

Here's a rough agenda to work through over the first few sessions, grouped
by priority:

*First: foundational direction*

1.  MetricsPersistence: public SPI or internal implementation detail?
   •   Marked @Beta, javadoc calls it a "Service Provider Interface", but
only one consumer (JdbcBasePersistenceImpl), lives on BasePersistence. If
demoted to a private helper inside a persisting reporter impl, most
downstream design decisions become implementation details rather than
contract questions.

2.  Persistence schema redesign
   •   Current two-table layout (scan_metrics_report,
commit_metrics_report) with ~25 flattened columns each. Every new metric
type requires a new table, SPI method, record class, model, converter, and
schema migration. Direction to explore: single table with metric_type enum,
schema_version, and JSON payload column.

*Second: design details once direction is set*

3.  Partition key strategy
   •   Single-table design means scan metrics at scale will have high write
concurrency per table. Schema needs to expose enough structure for backends
to shard by entity or time range.

4.  Read/write path consistency
   •   Writes go through PolarisMetricsManager on MetaStoreManager. Reads
bypass MetaStoreManager and go straight to BasePersistence, excluding
non-JDBC backends from the read API.

*Third: cleanup and alignment*

5.  PolarisMetricsReporter naming
   •   Only handles IRC (ScanReport/CommitReport), doesn't cover generic
tables or operational metrics. Name is broader than scope.

6.  PolarisMetricsManager facade passthrough
   •   Entire default method is callCtx.getMetaStore().writeScanReport().
Zero logic, passes Level 1 straight through to Level 3. Same anti-pattern
as PolarisEventManager.

7.  Iceberg community alignment
   •   Payload-type extension needs discussion on dev@iceberg. obelix74's
Feb thread got zero replies. Needs a committer voice.

Lets confirm prioritization in the first session.

-ej

On Tue, Apr 21, 2026 at 3:18 PM Yufei Gu <[email protected]> wrote:

> Thanks everyone for continuing to drive this forward. I agree that the
> problem is getting complex enough that a more structured discussion would
> help.
>
> +1 on setting up a biweekly sync for the metrics architecture. I’m happy to
> join.
>
> Yufei
>
>
> On Tue, Apr 21, 2026 at 2:34 PM EJ Wang <[email protected]>
> wrote:
>
> > Also, I've been looking more closely at the *persistence schema in the
> > current metrics work*, and I think there's a structural rigidity problem
> > worth raising before the shape gets locked in.
> >
> > Right now we have two separate tables (scan_metrics_report and
> > commit_metrics_report), each with ~25 flattened columns that directly
> > mirror the Iceberg report fields. The SPI follows the same split:
> > writeScanReport and writeCommitReport as separate methods, with per-type
> > record classes, converters, and model objects. *The practical cost:
> > adding a new metric type (operational metrics, for example) requires a
> new
> > table, a new SPI method, a new record class, a new model class, a new
> > converter branch, and a schema migration*. That's a lot of surface area
> > for what should be "one more kind of metric."
> >
> > *My bias* would be toward a single metrics table with *a typed JSON
> > payload*. Something like: metric_type (enum), entity_id,
> > table_identifier, snapshot_id (nullable), received_ts, schema_version,
> and
> > a payload column for the metric-specific data. The metric_type +
> > schema_version pair gives us a forward-compatible contract for the
> payload
> > shape. Adding a new metric type becomes an enum value and a payload
> schema,
> > not a schema migration. One thing I think we need to be deliberate about
> is
> > the partition key design. If all metric types land in one table, scan
> > metrics at scale (high concurrency, high frequency across many tables)
> > could easily create hot partitions. We'd want the persistence layer to be
> > able to shard by entity or time range, and that means the logical schema
> > needs to expose enough structure for backends to partition on. I don't
> > think the current flattened layout gives us that.
> >
> > This is getting complex enough that I don't think ad-hoc PR/ML threads
> > will converge well. *Would people be open to a biweekly sync for metrics
> > architecture?* I think 30 minutes every two weeks with interested parties
> > would be enough to work through the schema, SPI shape, and read API
> design
> > together. Happy to help set that up.
> >
> > -ej
> >
> > On Mon, Apr 20, 2026 at 2:19 PM EJ Wang <[email protected]>
> > wrote:
> >
> >> Reviewed #4115, left a comment on the code organization side.
> >>
> >> One thing stood out: the metrics write path enters through
> >> PolarisMetricsManager on MetaStoreManager, but the new read path
> bypasses
> >> MetaStoreManager entirely and goes straight to BasePersistence via
> >> callContext.getMetaStore(). That means the read API only works for
> backends
> >> that implement BasePersistence. NoSQL and remote backends can't
> participate.
> >>
> >> Stepping back, I think the metrics subsystem is growing into something
> >> real (write + read + REST API + AuthZ + pagination) *but the persistence
> >> side is split across two layers in a way that's hard to extend*. I put
> >> together two diagrams to show what I mean (my best effort).
> >>
> >> *Current state* (Diagram 1): three interfaces at three different levels.
> >> The engine-facing SPI (PolarisMetricsReporter) is clean. But
> >> PolarisMetricsManager on MetaStoreManager is a passthrough to
> >> MetricsPersistence on BasePersistence. The @Beta annotation and SPI
> javadoc
> >> are on the BasePersistence layer, while the actual extension points
> >> (PolarisMetricsReporter, PolarisMetricsManager) carry no stability
> >> annotation. The write path goes through the MetaStoreManager layer, the
> >> read path doesn't.
> >>
> >> *What I envision* (Diagram 2): two SPIs at two levels.
> >> PolarisMetricsReporter stays as the engine-facing SPI.
> >> PolarisMetricsManager becomes the backend-facing SPI with both write and
> >> read methods at the MetaStoreManager level, where any backend (JDBC,
> NoSQL,
> >> remote) can implement them. MetricsPersistence on BasePersistence goes
> >> away. Where metrics actually land is an implementation detail, not a
> core
> >> interface.
> >>
> >> *Minor naming thing*: PolarisMetricsReporter is broader than what it
> >> actually handles. It only accepts Iceberg REST Catalog metrics
> (ScanReport,
> >> CommitReport via MetricsReport). Generic table metrics or operational
> >> metrics aren't in scope. Not blocking, but worth noting if the metrics
> >> surface expands.
> >>
> >> *Rough sketch of how to get there*:
> >>  1.  Add read methods to PolarisMetricsManager (listScanReports,
> >> listCommitReports) with default no-op, same as the existing write
> methods.
> >> (Probably make PolarisMetricsManager more explicit on being Iceberg
> >> specific like package name or class name etc.)
> >>  2.  Wire MetricsReportsService through MetaStoreManager instead of
> >> callContext.getMetaStore().
> >>  3.  Extract metrics persistence from JdbcBasePersistenceImpl into its
> >> own class. That file carries ~7 responsibilities, metrics being one of
> them.
> >>  4.  Remove MetricsPersistence from BasePersistence.
> >>
> >> *None of this needs to happen in #4115. But if the direction makes
> sense,
> >> it would be good to align before the metrics surface grows further.
> Curious
> >> what others think.*
> >>
> >> *My mental model note*: Level 1 MetaStoreManager; level 2 transactional
> >> persistence; level 3 base persistence
> >>
> >> Diagram 1
> >> <
> https://www.plantuml.com/plantuml/uml/bLHDR-Cs4BthLmpIYupw0zbkKQ1r3M-S7Bp8xhhM7WCOb3IM65EaGD9EX2RzxHrHb4CxRelwa4YSDu_lpOVcnZ9jzvM8BBS2uGjQpJC3dtHMSekPtMk44IpsMgEqa5XcCOhCZikQQLP1pR8TAp2n3ILhmZDP20m0fcIvUkAoW2qJXd9z1bpToO9BX3WXu0ucy5rpgGPNm0nW5_epUWtm2Ue3pn3kMOFQmKntGZW0BYtgBSi8k5A2QMwybJNMIbFiGSR9QZc4nUqIvikStF0jHprua5C-amge42aNt3R0f5JaaoivdV2Pkqbx4hee4ymOkBh5BTiB-_uIeGeo8zL8rPsPl4DktdEiK1jkB1NdZCRbrSTecDe_mlHbF0wvBmCkaOH5_S8a_TTTKI6-nmCAkEw4LpxsZ-LbYLKQFKMNOgf_wuM7_bV9gOer5SYMMksBSWXFcbi49KNZXNLicwfe3TETC7gPdPqI7uBcHMb1RSzYq34c6PDUM9mn8HRsUTZEiDBve3NjVZumBj0U7SS37mGO7vcwtiK-_pU7U7L_f-digo9YbhSwIfMRwIITKGXbxdIUTCGF1SeCJxloKsU-3k9ddRbX1eDq1q_fx1JbBGT0glVyXimDuP4TQ5qpCAmnGEj2s_6n5mtn1z-97-63itFQZLPO1Ev2tu_WF7Ju-VPc0Skg5bYXxBhkY1xpD7EM_7fyflSpIsqMgVth5xhVr4eQxWQ8enaSAJQSG16yFSDuJ798rrcXr_3n-lfdk7icQjEBmFujL7AodiP_Y4Z7-YxvtZNs4zMgpNTl6tF8sglyPsmqchrjvQ-m-aP94r-TwCA2Ka8upPJZwtvSpoYCXkYMZU2NXvRMBfq9P3i3Le4VAZUAlUZ_oPKsxPgY0Q_BSKLkyr9bhQhQrJjo_x3TPlIB0DPjnMfcIoYP0QaYw1a0fTKDr8fB6ntNuvmoL1ZGkXa69Njh43zf9GiGxHQrA_jDYWRSzF5--WmTVrN97_Sm8LbLUy_lGBmLanJjFkDlGkRqjA_4tm00
> >
> >> :
> >>
> >> [image: image.png]
> >>
> >> Diagram 2
> >> <
> https://www.plantuml.com/plantuml/uml/VLLDR-8m4BtdLupO2sWBLVU8AaGB7AXAbssGzb896SS9RXqxjKqBwkv_tt7iV43fdaZYDpFlpRmnOsE9jhjSH9PRmM31hERKm8scMsuPjJlDe0yheZDc8RR4iYWoBrmMH9CS2a9VICPYUy1OZN0YCy5Q0BCbYNhdCeEK28En8G8wCvbnoQ0R8_05Bc6bkLIz3X03p1zzH7zR-9ZfDquPt9C3qoNCX2yV4G2NbkcKu5jdgGJHt0GbZwnG6i-UP3TUpk5gM6Ldqke350eZUqzoCft3U9xWHvxoa5-7K4nF1J46EbEMafsmdrCBbQ44gVggy18IZrn_ph5asd1ZiIKdQSgueZvjXrQFSFrdC3YN-nXmBacxbGiYyLVxLaBtdhqn0LSzdBDhqQtQoOJeGyad3z0lUqnYgpGB6Ns8oVyta00Dy_WnX0tIOZ8v6SYxHll1TrH6aejAik-mh-AphVFCwSUQqFypElag5QRGFDjQKEd96K1P8QP41c9TzA_IIQyvdAWyv_RSiS3skb0_EzDDkK2v5xWF6MiGFlvhpFLcD2Dq2pml14gaF67eQkmd8gulDoC4kSOu6KVpkvlUJg1RTbWISU40RdBUUS_9XfRZ2dwxm_SW8LYFISgm_MnlDQ6M9P1gbKEc4X-2pH_FvJCkCqm9pbVjD6LrwdLeOrDWfOaqc8Wh9BE85oNKxkNQ6o4yGRy_Eae0G_G8tZv81d3bHDB23WOdisohVr3nh_j6lbSjbNaLRTc8UgtPbAU1J_tygOfZX9DWEJeHDvYx-qmSi5FgNLPZwHrHcUsncGQ5-skhUclpE5fo4ounpFauYrUbkU6ccfnxMvitwag4IyerhTxj8In_Oj1bDO4pQru674loYrGlULHLEGCjwJJ8gDoVZR8MxO4BT3IzRvIcAQKezC6xpziGnTyImrfEGyJI_OcKfgtxIvnTqFEMS17L9Z-jsARN5FmTheP7HtSdtOMT0B4GY2FYHXxgQmMtj2bRqiLFGapiVe1_QVKDrkqXcm83aFEXnMYCZ-xlyHy
> >
> >> :
> >> [image: image.png]
> >>
> >>  -ej
> >>
> >> On Wed, Apr 15, 2026 at 8:22 AM Dmitri Bourlatchkov <[email protected]>
> >> wrote:
> >>
> >>> Hi All,
> >>>
> >>> Heads up: The current state of PR [4115] looks pretty solid to me. I
> >>> believe this PR is approaching a mergeable condition.
> >>>
> >>> Please post your reviews if you have any comments.
> >>>
> >>> [4115] https://github.com/apache/polaris/pull/4115
> >>>
> >>> Thanks,
> >>> Dmitri.
> >>>
> >>> On Tue, Mar 3, 2026 at 3:29 PM Anand Kumar Sankaran via dev <
> >>> [email protected]> wrote:
> >>>
> >>> > Hi Yufei and Dmitri,
> >>> >
> >>> > Here is a proposal for the REST endpoints for metrics and events.
> >>> >
> >>> > https://github.com/apache/polaris/pull/3924/changes
> >>> >
> >>> > I did not see any precursors for raising a PR for proposals, so
> trying
> >>> > this.  Please let me know what you think.
> >>> >
> >>> > -
> >>> > Anand
> >>> >
> >>> > From: Anand Kumar Sankaran <[email protected]>
> >>> > Date: Monday, March 2, 2026 at 10:25 AM
> >>> > To: [email protected] <[email protected]>
> >>> > Subject: Re: Polaris Telemetry and Audit Trail
> >>> >
> >>> > About the REST API, based on my use cases:
> >>> >
> >>> >
> >>> >   1.
> >>> > I want to be able to query commit metrics to track files added /
> >>> removed
> >>> > per commit, along with record counts. The ingestion pipeline that
> >>> writes
> >>> > this data is owned by us and we are guaranteed to write this
> >>> information
> >>> > for each write.
> >>> >   2.
> >>> > I want to be able to query scan metrics for read. I understand
> clients
> >>> do
> >>> > not fulfill this requirement.
> >>> >   3.
> >>> > I want to be able to query the events table (events are persisted) -
> >>> this
> >>> > may supersede #2, I am not sure yet.
> >>> >
> >>> > All this information is in the JDBC based persistence model and is
> >>> > persisted in the metastore. I currently don’t have a need to query
> >>> > prometheus or open telemetry. I do publish some events to Prometheus
> >>> and
> >>> > they are forwarded to our dashboards elsewhere.
> >>> >
> >>> > About the CLI utilities, I meant the admin user utilities. In one of
> >>> the
> >>> > earliest drafts of my proposal, Prashant mentioned that the metrics
> >>> tables
> >>> > can grow indefinitely and that a similar problem exists with the
> events
> >>> > table as well. We discussed that cleaning up of old records from both
> >>> > metrics tables and events tables can be done via a CLI utility.
> >>> >
> >>> > I see that Yufei has covered the discussion about datasources.
> >>> >
> >>> > -
> >>> > Anand
> >>> >
> >>> >
> >>> >
> >>> > From: Yufei Gu <[email protected]>
> >>> > Date: Friday, February 27, 2026 at 9:54 PM
> >>> > To: [email protected] <[email protected]>
> >>> > Subject: Re: Polaris Telemetry and Audit Trail
> >>> >
> >>> > This Message Is From an External Sender
> >>> > This message came from outside your organization.
> >>> > Report Suspicious<
> >>> >
> >>>
> https://us-phishalarm-ewt.proofpoint.com/EWT/v1/Iz9xO38YGHZK!YhNDZABkHi1B699ote2uMwpOZw8i0QMCGO2Szc-HshuABGhGvwPJcymE6G2oUUxtS8xDkSrtGTPm_I3QnVDHoLMk50m9v8z_nZKTkd-bnVUbreF1u0WnfV_X5eYevZl_$
> >>> > >
> >>> >
> >>> >
> >>> > As I mentioned in
> >>> >
> >>>
> https://urldefense.com/v3/__https://github.com/apache/polaris/issues/3890__;!!Iz9xO38YGHZK!5EuyFFkk3vhRWVIRvQAWBSQfpJkTMA9HxugzDwXmN0LPPqhEFxYkFRGVhtb8AqUwXtDh2OplcMnbMDHKOxrvDU0$
> >>> ,
> >>> > supporting
> >>> > multiple data sources is not a trivial change. I would strongly
> >>> recommend
> >>> > starting with a design document to carefully evaluate the
> architectural
> >>> > implications and long term impact.
> >>> >
> >>> > A REST endpoint to query metrics seems reasonable given the current
> >>> JDBC
> >>> > based persistence model. That said, we may also consider alternative
> >>> > storage models. For example, if we later adopt a time series system
> >>> such as
> >>> > Prometheus to store metrics, the query model and access patterns
> would
> >>> be
> >>> > fundamentally different. Designing the REST API without considering
> >>> these
> >>> > potential evolutions may limit flexibility. I'd suggest to start with
> >>> the
> >>> > use case.
> >>> >
> >>> > Yufei
> >>> >
> >>> >
> >>> > On Fri, Feb 27, 2026 at 3:42 PM Dmitri Bourlatchkov <
> [email protected]>
> >>> > wrote:
> >>> >
> >>> > > Hi Anand,
> >>> > >
> >>> > > Sharing my view... subject to discussion:
> >>> > >
> >>> > > 1. Adding non-IRC REST API to Polaris is perfectly fine.
> >>> > >
> >>> > > Figuring out specific endpoint URIs and payloads might require a
> few
> >>> > > roundtrips, so opening a separate thread for that might be best.
> >>> > > Contributors commonly create Google Docs for new API proposals too
> >>> (they
> >>> > > fairly easy to update as the email discussion progresses).
> >>> > >
> >>> > > There was a suggestion to try Markdown (with PRs) for proposals [1]
> >>> ...
> >>> > > feel free to give it a try if you are comfortable with that.
> >>> > >
> >>> > > 2. Could you clarify whether you mean end user utilities or admin
> >>> user
> >>> > > utilities? In the latter case those might be more suitable for the
> >>> Admin
> >>> > > CLI (java) not the Python CLI, IMHO.
> >>> > >
> >>> > > Why would these utilities be common with events? IMHO, event use
> >>> cases
> >>> > are
> >>> > > distinct from scan/commit metrics.
> >>> > >
> >>> > > 3. I'd prefer separating metrics persistence from MetaStore
> >>> persistence
> >>> > at
> >>> > > the code level, so that they could be mixed and matched
> >>> independently.
> >>> > The
> >>> > > separate datasource question will become a non-issue with that
> >>> approach,
> >>> > I
> >>> > > guess.
> >>> > >
> >>> > > The rationale for separating scan metrics and metastore persistence
> >>> is
> >>> > that
> >>> > > "cascading deletes" between them are hardly ever required.
> >>> Furthermore,
> >>> > the
> >>> > > data and query patterns are very different so different
> technologies
> >>> > might
> >>> > > be beneficial in each case.
> >>> > >
> >>> > > [1]
> >>> >
> >>>
> https://urldefense.com/v3/__https://lists.apache.org/thread/yto2wp982t43h1mqjwnslswhws5z47cy__;!!Iz9xO38YGHZK!5EuyFFkk3vhRWVIRvQAWBSQfpJkTMA9HxugzDwXmN0LPPqhEFxYkFRGVhtb8AqUwXtDh2OplcMnbMDHKxYDakNU$
> >>> > >
> >>> > > Cheers,
> >>> > > Dmitri.
> >>> > >
> >>> > > On Fri, Feb 27, 2026 at 6:19 PM Anand Kumar Sankaran via dev <
> >>> > > [email protected]> wrote:
> >>> > >
> >>> > > > Thanks all. This PR is merged now.
> >>> > > >
> >>> > > > Here are the follow-up features / work needed.  These were all
> >>> part of
> >>> > > the
> >>> > > > merged PR at some point in time and were removed to reduce scope.
> >>> > > >
> >>> > > > Please let me know what you think.
> >>> > > >
> >>> > > >
> >>> > > >   1.  A REST API to paginate through table metrics. This will be
> >>> > non-IRC
> >>> > > > standard addition.
> >>> > > >   2.  Utilities for managing old records, should be common with
> >>> events.
> >>> > > > There was some discussion that it belongs to the CLI.
> >>> > > >   3.  Separate datasource (metrics, events, even other tables?).
> >>> > > >
> >>> > > >
> >>> > > > Anything else?
> >>> > > >
> >>> > > > -
> >>> > > > Anand
> >>> > > >
> >>> > > >
> >>> > >
> >>> >
> >>> >
> >>>
> >>
>

Reply via email to