Thanks all for the feedback! It seems we have some initial consensus that using the event framework for metrics delivery is a reasonable direction worth exploring. Most of the discussion now appears to be around impl details and operational considerations.
1. Benchmarking is a great idea, using the existing tool makes sense. I don't see it as a blocker though. The volume of scan metrics should be similar to, or even lower than, the volume of LoadTable requests. Some clients may not send scan metrics at all. If we're comfortable supporting LoadTable events, I'm not sure why metrics events would require a fundamentally different validation path, though benchmarking would certainly help us tune the event bus and listener configuration. 2. I agree that separating the datasource for event and metrics persistence is an active and worthwhile discussion. I think we should continue that work regardless of the direction we take here. 3. Agreed on evaluating payload sizes. That said, it doesn't seem like a major concern to me given that we already support larger payloads in some existing events. 4. Filtering is a valid use case. My thinking is that custom event listeners can already implement whatever filtering logic they need. I'm not sure we need a generic filtering framework in Polaris itself yet, but I'm open to further discussion if we find common requirements across deployments. 5. Schema migration is a good point and something we should keep in mind if metrics are persisted. 6. I also agree with Dmitri that we can continue improving the RDBMS schema evolution story. That feels largely orthogonal to this proposal, so perhaps it's best discussed in a separate thread. Thanks, Yufei On Wed, Jun 10, 2026 at 12:56 PM Dmitri Bourlatchkov <[email protected]> wrote: > Hi All, > > +1 to all points from Alex's email. > > Re: Metrics Persistence I believe we ought to make it as smooth as possible > from the Polaris code maintenance perspective. Therefore, I propose > starting the work to isolate the existing metrics schema from the MetaStore > schema in parallel with the event bus work. I think it will be beneficial > in its own right, regardless of how the event bus work progresses. > > PR [4397] is but the first step in that direction. > > Side note: we probably do not need to copy the whole schema SQL file on > every revision, but I'm contemplating starting a separate thread on that. > > Once a separate metrics schema is established, I think it will be natural > to also allow it to be on a different JDBC DataSource than the MetaStore > schema. > > If the event bus work is successful, JDBC Metrics Persistence can become > one of possibly many consumers for metrics events. > > With this approach, it should also be possible to write metrics to the > database in batches. IIRC, Venkateshwaran brought this point up in the > latest Metrics Sync meeting. > > Metrics filtering can probably progress in parallel too. I think it is a > useful feature. > > [4397] https://github.com/apache/polaris/pull/4397 > > Cheers, > Dmitri. > > > > On Wed, Jun 10, 2026 at 9:56 AM Alexandre Dutra <[email protected]> wrote: > > > Hi Yufei, > > > > The proposal to leverage the events subsystem for metrics delivery is > > quite appealing, though it requires a thorough evaluation regarding > > potential performance overhead. > > > > My primary considerations are as follows: > > > > 1) Given that scan reports can trigger a high volume of events, we > > should conduct rigorous testing, potentially using the Polaris > > benchmark tool. We need to determine what's the right configuration > > for the event bus and for the event listener executor. > > > > 2) While the events subsystem handles dispatch and delivery natively, > > it doesn't give persistence for free. My recollection is that we were > > pursuing the idea of a metrics persistence system with a unique schema > > and possibly a separate datasource, a process initiated by a recently > > merged PR [1]. Is that still the case? Furthermore, we'd need to > > implement data retention and purging, including for the current events > > table [2]. > > > > 3) If we consider the events table for metrics storage, we must > > evaluate average payload sizes. Although a PR [3] was introduced to > > prune large payloads (such as table metadata), this functionality is > > still in its early stages and will evolve. Similar pruning would be > > necessary for metrics reports if they are big. > > > > 4) As Yong suggested [4], we may still require more sophisticated > > metrics filtering. The events subsystem currently only allows > > filtering by event type or event category, which may not be granular > > enough for our needs (as of today, it would allow only to distinguish > > scan vs metrics reports). In that regard, I would welcome the > > opportunity to implement a generic EventFilter interface with a > > default implementation based on CEL. > > > > Thanks, > > Alex > > > > [1]: https://github.com/apache/polaris/pull/4397 > > [2]: https://lists.apache.org/thread/krmddx8myov926sd0mbh4ogy8sdgrfgq > > [3]: https://github.com/apache/polaris/pull/4225 > > [4]: https://lists.apache.org/thread/ogskc1szctkg5n0tdj0cm3pfkowcwx4z > > > > On Wed, Jun 10, 2026 at 2:04 AM Yufei Gu <[email protected]> wrote: > > > > > > Hi all, > > > > > > I've been thinking about how Polaris should support Iceberg scan and > > commit > > > metrics. A few challenges have come up in recent discussions: > > > 1. Sync metrics persistence chokes Polaris persistence due to the high > > > volume of scan metrics [3]. > > > 2. We spent considerable time figuring out the metrics persistence, > > > including the schema, SPIs, REST APIs [4]. > > > 3. Metric filtering remains a challenge [1]. > > > 4. We need to figure out how to purge metrics because they keep growing > > [2]. > > > > > > Looking at these challenges, most of them are not really metrics > > problems. > > > They are transport, delivery, retention, and lifecycle problems that > the > > > existing event framework already addresses. I'd like to propose using > the > > > event system to facilitate the current use cases of Iceberg scan and > > commit > > > metrics rather than introducing a separate Polaris metrics subsystem. > The > > > metrics for current use cases are fundamentally events with structured > > > telemetry attached. They are append only, generated by IRC endpoints, > > > typically consumed asynchronously, and often forwarded to external > > systems. > > > Since Polaris already needs to support them as part of IRC, treating > them > > > as event types seems like a natural fit. > > > > > > More importantly, I think Polaris should remain a catalog service and > > > telemetry producer rather than a metrics warehouse. Instead of > > introducing > > > a dedicated metrics subsystem along with storage, retention, query, and > > > scaling concerns, we could build on the existing event framework: > > > > > > - Emit them through the existing event mechanism. We will do that > > anyway > > > given it's an IRC endpoint. > > > - Let custom event listeners route them to the destination of > choice, > > > such as Prometheus, Grafana, RDBMSs, or other systems. > > > - Reuse the existing event lifecycle, retention, and delivery > models. > > If > > > temporary persistence is still required, the existing event table > can > > serve > > > that purpose. The payload size is manageable given that we have put > > the > > > loadTable/LoadView response in events. > > > > > > This approach also gives deployments flexibility to filter, sample, or > > > redirect high volume scan metrics without Polaris needing backend > > specific > > > metric storage behavior. For example, event listeners can choose which > > > metric events to process. We don't need to implement metric filtering > > logic > > > [1]. > > > > > > In short, my proposal is: Events provide the transport and lifecycle > > > mechanism, while downstream metrics systems remain responsible for > > storage, > > > querying, aggregation, and visualization. > > > > > > Curious what others think. > > > > > > 1. https://lists.apache.org/thread/ogskc1szctkg5n0tdj0cm3pfkowcwx4z > > > 2. https://lists.apache.org/thread/5nst0f2ygnl2gj3j910q7m8nk2fvokc7 > > > 3. https://lists.apache.org/thread/zp2rvsdkq3mb46722o0hfl0zh7kdqyr8 > > > 4. https://lists.apache.org/thread/qj1y7cw4dygcnczmymdwkfkp4ysq41ts > > > > > > > > > Yufei > > >
