Re: [PROPOSAL] Scan Planning with Optional Caching Layers

Yufei Gu Mon, 22 Jun 2026 11:15:55 -0700

Hi Tornike,

To clarify, I support Phase 1. That was actually the main point of the
first paragraph in my previous email. Could we focus on Phase 1 first? We
can also discuss the other topics in parallel.


Thanks,
Yufei


On Fri, Jun 19, 2026 at 4:08 PM Dmitri Bourlatchkov <[email protected]>
wrote:

> Hi Tornike,
>
> It's a very interesting proposal. Thanks for submitting it!
>
> The doc LGTM - no particular comments there.
>
> I imagine the actual caching layer might receive some more feedback and
> alternative suggestions later, but I'm sure it will invigorate the project.
>
> Breaking the implementation plan into multiple phases is certainly a
> good idea.
>
> Re: performance concerns, I propose making the implementation modular and
> composable (which is the approach followed by [4115]).
>
> Users of the ASF binaries will be able to switch the feature on/off
> according to their needs and avoid unnecessary overhead if the do not need
> this functionality.
>
> Downstream builds will be able to include/exclude related modules and
> further optimize the server's image this way.
>
> If a suitable external service (such as delegation) becomes available
> later, the modular design of this feature should simplify integrating with
> it.
>
> All in all, I support implementing this proposal in Polaris. Making it
> available in ASF releases will promote user feedback, which will inform
> further development of this feature.
>
> [4115] https://github.com/apache/polaris/pull/4115
>
> Cheers,
> Dmitri.
>
> On Fri, Jun 19, 2026 at 9:20 AM Tornike Gurgenidze <[email protected]
> >
> wrote:
>
> > Yufei, Adnan, thanks for taking a look at the proposal.
> >
> > I definitely understand the concern and agree that there should be a way
> to
> > avoid including compute-intensive workload in polaris server and/or
> > metadata db. Still, my preferred approach would be to implement entire
> > functionality first and make it configurable later on when we have better
> > idea of how Delegation Service will look like (planning will sit behind a
> > feature flag, after all). if that sounds fine, I can adjust the proposal
> to
> > include eventual integration with delegation service (both for
> ScanPlanner
> > SPI and indexing) rather than make Delegation Service a hard
> prerequisite.
> >
> > regarding SQL pruning index: I agree that it's a big topic and probably
> > valuable even outside of the scope of polaris. still.. since there's no
> > existing spec for anything like that outside of polaris, I think it makes
> > sense to start laying the foundation for it here for this particular use
> > case, don't you agree? In terms of compute, the actual indexing can
> happen
> > "externally", maybe orchestrated by polaris cli rather than as a side
> > effect of a snapshot update.
> >
> > In short, while I agree that we should coordinate planning and delegation
> > service, I'd much rather implement the feature first and then build
> > delegation service around it especially since there's both types of
> > delegation requirement here (invoking external planner, notifying
> external
> > indexer).
> >
> > Thanks,
> > Tornike
> >
> > On Fri, Jun 19, 2026 at 2:12 AM Adnan Hemani via dev <
> > [email protected]>
> > wrote:
> >
> > > I agree with Yufei - I don't think we can implement something as heavy
> as
> > > server-side planning directly onto Polaris as it stands. I think we
> need
> > to
> > > revisit the Delegation Service discussion; it would be a great place to
> > > implement this type of functionality.
> > >
> > > Best,
> > > Adnan Hemani
> > >
> > > On Wed, Jun 17, 2026 at 4:11 PM Yufei Gu <[email protected]> wrote:
> > >
> > > > Thanks for putting this together. The first phase sounds good to me.
> > > >
> > > > My main concern is that, without some form of delegation service,
> scan
> > > > planning could easily become a heavy workload that impacts Polaris
> > > > performance.
> > > >
> > > > The SQL pruning index is also a pretty big topic with a lot of design
> > > > choices around ownership, consistency, updates, and operations. I'm
> not
> > > > sure Polaris itself should be responsible for managing the index.
> > > >
> > > > One possible direction is to delegate scan planning and indexing to a
> > > > separate service. That would keep Polaris focused on catalog and
> > > governance
> > > > responsibilities while still enabling these optimizations. In a way,
> > that
> > > > brings us back to the delegation service discussion.
> > > >
> > > > Curious what others think.
> > > >
> > > > Yufei
> > > >
> > > >
> > > > On Tue, Jun 16, 2026 at 12:44 AM Tornike Gurgenidze <
> > > > [email protected]>
> > > > wrote:
> > > >
> > > > > Hi,
> > > > >
> > > > > I drafted a proposal regarding adding iceberg rest-compliant scan
> > > > planning
> > > > > support to Polaris. The proposal doc can be found here:
> > > > >
> > > > >
> > > >
> > >
> >
> https://docs.google.com/document/d/1agpz4wwXxWfEy9fJLgPRDcrzdR5USM1i9vQhOBcHo3Q/edit?usp=sharing
> > > > >
> > > > > tldr: doc proposes to first add a straightforward implementation of
> > > scan
> > > > > planning in the initial phase and integrate new endpoints with
> > polaris
> > > > > authz. Subsequently, we can enhance scan planning performance with
> 2
> > > > > independent caching layers:
> > > > >
> > > > >    - *CachingFileIO* - FileIO wrapper that wraps existing FileIO
> > > > >    implementations and introduces a configurable Caffeine-powered
> > > > in-memory
> > > > >    cache to speed up access to manifest files.
> > > > >    - *SQL Pruning Index* - additional index stored in a rdbms and
> > > > >    asynchronously updated by polaris when a new table snapshot is
> > > > > registered.
> > > > >    The goal is to store all relevant per-file stats in a db table
> > that
> > > > will
> > > > >    allow applying a pruning predicate in a single sql query. This
> is
> > > > >    essentially a ducklake-style index but used only as a file
> pruning
> > > > index
> > > > >    rather than the source of truth. Index is allowed to lag behind
> > the
> > > > > latest
> > > > >    snapshot in which case ScanPlanner will use both index and
> > > underlying
> > > > > files
> > > > >    for the relevant parts of the table metadata.
> > > > >
> > > > > I have a POC for caching layers in a private repo which you can
> take
> > a
> > > > look
> > > > > at as well: https://github.com/tokoko/iceberg-cache/.
> > > > >
> > > > > thanks,
> > > > > Tornike
> > > > >
> > > >
> > >
> >
>

Re: [PROPOSAL] Scan Planning with Optional Caching Layers

Reply via email to