Re: [PROPOSAL] Polaris Directories proposal

Jean-Baptiste Onofré Tue, 09 Jun 2026 12:43:43 -0700

Hi Robert,

Thanks for your feedback!


>From a user perspective, I personally prefer having the Directory and Table
share the same name, as I find it less confusing to see the association at
first glance. However, I'm open to including the inventory table name as
part of the Directory configuration instead.

As mentioned in my initial proposal, the current PR is intended to
illustrate a potential implementation. It is certainly not the final
version, and I am happy to update it based on community input. I fully
agree with the high-level model you outlined, and I believe the PR is
well-aligned with that direction.

I still believe the inventory table is essential, as it represents the core
value of the Directory and scanner; without it, users could simply create
an Iceberg table manually to list objects.
I'm fine to have add a endpoint in the Directory API to create a inventory
table without scanning (but using the static schema) and also other
endpoints to deal with entries in an inventory (if you think it's helpful).

Regards,
JB


On Mon, Jun 8, 2026 at 1:27 PM Robert Stupp <[email protected]> wrote:

> Hi,
>
> I support the general direction.
> Modeling a directory/prefix as a first-class catalog concept in Polaris,
> complete with an inventory table for discovered objects, seems very useful.
>
> I think we should separate agreement on that direction from locking in the
> exact object model too early, though.
> One design point I would like to keep open is the relationship between the
> directory configuration and the inventory table.
>
> For example, if the directory configuration and the inventory table share
> the same name in the same namespace and are distinguished only by object
> type, that may be workable, but it can create ambiguity for APIs, UI,
> events, authorization/audit, and lifecycle operations like rename/drop.
> I don’t think we need to settle that in the first discussion, but I also
> would not want the current PR shape to imply that this part is already
> fixed.
>
> My preference would be to first agree on the higher-level model:
>
> - Polaris has a first-class Directory abstraction.
> - A Directory has a configured object-store location and scan/inventory
> settings.
> - A Directory is associated with an Iceberg inventory table.
> - Scanner execution can be discussed separately: Polaris-provided,
> disabled, or integrator-provided.
>
> Then we can discuss whether the inventory table is implicitly named,
> explicitly referenced, hidden/internal, user-visible, or modeled some other
> way.
>
> Thoughts?
>
> Robert
>
> On Sun, Jun 7, 2026 at 7:07 AM Jean-Baptiste Onofré <[email protected]>
> wrote:
>
> > Hi
> >
> > I wanted to have two steps in the proposal: the configuration and high
> > level architecture (that’s the current proposal), then the scanning
> > service.
> >
> > I think the scanning should be part of Polaris but not mandatory: if
> > integrators want to have their own scanning they should be able to do so.
> > The Polaris scanners should be disabled by users. Integrators would
> > probably like to have scanning performed by a distributed engines or
> within
> > cloud provider infra.
> >
> > So my proposal here is:
> > 1. To have scanner in Polaris
> > 2. Be able to disable the Polaris scanner
> > 3. Allow users/integrators to provide their own scanners
> >
> > The first step is to get consensus on the Polaris Directories proposal
> > approach.
> >
> > I will create a follow up PR with a scanner.
> >
> > Regards
> > JB
> >
> > Le ven. 5 juin 2026 à 23:25, Yufei Gu <[email protected]> a écrit :
> >
> > > I think one thing we should clarify is where the scanner lives.
> > >
> > > If the scanner is completely outside Polaris, the UX becomes a bit
> > > confusing to me. In that model, Polaris only stores a directory
> > > configuration, while users still need to bring their own service to
> scan
> > > object storage and write an Iceberg table. In that case, I’m not sure
> > what
> > > value Polaris Directories add over *manually creating an Iceberg table
> to
> > > track unstructured data files*. Users can already do that today, and it
> > is
> > > arguably more flexible because they can define any schema they want and
> > use
> > > any engine or workflow to populate it.
> > >
> > > To me, the more compelling direction is for Polaris to own the scanner
> or
> > > at least provide it as part of the project, likely through a push mode
> > > delegation service[1]. Polaris would still not need to do all the heavy
> > > scanning work itself, but it should provide a clear, first class
> workflow
> > > for turning a directory configuration into an updated directory table,
> > via
> > > a delegated service.
> > >
> > > That also seems related to Romain’s questions. If the metadata
> extraction
> > > and scanning model are fully external, then extensibility and streaming
> > > support become entirely out of scope. But if Polaris provides the
> scanner
> > > framework, we can define clear extension points for custom metadata and
> > > think about supportting both batch and event driven scanning.
> > >
> > > 1.
> https://github.com/apache/polaris/issues/3786#issuecomment-4503583696
> > >
> > > Yufei
> > >
> > >
> > > On Fri, Jun 5, 2026 at 2:41 AM Romain Manni-Bucau <
> [email protected]
> > >
> > > wrote:
> > >
> > > > Hi JB,
> > > >
> > > > I have two questions on this scope:
> > > >
> > > > 1. any hope it is extensible so an user can plug its own metadata?
> > > > 2. will scanning be made streaming friendly (I assume phase 0 is a
> > > batch),
> > > > idea would be to be able to use Kappa like architecture to have real
> > time
> > > > capabilities
> > > >
> > > > Thanks,
> > > > Romain Manni-Bucau
> > > > @rmannibucau <https://x.com/rmannibucau> | .NET Blog
> > > > <https://dotnetbirdie.github.io/> | Blog <
> > https://rmannibucau.github.io/
> > > >
> > > > | Old
> > > > Blog <http://rmannibucau.wordpress.com> | Github
> > > > <https://github.com/rmannibucau> | LinkedIn
> > > > <https://www.linkedin.com/in/rmannibucau> | Book
> > > > <
> > > >
> > >
> >
> https://www.packtpub.com/en-us/product/java-ee-8-high-performance-9781788473064
> > > > >
> > > > Javaccino founder (Java/.NET service - contact via linkedin)
> > > >
> > > >
> > > > Le ven. 5 juin 2026 à 02:20, Yufei Gu <[email protected]> a
> écrit :
> > > >
> > > > > Great to see the progress here. Thanks a lot JB! I will take a look
> > at
> > > > the
> > > > > PR.
> > > > >
> > > > > Yufei
> > > > >
> > > > >
> > > > > On Thu, Jun 4, 2026 at 2:58 AM Jean-Baptiste Onofré <
> [email protected]
> > >
> > > > > wrote:
> > > > >
> > > > > > Hi everyone,
> > > > > >
> > > > > > After several months of discussion (involving Directories, Table
> > > > Sources,
> > > > > > etc), I would like to propose Polaris Directories.
> > > > > >
> > > > > > I drafted a PR:
> > > > > > https://github.com/apache/polaris/pull/4613
> > > > > >
> > > > > > The proposal is documented as part of the PR:
> > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> https://github.com/jbonofre/polaris/blob/12dfea48570d076d4012143e66f02e8b503c4f99/site/content/in-dev/unreleased/directories.md
> > > > > >
> > > > > > In a nutshell, Polaris Directories make objects (including
> > > unstructured
> > > > > > data like images, videos, and documents) discoverable alongside
> > > > > structured
> > > > > > Iceberg tables within a Polaris catalog. A directory points to a
> > base
> > > > > > location/prefix on an object store and automatically tracks the
> > > objects
> > > > > it
> > > > > > contains by maintaining an Iceberg table with object-level
> metadata
> > > > such
> > > > > as
> > > > > > URI, size, content type, checksum, ...
> > > > > >
> > > > > > This means query engines and tools that already know how to read
> > > > Iceberg
> > > > > > tables can discover and access unstructured data with little or
> no
> > > > extra
> > > > > > work (accessing the object itself).
> > > > > >
> > > > > > A directory has two main parts:
> > > > > > - Directory configuration, stored by the Polaris server. It
> > describes
> > > > > where
> > > > > > the data lives, how to authenticate, which objects to include,
> and
> > > how
> > > > > > often to re-scan. The configuration "lives" in a namespace.
> > > > > > - Directory table, an Iceberg table serving as the inventory of
> all
> > > > > objects
> > > > > > contained in the directory, with one row per object discovered
> > > during a
> > > > > > scan. The directory table uses the configuration name.
> > > > > > The Polaris server itself does not perform scans. Instead,
> external
> > > > > > services (e.g. directory table scanning service) read the
> directory
> > > > > > configuration through the REST API, walk the object store, and
> > write
> > > > the
> > > > > > results into the directory table.
> > > > > >
> > > > > > I propose we discuss this both on the mailing list (this thread)
> > and
> > > on
> > > > > the
> > > > > > PR. If needed, I'm happy to schedule a dedicated meeting.
> > > > > >
> > > > > > I'm looking forward to your thoughts!
> > > > > >
> > > > > > Thanks!
> > > > > >
> > > > > > Regards
> > > > > > JB
> > > > > >
> > > > >
> > > >
> > >
> >
>

Re: [PROPOSAL] Polaris Directories proposal

Reply via email to