It makes sense to introduce volume support without the inventory table in
phase 1. My original design also position the directory table optional:
https://docs.google.com/document/d/1ofljkrtiXRWc-v6hfkg_laKlYltepTPX7zsg44Tb-BY/edit?tab=t.0

Yufei


On Mon, Jun 8, 2026 at 4:26 AM Robert Stupp <[email protected]> wrote:

> Hi,
>
> I support the general direction.
> Modeling a directory/prefix as a first-class catalog concept in Polaris,
> complete with an inventory table for discovered objects, seems very useful.
>
> I think we should separate agreement on that direction from locking in the
> exact object model too early, though.
> One design point I would like to keep open is the relationship between the
> directory configuration and the inventory table.
>
> For example, if the directory configuration and the inventory table share
> the same name in the same namespace and are distinguished only by object
> type, that may be workable, but it can create ambiguity for APIs, UI,
> events, authorization/audit, and lifecycle operations like rename/drop.
> I don’t think we need to settle that in the first discussion, but I also
> would not want the current PR shape to imply that this part is already
> fixed.
>
> My preference would be to first agree on the higher-level model:
>
> - Polaris has a first-class Directory abstraction.
> - A Directory has a configured object-store location and scan/inventory
> settings.
> - A Directory is associated with an Iceberg inventory table.
> - Scanner execution can be discussed separately: Polaris-provided,
> disabled, or integrator-provided.
>
> Then we can discuss whether the inventory table is implicitly named,
> explicitly referenced, hidden/internal, user-visible, or modeled some other
> way.
>
> Thoughts?
>
> Robert
>
> On Sun, Jun 7, 2026 at 7:07 AM Jean-Baptiste Onofré <[email protected]>
> wrote:
>
> > Hi
> >
> > I wanted to have two steps in the proposal: the configuration and high
> > level architecture (that’s the current proposal), then the scanning
> > service.
> >
> > I think the scanning should be part of Polaris but not mandatory: if
> > integrators want to have their own scanning they should be able to do so.
> > The Polaris scanners should be disabled by users. Integrators would
> > probably like to have scanning performed by a distributed engines or
> within
> > cloud provider infra.
> >
> > So my proposal here is:
> > 1. To have scanner in Polaris
> > 2. Be able to disable the Polaris scanner
> > 3. Allow users/integrators to provide their own scanners
> >
> > The first step is to get consensus on the Polaris Directories proposal
> > approach.
> >
> > I will create a follow up PR with a scanner.
> >
> > Regards
> > JB
> >
> > Le ven. 5 juin 2026 à 23:25, Yufei Gu <[email protected]> a écrit :
> >
> > > I think one thing we should clarify is where the scanner lives.
> > >
> > > If the scanner is completely outside Polaris, the UX becomes a bit
> > > confusing to me. In that model, Polaris only stores a directory
> > > configuration, while users still need to bring their own service to
> scan
> > > object storage and write an Iceberg table. In that case, I’m not sure
> > what
> > > value Polaris Directories add over *manually creating an Iceberg table
> to
> > > track unstructured data files*. Users can already do that today, and it
> > is
> > > arguably more flexible because they can define any schema they want and
> > use
> > > any engine or workflow to populate it.
> > >
> > > To me, the more compelling direction is for Polaris to own the scanner
> or
> > > at least provide it as part of the project, likely through a push mode
> > > delegation service[1]. Polaris would still not need to do all the heavy
> > > scanning work itself, but it should provide a clear, first class
> workflow
> > > for turning a directory configuration into an updated directory table,
> > via
> > > a delegated service.
> > >
> > > That also seems related to Romain’s questions. If the metadata
> extraction
> > > and scanning model are fully external, then extensibility and streaming
> > > support become entirely out of scope. But if Polaris provides the
> scanner
> > > framework, we can define clear extension points for custom metadata and
> > > think about supportting both batch and event driven scanning.
> > >
> > > 1.
> https://github.com/apache/polaris/issues/3786#issuecomment-4503583696
> > >
> > > Yufei
> > >
> > >
> > > On Fri, Jun 5, 2026 at 2:41 AM Romain Manni-Bucau <
> [email protected]
> > >
> > > wrote:
> > >
> > > > Hi JB,
> > > >
> > > > I have two questions on this scope:
> > > >
> > > > 1. any hope it is extensible so an user can plug its own metadata?
> > > > 2. will scanning be made streaming friendly (I assume phase 0 is a
> > > batch),
> > > > idea would be to be able to use Kappa like architecture to have real
> > time
> > > > capabilities
> > > >
> > > > Thanks,
> > > > Romain Manni-Bucau
> > > > @rmannibucau <https://x.com/rmannibucau> | .NET Blog
> > > > <https://dotnetbirdie.github.io/> | Blog <
> > https://rmannibucau.github.io/
> > > >
> > > > | Old
> > > > Blog <http://rmannibucau.wordpress.com> | Github
> > > > <https://github.com/rmannibucau> | LinkedIn
> > > > <https://www.linkedin.com/in/rmannibucau> | Book
> > > > <
> > > >
> > >
> >
> https://www.packtpub.com/en-us/product/java-ee-8-high-performance-9781788473064
> > > > >
> > > > Javaccino founder (Java/.NET service - contact via linkedin)
> > > >
> > > >
> > > > Le ven. 5 juin 2026 à 02:20, Yufei Gu <[email protected]> a
> écrit :
> > > >
> > > > > Great to see the progress here. Thanks a lot JB! I will take a look
> > at
> > > > the
> > > > > PR.
> > > > >
> > > > > Yufei
> > > > >
> > > > >
> > > > > On Thu, Jun 4, 2026 at 2:58 AM Jean-Baptiste Onofré <
> [email protected]
> > >
> > > > > wrote:
> > > > >
> > > > > > Hi everyone,
> > > > > >
> > > > > > After several months of discussion (involving Directories, Table
> > > > Sources,
> > > > > > etc), I would like to propose Polaris Directories.
> > > > > >
> > > > > > I drafted a PR:
> > > > > > https://github.com/apache/polaris/pull/4613
> > > > > >
> > > > > > The proposal is documented as part of the PR:
> > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> https://github.com/jbonofre/polaris/blob/12dfea48570d076d4012143e66f02e8b503c4f99/site/content/in-dev/unreleased/directories.md
> > > > > >
> > > > > > In a nutshell, Polaris Directories make objects (including
> > > unstructured
> > > > > > data like images, videos, and documents) discoverable alongside
> > > > > structured
> > > > > > Iceberg tables within a Polaris catalog. A directory points to a
> > base
> > > > > > location/prefix on an object store and automatically tracks the
> > > objects
> > > > > it
> > > > > > contains by maintaining an Iceberg table with object-level
> metadata
> > > > such
> > > > > as
> > > > > > URI, size, content type, checksum, ...
> > > > > >
> > > > > > This means query engines and tools that already know how to read
> > > > Iceberg
> > > > > > tables can discover and access unstructured data with little or
> no
> > > > extra
> > > > > > work (accessing the object itself).
> > > > > >
> > > > > > A directory has two main parts:
> > > > > > - Directory configuration, stored by the Polaris server. It
> > describes
> > > > > where
> > > > > > the data lives, how to authenticate, which objects to include,
> and
> > > how
> > > > > > often to re-scan. The configuration "lives" in a namespace.
> > > > > > - Directory table, an Iceberg table serving as the inventory of
> all
> > > > > objects
> > > > > > contained in the directory, with one row per object discovered
> > > during a
> > > > > > scan. The directory table uses the configuration name.
> > > > > > The Polaris server itself does not perform scans. Instead,
> external
> > > > > > services (e.g. directory table scanning service) read the
> directory
> > > > > > configuration through the REST API, walk the object store, and
> > write
> > > > the
> > > > > > results into the directory table.
> > > > > >
> > > > > > I propose we discuss this both on the mailing list (this thread)
> > and
> > > on
> > > > > the
> > > > > > PR. If needed, I'm happy to schedule a dedicated meeting.
> > > > > >
> > > > > > I'm looking forward to your thoughts!
> > > > > >
> > > > > > Thanks!
> > > > > >
> > > > > > Regards
> > > > > > JB
> > > > > >
> > > > >
> > > >
> > >
> >
>

Reply via email to