It makes sense to introduce volume support without the inventory table in phase 1. My original design also position the directory table optional: https://docs.google.com/document/d/1ofljkrtiXRWc-v6hfkg_laKlYltepTPX7zsg44Tb-BY/edit?tab=t.0
Yufei On Mon, Jun 8, 2026 at 4:26 AM Robert Stupp <[email protected]> wrote: > Hi, > > I support the general direction. > Modeling a directory/prefix as a first-class catalog concept in Polaris, > complete with an inventory table for discovered objects, seems very useful. > > I think we should separate agreement on that direction from locking in the > exact object model too early, though. > One design point I would like to keep open is the relationship between the > directory configuration and the inventory table. > > For example, if the directory configuration and the inventory table share > the same name in the same namespace and are distinguished only by object > type, that may be workable, but it can create ambiguity for APIs, UI, > events, authorization/audit, and lifecycle operations like rename/drop. > I don’t think we need to settle that in the first discussion, but I also > would not want the current PR shape to imply that this part is already > fixed. > > My preference would be to first agree on the higher-level model: > > - Polaris has a first-class Directory abstraction. > - A Directory has a configured object-store location and scan/inventory > settings. > - A Directory is associated with an Iceberg inventory table. > - Scanner execution can be discussed separately: Polaris-provided, > disabled, or integrator-provided. > > Then we can discuss whether the inventory table is implicitly named, > explicitly referenced, hidden/internal, user-visible, or modeled some other > way. > > Thoughts? > > Robert > > On Sun, Jun 7, 2026 at 7:07 AM Jean-Baptiste Onofré <[email protected]> > wrote: > > > Hi > > > > I wanted to have two steps in the proposal: the configuration and high > > level architecture (that’s the current proposal), then the scanning > > service. > > > > I think the scanning should be part of Polaris but not mandatory: if > > integrators want to have their own scanning they should be able to do so. > > The Polaris scanners should be disabled by users. Integrators would > > probably like to have scanning performed by a distributed engines or > within > > cloud provider infra. > > > > So my proposal here is: > > 1. To have scanner in Polaris > > 2. Be able to disable the Polaris scanner > > 3. Allow users/integrators to provide their own scanners > > > > The first step is to get consensus on the Polaris Directories proposal > > approach. > > > > I will create a follow up PR with a scanner. > > > > Regards > > JB > > > > Le ven. 5 juin 2026 à 23:25, Yufei Gu <[email protected]> a écrit : > > > > > I think one thing we should clarify is where the scanner lives. > > > > > > If the scanner is completely outside Polaris, the UX becomes a bit > > > confusing to me. In that model, Polaris only stores a directory > > > configuration, while users still need to bring their own service to > scan > > > object storage and write an Iceberg table. In that case, I’m not sure > > what > > > value Polaris Directories add over *manually creating an Iceberg table > to > > > track unstructured data files*. Users can already do that today, and it > > is > > > arguably more flexible because they can define any schema they want and > > use > > > any engine or workflow to populate it. > > > > > > To me, the more compelling direction is for Polaris to own the scanner > or > > > at least provide it as part of the project, likely through a push mode > > > delegation service[1]. Polaris would still not need to do all the heavy > > > scanning work itself, but it should provide a clear, first class > workflow > > > for turning a directory configuration into an updated directory table, > > via > > > a delegated service. > > > > > > That also seems related to Romain’s questions. If the metadata > extraction > > > and scanning model are fully external, then extensibility and streaming > > > support become entirely out of scope. But if Polaris provides the > scanner > > > framework, we can define clear extension points for custom metadata and > > > think about supportting both batch and event driven scanning. > > > > > > 1. > https://github.com/apache/polaris/issues/3786#issuecomment-4503583696 > > > > > > Yufei > > > > > > > > > On Fri, Jun 5, 2026 at 2:41 AM Romain Manni-Bucau < > [email protected] > > > > > > wrote: > > > > > > > Hi JB, > > > > > > > > I have two questions on this scope: > > > > > > > > 1. any hope it is extensible so an user can plug its own metadata? > > > > 2. will scanning be made streaming friendly (I assume phase 0 is a > > > batch), > > > > idea would be to be able to use Kappa like architecture to have real > > time > > > > capabilities > > > > > > > > Thanks, > > > > Romain Manni-Bucau > > > > @rmannibucau <https://x.com/rmannibucau> | .NET Blog > > > > <https://dotnetbirdie.github.io/> | Blog < > > https://rmannibucau.github.io/ > > > > > > > > | Old > > > > Blog <http://rmannibucau.wordpress.com> | Github > > > > <https://github.com/rmannibucau> | LinkedIn > > > > <https://www.linkedin.com/in/rmannibucau> | Book > > > > < > > > > > > > > > > https://www.packtpub.com/en-us/product/java-ee-8-high-performance-9781788473064 > > > > > > > > > Javaccino founder (Java/.NET service - contact via linkedin) > > > > > > > > > > > > Le ven. 5 juin 2026 à 02:20, Yufei Gu <[email protected]> a > écrit : > > > > > > > > > Great to see the progress here. Thanks a lot JB! I will take a look > > at > > > > the > > > > > PR. > > > > > > > > > > Yufei > > > > > > > > > > > > > > > On Thu, Jun 4, 2026 at 2:58 AM Jean-Baptiste Onofré < > [email protected] > > > > > > > > wrote: > > > > > > > > > > > Hi everyone, > > > > > > > > > > > > After several months of discussion (involving Directories, Table > > > > Sources, > > > > > > etc), I would like to propose Polaris Directories. > > > > > > > > > > > > I drafted a PR: > > > > > > https://github.com/apache/polaris/pull/4613 > > > > > > > > > > > > The proposal is documented as part of the PR: > > > > > > > > > > > > > > > > > > > > > > > > > > > https://github.com/jbonofre/polaris/blob/12dfea48570d076d4012143e66f02e8b503c4f99/site/content/in-dev/unreleased/directories.md > > > > > > > > > > > > In a nutshell, Polaris Directories make objects (including > > > unstructured > > > > > > data like images, videos, and documents) discoverable alongside > > > > > structured > > > > > > Iceberg tables within a Polaris catalog. A directory points to a > > base > > > > > > location/prefix on an object store and automatically tracks the > > > objects > > > > > it > > > > > > contains by maintaining an Iceberg table with object-level > metadata > > > > such > > > > > as > > > > > > URI, size, content type, checksum, ... > > > > > > > > > > > > This means query engines and tools that already know how to read > > > > Iceberg > > > > > > tables can discover and access unstructured data with little or > no > > > > extra > > > > > > work (accessing the object itself). > > > > > > > > > > > > A directory has two main parts: > > > > > > - Directory configuration, stored by the Polaris server. It > > describes > > > > > where > > > > > > the data lives, how to authenticate, which objects to include, > and > > > how > > > > > > often to re-scan. The configuration "lives" in a namespace. > > > > > > - Directory table, an Iceberg table serving as the inventory of > all > > > > > objects > > > > > > contained in the directory, with one row per object discovered > > > during a > > > > > > scan. The directory table uses the configuration name. > > > > > > The Polaris server itself does not perform scans. Instead, > external > > > > > > services (e.g. directory table scanning service) read the > directory > > > > > > configuration through the REST API, walk the object store, and > > write > > > > the > > > > > > results into the directory table. > > > > > > > > > > > > I propose we discuss this both on the mailing list (this thread) > > and > > > on > > > > > the > > > > > > PR. If needed, I'm happy to schedule a dedicated meeting. > > > > > > > > > > > > I'm looking forward to your thoughts! > > > > > > > > > > > > Thanks! > > > > > > > > > > > > Regards > > > > > > JB > > > > > > > > > > > > > > > > > > > > >
