I think one thing we should clarify is where the scanner lives. If the scanner is completely outside Polaris, the UX becomes a bit confusing to me. In that model, Polaris only stores a directory configuration, while users still need to bring their own service to scan object storage and write an Iceberg table. In that case, I’m not sure what value Polaris Directories add over *manually creating an Iceberg table to track unstructured data files*. Users can already do that today, and it is arguably more flexible because they can define any schema they want and use any engine or workflow to populate it.
To me, the more compelling direction is for Polaris to own the scanner or at least provide it as part of the project, likely through a push mode delegation service[1]. Polaris would still not need to do all the heavy scanning work itself, but it should provide a clear, first class workflow for turning a directory configuration into an updated directory table, via a delegated service. That also seems related to Romain’s questions. If the metadata extraction and scanning model are fully external, then extensibility and streaming support become entirely out of scope. But if Polaris provides the scanner framework, we can define clear extension points for custom metadata and think about supportting both batch and event driven scanning. 1. https://github.com/apache/polaris/issues/3786#issuecomment-4503583696 Yufei On Fri, Jun 5, 2026 at 2:41 AM Romain Manni-Bucau <[email protected]> wrote: > Hi JB, > > I have two questions on this scope: > > 1. any hope it is extensible so an user can plug its own metadata? > 2. will scanning be made streaming friendly (I assume phase 0 is a batch), > idea would be to be able to use Kappa like architecture to have real time > capabilities > > Thanks, > Romain Manni-Bucau > @rmannibucau <https://x.com/rmannibucau> | .NET Blog > <https://dotnetbirdie.github.io/> | Blog <https://rmannibucau.github.io/> > | Old > Blog <http://rmannibucau.wordpress.com> | Github > <https://github.com/rmannibucau> | LinkedIn > <https://www.linkedin.com/in/rmannibucau> | Book > < > https://www.packtpub.com/en-us/product/java-ee-8-high-performance-9781788473064 > > > Javaccino founder (Java/.NET service - contact via linkedin) > > > Le ven. 5 juin 2026 à 02:20, Yufei Gu <[email protected]> a écrit : > > > Great to see the progress here. Thanks a lot JB! I will take a look at > the > > PR. > > > > Yufei > > > > > > On Thu, Jun 4, 2026 at 2:58 AM Jean-Baptiste Onofré <[email protected]> > > wrote: > > > > > Hi everyone, > > > > > > After several months of discussion (involving Directories, Table > Sources, > > > etc), I would like to propose Polaris Directories. > > > > > > I drafted a PR: > > > https://github.com/apache/polaris/pull/4613 > > > > > > The proposal is documented as part of the PR: > > > > > > > > > https://github.com/jbonofre/polaris/blob/12dfea48570d076d4012143e66f02e8b503c4f99/site/content/in-dev/unreleased/directories.md > > > > > > In a nutshell, Polaris Directories make objects (including unstructured > > > data like images, videos, and documents) discoverable alongside > > structured > > > Iceberg tables within a Polaris catalog. A directory points to a base > > > location/prefix on an object store and automatically tracks the objects > > it > > > contains by maintaining an Iceberg table with object-level metadata > such > > as > > > URI, size, content type, checksum, ... > > > > > > This means query engines and tools that already know how to read > Iceberg > > > tables can discover and access unstructured data with little or no > extra > > > work (accessing the object itself). > > > > > > A directory has two main parts: > > > - Directory configuration, stored by the Polaris server. It describes > > where > > > the data lives, how to authenticate, which objects to include, and how > > > often to re-scan. The configuration "lives" in a namespace. > > > - Directory table, an Iceberg table serving as the inventory of all > > objects > > > contained in the directory, with one row per object discovered during a > > > scan. The directory table uses the configuration name. > > > The Polaris server itself does not perform scans. Instead, external > > > services (e.g. directory table scanning service) read the directory > > > configuration through the REST API, walk the object store, and write > the > > > results into the directory table. > > > > > > I propose we discuss this both on the mailing list (this thread) and on > > the > > > PR. If needed, I'm happy to schedule a dedicated meeting. > > > > > > I'm looking forward to your thoughts! > > > > > > Thanks! > > > > > > Regards > > > JB > > > > > >
