Re: [PROPOSAL] Polaris Directories proposal

Jean-Baptiste Onofré Sat, 20 Jun 2026 22:37:29 -0700

Hi everyone

Thanks to your feedback, I will update the proposal/PR to include a default
object store scan service in Polaris (that can be disabled and replaced by
a custom one).


I will keep you posted when the PR is updated.

Thanks,

Regards
JB

Le mar. 9 juin 2026 à 21:42, Jean-Baptiste Onofré <[email protected]> a
écrit :

> Hi Robert,
>
> Thanks for your feedback!
>
> From a user perspective, I personally prefer having the Directory and
> Table share the same name, as I find it less confusing to see the
> association at first glance. However, I'm open to including the inventory
> table name as part of the Directory configuration instead.
>
> As mentioned in my initial proposal, the current PR is intended to
> illustrate a potential implementation. It is certainly not the final
> version, and I am happy to update it based on community input. I fully
> agree with the high-level model you outlined, and I believe the PR is
> well-aligned with that direction.
>
> I still believe the inventory table is essential, as it represents the
> core value of the Directory and scanner; without it, users could simply
> create an Iceberg table manually to list objects.
> I'm fine to have add a endpoint in the Directory API to create a inventory
> table without scanning (but using the static schema) and also other
> endpoints to deal with entries in an inventory (if you think it's helpful).
>
> Regards,
> JB
>
>
> On Mon, Jun 8, 2026 at 1:27 PM Robert Stupp <[email protected]> wrote:
>
>> Hi,
>>
>> I support the general direction.
>> Modeling a directory/prefix as a first-class catalog concept in Polaris,
>> complete with an inventory table for discovered objects, seems very
>> useful.
>>
>> I think we should separate agreement on that direction from locking in the
>> exact object model too early, though.
>> One design point I would like to keep open is the relationship between the
>> directory configuration and the inventory table.
>>
>> For example, if the directory configuration and the inventory table share
>> the same name in the same namespace and are distinguished only by object
>> type, that may be workable, but it can create ambiguity for APIs, UI,
>> events, authorization/audit, and lifecycle operations like rename/drop.
>> I don’t think we need to settle that in the first discussion, but I also
>> would not want the current PR shape to imply that this part is already
>> fixed.
>>
>> My preference would be to first agree on the higher-level model:
>>
>> - Polaris has a first-class Directory abstraction.
>> - A Directory has a configured object-store location and scan/inventory
>> settings.
>> - A Directory is associated with an Iceberg inventory table.
>> - Scanner execution can be discussed separately: Polaris-provided,
>> disabled, or integrator-provided.
>>
>> Then we can discuss whether the inventory table is implicitly named,
>> explicitly referenced, hidden/internal, user-visible, or modeled some
>> other
>> way.
>>
>> Thoughts?
>>
>> Robert
>>
>> On Sun, Jun 7, 2026 at 7:07 AM Jean-Baptiste Onofré <[email protected]>
>> wrote:
>>
>> > Hi
>> >
>> > I wanted to have two steps in the proposal: the configuration and high
>> > level architecture (that’s the current proposal), then the scanning
>> > service.
>> >
>> > I think the scanning should be part of Polaris but not mandatory: if
>> > integrators want to have their own scanning they should be able to do
>> so.
>> > The Polaris scanners should be disabled by users. Integrators would
>> > probably like to have scanning performed by a distributed engines or
>> within
>> > cloud provider infra.
>> >
>> > So my proposal here is:
>> > 1. To have scanner in Polaris
>> > 2. Be able to disable the Polaris scanner
>> > 3. Allow users/integrators to provide their own scanners
>> >
>> > The first step is to get consensus on the Polaris Directories proposal
>> > approach.
>> >
>> > I will create a follow up PR with a scanner.
>> >
>> > Regards
>> > JB
>> >
>> > Le ven. 5 juin 2026 à 23:25, Yufei Gu <[email protected]> a écrit :
>> >
>> > > I think one thing we should clarify is where the scanner lives.
>> > >
>> > > If the scanner is completely outside Polaris, the UX becomes a bit
>> > > confusing to me. In that model, Polaris only stores a directory
>> > > configuration, while users still need to bring their own service to
>> scan
>> > > object storage and write an Iceberg table. In that case, I’m not sure
>> > what
>> > > value Polaris Directories add over *manually creating an Iceberg
>> table to
>> > > track unstructured data files*. Users can already do that today, and
>> it
>> > is
>> > > arguably more flexible because they can define any schema they want
>> and
>> > use
>> > > any engine or workflow to populate it.
>> > >
>> > > To me, the more compelling direction is for Polaris to own the
>> scanner or
>> > > at least provide it as part of the project, likely through a push mode
>> > > delegation service[1]. Polaris would still not need to do all the
>> heavy
>> > > scanning work itself, but it should provide a clear, first class
>> workflow
>> > > for turning a directory configuration into an updated directory table,
>> > via
>> > > a delegated service.
>> > >
>> > > That also seems related to Romain’s questions. If the metadata
>> extraction
>> > > and scanning model are fully external, then extensibility and
>> streaming
>> > > support become entirely out of scope. But if Polaris provides the
>> scanner
>> > > framework, we can define clear extension points for custom metadata
>> and
>> > > think about supportting both batch and event driven scanning.
>> > >
>> > > 1.
>> https://github.com/apache/polaris/issues/3786#issuecomment-4503583696
>> > >
>> > > Yufei
>> > >
>> > >
>> > > On Fri, Jun 5, 2026 at 2:41 AM Romain Manni-Bucau <
>> [email protected]
>> > >
>> > > wrote:
>> > >
>> > > > Hi JB,
>> > > >
>> > > > I have two questions on this scope:
>> > > >
>> > > > 1. any hope it is extensible so an user can plug its own metadata?
>> > > > 2. will scanning be made streaming friendly (I assume phase 0 is a
>> > > batch),
>> > > > idea would be to be able to use Kappa like architecture to have real
>> > time
>> > > > capabilities
>> > > >
>> > > > Thanks,
>> > > > Romain Manni-Bucau
>> > > > @rmannibucau <https://x.com/rmannibucau> | .NET Blog
>> > > > <https://dotnetbirdie.github.io/> | Blog <
>> > https://rmannibucau.github.io/
>> > > >
>> > > > | Old
>> > > > Blog <http://rmannibucau.wordpress.com> | Github
>> > > > <https://github.com/rmannibucau> | LinkedIn
>> > > > <https://www.linkedin.com/in/rmannibucau> | Book
>> > > > <
>> > > >
>> > >
>> >
>> https://www.packtpub.com/en-us/product/java-ee-8-high-performance-9781788473064
>> > > > >
>> > > > Javaccino founder (Java/.NET service - contact via linkedin)
>> > > >
>> > > >
>> > > > Le ven. 5 juin 2026 à 02:20, Yufei Gu <[email protected]> a
>> écrit :
>> > > >
>> > > > > Great to see the progress here. Thanks a lot JB! I will take a
>> look
>> > at
>> > > > the
>> > > > > PR.
>> > > > >
>> > > > > Yufei
>> > > > >
>> > > > >
>> > > > > On Thu, Jun 4, 2026 at 2:58 AM Jean-Baptiste Onofré <
>> [email protected]
>> > >
>> > > > > wrote:
>> > > > >
>> > > > > > Hi everyone,
>> > > > > >
>> > > > > > After several months of discussion (involving Directories, Table
>> > > > Sources,
>> > > > > > etc), I would like to propose Polaris Directories.
>> > > > > >
>> > > > > > I drafted a PR:
>> > > > > > https://github.com/apache/polaris/pull/4613
>> > > > > >
>> > > > > > The proposal is documented as part of the PR:
>> > > > > >
>> > > > > >
>> > > > >
>> > > >
>> > >
>> >
>> https://github.com/jbonofre/polaris/blob/12dfea48570d076d4012143e66f02e8b503c4f99/site/content/in-dev/unreleased/directories.md
>> > > > > >
>> > > > > > In a nutshell, Polaris Directories make objects (including
>> > > unstructured
>> > > > > > data like images, videos, and documents) discoverable alongside
>> > > > > structured
>> > > > > > Iceberg tables within a Polaris catalog. A directory points to a
>> > base
>> > > > > > location/prefix on an object store and automatically tracks the
>> > > objects
>> > > > > it
>> > > > > > contains by maintaining an Iceberg table with object-level
>> metadata
>> > > > such
>> > > > > as
>> > > > > > URI, size, content type, checksum, ...
>> > > > > >
>> > > > > > This means query engines and tools that already know how to read
>> > > > Iceberg
>> > > > > > tables can discover and access unstructured data with little or
>> no
>> > > > extra
>> > > > > > work (accessing the object itself).
>> > > > > >
>> > > > > > A directory has two main parts:
>> > > > > > - Directory configuration, stored by the Polaris server. It
>> > describes
>> > > > > where
>> > > > > > the data lives, how to authenticate, which objects to include,
>> and
>> > > how
>> > > > > > often to re-scan. The configuration "lives" in a namespace.
>> > > > > > - Directory table, an Iceberg table serving as the inventory of
>> all
>> > > > > objects
>> > > > > > contained in the directory, with one row per object discovered
>> > > during a
>> > > > > > scan. The directory table uses the configuration name.
>> > > > > > The Polaris server itself does not perform scans. Instead,
>> external
>> > > > > > services (e.g. directory table scanning service) read the
>> directory
>> > > > > > configuration through the REST API, walk the object store, and
>> > write
>> > > > the
>> > > > > > results into the directory table.
>> > > > > >
>> > > > > > I propose we discuss this both on the mailing list (this thread)
>> > and
>> > > on
>> > > > > the
>> > > > > > PR. If needed, I'm happy to schedule a dedicated meeting.
>> > > > > >
>> > > > > > I'm looking forward to your thoughts!
>> > > > > >
>> > > > > > Thanks!
>> > > > > >
>> > > > > > Regards
>> > > > > > JB
>> > > > > >
>> > > > >
>> > > >
>> > >
>> >
>>
>

Re: [PROPOSAL] Polaris Directories proposal

Reply via email to