Re: [PROPOSAL] Polaris Directories proposal

Jean-Baptiste Onofré Mon, 22 Jun 2026 06:42:16 -0700

Let me clarify: the scan service code is in Polaris (the default) but
running outside of the Polaris server.


Regards
JB

Le dim. 21 juin 2026 à 20:30, Yufei Gu <[email protected]> a écrit :

> Thanks JB! I think that's the right direction.
>
> That said, I don't think the default scan service should run inside the
> Polaris service itself. Scanning can be very I/O and network intensive and
> could easily saturate a Polaris instance. We'll likely need a delegation
> service for that.
>
> I think the most practical path forward is to work on the delegation
> service to unblock it. In parallel, we can continue working on volume
> support without the inventory table.
>
> Yufei
>
>
> On Sat, Jun 20, 2026 at 10:37 PM Jean-Baptiste Onofré <[email protected]>
> wrote:
>
>> Hi everyone
>>
>> Thanks to your feedback, I will update the proposal/PR to include a
>> default
>> object store scan service in Polaris (that can be disabled and replaced by
>> a custom one).
>>
>> I will keep you posted when the PR is updated.
>>
>> Thanks,
>>
>> Regards
>> JB
>>
>> Le mar. 9 juin 2026 à 21:42, Jean-Baptiste Onofré <[email protected]> a
>> écrit :
>>
>> > Hi Robert,
>> >
>> > Thanks for your feedback!
>> >
>> > From a user perspective, I personally prefer having the Directory and
>> > Table share the same name, as I find it less confusing to see the
>> > association at first glance. However, I'm open to including the
>> inventory
>> > table name as part of the Directory configuration instead.
>> >
>> > As mentioned in my initial proposal, the current PR is intended to
>> > illustrate a potential implementation. It is certainly not the final
>> > version, and I am happy to update it based on community input. I fully
>> > agree with the high-level model you outlined, and I believe the PR is
>> > well-aligned with that direction.
>> >
>> > I still believe the inventory table is essential, as it represents the
>> > core value of the Directory and scanner; without it, users could simply
>> > create an Iceberg table manually to list objects.
>> > I'm fine to have add a endpoint in the Directory API to create a
>> inventory
>> > table without scanning (but using the static schema) and also other
>> > endpoints to deal with entries in an inventory (if you think it's
>> helpful).
>> >
>> > Regards,
>> > JB
>> >
>> >
>> > On Mon, Jun 8, 2026 at 1:27 PM Robert Stupp <[email protected]> wrote:
>> >
>> >> Hi,
>> >>
>> >> I support the general direction.
>> >> Modeling a directory/prefix as a first-class catalog concept in
>> Polaris,
>> >> complete with an inventory table for discovered objects, seems very
>> >> useful.
>> >>
>> >> I think we should separate agreement on that direction from locking in
>> the
>> >> exact object model too early, though.
>> >> One design point I would like to keep open is the relationship between
>> the
>> >> directory configuration and the inventory table.
>> >>
>> >> For example, if the directory configuration and the inventory table
>> share
>> >> the same name in the same namespace and are distinguished only by
>> object
>> >> type, that may be workable, but it can create ambiguity for APIs, UI,
>> >> events, authorization/audit, and lifecycle operations like rename/drop.
>> >> I don’t think we need to settle that in the first discussion, but I
>> also
>> >> would not want the current PR shape to imply that this part is already
>> >> fixed.
>> >>
>> >> My preference would be to first agree on the higher-level model:
>> >>
>> >> - Polaris has a first-class Directory abstraction.
>> >> - A Directory has a configured object-store location and scan/inventory
>> >> settings.
>> >> - A Directory is associated with an Iceberg inventory table.
>> >> - Scanner execution can be discussed separately: Polaris-provided,
>> >> disabled, or integrator-provided.
>> >>
>> >> Then we can discuss whether the inventory table is implicitly named,
>> >> explicitly referenced, hidden/internal, user-visible, or modeled some
>> >> other
>> >> way.
>> >>
>> >> Thoughts?
>> >>
>> >> Robert
>> >>
>> >> On Sun, Jun 7, 2026 at 7:07 AM Jean-Baptiste Onofré <[email protected]>
>> >> wrote:
>> >>
>> >> > Hi
>> >> >
>> >> > I wanted to have two steps in the proposal: the configuration and
>> high
>> >> > level architecture (that’s the current proposal), then the scanning
>> >> > service.
>> >> >
>> >> > I think the scanning should be part of Polaris but not mandatory: if
>> >> > integrators want to have their own scanning they should be able to do
>> >> so.
>> >> > The Polaris scanners should be disabled by users. Integrators would
>> >> > probably like to have scanning performed by a distributed engines or
>> >> within
>> >> > cloud provider infra.
>> >> >
>> >> > So my proposal here is:
>> >> > 1. To have scanner in Polaris
>> >> > 2. Be able to disable the Polaris scanner
>> >> > 3. Allow users/integrators to provide their own scanners
>> >> >
>> >> > The first step is to get consensus on the Polaris Directories
>> proposal
>> >> > approach.
>> >> >
>> >> > I will create a follow up PR with a scanner.
>> >> >
>> >> > Regards
>> >> > JB
>> >> >
>> >> > Le ven. 5 juin 2026 à 23:25, Yufei Gu <[email protected]> a
>> écrit :
>> >> >
>> >> > > I think one thing we should clarify is where the scanner lives.
>> >> > >
>> >> > > If the scanner is completely outside Polaris, the UX becomes a bit
>> >> > > confusing to me. In that model, Polaris only stores a directory
>> >> > > configuration, while users still need to bring their own service to
>> >> scan
>> >> > > object storage and write an Iceberg table. In that case, I’m not
>> sure
>> >> > what
>> >> > > value Polaris Directories add over *manually creating an Iceberg
>> >> table to
>> >> > > track unstructured data files*. Users can already do that today,
>> and
>> >> it
>> >> > is
>> >> > > arguably more flexible because they can define any schema they want
>> >> and
>> >> > use
>> >> > > any engine or workflow to populate it.
>> >> > >
>> >> > > To me, the more compelling direction is for Polaris to own the
>> >> scanner or
>> >> > > at least provide it as part of the project, likely through a push
>> mode
>> >> > > delegation service[1]. Polaris would still not need to do all the
>> >> heavy
>> >> > > scanning work itself, but it should provide a clear, first class
>> >> workflow
>> >> > > for turning a directory configuration into an updated directory
>> table,
>> >> > via
>> >> > > a delegated service.
>> >> > >
>> >> > > That also seems related to Romain’s questions. If the metadata
>> >> extraction
>> >> > > and scanning model are fully external, then extensibility and
>> >> streaming
>> >> > > support become entirely out of scope. But if Polaris provides the
>> >> scanner
>> >> > > framework, we can define clear extension points for custom metadata
>> >> and
>> >> > > think about supportting both batch and event driven scanning.
>> >> > >
>> >> > > 1.
>> >> https://github.com/apache/polaris/issues/3786#issuecomment-4503583696
>> >> > >
>> >> > > Yufei
>> >> > >
>> >> > >
>> >> > > On Fri, Jun 5, 2026 at 2:41 AM Romain Manni-Bucau <
>> >> [email protected]
>> >> > >
>> >> > > wrote:
>> >> > >
>> >> > > > Hi JB,
>> >> > > >
>> >> > > > I have two questions on this scope:
>> >> > > >
>> >> > > > 1. any hope it is extensible so an user can plug its own
>> metadata?
>> >> > > > 2. will scanning be made streaming friendly (I assume phase 0 is
>> a
>> >> > > batch),
>> >> > > > idea would be to be able to use Kappa like architecture to have
>> real
>> >> > time
>> >> > > > capabilities
>> >> > > >
>> >> > > > Thanks,
>> >> > > > Romain Manni-Bucau
>> >> > > > @rmannibucau <https://x.com/rmannibucau> | .NET Blog
>> >> > > > <https://dotnetbirdie.github.io/> | Blog <
>> >> > https://rmannibucau.github.io/
>> >> > > >
>> >> > > > | Old
>> >> > > > Blog <http://rmannibucau.wordpress.com> | Github
>> >> > > > <https://github.com/rmannibucau> | LinkedIn
>> >> > > > <https://www.linkedin.com/in/rmannibucau> | Book
>> >> > > > <
>> >> > > >
>> >> > >
>> >> >
>> >>
>> https://www.packtpub.com/en-us/product/java-ee-8-high-performance-9781788473064
>> >> > > > >
>> >> > > > Javaccino founder (Java/.NET service - contact via linkedin)
>> >> > > >
>> >> > > >
>> >> > > > Le ven. 5 juin 2026 à 02:20, Yufei Gu <[email protected]> a
>> >> écrit :
>> >> > > >
>> >> > > > > Great to see the progress here. Thanks a lot JB! I will take a
>> >> look
>> >> > at
>> >> > > > the
>> >> > > > > PR.
>> >> > > > >
>> >> > > > > Yufei
>> >> > > > >
>> >> > > > >
>> >> > > > > On Thu, Jun 4, 2026 at 2:58 AM Jean-Baptiste Onofré <
>> >> [email protected]
>> >> > >
>> >> > > > > wrote:
>> >> > > > >
>> >> > > > > > Hi everyone,
>> >> > > > > >
>> >> > > > > > After several months of discussion (involving Directories,
>> Table
>> >> > > > Sources,
>> >> > > > > > etc), I would like to propose Polaris Directories.
>> >> > > > > >
>> >> > > > > > I drafted a PR:
>> >> > > > > > https://github.com/apache/polaris/pull/4613
>> >> > > > > >
>> >> > > > > > The proposal is documented as part of the PR:
>> >> > > > > >
>> >> > > > > >
>> >> > > > >
>> >> > > >
>> >> > >
>> >> >
>> >>
>> https://github.com/jbonofre/polaris/blob/12dfea48570d076d4012143e66f02e8b503c4f99/site/content/in-dev/unreleased/directories.md
>> >> > > > > >
>> >> > > > > > In a nutshell, Polaris Directories make objects (including
>> >> > > unstructured
>> >> > > > > > data like images, videos, and documents) discoverable
>> alongside
>> >> > > > > structured
>> >> > > > > > Iceberg tables within a Polaris catalog. A directory points
>> to a
>> >> > base
>> >> > > > > > location/prefix on an object store and automatically tracks
>> the
>> >> > > objects
>> >> > > > > it
>> >> > > > > > contains by maintaining an Iceberg table with object-level
>> >> metadata
>> >> > > > such
>> >> > > > > as
>> >> > > > > > URI, size, content type, checksum, ...
>> >> > > > > >
>> >> > > > > > This means query engines and tools that already know how to
>> read
>> >> > > > Iceberg
>> >> > > > > > tables can discover and access unstructured data with little
>> or
>> >> no
>> >> > > > extra
>> >> > > > > > work (accessing the object itself).
>> >> > > > > >
>> >> > > > > > A directory has two main parts:
>> >> > > > > > - Directory configuration, stored by the Polaris server. It
>> >> > describes
>> >> > > > > where
>> >> > > > > > the data lives, how to authenticate, which objects to
>> include,
>> >> and
>> >> > > how
>> >> > > > > > often to re-scan. The configuration "lives" in a namespace.
>> >> > > > > > - Directory table, an Iceberg table serving as the inventory
>> of
>> >> all
>> >> > > > > objects
>> >> > > > > > contained in the directory, with one row per object
>> discovered
>> >> > > during a
>> >> > > > > > scan. The directory table uses the configuration name.
>> >> > > > > > The Polaris server itself does not perform scans. Instead,
>> >> external
>> >> > > > > > services (e.g. directory table scanning service) read the
>> >> directory
>> >> > > > > > configuration through the REST API, walk the object store,
>> and
>> >> > write
>> >> > > > the
>> >> > > > > > results into the directory table.
>> >> > > > > >
>> >> > > > > > I propose we discuss this both on the mailing list (this
>> thread)
>> >> > and
>> >> > > on
>> >> > > > > the
>> >> > > > > > PR. If needed, I'm happy to schedule a dedicated meeting.
>> >> > > > > >
>> >> > > > > > I'm looking forward to your thoughts!
>> >> > > > > >
>> >> > > > > > Thanks!
>> >> > > > > >
>> >> > > > > > Regards
>> >> > > > > > JB
>> >> > > > > >
>> >> > > > >
>> >> > > >
>> >> > >
>> >> >
>> >>
>> >
>>
>

Re: [PROPOSAL] Polaris Directories proposal

Reply via email to