Hi

I wanted to have two steps in the proposal: the configuration and high
level architecture (that’s the current proposal), then the scanning
service.

I think the scanning should be part of Polaris but not mandatory: if
integrators want to have their own scanning they should be able to do so.
The Polaris scanners should be disabled by users. Integrators would
probably like to have scanning performed by a distributed engines or within
cloud provider infra.

So my proposal here is:
1. To have scanner in Polaris
2. Be able to disable the Polaris scanner
3. Allow users/integrators to provide their own scanners

The first step is to get consensus on the Polaris Directories proposal
approach.

I will create a follow up PR with a scanner.

Regards
JB

Le ven. 5 juin 2026 à 23:25, Yufei Gu <[email protected]> a écrit :

> I think one thing we should clarify is where the scanner lives.
>
> If the scanner is completely outside Polaris, the UX becomes a bit
> confusing to me. In that model, Polaris only stores a directory
> configuration, while users still need to bring their own service to scan
> object storage and write an Iceberg table. In that case, I’m not sure what
> value Polaris Directories add over *manually creating an Iceberg table to
> track unstructured data files*. Users can already do that today, and it is
> arguably more flexible because they can define any schema they want and use
> any engine or workflow to populate it.
>
> To me, the more compelling direction is for Polaris to own the scanner or
> at least provide it as part of the project, likely through a push mode
> delegation service[1]. Polaris would still not need to do all the heavy
> scanning work itself, but it should provide a clear, first class workflow
> for turning a directory configuration into an updated directory table, via
> a delegated service.
>
> That also seems related to Romain’s questions. If the metadata extraction
> and scanning model are fully external, then extensibility and streaming
> support become entirely out of scope. But if Polaris provides the scanner
> framework, we can define clear extension points for custom metadata and
> think about supportting both batch and event driven scanning.
>
> 1. https://github.com/apache/polaris/issues/3786#issuecomment-4503583696
>
> Yufei
>
>
> On Fri, Jun 5, 2026 at 2:41 AM Romain Manni-Bucau <[email protected]>
> wrote:
>
> > Hi JB,
> >
> > I have two questions on this scope:
> >
> > 1. any hope it is extensible so an user can plug its own metadata?
> > 2. will scanning be made streaming friendly (I assume phase 0 is a
> batch),
> > idea would be to be able to use Kappa like architecture to have real time
> > capabilities
> >
> > Thanks,
> > Romain Manni-Bucau
> > @rmannibucau <https://x.com/rmannibucau> | .NET Blog
> > <https://dotnetbirdie.github.io/> | Blog <https://rmannibucau.github.io/
> >
> > | Old
> > Blog <http://rmannibucau.wordpress.com> | Github
> > <https://github.com/rmannibucau> | LinkedIn
> > <https://www.linkedin.com/in/rmannibucau> | Book
> > <
> >
> https://www.packtpub.com/en-us/product/java-ee-8-high-performance-9781788473064
> > >
> > Javaccino founder (Java/.NET service - contact via linkedin)
> >
> >
> > Le ven. 5 juin 2026 à 02:20, Yufei Gu <[email protected]> a écrit :
> >
> > > Great to see the progress here. Thanks a lot JB! I will take a look at
> > the
> > > PR.
> > >
> > > Yufei
> > >
> > >
> > > On Thu, Jun 4, 2026 at 2:58 AM Jean-Baptiste Onofré <[email protected]>
> > > wrote:
> > >
> > > > Hi everyone,
> > > >
> > > > After several months of discussion (involving Directories, Table
> > Sources,
> > > > etc), I would like to propose Polaris Directories.
> > > >
> > > > I drafted a PR:
> > > > https://github.com/apache/polaris/pull/4613
> > > >
> > > > The proposal is documented as part of the PR:
> > > >
> > > >
> > >
> >
> https://github.com/jbonofre/polaris/blob/12dfea48570d076d4012143e66f02e8b503c4f99/site/content/in-dev/unreleased/directories.md
> > > >
> > > > In a nutshell, Polaris Directories make objects (including
> unstructured
> > > > data like images, videos, and documents) discoverable alongside
> > > structured
> > > > Iceberg tables within a Polaris catalog. A directory points to a base
> > > > location/prefix on an object store and automatically tracks the
> objects
> > > it
> > > > contains by maintaining an Iceberg table with object-level metadata
> > such
> > > as
> > > > URI, size, content type, checksum, ...
> > > >
> > > > This means query engines and tools that already know how to read
> > Iceberg
> > > > tables can discover and access unstructured data with little or no
> > extra
> > > > work (accessing the object itself).
> > > >
> > > > A directory has two main parts:
> > > > - Directory configuration, stored by the Polaris server. It describes
> > > where
> > > > the data lives, how to authenticate, which objects to include, and
> how
> > > > often to re-scan. The configuration "lives" in a namespace.
> > > > - Directory table, an Iceberg table serving as the inventory of all
> > > objects
> > > > contained in the directory, with one row per object discovered
> during a
> > > > scan. The directory table uses the configuration name.
> > > > The Polaris server itself does not perform scans. Instead, external
> > > > services (e.g. directory table scanning service) read the directory
> > > > configuration through the REST API, walk the object store, and write
> > the
> > > > results into the directory table.
> > > >
> > > > I propose we discuss this both on the mailing list (this thread) and
> on
> > > the
> > > > PR. If needed, I'm happy to schedule a dedicated meeting.
> > > >
> > > > I'm looking forward to your thoughts!
> > > >
> > > > Thanks!
> > > >
> > > > Regards
> > > > JB
> > > >
> > >
> >
>

Reply via email to