Hi everyone,

After several months of discussion (involving Directories, Table Sources,
etc), I would like to propose Polaris Directories.

I drafted a PR:
https://github.com/apache/polaris/pull/4613

The proposal is documented as part of the PR:
https://github.com/jbonofre/polaris/blob/12dfea48570d076d4012143e66f02e8b503c4f99/site/content/in-dev/unreleased/directories.md

In a nutshell, Polaris Directories make objects (including unstructured
data like images, videos, and documents) discoverable alongside structured
Iceberg tables within a Polaris catalog. A directory points to a base
location/prefix on an object store and automatically tracks the objects it
contains by maintaining an Iceberg table with object-level metadata such as
URI, size, content type, checksum, ...

This means query engines and tools that already know how to read Iceberg
tables can discover and access unstructured data with little or no extra
work (accessing the object itself).

A directory has two main parts:
- Directory configuration, stored by the Polaris server. It describes where
the data lives, how to authenticate, which objects to include, and how
often to re-scan. The configuration "lives" in a namespace.
- Directory table, an Iceberg table serving as the inventory of all objects
contained in the directory, with one row per object discovered during a
scan. The directory table uses the configuration name.
The Polaris server itself does not perform scans. Instead, external
services (e.g. directory table scanning service) read the directory
configuration through the REST API, walk the object store, and write the
results into the directory table.

I propose we discuss this both on the mailing list (this thread) and on the
PR. If needed, I'm happy to schedule a dedicated meeting.

I'm looking forward to your thoughts!

Thanks!

Regards
JB

Reply via email to