Hi everyone, After several months of discussion (involving Directories, Table Sources, etc), I would like to propose Polaris Directories.
I drafted a PR: https://github.com/apache/polaris/pull/4613 The proposal is documented as part of the PR: https://github.com/jbonofre/polaris/blob/12dfea48570d076d4012143e66f02e8b503c4f99/site/content/in-dev/unreleased/directories.md In a nutshell, Polaris Directories make objects (including unstructured data like images, videos, and documents) discoverable alongside structured Iceberg tables within a Polaris catalog. A directory points to a base location/prefix on an object store and automatically tracks the objects it contains by maintaining an Iceberg table with object-level metadata such as URI, size, content type, checksum, ... This means query engines and tools that already know how to read Iceberg tables can discover and access unstructured data with little or no extra work (accessing the object itself). A directory has two main parts: - Directory configuration, stored by the Polaris server. It describes where the data lives, how to authenticate, which objects to include, and how often to re-scan. The configuration "lives" in a namespace. - Directory table, an Iceberg table serving as the inventory of all objects contained in the directory, with one row per object discovered during a scan. The directory table uses the configuration name. The Polaris server itself does not perform scans. Instead, external services (e.g. directory table scanning service) read the directory configuration through the REST API, walk the object store, and write the results into the directory table. I propose we discuss this both on the mailing list (this thread) and on the PR. If needed, I'm happy to schedule a dedicated meeting. I'm looking forward to your thoughts! Thanks! Regards JB
