adutra opened a new issue, #3621:
URL: https://github.com/apache/polaris/issues/3621

   ### Is your feature request related to a problem? Please describe.
   
   ### Summary
   
   The `DEFAULT_LOCATION_OBJECT_STORAGE_PREFIX_ENABLED` feature configuration 
could benefit from
   improved documentation to clarify its purpose, limitations, and relationship 
with Iceberg's
   `write.object-storage.enabled` feature.
   
   ### Background
   
   Polaris has a feature config called 
`DEFAULT_LOCATION_OBJECT_STORAGE_PREFIX_ENABLED` introduced in
   bd8325208675c2b6505888cdd12d2c5abaa8dd2a.
   
   The feature name and current description may lead users to believe it 
provides similar functionality
   to Iceberg's [Object Store File 
Layout](https://iceberg.apache.org/docs/1.10.0/aws/#object-store-file-layout),
   but the two features work at different levels and are designed to be 
complementary.
   
   ### How the features differ
   
   **Iceberg's `write.object-storage.enabled`** applies entropy (hash-based 
prefix) on a *per-file*
   basis. Each file gets a unique hash prefix.
   
   **Polaris's `DEFAULT_LOCATION_OBJECT_STORAGE_PREFIX_ENABLED`** applies 
entropy *once per table*, based on the table identifier. All files in the same 
table share the same hash.
   
   ### Example
   
   Consider two data files in a table `newdb.newtable`:
   
   **Standard layout (no entropy):**
   ```
   s3://bucket/warehouse/newdb/newtable/data/file1.parquet
   s3://bucket/warehouse/newdb/newtable/data/file2.parquet
   ```
   
   **With Iceberg's object store layout only** (per-file entropy):
   ```
   
s3://bucket/warehouse/newdb/newtable/data/0011/0100/1011/11101010/file1.parquet
   
s3://bucket/warehouse/newdb/newtable/data/0011/0001/0001/00000001/file2.parquet
   ```
   
   **With Polaris's object storage prefix only** (per-table entropy):
   ```
   
s3://bucket/warehouse/1111/1111/0100/01010000/newdb/newtable/data/file1.parquet
   
s3://bucket/warehouse/1111/1111/0100/01010000/newdb/newtable/data/file2.parquet
   ```
   
   **With both features combined**:
   ```
   
s3://bucket/warehouse/1111/1111/0100/01010000/newdb/newtable/data/0011/0100/1011/11101010/file1.parquet
   
s3://bucket/warehouse/1111/1111/0100/01010000/newdb/newtable/data/0011/0001/0001/00000001/file2.parquet
   ```
   
   ### Describe the solution you'd like
   
   
   The documentation should clarify:
   
   1. Purpose: Polaris's layout distributes *different tables* across the key 
space, preventing
      hotspots when multiple tables in the same namespace are accessed 
concurrently. It does *not*
      distribute files within a single table.
   2. Limitations: since all files in a table share the same prefix, this 
layout alone does not
      prevent hotspots when a single table receives heavy write traffic.
   3. Complementary usage: ss stated in the original commit, "The two features 
can and should be
      combined to achieve the best distribution of data files throughout the 
key space."
   
   ### Describe alternatives you've considered
   
   _No response_
   
   ### Additional context
   
   _No response_


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to