adutra commented on issue #3621:
URL: https://github.com/apache/polaris/issues/3621#issuecomment-3834628470

   After more investigation on this, I think there is a much better option to 
improve entropy and reduce hotspots across different tables: just set 
`write.data.path` for all tables to a common, short prefix, e.g. the warehouse 
location.
   
   Indeed Iceberg's Object Store layout behaves differently when the table has 
`write.data.path` defined.
   
   Instead of creating file path like this:
   
   ```
   <table-base>/data/<file-hash>/[<partition>/]file1.parquet
   ```
   
   The layout creates paths like this:
   
   ```
   <write.data.path>/<file-hash>/<ns>/<table>/[<partition>/]file1.parquet
   ```
   
   This setup effectively scatters table files and creates high entropy:
   
   ```
   s3://bucket/warehouse/1111/1111/0100/01010000/ns1/table1/data/file1.parquet
   s3://bucket/warehouse/0011/0110/0101/10101100/ns1/table1/data/file2.parquet
   s3://bucket/warehouse/1110/0011/0100/11010000/ns2/table2/data/file1.parquet
   s3://bucket/warehouse/0111/0100/0101/01101101/ns2/table2/data/file2.parquet
   ```
   
   While Polaris layout achieves lesser entropy since the hash is per-table and 
not per-file:
   
   ```
   s3://bucket/warehouse/1111/1111/0100/01010000/ns1/table1/data/file1.parquet
   s3://bucket/warehouse/1111/1111/0100/01010000/ns1/table1/data/file2.parquet
   s3://bucket/warehouse/1001/0011/1101/11010111/ns2/table2/data/file1.parquet
   s3://bucket/warehouse/1001/0011/1101/11010111/ns2/table2/data/file2.parquet
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to