cobookman commented on a change in pull request #2963:
URL: https://github.com/apache/iceberg/pull/2963#discussion_r688968140
##########
File path: site/docs/aws.md
##########
@@ -358,6 +361,16 @@ OPTIONS (
PARTITIONED BY (category);
```
+We can then insert a single row into this new table
+```SQL
+INSERT INTO my_catalog.my_ns.my_table VALUES (1, "Pizza", "orders");
+```
+
+Which will cause the following object to be written to S3:
+```
+s3://my-table-data-bucket/2d3905f8/my_ns.db/my_table/category=orders/00000-0-5affc076-96a4-48f2-9cd2-d5efbc9f0c94-00001.parquet
+```
+
Review comment:
added
##########
File path: site/docs/aws.md
##########
@@ -339,13 +339,16 @@ For more details, please read [S3 ACL
Documentation](https://docs.aws.amazon.com
### Object Store File Layout
-S3 and many other cloud storage services [throttle requests based on object
prefix](https://aws.amazon.com/premiumsupport/knowledge-center/s3-request-limit-avoid-throttling/).
-This means data stored in a traditional Hive storage layout has bad read and
write throughput since data files of the same partition are placed under the
same prefix.
-Iceberg by default uses the Hive storage layout, but can be switched to use a
different `ObjectStoreLocationProvider`.
-In this mode, a hash string is added to the beginning of each file path, so
that files are equally distributed across all prefixes in an S3 bucket.
+S3 and many other cloud storage services [throttle requests based on object
prefix](https://aws.amazon.com/premiumsupport/knowledge-center/s3-request-limit-avoid-throttling/).
+Data stored on S3 in a traditional Hive storage layout can face S3 API request
throttling as objects are stored under the same filepath prefix.
+
+Iceberg by default uses the Hive storage layout, but can be switched to use
the `ObjectStoreLocationProvider`.
+With `ObjectStoreLocationProvider`, a determenistic hash is generated for each
stored file, with the hash appended
+directly after the `write.object-storage.path`. This ensures files written to
s3 are equally distributed across multiple prefixes in the S3 bucket.
This results in minimized throttling and maximized throughput for S3-related
IO operations.
-Here is an example Spark SQL command to create a table with this feature
enabled:
+To use the `ObjectStorageLocationProvider` you just need to add
`'write.object-storage.enabled'=true` in the table's `OPTIONS`.
Review comment:
fixed.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]