[GitHub] [iceberg] cobookman commented on a change in pull request #2963: Docs: Clarify ObjectStoreLocationProvider

GitBox Sat, 14 Aug 2021 09:50:15 -0700


cobookman commented on a change in pull request #2963:
URL: https://github.com/apache/iceberg/pull/2963#discussion_r688968140




##########
File path: site/docs/aws.md
##########
@@ -358,6 +361,16 @@ OPTIONS (
 PARTITIONED BY (category);
 ```
 
+We can then insert a single row into this new table
+```SQL
+INSERT INTO my_catalog.my_ns.my_table VALUES (1, "Pizza", "orders");
+```
+
+Which will cause the following object to be written to S3:
+```
+s3://my-table-data-bucket/2d3905f8/my_ns.db/my_table/category=orders/00000-0-5affc076-96a4-48f2-9cd2-d5efbc9f0c94-00001.parquet
+```
+

Review comment:
       added

##########
File path: site/docs/aws.md
##########
@@ -339,13 +339,16 @@ For more details, please read [S3 ACL 
Documentation](https://docs.aws.amazon.com
 
 ### Object Store File Layout
 
-S3 and many other cloud storage services [throttle requests based on object 
prefix](https://aws.amazon.com/premiumsupport/knowledge-center/s3-request-limit-avoid-throttling/).
 
-This means data stored in a traditional Hive storage layout has bad read and 
write throughput since data files of the same partition are placed under the 
same prefix.
-Iceberg by default uses the Hive storage layout, but can be switched to use a 
different `ObjectStoreLocationProvider`.
-In this mode, a hash string is added to the beginning of each file path, so 
that files are equally distributed across all prefixes in an S3 bucket.
+S3 and many other cloud storage services [throttle requests based on object 
prefix](https://aws.amazon.com/premiumsupport/knowledge-center/s3-request-limit-avoid-throttling/).
+Data stored on S3 in a traditional Hive storage layout can face S3 API request 
throttling as objects are stored under the same filepath prefix.
+
+Iceberg by default uses the Hive storage layout, but can be switched to use 
the `ObjectStoreLocationProvider`. 
+With `ObjectStoreLocationProvider`, a determenistic hash is generated for each 
stored file, with the hash appended 
+directly after the `write.object-storage.path`. This ensures files written to 
s3 are equally distributed across multiple prefixes in the S3 bucket.
 This results in minimized throttling and maximized throughput for S3-related 
IO operations.
-Here is an example Spark SQL command to create a table with this feature 
enabled:
 
+To use the `ObjectStorageLocationProvider` you just need to add 
`'write.object-storage.enabled'=true` in the table's `OPTIONS`. 

Review comment:
       fixed.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [iceberg] cobookman commented on a change in pull request #2963: Docs: Clarify ObjectStoreLocationProvider

Reply via email to