jackye1995 commented on a change in pull request #2963: URL: https://github.com/apache/iceberg/pull/2963#discussion_r688691483
########## File path: site/docs/aws.md ########## @@ -339,13 +339,16 @@ For more details, please read [S3 ACL Documentation](https://docs.aws.amazon.com ### Object Store File Layout -S3 and many other cloud storage services [throttle requests based on object prefix](https://aws.amazon.com/premiumsupport/knowledge-center/s3-request-limit-avoid-throttling/). -This means data stored in a traditional Hive storage layout has bad read and write throughput since data files of the same partition are placed under the same prefix. -Iceberg by default uses the Hive storage layout, but can be switched to use a different `ObjectStoreLocationProvider`. -In this mode, a hash string is added to the beginning of each file path, so that files are equally distributed across all prefixes in an S3 bucket. +S3 and many other cloud storage services [throttle requests based on object prefix](https://aws.amazon.com/premiumsupport/knowledge-center/s3-request-limit-avoid-throttling/). +Data stored on S3 in a traditional Hive storage layout can face S3 API request throttling as objects are stored under the same filepath prefix. + +Iceberg by default uses the Hive storage layout, but can be switched to use the `ObjectStoreLocationProvider`. +With `ObjectStoreLocationProvider`, a determenistic hash is generated for each stored file, with the hash appended +directly after the `write.object-storage.path`. This ensures files written to s3 are equally distributed across multiple prefixes in the S3 bucket. This results in minimized throttling and maximized throughput for S3-related IO operations. -Here is an example Spark SQL command to create a table with this feature enabled: +To use the `ObjectStorageLocationProvider` you just need to add `'write.object-storage.enabled'=true` in the table's `OPTIONS`. Review comment: in the table's properties. (because `OPTIONS` and `TBLPROPERTIES` are used interchangeably) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
