[GitHub] [iceberg] jackye1995 commented on a change in pull request #4216: Docs: Add Glue optimistic locking

GitBox Wed, 23 Feb 2022 21:37:48 -0800


jackye1995 commented on a change in pull request #4216:
URL: https://github.com/apache/iceberg/pull/4216#discussion_r813559413




##########
File path: docs/versioned/integrations/aws.md
##########
@@ -188,12 +188,30 @@ However, if you are streaming data to Iceberg, this will 
easily create a lot of
 Therefore, it is recommended to turn off the archive feature in Glue by 
setting `glue.skip-archive` to `true`.
 For more details, please read [Glue 
Quotas](https://docs.aws.amazon.com/general/latest/gr/glue.html) and the 
[UpdateTable 
API](https://docs.aws.amazon.com/glue/latest/webapi/API_UpdateTable.html).
 
+#### Optimistic Locking
+
+Glue supports optimistic locking over concurrent updates to a table.
+With optimistic locking, each table has a version id. 
+If you retrieve the table metadata, Iceberg records the version id of that 
table. 
+You can update the table, but only if the version id on the server side has 
not changed. 
+If there is a version mismatch, it means that someone else has modified the 
table before you did. 
+The update attempt fails, because you have a stale version of the table. 
+If this happens, Iceberg simply tries again by retrieving the table metadata 
and then tries to update it. 
+Optimistic locking prevents you from accidentally overwriting changes that 
were made by others. 
+It also prevents others from accidentally overwriting your changes.
+
+To use Glue's Optimistic Locking, you can start the Spark SQL shell with:
+```
+spark-sql --packages org.apache.iceberg:iceberg-spark3-runtime:{{% 
icebergVersion %}},software.amazon.awssdk:bundle:2.17.131 \

Review comment:
       can you update all AWS SDK versions in this doc to 2.17.131?

##########
File path: docs/versioned/integrations/aws.md
##########
@@ -188,12 +188,30 @@ However, if you are streaming data to Iceberg, this will 
easily create a lot of
 Therefore, it is recommended to turn off the archive feature in Glue by 
setting `glue.skip-archive` to `true`.
 For more details, please read [Glue 
Quotas](https://docs.aws.amazon.com/general/latest/gr/glue.html) and the 
[UpdateTable 
API](https://docs.aws.amazon.com/glue/latest/webapi/API_UpdateTable.html).
 
+#### Optimistic Locking
+
+Glue supports optimistic locking over concurrent updates to a table.
+With optimistic locking, each table has a version id. 
+If you retrieve the table metadata, Iceberg records the version id of that 
table. 
+You can update the table, but only if the version id on the server side has 
not changed. 
+If there is a version mismatch, it means that someone else has modified the 
table before you did. 
+The update attempt fails, because you have a stale version of the table. 
+If this happens, Iceberg simply tries again by retrieving the table metadata 
and then tries to update it. 
+Optimistic locking prevents you from accidentally overwriting changes that 
were made by others. 
+It also prevents others from accidentally overwriting your changes.
+
+To use Glue's Optimistic Locking, you can start the Spark SQL shell with:

Review comment:
       I am thinking if we really need an example, we should just say this is 
the default behavior and remove any example script snippet reference to the use 
of `org.apache.iceberg.glue.DynamoLockManager`

##########
File path: docs/versioned/integrations/aws.md
##########
@@ -188,12 +188,30 @@ However, if you are streaming data to Iceberg, this will 
easily create a lot of
 Therefore, it is recommended to turn off the archive feature in Glue by 
setting `glue.skip-archive` to `true`.
 For more details, please read [Glue 
Quotas](https://docs.aws.amazon.com/general/latest/gr/glue.html) and the 
[UpdateTable 
API](https://docs.aws.amazon.com/glue/latest/webapi/API_UpdateTable.html).
 
+#### Optimistic Locking
+
+Glue supports optimistic locking over concurrent updates to a table.
+With optimistic locking, each table has a version id. 
+If you retrieve the table metadata, Iceberg records the version id of that 
table. 
+You can update the table, but only if the version id on the server side has 
not changed. 
+If there is a version mismatch, it means that someone else has modified the 
table before you did. 
+The update attempt fails, because you have a stale version of the table. 
+If this happens, Iceberg simply tries again by retrieving the table metadata 
and then tries to update it. 
+Optimistic locking prevents you from accidentally overwriting changes that 
were made by others. 
+It also prevents others from accidentally overwriting your changes.
+
+To use Glue's Optimistic Locking, you can start the Spark SQL shell with:
+```
+spark-sql --packages org.apache.iceberg:iceberg-spark3-runtime:{{% 
icebergVersion %}},software.amazon.awssdk:bundle:2.17.131 \
+    --conf spark.sql.catalog.my_catalog=org.apache.iceberg.spark.SparkCatalog \
+    --conf spark.sql.catalog.my_catalog.warehouse=s3://my-bucket/my/key/prefix 
\
+    --conf 
spark.sql.catalog.my_catalog.catalog-impl=org.apache.iceberg.aws.glue.GlueCatalog
 \
+    --conf 
spark.sql.catalog.my_catalog.io-impl=org.apache.iceberg.aws.s3.S3FileIO
+```
+
 #### DynamoDB for Commit Locking
 
-Glue does not have a strong guarantee over concurrent updates to a table. 
-Although it throws `ConcurrentModificationException` when detecting two 
processes updating a table at the same time,
-there is no guarantee that one update would not clobber the other update.
-Therefore, [DynamoDB](https://aws.amazon.com/dynamodb) can be used for Glue, 
so that for every commit, 
+[DynamoDB](https://aws.amazon.com/dynamodb) can be used for Glue, so that for 
every commit, 

Review comment:
       I think we should not even say "can be used", I don't think people would 
really use an additional external lock table if all they need is 
compare-and-swap at commit time. This should just be a warning section under 
**Optimistic Locking**, saying before SDK version 2.17.131 we cannot do that, 
people have to use this lock table. And we can just link to the section of lock 
manager (see comment below)

##########
File path: docs/versioned/integrations/aws.md
##########
@@ -188,12 +188,30 @@ However, if you are streaming data to Iceberg, this will 
easily create a lot of
 Therefore, it is recommended to turn off the archive feature in Glue by 
setting `glue.skip-archive` to `true`.
 For more details, please read [Glue 
Quotas](https://docs.aws.amazon.com/general/latest/gr/glue.html) and the 
[UpdateTable 
API](https://docs.aws.amazon.com/glue/latest/webapi/API_UpdateTable.html).
 
+#### Optimistic Locking
+
+Glue supports optimistic locking over concurrent updates to a table.
+With optimistic locking, each table has a version id. 
+If you retrieve the table metadata, Iceberg records the version id of that 
table. 
+You can update the table, but only if the version id on the server side has 
not changed. 
+If there is a version mismatch, it means that someone else has modified the 
table before you did. 
+The update attempt fails, because you have a stale version of the table. 
+If this happens, Iceberg simply tries again by retrieving the table metadata 
and then tries to update it. 
+Optimistic locking prevents you from accidentally overwriting changes that 
were made by others. 
+It also prevents others from accidentally overwriting your changes.
+
+To use Glue's Optimistic Locking, you can start the Spark SQL shell with:
+```
+spark-sql --packages org.apache.iceberg:iceberg-spark3-runtime:{{% 
icebergVersion %}},software.amazon.awssdk:bundle:2.17.131 \
+    --conf spark.sql.catalog.my_catalog=org.apache.iceberg.spark.SparkCatalog \
+    --conf spark.sql.catalog.my_catalog.warehouse=s3://my-bucket/my/key/prefix 
\
+    --conf 
spark.sql.catalog.my_catalog.catalog-impl=org.apache.iceberg.aws.glue.GlueCatalog
 \
+    --conf 
spark.sql.catalog.my_catalog.io-impl=org.apache.iceberg.aws.s3.S3FileIO
+```
+
 #### DynamoDB for Commit Locking
 
-Glue does not have a strong guarantee over concurrent updates to a table. 
-Although it throws `ConcurrentModificationException` when detecting two 
processes updating a table at the same time,
-there is no guarantee that one update would not clobber the other update.
-Therefore, [DynamoDB](https://aws.amazon.com/dynamodb) can be used for Glue, 
so that for every commit, 
+[DynamoDB](https://aws.amazon.com/dynamodb) can be used for Glue, so that for 
every commit, 
 `GlueCatalog` first obtains a lock using a helper DynamoDB table and then try 
to safely modify the Glue table.
 
 This feature requires the following lock related catalog properties:

Review comment:
       This section can be moved to a full new section to talk about the public 
`DynamoDbLockManager` which can be used by `HadoopCatalog` or `HadoopTables`. 




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [iceberg] jackye1995 commented on a change in pull request #4216: Docs: Add Glue optimistic locking

Reply via email to