VitoMakarevich opened a new pull request, #10460: URL: https://github.com/apache/hudi/pull/10460
### Change Logs Currently, the sync works next way(simplified for particular problem explanation): 1. Read partitions which were upserted(create/update is indistinguishable) and deleted from the commitline(since previous sync). 2. Read existing partitions from Glue. 3. Iterate to understand which AWS call should be done for the changed partition(create/update/delete). To facilitate this, there are 2 clients, 1 is `AWSGlueCatalogSyncClient`, second is `HoodieHiveSyncClient`. In the Spark-EMR case, the second relies on the AWS's Hive-Glue interface implementation. There is no public code of the actual version, but there is of some previous, the interesting part is [this](https://github.com/awslabs/aws-glue-data-catalog-client-for-apache-hive-metastore/blob/d885a996137b19df6a128cb950ebc43711374e83/aws-glue-datacatalog-client-common/src/main/java/com/amazonaws/glue/catalog/metastore/DefaultAWSGlueMetastore.java#L301). Saying short for AWS Glue exists a mechanism of [segments](https://docs.aws.amazon.com/glue/latest/webapi/API_Segment.html) which allows to query long lists in parallel. I added code similar to what exists in AWS Hive-Glue implementation. The real use-case for us was that we have been using `HiveSyncTool` (table with ~200k partitions) - and it was behaving slow when a lot of partitions were changing/in case of resync - so we decided to try out `AWSGlueCatalogSyncClient` which gave about 2x boost in write speed, but after we noticed a significant degradation in speed of listing(roughly 40 sec vs 210 on the ~same amount of partitions - 200k). This improvement should bring the same listing speed to the tool. ### Impact There is a new config added with the default value from AWS Glue-Hive(parallelism of 5) - while before it was sequential, so I don't expect it to be failing, but we may set parallelism to 1 to be backward-compatible for whatever reasons. ### Risk level (write none, low medium or high below) Low ### Documentation Update _Describe any necessary documentation update if there is any new feature, config, or user-facing change_ **Need to adjust the website, but please let me know if you are ok with the changes, I'll proceed then.** - _The config description must be updated if new configs are added or the default value of the configs are changed_ - _Any new feature or user-facing change requires updating the Hudi website. Please create a Jira ticket, attach the ticket number here and follow the [instruction](https://hudi.apache.org/contribute/developer-setup#website) to make changes to the website._ ### Contributor's checklist - [ ] Read through [contributor's guide](https://hudi.apache.org/contribute/how-to-contribute) - [ ] Change Logs and Impact were stated clearly - [ ] Adequate tests were added if applicable - [ ] CI passed -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org