VitoMakarevich opened a new pull request, #10460:
URL: https://github.com/apache/hudi/pull/10460

   ### Change Logs
   
   Currently, the sync works next way(simplified for particular problem 
explanation):
   1. Read partitions which were upserted(create/update is indistinguishable) 
and deleted from the commitline(since previous sync).
   2. Read existing partitions from Glue.
   3. Iterate to understand which AWS call should be done for the changed 
partition(create/update/delete).
   
   To facilitate this, there are 2 clients, 1 is `AWSGlueCatalogSyncClient`, 
second is `HoodieHiveSyncClient`. In the Spark-EMR case, the second relies on 
the AWS's Hive-Glue interface implementation. There is no public code of the 
actual version, but there is of some previous, the interesting part is 
[this](https://github.com/awslabs/aws-glue-data-catalog-client-for-apache-hive-metastore/blob/d885a996137b19df6a128cb950ebc43711374e83/aws-glue-datacatalog-client-common/src/main/java/com/amazonaws/glue/catalog/metastore/DefaultAWSGlueMetastore.java#L301).
   Saying short for AWS Glue exists a mechanism of 
[segments](https://docs.aws.amazon.com/glue/latest/webapi/API_Segment.html) 
which allows to query long lists in parallel. I added code similar to what 
exists in AWS Hive-Glue implementation.
   
   The real use-case for us was that we have been using `HiveSyncTool` (table 
with ~200k partitions) - and it was behaving slow when a lot of partitions were 
changing/in case of resync - so we decided to try out 
`AWSGlueCatalogSyncClient` which gave about 2x boost in write speed, but after 
we noticed a significant degradation in speed of listing(roughly 40 sec vs 210 
on the ~same amount of partitions - 200k).
   This improvement should bring the same listing speed to the tool.
   
   ### Impact
   
   There is a new config added with the default value from AWS 
Glue-Hive(parallelism of 5) - while before it was sequential, so I don't expect 
it to be failing, but we may set parallelism to 1 to be backward-compatible for 
whatever reasons.
   
   ### Risk level (write none, low medium or high below)
   
   Low
   
   ### Documentation Update
   
   _Describe any necessary documentation update if there is any new feature, 
config, or user-facing change_
   
   **Need to adjust the website, but please let me know if you are ok with the 
changes, I'll proceed then.**
   - _The config description must be updated if new configs are added or the 
default value of the configs are changed_
   - _Any new feature or user-facing change requires updating the Hudi website. 
Please create a Jira ticket, attach the
     ticket number here and follow the 
[instruction](https://hudi.apache.org/contribute/developer-setup#website) to 
make
     changes to the website._
   
   ### Contributor's checklist
   
   - [ ] Read through [contributor's 
guide](https://hudi.apache.org/contribute/how-to-contribute)
   - [ ] Change Logs and Impact were stated clearly
   - [ ] Adequate tests were added if applicable
   - [ ] CI passed
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Reply via email to