[PR] Add parallel listing of existing partitions [hudi]

via GitHub Mon, 08 Jan 2024 09:17:23 -0800


VitoMakarevich opened a new pull request, #10460:
URL: https://github.com/apache/hudi/pull/10460

### Change Logs

Currently, the sync works next way(simplified for particular problem
explanation):
1. Read partitions which were upserted(create/update is indistinguishable)
and deleted from the commitline(since previous sync).
2. Read existing partitions from Glue.
3. Iterate to understand which AWS call should be done for the changed
partition(create/update/delete).

To facilitate this, there are 2 clients, 1 is `AWSGlueCatalogSyncClient`,
second is `HoodieHiveSyncClient`. In the Spark-EMR case, the second relies on
the AWS's Hive-Glue interface implementation. There is no public code of the
actual version, but there is of some previous, the interesting part is
[this](https://github.com/awslabs/aws-glue-data-catalog-client-for-apache-hive-metastore/blob/d885a996137b19df6a128cb950ebc43711374e83/aws-glue-datacatalog-client-common/src/main/java/com/amazonaws/glue/catalog/metastore/DefaultAWSGlueMetastore.java#L301).
Saying short for AWS Glue exists a mechanism of
[segments](https://docs.aws.amazon.com/glue/latest/webapi/API_Segment.html)
which allows to query long lists in parallel. I added code similar to what
exists in AWS Hive-Glue implementation.

The real use-case for us was that we have been using `HiveSyncTool` (table
with ~200k partitions) - and it was behaving slow when a lot of partitions were
changing/in case of resync - so we decided to try out
`AWSGlueCatalogSyncClient` which gave about 2x boost in write speed, but after
we noticed a significant degradation in speed of listing(roughly 40 sec vs 210
on the ~same amount of partitions - 200k).
This improvement should bring the same listing speed to the tool.

### Impact

There is a new config added with the default value from AWS
Glue-Hive(parallelism of 5) - while before it was sequential, so I don't expect
it to be failing, but we may set parallelism to 1 to be backward-compatible for
whatever reasons.

### Risk level (write none, low medium or high below)

Low

### Documentation Update

_Describe any necessary documentation update if there is any new feature,
config, or user-facing change_

**Need to adjust the website, but please let me know if you are ok with the
changes, I'll proceed then.**
- _The config description must be updated if new configs are added or the
default value of the configs are changed_
- _Any new feature or user-facing change requires updating the Hudi website.
Please create a Jira ticket, attach the
ticket number here and follow the
[instruction](https://hudi.apache.org/contribute/developer-setup#website) to
make
changes to the website._

### Contributor's checklist

- [ ] Read through [contributor's
guide](https://hudi.apache.org/contribute/how-to-contribute)
- [ ] Change Logs and Impact were stated clearly
- [ ] Adequate tests were added if applicable
- [ ] CI passed

--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[PR] Add parallel listing of existing partitions [hudi]

Reply via email to