[ https://issues.apache.org/jira/browse/HUDI-7466?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
ASF GitHub Bot updated HUDI-7466: --------------------------------- Labels: pull-request-available (was: ) > AWS Glue sync > ------------- > > Key: HUDI-7466 > URL: https://issues.apache.org/jira/browse/HUDI-7466 > Project: Apache Hudi > Issue Type: Improvement > Reporter: Vitali Makarevich > Priority: Major > Labels: pull-request-available > > Ticket to track > [https://github.com/apache/hudi/pull/10460] > Currently, AWS Glue sync works and provides 2 interfaces - one is > {{HoodieHiveSyncClient}} using Hive, then Glue -> Hive implementation(hidden > by AWS), and another is {{{}AWSGlueCatalogSyncClient{}}}. > Both of them have limitations - although Hive has improved a bit to use > pushdown on a big scale still fails and fallback may not work for certain > partitioning schemes. > For syncing to Glue using {{{}HoodieHiveSyncClient{}}}, there is a set of > limitations: > # Create/update is not parallelized under the hood, meaning for big sets > it's very slow - empirically it's about 40 partitions/sec MAX, which > translates to minutes for bigger scale. > # The pushdown filter is not really effective since for 1st case(specifying > exact partitions) - it works unpredictably, since the longer the partition > value you have, the fewer partitions you can specify, in our case we cannot > specify > 100 partitions, therefore it falls back to min-max predicate. > # Min-max predicate does not work if the number of partitions is growing > with nesting, e.g. on level 1 there are 10, on level 2 there are 100, on > level 3 there are 1000. In this case, min-max will cut down high-level ones, > but load all levels down, therefore not really making optimization. > # When there is e.g. a schema change, Hive-Glue calls {{{}cascade{}}}, and > for big tables it's impossible to sync in meaningful time - although for Glue > -> Hudi does not specify schema on partition level, so this is wasted effort. > This is why {{AWSGlueCatalogSyncClient}} is preferable. But there are other > problems with it. > Particular list of problems: > # Create/Update/Delete were not optimized before - now optimized to be > async, but without a meaningful high border, it will simply reach the request > limit and stay there. *This solution adds a parameter for such parallelism > and creates parallelization logic.* > # Listing all partitions is used always for {{AWSGlueCatalogSyncClient}} - > this is way suboptimal since the goal of this is to distinguish which of > changed-since-last-sync are created and which are deletes, therefore more > optimal API can be used - > [{{BatchGetPartition}}|https://docs.aws.amazon.com/glue/latest/webapi/API_BatchGetPartition.html]. > Also, it can be parallelized easily. *I added a new method to sync client > classes and moved Hive-pushdown into a Hive-specific class and implemented > this method for the AWSGlue client class. Also, parameter controlling > parallelism is added.* > # Listing all partitions is suboptimal - it is still needed for initial > sync/resync, but it's done in a straightforward way and is suboptimal. In > particular - it uses basic {{nextToken}} which makes it sequential and works > slowly in heavily partitioned tables. AWS has an improvement for this > particular > [method|https://docs.aws.amazon.com/glue/latest/webapi/API_GetPartitions.html], > called {{{}segment{}}}. This allows us to basically create 1 to 10 start > positions and use standard({{{}nextToken{}}}) API to list partitions. Also - > [last public version of Hive-> Glue interface implementation uses > it|https://docs.aws.amazon.com/glue/latest/webapi/API_GetPartitions.html]. > When we switched from the Hive sync class to AWS Glue specific - first what > we faced is performance degradation with the listing. *I added {{segment}} > API parameter usage and added parameter controlling parallelism.* > All this has been tested for a partitioned table with >200k partitions. > I managed to get speed improvement from 2-3 minutes to 3 seconds. Let me know > if you are interested in numbers. -- This message was sent by Atlassian Jira (v8.20.10#820010)