[ 
https://issues.apache.org/jira/browse/HUDI-7466?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated HUDI-7466:
---------------------------------
    Labels: pull-request-available  (was: )

> AWS Glue sync
> -------------
>
>                 Key: HUDI-7466
>                 URL: https://issues.apache.org/jira/browse/HUDI-7466
>             Project: Apache Hudi
>          Issue Type: Improvement
>            Reporter: Vitali Makarevich
>            Priority: Major
>              Labels: pull-request-available
>
> Ticket to track
> [https://github.com/apache/hudi/pull/10460]
> Currently, AWS Glue sync works and provides 2 interfaces - one is 
> {{HoodieHiveSyncClient}} using Hive, then Glue -> Hive implementation(hidden 
> by AWS), and another is {{{}AWSGlueCatalogSyncClient{}}}.
> Both of them have limitations - although Hive has improved a bit to use 
> pushdown on a big scale still fails and fallback may not work for certain 
> partitioning schemes.
> For syncing to Glue using {{{}HoodieHiveSyncClient{}}}, there is a set of 
> limitations:
>  # Create/update is not parallelized under the hood, meaning for big sets 
> it's very slow - empirically it's about 40 partitions/sec MAX, which 
> translates to minutes for bigger scale.
>  # The pushdown filter is not really effective since for 1st case(specifying 
> exact partitions) - it works unpredictably, since the longer the partition 
> value you have, the fewer partitions you can specify, in our case we cannot 
> specify > 100 partitions, therefore it falls back to min-max predicate.
>  # Min-max predicate does not work if the number of partitions is growing 
> with nesting, e.g. on level 1 there are 10, on level 2 there are 100, on 
> level 3 there are 1000. In this case, min-max will cut down high-level ones, 
> but load all levels down, therefore not really making optimization.
>  # When there is e.g. a schema change, Hive-Glue calls {{{}cascade{}}}, and 
> for big tables it's impossible to sync in meaningful time - although for Glue 
> -> Hudi does not specify schema on partition level, so this is wasted effort.
> This is why {{AWSGlueCatalogSyncClient}} is preferable. But there are other 
> problems with it.
> Particular list of problems:
>  # Create/Update/Delete were not optimized before - now optimized to be 
> async, but without a meaningful high border, it will simply reach the request 
> limit and stay there. *This solution adds a parameter for such parallelism 
> and creates parallelization logic.*
>  # Listing all partitions is used always for {{AWSGlueCatalogSyncClient}} - 
> this is way suboptimal since the goal of this is to distinguish which of 
> changed-since-last-sync are created and which are deletes, therefore more 
> optimal API can be used - 
> [{{BatchGetPartition}}|https://docs.aws.amazon.com/glue/latest/webapi/API_BatchGetPartition.html].
>  Also, it can be parallelized easily. *I added a new method to sync client 
> classes and moved Hive-pushdown into a Hive-specific class and implemented 
> this method for the AWSGlue client class. Also, parameter controlling 
> parallelism is added.*
>  # Listing all partitions is suboptimal - it is still needed for initial 
> sync/resync, but it's done in a straightforward way and is suboptimal. In 
> particular - it uses basic {{nextToken}} which makes it sequential and works 
> slowly in heavily partitioned tables. AWS has an improvement for this 
> particular 
> [method|https://docs.aws.amazon.com/glue/latest/webapi/API_GetPartitions.html],
>  called {{{}segment{}}}. This allows us to basically create 1 to 10 start 
> positions and use standard({{{}nextToken{}}}) API to list partitions. Also - 
> [last public version of Hive-> Glue interface implementation uses 
> it|https://docs.aws.amazon.com/glue/latest/webapi/API_GetPartitions.html].
> When we switched from the Hive sync class to AWS Glue specific - first what 
> we faced is performance degradation with the listing. *I added {{segment}} 
> API parameter usage and added parameter controlling parallelism.*
> All this has been tested for a partitioned table with >200k partitions.
> I managed to get speed improvement from 2-3 minutes to 3 seconds. Let me know 
> if you are interested in numbers.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to