[jira] [Updated] (HUDI-5327) ClusteringWithRecordsAsRow generates too many spark jobs

Ethan Guo (Jira) Wed, 27 Mar 2024 18:07:05 -0700


     [ 
https://issues.apache.org/jira/browse/HUDI-5327?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Ethan Guo updated HUDI-5327:
----------------------------
    Component/s: table-service

> ClusteringWithRecordsAsRow generates too many spark jobs
> --------------------------------------------------------
>
>                 Key: HUDI-5327
>                 URL: https://issues.apache.org/jira/browse/HUDI-5327
>             Project: Apache Hudi
>          Issue Type: Improvement
>          Components: clustering, table-service
>            Reporter: Xinyu Zou
>            Assignee: Xinyu Zou
>            Priority: Major
>              Labels: pull-request-available
>         Attachments: image-2022-12-03-19-24-27-106.png
>
>
> By set hoodie.datasource.write.row.writer.enable=true
> This application will generate too many spark jobs when execute clusting
> !image-2022-12-03-19-24-27-106.png|width=1031,height=486!
> In addition to not looking concise, it will bring a hidden performance 
> bottleneck
> {code:java}
> Stream<HoodieData<WriteStatus>> writeStatusesStream = FutureUtils.allOf(
>         clusteringPlan.getInputGroups().stream()
>             .map(inputGroup -> {
>               if 
> (getWriteConfig().getBooleanOrDefault("hoodie.datasource.write.row.writer.enable",
>  false)) {
>                 return runClusteringForGroupAsyncAsRow(inputGroup,
>                     clusteringPlan.getStrategy().getStrategyParams(),
>                     shouldPreserveMetadata,
>                     instantTime);
>               }
>               return runClusteringForGroupAsync(inputGroup,
>                   clusteringPlan.getStrategy().getStrategyParams(),
>                   shouldPreserveMetadata,
>                   instantTime);
>             })
>             .collect(Collectors.toList()))
>     .join()
>     .stream(); {code}
> `runClusteringForGroupAsyncAsRow`  will generate a spark job for its internal 
> use of collect , and the default concurrency of `FutureUtils.allOf()` is `the 
> number of CPU cores -1` if no executor is configured, which means that the 
> maximum concurrency of spark tasks = `the number of CPU cores -1` * `the 
> number of files divided by each clusting group`
> For our cluster, the driver has 32 CPU cores, the maximum concurrency of 
> spark tasks is 31 * 2, although our executor is configured with 20, each with 
> 4 cores
>  
> In addition, when the set hoodie.datasource.write.row.writer.enable=false, 
> this problem does not occur because it does not generate a spark job in 
> runClusteringForGroupAsync.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (HUDI-5327) ClusteringWithRecordsAsRow generates too many spark jobs

Reply via email to