[ https://issues.apache.org/jira/browse/HUDI-6990?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
kwang updated HUDI-6990: ------------------------ Summary: Configurable clustering task parallelism (was: Spark clustering job reads records support control the parallelism) > Configurable clustering task parallelism > ---------------------------------------- > > Key: HUDI-6990 > URL: https://issues.apache.org/jira/browse/HUDI-6990 > Project: Apache Hudi > Issue Type: Improvement > Components: clustering > Reporter: kwang > Priority: Major > Labels: pull-request-available > Fix For: 1.0.0, 0.14.1 > > Attachments: after-subtasks.png, before-subtasks.png > > > Spark executes clustering job will read clustering plan which contains > multiple groups. Each group process many base files or log files. When we > config param ` > hoodie.clustering.plan.strategy.sort.columns`, we read those files through > spark's parallelize method, every file read will generate one sub task. It's > unreasonable. -- This message was sent by Atlassian Jira (v8.20.10#820010)