[jira] [Updated] (HUDI-6990) Spark clustering job reads records support control the parallelism

2023-11-03 Thread Danny Chen (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-6990?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Danny Chen updated HUDI-6990:
-
Fix Version/s: 1.0.0
   0.14.1

> Spark clustering job reads records support control the parallelism
> --
>
> Key: HUDI-6990
> URL: https://issues.apache.org/jira/browse/HUDI-6990
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: clustering
>Reporter: kwang
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.0.0, 0.14.1
>
> Attachments: after-subtasks.png, before-subtasks.png
>
>
> Spark executes clustering job will read clustering plan which contains 
> multiple groups. Each group process many base files or log files. When we 
> config param `
> hoodie.clustering.plan.strategy.sort.columns`, we read those files through 
> spark's parallelize method, every file read will generate one sub task. It's 
> unreasonable.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-6990) Spark clustering job reads records support control the parallelism

2023-10-26 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-6990?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated HUDI-6990:
-
Labels: pull-request-available  (was: )

> Spark clustering job reads records support control the parallelism
> --
>
> Key: HUDI-6990
> URL: https://issues.apache.org/jira/browse/HUDI-6990
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: clustering
>Reporter: kwang
>Priority: Major
>  Labels: pull-request-available
> Attachments: after-subtasks.png, before-subtasks.png
>
>
> Spark executes clustering job will read clustering plan which contains 
> multiple groups. Each group process many base files or log files. When we 
> config param `
> hoodie.clustering.plan.strategy.sort.columns`, we read those files through 
> spark's parallelize method, every file read will generate one sub task. It's 
> unreasonable.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-6990) Spark clustering job reads records support control the parallelism

2023-10-26 Thread kwang (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-6990?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

kwang updated HUDI-6990:

Component/s: clustering

> Spark clustering job reads records support control the parallelism
> --
>
> Key: HUDI-6990
> URL: https://issues.apache.org/jira/browse/HUDI-6990
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: clustering
>Reporter: kwang
>Priority: Major
> Attachments: after-subtasks.png, before-subtasks.png
>
>
> Spark executes clustering job will read clustering plan which contains 
> multiple groups. Each group process many base files or log files. When we 
> config param `
> hoodie.clustering.plan.strategy.sort.columns`, we read those files through 
> spark's parallelize method, every file read will generate one sub task. It's 
> unreasonable.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-6990) Spark clustering job reads records support control the parallelism

2023-10-26 Thread kwang (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-6990?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

kwang updated HUDI-6990:

Attachment: after-subtasks.png
before-subtasks.png

> Spark clustering job reads records support control the parallelism
> --
>
> Key: HUDI-6990
> URL: https://issues.apache.org/jira/browse/HUDI-6990
> Project: Apache Hudi
>  Issue Type: Improvement
>Reporter: kwang
>Priority: Major
> Attachments: after-subtasks.png, before-subtasks.png
>
>
> Spark executes clustering job will read clustering plan which contains 
> multiple groups. Each group process many base files or log files. When we 
> config param `
> hoodie.clustering.plan.strategy.sort.columns`, we read those files through 
> spark's parallelize method, every file read will generate one sub task. It's 
> unreasonable.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-6990) Spark clustering job reads records support control the parallelism

2023-10-26 Thread kwang (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-6990?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

kwang updated HUDI-6990:

Description: 
Spark executes clustering job will read clustering plan which contains multiple 
groups. Each group process many base files or log files. When we config param `
hoodie.clustering.plan.strategy.sort.columns`, we read those files through 
spark's parallelize method, every file read will generate one sub task. It's 
unreasonable.

  was:Spark executes clustering job will read clustering plan which contains 
multiple groups. Each group process many base files or log files. When we read 
those files through spark's parallelize method, every file will generate one 
sub task. It's unreasonable.


> Spark clustering job reads records support control the parallelism
> --
>
> Key: HUDI-6990
> URL: https://issues.apache.org/jira/browse/HUDI-6990
> Project: Apache Hudi
>  Issue Type: Improvement
>Reporter: kwang
>Priority: Major
>
> Spark executes clustering job will read clustering plan which contains 
> multiple groups. Each group process many base files or log files. When we 
> config param `
> hoodie.clustering.plan.strategy.sort.columns`, we read those files through 
> spark's parallelize method, every file read will generate one sub task. It's 
> unreasonable.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-6990) Spark clustering job reads records support control the parallelism

2023-10-26 Thread kwang (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-6990?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

kwang updated HUDI-6990:

Description: Spark executes clustering job will read clustering plan which 
contains multiple groups. Each group process many base files or log files. When 
we read those files through spark's parallelize method, every file will 
generate one sub task. It's unreasonable.

> Spark clustering job reads records support control the parallelism
> --
>
> Key: HUDI-6990
> URL: https://issues.apache.org/jira/browse/HUDI-6990
> Project: Apache Hudi
>  Issue Type: Improvement
>Reporter: kwang
>Priority: Major
>
> Spark executes clustering job will read clustering plan which contains 
> multiple groups. Each group process many base files or log files. When we 
> read those files through spark's parallelize method, every file will generate 
> one sub task. It's unreasonable.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)