[ 
https://issues.apache.org/jira/browse/SPARK-35332?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xianghao Lu updated SPARK-35332:
--------------------------------
    Description: 
*How to reproduce the problem*

_linux shell command to prepare data:_
 for i in $(seq 200000);do echo "$(($i+100000)),name$i,$(($i*10))";done > 
data.text

_sql to reproduce the problem:_
 * create table data_table(id int, str string, num int) row format delimited 
fields terminated by ',';
 * load data local inpath '/path/to/data.text' into table data_table;
 * CACHE TABLE test_cache_table AS
 SELECT str
 FROM
 (SELECT id,str FROM data_table
 )group by str;

Finally you will see a stage with 200 tasks and not coalesce shuffle 
partitions, this will waste resource when data size is small.

  was:
*How to reproduce the problem*

_linux shell command to prepare data_
 for i in $(seq 200000);do echo "$(($i+100000)),name$i,$(($i*10))";done > 
data.text

_sql to reproduce the problem_
 * create table data_table(id int, str string, num int) row format delimited 
fields terminated by ',';
 * load data local inpath '/path/to/data.text' into table data_table;
 * CACHE TABLE test_cache_table AS
 SELECT str
 FROM
 (SELECT id,str FROM data_table
 )group by str;

Finally you will see a stage with 200 tasks and not coalesce shuffle 
partitions, this will waste resource when data size is small.


> Not Coalesce shuffle partitions when cache table
> ------------------------------------------------
>
>                 Key: SPARK-35332
>                 URL: https://issues.apache.org/jira/browse/SPARK-35332
>             Project: Spark
>          Issue Type: Improvement
>          Components: Shuffle
>    Affects Versions: 3.0.1, 3.1.0, 3.1.1
>         Environment: latest spark version
>            Reporter: Xianghao Lu
>            Priority: Major
>         Attachments: cacheTable.png
>
>
> *How to reproduce the problem*
> _linux shell command to prepare data:_
>  for i in $(seq 200000);do echo "$(($i+100000)),name$i,$(($i*10))";done > 
> data.text
> _sql to reproduce the problem:_
>  * create table data_table(id int, str string, num int) row format delimited 
> fields terminated by ',';
>  * load data local inpath '/path/to/data.text' into table data_table;
>  * CACHE TABLE test_cache_table AS
>  SELECT str
>  FROM
>  (SELECT id,str FROM data_table
>  )group by str;
> Finally you will see a stage with 200 tasks and not coalesce shuffle 
> partitions, this will waste resource when data size is small.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to