[jira] [Created] (SPARK-35332) Not Coalesce shuffle partitions when cache table

Xianghao Lu (Jira) Thu, 06 May 2021 21:54:08 -0700

Xianghao Lu created SPARK-35332:
-----------------------------------

             Summary: Not Coalesce shuffle partitions when cache table
                 Key: SPARK-35332
                 URL: https://issues.apache.org/jira/browse/SPARK-35332
             Project: Spark
          Issue Type: Improvement
          Components: Shuffle
    Affects Versions: 3.1.1, 3.1.0, 3.0.1
         Environment: latest spark version
            Reporter: Xianghao Lu



How to reproduce the problem

prepare data
for i in $(seq 200000);do echo "$(($i+100000)),name$i,$(($i*10))";done > 
data.text

sql to reproduce the problem
* create table data_table(id int, str string, num int) row format delimited 
fields terminated by ',';
* load data local inpath '/path/to/data.text' into table data_table;
* CACHE TABLE test_cache_table AS
SELECT str
FROM
  (SELECT id,str FROM data_table
 )group by str;

Finally you will see a stage with 200 tasks and not coalesce shuffle 
partitions, this will waste resource when data size is small.




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-35332) Not Coalesce shuffle partitions when cache table

Reply via email to