Xianghao Lu created SPARK-35332: ----------------------------------- Summary: Not Coalesce shuffle partitions when cache table Key: SPARK-35332 URL: https://issues.apache.org/jira/browse/SPARK-35332 Project: Spark Issue Type: Improvement Components: Shuffle Affects Versions: 3.1.1, 3.1.0, 3.0.1 Environment: latest spark version Reporter: Xianghao Lu
How to reproduce the problem prepare data for i in $(seq 200000);do echo "$(($i+100000)),name$i,$(($i*10))";done > data.text sql to reproduce the problem * create table data_table(id int, str string, num int) row format delimited fields terminated by ','; * load data local inpath '/path/to/data.text' into table data_table; * CACHE TABLE test_cache_table AS SELECT str FROM (SELECT id,str FROM data_table )group by str; Finally you will see a stage with 200 tasks and not coalesce shuffle partitions, this will waste resource when data size is small. -- This message was sent by Atlassian Jira (v8.3.4#803005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org