[ https://issues.apache.org/jira/browse/SPARK-35332?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17341313#comment-17341313 ]
Takeshi Yamamuro commented on SPARK-35332: ------------------------------------------ okay, sgtm. cc: [~cloud_fan] could you make a PR for that? Let's keep discussing it by referring to an implementation. > Not Coalesce shuffle partitions when cache table > ------------------------------------------------ > > Key: SPARK-35332 > URL: https://issues.apache.org/jira/browse/SPARK-35332 > Project: Spark > Issue Type: Improvement > Components: Shuffle > Affects Versions: 3.0.1, 3.1.0, 3.1.1 > Environment: latest spark version > Reporter: Xianghao Lu > Priority: Major > Attachments: cacheTable.png > > > *How to reproduce the problem* > _linux shell command to prepare data:_ > for i in $(seq 200000);do echo "$(($i+100000)),name$i,$(($i*10))";done > > data.text > _sql to reproduce the problem:_ > * create table data_table(id int, str string, num int) row format delimited > fields terminated by ','; > * load data local inpath '/path/to/data.text' into table data_table; > * CACHE TABLE test_cache_table AS > SELECT str > FROM > (SELECT id,str FROM data_table > )group by str; > Finally you will see a stage with 200 tasks and not coalesce shuffle > partitions, the problem will waste resource when data size is small. -- This message was sent by Atlassian Jira (v8.3.4#803005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org