[ https://issues.apache.org/jira/browse/SPARK-2773?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Patrick Wendell reopened SPARK-2773: ------------------------------------ > Shuffle:use growth rate to predict if need to spill > --------------------------------------------------- > > Key: SPARK-2773 > URL: https://issues.apache.org/jira/browse/SPARK-2773 > Project: Spark > Issue Type: Improvement > Components: Shuffle > Affects Versions: 0.9.0, 1.0.0 > Reporter: uncleGen > Priority: Minor > > Right now, Spark uses the total usage of "shuffle" memory of each thread to > predict if need to spill. I think it is not very reasonable. For example, > there are two threads pulling "shuffle" data. The total memory used to buffer > data is 21G. The first time to trigger spilling it when one thread has used > 7G memory to buffer "shuffle" data, here I assume another one has used the > same size. Unfortunately, I still have remaining 7G to use. So, I think > current prediction mode is too conservative, and can not maximize the usage > of "shuffle" memory. In my solution, I use the growth rate of "shuffle" > memory. Again, the growth of each time is limited, maybe 10K * 1024(my > assumption), then the first time to trigger spilling is when the remaining > "shuffle" memory is less than threads * growth * 2, i.e. 2 * 10M * 2. I think > it can maximize the usage of "shuffle" memory. In my solution, there is also > a conservative assumption, i.e. all of threads is pulling shuffle data in one > executor. However it dose not have much effect, the grow is limited after > all. Any suggestion? -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org