Github user jerryshao commented on the issue: https://github.com/apache/spark/pull/19184 Hi @mridulm , sorry for late response. I agree with you that the scenario is different between here and shuffle, but the underlying structure and solutions to spill data is the same, so the problem is the same. While in the shuffle side, we could control the memory size to hold more data before spilling to avoid too many spills, but as you mentioned here we cannot do it. Yes it is not necessary to open all the files beforehand. But since we're using priority queue to do merge sort, which will make all the file handler opened very likely. And this fix only reduces the chances to encounter too many files issue. Maybe we can call this fix as an intermittent fix, what do you think?
--- --------------------------------------------------------------------- To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org