I recently submitted an issue about "Too many open files" in GroupBy v2 ( https://github.com/apache/druid/issues/11558) and have been investigating a solution. It looked like the problem was happening because the code preemptively opened all the spill files for reading, which when there are a huge number of spill files (in our case, a single query is generating 110k spill files), causes the "too many open files" error when the files ulimit is set to an otherwise reasonable number. We can work around this for now by setting "ulimit -n" to a huge value (like 1 million), but I was hoping for a better solution.
In https://github.com/apache/druid/pull/11559, I attempted to fix this by lazily opening files only when they were ready to be read and closing them immediately after they had finished being read. While this looks like it fixes the issue in some small edge cases, it isn't a general solution because many queries end up calling CloseableIterators.mergeSorted() to merge all the spill files together, which due to sorting necessitates reading all the files at once, causing the "too many files" error again. It looks like mergeSorted() is called because frequently the grouping code is assuming the results should be sorted and is calling ConcurrentGrouper.parallelSortAndGetGroupersIterator(). My question is, can anyone think of a way to avoid the need for sorting at this level so as to avoid the need for opening all the spill files. Given how sketches work in druid right now, I don't see an easy way to reduce the number of spill files we are seeing, so I was hoping to address this on the grouper side, but right now I can't see a solution that makes this any better. We aren't blocked, because we can set the maximum number of files to a much larger number, but that is an unpalatable long term solution. Will <http://www.verizonmedia.com> Will Lauer Senior Principal Architect, Audience & Advertising Reporting Data Platforms & Systems Engineering M 508 561 6427 1908 S. First St Champaign, IL 61822 <http://www.facebook.com/verizonmedia> <http://twitter.com/verizonmedia> <https://www.linkedin.com/company/verizon-media/> <http://www.instagram.com/verizonmedia>
