Hello, I have seen an unexpected behavior when setting a limit too high in a search. I index log files in my system. Each week I create a new index. At the end of the week the index is around 35 Gb. when I do a search with no date, I would create a MultiReader built out of the readers from the weekly indexes (around 15 indexes). I sort with Sort.INDEXORDER by default. when running a search with a high limit (eg: 1 million), I ended up sometimes going out of memory because of the empty datastructures that were initialized. I looked at the objects and in the CollectorManager. reduce(Collection<TopFieldCollector> collectors) of IndexSearcher.searchAfter(FieldDoc after, Query query, int numHits, Sort sort, boolean doDocScores, boolean doMaxScore), I ended up with 137 collectors each defining a HitQueue containing:
- oneComparator.docIDs[1 million] - heap[1 million] so that is 8 bytes * 1 million = around 8 Mb since all 137 collectors had been initialized the same way (with arrays with 1 million elements), then I ended up with 1 Gb of RAM used for that search. What is strange to me is that no collector had 1 million hits, because the 5 million log events that matched were spread around the different weekly indexes. so that seemed quite a waste of space to initialize all of these arrays with slots that would not be used. so the only thing I could think of was to get rid of the MultiReader, manage the search myself on the different subreaders, and adjust the limit to be the min between the count of the query on each index and the limit passed by the user (similar to the way the cappedNumHits gets calculated). That way, a user passing a very big limit would not be able to consume so much memory. I guess it makes sense to preallocate data structures to be more efficient on the garbage collection by avoid growing arrays and list, but I must admit that I did not expect that the limit parameter could have such an impact on memory. In this situation, I would rather have those arrays start small and grow based on needs. As a side effect I would have to get rid of the MultiReader, which is a nice abstraction, and simplifies my code. I would rather not, but I want to be very careful about memory consumption, and it always looks bad when a user can create an OOM on a server just with a query, even if he is passing an abnormal high limit. what are your recommendations? using lucene 6.2.1 thanks, Vince [[ rethink everything. ]]<http://www.lombardodier.com> DISCLAIMER ********************************************** This message is intended only for use by the person to whom it is addressed. It may contain information that is privileged and confidential. Its content does not constitute a formal commitment by Bank Lombard Odier & Co Ltd or any of its branches or affiliates. If you are not the intended recipient of this message, kindly notify the sender immediately and destroy this message. Thank You. ***************************************************************