Hello,

I have seen an unexpected behavior when setting a limit too high in a search.
I index log files in my system. Each week I create a new index. At the end of 
the week the index is around 35 Gb.
when I do a search with no date, I would create a MultiReader built out of the 
readers from the weekly indexes (around 15 indexes).
I sort with Sort.INDEXORDER by default.
when running a search with a high limit (eg: 1 million), I ended up sometimes 
going out of memory because of the empty datastructures that were initialized.
I looked at the objects and in the CollectorManager. 
reduce(Collection<TopFieldCollector> collectors) of 
IndexSearcher.searchAfter(FieldDoc after, Query query, int numHits, Sort sort, 
boolean doDocScores, boolean doMaxScore), I ended up with 137 collectors each 
defining a HitQueue containing:

-       oneComparator.docIDs[1 million]

-       heap[1 million]

so that is 8 bytes * 1 million = around 8 Mb
since all 137 collectors had been initialized the same way (with arrays with 1 
million elements), then I ended up with 1 Gb of RAM used for that search.
What is strange to me is that no collector had 1 million hits, because the 5 
million log events that matched were spread around the different weekly indexes.
so that seemed quite a waste of space to initialize all of these arrays with 
slots that would not be used.
so the only thing I could think of was to get rid of the MultiReader, manage 
the search myself on the different subreaders, and adjust the limit to be the 
min between the count of the query on each index and the limit passed by the 
user (similar to the way the cappedNumHits gets calculated). That way, a user 
passing a very big limit would not be able to consume so much memory.
I guess it makes sense to preallocate data structures to be more efficient on 
the garbage collection by avoid growing arrays and list, but I must admit that 
I did not expect that the limit parameter could have such an impact on memory. 
In this situation, I would rather have those arrays start small and grow based 
on needs.
As a side effect I would have to get rid of the MultiReader, which is a nice 
abstraction, and simplifies my code. I would rather not, but I want to be very 
careful about memory consumption, and it always looks bad when a user can 
create an OOM on a server just with a query, even if he is passing an abnormal 
high limit.

what are your recommendations?
using lucene 6.2.1

thanks,
Vince
[[ rethink everything. ]]<http://www.lombardodier.com>

DISCLAIMER **********************************************
This message is intended only for use by the person to
whom it is addressed. It may contain information that is
privileged and confidential. Its content does not constitute
a formal commitment by Bank Lombard Odier & Co Ltd
or any of its branches or affiliates. If you are not the
intended recipient of this message, kindly notify the sender
immediately and destroy this message. Thank You.
***************************************************************

Reply via email to