On 28-Oct-08, at 5:36 AM, Jérôme Etévé wrote:
Hi all,
In my code, I'd like to keep a subset of my 14M docs which is around
100k large.
What is according to you the best option in terms of speed and
memory usage ?
Some basic thoughts tells me the BitDocSet should be the fastest for
lookup, but takes ~ 14M * sizeof(int) in memory, whereas
the HashDocSet takes just ~ 100k * sizeof(int) , but is a bit
slower lookup.
The doc of HashDocSet says "t can be a better choice if there are few
docs in the set" . What does 'few' means in this context ?
Solr, by default, ships in a configuration that creates filters with
HashDocSet if the size of the set is < 3000, and BitDocSet otherwise.
This parameter is tunable in solrconfig.xml. You might find it helps
to increase this slightly with 14m docs--say to 5000-6000. In my
testing, any higher than this is a net loss.
-Mike