On Mon, 2009-10-12 at 20:02 +0200, Jake Mannix wrote:
> This killer is the "TermQuery for each term" part - this is huge. You need
> to invert this process, and use your query as is, but while walking in the 
> HitCollector, on each doc which matches your query, increment counters for
> each of the terms in that document (which means you need an in-memory 
> forward lookup for your documents, like a multivalued FieldCache - and if
> you've got roughly the same number of terms as documents, this cache is 
> likely to be as large as your entire index - a pretty hefty RAM cost).

We use something a lot like this, with some tricks to reduce memory
usage, primarily by storing the terms as a file:

A docID-array (#docs * 4 bytes) has an entry for each document that
specifies where in the field/term list to start reading. The number of
field/term entries to read is inferred by looking at the offset for the
next docID.

The field/term array (#terms * 4 bytes) consists of packed pairs, where
the field is specified by 5 bits (giving a max of 32 fields) and the
term by 27 bits (giving a maximum of 134 million unique terms/field).

For each field, a file is stored with the concatenated terms. An
in-memory array (#terms * 8 bytes) of offsets in the file for the terms
provides fast lookup. It should be possible to store this array as a
file, which would significantly reduce memory usage.

Total memory usage: (#docs * 4 bytes) + (#terms * 12 bytes).

[...]

> For each document you look at (which matches your query), you need to look
> at all of the terms in that document, and increment a counter for that term.

This can be done surprisingly fast as you do not need to look at the
terms (Strings) themselves. The pseudo-code for our counting is
something like

for (int id: docIDs) {
  for (int pos = docID-array[id] ; pos < docID-array[id+1] ; pos++) {
    counters[field_terms >> 27][field_terms & TERM_MASK]++
  )
)

In order to provide the result, we iterate the counters, extracting the
top-10 (or whatever amount of terms we want). After that we can retrieve
the actual Strings for the terms, as we can get the offset of the terms
by using the position from the counters.

Total memory-usage thus jumped to
(#docs * 4 bytes) + (#terms * 12 bytes) + 
((#terms * 4 bytes) * #concurrent searches)

To recap, our faceted search is thus

 - The search itself, providing a bitset with all matching docIDs
 - Counting (very tight loop with no conditionals)
 - Extracting top-x term-IDs for each field (heapsort or such)
 - Retrieving only the relevant terms from file
 - Clearing the counters for next usage

The last step is obviously somewhat critical, as requesting a facet
model with 10 fields of 20 terms each means a worst-case of 200 seeks.
In real world usage, caching helps a lot and SSDs help even more.

We did some tests on this and got the following, where #references is
the number of facet-search-relevant terms in the documents (in the first
example, this means that each document had an average of 10
facet-relevant terms).

* 10M docs, 10M unique facet-relevant terms, 100M refs, 1 machine
     A few 1000 hits < 100 ms
     A few 100.000 hits < 200 ms
     10M hits ~3 sec

* 100M docs, 1G unique facet-relevant terms, 1G refs, 3 machines
     A few 1000 hits ~1 sec (ouch)
     A few 100.000 hits ~1 sec
     10M hits < 3 sec
     100M hits ~15 sec

Lots of room for improvement for small numbers of hits and as always,
YMMW, but faceted search with a large number of terms in a large number
of documents is feasible.

Regards,
Toke Eskildsen


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

Reply via email to