Re: Document-at-a-time or Term-at-a-time searching

Yonik Seeley Mon, 10 Oct 2005 07:06:21 -0700

Lucene does "docment-at-a-time", as you describe. One could make a query to
do term-at-a-time in lucene if desired, as the traversal logic is
encapsulated in the particular scorers for queries.


> I'm looking into how easy it would be to implement a streaming search in
Lucene
> (where a document is returned as it is found, instead of returning a list
of
> doc ids and iterating over them)

It's already done for you :-) Look at the Searcher method that takes a
HitCollector as a parameter. It will give you a docid and a score as they
are generated (no big internal list of all the matches is generated).

-Yonik
Now hiring -- http://tinyurl.com/7m67g

On 10/10/05, Shane O'Sullivan <[EMAIL PROTECTED]> wrote:
>
> Hi all,
>
> Does Lucene use Document and a time (DAAT), or Term at a time (TAAT)
> searching?
>
> To explain what I'm asking, let's look at a three-term query T1 T2 T3
> In term-at-a-time (TAAT) evaluation, the engine would typically start by
> scanning the entire postings list of T1, and would keep some score
> accumulator for all documents that contain T1. The accumulator keeps the
> score component contributed to that document by the occurrences of T1.
> Then, the engine would scan the postings list for T2. Any document
> containing T2 that doesn't have an accumulator from the pass over T1 can
> be
> ignored. The accumulators for documents that also contain T2 are updated
> to
> reflect the additional score that T2 contributes to them. Accumulators for
> documents that do not contain T2 can be discarded. The same process
> repeats
> for the scanning of T3's postings list, and at the end the set of
> accumulators contains the scores of all result documents. The accumulators
> are sorted, and the results are returned. Note that prior to scanning T3's
> postings list, the engine could not be sure of any document that qualifies
> as a candidate.
>
> In document-at-a-time (DAAT) evaluation, the postings lists of all three
> terms are traversed in parallel. Typically, the engine would advance the
> iterator over T1 to some document containing T1. Then, it would move the
> iterator over T2 to the first document containing T2 whose docID equals to
> or is greater than the docID pointed to by T1. If the T2 iterator passes
> the
> T1 iterator, T1 is then advanced to at least the position pointed to by
> T2,
> and so forth.
> Assume that eventually one finds a document containing both T1 and T2.
> Then,
> the iterator over T3 is advanced to at least that position. If it lands on
> the same document containing the first two terms, that document can be
> fully
> scored and entered into a heap (it surely qualifies as a candidate). T1 is
> then advanced again. However, if T3 jumped over the document containing T1
> and T2, the engine tries to find a new alignment by advancing T1 to a
> position at or beyond that of T3. At the end of the evaluation, the heap
> is
> emptied (either partially or fully)..
>
> So, long story short, does anyone know which of these Lucene uses? I'm
> looking into how easy it would be to implement a streaming search in
> Lucene
> (where a document is returned as it is found, instead of returning a list
> of
> doc ids and iterating over them), and Document-At-A-Time processing would
> be
> required for this. Does anyone know if this has already been implemented,
> or
> is planned for an upcoming release?
>
> Thanks
>
> Shane
>
>

Re: Document-at-a-time or Term-at-a-time searching

Reply via email to