Lucene does "docment-at-a-time", as you describe. One could make a query to do term-at-a-time in lucene if desired, as the traversal logic is encapsulated in the particular scorers for queries.
> I'm looking into how easy it would be to implement a streaming search in Lucene > (where a document is returned as it is found, instead of returning a list of > doc ids and iterating over them) It's already done for you :-) Look at the Searcher method that takes a HitCollector as a parameter. It will give you a docid and a score as they are generated (no big internal list of all the matches is generated). -Yonik Now hiring -- http://tinyurl.com/7m67g On 10/10/05, Shane O'Sullivan <[EMAIL PROTECTED]> wrote: > > Hi all, > > Does Lucene use Document and a time (DAAT), or Term at a time (TAAT) > searching? > > To explain what I'm asking, let's look at a three-term query T1 T2 T3 > In term-at-a-time (TAAT) evaluation, the engine would typically start by > scanning the entire postings list of T1, and would keep some score > accumulator for all documents that contain T1. The accumulator keeps the > score component contributed to that document by the occurrences of T1. > Then, the engine would scan the postings list for T2. Any document > containing T2 that doesn't have an accumulator from the pass over T1 can > be > ignored. The accumulators for documents that also contain T2 are updated > to > reflect the additional score that T2 contributes to them. Accumulators for > documents that do not contain T2 can be discarded. The same process > repeats > for the scanning of T3's postings list, and at the end the set of > accumulators contains the scores of all result documents. The accumulators > are sorted, and the results are returned. Note that prior to scanning T3's > postings list, the engine could not be sure of any document that qualifies > as a candidate. > > In document-at-a-time (DAAT) evaluation, the postings lists of all three > terms are traversed in parallel. Typically, the engine would advance the > iterator over T1 to some document containing T1. Then, it would move the > iterator over T2 to the first document containing T2 whose docID equals to > or is greater than the docID pointed to by T1. If the T2 iterator passes > the > T1 iterator, T1 is then advanced to at least the position pointed to by > T2, > and so forth. > Assume that eventually one finds a document containing both T1 and T2. > Then, > the iterator over T3 is advanced to at least that position. If it lands on > the same document containing the first two terms, that document can be > fully > scored and entered into a heap (it surely qualifies as a candidate). T1 is > then advanced again. However, if T3 jumped over the document containing T1 > and T2, the engine tries to find a new alignment by advancing T1 to a > position at or beyond that of T3. At the end of the evaluation, the heap > is > emptied (either partially or fully).. > > So, long story short, does anyone know which of these Lucene uses? I'm > looking into how easy it would be to implement a streaming search in > Lucene > (where a document is returned as it is found, instead of returning a list > of > doc ids and iterating over them), and Document-At-A-Time processing would > be > required for this. Does anyone know if this has already been implemented, > or > is planned for an upcoming release? > > Thanks > > Shane > >
