Re: Beyond Lucene 2.0 Index Design

Doron Cohen Tue, 09 Jan 2007 14:29:58 -0800

Scoring today goes doc-at-a-time - all scorers and term-posting-readers
advance together; once a new doc is processed, scoring of previous docs is
known and final. This allows maintaining a finite size queue for collecting
best hits. Then, for huge collections, having to exhaustively scan all
posting lists becomes a scalability issue.


If I understand the proposal here correctly, it is suggested to order
posting lists by possible score contribution (frequency/impact) - in
particular, two different terms A and B both appearing in two docs X and Y
may have the docs in a different order - X Y for A and Y X for B. This
flexibility/granularity, while enabling early-terminataion and pruning,
would require a term-at-a-time scoring algorithm, which would be quite a
change in Lucene scoring.

The work sited seems less flexible - only a single value is maintained per
document - e.g. Page Rank - and then documents IDs are shuffled so that in
the resulted index the posting lists are still ordered - as today - by
docids, but sattify: rank(i) >= rank(i+1).

If the latter (single-value-per-doc) approach is taken, still needs to
decide when to sort - externally, when exporting the (large) index for
search, or internally, as part of merge. If done as part of merge, each
(result) segment is always ordered (by rank); merging would become more
tricky, since you don't want to exhaust memory while merging, but I think
it is doable.

"Dalton, Jeffery" <[EMAIL PROTECTED]> wrote on 09/01/2007 06:25:33:

> Hi,
>
> I wanted to start some discussion about possible future Lucene file /
> index formats.  This is an extension to the discussion on Flexible
> Lucene Indexing discussed on the wiki:
> http://wiki.apache.org/jakarta-lucene/FlexibleIndexing
>
> Note: Related sources are listed at the end.
>
> I would like to have the ability to create term frequency [Persin, et
> al. 1996] or "impact" sorted [Anh, Moffat 2001,2006] posting lists (freq
> data) . A posting list sorted by Term frequency rather than document id
> is straightforward (posting design below).  An Impact sorted list is
> relatively new (and perhaps unfamiliar).  An Impact is a single integer
> value for a term in a document that is stored in the posting list and is
> computed from the combination of the term frequency, document boost,
> field boost, length norms, and other arbitrary scoring features (word
> position, font, etc...) -- all local information.
>
> The driving motivation for this change is to avoid reading the entire
> posting list from disk for very long posting lists (it also leads to
> simplified query-time scoring because the tf, norms, and boosts are
> built into the impact).  This would address scalability issues with
> large collections that have been seen in the past; back in December 2005
> there were two threads: "Lucene Performance Bottlenecks" (Lucene User)
> and "IndexOptimizer Re: Lucene Performance Bottlenecks" (Nutch Dev)
> where Doug and Andrzej addressed some speed concerns by sorting the
> Nutch index based on Document Boost (IndexSorter and a TopDocsCollector)
> [inpsired by Long, Suel]. The goal is that an impact sorted posting list
> would address these and other concerns in a generic manner.
>
> Allowing a frequency or impact sorted posting list format would lead to
> a posting list with the following structure:
> (Frequency or impact could be used interchangeably in the structure.
> Lettering continued from Wiki)
>
> e. <impact, num_docs, (doc1,...docN)>
> f. <impact, num_docs, ([doc1, freq ,<positions>],...[docN, freq
> ,<positions>])
>
> The positions are delta encoded for compression.  Similarly, the
> document numbers for a given frequency/impact are delta encoded.  If you
> read Moffat and Persin, the papers show that this achieves compression
> comparable to, or even better than, a standard delta encoded docId
> sorted index.  The structure lends itself well to early termination,
> pruning, etc... where the entire posting list is not read from disk.
>
> This type of Impact sorted structure (or similar concept) seems to be
> becoming a "standard" solution described in a lot of new research / text
> books on IR for large scale indexes.  It would be great if Lucene
> supported something like this someday ;-).
>
> Thanks,
>
> Jeff Dalton
>
> References:
> Anh, Moffat. Vector-Space Ranking with Effective Early Termination.
> 2001.
> Anh, Moffat. Impact Transformation: Effective and Efficient Web
> Retrieval. 2006.
> Anh, Moffat. Pruned Query Evaluation Using Pre-Computed Impacts. 2006.
> Long, Suel. Optimized Query Execution in Large Search Engine with Global
> Page Ordering.
> Manning, Raghavan, Schutze. Introduction to Information Retrieval,
> Chapters 2,7.
> http://www-csli.stanford.edu/%7Eschuetze/information-retrieval-book.html
>
> Persin, et al. Filtered Document Retrieval with Frequency-Sorted
> Indexes. 1996.
>
>
>
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
>


---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Beyond Lucene 2.0 Index Design

Reply via email to