Scoring today goes doc-at-a-time - all scorers and term-posting-readers advance together; once a new doc is processed, scoring of previous docs is known and final. This allows maintaining a finite size queue for collecting best hits. Then, for huge collections, having to exhaustively scan all posting lists becomes a scalability issue.
If I understand the proposal here correctly, it is suggested to order posting lists by possible score contribution (frequency/impact) - in particular, two different terms A and B both appearing in two docs X and Y may have the docs in a different order - X Y for A and Y X for B. This flexibility/granularity, while enabling early-terminataion and pruning, would require a term-at-a-time scoring algorithm, which would be quite a change in Lucene scoring. The work sited seems less flexible - only a single value is maintained per document - e.g. Page Rank - and then documents IDs are shuffled so that in the resulted index the posting lists are still ordered - as today - by docids, but sattify: rank(i) >= rank(i+1). If the latter (single-value-per-doc) approach is taken, still needs to decide when to sort - externally, when exporting the (large) index for search, or internally, as part of merge. If done as part of merge, each (result) segment is always ordered (by rank); merging would become more tricky, since you don't want to exhaust memory while merging, but I think it is doable. "Dalton, Jeffery" <[EMAIL PROTECTED]> wrote on 09/01/2007 06:25:33: > Hi, > > I wanted to start some discussion about possible future Lucene file / > index formats. This is an extension to the discussion on Flexible > Lucene Indexing discussed on the wiki: > http://wiki.apache.org/jakarta-lucene/FlexibleIndexing > > Note: Related sources are listed at the end. > > I would like to have the ability to create term frequency [Persin, et > al. 1996] or "impact" sorted [Anh, Moffat 2001,2006] posting lists (freq > data) . A posting list sorted by Term frequency rather than document id > is straightforward (posting design below). An Impact sorted list is > relatively new (and perhaps unfamiliar). An Impact is a single integer > value for a term in a document that is stored in the posting list and is > computed from the combination of the term frequency, document boost, > field boost, length norms, and other arbitrary scoring features (word > position, font, etc...) -- all local information. > > The driving motivation for this change is to avoid reading the entire > posting list from disk for very long posting lists (it also leads to > simplified query-time scoring because the tf, norms, and boosts are > built into the impact). This would address scalability issues with > large collections that have been seen in the past; back in December 2005 > there were two threads: "Lucene Performance Bottlenecks" (Lucene User) > and "IndexOptimizer Re: Lucene Performance Bottlenecks" (Nutch Dev) > where Doug and Andrzej addressed some speed concerns by sorting the > Nutch index based on Document Boost (IndexSorter and a TopDocsCollector) > [inpsired by Long, Suel]. The goal is that an impact sorted posting list > would address these and other concerns in a generic manner. > > Allowing a frequency or impact sorted posting list format would lead to > a posting list with the following structure: > (Frequency or impact could be used interchangeably in the structure. > Lettering continued from Wiki) > > e. <impact, num_docs, (doc1,...docN)> > f. <impact, num_docs, ([doc1, freq ,<positions>],...[docN, freq > ,<positions>]) > > The positions are delta encoded for compression. Similarly, the > document numbers for a given frequency/impact are delta encoded. If you > read Moffat and Persin, the papers show that this achieves compression > comparable to, or even better than, a standard delta encoded docId > sorted index. The structure lends itself well to early termination, > pruning, etc... where the entire posting list is not read from disk. > > This type of Impact sorted structure (or similar concept) seems to be > becoming a "standard" solution described in a lot of new research / text > books on IR for large scale indexes. It would be great if Lucene > supported something like this someday ;-). > > Thanks, > > Jeff Dalton > > References: > Anh, Moffat. Vector-Space Ranking with Effective Early Termination. > 2001. > Anh, Moffat. Impact Transformation: Effective and Efficient Web > Retrieval. 2006. > Anh, Moffat. Pruned Query Evaluation Using Pre-Computed Impacts. 2006. > Long, Suel. Optimized Query Execution in Large Search Engine with Global > Page Ordering. > Manning, Raghavan, Schutze. Introduction to Information Retrieval, > Chapters 2,7. > http://www-csli.stanford.edu/%7Eschuetze/information-retrieval-book.html > > Persin, et al. Filtered Document Retrieval with Frequency-Sorted > Indexes. 1996. > > > > > > > --------------------------------------------------------------------- > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] > --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]