Beyond Lucene 2.0 Index Design

Dalton, Jeffery Tue, 09 Jan 2007 06:26:02 -0800

Hi,

I wanted to start some discussion about possible future Lucene file /
index formats.  This is an extension to the discussion on Flexible
Lucene Indexing discussed on the wiki:
http://wiki.apache.org/jakarta-lucene/FlexibleIndexing


Note: Related sources are listed at the end.    

I would like to have the ability to create term frequency [Persin, et
al. 1996] or "impact" sorted [Anh, Moffat 2001,2006] posting lists (freq
data) . A posting list sorted by Term frequency rather than document id
is straightforward (posting design below).  An Impact sorted list is
relatively new (and perhaps unfamiliar).  An Impact is a single integer
value for a term in a document that is stored in the posting list and is
computed from the combination of the term frequency, document boost,
field boost, length norms, and other arbitrary scoring features (word
position, font, etc...) -- all local information. 

The driving motivation for this change is to avoid reading the entire
posting list from disk for very long posting lists (it also leads to
simplified query-time scoring because the tf, norms, and boosts are
built into the impact).  This would address scalability issues with
large collections that have been seen in the past; back in December 2005
there were two threads: "Lucene Performance Bottlenecks" (Lucene User)
and "IndexOptimizer Re: Lucene Performance Bottlenecks" (Nutch Dev)
where Doug and Andrzej addressed some speed concerns by sorting the
Nutch index based on Document Boost (IndexSorter and a TopDocsCollector)
[inpsired by Long, Suel]. The goal is that an impact sorted posting list
would address these and other concerns in a generic manner.    

Allowing a frequency or impact sorted posting list format would lead to
a posting list with the following structure:  
(Frequency or impact could be used interchangeably in the structure.
Lettering continued from Wiki)

e. <impact, num_docs, (doc1,...docN)> 
f. <impact, num_docs, ([doc1, freq ,<positions>],...[docN, freq
,<positions>]) 

The positions are delta encoded for compression.  Similarly, the
document numbers for a given frequency/impact are delta encoded.  If you
read Moffat and Persin, the papers show that this achieves compression
comparable to, or even better than, a standard delta encoded docId
sorted index.  The structure lends itself well to early termination,
pruning, etc... where the entire posting list is not read from disk.  

This type of Impact sorted structure (or similar concept) seems to be
becoming a "standard" solution described in a lot of new research / text
books on IR for large scale indexes.  It would be great if Lucene
supported something like this someday ;-).

Thanks,

Jeff Dalton

References:
Anh, Moffat. Vector-Space Ranking with Effective Early Termination.
2001.
Anh, Moffat. Impact Transformation: Effective and Efficient Web
Retrieval. 2006.  
Anh, Moffat. Pruned Query Evaluation Using Pre-Computed Impacts. 2006.
Long, Suel. Optimized Query Execution in Large Search Engine with Global
Page Ordering.
Manning, Raghavan, Schutze. Introduction to Information Retrieval,
Chapters 2,7.
http://www-csli.stanford.edu/%7Eschuetze/information-retrieval-book.html

Persin, et al. Filtered Document Retrieval with Frequency-Sorted
Indexes. 1996.






---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Beyond Lucene 2.0 Index Design

Reply via email to