RE: Beyond Lucene 2.0 Index Design

Dalton, Jeffery Tue, 09 Jan 2007 15:15:34 -0800

Doron -- you have the idea.  

And yes, it would be a substantial change to Lucene scoring.  Ideally,
Lucene / doc format would be changed in such a way to support both docId
sorted indexes (and doc-at-a-time processing) and frequency/impact
sorted indexes with term-at-a-time or even score-at-a-time processing.
I think we could do a lot to generalize Lucene to support both modes (at
least at index level).


You do a good job describing doc and term at a time processing.
Score-at-a-time processing is where the posting list with the highest
IDF x current Impact score is read -- the next chunk of index that can
contribute the most to overall document score.  

Doron, the less flexible work I assume you mention is Suel -- Global
Page Ordering.  Indeed, I agree, this is inflexible, requires all
documents to be renumbered, and only sorts by Page Rank.  It's too high
level (only applies to web docs) for Lucene to support.  Anh/Moffat's
Impact sorting is quite flexible and generally has wide application
because it provides one score per term in a document - any local
document information can be factored into the score at index creation
time (such as multiple field or positional boosts).

As you say Doron, I think the best approach is that the sort be
maintained at segment merge time, the ImpactSortedSegmentMerger (or a
class similarly named) would merge sort the postings by score as it
writes them to disk. It's definitely doable -- in fact I have created
classes to do it, but doing so required significant changes to other
parts of Lucene to calculate the score and get it down to the
DocumentWriter / SegmentMerger level.     

On the search side there needs to be ScoreSortedTermDocs and
ScoreSortedTermPositions objects to read the new format, along with
Scorers to perform scoring and intersection.  

The end product would be a very scalable and flexible solution.  

- Jeff

> -----Original Message-----
> From: Doron Cohen [mailto:[EMAIL PROTECTED] 
> Sent: Tuesday, January 09, 2007 5:27 PM
> To: java-dev@lucene.apache.org
> Subject: Re: Beyond Lucene 2.0 Index Design
> 
> Scoring today goes doc-at-a-time - all scorers and 
> term-posting-readers advance together; once a new doc is 
> processed, scoring of previous docs is known and final. This 
> allows maintaining a finite size queue for collecting best 
> hits. Then, for huge collections, having to exhaustively scan 
> all posting lists becomes a scalability issue.
> 
> If I understand the proposal here correctly, it is suggested 
> to order posting lists by possible score contribution 
> (frequency/impact) - in particular, two different terms A and 
> B both appearing in two docs X and Y may have the docs in a 
> different order - X Y for A and Y X for B. This 
> flexibility/granularity, while enabling early-terminataion 
> and pruning, would require a term-at-a-time scoring 
> algorithm, which would be quite a change in Lucene scoring.
> 
> The work sited seems less flexible - only a single value is 
> maintained per document - e.g. Page Rank - and then documents 
> IDs are shuffled so that in the resulted index the posting 
> lists are still ordered - as today - by docids, but sattify: 
> rank(i) >= rank(i+1).
> 
> If the latter (single-value-per-doc) approach is taken, still 
> needs to decide when to sort - externally, when exporting the 
> (large) index for search, or internally, as part of merge. If 
> done as part of merge, each
> (result) segment is always ordered (by rank); merging would 
> become more tricky, since you don't want to exhaust memory 
> while merging, but I think it is doable.
> 
> "Dalton, Jeffery" <[EMAIL PROTECTED]> wrote on 
> 09/01/2007 06:25:33:
> 
> > Hi,
> >
> > I wanted to start some discussion about possible future 
> Lucene file / 
> > index formats.  This is an extension to the discussion on Flexible 
> > Lucene Indexing discussed on the wiki:
> > http://wiki.apache.org/jakarta-lucene/FlexibleIndexing
> >
> > Note: Related sources are listed at the end.
> >
> > I would like to have the ability to create term frequency 
> [Persin, et 
> > al. 1996] or "impact" sorted [Anh, Moffat 2001,2006] posting lists 
> > (freq
> > data) . A posting list sorted by Term frequency rather than 
> document 
> > id is straightforward (posting design below).  An Impact 
> sorted list 
> > is relatively new (and perhaps unfamiliar).  An Impact is a single 
> > integer value for a term in a document that is stored in 
> the posting 
> > list and is computed from the combination of the term frequency, 
> > document boost, field boost, length norms, and other 
> arbitrary scoring 
> > features (word position, font, etc...) -- all local information.
> >
> > The driving motivation for this change is to avoid reading 
> the entire 
> > posting list from disk for very long posting lists (it also 
> leads to 
> > simplified query-time scoring because the tf, norms, and boosts are 
> > built into the impact).  This would address scalability issues with 
> > large collections that have been seen in the past; back in December 
> > 2005 there were two threads: "Lucene Performance 
> Bottlenecks" (Lucene 
> > User) and "IndexOptimizer Re: Lucene Performance 
> Bottlenecks" (Nutch 
> > Dev) where Doug and Andrzej addressed some speed concerns 
> by sorting 
> > the Nutch index based on Document Boost (IndexSorter and a 
> > TopDocsCollector) [inpsired by Long, Suel]. The goal is 
> that an impact 
> > sorted posting list would address these and other concerns 
> in a generic manner.
> >
> > Allowing a frequency or impact sorted posting list format 
> would lead 
> > to a posting list with the following structure:
> > (Frequency or impact could be used interchangeably in the structure.
> > Lettering continued from Wiki)
> >
> > e. <impact, num_docs, (doc1,...docN)>
> > f. <impact, num_docs, ([doc1, freq ,<positions>],...[docN, freq
> > ,<positions>])
> >
> > The positions are delta encoded for compression.  Similarly, the 
> > document numbers for a given frequency/impact are delta 
> encoded.  If 
> > you read Moffat and Persin, the papers show that this achieves 
> > compression comparable to, or even better than, a standard delta 
> > encoded docId sorted index.  The structure lends itself 
> well to early 
> > termination, pruning, etc... where the entire posting list 
> is not read from disk.
> >
> > This type of Impact sorted structure (or similar concept) 
> seems to be 
> > becoming a "standard" solution described in a lot of new research / 
> > text books on IR for large scale indexes.  It would be 
> great if Lucene 
> > supported something like this someday ;-).
> >
> > Thanks,
> >
> > Jeff Dalton
> >
> > References:
> > Anh, Moffat. Vector-Space Ranking with Effective Early Termination.
> > 2001.
> > Anh, Moffat. Impact Transformation: Effective and Efficient Web 
> > Retrieval. 2006.
> > Anh, Moffat. Pruned Query Evaluation Using Pre-Computed 
> Impacts. 2006.
> > Long, Suel. Optimized Query Execution in Large Search Engine with 
> > Global Page Ordering.
> > Manning, Raghavan, Schutze. Introduction to Information Retrieval, 
> > Chapters 2,7.
> > 
> http://www-csli.stanford.edu/%7Eschuetze/information-retrieval-book.ht
> > ml
> >
> > Persin, et al. Filtered Document Retrieval with Frequency-Sorted 
> > Indexes. 1996.
> >
> >
> >
> >
> >
> >
> > 
> ---------------------------------------------------------------------
> > To unsubscribe, e-mail: [EMAIL PROTECTED]
> > For additional commands, e-mail: [EMAIL PROTECTED]
> >
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
> 
> 

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

RE: Beyond Lucene 2.0 Index Design

Reply via email to