Thanks Grant, I will take a look at this.   

> -----Original Message-----
> From: Grant Ingersoll [mailto:[EMAIL PROTECTED] 
> Sent: Thursday, January 11, 2007 8:12 AM
> To: java-dev@lucene.apache.org
> Subject: Re: Beyond Lucene 2.0 Index Design
> 
> Hi Jeff,
> 
> Wondering if you (and/or others) would be interested in 
> taking a look at 
> https://issues.apache.org/jira/browse/LUCENE-662 and vetting 
> the new interfaces, etc. to see if you could come up w/ a 
> prototype implementation.  This would help move along 662 as 
> it would sort out some of the issues that come up with the 
> new interface based approach (I often find it takes at least 
> two implementations to fully flesh out an abstraction like this)
> 
> -Grant
> 
> On Jan 9, 2007, at 9:25 AM, Dalton, Jeffery wrote:
> 
> > Hi,
> >
> > I wanted to start some discussion about possible future 
> Lucene file / 
> > index formats.  This is an extension to the discussion on Flexible 
> > Lucene Indexing discussed on the wiki:
> > http://wiki.apache.org/jakarta-lucene/FlexibleIndexing
> >
> > Note: Related sources are listed at the end.
> >
> > I would like to have the ability to create term frequency 
> [Persin, et 
> > al. 1996] or "impact" sorted [Anh, Moffat 2001,2006] posting lists 
> > (freq
> > data) . A posting list sorted by Term frequency rather than 
> document 
> > id is straightforward (posting design below).  An Impact 
> sorted list 
> > is relatively new (and perhaps unfamiliar).  An Impact is a single 
> > integer value for a term in a document that is stored in 
> the posting 
> > list and is computed from the combination of the term frequency, 
> > document boost, field boost, length norms, and other 
> arbitrary scoring 
> > features (word position, font, etc...) -- all local information.
> >
> > The driving motivation for this change is to avoid reading 
> the entire 
> > posting list from disk for very long posting lists (it also 
> leads to 
> > simplified query-time scoring because the tf, norms, and boosts are 
> > built into the impact).  This would address scalability issues with 
> > large collections that have been seen in the past; back in December
> > 2005
> > there were two threads: "Lucene Performance Bottlenecks" 
> (Lucene User) 
> > and "IndexOptimizer Re: Lucene Performance Bottlenecks" (Nutch Dev) 
> > where Doug and Andrzej addressed some speed concerns by sorting the 
> > Nutch index based on Document Boost (IndexSorter and a
> > TopDocsCollector)
> > [inpsired by Long, Suel]. The goal is that an impact sorted posting 
> > list would address these and other concerns in a generic manner.
> >
> > Allowing a frequency or impact sorted posting list format 
> would lead 
> > to a posting list with the following structure:
> > (Frequency or impact could be used interchangeably in the structure.
> > Lettering continued from Wiki)
> >
> > e. <impact, num_docs, (doc1,...docN)>
> > f. <impact, num_docs, ([doc1, freq ,<positions>],...[docN, freq
> > ,<positions>])
> >
> > The positions are delta encoded for compression.  Similarly, the
> > document numbers for a given frequency/impact are delta encoded.   
> > If you
> > read Moffat and Persin, the papers show that this achieves 
> compression 
> > comparable to, or even better than, a standard delta encoded docId 
> > sorted index.  The structure lends itself well to early 
> termination, 
> > pruning, etc... where the entire posting list is not read from disk.
> >
> > This type of Impact sorted structure (or similar concept) 
> seems to be 
> > becoming a "standard" solution described in a lot of new research / 
> > text books on IR for large scale indexes.  It would be 
> great if Lucene 
> > supported something like this someday ;-).
> >
> > Thanks,
> >
> > Jeff Dalton
> >
> > References:
> > Anh, Moffat. Vector-Space Ranking with Effective Early Termination.
> > 2001.
> > Anh, Moffat. Impact Transformation: Effective and Efficient Web 
> > Retrieval. 2006.
> > Anh, Moffat. Pruned Query Evaluation Using Pre-Computed 
> Impacts. 2006.
> > Long, Suel. Optimized Query Execution in Large Search Engine with 
> > Global Page Ordering.
> > Manning, Raghavan, Schutze. Introduction to Information Retrieval, 
> > Chapters 2,7.
> > http://www-csli.stanford.edu/%7Eschuetze/information-retrieval-
> > book.html
> >
> > Persin, et al. Filtered Document Retrieval with Frequency-Sorted 
> > Indexes. 1996.
> >
> >
> >
> >
> >
> >
> > 
> ---------------------------------------------------------------------
> > To unsubscribe, e-mail: [EMAIL PROTECTED]
> > For additional commands, e-mail: [EMAIL PROTECTED]
> >
> 
> ------------------------------------------------------
> Grant Ingersoll
> http://www.grantingersoll.com/
> 
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
> 
> 

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Reply via email to