Thanks Grant, I will take a look at this. > -----Original Message----- > From: Grant Ingersoll [mailto:[EMAIL PROTECTED] > Sent: Thursday, January 11, 2007 8:12 AM > To: java-dev@lucene.apache.org > Subject: Re: Beyond Lucene 2.0 Index Design > > Hi Jeff, > > Wondering if you (and/or others) would be interested in > taking a look at > https://issues.apache.org/jira/browse/LUCENE-662 and vetting > the new interfaces, etc. to see if you could come up w/ a > prototype implementation. This would help move along 662 as > it would sort out some of the issues that come up with the > new interface based approach (I often find it takes at least > two implementations to fully flesh out an abstraction like this) > > -Grant > > On Jan 9, 2007, at 9:25 AM, Dalton, Jeffery wrote: > > > Hi, > > > > I wanted to start some discussion about possible future > Lucene file / > > index formats. This is an extension to the discussion on Flexible > > Lucene Indexing discussed on the wiki: > > http://wiki.apache.org/jakarta-lucene/FlexibleIndexing > > > > Note: Related sources are listed at the end. > > > > I would like to have the ability to create term frequency > [Persin, et > > al. 1996] or "impact" sorted [Anh, Moffat 2001,2006] posting lists > > (freq > > data) . A posting list sorted by Term frequency rather than > document > > id is straightforward (posting design below). An Impact > sorted list > > is relatively new (and perhaps unfamiliar). An Impact is a single > > integer value for a term in a document that is stored in > the posting > > list and is computed from the combination of the term frequency, > > document boost, field boost, length norms, and other > arbitrary scoring > > features (word position, font, etc...) -- all local information. > > > > The driving motivation for this change is to avoid reading > the entire > > posting list from disk for very long posting lists (it also > leads to > > simplified query-time scoring because the tf, norms, and boosts are > > built into the impact). This would address scalability issues with > > large collections that have been seen in the past; back in December > > 2005 > > there were two threads: "Lucene Performance Bottlenecks" > (Lucene User) > > and "IndexOptimizer Re: Lucene Performance Bottlenecks" (Nutch Dev) > > where Doug and Andrzej addressed some speed concerns by sorting the > > Nutch index based on Document Boost (IndexSorter and a > > TopDocsCollector) > > [inpsired by Long, Suel]. The goal is that an impact sorted posting > > list would address these and other concerns in a generic manner. > > > > Allowing a frequency or impact sorted posting list format > would lead > > to a posting list with the following structure: > > (Frequency or impact could be used interchangeably in the structure. > > Lettering continued from Wiki) > > > > e. <impact, num_docs, (doc1,...docN)> > > f. <impact, num_docs, ([doc1, freq ,<positions>],...[docN, freq > > ,<positions>]) > > > > The positions are delta encoded for compression. Similarly, the > > document numbers for a given frequency/impact are delta encoded. > > If you > > read Moffat and Persin, the papers show that this achieves > compression > > comparable to, or even better than, a standard delta encoded docId > > sorted index. The structure lends itself well to early > termination, > > pruning, etc... where the entire posting list is not read from disk. > > > > This type of Impact sorted structure (or similar concept) > seems to be > > becoming a "standard" solution described in a lot of new research / > > text books on IR for large scale indexes. It would be > great if Lucene > > supported something like this someday ;-). > > > > Thanks, > > > > Jeff Dalton > > > > References: > > Anh, Moffat. Vector-Space Ranking with Effective Early Termination. > > 2001. > > Anh, Moffat. Impact Transformation: Effective and Efficient Web > > Retrieval. 2006. > > Anh, Moffat. Pruned Query Evaluation Using Pre-Computed > Impacts. 2006. > > Long, Suel. Optimized Query Execution in Large Search Engine with > > Global Page Ordering. > > Manning, Raghavan, Schutze. Introduction to Information Retrieval, > > Chapters 2,7. > > http://www-csli.stanford.edu/%7Eschuetze/information-retrieval- > > book.html > > > > Persin, et al. Filtered Document Retrieval with Frequency-Sorted > > Indexes. 1996. > > > > > > > > > > > > > > > --------------------------------------------------------------------- > > To unsubscribe, e-mail: [EMAIL PROTECTED] > > For additional commands, e-mail: [EMAIL PROTECTED] > > > > ------------------------------------------------------ > Grant Ingersoll > http://www.grantingersoll.com/ > > > > --------------------------------------------------------------------- > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] > >
--------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]