Hi, I wanted to start some discussion about possible future Lucene file / index formats. This is an extension to the discussion on Flexible Lucene Indexing discussed on the wiki: http://wiki.apache.org/jakarta-lucene/FlexibleIndexing
Note: Related sources are listed at the end. I would like to have the ability to create term frequency [Persin, et al. 1996] or "impact" sorted [Anh, Moffat 2001,2006] posting lists (freq data) . A posting list sorted by Term frequency rather than document id is straightforward (posting design below). An Impact sorted list is relatively new (and perhaps unfamiliar). An Impact is a single integer value for a term in a document that is stored in the posting list and is computed from the combination of the term frequency, document boost, field boost, length norms, and other arbitrary scoring features (word position, font, etc...) -- all local information. The driving motivation for this change is to avoid reading the entire posting list from disk for very long posting lists (it also leads to simplified query-time scoring because the tf, norms, and boosts are built into the impact). This would address scalability issues with large collections that have been seen in the past; back in December 2005 there were two threads: "Lucene Performance Bottlenecks" (Lucene User) and "IndexOptimizer Re: Lucene Performance Bottlenecks" (Nutch Dev) where Doug and Andrzej addressed some speed concerns by sorting the Nutch index based on Document Boost (IndexSorter and a TopDocsCollector) [inpsired by Long, Suel]. The goal is that an impact sorted posting list would address these and other concerns in a generic manner. Allowing a frequency or impact sorted posting list format would lead to a posting list with the following structure: (Frequency or impact could be used interchangeably in the structure. Lettering continued from Wiki) e. <impact, num_docs, (doc1,...docN)> f. <impact, num_docs, ([doc1, freq ,<positions>],...[docN, freq ,<positions>]) The positions are delta encoded for compression. Similarly, the document numbers for a given frequency/impact are delta encoded. If you read Moffat and Persin, the papers show that this achieves compression comparable to, or even better than, a standard delta encoded docId sorted index. The structure lends itself well to early termination, pruning, etc... where the entire posting list is not read from disk. This type of Impact sorted structure (or similar concept) seems to be becoming a "standard" solution described in a lot of new research / text books on IR for large scale indexes. It would be great if Lucene supported something like this someday ;-). Thanks, Jeff Dalton References: Anh, Moffat. Vector-Space Ranking with Effective Early Termination. 2001. Anh, Moffat. Impact Transformation: Effective and Efficient Web Retrieval. 2006. Anh, Moffat. Pruned Query Evaluation Using Pre-Computed Impacts. 2006. Long, Suel. Optimized Query Execution in Large Search Engine with Global Page Ordering. Manning, Raghavan, Schutze. Introduction to Information Retrieval, Chapters 2,7. http://www-csli.stanford.edu/%7Eschuetze/information-retrieval-book.html Persin, et al. Filtered Document Retrieval with Frequency-Sorted Indexes. 1996. --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]