Hi Lucene geeks,

We have a whole new year in front of us and we don't want to become bored, do we... so I thought I'd share some interesting ideas that I encountered over the past few months, while reading now and then a bunch papers on IR. No code yet, sorry! just wondering what it would be like if Lucene supported this or that functionality. Feel free to say "nuts" or "useless" or "brilliant" or anything in between. Or come up with your ideas!

Mainly the following concepts are about maintaining additional index data for improved performance or functionality. Experimentation in this area became practical now in trunk with the completion of the Codec API, but there may be still some things missing in the API-s, for example the ability to discover, select and process sub-lists of postings, or customization of query evaluation algorithms, etc.

Some of these ideas got implemented as a part of the original research - I'm sorry to say that nearly none of them used Lucene, usually it was either Zetair or Terrier. I'd blame pre-flex API-s for this, so hopefully the situation will improve in the coming years.

So, here we go.

1. Block-Max indexes
====================
The idea is presented fully here: http://cis.poly.edu/suel/papers/bmw.pdf . Basically, it's about skipping parts of posting lists that are unlikely to contribute to the top-N documents. The parts of the lists are marked with, well, tombstones, that carry a value, which is the maximum score of a term query for a given range of the doc-ids (under some metric). For some types of queries it's possible to predict whether any possible matches in a given portion of the posting list will produce a candidate that fits in the top-N docids, based on the maximum value of a term score (or any other useful metric for that matter). You can read the gory details of query eval. in the paper. This is a part of a broader topic of dynamic pruning of query eval. and I have a dozen or so other references on this.

In Lucene, we could handle such tombstones using a specialized codec. However, I think the query evaluation mechanism wouldn't be able to use this information to skip certain ranges of docs... or maybe it could be implemented as filters initialized from tombstone values?

2. Time-machine indexes
=======================
This is really a variant of the above, only the tombstones record timestamps (and of course the index is allowed to hold duplicates of documents).

We can already do an approximation of this by limiting query evaluation only to the latest segments (if we can guarantee that segment creation / merging follows monotonically increasing timestamps). But using tombstones we could merge segments from different periods of time, as long as we guarantee that we don't split&shuffle blocks of postings that belong to the same timestamp.

Query evaluation that concerns a time range would then be able to skip directly to the right tombstones based on timestamps (plus some additional filtering if tombstones are too coarse-grained). No idea how to implement this with the current API - maybe with filters, as above?

Note that the current flex API always assumes that postings need to be fully decoded for evaluation, because the evaluation algorithms are codec-independent. Perhaps we could come up with an api that allows us to customize the evaluation algos based on codec impl?

3. Caching results as an in-memory inverted index
=================================================
I can't find the paper right now ... perhaps it was by Torsten Suel, who did a lot of research on the topic of caching. In Solr we use caches for caching docsets from past queries, and we can do some limited intersections for simple boolean queries. The idea here is really simple: since we already pull in results and doc fields (and we know what terms contribute to these results, from re-written queries, so we could provide these too) we could use this information to create a memory-constrained inverted index that will answer not only simple boolean queries using intersections of bitsets, but possibly also other queries that require full query evaluation - and under some metric we could decide that results are either exact, good enough, or need to be evaluated against the full index. We could then periodically prune this index based on LFU, LRU or some such strategy. Hmm, part of this idea is here, I think: http://www2008.org/papers/pdf/p387-zhangA.pdf or here: http://www2005.org/cdrom/docs/p257.pdf

BTW, there are dozens of papers on caching in search engines, for example this: http://www.hugo-zaragoza.net/academic/pdf/blanco_SIGIR2010b.pdf - here the author argues against throwing away all cached lists after an index update (which we do in Solr), and instead to keep those lists that are likely to give identical results as before the update.

4. Phrase indexing
==================
Again, I lost the reference to the paper that describes this ... I'll find it later. Phrase indexing is of course well known, and has well-known benefits and costs (mostly prohibitive except for very limited number of phrases). The idea here is to index phrases in such a way that the term dictionary (and postings) consists only of relatively long phrases, and postings for all shorter phrases subsumed by the long phrases are put in the same posting lists. Now, the dictionary needs to store also pointers from each leading term of the shorter phrases to the corresponding longer phrase entry, so that we can find the right postings given a shorter phrase. And postings are also augmented with bitmasks that determine what terms in the phrase match in which document on a list.

(Hmm, maybe it was this paper? http://www.mpi-inf.mpg.de/~bast/papers/autocompletion-sigir.pdf)

5. Chunk-level indexing
=======================
It's basically a regular index, only we add terms with coarse-grained position information - instead of storing positions for every posting (or none) we store only "chunk" numbers, where "chunk" could be interpreted as a sentence (or a page, or a paragraph, or a chunk ;) ). From the point of view of the API this would translate to several postings with position increment 0, i.e. several terms would end up at the same positions. Obviously, this lossy encoding of term proximity would save a lot of space and would speed up proximity query evaluation, at the cost of matching with coarse "slop" - but even then we would know that the slop is limited to the chunk size, which is often good enough. Phrase/span scorers would have to understand that they are looking for terms that have the same (equal) "chunk" number, and score them accordingly (whereas the regular phrase scorer looks for postings with posIncr != 0 or posIncr=1 for exact phrases).

The following paper discusses this concept in detail: http://www.isi.edu/~metzler/papers/elsayed-cikm11.pdf and this one (paywall): http://www.springerlink.com/index/T5355418276V7115.pdf

6. Stored fields compaction for efficient snippet generation
============================================================
This time I have the links to the papers: http://www.springerlink.com/index/j2774187g532603t.pdf and http://www.edbt.org/Proceedings/2011-Uppsala/papers/edbt/a10-ceccarelli.pdf . The idea again is quite simple: instead of using full text for snippet generation and highlighting, why not choose the best candidate snippets in advance, and store/cache/highlight only these.


And finally some other odd-ball links to cool papers:

* Hauff, C. (2010). Predicting the effectiveness of queries and retrieval systems. http://eprints.eemcs.utwente.nl/17338/01/final_version_LR.pdf - concerning the evaluation of query complexity and routing the queries to indexes (or nodes) that can best answer the queries. See also http://www.morganclaypool.com/doi/pdf/10.2200/S00235ED1V01Y201004ICR015 (behind a paywall). Now that we can efficiently construct subsets of indexes on the fly I'm really tempted to implement the tiered search model that I presented at Lucene Revolution, unless someone beats me to it.

* F Claude, A Farina (2009). Re-Pair Compression of Inverted Lists. http://arxiv.org/pdf/0911.3318 - on the surface it's a wild idea that apparently works... it's an LZ-like compression method for postings and a set of algos for intersections of these lists without decompression.


I think that's it for now... Enjoy!

--
Best regards,
Andrzej Bialecki     <><
 ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

Reply via email to