Some interesting IR ideas

Andrzej Bialecki Mon, 02 Jan 2012 16:54:28 -0800

Hi Lucene geeks,

We have a whole new year in front of us and we don't want to becomebored, do we... so I thought I'd share some interesting ideas that Iencountered over the past few months, while reading now and then a bunchpapers on IR. No code yet, sorry! just wondering what it would be likeif Lucene supported this or that functionality. Feel free to say "nuts"or "useless" or "brilliant" or anything in between. Or come up with yourideas!

Mainly the following concepts are about maintaining additional indexdata for improved performance or functionality. Experimentation in thisarea became practical now in trunk with the completion of the Codec API,but there may be still some things missing in the API-s, for example theability to discover, select and process sub-lists of postings, orcustomization of query evaluation algorithms, etc.

Some of these ideas got implemented as a part of the original research -I'm sorry to say that nearly none of them used Lucene, usually it waseither Zetair or Terrier. I'd blame pre-flex API-s for this, sohopefully the situation will improve in the coming years.


So, here we go.

1. Block-Max indexes
====================

The idea is presented fully here:http://cis.poly.edu/suel/papers/bmw.pdf . Basically, it's about skippingparts of posting lists that are unlikely to contribute to the top-Ndocuments. The parts of the lists are marked with, well, tombstones,that carry a value, which is the maximum score of a term query for agiven range of the doc-ids (under some metric). For some types ofqueries it's possible to predict whether any possible matches in a givenportion of the posting list will produce a candidate that fits in thetop-N docids, based on the maximum value of a term score (or any otheruseful metric for that matter). You can read the gory details of queryeval. in the paper. This is a part of a broader topic of dynamic pruningof query eval. and I have a dozen or so other references on this.

In Lucene, we could handle such tombstones using a specialized codec.However, I think the query evaluation mechanism wouldn't be able to usethis information to skip certain ranges of docs... or maybe it could beimplemented as filters initialized from tombstone values?


2. Time-machine indexes
=======================

This is really a variant of the above, only the tombstones recordtimestamps (and of course the index is allowed to hold duplicates ofdocuments).

We can already do an approximation of this by limiting query evaluationonly to the latest segments (if we can guarantee that segment creation /merging follows monotonically increasing timestamps). But usingtombstones we could merge segments from different periods of time, aslong as we guarantee that we don't split&shuffle blocks of postings thatbelong to the same timestamp.

Query evaluation that concerns a time range would then be able to skipdirectly to the right tombstones based on timestamps (plus someadditional filtering if tombstones are too coarse-grained). No idea howto implement this with the current API - maybe with filters, as above?

Note that the current flex API always assumes that postings need to befully decoded for evaluation, because the evaluation algorithms arecodec-independent. Perhaps we could come up with an api that allows usto customize the evaluation algos based on codec impl?


3. Caching results as an in-memory inverted index
=================================================

I can't find the paper right now ... perhaps it was by Torsten Suel, whodid a lot of research on the topic of caching. In Solr we use caches forcaching docsets from past queries, and we can do some limitedintersections for simple boolean queries. The idea here is reallysimple: since we already pull in results and doc fields (and we knowwhat terms contribute to these results, from re-written queries, so wecould provide these too) we could use this information to create amemory-constrained inverted index that will answer not only simpleboolean queries using intersections of bitsets, but possibly also otherqueries that require full query evaluation - and under some metric wecould decide that results are either exact, good enough, or need to beevaluated against the full index. We could then periodically prune thisindex based on LFU, LRU or some such strategy.Hmm, part of this idea is here, I think:http://www2008.org/papers/pdf/p387-zhangA.pdf or here:http://www2005.org/cdrom/docs/p257.pdf

BTW, there are dozens of papers on caching in search engines, forexample this:http://www.hugo-zaragoza.net/academic/pdf/blanco_SIGIR2010b.pdf - herethe author argues against throwing away all cached lists after an indexupdate (which we do in Solr), and instead to keep those lists that arelikely to give identical results as before the update.


4. Phrase indexing
==================

Again, I lost the reference to the paper that describes this ... I'llfind it later. Phrase indexing is of course well known, and haswell-known benefits and costs (mostly prohibitive except for verylimited number of phrases). The idea here is to index phrases in such away that the term dictionary (and postings) consists only of relativelylong phrases, and postings for all shorter phrases subsumed by the longphrases are put in the same posting lists. Now, the dictionary needs tostore also pointers from each leading term of the shorter phrases to thecorresponding longer phrase entry, so that we can find the rightpostings given a shorter phrase. And postings are also augmented withbitmasks that determine what terms in the phrase match in which documenton a list.

(Hmm, maybe it was this paper?http://www.mpi-inf.mpg.de/~bast/papers/autocompletion-sigir.pdf)


5. Chunk-level indexing
=======================

It's basically a regular index, only we add terms with coarse-grainedposition information - instead of storing positions for every posting(or none) we store only "chunk" numbers, where "chunk" could beinterpreted as a sentence (or a page, or a paragraph, or a chunk ;) ).From the point of view of the API this would translate to severalpostings with position increment 0, i.e. several terms would end up atthe same positions. Obviously, this lossy encoding of term proximitywould save a lot of space and would speed up proximity query evaluation,at the cost of matching with coarse "slop" - but even then we would knowthat the slop is limited to the chunk size, which is often good enough.Phrase/span scorers would have to understand that they are looking forterms that have the same (equal) "chunk" number, and score themaccordingly (whereas the regular phrase scorer looks for postings withposIncr != 0 or posIncr=1 for exact phrases).

The following paper discusses this concept in detail:http://www.isi.edu/~metzler/papers/elsayed-cikm11.pdf and this one(paywall): http://www.springerlink.com/index/T5355418276V7115.pdf


6. Stored fields compaction for efficient snippet generation
============================================================

This time I have the links to the papers:http://www.springerlink.com/index/j2774187g532603t.pdf andhttp://www.edbt.org/Proceedings/2011-Uppsala/papers/edbt/a10-ceccarelli.pdf. The idea again is quite simple: instead of using full text for snippetgeneration and highlighting, why not choose the best candidate snippetsin advance, and store/cache/highlight only these.



And finally some other odd-ball links to cool papers:

* Hauff, C. (2010). Predicting the effectiveness of queries andretrieval systems.http://eprints.eemcs.utwente.nl/17338/01/final_version_LR.pdf -concerning the evaluation of query complexity and routing the queries toindexes (or nodes) that can best answer the queries. See alsohttp://www.morganclaypool.com/doi/pdf/10.2200/S00235ED1V01Y201004ICR015(behind a paywall). Now that we can efficiently construct subsets ofindexes on the fly I'm really tempted to implement the tiered searchmodel that I presented at Lucene Revolution, unless someone beats me to it.

* F Claude, A Farina (2009). Re-Pair Compression of Inverted Lists.http://arxiv.org/pdf/0911.3318 - on the surface it's a wild idea thatapparently works... it's an LZ-like compression method for postings anda set of algos for intersections of these lists without decompression.



I think that's it for now... Enjoy!

--
Best regards,
Andrzej Bialecki     <><
 ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

Some interesting IR ideas

Reply via email to