Hi Lucene geeks,
We have a whole new year in front of us and we don't want to become
bored, do we... so I thought I'd share some interesting ideas that I
encountered over the past few months, while reading now and then a bunch
papers on IR. No code yet, sorry! just wondering what it would be like
if Lucene supported this or that functionality. Feel free to say "nuts"
or "useless" or "brilliant" or anything in between. Or come up with your
ideas!
Mainly the following concepts are about maintaining additional index
data for improved performance or functionality. Experimentation in this
area became practical now in trunk with the completion of the Codec API,
but there may be still some things missing in the API-s, for example the
ability to discover, select and process sub-lists of postings, or
customization of query evaluation algorithms, etc.
Some of these ideas got implemented as a part of the original research -
I'm sorry to say that nearly none of them used Lucene, usually it was
either Zetair or Terrier. I'd blame pre-flex API-s for this, so
hopefully the situation will improve in the coming years.
So, here we go.
1. Block-Max indexes
====================
The idea is presented fully here:
http://cis.poly.edu/suel/papers/bmw.pdf . Basically, it's about skipping
parts of posting lists that are unlikely to contribute to the top-N
documents. The parts of the lists are marked with, well, tombstones,
that carry a value, which is the maximum score of a term query for a
given range of the doc-ids (under some metric). For some types of
queries it's possible to predict whether any possible matches in a given
portion of the posting list will produce a candidate that fits in the
top-N docids, based on the maximum value of a term score (or any other
useful metric for that matter). You can read the gory details of query
eval. in the paper. This is a part of a broader topic of dynamic pruning
of query eval. and I have a dozen or so other references on this.
In Lucene, we could handle such tombstones using a specialized codec.
However, I think the query evaluation mechanism wouldn't be able to use
this information to skip certain ranges of docs... or maybe it could be
implemented as filters initialized from tombstone values?
2. Time-machine indexes
=======================
This is really a variant of the above, only the tombstones record
timestamps (and of course the index is allowed to hold duplicates of
documents).
We can already do an approximation of this by limiting query evaluation
only to the latest segments (if we can guarantee that segment creation /
merging follows monotonically increasing timestamps). But using
tombstones we could merge segments from different periods of time, as
long as we guarantee that we don't split&shuffle blocks of postings that
belong to the same timestamp.
Query evaluation that concerns a time range would then be able to skip
directly to the right tombstones based on timestamps (plus some
additional filtering if tombstones are too coarse-grained). No idea how
to implement this with the current API - maybe with filters, as above?
Note that the current flex API always assumes that postings need to be
fully decoded for evaluation, because the evaluation algorithms are
codec-independent. Perhaps we could come up with an api that allows us
to customize the evaluation algos based on codec impl?
3. Caching results as an in-memory inverted index
=================================================
I can't find the paper right now ... perhaps it was by Torsten Suel, who
did a lot of research on the topic of caching. In Solr we use caches for
caching docsets from past queries, and we can do some limited
intersections for simple boolean queries. The idea here is really
simple: since we already pull in results and doc fields (and we know
what terms contribute to these results, from re-written queries, so we
could provide these too) we could use this information to create a
memory-constrained inverted index that will answer not only simple
boolean queries using intersections of bitsets, but possibly also other
queries that require full query evaluation - and under some metric we
could decide that results are either exact, good enough, or need to be
evaluated against the full index. We could then periodically prune this
index based on LFU, LRU or some such strategy.
Hmm, part of this idea is here, I think:
http://www2008.org/papers/pdf/p387-zhangA.pdf or here:
http://www2005.org/cdrom/docs/p257.pdf
BTW, there are dozens of papers on caching in search engines, for
example this:
http://www.hugo-zaragoza.net/academic/pdf/blanco_SIGIR2010b.pdf - here
the author argues against throwing away all cached lists after an index
update (which we do in Solr), and instead to keep those lists that are
likely to give identical results as before the update.
4. Phrase indexing
==================
Again, I lost the reference to the paper that describes this ... I'll
find it later. Phrase indexing is of course well known, and has
well-known benefits and costs (mostly prohibitive except for very
limited number of phrases). The idea here is to index phrases in such a
way that the term dictionary (and postings) consists only of relatively
long phrases, and postings for all shorter phrases subsumed by the long
phrases are put in the same posting lists. Now, the dictionary needs to
store also pointers from each leading term of the shorter phrases to the
corresponding longer phrase entry, so that we can find the right
postings given a shorter phrase. And postings are also augmented with
bitmasks that determine what terms in the phrase match in which document
on a list.
(Hmm, maybe it was this paper?
http://www.mpi-inf.mpg.de/~bast/papers/autocompletion-sigir.pdf)
5. Chunk-level indexing
=======================
It's basically a regular index, only we add terms with coarse-grained
position information - instead of storing positions for every posting
(or none) we store only "chunk" numbers, where "chunk" could be
interpreted as a sentence (or a page, or a paragraph, or a chunk ;) ).
From the point of view of the API this would translate to several
postings with position increment 0, i.e. several terms would end up at
the same positions. Obviously, this lossy encoding of term proximity
would save a lot of space and would speed up proximity query evaluation,
at the cost of matching with coarse "slop" - but even then we would know
that the slop is limited to the chunk size, which is often good enough.
Phrase/span scorers would have to understand that they are looking for
terms that have the same (equal) "chunk" number, and score them
accordingly (whereas the regular phrase scorer looks for postings with
posIncr != 0 or posIncr=1 for exact phrases).
The following paper discusses this concept in detail:
http://www.isi.edu/~metzler/papers/elsayed-cikm11.pdf and this one
(paywall): http://www.springerlink.com/index/T5355418276V7115.pdf
6. Stored fields compaction for efficient snippet generation
============================================================
This time I have the links to the papers:
http://www.springerlink.com/index/j2774187g532603t.pdf and
http://www.edbt.org/Proceedings/2011-Uppsala/papers/edbt/a10-ceccarelli.pdf
. The idea again is quite simple: instead of using full text for snippet
generation and highlighting, why not choose the best candidate snippets
in advance, and store/cache/highlight only these.
And finally some other odd-ball links to cool papers:
* Hauff, C. (2010). Predicting the effectiveness of queries and
retrieval systems.
http://eprints.eemcs.utwente.nl/17338/01/final_version_LR.pdf -
concerning the evaluation of query complexity and routing the queries to
indexes (or nodes) that can best answer the queries. See also
http://www.morganclaypool.com/doi/pdf/10.2200/S00235ED1V01Y201004ICR015
(behind a paywall). Now that we can efficiently construct subsets of
indexes on the fly I'm really tempted to implement the tiered search
model that I presented at Lucene Revolution, unless someone beats me to it.
* F Claude, A Farina (2009). Re-Pair Compression of Inverted Lists.
http://arxiv.org/pdf/0911.3318 - on the surface it's a wild idea that
apparently works... it's an LZ-like compression method for postings and
a set of algos for intersections of these lists without decompression.
I think that's it for now... Enjoy!
--
Best regards,
Andrzej Bialecki <><
___. ___ ___ ___ _ _ __________________________________
[__ || __|__/|__||\/| Information Retrieval, Semantic Web
___|||__|| \| || | Embedded Unix, System Integration
http://www.sigram.com Contact: info at sigram dot com
---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org