There was some interesting work done on optimizing queries including very common words (stop words) that I think overlaps with your problem. See this blog post from the Hathi Trust.

The upshot in a nutshell was that queries including terms with very large postings lists (ie high occurrences) were slow, and the approach they took to dealing with this was to index n-grams (ie pairs and triplets of adjacent tokens). However I'm not sure this would help much if your queries will typically include only a single token.


On 07/30/2012 09:07 AM, Carsten Schnober wrote:
Dear list,
I'm considering to use Lucene for indexing sequences of part-of-speech
(POS) tags instead of words; for those who don't know, POS tags are
linguistically motivated labels that are assigned to tokens (words) to
describe its morpho-syntactic function. Instead of sequences of words, I
would like to index sequences of tags, for instance "ART ADV ADJA NN".
The aim is to be able to search (efficiently) for occurrences of "ADJA".

The question is whether Lucene can be applied to deal with that data
cleverly because the statistical properties of such pseudo-texts is very
distinct from natural language texts and make me wonder whether Lucene's
inverted indexes are suitable. Especially the small vocabulary size (<50
distinct tokens, depending on the tagging system) is problematic, I suppose.

First trials for which I have implemented an analyzer that just outputs
Lucene tokens such as "ART", "ADV", "ADJA", etc. yield results that are
not exactly perfect regarding search performance, in a test corpus with
a few million tokens. The number of tokens in production mode is
expected to be much larger, so I wonder whether this approach is
promising at all.
Does Lucene (4.0?) provide optimization techniques for extremely small
vocabulary sizes?

Thank you very much,
Carsten Schnober

To unsubscribe, e-mail:
For additional commands, e-mail:

Reply via email to