Re: Small Vocabulary

Mike Sokolov Mon, 06 Aug 2012 11:30:05 -0700

There was some interesting work done on optimizing queries includingvery common words (stop words) that I think overlaps with your problem.See this blog posthttp://www.hathitrust.org/blogs/large-scale-search/slow-queries-and-common-words-part-2from the Hathi Trust.

The upshot in a nutshell was that queries including terms with verylarge postings lists (ie high occurrences) were slow, and the approachthey took to dealing with this was to index n-grams (ie pairs andtriplets of adjacent tokens). However I'm not sure this would help muchif your queries will typically include only a single token.


-Mike

On 07/30/2012 09:07 AM, Carsten Schnober wrote:

Dear list,
I'm considering to use Lucene for indexing sequences of part-of-speech
(POS) tags instead of words; for those who don't know, POS tags are
linguistically motivated labels that are assigned to tokens (words) to
describe its morpho-syntactic function. Instead of sequences of words, I
would like to index sequences of tags, for instance "ART ADV ADJA NN".
The aim is to be able to search (efficiently) for occurrences of "ADJA".

The question is whether Lucene can be applied to deal with that data
cleverly because the statistical properties of such pseudo-texts is very
distinct from natural language texts and make me wonder whether Lucene's
inverted indexes are suitable. Especially the small vocabulary size (<50
distinct tokens, depending on the tagging system) is problematic, I suppose.

First trials for which I have implemented an analyzer that just outputs
Lucene tokens such as "ART", "ADV", "ADJA", etc. yield results that are
not exactly perfect regarding search performance, in a test corpus with
a few million tokens. The number of tokens in production mode is
expected to be much larger, so I wonder whether this approach is
promising at all.
Does Lucene (4.0?) provide optimization techniques for extremely small
vocabulary sizes?

Thank you very much,
Carsten Schnober


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

Re: Small Vocabulary

Reply via email to