Dear list, I'm considering to use Lucene for indexing sequences of part-of-speech (POS) tags instead of words; for those who don't know, POS tags are linguistically motivated labels that are assigned to tokens (words) to describe its morpho-syntactic function. Instead of sequences of words, I would like to index sequences of tags, for instance "ART ADV ADJA NN". The aim is to be able to search (efficiently) for occurrences of "ADJA".
The question is whether Lucene can be applied to deal with that data cleverly because the statistical properties of such pseudo-texts is very distinct from natural language texts and make me wonder whether Lucene's inverted indexes are suitable. Especially the small vocabulary size (<50 distinct tokens, depending on the tagging system) is problematic, I suppose. First trials for which I have implemented an analyzer that just outputs Lucene tokens such as "ART", "ADV", "ADJA", etc. yield results that are not exactly perfect regarding search performance, in a test corpus with a few million tokens. The number of tokens in production mode is expected to be much larger, so I wonder whether this approach is promising at all. Does Lucene (4.0?) provide optimization techniques for extremely small vocabulary sizes? Thank you very much, Carsten Schnober -- Institut für Deutsche Sprache | http://www.ids-mannheim.de Projekt KorAP | http://korap.ids-mannheim.de Tel. +49-(0)621-43740789 | schno...@ids-mannheim.de Korpusanalyseplattform der nächsten Generation Next Generation Corpus Analysis Platform --------------------------------------------------------------------- To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org