Tokenization and PrefixQuery

Yann-Erwan Perio Fri, 14 Feb 2014 03:18:32 -0800

Hello,

I am designing a system with documents having one field containing
values such as "Ae1 Br2 Cy8 ...", i.e. a sequence of items made of
letters and numbers (max=7 per item), all separated by a space,
possibly 200 items per field, with no limit upon the number of
documents (although I would not expect more than a few millions
documents). The order of these values are important, and I want to
search for these, always starting with the first value, and including
as many following values as needed: for instance, "Ae1", "Ae1 Br2"
would be possible search values.


At first, I indexed these using a space-delimited analyzer, and ran
PrefixQueries. I encountered some performance issues though, so ended
up building my own tokenizer, which would create tokens for all
starting combinations ("Ae1", "Ae1 Br2"...), up to certain limit,
called the analysis depth. I would then dynamically create TermQueries
to match these tokens when searching under the analysis depth, and
PrefixQueries when searching over the analysis depth (the whole string
also being indexed as a single token). The performance was great,
because TermQueries are very fast, and PrefixQueries are not bad
either, when the underlying relevant number of documents is small
(which happens to be the case when searching beyond the analysis
depth). I have however two questions: one regarding the PrefixQuery,
and one regarding the general design.

Regarding the PrefixQuery: it seems that it stops matching documents
when the length of the searched string exceeds a certain length. Is
that the expected behavior, an if so, can I / should I manage this
length?

Regarding the general design: I have adopted an hybrid approach
TermQueries/PrefixQueries, letting clients customize the analysis
depth, so as to keep a balance between the performance and the size of
the index. I am however not sure this is a good idea: would it be
better to tokenize the full string (i.e. analysis depth is infinity,
so as to only use TermQueries)? Or could my design be substituted by
an altogether different, more successful analysis approach?

Thank you in advance for your insights.

Kind regards.

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

Tokenization and PrefixQuery

Reply via email to