On Mon, Jan 27, 2014 at 3:48 AM, Andreas Brandl <m...@3.141592654.de> wrote: > Is there some limitation on the length of fields? How do I get around this? [cut] > My overall goal is to index (arbitrary sized) text files and run a regular > expression search using lucene's RegexpQuery. I suspect the > KeywordAnalyzer to cause the inconsistent behaviour - is this the right > analyzer to use for a RegexpQuery?
The limit is most likely that one in DocumentsWriter, where MAX_TERM_LENGTH == 16383. addDocument() says it throws an error when this limit is exceeded. What we do for RegexpQuery is that we still tokenise the text, but we explain the caveat that the regular expression will match individual tokens. I think KeywordAnalyzer will probably give better results, but you're going to hit this limitation past a certain size. Going in the other direction, if you tokenise the text character-by-character, you might be able to write a regular expression engine which uses span queries to match the regular expression to the terms. I don't know how that would perform, but ever since writing a per-character tokeniser, I have been wondering if it would be a decent way to do it. TX --------------------------------------------------------------------- To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org