Re: limitation on token-length for KeywordAnalyzer?

Andreas Brandl Tue, 28 Jan 2014 03:49:57 -0800

Hi,

----- Original Message -----
> On Mon, Jan 27, 2014 at 3:48 AM, Andreas Brandl <m...@3.141592654.de>
> wrote:
> > Is there some limitation on the length of fields? How do I get
> > around this?
> [cut]
> > My overall goal is to index (arbitrary sized) text files and run a
> > regular
> > expression search using lucene's RegexpQuery. I suspect the
> > KeywordAnalyzer to cause the inconsistent behaviour - is this the
> > right
> > analyzer to use for a RegexpQuery?
> 
> The limit is most likely that one in DocumentsWriter, where
> MAX_TERM_LENGTH == 16383. addDocument() says it throws an error when
> this limit is exceeded.


Thanks, that's it. Although that seems to have changed in newer lucene versions 
(4.6):

IndexWriter#MAX_TERM_LENGTH:
Absolute hard maximum length for a term. If a term arrives from the analyzer 
longer than this length, it is skipped and a message is printed to infoStream, 
if set (see setInfoStream(java.io.PrintStream)).

For me, MAX_TERM_LENGTH is set to 32766 and that is what I get as a warning:
IW 0 [Tue Jan 28 12:07:30 CET 2014; main]: WARNING: document contains at least 
one immense term (whose UTF8 encoding is longer than the max length 32766), all 
of which were skipped.  Please correct the analyzer to not produce such terms.  
The prefix of the first immense term is: ...

So that makes perfectly sense.

> 
> What we do for RegexpQuery is that we still tokenise the text, but we
> explain the caveat that the regular expression will match individual
> tokens. I think KeywordAnalyzer will probably give better results,
> but
> you're going to hit this limitation past a certain size.
> 
> Going in the other direction, if you tokenise the text
> character-by-character, you might be able to write a regular
> expression engine which uses span queries to match the regular
> expression to the terms. I don't know how that would perform, but
> ever
> since writing a per-character tokeniser, I have been wondering if it
> would be a decent way to do it.

I've written an indexed regex search engine using Lucene and trigram-tokens 
based on the idea of [1]. I'm still looking into performance but so far it 
seems very good, it even supersedes an in-memory implementation (all docs in 
memory, sequential scan using Pattern#matches) in cases where the regex 
matching itself is quite costly. If you're curious, I can provide source code 
and an evaluation in a couple of weeks (part of my master thesis).

That is why I'm curious to see how Lucene's AutomatonQuery implementation 
performs compared to the trigram solution. Though, with the above limit in 
mind, I guess that can't be compared for real.

Do you know of any other ways to do efficient regex search using Lucene or any 
other method?

Thanks,

Best Regards
Andreas

[1] http://swtch.com/~rsc/regexp/regexp4.html

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

Re: limitation on token-length for KeywordAnalyzer?

Reply via email to