[ https://issues.apache.org/jira/browse/LUCENE-5897?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14104519#comment-14104519 ]
Steve Rowe commented on LUCENE-5897: ------------------------------------ I'll try to figure out a way to limit the search, as you say, to to maxTokenLength(). I worry about two things though, both of which are currently handled (though badly in these adversary cases): # in a rule with alternates, one of which is satisfied below the limit, the satisfied alternate should produce a match when a partially matching alternate exceeds the limit and is aborted. # When rule A matches partially, exceeds the limit, and is aborted, and rule B matches a prefix that is under the limit, rule B should produce a match. > performance bug ("adversary") in StandardTokenizer > -------------------------------------------------- > > Key: LUCENE-5897 > URL: https://issues.apache.org/jira/browse/LUCENE-5897 > Project: Lucene - Core > Issue Type: Bug > Reporter: Robert Muir > > There seem to be some conditions (I don't know how rare or what conditions) > that cause StandardTokenizer to essentially hang on input: I havent looked > hard yet, but as its essentially a DFA I think something wierd might be going > on. > An easy way to reproduce is with 1MB of underscores, it will just hang > forever. > {code} > public void testWorthyAdversary() throws Exception { > char buffer[] = new char[1024 * 1024]; > Arrays.fill(buffer, '_'); > int tokenCount = 0; > Tokenizer ts = new StandardTokenizer(); > ts.setReader(new StringReader(new String(buffer))); > ts.reset(); > while (ts.incrementToken()) { > tokenCount++; > } > ts.end(); > ts.close(); > assertEquals(0, tokenCount); > } > {code} -- This message was sent by Atlassian JIRA (v6.2#6252) --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org