[
https://issues.apache.org/jira/browse/LUCENE-5897?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14106684#comment-14106684
]
ASF subversion and git services commented on LUCENE-5897:
---------------------------------------------------------
Commit 1619730 from [[email protected]] in branch 'dev/trunk'
[ https://svn.apache.org/r1619730 ]
LUCENE-5897, LUCENE-5400: JFlex-based tokenizers StandardTokenizer and
UAX29URLEmailTokenizer tokenize extremely slowly over long sequences of text
partially matching certain grammar rules. The scanner default buffer size was
reduced, and scanner buffer growth was disabled, resulting in much, much faster
tokenization for these text sequences.
> performance bug ("adversary") in StandardTokenizer
> --------------------------------------------------
>
> Key: LUCENE-5897
> URL: https://issues.apache.org/jira/browse/LUCENE-5897
> Project: Lucene - Core
> Issue Type: Bug
> Reporter: Robert Muir
> Attachments: LUCENE-5897.patch
>
>
> There seem to be some conditions (I don't know how rare or what conditions)
> that cause StandardTokenizer to essentially hang on input: I havent looked
> hard yet, but as its essentially a DFA I think something wierd might be going
> on.
> An easy way to reproduce is with 1MB of underscores, it will just hang
> forever.
> {code}
> public void testWorthyAdversary() throws Exception {
> char buffer[] = new char[1024 * 1024];
> Arrays.fill(buffer, '_');
> int tokenCount = 0;
> Tokenizer ts = new StandardTokenizer();
> ts.setReader(new StringReader(new String(buffer)));
> ts.reset();
> while (ts.incrementToken()) {
> tokenCount++;
> }
> ts.end();
> ts.close();
> assertEquals(0, tokenCount);
> }
> {code}
--
This message was sent by Atlassian JIRA
(v6.2#6252)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]