[jira] [Commented] (LUCENE-5897) performance bug ("adversary") in StandardTokenizer

ASF subversion and git services (JIRA) Fri, 22 Aug 2014 03:21:43 -0700

    [ 
https://issues.apache.org/jira/browse/LUCENE-5897?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14106684#comment-14106684
 ]


ASF subversion and git services commented on LUCENE-5897:
---------------------------------------------------------

Commit 1619730 from [~sar...@syr.edu] in branch 'dev/trunk'
[ https://svn.apache.org/r1619730 ]

LUCENE-5897, LUCENE-5400: JFlex-based tokenizers StandardTokenizer and 
UAX29URLEmailTokenizer tokenize extremely slowly over long sequences of text 
partially matching certain grammar rules.  The scanner default buffer size was 
reduced, and scanner buffer growth was disabled, resulting in much, much faster 
tokenization for these text sequences.

> performance bug ("adversary") in StandardTokenizer
> --------------------------------------------------
>
>                 Key: LUCENE-5897
>                 URL: https://issues.apache.org/jira/browse/LUCENE-5897
>             Project: Lucene - Core
>          Issue Type: Bug
>            Reporter: Robert Muir
>         Attachments: LUCENE-5897.patch
>
>
> There seem to be some conditions (I don't know how rare or what conditions) 
> that cause StandardTokenizer to essentially hang on input: I havent looked 
> hard yet, but as its essentially a DFA I think something wierd might be going 
> on.
> An easy way to reproduce is with 1MB of underscores, it will just hang 
> forever.
> {code}
>   public void testWorthyAdversary() throws Exception {
>     char buffer[] = new char[1024 * 1024];
>     Arrays.fill(buffer, '_');
>     int tokenCount = 0;
>     Tokenizer ts = new StandardTokenizer();
>     ts.setReader(new StringReader(new String(buffer)));
>     ts.reset();
>     while (ts.incrementToken()) {
>       tokenCount++;
>     }
>     ts.end();
>     ts.close();
>     assertEquals(0, tokenCount);
>   }
> {code} 



--
This message was sent by Atlassian JIRA
(v6.2#6252)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-5897) performance bug ("adversary") in StandardTokenizer

Reply via email to