[jira] [Commented] (LUCENE-5897) performance bug ("adversary") in StandardTokenizer

Steve Rowe (JIRA) Thu, 21 Aug 2014 11:37:36 -0700

    [ 
https://issues.apache.org/jira/browse/LUCENE-5897?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14105745#comment-14105745
 ]


Steve Rowe commented on LUCENE-5897:
------------------------------------

bq. do we need a separate max buffer size parameter? can it just be an impl 
detail based on max token length?

It depends on whether we think anybody will want the (apparently minor) benefit 
of having a larger buffer, regardless of max token length

> performance bug ("adversary") in StandardTokenizer
> --------------------------------------------------
>
>                 Key: LUCENE-5897
>                 URL: https://issues.apache.org/jira/browse/LUCENE-5897
>             Project: Lucene - Core
>          Issue Type: Bug
>            Reporter: Robert Muir
>
> There seem to be some conditions (I don't know how rare or what conditions) 
> that cause StandardTokenizer to essentially hang on input: I havent looked 
> hard yet, but as its essentially a DFA I think something wierd might be going 
> on.
> An easy way to reproduce is with 1MB of underscores, it will just hang 
> forever.
> {code}
>   public void testWorthyAdversary() throws Exception {
>     char buffer[] = new char[1024 * 1024];
>     Arrays.fill(buffer, '_');
>     int tokenCount = 0;
>     Tokenizer ts = new StandardTokenizer();
>     ts.setReader(new StringReader(new String(buffer)));
>     ts.reset();
>     while (ts.incrementToken()) {
>       tokenCount++;
>     }
>     ts.end();
>     ts.close();
>     assertEquals(0, tokenCount);
>   }
> {code} 



--
This message was sent by Atlassian JIRA
(v6.2#6252)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-5897) performance bug ("adversary") in StandardTokenizer

Reply via email to