Robert Muir created LUCENE-5897:
-----------------------------------

             Summary: performance bug ("adversary") in StandardTokenizer
                 Key: LUCENE-5897
                 URL: https://issues.apache.org/jira/browse/LUCENE-5897
             Project: Lucene - Core
          Issue Type: Bug
            Reporter: Robert Muir


There seem to be some conditions (I don't know how rare or what conditions) 
that cause StandardTokenizer to essentially hang on input: I havent looked hard 
yet, but as its essentially a DFA I think something wierd might be going on.

An easy way to reproduce is with 1MB of underscores, it will just hang forever.
{code}
  public void testWorthyAdversary() throws Exception {
    char buffer[] = new char[1024 * 1024];
    Arrays.fill(buffer, '_');
    int tokenCount = 0;
    Tokenizer ts = new StandardTokenizer();
    ts.setReader(new StringReader(new String(buffer)));
    ts.reset();
    while (ts.incrementToken()) {
      tokenCount++;
    }
    ts.end();
    ts.close();
    assertEquals(0, tokenCount);
  }
{code} 



--
This message was sent by Atlassian JIRA
(v6.2#6252)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

Reply via email to