[jira] [Commented] (LUCENE-5897) performance bug ("adversary") in StandardTokenizer

Robert Muir (JIRA) Wed, 20 Aug 2014 18:45:52 -0700

    [ 
https://issues.apache.org/jira/browse/LUCENE-5897?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14104932#comment-14104932
 ]


Robert Muir commented on LUCENE-5897:
-------------------------------------

Well, I guess one concern is the 'adversary' case but I'm a little concerned 
the behavior might impact ordinary performance: so I'm just stretching a bit 
and trying to figure out how com.icu.ibm.text.BreakIterator (which impls the 
same algo) doesnt' get hung in such an adversary case.

I looked at http://icu-project.org/docs/papers/text_boundary_analysis_in_java/

especially: "If the current state is an accepting state, the break position is 
after that character. Otherwise, the break position is after the last character 
that caused a transition to an accepting state. (In other words, we keep track 
of the break position, updating it to after the current position every time we 
enter an accepting state. This is called "marking" the position.)"

So more generally, can we optimize the general case to also remove what appears 
to be a backtracking algo? I know jflex is more general than what ICU offers, 
so its like comparing apples and oranges, but i can't help but wonder...

> performance bug ("adversary") in StandardTokenizer
> --------------------------------------------------
>
>                 Key: LUCENE-5897
>                 URL: https://issues.apache.org/jira/browse/LUCENE-5897
>             Project: Lucene - Core
>          Issue Type: Bug
>            Reporter: Robert Muir
>
> There seem to be some conditions (I don't know how rare or what conditions) 
> that cause StandardTokenizer to essentially hang on input: I havent looked 
> hard yet, but as its essentially a DFA I think something wierd might be going 
> on.
> An easy way to reproduce is with 1MB of underscores, it will just hang 
> forever.
> {code}
>   public void testWorthyAdversary() throws Exception {
>     char buffer[] = new char[1024 * 1024];
>     Arrays.fill(buffer, '_');
>     int tokenCount = 0;
>     Tokenizer ts = new StandardTokenizer();
>     ts.setReader(new StringReader(new String(buffer)));
>     ts.reset();
>     while (ts.incrementToken()) {
>       tokenCount++;
>     }
>     ts.end();
>     ts.close();
>     assertEquals(0, tokenCount);
>   }
> {code} 



--
This message was sent by Atlassian JIRA
(v6.2#6252)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-5897) performance bug ("adversary") in StandardTokenizer

Reply via email to