[ https://issues.apache.org/jira/browse/LUCENE-5897?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14104932#comment-14104932 ]
Robert Muir commented on LUCENE-5897: ------------------------------------- Well, I guess one concern is the 'adversary' case but I'm a little concerned the behavior might impact ordinary performance: so I'm just stretching a bit and trying to figure out how com.icu.ibm.text.BreakIterator (which impls the same algo) doesnt' get hung in such an adversary case. I looked at http://icu-project.org/docs/papers/text_boundary_analysis_in_java/ especially: "If the current state is an accepting state, the break position is after that character. Otherwise, the break position is after the last character that caused a transition to an accepting state. (In other words, we keep track of the break position, updating it to after the current position every time we enter an accepting state. This is called "marking" the position.)" So more generally, can we optimize the general case to also remove what appears to be a backtracking algo? I know jflex is more general than what ICU offers, so its like comparing apples and oranges, but i can't help but wonder... > performance bug ("adversary") in StandardTokenizer > -------------------------------------------------- > > Key: LUCENE-5897 > URL: https://issues.apache.org/jira/browse/LUCENE-5897 > Project: Lucene - Core > Issue Type: Bug > Reporter: Robert Muir > > There seem to be some conditions (I don't know how rare or what conditions) > that cause StandardTokenizer to essentially hang on input: I havent looked > hard yet, but as its essentially a DFA I think something wierd might be going > on. > An easy way to reproduce is with 1MB of underscores, it will just hang > forever. > {code} > public void testWorthyAdversary() throws Exception { > char buffer[] = new char[1024 * 1024]; > Arrays.fill(buffer, '_'); > int tokenCount = 0; > Tokenizer ts = new StandardTokenizer(); > ts.setReader(new StringReader(new String(buffer))); > ts.reset(); > while (ts.incrementToken()) { > tokenCount++; > } > ts.end(); > ts.close(); > assertEquals(0, tokenCount); > } > {code} -- This message was sent by Atlassian JIRA (v6.2#6252) --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org