[ https://issues.apache.org/jira/browse/LUCENE-6584?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14593306#comment-14593306 ]
Daniel Collins commented on LUCENE-6584: ---------------------------------------- I think the point is that in Lucene 4.7, this update was made: {quote} LUCENE-5357: Upgrade StandardTokenizer and UAX29URLEmailTokenizer to Unicode 6.3; update UAX29URLEmailTokenizer's recognized top level domains in URLs and Emails from the IANA Root Zone Database. {quote} but that never made it to the Javadoc page.. > Docs on StandardTokenizer don't mention the behaviour change in > Version.LUCENE_4_7_0 > ------------------------------------------------------------------------------------ > > Key: LUCENE-6584 > URL: https://issues.apache.org/jira/browse/LUCENE-6584 > Project: Lucene - Core > Issue Type: Bug > Components: modules/analysis > Affects Versions: 4.10.4 > Reporter: Trejkaz > Priority: Minor > > The following test shows that the behaviour of StandardTokenizer differs once > you start passing Version.LUCENE_4_7_0 or greater: > {code} > import java.io.StringReader; > import org.apache.lucene.analysis.TokenStream; > import org.apache.lucene.analysis.standard.StandardTokenizer; > import org.apache.lucene.util.Version; > import org.junit.Test; > import static org.hamcrest.Matchers.is; > import static org.junit.Assert.assertThat; > public class TestStandardTokenizerStandalone > { > @Test > public void testLucene4_6_1() throws Exception > { > doTest(Version.LUCENE_4_6_1); > } > @Test > public void testLucene4_7_0() throws Exception > { > doTest(Version.LUCENE_4_7_0); > } > public void doTest(Version version) throws Exception > { > try (TokenStream stream = new StandardTokenizer(version, new > StringReader(makeLongString(2550)))) > { > stream.reset(); > assertThat(stream.incrementToken(), is(false)); > } > } > private String makeLongString(int length) > { > StringBuilder builder = new StringBuilder(length); > for (int i = 0; i < length; i++) > { > builder.append('x'); > } > return builder.toString(); > } > } > {code} > However, the Javadoc only mentions the behaviour changes in versions 3.1 and > 3.4. > The constructor for passing the version is deprecated, presumably under the > false impression that no changes occurred during Lucene 4. I know the Version > parameter was killed off entirely in version 5, which presumably means that > people who tokenised stuff in Lucene 4.6 or earlier have now been trapped and > have to copy the tokeniser from Lucene 4 to keep their queries working. -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org