Hello, I'm using lucene 6.2.0 and expecting the following test to pass:
import org.apache.lucene.analysis.BaseTokenStreamTestCase; import org.apache.lucene.analysis.standard.StandardTokenizer; import java.io.IOException; import java.io.StringReader; public class TestStandardTokenizer extends BaseTokenStreamTestCase { public void testLongToken() throws IOException { final StandardTokenizer tokenizer = new StandardTokenizer(); final int maxTokenLength = tokenizer.getMaxTokenLength(); // string with the following contents: a...maxTokenLength+5 times...a abc final String longToken = new String(new char[maxTokenLength + 5]).replace("\0", "a") + " abc"; tokenizer.setReader(new StringReader(longToken)); assertTokenStreamContents(tokenizer, new String[]{"abc"}); // actual contents: "a" 255 times, "aaaaa", "abc" } } It seems like StandardTokenizer considers completely filled buffer as a successfully extracted token (1), and also includes tail of too-long-token as a separate token (2). Maybe (1) is disputable (I think it is bug), but I think (2) is a bug. Best regards, Alexey Makeev makeev...@mail.ru