Hi Alexey,

The behavior you mention is an intentional change from the behavior in Lucene 
4.9.0 and earlier, when tokens longer than maxTokenLenth were silently ignored: 
see LUCENE-5897[1] and LUCENE-5400[2].

The new behavior is as follows: Token matching rules are no longer allowed to 
match against input char sequences longer than maxTokenLength.  If a rule that 
would match a sequence longer than maxTokenLength, but also matches at 
maxTokenLength chars or fewer, and has the highest priority among all other 
rules matching at this length, and no other rule matches more chars, then a 
token will be emitted for that rule at the matching length.  And then the 
rule-matching iteration simply continues from that point as normal.  If the 
same rule matches against the remainder of the sequence that the first rule 
would have matched if maxTokenLength were longer, then another token at the 
matched length will be emitted, and so on. 

Note that this can result in effectively splitting the sequence at 
maxTokenLength intervals as you noted.

You can fix the problem by setting maxTokenLength higher - this has the side 
effect of growing the buffer and not causing unwanted token splitting.  If this 
results in tokens larger than you would like, you can remove them with 
LengthFilter.

FYI there is discussion on LUCENE-5897 about separating buffer size from 
maxTokenLength, starting here: 
<https://issues.apache.org/jira/browse/LUCENE-5897?focusedCommentId=14105729&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14105729>
 - ultimately I decided that few people would benefit from the increased 
configuration complexity.

[1] https://issues.apache.org/jira/browse/LUCENE-5897
[2] https://issues.apache.org/jira/browse/LUCENE-5400

--
Steve
www.lucidworks.com

> On Nov 11, 2016, at 6:23 AM, Alexey Makeev <makeev...@mail.ru.INVALID> wrote:
> 
> Hello,
> 
> I'm using lucene 6.2.0 and expecting the following test to pass:
> 
> import org.apache.lucene.analysis.BaseTokenStreamTestCase;
> import org.apache.lucene.analysis.standard.StandardTokenizer;
> 
> import java.io.IOException;
> import java.io.StringReader;
> 
> public class TestStandardTokenizer extends BaseTokenStreamTestCase
> {
>     public void testLongToken() throws IOException
>     {
>         final StandardTokenizer tokenizer = new StandardTokenizer();
>         final int maxTokenLength = tokenizer.getMaxTokenLength();
> 
>         // string with the following contents: a...maxTokenLength+5 times...a 
> abc
>         final String longToken = new String(new char[maxTokenLength + 
> 5]).replace("\0", "a") + " abc";
> 
>         tokenizer.setReader(new StringReader(longToken));
>         
>         assertTokenStreamContents(tokenizer, new String[]{"abc"});
>         // actual contents: "a" 255 times, "aaaaa", "abc"
>     }
> }
> 
> It seems like StandardTokenizer considers completely filled buffer as a 
> successfully extracted token (1), and also includes tail of too-long-token as 
> a separate token (2). Maybe (1) is disputable (I think it is bug), but I 
> think (2) is a bug. 
> 
> Best regards,
> Alexey Makeev
> makeev...@mail.ru


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

Reply via email to