[jira] [Commented] (LUCENE-7760) StandardAnalyzer/Tokenizer.setMaxTokenLength's javadocs are lying

Steve Rowe (JIRA) Thu, 30 Mar 2017 06:41:55 -0700

    [ 
https://issues.apache.org/jira/browse/LUCENE-7760?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15949082#comment-15949082
 ]


Steve Rowe commented on LUCENE-7760:
------------------------------------

+1

>From 
>[http://mail-archives.apache.org/mod_mbox/lucene-java-user/201611.mbox/%[email protected]%3e],
> where I most recently responded to a user question about the situation - this 
>should be useful as a seed for javadoc fixes:

{noformat}
The behavior you mention is an intentional change from the behavior in Lucene 
4.9.0 and earlier,
when tokens longer than maxTokenLenth were silently ignored: see LUCENE-5897[1] 
and LUCENE-5400[2].

The new behavior is as follows: Token matching rules are no longer allowed to 
match against
input char sequences longer than maxTokenLength.  If a rule that would match a 
sequence longer
than maxTokenLength, but also matches at maxTokenLength chars or fewer, and has 
the highest
priority among all other rules matching at this length, and no other rule 
matches more chars,
then a token will be emitted for that rule at the matching length.  And then 
the rule-matching
iteration simply continues from that point as normal.  If the same rule matches 
against the
remainder of the sequence that the first rule would have matched if 
maxTokenLength were longer,
then another token at the matched length will be emitted, and so on. 

Note that this can result in effectively splitting the sequence at 
maxTokenLength intervals
as you noted.

You can fix the problem by setting maxTokenLength higher - this has the side 
effect of growing
the buffer and not causing unwanted token splitting.  If this results in tokens 
larger than
you would like, you can remove them with LengthFilter.

FYI there is discussion on LUCENE-5897 about separating buffer size from 
maxTokenLength, starting
here: 
<https://issues.apache.org/jira/browse/LUCENE-5897?focusedCommentId=14105729&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14105729>
- ultimately I decided that few people would benefit from the increased 
configuration complexity.

[1] https://issues.apache.org/jira/browse/LUCENE-5897
[2] https://issues.apache.org/jira/browse/LUCENE-5400
{noformat}

> StandardAnalyzer/Tokenizer.setMaxTokenLength's javadocs are lying
> -----------------------------------------------------------------
>
>                 Key: LUCENE-7760
>                 URL: https://issues.apache.org/jira/browse/LUCENE-7760
>             Project: Lucene - Core
>          Issue Type: Bug
>            Reporter: Michael McCandless
>            Assignee: Michael McCandless
>             Fix For: master (7.0), 6.6
>
>
> The javadocs claim that too-long tokens are discarded, but in fact they are 
> simply chopped up.  The following test case unexpectedly passes:
> {noformat}
>   public void testMaxTokenLengthNonDefault() throws Exception {
>     StandardAnalyzer a = new StandardAnalyzer();
>     a.setMaxTokenLength(5);
>     assertAnalyzesTo(a, "ab cd toolong xy z", new String[]{"ab", "cd", 
> "toolo", "ng", "xy", "z"});
>     a.close();
>   }
> {noformat}
> We should at least fix the javadocs ...
> (I hit this because I was trying to also add {{setMaxTokenLength}} to 
> {{EnglishAnalyzer}}).



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (LUCENE-7760) StandardAnalyzer/Tokenizer.setMaxTokenLength's javadocs are lying

Reply via email to