[
https://issues.apache.org/jira/browse/LUCENE-7760?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15949082#comment-15949082
]
Steve Rowe commented on LUCENE-7760:
------------------------------------
+1
>From
>[http://mail-archives.apache.org/mod_mbox/lucene-java-user/201611.mbox/%[email protected]%3e],
> where I most recently responded to a user question about the situation - this
>should be useful as a seed for javadoc fixes:
{noformat}
The behavior you mention is an intentional change from the behavior in Lucene
4.9.0 and earlier,
when tokens longer than maxTokenLenth were silently ignored: see LUCENE-5897[1]
and LUCENE-5400[2].
The new behavior is as follows: Token matching rules are no longer allowed to
match against
input char sequences longer than maxTokenLength. If a rule that would match a
sequence longer
than maxTokenLength, but also matches at maxTokenLength chars or fewer, and has
the highest
priority among all other rules matching at this length, and no other rule
matches more chars,
then a token will be emitted for that rule at the matching length. And then
the rule-matching
iteration simply continues from that point as normal. If the same rule matches
against the
remainder of the sequence that the first rule would have matched if
maxTokenLength were longer,
then another token at the matched length will be emitted, and so on.
Note that this can result in effectively splitting the sequence at
maxTokenLength intervals
as you noted.
You can fix the problem by setting maxTokenLength higher - this has the side
effect of growing
the buffer and not causing unwanted token splitting. If this results in tokens
larger than
you would like, you can remove them with LengthFilter.
FYI there is discussion on LUCENE-5897 about separating buffer size from
maxTokenLength, starting
here:
<https://issues.apache.org/jira/browse/LUCENE-5897?focusedCommentId=14105729&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14105729>
- ultimately I decided that few people would benefit from the increased
configuration complexity.
[1] https://issues.apache.org/jira/browse/LUCENE-5897
[2] https://issues.apache.org/jira/browse/LUCENE-5400
{noformat}
> StandardAnalyzer/Tokenizer.setMaxTokenLength's javadocs are lying
> -----------------------------------------------------------------
>
> Key: LUCENE-7760
> URL: https://issues.apache.org/jira/browse/LUCENE-7760
> Project: Lucene - Core
> Issue Type: Bug
> Reporter: Michael McCandless
> Assignee: Michael McCandless
> Fix For: master (7.0), 6.6
>
>
> The javadocs claim that too-long tokens are discarded, but in fact they are
> simply chopped up. The following test case unexpectedly passes:
> {noformat}
> public void testMaxTokenLengthNonDefault() throws Exception {
> StandardAnalyzer a = new StandardAnalyzer();
> a.setMaxTokenLength(5);
> assertAnalyzesTo(a, "ab cd toolong xy z", new String[]{"ab", "cd",
> "toolo", "ng", "xy", "z"});
> a.close();
> }
> {noformat}
> We should at least fix the javadocs ...
> (I hit this because I was trying to also add {{setMaxTokenLength}} to
> {{EnglishAnalyzer}}).
--
This message was sent by Atlassian JIRA
(v6.3.15#6346)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]