[jira] [Commented] (LUCENE-5785) White space tokenizer has undocumented limit of 256 characters per token

Jack Krupansky (JIRA) Wed, 25 Jun 2014 06:44:42 -0700

    [ 
https://issues.apache.org/jira/browse/LUCENE-5785?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14043454#comment-14043454
 ]


Jack Krupansky commented on LUCENE-5785:
----------------------------------------

It is worth keeping in mind that a "token" isn't necessarily the same as a 
"term". It may indeed be desirable to limit the length of terms in the Lucene 
index for tokenized fields, but all too often an initial token is further 
broken down using token filters (e.g., word delimiter filter) so that the final 
term(s) are much shorter than the initial token. So, 256 may be a reasonable 
limit for indexed terms, but not a great limit for initial tokenization in a 
complex analysis chain.

Whether the default token length limit should be changed as part of this issue 
is open. Personally I'd prefer a more reasonable limit such as 4096. But as 
long as the limit can be upped using a tokenizer attribute, that should be 
enough for now.


> White space tokenizer has undocumented limit of 256 characters per token
> ------------------------------------------------------------------------
>
>                 Key: LUCENE-5785
>                 URL: https://issues.apache.org/jira/browse/LUCENE-5785
>             Project: Lucene - Core
>          Issue Type: Improvement
>          Components: modules/analysis
>    Affects Versions: 4.8.1
>            Reporter: Jack Krupansky
>            Priority: Minor
>
> The white space tokenizer breaks tokens at 256 characters, which is a 
> hard-wired limit of the character tokenizer abstract class.
> The limit of 256 is obviously fine for normal, natural language text, but 
> excessively restrictive for semi-structured data.
> 1. Document the current limit in the Javadoc for the character tokenizer. Add 
> a note to any derived tokenizers (such as the white space tokenizer) that 
> token size is limited as per the character tokenizer.
> 2. Added the setMaxTokenLength method to the character tokenizer ala the 
> standard tokenizer so that an application can control the limit. This should 
> probably be added to the character tokenizer abstract class, and then other 
> derived tokenizer classes can inherit it.
> 3. Disallow a token size limit of 0.
> 4. A limit of -1 would mean no limit.
> 5. Add a "token limit mode" method - "skip" (what the standard tokenizer 
> does), "break" (current behavior of the white space tokenizer and its derived 
> tokenizers), and "trim" (what I think a lot of people might expect.)
> 6. Not sure whether to change the current behavior of the character tokenizer 
> (break mode) to fix it to match the standard tokenizer, or to be "trim" mode, 
> which is my choice and likely to be what people might expect.
> 7. Add matching attributes to the tokenizer factories for Solr, including 
> Solr XML javadoc.
> At a minimum, this issue should address the documentation problem.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (LUCENE-5785) White space tokenizer has undocumented limit of 256 characters per token

Reply via email to