[jira] [Commented] (LUCENE-5785) White space tokenizer has undocumented limit of 256 characters per token

Jack Krupansky (JIRA) Mon, 23 Jun 2014 12:48:45 -0700

    [ 
https://issues.apache.org/jira/browse/LUCENE-5785?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14041194#comment-14041194
 ]


Jack Krupansky commented on LUCENE-5785:
----------------------------------------

The pattern tokenizer can be used as a workaround for the white space tokenizer 
since it doesn't have that hard-wired token length limit.


> White space tokenizer has undocumented limit of 256 characters per token
> ------------------------------------------------------------------------
>
>                 Key: LUCENE-5785
>                 URL: https://issues.apache.org/jira/browse/LUCENE-5785
>             Project: Lucene - Core
>          Issue Type: Improvement
>          Components: modules/analysis
>    Affects Versions: 4.8.1
>            Reporter: Jack Krupansky
>            Priority: Minor
>
> The white space tokenizer breaks tokens at 256 characters, which is a 
> hard-wired limit of the character tokenizer abstract class.
> The limit of 256 is obviously fine for normal, natural language text, but 
> excessively restrictive for semi-structured data.
> 1. Document the current limit in the Javadoc for the character tokenizer. Add 
> a note to any derived tokenizers (such as the white space tokenizer) that 
> token size is limited as per the character tokenizer.
> 2. Added the setMaxTokenLength method to the character tokenizer ala the 
> standard tokenizer so that an application can control the limit. This should 
> probably be added to the character tokenizer abstract class, and then other 
> derived tokenizer classes can inherit it.
> 3. Disallow a token size limit of 0.
> 4. A limit of -1 would mean no limit.
> 5. Add a "token limit mode" method - "skip" (what the standard tokenizer 
> does), "break" (current behavior of the white space tokenizer and its derived 
> tokenizers), and "trim" (what I think a lot of people might expect.)
> 6. Not sure whether to change the current behavior of the character tokenizer 
> (break mode) to fix it to match the standard tokenizer, or to be "trim" mode, 
> which is my choice and likely to be what people might expect.
> 7. Add matching attributes to the tokenizer factories for Solr, including 
> Solr XML javadoc.
> At a minimum, this issue should address the documentation problem.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-5785) White space tokenizer has undocumented limit of 256 characters per token

Reply via email to