[ https://issues.apache.org/jira/browse/LUCENE-5785?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14041194#comment-14041194 ]
Jack Krupansky commented on LUCENE-5785: ---------------------------------------- The pattern tokenizer can be used as a workaround for the white space tokenizer since it doesn't have that hard-wired token length limit. > White space tokenizer has undocumented limit of 256 characters per token > ------------------------------------------------------------------------ > > Key: LUCENE-5785 > URL: https://issues.apache.org/jira/browse/LUCENE-5785 > Project: Lucene - Core > Issue Type: Improvement > Components: modules/analysis > Affects Versions: 4.8.1 > Reporter: Jack Krupansky > Priority: Minor > > The white space tokenizer breaks tokens at 256 characters, which is a > hard-wired limit of the character tokenizer abstract class. > The limit of 256 is obviously fine for normal, natural language text, but > excessively restrictive for semi-structured data. > 1. Document the current limit in the Javadoc for the character tokenizer. Add > a note to any derived tokenizers (such as the white space tokenizer) that > token size is limited as per the character tokenizer. > 2. Added the setMaxTokenLength method to the character tokenizer ala the > standard tokenizer so that an application can control the limit. This should > probably be added to the character tokenizer abstract class, and then other > derived tokenizer classes can inherit it. > 3. Disallow a token size limit of 0. > 4. A limit of -1 would mean no limit. > 5. Add a "token limit mode" method - "skip" (what the standard tokenizer > does), "break" (current behavior of the white space tokenizer and its derived > tokenizers), and "trim" (what I think a lot of people might expect.) > 6. Not sure whether to change the current behavior of the character tokenizer > (break mode) to fix it to match the standard tokenizer, or to be "trim" mode, > which is my choice and likely to be what people might expect. > 7. Add matching attributes to the tokenizer factories for Solr, including > Solr XML javadoc. > At a minimum, this issue should address the documentation problem. -- This message was sent by Atlassian JIRA (v6.2#6252) --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org