[jira] [Commented] (LUCENE-5785) White space tokenizer has undocumented limit of 256 characters per token
[ https://issues.apache.org/jira/browse/LUCENE-5785?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14043432#comment-14043432 ] Shawn Heisey commented on LUCENE-5785: -- A hard limit on the number of characters in a token is probably unavoidable. Every tokenizer I've actually looked at has a limit. For some it's up to 4096 bytes. 256 bytes seems REALLY small, even for an abstract base class. I'd hope for 1024 as a minimum. Here's a radical idea: Make the limit configurable for all tokenizers, and expose that config option in the Solr schema. White space tokenizer has undocumented limit of 256 characters per token Key: LUCENE-5785 URL: https://issues.apache.org/jira/browse/LUCENE-5785 Project: Lucene - Core Issue Type: Improvement Components: modules/analysis Affects Versions: 4.8.1 Reporter: Jack Krupansky Priority: Minor The white space tokenizer breaks tokens at 256 characters, which is a hard-wired limit of the character tokenizer abstract class. The limit of 256 is obviously fine for normal, natural language text, but excessively restrictive for semi-structured data. 1. Document the current limit in the Javadoc for the character tokenizer. Add a note to any derived tokenizers (such as the white space tokenizer) that token size is limited as per the character tokenizer. 2. Added the setMaxTokenLength method to the character tokenizer ala the standard tokenizer so that an application can control the limit. This should probably be added to the character tokenizer abstract class, and then other derived tokenizer classes can inherit it. 3. Disallow a token size limit of 0. 4. A limit of -1 would mean no limit. 5. Add a token limit mode method - skip (what the standard tokenizer does), break (current behavior of the white space tokenizer and its derived tokenizers), and trim (what I think a lot of people might expect.) 6. Not sure whether to change the current behavior of the character tokenizer (break mode) to fix it to match the standard tokenizer, or to be trim mode, which is my choice and likely to be what people might expect. 7. Add matching attributes to the tokenizer factories for Solr, including Solr XML javadoc. At a minimum, this issue should address the documentation problem. -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-5785) White space tokenizer has undocumented limit of 256 characters per token
[ https://issues.apache.org/jira/browse/LUCENE-5785?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14043454#comment-14043454 ] Jack Krupansky commented on LUCENE-5785: It is worth keeping in mind that a token isn't necessarily the same as a term. It may indeed be desirable to limit the length of terms in the Lucene index for tokenized fields, but all too often an initial token is further broken down using token filters (e.g., word delimiter filter) so that the final term(s) are much shorter than the initial token. So, 256 may be a reasonable limit for indexed terms, but not a great limit for initial tokenization in a complex analysis chain. Whether the default token length limit should be changed as part of this issue is open. Personally I'd prefer a more reasonable limit such as 4096. But as long as the limit can be upped using a tokenizer attribute, that should be enough for now. White space tokenizer has undocumented limit of 256 characters per token Key: LUCENE-5785 URL: https://issues.apache.org/jira/browse/LUCENE-5785 Project: Lucene - Core Issue Type: Improvement Components: modules/analysis Affects Versions: 4.8.1 Reporter: Jack Krupansky Priority: Minor The white space tokenizer breaks tokens at 256 characters, which is a hard-wired limit of the character tokenizer abstract class. The limit of 256 is obviously fine for normal, natural language text, but excessively restrictive for semi-structured data. 1. Document the current limit in the Javadoc for the character tokenizer. Add a note to any derived tokenizers (such as the white space tokenizer) that token size is limited as per the character tokenizer. 2. Added the setMaxTokenLength method to the character tokenizer ala the standard tokenizer so that an application can control the limit. This should probably be added to the character tokenizer abstract class, and then other derived tokenizer classes can inherit it. 3. Disallow a token size limit of 0. 4. A limit of -1 would mean no limit. 5. Add a token limit mode method - skip (what the standard tokenizer does), break (current behavior of the white space tokenizer and its derived tokenizers), and trim (what I think a lot of people might expect.) 6. Not sure whether to change the current behavior of the character tokenizer (break mode) to fix it to match the standard tokenizer, or to be trim mode, which is my choice and likely to be what people might expect. 7. Add matching attributes to the tokenizer factories for Solr, including Solr XML javadoc. At a minimum, this issue should address the documentation problem. -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-5785) White space tokenizer has undocumented limit of 256 characters per token
[ https://issues.apache.org/jira/browse/LUCENE-5785?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14043670#comment-14043670 ] Jack Krupansky commented on LUCENE-5785: bq. Make the limit configurable for all tokenizers, and expose that config option in the Solr schema. I wouldn't mind having a Solr-only, core/schema-specific default setting. Not like max Boolean clause which was a Java static for Lucene and quite a mess in terms of the order cores were loaded. In short, leave the default as 256 in Lucene, but we could have Solr default to something much less restrictive, like 4096, and in addition to the tokenizer-specific attribute, the user could specify a global (for the core/schema) override. One key advantage of the schema-global override is that the user could leave the existing field types intact. White space tokenizer has undocumented limit of 256 characters per token Key: LUCENE-5785 URL: https://issues.apache.org/jira/browse/LUCENE-5785 Project: Lucene - Core Issue Type: Improvement Components: modules/analysis Affects Versions: 4.8.1 Reporter: Jack Krupansky Priority: Minor The white space tokenizer breaks tokens at 256 characters, which is a hard-wired limit of the character tokenizer abstract class. The limit of 256 is obviously fine for normal, natural language text, but excessively restrictive for semi-structured data. 1. Document the current limit in the Javadoc for the character tokenizer. Add a note to any derived tokenizers (such as the white space tokenizer) that token size is limited as per the character tokenizer. 2. Added the setMaxTokenLength method to the character tokenizer ala the standard tokenizer so that an application can control the limit. This should probably be added to the character tokenizer abstract class, and then other derived tokenizer classes can inherit it. 3. Disallow a token size limit of 0. 4. A limit of -1 would mean no limit. 5. Add a token limit mode method - skip (what the standard tokenizer does), break (current behavior of the white space tokenizer and its derived tokenizers), and trim (what I think a lot of people might expect.) 6. Not sure whether to change the current behavior of the character tokenizer (break mode) to fix it to match the standard tokenizer, or to be trim mode, which is my choice and likely to be what people might expect. 7. Add matching attributes to the tokenizer factories for Solr, including Solr XML javadoc. At a minimum, this issue should address the documentation problem. -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-5785) White space tokenizer has undocumented limit of 256 characters per token
[ https://issues.apache.org/jira/browse/LUCENE-5785?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14041194#comment-14041194 ] Jack Krupansky commented on LUCENE-5785: The pattern tokenizer can be used as a workaround for the white space tokenizer since it doesn't have that hard-wired token length limit. White space tokenizer has undocumented limit of 256 characters per token Key: LUCENE-5785 URL: https://issues.apache.org/jira/browse/LUCENE-5785 Project: Lucene - Core Issue Type: Improvement Components: modules/analysis Affects Versions: 4.8.1 Reporter: Jack Krupansky Priority: Minor The white space tokenizer breaks tokens at 256 characters, which is a hard-wired limit of the character tokenizer abstract class. The limit of 256 is obviously fine for normal, natural language text, but excessively restrictive for semi-structured data. 1. Document the current limit in the Javadoc for the character tokenizer. Add a note to any derived tokenizers (such as the white space tokenizer) that token size is limited as per the character tokenizer. 2. Added the setMaxTokenLength method to the character tokenizer ala the standard tokenizer so that an application can control the limit. This should probably be added to the character tokenizer abstract class, and then other derived tokenizer classes can inherit it. 3. Disallow a token size limit of 0. 4. A limit of -1 would mean no limit. 5. Add a token limit mode method - skip (what the standard tokenizer does), break (current behavior of the white space tokenizer and its derived tokenizers), and trim (what I think a lot of people might expect.) 6. Not sure whether to change the current behavior of the character tokenizer (break mode) to fix it to match the standard tokenizer, or to be trim mode, which is my choice and likely to be what people might expect. 7. Add matching attributes to the tokenizer factories for Solr, including Solr XML javadoc. At a minimum, this issue should address the documentation problem. -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org