[jira] [Comment Edited] (SOLR-10186) Allow CharTokenizer-derived tokenizers and KeywordTokenizer to configure the max token length

Erick Erickson (JIRA) Wed, 22 Feb 2017 07:20:59 -0800

    [ 
https://issues.apache.org/jira/browse/SOLR-10186?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15878455#comment-15878455
 ]


Erick Erickson edited comment on SOLR-10186 at 2/22/17 3:20 PM:
----------------------------------------------------------------

Because the only change we'd need to make for KeywordTokenizer is in 
KeywordTokenizerFactory where we'd have to be sensitive to the presence of a 
new parameter and use it in the KeywordTokenizer c'tor. The CharacterTokenizer 
based factories would require changes in both the factory methods and the 
tokenizers themselves and it seems unnecessary to have two separate JIRAs as 
all the changes would be pretty trivial.


was (Author: erickerickson):
Because the only change we'd need to make for KeywordTokenizer is in 
KeywordTokenizerFactory where we'd have to be sensitive to the presence of a 
new parameter and use it in the KeywordTokenizer c'tor. The CharacterTokenizer 
based factories would require changes in both the factory methods and the 
tokenizers themselves and it seems silly to have two separate JIRAs.

> Allow CharTokenizer-derived tokenizers and KeywordTokenizer to configure the 
> max token length
> ---------------------------------------------------------------------------------------------
>
>                 Key: SOLR-10186
>                 URL: https://issues.apache.org/jira/browse/SOLR-10186
>             Project: Solr
>          Issue Type: Improvement
>      Security Level: Public(Default Security Level. Issues are Public) 
>            Reporter: Erick Erickson
>            Priority: Minor
>
> Is there a good reason that we hard-code a 256 character limit for the 
> CharTokenizer? In order to change this limit it requires that people 
> copy/paste the incrementToken into some new class since incrementToken is 
> final.
> KeywordTokenizer can easily change the default (which is also 256 bytes), but 
> to do so requires code rather than being able to configure it in the schema.
> For KeywordTokenizer, this is Solr-only. For the CharTokenizer classes 
> (WhitespaceTokenizer, UnicodeWhitespaceTokenizer and LetterTokenizer) 
> (Factories) it would take adding a c'tor to the base class in Lucene and 
> using it in the factory.
> Any objections?



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Comment Edited] (SOLR-10186) Allow CharTokenizer-derived tokenizers and KeywordTokenizer to configure the max token length

Reply via email to