[jira] Commented: (SOLR-211) regex split() Tokenizer

Ken Krugler (JIRA) Wed, 25 Apr 2007 17:26:35 -0700

    [ 
https://issues.apache.org/jira/browse/SOLR-211?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12491852
 ]


Ken Krugler commented on SOLR-211:
----------------------------------

I think we must be working on similar types of projects :)

I did something similar to the above, but in two different ways:

# I extended WhitespaceTokenizerFactory to take optional pattern & replacement 
parameters. If these exist, then I apply them before the tokenizer gets called. 
This lets me do something like strip out all XML fields other than the content 
of the one that I want to index from a bunch of XML going into a Solr field.
# I added a CSVTokenizerFactory, which takes an optional split character and an 
optional remapping file. This lets me get a field like "Java,Python,C#" and 
turn it into "java python csharp", which are the index tokens I need, while 
leaving the display text as-is.

I don't know if your new PatternTokenizerFactory could replace either of these, 
though. For the first case, I still want the white space tokenization after 
I've stripped off all the junk I don't want. And for the second, I need to be 
able to do the remapping.

> regex split() Tokenizer
> -----------------------
>
>                 Key: SOLR-211
>                 URL: https://issues.apache.org/jira/browse/SOLR-211
>             Project: Solr
>          Issue Type: New Feature
>          Components: search
>            Reporter: Ryan McKinley
>         Assigned To: Ryan McKinley
>         Attachments: SOLR-211-RegexSplitTokenizer.patch, 
> SOLR-211-RegexSplitTokenizer.patch, SOLR-211-RegexSplitTokenizer.patch
>
>
> A TokenizerFactory that makes tokens from:
>   string.split( regex );

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (SOLR-211) regex split() Tokenizer

Reply via email to