CombiningFilter to recombine tokens into a single token for sorting
-------------------------------------------------------------------

                 Key: LUCENE-3413
                 URL: https://issues.apache.org/jira/browse/LUCENE-3413
             Project: Lucene - Java
          Issue Type: New Feature
          Components: modules/analysis
    Affects Versions: 2.9.3
            Reporter: Chris A. Mattmann
            Priority: Minor


I whipped up this CombiningFilter for the following use case:

I've got a bunch of titles of e.g., Books, such as:

The Grapes of Wrath
Tommy Tommerson saves the World
Top of the World
The Tales of Beedle the Bard
Born Free

etc.

I want to sort these titles using a String field that includes stopword 
analysis (e.g., to remove "The"), and synonym filtering (e.g., for grouping), 
etc. I created an analysis chain in Solr for this that was based off of 
*alphaOnlySort*, which looks like this:

{code:xml}
<fieldType name="alphaOnlySort" class="solr.TextField" sortMissingLast="true" 
omitNorms="true">
   <analyzer>
        <!-- KeywordTokenizer does no actual tokenizing, so the entire
             input string is preserved as a single token
          -->
        <tokenizer class="solr.KeywordTokenizerFactory"/>
        <!-- The LowerCase TokenFilter does what you expect, which can be
             when you want your sorting to be case insensitive
          -->
        <filter class="solr.LowerCaseFilterFactory" />
        <!-- The TrimFilter removes any leading or trailing whitespace -->
        <filter class="solr.TrimFilterFactory" />
        <!-- The PatternReplaceFilter gives you the flexibility to use
             Java Regular expression to replace any sequence of characters
             matching a pattern with an arbitrary replacement string, 
             which may include back references to portions of the original
             string matched by the pattern.
             
             See the Java Regular Expression documentation for more
             information on pattern and replacement string syntax.
             
             
http://java.sun.com/j2se/1.5.0/docs/api/java/util/regex/package-summary.html
          -->
        <filter class="solr.PatternReplaceFilterFactory"
                pattern="([^a-z])" replacement="" replace="all"
        /> 
    </analyzer>       
    </fieldType>

{code}

The issue with alphaOnlySort is that it doesn't support stopword remove or 
synonyms because those are based on the original token level instead of the 
full strings produced by the KeywordTokenizer (which does not do tokenization). 
I needed a filter that would allow me to change alphaOnlySort and its analysis 
chain from using KeywordTokenizer to using WhitespaceTokenizer, and then a way 
to recombine the tokens at the end. So, take "The Grapes of Wrath". I needed a 
way for it to get turned into:

{noformat}
grapes of wrath
{noformat}

And then to combine those tokens into a single token:

{noformat}
grapesofwrath
{noformat}

The attached CombiningFilter takes care of that. It doesn't do it super 
efficiently I'm guessing (since I used a StringBuffer), but I'm open to 
suggestions on how to make it better. 

One other thing is that apparently this analyzer works fine for analysis (e.g., 
it produces the desired tokens), however, for sorting in Solr I'm getting null 
sort tokens. Need to figure out why. 

Here ya go!



--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to