[
https://issues.apache.org/jira/browse/LUCENE-3413?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13540073#comment-13540073
]
Chris A. Mattmann commented on LUCENE-3413:
-------------------------------------------
Hi Guys, there seems to be some interest on list for such a capability:
http://lucene.472066.n3.nabble.com/Which-token-filter-can-combine-2-terms-into-1-td4028482.html
(or at least sounds similar). Any interest from someone to work with me to
commit this?
> CombiningFilter to recombine tokens into a single token for sorting
> -------------------------------------------------------------------
>
> Key: LUCENE-3413
> URL: https://issues.apache.org/jira/browse/LUCENE-3413
> Project: Lucene - Core
> Issue Type: New Feature
> Components: modules/analysis
> Affects Versions: 2.9.3
> Reporter: Chris A. Mattmann
> Priority: Minor
> Attachments: LUCENE-3413.Mattmann.090311.patch.txt,
> LUCENE-3413.Mattmann.090511.patch.txt
>
>
> I whipped up this CombiningFilter for the following use case:
> I've got a bunch of titles of e.g., Books, such as:
> The Grapes of Wrath
> Tommy Tommerson saves the World
> Top of the World
> The Tales of Beedle the Bard
> Born Free
> etc.
> I want to sort these titles using a String field that includes stopword
> analysis (e.g., to remove "The"), and synonym filtering (e.g., for grouping),
> etc. I created an analysis chain in Solr for this that was based off of
> *alphaOnlySort*, which looks like this:
> {code:xml}
> <fieldType name="alphaOnlySort" class="solr.TextField" sortMissingLast="true"
> omitNorms="true">
> <analyzer>
> <!-- KeywordTokenizer does no actual tokenizing, so the entire
> input string is preserved as a single token
> -->
> <tokenizer class="solr.KeywordTokenizerFactory"/>
> <!-- The LowerCase TokenFilter does what you expect, which can be
> when you want your sorting to be case insensitive
> -->
> <filter class="solr.LowerCaseFilterFactory" />
> <!-- The TrimFilter removes any leading or trailing whitespace -->
> <filter class="solr.TrimFilterFactory" />
> <!-- The PatternReplaceFilter gives you the flexibility to use
> Java Regular expression to replace any sequence of characters
> matching a pattern with an arbitrary replacement string,
> which may include back references to portions of the original
> string matched by the pattern.
>
> See the Java Regular Expression documentation for more
> information on pattern and replacement string syntax.
>
>
> http://java.sun.com/j2se/1.5.0/docs/api/java/util/regex/package-summary.html
> -->
> <filter class="solr.PatternReplaceFilterFactory"
> pattern="([^a-z])" replacement="" replace="all"
> />
> </analyzer>
> </fieldType>
> {code}
> The issue with alphaOnlySort is that it doesn't support stopword remove or
> synonyms because those are based on the original token level instead of the
> full strings produced by the KeywordTokenizer (which does not do
> tokenization). I needed a filter that would allow me to change alphaOnlySort
> and its analysis chain from using KeywordTokenizer to using
> WhitespaceTokenizer, and then a way to recombine the tokens at the end. So,
> take "The Grapes of Wrath". I needed a way for it to get turned into:
> {noformat}
> grapes of wrath
> {noformat}
> And then to combine those tokens into a single token:
> {noformat}
> grapesofwrath
> {noformat}
> The attached CombiningFilter takes care of that. It doesn't do it super
> efficiently I'm guessing (since I used a StringBuffer), but I'm open to
> suggestions on how to make it better.
> One other thing is that apparently this analyzer works fine for analysis
> (e.g., it produces the desired tokens), however, for sorting in Solr I'm
> getting null sort tokens. Need to figure out why.
> Here ya go!
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]