On 10/6/09 3:32 PM, "Chris Hostetter" <hossman_luc...@fucit.org> wrote:
> > : I ll try to explain with an example. Given the term 'it!' in the title, it > : should match both 'it' and 'it!' in the query as an exact match. Currently, > : this is done by using a synonym entry (and index time SynonymFilter) as > : follows: > : > : it! => it, it! > : > : Now, the above holds true for all cases where you have a title token of the > : form [aA-zZ]*!. Handling all of those cases requires adding synonyms > : manually for each case which is not easy to manage and does not scale. > : > : I am hoping to do the same by using a index time filter that takes in a > : pattern like the PatternReplace filter and adds the newly created token > : instead of replacing the original one. Does this make sense? Am I missing > : something that would break this approach? > > something like this would be fairly easy to implement in Lucene, but > somewhat confusing to try and configure in Solr. I was going to suggest > that you use something like... > <filter class="solr.PatternReplaceFilterFactory" > pattern="(^.*)\!?$)" replacement="$1 $2" replace="all" /> > > ..and then have a subsequent filter that splits the tokens on the > whitespace (or any other special character you could use in the > replacement) ... but aparently we don't have any built in filters that > will just split tokens on a character/pattern for you. that would also be > fairly easy to write if someone wnats to submit a patch. There is a Solr.PatternTokenizerFactory class which likely fits the bill in this case. The related question I have is this - is it possible to have multiple Tokenizers in your analysis chain? Prasanna.