Note that my example does not actually use PatternReplaceCharFilterFactory twice - the second one is actually a PatternReplaceFilterFactory - note that "Char" isn't present in the second one.
CharFilters operate before tokenizers, and regular filters operate after tokenizers. Steve > -----Original Message----- > From: Marian Steinbach [mailto:marian.steinb...@gmail.com] > Sent: Wednesday, November 30, 2011 10:44 AM > To: solr-user@lucene.apache.org > Subject: Re: Leaving certain tokens intact during indexing and search > > That's pretty helpful, thanks! Especially since I didn't understand so far > that I could use a filter like PatternReplaceCharFilterFactory both as a > charFilter and as a filter. > > In the meantime I had figured out another alternative, > involving WordDelimiterFilterFactory. But I had to > use WhitespaceTokenizerFactory instead of StandardTokenizerFactory, which > means that I had to use extra PatternReplaceCharFilterFactory filters to > get rid of leading/trailing punctuation. > > Again, thanks! > > Marian > > 2011/11/30 Steven A Rowe <sar...@syr.edu> > > > Hi Marian, > > > > Extending the StandardTokenizer(Factory) java class is not the way to go > > if you want to change its behavior. > > > > StandardTokenizer is generated from a JFlex <http://jflex.de/> > > specification, so you would need to modify the specification to include > > your special slash-containing-word rule, then regenerate the java code, > and > > then compile it. > > > > It would be much simpler to use a PatternReplaceCharFilter < > > > http://lucene.apache.org/solr/api/org/apache/solr/analysis/PatternReplaceC > harFilter.html> > > to convert the slashes into unusual (sequences of) characters that won't > be > > broken up by the analyzer you're using, then add a PatternReplaceFilter > to > > convert the unusual sequences back to slashes. E.g. if you used "-blah- > " > > as the unusual sequence (note: people have also reported using a single > > character drawn from a script that would otherwise not be used in the > text, > > e.g. a Chinese ideograph in English text), "AB/1234/5678" could become > > "AB-blah-1234-blah-5678". > > > > Here's an (untested!) analyzer specification that would do this: > > > > <analyzer> > > <charFilter class="solr.PatternReplaceCharFilterFactory" > > pattern="([A-Z]+)/([0-9]+)/([0-9]+)" > > replacement="$1-blah-$2-blah-$3"/> > > <tokenizer class="solr.StandardTokenizerFactory"/> > > <filter class="solr.PatternReplaceFilterFactory" pattern="-blah-" > > replacement="/" replace="all"/> > > <filter class="solr.LowerCaseFilterFactory"/> > > </analyzer> > > > > Steve > > > >