That's pretty helpful, thanks! Especially since I didn't understand so far that I could use a filter like PatternReplaceCharFilterFactory both as a charFilter and as a filter.
In the meantime I had figured out another alternative, involving WordDelimiterFilterFactory. But I had to use WhitespaceTokenizerFactory instead of StandardTokenizerFactory, which means that I had to use extra PatternReplaceCharFilterFactory filters to get rid of leading/trailing punctuation. Again, thanks! Marian 2011/11/30 Steven A Rowe <sar...@syr.edu> > Hi Marian, > > Extending the StandardTokenizer(Factory) java class is not the way to go > if you want to change its behavior. > > StandardTokenizer is generated from a JFlex <http://jflex.de/> > specification, so you would need to modify the specification to include > your special slash-containing-word rule, then regenerate the java code, and > then compile it. > > It would be much simpler to use a PatternReplaceCharFilter < > http://lucene.apache.org/solr/api/org/apache/solr/analysis/PatternReplaceCharFilter.html> > to convert the slashes into unusual (sequences of) characters that won't be > broken up by the analyzer you're using, then add a PatternReplaceFilter to > convert the unusual sequences back to slashes. E.g. if you used "-blah-" > as the unusual sequence (note: people have also reported using a single > character drawn from a script that would otherwise not be used in the text, > e.g. a Chinese ideograph in English text), "AB/1234/5678" could become > "AB-blah-1234-blah-5678". > > Here's an (untested!) analyzer specification that would do this: > > <analyzer> > <charFilter class="solr.PatternReplaceCharFilterFactory" > pattern="([A-Z]+)/([0-9]+)/([0-9]+)" > replacement="$1-blah-$2-blah-$3"/> > <tokenizer class="solr.StandardTokenizerFactory"/> > <filter class="solr.PatternReplaceFilterFactory" pattern="-blah-" > replacement="/" replace="all"/> > <filter class="solr.LowerCaseFilterFactory"/> > </analyzer> > > Steve > >