RE: Leaving certain tokens intact during indexing and search

Steven A Rowe Wed, 30 Nov 2011 07:34:51 -0800

Hi Marian,

Extending the StandardTokenizer(Factory) java class is not the way to go if you 
want to change its behavior.


StandardTokenizer is generated from a JFlex <http://jflex.de/> specification, 
so you would need to modify the specification to include your special 
slash-containing-word rule, then regenerate the java code, and then compile it.

It would be much simpler to use a PatternReplaceCharFilter 
<http://lucene.apache.org/solr/api/org/apache/solr/analysis/PatternReplaceCharFilter.html>
 to convert the slashes into unusual (sequences of) characters that won't be 
broken up by the analyzer you're using, then add a PatternReplaceFilter to 
convert the unusual sequences back to slashes.  E.g. if you used "-blah-" as 
the unusual sequence (note: people have also reported using a single character 
drawn from a script that would otherwise not be used in the text, e.g. a 
Chinese ideograph in English text), "AB/1234/5678" could become 
"AB-blah-1234-blah-5678".

Here's an (untested!) analyzer specification that would do this:

<analyzer>
  <charFilter class="solr.PatternReplaceCharFilterFactory" 
pattern="([A-Z]+)/([0-9]+)/([0-9]+)"
              replacement="$1-blah-$2-blah-$3"/>
  <tokenizer class="solr.StandardTokenizerFactory"/>
  <filter class="solr.PatternReplaceFilterFactory" pattern="-blah-" 
replacement="/" replace="all"/>
  <filter class="solr.LowerCaseFilterFactory"/>
</analyzer>

Steve

> -----Original Message-----
> From: Marian Steinbach [mailto:marian.steinb...@gmail.com]
> Sent: Wednesday, November 30, 2011 9:41 AM
> To: solr-user@lucene.apache.org
> Subject: Re: Leaving certain tokens intact during indexing and search
> 
> Thanks for the quick response!
> 
> Are you saying that I should extend WhitespaceTokenizerFactory to create
> my
> own? Or should I simply use it?
> 
> Because, I guess tokenizing on spaces wouldn't be enough. I would need
> tokenizing on slashes in other positions, just not within strings matching
> ([A-Z]+/[0-9]+/[0-9]+).
> 
> Marian
> 
> 
> 2011/11/30 Erick Erickson <erickerick...@gmail.com>
> 
> > There's about a zillion tokenizers, for what you're describing
> > WhitespaceTokenizerFactory is a good candidate.
> >
> > See: http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters
> > for a partial list, and it has links to the authoritative docs.
> >
> > Best
> > Erick
> >
> >

RE: Leaving certain tokens intact during indexing and search

Reply via email to