Question about Solr Fieldtypes, Chaining of Tokenizers

Matthew Hall Fri, 03 Dec 2010 10:15:03 -0800

Hey folks, I'm working with a fairly specific set of requirements forour corpus that needs a somewhat tricky text type for both indexing andsearching.


The chain currently looks like this:


<tokenizer class="solr.WhitespaceTokenizerFactory"/>
<filter class="solr.PatternReplaceFilterFactory"
               pattern="(.*?)(\p{Punct}*)$"
               replacement="$1"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.StopFilterFactory"
                ignoreCase="true"
                words="stopwords.txt"
                enablePositionIncrements="true"
                />

<filter class="solr.SnowballPorterFilterFactory" language="English"protected="protwords.txt"/>

<filter class="solr.PatternReplaceFilterFactory"
               pattern="\p{Punct}"
               replacement=" "/>
<tokenizer class="solr.WhitespaceTokenizerFactory"/>

Now you will notice that I'm trying to add in a second tokenizer to thischain at the very end, this is due to the final replacement ofpunctuation to whitespace. At that point I'd like to further break upthese tokens to smaller tokens.

The reason for this is that we have a mixed normal english word andscientific corpus. For example you could expect string like "Thesymposium of Tg<The>(RX3fg+and) gene studies" being added to the index,and parts of those phrases being searched on.

We want to be able to remove the stopwords in the mostly english partsof these types of statements, which the whitespace tokenizer, followedby removing trailing punctuation, followed by the stopfilter takes careof. We do not want to remove references to genetic informationcontained in allele symbols and the like.

Sadly as far as I can tell, you cannot chain tokenizers in theschema.xml, so does anyone have some suggestions on how this could beaccomplished?

Oh, and let me add that the WordDelimiterFilter comes really close towhat I want, but since we are unwilling to promote our solr version tothe trunk (we are on the 1.4x) version atm, the inability to turn offthe automatic phrase queries makes it a no go. We need to be able tomake searches on "left/right" match "right/left."

My searches through the old material on this subject isn't reallyshowing me much except some advice on using the copyField attribute.But my understanding is that this will simply take your original inputto the field, and then analyze it in two different ways depending on thefield definitions. It would be very nice if it were copying the alreadyanalyzed version of the text... but that's not what its doing, right?


Thanks for any advice on this matter.

Matt

Question about Solr Fieldtypes, Chaining of Tokenizers

Reply via email to