Re: Question about Solr Fieldtypes, Chaining of Tokenizers

Grant Ingersoll Sat, 04 Dec 2010 17:19:11 -0800

Could you expand on your example and show the output you want?  FWIW, you could 
simply write a token filter that does the same thing as the WhitespaceTokenizer.


-Grant

On Dec 3, 2010, at 1:14 PM, Matthew Hall wrote:

> Hey folks, I'm working with a fairly specific set of requirements for our 
> corpus that needs a somewhat tricky text type for both indexing and searching.
> 
> The chain currently looks like this:
> 
> <tokenizer class="solr.WhitespaceTokenizerFactory"/>
> <filter class="solr.PatternReplaceFilterFactory"
>               pattern="(.*?)(\p{Punct}*)$"
>               replacement="$1"/>
> <filter class="solr.LowerCaseFilterFactory"/>
> <filter class="solr.StopFilterFactory"
>                ignoreCase="true"
>                words="stopwords.txt"
>                enablePositionIncrements="true"
>                />
> <filter class="solr.SnowballPorterFilterFactory" language="English" 
> protected="protwords.txt"/>
> <filter class="solr.PatternReplaceFilterFactory"
>               pattern="\p{Punct}"
>               replacement=" "/>
> <tokenizer class="solr.WhitespaceTokenizerFactory"/>
> 
> Now you will notice that I'm trying to add in a second tokenizer to this 
> chain at the very end, this is due to the final replacement of punctuation to 
> whitespace.  At that point I'd like to further break up these tokens to 
> smaller tokens.
> 
> The reason for this is that we have a mixed normal english word and 
> scientific corpus.  For example you could expect string like "The symposium 
> of Tg<The>(RX3fg+and) gene studies" being added to the index, and parts of 
> those phrases being searched on.
> 
> We want to be able to remove the stopwords in the mostly english parts of 
> these types of statements, which the whitespace tokenizer, followed by 
> removing trailing punctuation,  followed by the stopfilter takes care of.  We 
> do not want to remove references to genetic information contained in allele 
> symbols and the like.
> 
> Sadly as far as I can tell, you cannot chain tokenizers in the schema.xml, so 
> does anyone have some suggestions on how this could be accomplished?
> 
> Oh, and let me add that the WordDelimiterFilter comes really close to what I 
> want, but since we are unwilling to promote our solr version to the trunk (we 
> are on the 1.4x) version atm, the inability to turn off the automatic phrase 
> queries makes it a no go.  We need to be able to make searches on 
> "left/right" match "right/left."
> 
> My searches through the old material on this subject isn't really showing me 
> much except some advice on using the copyField attribute.  But my 
> understanding is that this will simply take your original input to the field, 
> and then analyze it in two different ways depending on the field definitions. 
>  It would be very nice if it were copying the already analyzed version of the 
> text... but that's not what its doing, right?
> 
> Thanks for any advice on this matter.
> 
> Matt
> 
> 

--------------------------
Grant Ingersoll
http://www.lucidimagination.com

Re: Question about Solr Fieldtypes, Chaining of Tokenizers

Reply via email to