Mike, thanks a lot for your example: the idea here would be you would
put the lowercasefilter after the synonymfilter, and then you get this
exact flexibility?

e.g.
WhitespaceTokenizer
SynonymFilter -> no lowercasing of tokens are done as it "analyzes"
your synonyms with just the tokenizer
LowerCaseFilter

but
WhitespaceTokenizer
LowerCaseFilter
SynonymFilter -> the synonyms are lowercased, as it "analyzes"
synonyms with the tokenizer+filter

its already inconsistent today, because if you do:

LowerCaseTokenizer
SynonymFilter

then your synonyms are in fact all being lowercased... its just
arbitrary that they are only being analyzed with the "tokenizer".

On Tue, Apr 26, 2011 at 4:13 PM, Mike Sokolov <soko...@ifactory.com> wrote:
> Suppose your analysis stack includes lower-casing, but your synonyms are
> only supposed to apply to upper-case tokens.  For example, "PET" might be a
> synonym of "positron emission tomography", but "pet" wouldn't be.
>
> -Mike
>
> On 04/26/2011 09:51 AM, Robert Muir wrote:
>>
>> On Tue, Apr 26, 2011 at 12:24 AM, Otis Gospodnetic
>> <otis_gospodne...@yahoo.com>  wrote:
>>
>>
>>>
>>> But somehow this feels bad (well, so does sticking word variations in
>>> what's
>>> supposed to be a synonyms file), partly because it means that the person
>>> adding
>>> new synonyms would need to know what they stem to (or always check it
>>> against
>>> Solr before editing the file).
>>>
>>
>> when creating the synonym map from your input file, currently the
>> factory actually uses your Tokenizer only to pre-process the synonyms
>> file.
>>
>> One idea would be to use the tokenstream up to the synonymfilter
>> itself (including filters). This way if you put a stemmer before the
>> synonymfilter, it would stem your synonyms file, too.
>>
>> I haven't totally thought the whole thing through to see if theres a
>> big reason why this wouldn't work (the synonymsfilter is complicated,
>> sorry). But it does seem like it would produce more consistent
>> results... and perhaps the inconsistency isnt so obvious since in the
>> default configuration the synonymfilter is directly after the
>> tokenizer.
>>
>

Reply via email to