Re: Automatic synonyms for multiple variations of a word

Mike Sokolov Tue, 26 Apr 2011 14:03:47 -0700

Yes, I see. Makes sense. It is a bit hard to see a "bad" case for yourproposal in that light. Here is one other example; I'm not sure whetherit presents difficulties or not, and may be a bit contrived, but hey,food for thought at least:

Say you have set up synonyms between names and commonly-used pseudonymsor alternate names that should not be stemmed:


Malcolm X <=> Malcolm Little
Prince <=> Rogers Nelson Prince
Little Kim <=> Kimberly Denise Jones
Biggy Smalls etc.

You don't want "Malcolm Littler" or "Littlest Kim" or "Big Small" tomatch anything. And Princely shouldn't bring up the artist.

But you also have regular linguistic synonyms (not names) that *should*be stemmed (as in the original example). So little <=> small shouldimply littler <=> smaller and so on via stemming.

Ideally you could put one SynonymFilter before the stemming and theother one after. In that case do the SynonymFilters get composed? Ican't think of a believable example where that would cause a problem,but maybe you can?


-Mike


On 04/26/2011 04:25 PM, Robert Muir wrote:

Mike, thanks a lot for your example: the idea here would be you would
put the lowercasefilter after the synonymfilter, and then you get this
exact flexibility?

e.g.
WhitespaceTokenizer
SynonymFilter ->  no lowercasing of tokens are done as it "analyzes"
your synonyms with just the tokenizer
LowerCaseFilter

but
WhitespaceTokenizer
LowerCaseFilter
SynonymFilter ->  the synonyms are lowercased, as it "analyzes"
synonyms with the tokenizer+filter

its already inconsistent today, because if you do:

LowerCaseTokenizer
SynonymFilter

then your synonyms are in fact all being lowercased... its just
arbitrary that they are only being analyzed with the "tokenizer".

On Tue, Apr 26, 2011 at 4:13 PM, Mike Sokolov<soko...@ifactory.com>  wrote:

Suppose your analysis stack includes lower-casing, but your synonyms are
only supposed to apply to upper-case tokens.  For example, "PET" might be a
synonym of "positron emission tomography", but "pet" wouldn't be.

-Mike

On 04/26/2011 09:51 AM, Robert Muir wrote:

On Tue, Apr 26, 2011 at 12:24 AM, Otis Gospodnetic
<otis_gospodne...@yahoo.com>    wrote:

But somehow this feels bad (well, so does sticking word variations in
what's
supposed to be a synonyms file), partly because it means that the person
adding
new synonyms would need to know what they stem to (or always check it
against
Solr before editing the file).

when creating the synonym map from your input file, currently the
factory actually uses your Tokenizer only to pre-process the synonyms
file.

One idea would be to use the tokenstream up to the synonymfilter
itself (including filters). This way if you put a stemmer before the
synonymfilter, it would stem your synonyms file, too.

I haven't totally thought the whole thing through to see if theres a
big reason why this wouldn't work (the synonymsfilter is complicated,
sorry). But it does seem like it would produce more consistent
results... and perhaps the inconsistency isnt so obvious since in the
default configuration the synonymfilter is directly after the
tokenizer.

Re: Automatic synonyms for multiple variations of a word

Reply via email to