Hi Mark, thanks for your response.

Here are my thoughts on your suggestion:

I believe it would be a good idea to merge similar query expansion code.
I also agree that the situation of fuzzy query is similar to the synonym
query use-case, in the sense of having a root term and some related,
de-boosted, terms around it. However, I also see some significant
difference in the two scenarios. I'll try to explain my view, but first,
I'll briefly divide our solution to sub elements:
1. Expanding the root term to related terms.
        1.a.    creating a de-boosted additional term.
        1.b     making sure the IDF normalization
(sumOfSquares/normalize) for            root terms (entered by the user)
will take into account only the                 original term, and will
normalize the expanded term accordingly.
2. Result aggregation for different terms, generated from the same root.
        2.a     In the synonym solution, we attempted to sum occurrences
of              all terms, generated from same root, as if it was the
same term,              and only then apply tf function upon this
frequency. 
                i.e. if "car" and "automobile" appeared 5 times each in
the same                document, calculate term frequency for a
frequency of 10 
                (=2X5:-). This replaces the OR (aggregated as either max
or sum)                 approach usually taken in such cases.
        2.b     Applying 2.a alone carries with it a problem - How do
you count               together rare and common terms? If applied as
is, it may bonus                common terms, whose frequency is
presumed to be "cheap" and in           Lucene is compensated with a low
IDF value. However, here it is          counted under the IDF of the
root term. The solution was to          normalize the frequencies to the
root term (BTW - an inverse             normalization to 1.b)

Going back to the discussion on how similar FuzzyLikeThisQuery and
SynonymsQuery are, I believe that part 1 of our solution applies also
for the fuzzy case, with the addition of differential boost factor (your
suggestion of edit distance as a measure for this deboost makes perfect
sense to me). On the other hand, part 2 of our solution, does not seem
to me helpful in the Fuzzy case, I would assume that different fuzzy
forms, will not appear in the same document, but will tend to appear in
different documents. Say you are looking for my surname "gome". Some
people may misspell it to "gomeh" or "goma", but I would dare to state
that if both "gomeh" and "gome" appears together in the same document,
the author meant two different things. Naturally, the possibility of
typos exists, but would probably not generate a lot of occurrences. For
that reason, I would not adopt part 2 for the fuzzy case, but stick to
aggregating the results in an OR fashion (I.e. "gome" OR "gomeh" OR
"goma"). Applying part 2 of the synonym solution is harmless IMHO, but
the scorer works harder (a lot of SQR/SQRT).  
BTW, in my other post on this matter, I referred to another case we
applied part 1 of this solution, the case of searching thru 2 fields,
main and secondary (for example - content text Vs title). Here we are
talking about the same term, but it is expanded to different fields,
with very different IDF values (sometimes). The problem here is again
similar to the synonym and fuzzy cases, as IDFs tend to mix across
fields, and yield unexpected results.

Summary :-)

My feeling is that part 1 makes sense to other query expansion cases
from a root term, such as the fuzzy case, but part 2 does not, although
it will still generate valid results. Part 2 is in general implemented
as a specialized scorer, so perhaps the Query class can be unified, but
the synonym specialized scorer be constructed only for the case where
you wish to "count as one term", which I believe is not the case for the
fuzzy situation.

Thanx for your time,
Ziv Gome

-----Original Message-----
From: mark harwood [mailto:[EMAIL PROTECTED] 
Sent: Monday, March 06, 2006 5:22 PM
To: java-user@lucene.apache.org
Subject: Re: Search for synonyms - implemenetation for review

Sounds like you've been tackling a number of the
issues I was concerned with "fuzzy" searching. It's
essentially the same problem - the user types one word
and the engine searches for several variants.

The FuzzyLikeThisQuery class in the "queries" module
of the contrib area in SVN contains similar code. It
addresses idf and coord issues introduced with fuzzy
variants. 

It's probably worth considering having one
implementation for generically scoring variants
whether they are produced by fuzzy algorithms or
synonyms or any other means. In either case there
could be a "cost" factor associated with variants
which could be based on the fuzzy edit distance from
the root term or synonym "relatedness" to the root
term.

I'll have a look at your implementation with this in
mind when I have a bit more time.

Cheers,
Mark



                
___________________________________________________________ 
To help you stay safe and secure online, we've developed the all new
Yahoo! Security Centre. http://uk.security.yahoo.com


---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Reply via email to